Hive substr with group by cause slow performance - hadoop

I have a table which has about 300,000 records in each partition (ds).
When running the following query in hive (select with substr), it will hanging on the step: map = 0%
select t0.stat_date,t0.plat,t0.soh,t0.page_name,t0.component_name ,count(*) as num,user_id,t0.other_info
from (
select substr(stat_time,1,8) as stat_date,user_id,city_id,soh,dept_id,dph,plat,page_name,component_name,cookie_id,other_info
from db.table
where ds=20170221 and plat='abc'
)t0
group by stat_date,plat,soh,page_name,component_name,other_info,t0.user_id
However, if I replace the inner query from select substr(stat_time,1,8) as stat_date to select stat_time as stat_date, it will execute normally.
stat_time is something formatted in YYYYMMDDHHmm = 201702210900
select t0.stat_date,t0.plat,t0.soh,t0.page_name,t0.component_name ,count(*) as num,user_id,t0.other_info
from (
select stat_time as stat_date,user_id,city_id,soh,dept_id,dph,plat,page_name,component_name,cookie_id,other_info
from db.table
where ds=20170221 and plat='abc'
)t0
group by stat_date,plat,soh,page_name,component_name,other_info,t0.user_id
So why substr leads to slow performance?
--
Edit:
I changed mapred.child.java.opts in mapred-site.xml, and HADOOP_HEAPSIZE in hadoop-env.sh to 4G. And it succeed.
I guess either saving or calculating all the substr causes a lot heap.
If anybody knows why using a plain field value is less memory causing than substr, please leave a comment/answer.

Related

Oracle tuning for query with query annidate

i am trying to better a query. I have a dataset of ticket opened. Every ticket has different rows, every row rappresent an update of the ticket. There is a field (dt_update) that differs it every row.
I have this indexs in the st_remedy_full_light.
IDX_ASSIGNMENT (ASSIGNMENT)
IDX_REMEDY_INC_ID (REMEDY_INC_ID)
IDX_REMDULL_LIGHT_DTUPD (DT_UPDATE)
Now, the query is performed in 8 second. Is high for me.
WITH last_ticket AS
( SELECT *
FROM st_remedy_full_light a
WHERE a.dt_update IN
( SELECT MAX(dt_update)
FROM st_remedy_full_light
WHERE remedy_inc_id = a.remedy_inc_id
)
)
SELECT remedy_inc_id, ASSIGNMENT FROM last_ticket
This is the plan
How i could to better this query?
P.S. This is just a part of a big query
Additional information:
- The table st_remedy_full_light contain 529.507 rows
You could try:
WITH last_ticket AS
( SELECT remedy_inc_id, ASSIGNMENT,
rank() over (partition by remedy_inc_id order by dt_update desc) rn
FROM st_remedy_full_light a
)
SELECT remedy_inc_id, ASSIGNMENT FROM last_ticket
where rn = 1;
The best alternative query, which is also much easier to execute, is this:
select remedy_inc_id
, max(assignment) keep (dense_rank last order by dt_update)
from st_remedy_full_light
group by remedy_inc_id
This will use only one full table scan and a (hash/sort) group by, no self joins.
Don't bother about indexed access, as you'll probably find a full table scan is most appropriate here. Unless the table is really wide and a composite index on all columns used (remedy_inc_id,dt_update,assignment) would be significantly quicker to read than the table.

Retrieving a large dataset from a very large table

I have a very large table in oracle that contains 140+ million rows. Currently we are doing three full table scans on this table nightly, and using some of the results to populate a tmp table. That tmp table is then turned into a very large report (usually 140K + lines).
The big table is called tasklog and has the following structure has:
tasklog_id (number) - PK
document_id (number)
date_time_in (date)
+ a few more rows that aren't relevant
There are millions of different document ids each repeated between 1 and several hundred times, date_time_in is the time this entry was put into the database.
All of the full table scans looks like this
DECLARE
n_prevdocid number;
cursor tasks is
select *
from tasklog
order by document_id, date_time_in DESC;
BEGIN
for tk in tasks
loop
if n_prevdocid <> tk.document_id then
-- *code snipped*
end if;
n_prevdocid = tk.document_id;
end loop;
END;
/
So my question: is there a quick (ish) way to get a distinct list of document_ids with the row having the most recent date_time_in. This could dramatically speed up the whole thing. Or can anyone think of a better way of retrieving this data daily?
Things that may be relevant, this table only ever has rows inserted with current date time. It is not range paritioned but I can't see how that might help me. No rows are ever updated or deleted. There are about 70k - 80k rows inserted daily.
I don't think that you're going to get away from doing at least one full table scan, as the only way that it would be efficient would be is if the ratio of distinct document_id's to total records was pretty small. The clustering on the document_id is going to be very poor due to the way that the data is generated and inserted.
How about:
create table tmp nologging compress -- or pctfree 0
as
select ...
from (
select t.*,
max(date_time_in) over (partition by document_id) max_date_time_in
from tasklog t)
where date_time_in = max_date_time_in
Possibly, having created this once, you could then optimise further refreshes by merging into this set only the newer records. Something like ...
merge into tmp
using (
select ...
from (
select t.*,
max(date_time_in) over (partition by document_id) max_date_time_in
from tasklog t
where date_time_in > (select max(date_time_in) from tmp))
where date_time_in = max_date_time_in)
on ... blah blah
Have you tried:
select document_id
from tasklog t1
where date_time_in = (select max(date_time_in)
from tasklog t2
where t1.document_id=t2.document_id)
You can do something like this:
select document_id , date_time from tasklog group by date_time,document_id order by date_time desc;
By this you can retrieve distinct document_id with latest date_time colums.

optimizing a dup delete statement Oracle

I have 2 delete statements that are taking a long time to complete. There are several indexes on the columns in where clause.
What is a duplicate?
If 2 or more records have same values in columns id,cid,type,trefid,ordrefid,amount and paydt then there are duplicates.
The DELETEs delete about 1 million record.
Can they be re-written in any way to make it quicker.
DELETE FROM TABLE1 A WHERE loaddt < (
SELECT max(loaddt) FROM TABLE1 B
WHERE
a.id=b.id and
a.cid=b.cid and
NVL(a.type,'-99999') = NVL(b.type,'-99999') and
NVL(a.trefid,'-99999')=NVL(b.trefid,'-99999') and
NVL(a.ordrefid,'-99999')= NVL(b.ordrefid,'-99999') and
NVL(a.amount,'-99999')=NVL(b.amount,'-99999') and
NVL(a.paydt,TO_DATE('9999-12-31','YYYY-MM-DD'))=NVL(b.paydt,TO_DATE('9999-12-31','YYYY-MM-DD'))
);
COMMIT;
DELETE FROM TABLE1 a where rowid > (
Select min(rowid) from TABLE1 b
WHERE
a.id=b.id and
a.cid=b.cid and
NVL(a.type,'-99999') = NVL(b.type,'-99999') and
NVL(a.trefid,'-99999')=NVL(b.trefid,'-99999') and
NVL(a.ordrefid,'-99999')= NVL(b.ordrefid,'-99999') and
NVL(a.amount,'-99999')=NVL(b.amount,'-99999') and
NVL(a.paydt,TO_DATE('9999-12-31','YYYY-MM-DD'))=NVL(b.paydt,TO_DATE('9999-12-31','YYYY-MM-DD'))
);
commit;
Explain Plan:
DELETE TABLE1
HASH JOIN 1296491
Access Predicates
AND
A.ID=ITEM_1
A.CID=ITEM_2
ITEM_3=NVL(TYPE,'-99999')
ITEM_4=NVL(TREFID,'-99999')
ITEM_5=NVL(ORDREFID,'-99999')
ITEM_6=NVL(AMOUNT,(-99999))
ITEM_7=NVL(PAYDT,TO_DATE(' 9999-12-31 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))
Filter Predicates
LOADDT<MAX(LOADDT)
TABLE ACCESS TABLE1 FULL 267904
VIEW VW_SQ_1 690385
SORT GROUP BY 690385
TABLE ACCESS TABLE1 FULL 267904
How large is the table? If count of deleted rows is up to 12% then you may think about index.
Could you somehow partition your table - like week by week and then scan only actual week?
Maybe this could be more effecient. When you're using aggregate function, then oracle must walk through all relevant rows (in your case fullscan), but when you use exists it stops when the first occurence is found. (and of course the query would be much faster, when there was one function-based(because of NVL) index on all columns in where clause)
DELETE FROM TABLE1 A
WHERE exists (
SELECT 1
FROM TABLE1 B
WHERE
A.loaddt != b.loaddt
a.id=b.id and
a.cid=b.cid and
NVL(a.type,'-99999') = NVL(b.type,'-99999') and
NVL(a.trefid,'-99999')=NVL(b.trefid,'-99999') and
NVL(a.ordrefid,'-99999')= NVL(b.ordrefid,'-99999') and
NVL(a.amount,'-99999')=NVL(b.amount,'-99999') and
NVL(a.paydt,TO_DATE('9999-12-31','YYYY-MM-DD'))=NVL(b.paydt,TO_DATE('9999-12-31','YYYY-MM-DD'))
);
Although some may disagree, I am a proponent of running large, long running deletes procedurally. In my view it is much easier to control and track progress (and your DBA will like you better ;-) Also, not sure why you need to join table1 to itself to identify duplicates (and I'd be curious if you ever run into snapshot too old issues with your current approach). You also shouldn't need multiple delete statements, all duplicates should be handled in one process. Finally, you should check WHY you're constantly re-introducing duplicates each week, and perhaps change the load process (maybe doing a merge/upsert rather than all inserts).
That said, you might try something like:
-- first create mat view to find all duplicates
create materialized view my_dups_mv
tablespace my_tablespace
build immediate
refresh complete on demand
as
select id,cid,type,trefid,ordrefid,amount,paydt, count(1) as cnt
from table1
group by id,cid,type,trefid,ordrefid,amount,paydt
having count(1) > 1;
-- dedup data (or put into procedure and schedule along with mat view refresh above)
declare
-- make sure my_dups_mv is refreshed first
cursor dup_cur is
select * from my_dups_mv;
type duprec_t is record(row_id rowid);
duprec duprec_t;
type duptab_t is table of duprec_t index by pls_integer;
duptab duptab_t;
l_ctr pls_integer := 0;
l_dupcnt pls_integer := 0;
begin
for rec in dup_cur
loop
l_ctr := l_ctr + 1;
-- assuming needed indexes exist
select rowid
bulk collect into duptab
from table1
where id = rec.id
and cid = rec.cid
and type = rec.type
and trefid = rec.trefid
and ordrefid = rec.ordrefid
and amount = rec.amount
and paydt = rec.paydt
-- order by whatever makes sense to make the "keeper" float to top
order by loaddt desc
;
for i in 2 .. duptab.count
loop
l_dupcnt := l_dupcnt + 1;
delete from table1 where rowid = duptab(i).row_id;
end loop;
if (mod(l_ctr, 10000) = 0) then
-- log to log table here (calling autonomous procedure you'll need to implement)
insert_logtable('Table1 deletes', 'Commit reached, deleted ' || l_dupcnt || ' rows');
commit;
end if;
end loop;
commit;
end;
Check your log table for progress status.
1. Parallel
alter session enable parallel dml;
DELETE /*+ PARALLEL */ FROM TABLE1 A WHERE loaddt < (
...
Assuming you have Enterprise Edition, a sane server configuration, and you are on 11g. If you're not on 11g, the parallel syntax is slightly different.
2. Reduce memory requirements
The plan shows a hash join, which is probably a good thing. But without any useful filters, Oracle has to hash the entire table. (Tbone's query, that only use a GROUP BY, looks nicer and may run faster. But it will also probably run into the same problem trying to sort or hash the entire table.)
If the hash can't fit in memory it must be written to disk, which can be very slow. Since you run this query every week, only one of the tables needs to look at all the rows. Depending on exactly when it runs, you can add something like this to the end of the query: ) where b.loaddt >= sysdate - 14. This may significantly reduce the amount of writing to temporary tablespace. And it may also reduce read IO if you use some partitioning strategy like jakub.petr suggested.
3. Active Report
If you want to know exactly what your query is doing, run the Active Report:
select dbms_sqltune.report_sql_monitor(sql_id => 'YOUR_SQL_ID_HERE', type => 'active')
from dual;
(Save the output to an .html file and open it with a browser.)

Get the sysdate -1 in Hive

Is there any way to get the current date -1 in Hive means yesterdays date always?
And in this format- 20120805?
I can run my query like this to get the data for yesterday's date as today is Aug 6th-
select * from table1 where dt = '20120805';
But when I tried doing this way with date_sub function to get the yesterday's date as the below table is partitioned on date(dt) column.
select * from table1 where dt = date_sub(TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(),
'yyyyMMdd')) , 1) limit 10;
It is looking for the data in all the partitions? Why? Something wrong I am doing in my query?
How I can make the evaluation happen in a subquery to avoid the whole table scanned?
Try something like:
select * from table1
where dt >= from_unixtime(unix_timestamp()-1*60*60*24, 'yyyyMMdd');
This works if you don't mind that hive scans the entire table. from_unixtime is not deterministic, so the query planner in Hive won't optimize for you. For many cases (for example log files), not specifying a deterministic partition key can cause a very large hadoop job to start since it will scan the whole table, not just the rows with the given partition key.
If this matters to you, you can launch hive with an additional option
$ hive -hiveconf date_yesterday=20150331
And in the script or hive terminal use
select * from table1
where dt >= ${hiveconf:date_yesterday};
The name of the variable doesn't matter, nor does the value, you can set them in this case to get the prior date using unix commands. In the specific case of the OP
$ hive -hiveconf date_yesterday=$(date --date yesterday "+%Y%m%d")
In mysql:
select DATE_FORMAT(curdate()-1,'%Y%m%d');
In sqlserver :
SELECT convert(varchar,getDate()-1,112)
Use this query:
SELECT FROM_UNIXTIME(UNIX_TIMESTAMP()-1*24*60*60,'%Y%m%d');
It looks like DATE_SUB assumes date in format yyyy-MM-dd. So you might have to do some more format manipulation to get to your format. Try this:
select * from table1
where dt = FROM_UNIXTIME(
UNIX_TIMESTAMP(
DATE_SUB(
FROM_UNIXTIME(UNIX_TIMESTAMP(),'yyyy-MM-dd')
, 1)
)
, 'yyyyMMdd') limit 10;
Use this:
select * from table1 where dt = date_format(concat(year(date_sub(current_timestamp,1)),'-', month(date_sub(current_timestamp,1)), '-', day(date_sub(current_timestamp,1))), 'yyyyMMdd') limit 10;
This will give a deterministic result (a string) of your partition.
I know it's super verbose.

What does PARTITION BY 1 mean?

For a pair of cursors where the total number of rows in the resultset is required immediately after the first FETCH, ( after some trial-and-error ) I came up with the query below
SELECT
col_a,
col_b,
col_c,
COUNT(*) OVER( PARTITION BY 1 ) AS rows_in_result
FROM
myTable JOIN theirTable ON
myTable.col_a = theirTable.col_z
GROUP BY
col_a, col_b, col_c
ORDER BY
col_b
Now when the output of the query is X rows, rows_in_result reflects this accurately.
What does PARTITION BY 1 mean?
I think it probably tells the database to partition the results into pieces of 1-row each
It is an unusual use of PARTITION BY. What it does is put everything into the same partition so that if the query returns 123 rows altogether, then the value of rows_in_result on each row will be 123 (as its alias implies).
It is therefore equivalent to the more concise:
COUNT(*) OVER ()
Databases are quite free to add restrictions to the OVER() clause. Sometimes, either PARTITION BY [...] and/or ORDER BY [...] are mandatory clauses, depending on the aggregate function. PARTITION BY 1 may just be a dummy clause used for syntax integrity. The following two are usually equivalent:
[aggregate function] OVER ()
[aggregate function] OVER (PARTITION BY 1)
Note, though, that Sybase SQL Anywhere and CUBRID interpret this 1 as being a column index reference, similar to what is possible in the ORDER BY [...] clause. This might appear to be a bit surprising as it imposes an evaluation order to the query's projection. In your case, this would then mean that the following are equivalent
COUNT(*) OVER (PARTITION BY 1)
COUNT(*) OVER (PARTITION BY col_a)
This curious deviation from other databases' interpretation allows for referencing more complex grouping expressions.

Resources