Optimize the join performance with Hive partition table - performance

I have a Hive orc test_dev_db.TransactionUpdateTable table with some sample data, which will be holding increment data which needs to be updated to main table (test_dev_db.TransactionMainHistoryTable) which is partitioned on columns Country,Tran_date.
Hive Incremental load table schema: It holds 19 rows which needs to be merge.
CREATE TABLE IF NOT EXISTS test_dev_db.TransactionUpdateTable
(
Transaction_date timestamp,
Product string,
Price int,
Payment_Type string,
Name string,
City string,
State string,
Country string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS orc
;
Hive main table schema: Total row counts 77.
CREATE TABLE IF NOT EXISTS test_dev_db.TransactionMainHistoryTable
(
Transaction_date timestamp,
Product string,
Price int,
Payment_Type string,
Name string,
City string,
State string
)
PARTITIONED BY (Country string,Tran_date string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS orc
;
I am running below query to merge the incremental data with main table.
SELECT
case when i.transaction_date is not null then cast(substring(current_timestamp(),0,19) as timestamp)
else t.transaction_date end as transaction_date,
t.product,
case when i.price is not null then i.price else t.price end as price,
t.payment_type,
t.name,
t.city,
t.state,
t.country,
case when i.transaction_date is not null then substring(current_timestamp(),0,10)
else t.tran_date end as tran_date
from
test_dev_db.TransactionMainHistoryTable t
full join test_dev_db.TransactionUpdateTable i on (t.Name=i.Name)
;
/hdfs/path/database/test_dev_db.db/transactionmainhistorytable/country=Australia/tran_date=2009-03-01
/hdfs/path/database/test_dev_db.db/transactionmainhistorytable/country=Australia/tran_date=2009-05-01
and running below query to filter out the specific partitions which needs to be merged, just to eliminate the rewriting the no updated partitions.
SELECT
case when i.transaction_date is not null then cast(substring(current_timestamp(),0,19) as timestamp)
else t.transaction_date end as transaction_date,
t.product,
case when i.price is not null then i.price else t.price end as price,
t.payment_type,
t.name,
t.city,
t.state,
t.country,
case when i.transaction_date is not null then substring(current_timestamp(),0,10) else t.tran_date end as tran_date
from
(SELECT
*
FROM
test_dev_db.TransactionMainHistoryTable
where Tran_date in
(select distinct from_unixtime(to_unix_timestamp (Transaction_date,'yyyy-MM-dd HH:mm'),'yyyy-MM-dd') from test_dev_db.TransactionUpdateTable
))t
full join test_dev_db.TransactionUpdateTable i on (t.Name=i.Name)
;
only Transaction_date,Price and partition column tran_date needs to be updated in both the cases. Both queries running fine though the lateral taking longer time to execute.
Execution plan for partitioned table as:
Stage: Stage-5
Map Reduce
Map Operator Tree:
TableScan
alias: transactionmainhistorytable
filterExpr: tran_date is not null (type: boolean)
Statistics: Num rows: 77 Data size: 39151 Basic stats: COMPLETE Column stats: COMPLETE
Map Join Operator
condition map:
Left Semi Join 0 to 1
keys:
0 tran_date (type: string)
1 _col0 (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8
Am I doing something wrong with second query? Do I need to use both the partition column for better pruning.
Any help or advice is greatly appreciated.

Maybe this is not a complete answer but I hope these thoughts will be useful.
where tran_date IN (select ... )
is actually the same as
LEFT SEMI JOIN (SELECT ...)
And this is reflected in the plan:
Map Join Operator
condition map:
Left Semi Join 0 to 1
keys:
0 tran_date (type: string)
1 _col0 (type: string)
And it is executed as map-join. First the subquery dataset is being selected, second it is placed in the distributed cache, loaded in memory to be used in the map-join. All these steps: select, load into memory, map-join are slower than read and overwrite all the table because it is so small and over-partitioned: statistics says Num rows: 77 Data size: 39151 - too small to be partitioned by two columns and even too small to be partitioned at all. Try bigger table and use EXPLAIN EXTENDED to check what is really being scanned.
Also, replace this:
from_unixtime(to_unix_timestamp (Transaction_date,'yyyy-MM-dd HH:mm'),'yyyy-MM-dd')
with substr(Transaction_date,0,10) or date(Transaction_date)
And substring(current_timestamp,0,10) with current_date just to simplify the code a bit.
If you want partition filter displayed in the plan, try to substitute partition filter passed as a list of partitions which you can select in a separate session and use shell to pass the list of partitions into the where clause, see this answer: https://stackoverflow.com/a/56963448/2700344

Related

Hadoop and hive optimisation

I need help on the following scenario:
1) Memo table is the source table in hive.
It has 5493656359 records.Its desc is as follows:
load_ts timestamp
memo_ban bigint
memo_id bigint
sys_creation_date timestamp
sys_update_date timestamp
operator_id bigint
application_id varchar(6)
dl_service_code varchar(5)
dl_update_stamp bigint
memo_date timestamp
memo_type varchar(4)
memo_subscriber varchar(20)
memo_system_txt varchar(180)
memo_manual_txt varchar(2000)
memo_source varchar(1)
data_dt string
market_cd string
Partition information:
data_dt string
market_cd string
2)
This is the target table
CREATE TABLE IF NOT EXISTS memo_temprushi (
load_ts TIMESTAMP,
ban BIGINT,
attuid BIGINT,
application VARCHAR(6),
system_note INT,
user_note INT,
note_time INT,
date TIMESTAMP)
PARTITIONED BY (data_dt STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS ORC
TBLPROPERTIES ("orc.compress"="SNAPPY");
3)
This is the initial load statement from source table Memo into
target table memo_temprushi. Loads all records till date 2015-12-14:
SET hive.exec.compress.output=true;
SET mapred.output.compression.type=BLOCK;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
INSERT INTO TABLE memo_temprushi PARTITION (DATA_DT)
SELECT LOAD_TS,MEMO_BAN, OPERATOR_ID, APPLICATION_ID,
CASE WHEN LENGTH(MEMO_SYSTEM_TXT)=0 THEN 0 ELSE 1 END,
CASE WHEN LENGTH(MEMO_MANUAL_TXT)=0 THEN 0 ELSE 1 END,
HOUR(MEMO_DATE), MEMO_DATE, DATA_DT
FROM tlgmob_gold.MEMO WHERE LOAD_TS < DATE('2015-12-15');
4)
For incremental load I want to insert the rest of the records i.e. from date 2015-12-15 onward. I'm using following query:
INSERT INTO TABLE memo_temprushi PARTITION (DATA_DT)
SELECT MS.LOAD_TS,MS.MEMO_BAN, MS.OPERATOR_ID, MS.APPLICATION_ID,
CASE WHEN LENGTH(MS.MEMO_SYSTEM_TXT)=0 THEN 0 ELSE 1 END,
CASE WHEN LENGTH(MS.MEMO_MANUAL_TXT)=0 THEN 0 ELSE 1 END,
HOUR(MS.MEMO_DATE), MS.MEMO_DATE, MS.DATA_DT
FROM tlgmob_gold.MEMO MS JOIN (select max(load_ts) max_load_ts from memo_temprushi) mt
ON 1=1
WHERE
ms.load_ts > mt.max_load_ts;
It launches 2 jobs. Initially it gives warning regarding stage being a cross product.
The first job gets completely quite soon but second job remains stuck at reduce 33%.
The log shows : [EventFetcher for fetching Map Completion Events] org.apache.hadoop.mapreduce.task.reduce.EventFetcher: EventFetcher is interrupted.. Returning
It shows that the number of reducers is 1.
Trying to increase the number of reducers through this command set mapreduce.job.reduces but it's not working.
Thanks
You can try this.
Run "select max(load_ts) max_load_ts from memo_temprushi"
Add the value in the where condition of the query and remove the join condition of the query.
If it works, then you can develop shell script in which first query will get max value and then run the second query with out join.
Here is the sample shell script.
max_date=$(hive -e "select max(order_date) from orders" 2>/dev/null)
hive -e "select order_date from orders where order_date >= date_sub('$max_date', 7);"

HIVE: Empty buckets getting created after partitioning in HDFS

I was trying to create Partition and buckets using HIVE.
For setting some of the properties:
set hive.enforce.bucketing = true;
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
Below is the code for creating the table:
CREATE TABLE transactions_production
( id string,
dept string,
category string,
company string,
brand string,
date1 string,
productsize int,
productmeasure string,
purchasequantity int,
purchaseamount double)
PARTITIONED BY (chain string) clustered by(id) into 5 buckets
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
Below is the code for inserting data into the table:
INSERT OVERWRITE TABLE transactions_production PARTITION (chain)
select id, dept, category, company, brand, date1, productsize, productmeasure,
purchasequantity, purchaseamount, chain from transactions_staging;
What went wrong:
Partitions and buckets are getting created in HDFS but the data is present only in the 1st bucket of all the partitions; all the remaining buckets are empty.
Please let me know what i did wrong and how to resolve this issue.
When using bucketing, Hive comes up with a hash of the clustered by value (here you use id) and splits the table into that many flat files inside partitions.
Because the table is split up by the hashes of the id's the size of each split is based on the values in your table.
If you have no values that would get mapped to the buckets other than the first bucket, all those flat files will be empty.

Hive, Bucketing for the partitioned table

This is my script:
--table without partition
drop table if exists ufodata;
create table ufodata ( sighted string, reported string, city string, shape string, duration string, description string )
row format delimited
fields terminated by '\t'
Location '/mapreduce/hive/ufo';
--load my data in ufodata
load data local inpath '/home/training/downloads/ufo_awesome.tsv' into table ufodata;
--create partition table
drop table if exists partufo;
create table partufo ( sighted string, reported string, city string, shape string, duration string, description string )
partitioned by ( year string )
clustered by (year) into 6 buckets
row format delimited
fields terminated by '/t';
--by default dynamic partition is not set
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
--by default bucketing is false
set hive.enforcebucketing=true;
--loading mydata
insert overwrite table partufo
partition (year)
select sighted, reported, city, shape, min, description, SUBSTR(TRIM(sighted), 1,4) from ufodata;
Error message:
FAILED: Error in semantic analysis: Invalid column reference
I tried bucketing for my partitioned table. If I remove "clustered by (year) into 6 buckets" the script works fine. How do I bucket the partitioned table
There is an important thing we should consider while doing bucketing in hive.
The same column name cannot be used for both bucketing and partitioning. The reason is as follows:
Clustering and Sorting happens within a partition. Inside each partition there will be only one value associated with the partition column(in your case it is year)therefore there will not any be any impact on clustering and sorting. That is the reason for your error....
You can use the below syntax to create bucketing table with partition.
CREATE TABLE bckt_movies
(mov_id BIGINT , mov_name STRING ,prod_studio STRING, col_world DOUBLE , col_us_canada DOUBLE , col_uk DOUBLE , col_aus DOUBLE)
PARTITIONED BY (rel_year STRING)
CLUSTERED BY(mov_id) INTO 6 BUCKETS;
when you're doing dynamic partition, create a temporary table with all the columns (including your partitioned column) and load data into temporary table.
create actual partitioned table with partition column. While you are loading data from temporary table the partitioned column should be in the last in the select clause.

Hive static partitions issue

I have a csv file which have 600 records, 300 for male and female each.
I have created a Table_Temp and fill all these records in that table. Then, I create Table_Main with gender as partition column.
For Temp_Table query is:
Create table if not exists Temp_Table
(id string, age int, gender string, city string, pin string)
row format delimited
fields terminated by ',';
Then I write the below query:
Insert into Table_Main
partitioned (gender)
select a,b,c,d,gender from Table)Temp
Problem: I am getting a file in /user/hive/warehouse/mydb.db/Table_Main/gender=Male/000000_0
In this file, I am getting total 600 records. I am not sure whats happening but what I was expected is I should get 300 records in this file(only Male).
Q:1. Where am I mistaken ?
Q:2. Should I not get one more folder for all other values(which are not in static partition) ? If NOT, what will happen to those ?
In static partition we need to specify a where condition while inserting data into partition table.(which I have not done).
For this we can use dynamic partition without where condition.

Hive partition columns seem to prevent "select distinct"

I have created a table in Hive like this:
CREATE TABLE application_path
(userId STRING, sessId BIGINT, accesstime BIGINT, actionId STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '#'
STORED AS TEXTFILE;
Running on this table the query:
SELECT DISTINCT userId FROM application_path;
gives the expected result:
user1#domain.com
user2#domain.com
user3#domain.com
...
Then I've changed the declaration to add a partition:
CREATE TABLE application_path
(sessId BIGINT, accesstime BIGINT, actionId STRING)
PARTITIONED BY(userId STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '#'
STORED AS TEXTFILE;
Now the query SELECT DISTINCT userId... runs for seconds as before, but eventually returns anything.
I've just noticed the syntax:
SHOW PARTITIONS application_path;
but I was wondering if that's the only way to get unique (distinct) values from a partitioning column. The output of SHOW PARTITION is not even an exact replacement of what you would get from SELECT DISTINCT, since the column name is prefixed to each row:
hive> show partitions application_path;
OK
userid=user1#domain.com
userid=user2#domain.com
userid=user3#domain.com
...
What's strange to me is that usedId can be used in GROUP BY with other columns, like in:
SELECT userId, sessId FROM application_path GROUP BY userId, sessId;
but does return anything in:
SELECT userId FROM application_path GROUP BY userId;
I experienced the same issue, it will be fixed in 0.10
https://issues.apache.org/jira/browse/HIVE-2955

Resources