Hive sort operation on high volume skewed dataset - hadoop

I am working on a big dataset of size around 3 TB on Hortonworks 2.6.5, the layout of the dataset is pretty straight forward.
The heirarchy of data is as follows -
-Country
-Warehouse
-Product
-Product Type
-Product Serial Id
We have transaction data in the above hierarchy for 30 countries each country have more than 200 warehouse, single country USA contributes around 75% of the entire data set.
Problem:
1) We have transaction data with transaction date column (trans_dt) for the above data set for each warehouse, I need to sort trans_dt in ascending order within each warehouse using Hive (1.1.2 version) MapReduce. I have created a partition at Country level and then applied DISTRIBUTE BY Warehouse SORT BY trans_dt ASC; Sorting takes around 8 hours to finish and last 6 hrs is being used at Reducer at 99% stage. I see a lot of shuffles at this stage.
2) We do lot of group by on this combination - Country,Warehouse,Product,Product Type,Product Serial Id any suggestion to optimize this operation will be very helpful.
3) How to handle Skewed dataset for USA country ?
We are using below hive properties.
SET hive.exec.compress.intermediate=true;
SET hive.intermediate.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET hive.intermediate.compression.type=BLOCK;
SET hive.exec.compress.output=true;
SET mapreduce.output.fileoutputformat.compress=true;
SET mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapreduce.output.fileoutputformat.compress.type=BLOCK;
SET hive.auto.convert.join=true;
SET hive.auto.convert.join.noconditionaltask=true;
SET hive.auto.convert.join.noconditionaltask.size=10000000;
SET hive.groupby.skewindata=true;
SET hive.optimize.skewjoin.compiletime=true;
SET hive.optimize.skewjoin=true;
SET hive.optimize.bucketmapjoin=true;
SET hive.exec.parallel=true;
SET hive.cbo.enable=true;
SET hive.stats.autogather=true;
SET hive.compute.query.using.stats=true;
SET hive.stats.fetch.column.stats=true;
SET hive.stats.fetch.partition.stats=true;
SET hive.vectorized.execution.enabled=true;
SET hive.vectorized.execution.reduce.enabled=true;
SET hive.optimize.index.filter=true;
SET hive.optimize.ppd=true;
SET hive.mapjoin.smalltable.filesize=25000000;
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.max.dynamic.partitions.pernode=1000;
SET mapreduce.reduce.memory.mb=10240;
SET mapreduce.reduce.java.opts=-Xmx9216m;
SET mapreduce.map.memory.mb=10240;
SET mapreduce.map.java.opts=-Xmx9216m;
SET mapreduce.task.io.sort.mb=1536;
SET hive.optimize.groupby=true;
SET hive.groupby.orderby.position.alias=true;
SET hive.multigroupby.singlereducer=true;
SET hive.merge.mapfiles=true;
SET hive.merge.smallfiles.avgsize=128000000;
SET hive.merge.size.per.task=268435456;
SET hive.map.aggr=true;
SET hive.optimize.distinct.rewrite=true;
SET mapreduce.map.speculative=false;
set hive.fetch.task.conversion = more;
set hive.fetch.task.aggr=true;
set hive.fetch.task.conversion.threshold=1024000000;

For US and Non US use the same query but process them independently.
Select * from Table where Country = 'US'
UNION
Select * from Table where Country <> 'US'
OR
You can process them using a script where you fire one country at the query at a time, reducing the volume of data that needs to be processed at one instance.
INSERT INTO TABLE <AggregateTable>
SELECT * FROM <SourceTable>
WHERE Country in ('${hiveconf:ProcessCountry}')

Related

mappers stuck at 2 % in simple hive insert command

I am trying to run an insert command which inner joins 2 tables with data in one table as 34567892 and another table is 6754289. The issue is , the mappers are not getting started after completing 2%. I have used various properties like set tez.am.resource.memory.mb=16384;
set hive.tez.container.size=16384;
set hive.tez.java.opts=-Xms13107m;
but still no luck.
Can someone please help me to figure out what to do?
Through researching a lot, I have found the following properties helpful and which ran my query in 2-3 minutes:
set hive.auto.convert.join = false;
set hive.exec.parallel=true;
set hive.exec.compress.output=true;
set hive.exec.parallel=true;
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;

Last few mappers with skew data set taking long time to run on groupby hive map reduce

I am running a simple groupby query as shown below on dataset size 3.5 TB. I know there is skew in my dataset. The "partno" column is contributing 95% of dataset, because of this the entire job is taking 9 hrs to complete and last few mappers take the longest time of all.
Can you please help me with this, I need to optimize this to efficiently. Basically I need help to tackle skew data in terms of groupby and join.
select cntry,partno
percentile_approx(part_pr,0.999) as part_pr_cutoff
from sourceTable
GROUP BY cntry,partno;
Below are the hive.properties I am using in the hql files.
SET hive.exec.compress.intermediate=true;
SET hive.intermediate.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET hive.intermediate.compression.type=BLOCK;
SET hive.exec.compress.output=true;
SET mapreduce.output.fileoutputformat.compress=true;
SET mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapreduce.output.fileoutputformat.compress.type=BLOCK;
SET hive.auto.convert.join=true;
SET hive.auto.convert.join.noconditionaltask=true;
SET hive.auto.convert.join.noconditionaltask.size=10000000;
SET hive.groupby.skewindata=true;
SET hive.optimize.skewjoin.compiletime=true;
SET hive.optimize.skewjoin=true;
SET hive.optimize.bucketmapjoin=true;
SET hive.exec.parallel=true;
SET hive.cbo.enable=true;
SET hive.stats.autogather=true;
SET hive.compute.query.using.stats=true;
SET hive.stats.fetch.column.stats=true;
SET hive.stats.fetch.partition.stats=true;
SET hive.vectorized.execution.enabled=true;
SET hive.vectorized.execution.reduce.enabled=true;
SET hive.optimize.index.filter=true;
SET hive.optimize.ppd=true;
SET hive.mapjoin.smalltable.filesize=25000000;
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.max.dynamic.partitions.pernode=1000;
SET mapreduce.reduce.memory.mb=10240;
SET mapreduce.reduce.java.opts=-Xmx9216m;
SET mapreduce.map.memory.mb=10240;
SET mapreduce.map.java.opts=-Xmx9216m;
SET mapreduce.task.io.sort.mb=1536;
SET hive.optimize.groupby=true;
SET hive.groupby.orderby.position.alias=true;
SET hive.multigroupby.singlereducer=true;
SET hive.optimize.point.lookup=true;
SET hive.optimize.point.lookup.min=true;
SET hive.merge.mapfiles=true;
SET hive.merge.smallfiles.avgsize=128000000;
SET hive.merge.size.per.task=268435456;
SET hive.map.aggr=true;
SET hive.optimize.distinct.rewrite=true;
SET mapreduce.map.speculative=false;
set hive.fetch.task.conversion = more;
set hive.fetch.task.aggr=true;
set hive.fetch.task.conversion.threshold=1024000000;

Execute Hive Query with IN clause parameters in parallel

I am having a Hive query like the one below:
select a.x as column from table1 a where a.y in (<long comma-separated list of parameters>)
union all
select b.x as column from table2 b where b.y in (<long comma-separated list of parameters>)
I have set hive.exec.parallel as true which is helping me achieve parallelism between the two queries between union all.
But, my IN clause has many comma separated values and each value is taken once in 1 job and then the next value. This is actually getting executed sequentially.
Is there any hive parameter which if enabled can help me fetch data parallelly for the parameters in the IN clause?
Currently, the solution I am having is fire the select query with = multiple times instead of one IN clause.
There is no need to read the same data many times in separate queries to achieve better parallelism. Tune proper mapper and reducer parallelism for the same.
First of all, enable PPD with vectorizing, use CBO and Tez:
SET hive.optimize.ppd=true;
SET hive.optimize.ppd.storage=true;
SET hive.vectorized.execution.enabled=true;
SET hive.vectorized.execution.reduce.enabled = true;
SET hive.cbo.enable=true;
set hive.stats.autogather=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.partition.stats=true;
set hive.execution.engine=tez;
SET hive.stats.fetch.column.stats=true;
SET hive.tez.auto.reducer.parallelism=true;
Example settings for Mappers on Tez:
set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
set tez.grouping.max-size=32000000;
set tez.grouping.min-size=32000;
Example settings for Mappers if you decide to run on MR instead of Tez:
set mapreduce.input.fileinputformat.split.minsize=32000;
set mapreduce.input.fileinputformat.split.maxsize=32000000;
--example settings for reducers:
set hive.exec.reducers.bytes.per.reducer=32000000; --decrease this to increase the number of reducers, increase to reduce parallelism
Play with these settings. Success criteria is more mappers/reducers and your map and reduce stages are running faster.
Read this article for better understanding of how to tune Tez: https://community.hortonworks.com/articles/14309/demystify-tez-tuning-step-by-step.html

Hive : with multi insert query: FAILED: SemanticException Should not happened

I am using multi insert query for optimization purpose, surely it helps me a lot but with each day run, I can find 3 to 4 id's (having count more than 10 Million) taking too much time at reducer. to fix this I have implemented skewjoin optimization properties but it's throwing
"FAILED: SemanticException Should not happened"
Properties which I am using
set hive.exec.compress.output=true;
set mapreduce.output.fileoutputformat.compress=true;
set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
set mapreduce.output.fileoutputformat.compress.type=BLOCK;
SET hive.optimize.skewjoin=true;
set hive.exec.compress.intermediate=true;
set hive.intermediate.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
set hive.intermediate.compression.type=BLOCK;
set hive.optimize.skewjoin=true;
set hive.skewjoin.key=100000;
set hive.skewjoin.mapjoin.map.tasks=10000;
set hive.skewjoin.mapjoin.min.split=33554432;
Kindly suggest how can I optimize this skew data.(with each new run id's would be different)
set hive.optimize.skewjoin=true; ---> set hive.optimize.skewjoin=false;

How to Tune Hive Insert overwrite partition?

I have written insert overwrite partition in hive to merge all the files in a partition into bigger file,
SQL:
SET hive.exec.compress.output=true;
set hive.merge.smallfiles.avgsize=2560000000;
set hive.merge.mapredfiles=true;
set hive.merge.mapfiles =true;
SET mapreduce.max.split.size=256000000;
SET mapreduce.min.split.size=256000000;
SET mapreduce.output.fileoutputformat.compress.type =BLOCK;
SET hive.hadoop.supports.splittable.combineinputformat=true;
SET mapreduce.output.fileoutputformat.compress.codec=${v_compression_codec};
INSERT OVERWRITE TABLE ${source_database}.${table_name} PARTITION (${line}) \n SELECT ${prepare_sel_columns} \n from ${source_database}.${table_name} \n WHERE ${partition_where_clause};\n"
With the above setting I am getting the compressed output but the time it takes to generate the output file is too long.
Even though it runs map only jobs , Takes much time.
Looking for any further setting from hive side to tune the Insert to run faster.
Metrics.
15 GB files ==> taking 10 min.
SET hive.exec.compress.output=true;
SET mapreduce.input.fileinputformat.split.minsize=512000000;
SET mapreduce.input.fileinputformat.split.maxsize=5120000000;
SET mapreduce.output.fileoutputformat.compress.type =BLOCK;
SET hive.hadoop.supports.splittable.combineinputformat=true;
SET mapreduce.output.fileoutputformat.compress.codec=${v_compression_codec};
The above setting helped lot , The duration came down from 10 min to 1 min.

Resources