Hive : with multi insert query: FAILED: SemanticException Should not happened - hadoop

I am using multi insert query for optimization purpose, surely it helps me a lot but with each day run, I can find 3 to 4 id's (having count more than 10 Million) taking too much time at reducer. to fix this I have implemented skewjoin optimization properties but it's throwing
"FAILED: SemanticException Should not happened"
Properties which I am using
set hive.exec.compress.output=true;
set mapreduce.output.fileoutputformat.compress=true;
set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
set mapreduce.output.fileoutputformat.compress.type=BLOCK;
SET hive.optimize.skewjoin=true;
set hive.exec.compress.intermediate=true;
set hive.intermediate.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
set hive.intermediate.compression.type=BLOCK;
set hive.optimize.skewjoin=true;
set hive.skewjoin.key=100000;
set hive.skewjoin.mapjoin.map.tasks=10000;
set hive.skewjoin.mapjoin.min.split=33554432;
Kindly suggest how can I optimize this skew data.(with each new run id's would be different)

set hive.optimize.skewjoin=true; ---> set hive.optimize.skewjoin=false;

Related

mappers stuck at 2 % in simple hive insert command

I am trying to run an insert command which inner joins 2 tables with data in one table as 34567892 and another table is 6754289. The issue is , the mappers are not getting started after completing 2%. I have used various properties like set tez.am.resource.memory.mb=16384;
set hive.tez.container.size=16384;
set hive.tez.java.opts=-Xms13107m;
but still no luck.
Can someone please help me to figure out what to do?
Through researching a lot, I have found the following properties helpful and which ran my query in 2-3 minutes:
set hive.auto.convert.join = false;
set hive.exec.parallel=true;
set hive.exec.compress.output=true;
set hive.exec.parallel=true;
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;

Hive sort operation on high volume skewed dataset

I am working on a big dataset of size around 3 TB on Hortonworks 2.6.5, the layout of the dataset is pretty straight forward.
The heirarchy of data is as follows -
-Country
-Warehouse
-Product
-Product Type
-Product Serial Id
We have transaction data in the above hierarchy for 30 countries each country have more than 200 warehouse, single country USA contributes around 75% of the entire data set.
Problem:
1) We have transaction data with transaction date column (trans_dt) for the above data set for each warehouse, I need to sort trans_dt in ascending order within each warehouse using Hive (1.1.2 version) MapReduce. I have created a partition at Country level and then applied DISTRIBUTE BY Warehouse SORT BY trans_dt ASC; Sorting takes around 8 hours to finish and last 6 hrs is being used at Reducer at 99% stage. I see a lot of shuffles at this stage.
2) We do lot of group by on this combination - Country,Warehouse,Product,Product Type,Product Serial Id any suggestion to optimize this operation will be very helpful.
3) How to handle Skewed dataset for USA country ?
We are using below hive properties.
SET hive.exec.compress.intermediate=true;
SET hive.intermediate.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET hive.intermediate.compression.type=BLOCK;
SET hive.exec.compress.output=true;
SET mapreduce.output.fileoutputformat.compress=true;
SET mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapreduce.output.fileoutputformat.compress.type=BLOCK;
SET hive.auto.convert.join=true;
SET hive.auto.convert.join.noconditionaltask=true;
SET hive.auto.convert.join.noconditionaltask.size=10000000;
SET hive.groupby.skewindata=true;
SET hive.optimize.skewjoin.compiletime=true;
SET hive.optimize.skewjoin=true;
SET hive.optimize.bucketmapjoin=true;
SET hive.exec.parallel=true;
SET hive.cbo.enable=true;
SET hive.stats.autogather=true;
SET hive.compute.query.using.stats=true;
SET hive.stats.fetch.column.stats=true;
SET hive.stats.fetch.partition.stats=true;
SET hive.vectorized.execution.enabled=true;
SET hive.vectorized.execution.reduce.enabled=true;
SET hive.optimize.index.filter=true;
SET hive.optimize.ppd=true;
SET hive.mapjoin.smalltable.filesize=25000000;
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.max.dynamic.partitions.pernode=1000;
SET mapreduce.reduce.memory.mb=10240;
SET mapreduce.reduce.java.opts=-Xmx9216m;
SET mapreduce.map.memory.mb=10240;
SET mapreduce.map.java.opts=-Xmx9216m;
SET mapreduce.task.io.sort.mb=1536;
SET hive.optimize.groupby=true;
SET hive.groupby.orderby.position.alias=true;
SET hive.multigroupby.singlereducer=true;
SET hive.merge.mapfiles=true;
SET hive.merge.smallfiles.avgsize=128000000;
SET hive.merge.size.per.task=268435456;
SET hive.map.aggr=true;
SET hive.optimize.distinct.rewrite=true;
SET mapreduce.map.speculative=false;
set hive.fetch.task.conversion = more;
set hive.fetch.task.aggr=true;
set hive.fetch.task.conversion.threshold=1024000000;
For US and Non US use the same query but process them independently.
Select * from Table where Country = 'US'
UNION
Select * from Table where Country <> 'US'
OR
You can process them using a script where you fire one country at the query at a time, reducing the volume of data that needs to be processed at one instance.
INSERT INTO TABLE <AggregateTable>
SELECT * FROM <SourceTable>
WHERE Country in ('${hiveconf:ProcessCountry}')

Last few mappers with skew data set taking long time to run on groupby hive map reduce

I am running a simple groupby query as shown below on dataset size 3.5 TB. I know there is skew in my dataset. The "partno" column is contributing 95% of dataset, because of this the entire job is taking 9 hrs to complete and last few mappers take the longest time of all.
Can you please help me with this, I need to optimize this to efficiently. Basically I need help to tackle skew data in terms of groupby and join.
select cntry,partno
percentile_approx(part_pr,0.999) as part_pr_cutoff
from sourceTable
GROUP BY cntry,partno;
Below are the hive.properties I am using in the hql files.
SET hive.exec.compress.intermediate=true;
SET hive.intermediate.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET hive.intermediate.compression.type=BLOCK;
SET hive.exec.compress.output=true;
SET mapreduce.output.fileoutputformat.compress=true;
SET mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapreduce.output.fileoutputformat.compress.type=BLOCK;
SET hive.auto.convert.join=true;
SET hive.auto.convert.join.noconditionaltask=true;
SET hive.auto.convert.join.noconditionaltask.size=10000000;
SET hive.groupby.skewindata=true;
SET hive.optimize.skewjoin.compiletime=true;
SET hive.optimize.skewjoin=true;
SET hive.optimize.bucketmapjoin=true;
SET hive.exec.parallel=true;
SET hive.cbo.enable=true;
SET hive.stats.autogather=true;
SET hive.compute.query.using.stats=true;
SET hive.stats.fetch.column.stats=true;
SET hive.stats.fetch.partition.stats=true;
SET hive.vectorized.execution.enabled=true;
SET hive.vectorized.execution.reduce.enabled=true;
SET hive.optimize.index.filter=true;
SET hive.optimize.ppd=true;
SET hive.mapjoin.smalltable.filesize=25000000;
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.max.dynamic.partitions.pernode=1000;
SET mapreduce.reduce.memory.mb=10240;
SET mapreduce.reduce.java.opts=-Xmx9216m;
SET mapreduce.map.memory.mb=10240;
SET mapreduce.map.java.opts=-Xmx9216m;
SET mapreduce.task.io.sort.mb=1536;
SET hive.optimize.groupby=true;
SET hive.groupby.orderby.position.alias=true;
SET hive.multigroupby.singlereducer=true;
SET hive.optimize.point.lookup=true;
SET hive.optimize.point.lookup.min=true;
SET hive.merge.mapfiles=true;
SET hive.merge.smallfiles.avgsize=128000000;
SET hive.merge.size.per.task=268435456;
SET hive.map.aggr=true;
SET hive.optimize.distinct.rewrite=true;
SET mapreduce.map.speculative=false;
set hive.fetch.task.conversion = more;
set hive.fetch.task.aggr=true;
set hive.fetch.task.conversion.threshold=1024000000;

Hive query vertex failure in tez mode of execution

I'm trying to execute a Hive query --
Select a,b,c,d,e,f,cast(g as timestamp) - cast(f as timestamp) as runtime
from table ORDER BY d,e desc limit 100
It is falling with below error
TaskAttempt 1 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.RuntimeException: java.lang. RuntimeException: Cannot find ExprNodeEvaluator for the exprNodeDesc null
I suspect this is because of the difference calculation b/w g and f (may be some NULL values ) but requesting expert answers to solve the issue, As I don't have the access to data . thanks in advance
I'm using below properties .
set hive.execution.engine=tez;
set hive.exec.parallel=true;
set hive.auto.convert.join=false;
set hive.compute.query.using.state=true;
set hive.stats. fetch.column. stats=true;
set hive.stats. fetch.partition.stats=true;
set mapreduce.map.memory.mb=9000;
set mapreduce.map.java.opts=--Xmx7200m;
set mapreduce.reduce.memory.mb=9000;
set mapreduce. reduce . java. opts=-Xmx7200m;
set hive.cho.enable=true;
set hive. vectorized.execution.enabled=true;
set hive.vectorized.execution.reduce.enabled=true;
I'm running from the hive prompt on UNIX server . actually the underlying table is a view contains some joins .. on further research I found that we need to replace the order by . Unfortunately distribute by needs sorting before the limit -->this is also leading the same issue . Can some one please suggest any other way to rewrite the query

How to Tune Hive Insert overwrite partition?

I have written insert overwrite partition in hive to merge all the files in a partition into bigger file,
SQL:
SET hive.exec.compress.output=true;
set hive.merge.smallfiles.avgsize=2560000000;
set hive.merge.mapredfiles=true;
set hive.merge.mapfiles =true;
SET mapreduce.max.split.size=256000000;
SET mapreduce.min.split.size=256000000;
SET mapreduce.output.fileoutputformat.compress.type =BLOCK;
SET hive.hadoop.supports.splittable.combineinputformat=true;
SET mapreduce.output.fileoutputformat.compress.codec=${v_compression_codec};
INSERT OVERWRITE TABLE ${source_database}.${table_name} PARTITION (${line}) \n SELECT ${prepare_sel_columns} \n from ${source_database}.${table_name} \n WHERE ${partition_where_clause};\n"
With the above setting I am getting the compressed output but the time it takes to generate the output file is too long.
Even though it runs map only jobs , Takes much time.
Looking for any further setting from hive side to tune the Insert to run faster.
Metrics.
15 GB files ==> taking 10 min.
SET hive.exec.compress.output=true;
SET mapreduce.input.fileinputformat.split.minsize=512000000;
SET mapreduce.input.fileinputformat.split.maxsize=5120000000;
SET mapreduce.output.fileoutputformat.compress.type =BLOCK;
SET hive.hadoop.supports.splittable.combineinputformat=true;
SET mapreduce.output.fileoutputformat.compress.codec=${v_compression_codec};
The above setting helped lot , The duration came down from 10 min to 1 min.

Resources