Hive : with multi insert query: FAILED: SemanticException Should not happened

Hive : with multi insert query: FAILED: SemanticException Should not happened - hadoop

I am using multi insert query for optimization purpose, surely it helps me a lot but with each day run, I can find 3 to 4 id's (having count more than 10 Million) taking too much time at reducer. to fix this I have implemented skewjoin optimization properties but it's throwing
"FAILED: SemanticException Should not happened"
Properties which I am using
set hive.exec.compress.output=true;
set mapreduce.output.fileoutputformat.compress=true;
set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
set mapreduce.output.fileoutputformat.compress.type=BLOCK;
SET hive.optimize.skewjoin=true;
set hive.exec.compress.intermediate=true;
set hive.intermediate.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
set hive.intermediate.compression.type=BLOCK;
set hive.optimize.skewjoin=true;
set hive.skewjoin.key=100000;
set hive.skewjoin.mapjoin.map.tasks=10000;
set hive.skewjoin.mapjoin.min.split=33554432;
Kindly suggest how can I optimize this skew data.(with each new run id's would be different)

set hive.optimize.skewjoin=true; ---> set hive.optimize.skewjoin=false;

Related

mappers stuck at 2 % in simple hive insert command

I am trying to run an insert command which inner joins 2 tables with data in one table as 34567892 and another table is 6754289. The issue is , the mappers are not getting started after completing 2%. I have used various properties like set tez.am.resource.memory.mb=16384;
set hive.tez.container.size=16384;
set hive.tez.java.opts=-Xms13107m;
but still no luck.
Can someone please help me to figure out what to do?

Through researching a lot, I have found the following properties helpful and which ran my query in 2-3 minutes:
set hive.auto.convert.join = false;
set hive.exec.parallel=true;
set hive.exec.compress.output=true;
set hive.exec.parallel=true;
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;

Hive sort operation on high volume skewed dataset

I am working on a big dataset of size around 3 TB on Hortonworks 2.6.5, the layout of the dataset is pretty straight forward.
The heirarchy of data is as follows -
-Country
-Warehouse
-Product
-Product Type
-Product Serial Id
We have transaction data in the above hierarchy for 30 countries each country have more than 200 warehouse, single country USA contributes around 75% of the entire data set.
Problem:
1) We have transaction data with transaction date column (trans_dt) for the above data set for each warehouse, I need to sort trans_dt in ascending order within each warehouse using Hive (1.1.2 version) MapReduce. I have created a partition at Country level and then applied DISTRIBUTE BY Warehouse SORT BY trans_dt ASC; Sorting takes around 8 hours to finish and last 6 hrs is being used at Reducer at 99% stage. I see a lot of shuffles at this stage.
2) We do lot of group by on this combination - Country,Warehouse,Product,Product Type,Product Serial Id any suggestion to optimize this operation will be very helpful.
3) How to handle Skewed dataset for USA country ?
We are using below hive properties.
SET hive.exec.compress.intermediate=true;
SET hive.intermediate.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET hive.intermediate.compression.type=BLOCK;
SET hive.exec.compress.output=true;
SET mapreduce.output.fileoutputformat.compress=true;
SET mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapreduce.output.fileoutputformat.compress.type=BLOCK;
SET hive.auto.convert.join=true;
SET hive.auto.convert.join.noconditionaltask=true;
SET hive.auto.convert.join.noconditionaltask.size=10000000;
SET hive.groupby.skewindata=true;
SET hive.optimize.skewjoin.compiletime=true;
SET hive.optimize.skewjoin=true;
SET hive.optimize.bucketmapjoin=true;
SET hive.exec.parallel=true;
SET hive.cbo.enable=true;
SET hive.stats.autogather=true;
SET hive.compute.query.using.stats=true;
SET hive.stats.fetch.column.stats=true;
SET hive.stats.fetch.partition.stats=true;
SET hive.vectorized.execution.enabled=true;
SET hive.vectorized.execution.reduce.enabled=true;
SET hive.optimize.index.filter=true;
SET hive.optimize.ppd=true;
SET hive.mapjoin.smalltable.filesize=25000000;
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.max.dynamic.partitions.pernode=1000;
SET mapreduce.reduce.memory.mb=10240;
SET mapreduce.reduce.java.opts=-Xmx9216m;
SET mapreduce.map.memory.mb=10240;
SET mapreduce.map.java.opts=-Xmx9216m;
SET mapreduce.task.io.sort.mb=1536;
SET hive.optimize.groupby=true;
SET hive.groupby.orderby.position.alias=true;
SET hive.multigroupby.singlereducer=true;
SET hive.merge.mapfiles=true;
SET hive.merge.smallfiles.avgsize=128000000;
SET hive.merge.size.per.task=268435456;
SET hive.map.aggr=true;
SET hive.optimize.distinct.rewrite=true;
SET mapreduce.map.speculative=false;
set hive.fetch.task.conversion = more;
set hive.fetch.task.aggr=true;
set hive.fetch.task.conversion.threshold=1024000000;

For US and Non US use the same query but process them independently.
Select * from Table where Country = 'US'
UNION
Select * from Table where Country <> 'US'
OR
You can process them using a script where you fire one country at the query at a time, reducing the volume of data that needs to be processed at one instance.
INSERT INTO TABLE <AggregateTable>
SELECT * FROM <SourceTable>
WHERE Country in ('${hiveconf:ProcessCountry}')

Last few mappers with skew data set taking long time to run on groupby hive map reduce

I am running a simple groupby query as shown below on dataset size 3.5 TB. I know there is skew in my dataset. The "partno" column is contributing 95% of dataset, because of this the entire job is taking 9 hrs to complete and last few mappers take the longest time of all.
Can you please help me with this, I need to optimize this to efficiently. Basically I need help to tackle skew data in terms of groupby and join.
select cntry,partno
percentile_approx(part_pr,0.999) as part_pr_cutoff
from sourceTable
GROUP BY cntry,partno;
Below are the hive.properties I am using in the hql files.
SET hive.exec.compress.intermediate=true;
SET hive.intermediate.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET hive.intermediate.compression.type=BLOCK;
SET hive.exec.compress.output=true;
SET mapreduce.output.fileoutputformat.compress=true;
SET mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapreduce.output.fileoutputformat.compress.type=BLOCK;
SET hive.auto.convert.join=true;
SET hive.auto.convert.join.noconditionaltask=true;
SET hive.auto.convert.join.noconditionaltask.size=10000000;
SET hive.groupby.skewindata=true;
SET hive.optimize.skewjoin.compiletime=true;
SET hive.optimize.skewjoin=true;
SET hive.optimize.bucketmapjoin=true;
SET hive.exec.parallel=true;
SET hive.cbo.enable=true;
SET hive.stats.autogather=true;
SET hive.compute.query.using.stats=true;
SET hive.stats.fetch.column.stats=true;
SET hive.stats.fetch.partition.stats=true;
SET hive.vectorized.execution.enabled=true;
SET hive.vectorized.execution.reduce.enabled=true;
SET hive.optimize.index.filter=true;
SET hive.optimize.ppd=true;
SET hive.mapjoin.smalltable.filesize=25000000;
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.max.dynamic.partitions.pernode=1000;
SET mapreduce.reduce.memory.mb=10240;
SET mapreduce.reduce.java.opts=-Xmx9216m;
SET mapreduce.map.memory.mb=10240;
SET mapreduce.map.java.opts=-Xmx9216m;
SET mapreduce.task.io.sort.mb=1536;
SET hive.optimize.groupby=true;
SET hive.groupby.orderby.position.alias=true;
SET hive.multigroupby.singlereducer=true;
SET hive.optimize.point.lookup=true;
SET hive.optimize.point.lookup.min=true;
SET hive.merge.mapfiles=true;
SET hive.merge.smallfiles.avgsize=128000000;
SET hive.merge.size.per.task=268435456;
SET hive.map.aggr=true;
SET hive.optimize.distinct.rewrite=true;
SET mapreduce.map.speculative=false;
set hive.fetch.task.conversion = more;
set hive.fetch.task.aggr=true;
set hive.fetch.task.conversion.threshold=1024000000;

Hive query vertex failure in tez mode of execution

I'm trying to execute a Hive query --
Select a,b,c,d,e,f,cast(g as timestamp) - cast(f as timestamp) as runtime
from table ORDER BY d,e desc limit 100
It is falling with below error
TaskAttempt 1 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.RuntimeException: java.lang. RuntimeException: Cannot find ExprNodeEvaluator for the exprNodeDesc null
I suspect this is because of the difference calculation b/w g and f (may be some NULL values ) but requesting expert answers to solve the issue, As I don't have the access to data . thanks in advance
I'm using below properties .
set hive.execution.engine=tez;
set hive.exec.parallel=true;
set hive.auto.convert.join=false;
set hive.compute.query.using.state=true;
set hive.stats. fetch.column. stats=true;
set hive.stats. fetch.partition.stats=true;
set mapreduce.map.memory.mb=9000;
set mapreduce.map.java.opts=--Xmx7200m;
set mapreduce.reduce.memory.mb=9000;
set mapreduce. reduce . java. opts=-Xmx7200m;
set hive.cho.enable=true;
set hive. vectorized.execution.enabled=true;
set hive.vectorized.execution.reduce.enabled=true;
I'm running from the hive prompt on UNIX server . actually the underlying table is a view contains some joins .. on further research I found that we need to replace the order by . Unfortunately distribute by needs sorting before the limit -->this is also leading the same issue . Can some one please suggest any other way to rewrite the query

How to Tune Hive Insert overwrite partition?

I have written insert overwrite partition in hive to merge all the files in a partition into bigger file,
SQL:
SET hive.exec.compress.output=true;
set hive.merge.smallfiles.avgsize=2560000000;
set hive.merge.mapredfiles=true;
set hive.merge.mapfiles =true;
SET mapreduce.max.split.size=256000000;
SET mapreduce.min.split.size=256000000;
SET mapreduce.output.fileoutputformat.compress.type =BLOCK;
SET hive.hadoop.supports.splittable.combineinputformat=true;
SET mapreduce.output.fileoutputformat.compress.codec=${v_compression_codec};
INSERT OVERWRITE TABLE ${source_database}.${table_name} PARTITION (${line}) \n SELECT ${prepare_sel_columns} \n from ${source_database}.${table_name} \n WHERE ${partition_where_clause};\n"
With the above setting I am getting the compressed output but the time it takes to generate the output file is too long.
Even though it runs map only jobs , Takes much time.
Looking for any further setting from hive side to tune the Insert to run faster.
Metrics.
15 GB files ==> taking 10 min.

SET hive.exec.compress.output=true;
SET mapreduce.input.fileinputformat.split.minsize=512000000;
SET mapreduce.input.fileinputformat.split.maxsize=5120000000;
SET mapreduce.output.fileoutputformat.compress.type =BLOCK;
SET hive.hadoop.supports.splittable.combineinputformat=true;
SET mapreduce.output.fileoutputformat.compress.codec=${v_compression_codec};
The above setting helped lot , The duration came down from 10 min to 1 min.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Hive : with multi insert query: FAILED: SemanticException Should not happened - hadoop

set hive.optimize.skewjoin=true; ---> set hive.optimize.skewjoin=false;

Related

mappers stuck at 2 % in simple hive insert command

Hive sort operation on high volume skewed dataset

Last few mappers with skew data set taking long time to run on groupby hive map reduce

Hive query vertex failure in tez mode of execution

How to Tune Hive Insert overwrite partition?

Categories

Resources