I am running a simple query like the one shown below(similar form)
INSERT OVERWRITE table TABLE2
PARTITION(COLUMN)
SELECT *
FROM TABLE1
There is nothing wrong with query syntax wise.
TABLE2 IS EMPTY and the total size of TABLE1 is 2gb in HDFS(stored as parquet with snappy compression)
When I run the query in hive, I see that 17 map tasks and 0 reducer tasks are launched.
What I notice is that most of the map task complete in a minute.
But one of the map task takes long time. It's like all the data in the table is going to that map task.
The whole query fails eventually with container physical memory limit error.
Any reasons for why this is happening or might happen?
It may happen because some partition is bigger than others.
Try to trigger reducer task by adding distribute by
INSERT OVERWRITE table TABLE2
PARTITION(COLUMN)
SELECT *
FROM TABLE1
DISTRIBUTE BY COLUMN
Additionally you can add some other evenly distributed column with low cardinality to the DISTRIBUTE BY to increase parallelism:
DISTRIBUTE BY COLUMN, COLUMN2
If COLUMN2 has high cardinality, it will produce too many files in each partition, if column values are distributed not evenly (skewed) then it will result in skew in reducer, so it is important to use low-cardinality, evenly distributed column or deterministic function with the same properties like substr(), etc.
Alternatively also try to increase mapper parallelism and check if it helps: https://stackoverflow.com/a/48487306/2700344
Related
I was told that count(distinct ) may result in data skew because only one reducer is used.
I made a test using a table with 5 billion data with 2 queries,
Query A:
select count(distinct columnA) from tableA
Query B:
select count(columnA) from
(select columnA from tableA group by columnA) a
Actually, query A takes about 1000-1500 seconds while query B takes 500-900 seconds. The result seems expected.
However, I realize that both queries use 370 mappers and 1 reducers and thay have almost the same cumulative CPU seconds. And this means they do not have geneiune difference and the time difference may caused by cluster load.
I am confused why the all use one 1 reducers and I even tried mapreduce.job.reduces but it does not work. Btw, if they all use 1 reducers why do people suggest not to use count(distinct ) and it seems data skew is not avoidable?
Both queries are using the same number of mappers which is expected and single final reducer, which is also expected because you need single scalar count result. Multiple reducers on the same vertex are running independently, isolated and each will produce it's own output, this is why the last stage has single reducer. The difference is in the plan.
In the first query execution single reducer reads each mapper output and does distinct count calculation on all the data, it process too much data.
Second query is using intermediate aggrgation and final reducer receives partially aggregated data (distinct values aggregated on previous step). Final reducer needs to aggregate partial results again to get final result, it can be much less data than in the first case.
As of Hive 1.2.0 there is optimization for count(distinct) and you do not need to rewrite query. Set this property: hive.optimize.distinct.rewrite=true
Also there is mapper aggregation (mapper can pre-aggregate data also and produce distinct values in the scope of their portion of data - splits) Set this property to allow map-side aggregation: hive.map.aggr=true
use EXPLAIN command to check the difference in the execution plan.
See also this answer: https://stackoverflow.com/a/51492032/2700344
I'm having some difficulties to make sure I'm leveraging sorted data within a Hive table. (Using ORC file format)
I understand we can affect how the data is read from a Hive table, by declaring a DISTRIBUTE BY clause in the create DDL.
CREATE TABLE trades
(
trade_id INT,
name STRING,
contract_type STRING,
ts INT
)
PARTITIONED BY (dt STRING)
CLUSTERED BY (trade_id) SORTED BY (trade_id, time) INTO 8 BUCKETS
STORED AS ORC;
This will mean that every time I make a query to this table, the data will be distributed by trade_id among the various mappers and afterward it will be sorted.
My question is:
I do not want the data to be split into N files (buckets), because the volume is not that much and I would stay with small files.
However, I do want to leverage sorted insertion.
INSERT OVERWRITE TABLE trades
PARTITION (dt)
SELECT trade_id, name, contract_type, ts, dt
FROM raw_trades
DISTRIBUTE BY trade_id
SORT BY trade_id;
Do I really need to use CLUSTERED/SORT in the create DLL statement? Or does Hive/ORC knows how to leverage the fact that the insertion process already ensured that the data is sorted?
Could it make sense to do something like:
CLUSTERED BY (trade_id) SORTED BY (trade_id, time) INTO 1 BUCKETS
Bucketed table is an outdated concept.
You do not need to write CLUSTERED BY in table DDL.
When loading table use distribute by partition key to reduce pressure on reducers especially when writing ORC, which requires intermediate buffers for building ORC and if each reducer loads many partitions it may cause OOM exception.
When the table is big, you can limit the max file size using bytes.per.reducer like this:
set hive.exec.reducers.bytes.per.reducer=67108864;--or even less
If you have more data, more reducers will be started, more files created. This is more flexible than loading fixed number of buckets.
This will also work better because for small tables you do not need to create smaller buckets.
ORC has internal indexes and bloom filters. Applying SORT you can improve index and bloom filters efficiency because all similar data will be stored together. Also this can improve compression depending on your data enthropy.
If distribution by partition key is not enough because you have some data skew and the data is big, you can additionally distribute by random. It is better to distribute by column if you have evenly distributed data. If not, distribute by random to avoid single long running reducer problem.
Finally your insert statement may look loke this:
set hive.exec.reducers.bytes.per.reducer=33554432; --32Mb per reducer
INSERT OVERWRITE TABLE trades PARTITION (dt)
SELECT trade_id, name, contract_type, ts, dt
FROM raw_trades
DISTRIBUTE BY dt, --partition key is a must for big data
trade_id, --some other key if the data is too big and key is
--evenly distributed (no skew)
FLOOR(RAND()*100.0)%20 --random to distribute additionally on 20 equal parts
SORT BY contract_type; --sort data if you want filtering by this key
--to work better using internal index
Do not use CLUSTERED BY in table DDL because using DISTRIBUTE BY, ORC w indexes and bloom filters + SORT during insert you can achieve the same in more flexible way.
Distribute + sort can reduce the size of ORC files extremely by x3 or x4 times. Similar data can be better compressed and makes internal indexes more efficient.
Read also this: https://stackoverflow.com/a/55375261/2700344
This is related answer about about sorting: https://stackoverflow.com/a/47416027/2700344
The only case when you can use CLUSTER BY in table DDL is when you joining two big tables which can be bucketed by exactly the same number of buckets to be able to use sort-merge-bucket-map-join, but practically it is so rare case when you can bucket two big tables in the same way. Having only 1 bucket makes no sense because for small tables you can use map-join, just sort the data during insert to reduce the compressed data size.
I am running a cross product operation and storing the results in a table. The number of rows in table1 and table2 is ~300K and ~15K respectively. The query is like
create table table3
as
select a.var1*b.var1+......+a.var_n.b.var_n as score
from
table1 a , table2b
I observed that the process is running fastest at 2000 to 3000 mappers as compared to much higher number of mappers allocated (5000).
My questions are :
Does increasing the number of mapper really speed up the process?
Is there any way to to figure out the optimal number of mapper for a process ?
I have table of with huge data like 100TB.
When I am querying the table I used distributed by clause on particular column (say x).
The table contains 200 distinct or unique values of X.
So When I queried the table with distributed by clause on X the maximum reducers should be 200. But I am seeing it is utilizing MAX reducers i.e. 999
Let me explain with example
Suppose the description of the emp_table is as fallows with 3 columns.
1.emp_name
2. emp_ID
3.Group_ID
and Group_ID has **200 distinct** values
Now I want to query the table
select * from emp_table distributed by Group_ID;
This Query should use 200 Reducers as per distributed clause. But I am seeing 999 reducers getting utilized.
I am doing it as part optimization. So how can I make sure it should be utilize 200 reducers?
The number of reducers in Hive is decided by either two properties.
hive.exec.reducers.bytes.per.reducer - The default value is 1GB, This makes hive to create one reducer for each 1GB of input table's size.
mapred.reduce.tasks - takes an intger value, and those many reducers will be prepared for the job.
The distribute by clause doesn't have any role in deciding the number of reducers, all its work is to distribute/partition the key value from mappers to prepared reducers based on the column given in the clause.
Consider setting the mapred.reduce.tasks as 200, and the distribute by will take care of partitioning the key values to the 200 reducers in even manner.
The reduce number of hive depends on the size of your input file.But if the output of the mapper contains only 200 groups.Then I guess most of the reduce job will recieve nothing.
If you really want to control the reduce number.set mapred.reduce.tasks will help.
We have a pig join between a small (16M rows) distinct table and a big (6B rows) skewed table.
A regular join finishes in 2 hours (after some tweaking). We tried using skewed and been able to improve the performance to 20 minutes.
HOWEVER, when we try a bigger skewed table (19B rows), we get this message from the SAMPLER job:
Split metadata size exceeded 10000000. Aborting job job_201305151351_21573 [ScriptRunner]
at org.apache.hadoop.mapreduce.split.SplitMetaInfoReader.readSplitMetaInfo(SplitMetaInfoReader.java:48)
at org.apache.hadoop.mapred.JobInProgress.createSplits(JobInProgress.java:817) [ScriptRunner]
This is reproducible every time we try using skewed, and does not happen when we use the regular join.
we tried setting mapreduce.jobtracker.split.metainfo.maxsize=-1 and we can see it's there in the job.xml file, but it doesn't change anything!
What's happening here? Is this a bug with the distribution sample created by using skewed? Why doesn't it help changing the param to -1?
Small table of 1MB is small enough to fit into memory, try replicated join.
Replicated join is Map only, does not cause Reduce stage as other types of join, thus is immune to the skew in the join keys. It should be quick.
big = LOAD 'big_data' AS (b1,b2,b3);
tiny = LOAD 'tiny_data' AS (t1,t2,t3);
mini = LOAD 'mini_data' AS (m1,m2,m3);
C = JOIN big BY b1, tiny BY t1, mini BY m1 USING 'replicated';
Big table is always the first one in the statement.
UPDATE 1:
If small table in its original form does not fit into memory,than as a work around you would need to partition your small table into partitions that are small enough to fit into memory and than apply the same partitioning to the big table, hopefully you could add the same partitioning algorithm to the system which creates big table, so that you do not waste time repartitioning it.
After partitioning, you can use replicated join, but it will require running pig script for each partition separately.
In newer versions of Hadoop (>=2.4.0 but maybe even earlier) you should be able to set the maximum split size at the job level by using the following configuration property:
mapreduce.job.split.metainfo.maxsize=-1