Hive number of reducers in group by and count(distinct) - hadoop

I was told that count(distinct ) may result in data skew because only one reducer is used.
I made a test using a table with 5 billion data with 2 queries,
Query A:
select count(distinct columnA) from tableA
Query B:
select count(columnA) from
(select columnA from tableA group by columnA) a
Actually, query A takes about 1000-1500 seconds while query B takes 500-900 seconds. The result seems expected.
However, I realize that both queries use 370 mappers and 1 reducers and thay have almost the same cumulative CPU seconds. And this means they do not have geneiune difference and the time difference may caused by cluster load.
I am confused why the all use one 1 reducers and I even tried mapreduce.job.reduces but it does not work. Btw, if they all use 1 reducers why do people suggest not to use count(distinct ) and it seems data skew is not avoidable?

Both queries are using the same number of mappers which is expected and single final reducer, which is also expected because you need single scalar count result. Multiple reducers on the same vertex are running independently, isolated and each will produce it's own output, this is why the last stage has single reducer. The difference is in the plan.
In the first query execution single reducer reads each mapper output and does distinct count calculation on all the data, it process too much data.
Second query is using intermediate aggrgation and final reducer receives partially aggregated data (distinct values aggregated on previous step). Final reducer needs to aggregate partial results again to get final result, it can be much less data than in the first case.
As of Hive 1.2.0 there is optimization for count(distinct) and you do not need to rewrite query. Set this property: hive.optimize.distinct.rewrite=true
Also there is mapper aggregation (mapper can pre-aggregate data also and produce distinct values in the scope of their portion of data - splits) Set this property to allow map-side aggregation: hive.map.aggr=true
use EXPLAIN command to check the difference in the execution plan.
See also this answer: https://stackoverflow.com/a/51492032/2700344

Related

Single map task taking long time and failing in hive map reduce

I am running a simple query like the one shown below(similar form)
INSERT OVERWRITE table TABLE2
PARTITION(COLUMN)
SELECT *
FROM TABLE1
There is nothing wrong with query syntax wise.
TABLE2 IS EMPTY and the total size of TABLE1 is 2gb in HDFS(stored as parquet with snappy compression)
When I run the query in hive, I see that 17 map tasks and 0 reducer tasks are launched.
What I notice is that most of the map task complete in a minute.
But one of the map task takes long time. It's like all the data in the table is going to that map task.
The whole query fails eventually with container physical memory limit error.
Any reasons for why this is happening or might happen?
It may happen because some partition is bigger than others.
Try to trigger reducer task by adding distribute by
INSERT OVERWRITE table TABLE2
PARTITION(COLUMN)
SELECT *
FROM TABLE1
DISTRIBUTE BY COLUMN
Additionally you can add some other evenly distributed column with low cardinality to the DISTRIBUTE BY to increase parallelism:
DISTRIBUTE BY COLUMN, COLUMN2
If COLUMN2 has high cardinality, it will produce too many files in each partition, if column values are distributed not evenly (skewed) then it will result in skew in reducer, so it is important to use low-cardinality, evenly distributed column or deterministic function with the same properties like substr(), etc.
Alternatively also try to increase mapper parallelism and check if it helps: https://stackoverflow.com/a/48487306/2700344

Distributed by Clause in HIVE

I have table of with huge data like 100TB.
When I am querying the table I used distributed by clause on particular column (say x).
The table contains 200 distinct or unique values of X.
So When I queried the table with distributed by clause on X the maximum reducers should be 200. But I am seeing it is utilizing MAX reducers i.e. 999
Let me explain with example
Suppose the description of the emp_table is as fallows with 3 columns.
1.emp_name
2. emp_ID
3.Group_ID
and Group_ID has **200 distinct** values
Now I want to query the table
select * from emp_table distributed by Group_ID;
This Query should use 200 Reducers as per distributed clause. But I am seeing 999 reducers getting utilized.
I am doing it as part optimization. So how can I make sure it should be utilize 200 reducers?
The number of reducers in Hive is decided by either two properties.
hive.exec.reducers.bytes.per.reducer - The default value is 1GB, This makes hive to create one reducer for each 1GB of input table's size.
mapred.reduce.tasks - takes an intger value, and those many reducers will be prepared for the job.
The distribute by clause doesn't have any role in deciding the number of reducers, all its work is to distribute/partition the key value from mappers to prepared reducers based on the column given in the clause.
Consider setting the mapred.reduce.tasks as 200, and the distribute by will take care of partitioning the key values to the 200 reducers in even manner.
The reduce number of hive depends on the size of your input file.But if the output of the mapper contains only 200 groups.Then I guess most of the reduce job will recieve nothing.
If you really want to control the reduce number.set mapred.reduce.tasks will help.

Relationship between a HIVE query and the number of mapreducers provided by Hadoop?

I am executing a query in HIVE shell as
SELECT tradeId, bookid, foid from trades where bookid='"ABCDEFG"'
The table "trades" has index on bookid. When the query runs, it shows the details of Mappers and Reducers as follows :-
Number of reduce tasks is set to 0 since there's no reduce operator
Hadoop job information for Stage-1: number of mappers: 48; number of reducers: 0
Time taken: **606.183 seconds**, Fetched: **18 row(s)**
If you see it took enormous amount of time to fetch just 18 rows. My question is what I am doing wrong here ? Should the recuder be non-zero ? Will it help if I set it using
set mapred.reduce.tasks = some_number
Shouldn't the indexes help retrieve the data faster ?
When you are doing simple select, all the filtering thing and column selection are done by the mappers itself. There is no purpose for reducer task here, hence number of reducer is zero - which is fine. You probably have around 48*block size amount of data in your table so it spawned 48 mappers. How many map slot per DN do you have and how many of them were free when you fired your query? Chances are all 48 of them are not running in parallel. Though it returned only 18 rows, it read the full table. Is your table bucketed and clustered on the bookid column - in that case you may use TABLESAMPLE clause to make it read only the buckets that contain your ABCDEFG value.

Amazon EMR not utilizing all the nodes

I am using 4 core nodes..
I am using hive to run queries on a table.
Various queries seem to be under utilizing the capacity.
My table consists of 8 integer fields and about 1000 rows.
queries of the form
select avg(col1-col2) from tbl;
select count(*) from tbl;
and every other query I tried
are producing
number of reducers=1,number of mappers=1
i have tried using set mapred.reduce.tasks=4;
but it doesnt work.
The weirdest thing is that when I use mapred.job.tracker=local which means one map and one reduce on the local node itself the task finished twice as fast.
All the reduce/map slots except one are open all the time.
Why isnt adding capacity even slightly improving exec time?
Is my data sample so small that increasing capacity doesn't matter and localizing the mapping and reduction actually improves the time?
The reason you are getting a single mapper is because your table is so small. I'm assuming your 1000 row table is one file which is much smaller than then your HDFS block size. Try a million row table or larger and you will start seeing it utilize multiple mappers. The answers to this question has some more information on how the number of mappers is chosen.
The reason you are getting a single reducer is a combination of two things. First, you are working with a tiny amount of data (for Hive) so you end up with one reducer. Second, some queries (like COUNT(*) FROM some_table) must have one reducer (see the question here)
You nailed it on why running the job locally is faster. 1000 row tables are great for testing the logic of your queries, but not for determining things like runtime. Running Hive on a cluster instead of locally will probably only start being better once you have data on the order of GBs. Hive is definitely not the "right tool for the job" until you get into queries that touch at least 10's of GBs, though 100's of GBs or TBs (or more) is easier to justify.

Why is sort by always using single reducer?

I am trying to execute the following query and it is taking forever to load data as only a single reducer is used for the second job.
INSERT INTO TABLE ddb_table
SELECT * FROM data_dump sort by rank desc LIMIT 1000000;
Two jobs are created for the above query. First job run pretty fast as it is using 80 mappers and about 22 reducers. Second job mappers are fast but it is super slow due to a single reducer.
I tried to increase reducer count with set mapred.reduce.tasks=35 but interestingly it was applied only for the first job and not the second.
Why is a single reducer used? Is it because of the sort by clause?
How can I set max reducers?
Is there a better way of doing it?
I'm not positive, but my intuition is that it's because of the "limit", not the "sort by". In fact, "sort by" explicitly will only sort within each reducer, so you will not get a total ordering.
The issue is that if there are multiple reducers, they aren't coordinated enough to be able to know when they've reached 1000000 records. So to do limit, it must be only one reducer, which maintains a count of the number of records, and stops outputting new ones once the limit is reached.
In fact, even if it were possible to do "sort by" and "limit" with multiple reducers, you could get different output on different runs, depending on which reducer runs fastest, so I don't think what you're trying to do here makes sense in the first place.
It is just the way sorting with default Partitioner works in Hadoop. Default partitioning uses hashcode mod number of reducers, so if you want 35 reducers, than you will get 35 output files, each sorted, but with overlapping ranges. For example you have keys starting with alpha characters [a..z]: file1 (a1,a2,a15,d3,d5,f6), file2(a3,a5,b1,z3), etc.
In order to avoid the overlapping key ranges you either need one Reducer or you need to make your partitioner more aware of the nature of the keys, for example make you partitioner to direct all of the keys with the same first character into the same partition, thus there will be multiple files in the output, but none of the ranges will overlap. Ex file1 (a1,a2,a3,a5,a15), file2(b1),file3(....) file4(d3,d6), etc.
It works for when me I use standard Hadoop jobs or Apache PIG. Unfortunately I do not have Hive expirience, but you could try to use Dynamic Partitioning on the table you are inserting into.

Resources