Optimal number of mappers required in hive - hadoop

I am running a cross product operation and storing the results in a table. The number of rows in table1 and table2 is ~300K and ~15K respectively. The query is like
create table table3
as
select a.var1*b.var1+......+a.var_n.b.var_n as score
from
table1 a , table2b
I observed that the process is running fastest at 2000 to 3000 mappers as compared to much higher number of mappers allocated (5000).
My questions are :
Does increasing the number of mapper really speed up the process?
Is there any way to to figure out the optimal number of mapper for a process ?

Related

Single map task taking long time and failing in hive map reduce

I am running a simple query like the one shown below(similar form)
INSERT OVERWRITE table TABLE2
PARTITION(COLUMN)
SELECT *
FROM TABLE1
There is nothing wrong with query syntax wise.
TABLE2 IS EMPTY and the total size of TABLE1 is 2gb in HDFS(stored as parquet with snappy compression)
When I run the query in hive, I see that 17 map tasks and 0 reducer tasks are launched.
What I notice is that most of the map task complete in a minute.
But one of the map task takes long time. It's like all the data in the table is going to that map task.
The whole query fails eventually with container physical memory limit error.
Any reasons for why this is happening or might happen?
It may happen because some partition is bigger than others.
Try to trigger reducer task by adding distribute by
INSERT OVERWRITE table TABLE2
PARTITION(COLUMN)
SELECT *
FROM TABLE1
DISTRIBUTE BY COLUMN
Additionally you can add some other evenly distributed column with low cardinality to the DISTRIBUTE BY to increase parallelism:
DISTRIBUTE BY COLUMN, COLUMN2
If COLUMN2 has high cardinality, it will produce too many files in each partition, if column values are distributed not evenly (skewed) then it will result in skew in reducer, so it is important to use low-cardinality, evenly distributed column or deterministic function with the same properties like substr(), etc.
Alternatively also try to increase mapper parallelism and check if it helps: https://stackoverflow.com/a/48487306/2700344

How map reduce is being performed in this HiveQL query?

FROM (
FROM pv_users
SELECT TRANSFORM(pv_users.userid, pv_users.date)
USING 'python mapper.py'
AS dt, uid
CLUSTER BY dt) map_output
INSERT OVERWRITE TABLE pv_users_reduced
SELECT TRANSFORM map_output.dt, map_output.uid
USING 'python reducer.py'
AS date, count;
How map reduce is working in this query and what is the significance of "CLUSTER BY" in this query?
Each mapper will read file splits, do something with their splits (for example pre-aggregation like distinct) and produce dt, uid grouped and sorted by dt, so different dt will be put in different files which will be consumed by reducers on the next step.
Reducers will read files prepared by mappers, so records with the same dt will be read by the same reducer because records were distributed by dt and sorted on mapper.
Reducer will merge partial results(files from mappers) and do some count aggregation. If some dt were in the same file, records are sorted, it reduces the amount of work to be done on reducer.
cluster by dt = distribute by dt sort by dt
Without cluster by, two reducers may receive same dt, this will make impossible to perform count correctly because reducers do not know about each other and do not share data between them, same dt will be counted partially on different reducers, final result will contain multiple records with the same dt

Hive number of reducers in group by and count(distinct)

I was told that count(distinct ) may result in data skew because only one reducer is used.
I made a test using a table with 5 billion data with 2 queries,
Query A:
select count(distinct columnA) from tableA
Query B:
select count(columnA) from
(select columnA from tableA group by columnA) a
Actually, query A takes about 1000-1500 seconds while query B takes 500-900 seconds. The result seems expected.
However, I realize that both queries use 370 mappers and 1 reducers and thay have almost the same cumulative CPU seconds. And this means they do not have geneiune difference and the time difference may caused by cluster load.
I am confused why the all use one 1 reducers and I even tried mapreduce.job.reduces but it does not work. Btw, if they all use 1 reducers why do people suggest not to use count(distinct ) and it seems data skew is not avoidable?
Both queries are using the same number of mappers which is expected and single final reducer, which is also expected because you need single scalar count result. Multiple reducers on the same vertex are running independently, isolated and each will produce it's own output, this is why the last stage has single reducer. The difference is in the plan.
In the first query execution single reducer reads each mapper output and does distinct count calculation on all the data, it process too much data.
Second query is using intermediate aggrgation and final reducer receives partially aggregated data (distinct values aggregated on previous step). Final reducer needs to aggregate partial results again to get final result, it can be much less data than in the first case.
As of Hive 1.2.0 there is optimization for count(distinct) and you do not need to rewrite query. Set this property: hive.optimize.distinct.rewrite=true
Also there is mapper aggregation (mapper can pre-aggregate data also and produce distinct values in the scope of their portion of data - splits) Set this property to allow map-side aggregation: hive.map.aggr=true
use EXPLAIN command to check the difference in the execution plan.
See also this answer: https://stackoverflow.com/a/51492032/2700344

Distributed by Clause in HIVE

I have table of with huge data like 100TB.
When I am querying the table I used distributed by clause on particular column (say x).
The table contains 200 distinct or unique values of X.
So When I queried the table with distributed by clause on X the maximum reducers should be 200. But I am seeing it is utilizing MAX reducers i.e. 999
Let me explain with example
Suppose the description of the emp_table is as fallows with 3 columns.
1.emp_name
2. emp_ID
3.Group_ID
and Group_ID has **200 distinct** values
Now I want to query the table
select * from emp_table distributed by Group_ID;
This Query should use 200 Reducers as per distributed clause. But I am seeing 999 reducers getting utilized.
I am doing it as part optimization. So how can I make sure it should be utilize 200 reducers?
The number of reducers in Hive is decided by either two properties.
hive.exec.reducers.bytes.per.reducer - The default value is 1GB, This makes hive to create one reducer for each 1GB of input table's size.
mapred.reduce.tasks - takes an intger value, and those many reducers will be prepared for the job.
The distribute by clause doesn't have any role in deciding the number of reducers, all its work is to distribute/partition the key value from mappers to prepared reducers based on the column given in the clause.
Consider setting the mapred.reduce.tasks as 200, and the distribute by will take care of partitioning the key values to the 200 reducers in even manner.
The reduce number of hive depends on the size of your input file.But if the output of the mapper contains only 200 groups.Then I guess most of the reduce job will recieve nothing.
If you really want to control the reduce number.set mapred.reduce.tasks will help.

Relationship between a HIVE query and the number of mapreducers provided by Hadoop?

I am executing a query in HIVE shell as
SELECT tradeId, bookid, foid from trades where bookid='"ABCDEFG"'
The table "trades" has index on bookid. When the query runs, it shows the details of Mappers and Reducers as follows :-
Number of reduce tasks is set to 0 since there's no reduce operator
Hadoop job information for Stage-1: number of mappers: 48; number of reducers: 0
Time taken: **606.183 seconds**, Fetched: **18 row(s)**
If you see it took enormous amount of time to fetch just 18 rows. My question is what I am doing wrong here ? Should the recuder be non-zero ? Will it help if I set it using
set mapred.reduce.tasks = some_number
Shouldn't the indexes help retrieve the data faster ?
When you are doing simple select, all the filtering thing and column selection are done by the mappers itself. There is no purpose for reducer task here, hence number of reducer is zero - which is fine. You probably have around 48*block size amount of data in your table so it spawned 48 mappers. How many map slot per DN do you have and how many of them were free when you fired your query? Chances are all 48 of them are not running in parallel. Though it returned only 18 rows, it read the full table. Is your table bucketed and clustered on the bookid column - in that case you may use TABLESAMPLE clause to make it read only the buckets that contain your ABCDEFG value.

Resources