How does Hive choose the number of reducers for a job? - hadoop

Several places say the default # of reducers in a Hadoop job is 1. You can use the mapred.reduce.tasks symbol to manually set the number of reducers.
When I run a Hive job (on Amazon EMR, AMI 2.3.3), it has some number of reducers greater than one. Looking at job settings, something has set mapred.reduce.tasks, I presume Hive. How does it choose that number?
Note: here are some messages while running a Hive job that should be a clue:
...
Number of reduce tasks not specified. Estimated from input data size: 500
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
...

The default of 1 maybe for a vanilla Hadoop install. Hive overrides it.
In open source hive (and EMR likely)
# reducers = (# bytes of input to mappers)
/ (hive.exec.reducers.bytes.per.reducer)
This post says default hive.exec.reducers.bytes.per.reducer is 1G.
You can limit the number of reducers produced by this heuristic using hive.exec.reducers.max.
If you know exactly the number of reducers you want, you can set mapred.reduce.tasks, and this will override all heuristics. (By default this is set to -1, indicating Hive should use its heuristics.)
In some cases - say 'select count(1) from T' - Hive will set the number of reducers to 1 , irrespective of the size of input data. These are called 'full aggregates' - and if the only thing that the query does is full aggregates - then the compiler knows that the data from the mappers is going to be reduced to trivial amount and there's no point running multiple reducers.

Related

Deciding on the optimal number of reducers to be specified for fastest processing in a Hadoop map reduce program

The default value for the number reducers is 1.Partitioner makes sure that same keys from multiple mappers goes to the same reducer but that does not mean the number of reducers will be equal to the number of partitions. From the driver one can specify the number of reducers using JobConf's conf.setNumReduceTasks(int num) or as mapred.reduce.tasks in the command line. If only the mappers are required then then we can set this as 0.
I have read regarding setting the number of reducers that:
The number of reducers can be between 0.95 or 1.75 multiplied by (no. of nodes) * (no. of maximum containers per node). I also read in the below link that increasing the number for reducers can have overhead:
Number of reducers in hadoop
Also, I read in the below link that the number of reducers is best set to be the number of reduce slots in the cluster (minus a few to allow for failures):
What determines the number of mappers/reducers to use given a specified set of data
Based on the range specified in 1 and based on 2, how to decide on the optimal number for fastest processing?
Thanks.
I want to know the approach for this in general.
This question can only have an empirical answer. Quoting from an answer in this Q&A
By default on 1 GB of data one reducer would be used.
[...]
Similarly if your data is 10 Gb so 10 reducer would be used .
Defaults already are the rule of thumb. You can further tweak the default number by making empirical tests and see how performances change. That is, currently, all.

What will happen if Hive number of reducers is different to number of keys?

In Hive I ofter do queries like:
select columnA, sum(columnB) from ... group by ...
I read some mapreduce example and one reducer can only produce one key. It seems the number of reducers completely depends on number of keys in columnA.
Therefore, why could hive set number of reducers manully?
If there are 10 different values in columnA and I set number of reducers to 2, what will happen? Each reducers will be reused 5 times?
If there are 10 different values in columnA and I set number of reducers to 20, what will happen? hive will only generate 10 reducers?
Normally you should not set the exact number of reducers manually. Use bytes.per.reducer instead:
--The number of reduce tasks determined at compile time
--Default size is 1G, so if the input size estimated is 10G then 10 reducers will be used
set hive.exec.reducers.bytes.per.reducer=67108864;
If you want to limit cluster usage by job reducers, you can set this property: hive.exec.reducers.max
If you are running on Tez, at execution time Hive can dynamically set the number of reducers if this property is set:
set hive.tez.auto.reducer.parallelism = true;
In this case the number of reducers initially started may be bigger because it was estimated based on size, at runtime extra reducers can be removed.
One reducer can process many keys, it depends on data size and bytes.per.reducer and reducer limit configuration settings. The same keys will pass to the same reducer in case of query like in your example because each reducer container is running isolated and all rows having particular key need to be passed to single reducer to be able calculate count for this key.
Extra reducers can be forced (mapreduce.job.reducers=N) or started automatically based on wrong estimation(because of stale stats) and if not removed at run-time, they will do nothing and finish quickly because there is nothing to process. But such reducers anyway will be scheduled and containers allocated, so better do not force extra reducers and keep stats fresh for better estimation.

how many mappers and reduces will get created for a partitoned table in hive

I am always confused on how many mappers and reduces will get created for a particular task in hive.
e.g If block size = 128mb and there are 365 files each maps to a date in a year(file size=1 mb each). There is partition based on date column. In this case how many mappers and reducers will be run during loading the data?
Mappers:
Number of mappers depends on various factors such as how the data is distributed among nodes, input format, execution engine and configuration params. See also here: https://cwiki.apache.org/confluence/display/TEZ/How+initial+task+parallelism+works
MR uses CombineInputFormat, while Tez uses grouped splits.
Tez:
set tez.grouping.min-size=16777216; -- 16 MB min split
set tez.grouping.max-size=1073741824; -- 1 GB max split
MapReduce:
set mapreduce.input.fileinputformat.split.minsize=16777216; -- 16 MB
set mapreduce.input.fileinputformat.split.maxsize=1073741824; -- 1 GB
Also Mappers are running on data nodes where the data is located, that is why manually controlling the number of mappers is not an easy task, not always possible to combine input.
Reducers:
Controlling the number of reducers is much easier.
The number of reducers determined according to
mapreduce.job.reduces - The default number of reduce tasks per job. Typically set to a prime close to the number of available hosts. Ignored when mapred.job.tracker is "local". Hadoop set this to 1 by default, whereas Hive uses -1 as its default value. By setting this property to -1, Hive will automatically figure out what should be the number of reducers.
hive.exec.reducers.bytes.per.reducer - The default in Hive 0.14.0 and earlier is 1 GB.
Also hive.exec.reducers.max - Maximum number of reducers that will be used. If mapreduce.job.reduces is negative, Hive will use this as the maximum number of reducers when automatically determining the number of reducers.
So, if you want to increase reducers parallelism, increase hive.exec.reducers.max and decrease hive.exec.reducers.bytes.per.reducer

Number of reducers in hadoop

I was learning hadoop,
I found number of reducers very confusing :
1) Number of reducers is same as number of partitions.
2) Number of reducers is 0.95 or 1.75 multiplied by (no. of nodes) * (no. of maximum containers per node).
3) Number of reducers is set by mapred.reduce.tasks.
4) Number of reducers is closest to: A multiple of the block size * A task time between 5 and 15 minutes * Creates the fewest files possible.
I am very confused, Do we explicitly set number of reducers or it is done by mapreduce program itself?
How is number of reducers is calculated? Please tell me how to calculate number of reducers.
1 - The number of reducers is as number of partitions - False. A single reducer might work on one or more partitions. But a chosen partition will be fully done on the reducer it is started.
2 - That is just a theoretical number of maximum reducers you can configure for a Hadoop cluster. Which is very much dependent on the kind of data you are processing too (decides how much heavy lifting the reducers are burdened with).
3 - The mapred-site.xml configuration is just a suggestion to the Yarn. But internally the ResourceManager has its own algorithm running, optimizing things on the go. So that value is not really the number of reducer tasks running every time.
4 - This one seems a bit unrealistic. My block size might 128MB and everytime I can't have 128*5 minimum number of reducers. That's again is false, I believe.
There is no fixed number of reducers task that can be configured or calculated. It depends on the moment how much of the resources are actually available to allocate.
Number of reducer is internally calculated from size of the data we are processing if you don't explicitly specify using below API in driver program
job.setNumReduceTasks(x)
By default on 1 GB of data one reducer would be used.
so if you are playing with less than 1 GB of data and you are not specifically setting the number of reducer so 1 reducer would be used .
Similarly if your data is 10 Gb so 10 reducer would be used .
You can change the configuration as well that instead of 1 GB you can specify the bigger size or smaller size.
property in hive for setting size of reducer is :
hive.exec.reducers.bytes.per.reducer
you can view this property by firing set command in hive cli.
Partitioner only decides which data would go to which reducer.
Your job may or may not need reducers, it depends on what are you trying to do. When there are multiple reducers, the map tasks partition their output, each creating one partition for each reduce task. There can be many keys (and their associated values) in each partition, but the records for any given key are all in a single partition. One rule of thumb is to aim for reducers that each run for five minutes or so, and which produce at least one HDFS block’s worth of output. Too many reducers and you end up with lots of small files.
Partitioner makes sure that same keys from multiple mappers goes to the same reducer. This doesn't mean that number of partitions is equal to number of reducers. However, you can specify number of reduce tasks in the driver program using job instance like job.setNumReduceTasks(2). If you don't specify the number of reduce tasks in the driver program then it picks from the mapred.reduce.tasks which has the default value of 1 (https://hadoop.apache.org/docs/r1.0.4/mapred-default.html) i.e. all mappers output will go to the same reducer.
Also, note that programmer will not have control over number of mappers as it depends on the input split where as programmer can control the number of reducers for any job.

Splitsize for the Hive table in TDCH

I'm using TDCH to export a hive data into teradata table. For this I need to specify the number of mappers to my TDCH job. So, my question is "is this number of mappers option we give to TDCH job is just a hint to TDCH? or are these total number of mappers created by TDCH will always be equal to number of mappers given in the option (of the TDCH job)"?
My assumption is that the number of mappers mainly depends on the split-size than on the given number of mappers (in the options of TDCH job). Is my assumption correct for the TDCH jobs?.
Also, for the Hive table how is the split size defined? is that defined based on the number of rows? or it's just defined based on the size of the data (like 60MB or 120MB etc) similar to the cases like the "textfiles"?
"is this number of mappers option we give to TDCH job is just a hint to TDCH? or are these total number of mappers created by TDCH will always be equal to number of mappers given in the option (of the TDCH job)"?
Splitsize in TDCH is always equals to "number of mappers" specified (I read this in one of the TDCH tutorials). So, number of mappers is not just a hint (unlike in conventional mapreduce programming) it's nothing but the number of splits.
As it's equals to number of splits total number of mappers generated for a TDCH job is always equals to "number of mappers" (option) specified while running the job.

Resources