If I set the number of reduce tasks as something like 100 and when I run the job, suppose the reduce task number exceeds (as per my understanding the number of reduce tasks depends on the key-value we get from the mapper.Suppose I am setting (1,abc) and (2,bcd) as key value in mapper, the number of reduce tasks will be 2) How will MapReduce handle it?.
as per my understanding the number of reduce tasks depends on the key-value we get from the mapper
Your understanding seems to be wrong. The number of reduce tasks does not depend on the key-value we get from the mapper.
In a MapReduce job the number of reducers is configurable on a per job basis and is set in the driver class.
For example if we need 2 reducers for our job then we need to set it in the driver class of our MapReduce job as below:-
job.setNumReduceTasks(2);
In the Hadoop: The Definitive Guide book, Tom White states that -
Setting reducer count is kind of art, instead of science.
So we have to decide how many reducers we need for our job. For your example if you have the intermediate Mapper input as (1,abc) and (2,bcd) and you have not set the number of reducers in the driver class then Mapreduce by default runs only 1 reducer and both of the key value pairs will be processed by a single Reducer and you will get a single output file in the specified output directory.
The default value of number of reducer on MapReduce is 1 irrespective of the number of the (key,value) pairs.
If you set the number of Reducer for a MapReduce job, then number of Reducer will not exceed than the defined value irrespective of the number of different (key,value) pairs.
Once the Mapper task are completed the output is processed by Partitioner by dividing the data into Reducers. The default partitioner for hadoop is HashPartitioner which partition the data based on hash value of keys. It has a method called getPartition. It takes key.hashCode() & Integer.MAX_VALUE and finds the modulus using the number of reduce tasks.
So the number of reducer will never exceed than what you have defined in the Driver class.
Related
In hadoop, what is the difference between using n mappers and n reduce, or n mappers and 1 reduce.
in the case of using 1 reduce, the reduce phase is made of which computer (mappers), if I have 3 computers
The number of mappers is controlled by the amount of data being processed. Reducers are controlled either by the developer or different system parameters.
To override the number of reducers:
set mapreduce.job.reduces=#;
or if it is a Hive job and you want to control more how much work each reducer has to do then you can tweak certain parameters such as:
hive.exec.reducers.bytes.per.reducer.
You can still override by using mapreduce.job.reduces it is just using the bytes per reducer allows you to control the amount each reducer processes.
In regards to controlling where the reducers run you really cannot control that except by using Node Labels. This would mean controlling where all of the tasks in the job run not just the reducers.
I was learning hadoop,
I found number of reducers very confusing :
1) Number of reducers is same as number of partitions.
2) Number of reducers is 0.95 or 1.75 multiplied by (no. of nodes) * (no. of maximum containers per node).
3) Number of reducers is set by mapred.reduce.tasks.
4) Number of reducers is closest to: A multiple of the block size * A task time between 5 and 15 minutes * Creates the fewest files possible.
I am very confused, Do we explicitly set number of reducers or it is done by mapreduce program itself?
How is number of reducers is calculated? Please tell me how to calculate number of reducers.
1 - The number of reducers is as number of partitions - False. A single reducer might work on one or more partitions. But a chosen partition will be fully done on the reducer it is started.
2 - That is just a theoretical number of maximum reducers you can configure for a Hadoop cluster. Which is very much dependent on the kind of data you are processing too (decides how much heavy lifting the reducers are burdened with).
3 - The mapred-site.xml configuration is just a suggestion to the Yarn. But internally the ResourceManager has its own algorithm running, optimizing things on the go. So that value is not really the number of reducer tasks running every time.
4 - This one seems a bit unrealistic. My block size might 128MB and everytime I can't have 128*5 minimum number of reducers. That's again is false, I believe.
There is no fixed number of reducers task that can be configured or calculated. It depends on the moment how much of the resources are actually available to allocate.
Number of reducer is internally calculated from size of the data we are processing if you don't explicitly specify using below API in driver program
job.setNumReduceTasks(x)
By default on 1 GB of data one reducer would be used.
so if you are playing with less than 1 GB of data and you are not specifically setting the number of reducer so 1 reducer would be used .
Similarly if your data is 10 Gb so 10 reducer would be used .
You can change the configuration as well that instead of 1 GB you can specify the bigger size or smaller size.
property in hive for setting size of reducer is :
hive.exec.reducers.bytes.per.reducer
you can view this property by firing set command in hive cli.
Partitioner only decides which data would go to which reducer.
Your job may or may not need reducers, it depends on what are you trying to do. When there are multiple reducers, the map tasks partition their output, each creating one partition for each reduce task. There can be many keys (and their associated values) in each partition, but the records for any given key are all in a single partition. One rule of thumb is to aim for reducers that each run for five minutes or so, and which produce at least one HDFS block’s worth of output. Too many reducers and you end up with lots of small files.
Partitioner makes sure that same keys from multiple mappers goes to the same reducer. This doesn't mean that number of partitions is equal to number of reducers. However, you can specify number of reduce tasks in the driver program using job instance like job.setNumReduceTasks(2). If you don't specify the number of reduce tasks in the driver program then it picks from the mapred.reduce.tasks which has the default value of 1 (https://hadoop.apache.org/docs/r1.0.4/mapred-default.html) i.e. all mappers output will go to the same reducer.
Also, note that programmer will not have control over number of mappers as it depends on the input split where as programmer can control the number of reducers for any job.
If there is a job that has only a map and no reduce, and if all data value that are to be processed are mapped to a single key, will the job only be processed on a single node?
No.
Basically, number of nodes will be determined by number of mappers. 1 mapper will run on 1 node, N mappers on N nodes, one node per mapper.
The number of mappers needed for your job will be set by Hadoop, depending on the amount of data, and on the size of blocks your data will be split in. Each block of data will be processed by 1 mapper.
So if for instance you have an amount of data, that is split in N blocks, you will need N mappers to process it.
Directly from Hadoop Definitive Guide, 6 Chapter, Anatomy of Map reduce job run.
"To create the list of tasks to run, the job scheduler first retrieves
the input splits computed by the client from the shared filesystem. It
then creates one map task for each split. The number of reduce tasks
to create is determined by the mapred.reduce.tasks property in the
Job, which is set by the setNumReduceTasks() method, and the scheduler
simply creates this number of reduce tasks to be run. Tasks are given
IDs at this point."
What is the difference between a mapper and a map task?
Similarly, a reducer and a reduce task?
Also, how are number of mappers,maptasks,reducers,reducetasks determined during the execution of a mapreduce task?
Give interrelationships between them if there is any.
Simply map task is an instance of Mapper. Mapper and reducer are methods in mapreduce jobs.
When we run a mapreduce job, number of map tasks spawned depends on the number blocks(number of blocks depend on input splits) in the input. However the number of reduce tasks can be specified in the mapreduce driver code. Either it can be specified by setting property mapred.reduce.tasks in the job configuration object or org.apache.hadoop.mapreduce.Job#setNumReduceTasks(int reducerCount); method can be used.
In the old JobConf API setNumMapTasks() method was there. But setNumMapTasks() method is removed in the new API org.apache.hadoop.mapreduce.Jobwith the intension of number of mappers should be calculated based on the input splits.
In a typical MapReduce setup(like Hadoop), how many reducer is used for 1 task, for example, counting words? My understanding of that MapReduce from Google means only 1 reducer is involved. Is that correct?
For example, the word count will divide the input into N chunks, and N Map will be running, producing the (word,#) list. My question is, once the Map phase is done, will there be only ONE reducer instance running to compute the result? or there will be reducers running in parallel?
The simple answer is that the number of reducers does not have to be 1 and yes, reducers can run in parallel. As I mentioned above this is user defined or derived.
To keep things in context I will refer to Hadoop in this case so you have an idea of how things work. If you are using the streaming API in Hadoop (0.20.2) you will have to explicitly define how many reducers you would like to run since by default, only 1 reduce task will be launched. You do so by passing the number of reducers to the -D mapred.reduce.tasks=# of reducers argument. The Java API will try to derive the number of reducers you will need but again you can explicitly set that too. In both cases, there is a hard cap on the number of reducers you can run per node and that is set in your mapred-site.xml configuration file using mapred.tasktracker.reduce.tasks.maximum.
On a more conceptual note, you can look at this post on the hadoop wiki that talks about choosing the number of map and reduce tasks.
I case of simple wordcount example it would make sense to use only one reducer.
If you want to have as a result of computation only one number you have to use one reducer (2 or more reducers would give you 2 or more output files).
If this reducer is taking long time to complete you can think of chaining multiple reducers where reducers in next phase would sum results from previous reducers.
This depends entirely on the situation. In some cases, you don't have any reducers...everything can be done mapside. In other cases, you cannot avoid having one reducer, but generally this comes in a 2nd or 3rd map/reduce job that condenses earlier results. Generally, however, you want to have a lot of reducers or else you are losing a lot of the power of MapReduce! In word count, for example, the result of your mappers will be pairs. These pairs are then partitioned based on the word such that each reducer will receive the same words, and can give you the ultimate sum. Each reducer then outputs the result. If you wanted to, you could then shoot off another M/R job that took all of these files and concatenated them-- that job would only have one reducer.
The default value is 1.
If you are considering hive or pig,then the number of reducer depends on the query , like group by , sum .....
In case of ur mapreduce code , it can be defined by setNumReduceTasks on job/conf.
job.setNumReduceTasks(3);
Most of the time it is done when you overwrite the getPartition(), i.e. you are using a custom partitioner
class customPartitioner extends Partitioner<Text,Text>{
public int getPartition(Text key, Text value, int numReduceTasks){
if(numReduceTasks==0)
return 0;
if(some logic)
return 0;
if(some logic)
return 1;
else
return 2;
}
}
One thing you will notice that the number of reducers = the number of part file in the output.
Let me know if you have doubts.
The reducers runs in parallel . The number of reducer you have set in your job while changing config file mapred-site.xml or by setting reducer while command of running job or you can set it in the program also that number of reducer will run parallely. Its not necessary to keep it as 1. By default its value is 1.