Creating more partitions than reducers - hadoop

When developing locally on my single machine, I believe the default number of reducers is 6. In a particular MR step, I actually divide up the data into n partitions where n can be greater than 6. From what I have observed, it looks like only 6 of those partitions actually get processed because I only see output from 6 specific partitions only. A few questions:
(a) Do I need to set the number of reducers to be greater than the number of partitions? If so, can I do this before/during/after running the Mapper?
(b) Why is it that the other partitions are not queued up? Is there a way to wait for a reducer to finish processing one partition before working on another partition such that all partitions can be processed regardless of whether the actual number of reducers is less than the number of partitions?

(a) No. You can have any number of reducers based on your needs. Partitioning just decides which set of key/value pairs will go to which reducer. It doesn't decide how many reducers will be generated. But, if there is a situation wherein you want to set the number of reducers as per your requirement, you can do that through Job :
job.setNumReduceTasks(2);
(b) This is actually what happens. Based on the availability of slots a set reducers is initiated which process all the input fed to them. If all the reducers have finished and some data is still left unprocessed a second batch of reducers will start and finish rest of the data. All of your data will eventually get processed irrespective of the number of partitions and reducers.
Please make sure your partition logic is correct.
P.S. : Why do you believe the default number of reducers is 6?

You can also ask for a number of reducers when you submit the job to hadoop.
$hadoop jar myjarfile mymainclass -Dmapreduce.job.reduces=n myinput myoutputdir
For more options and some details see:
Hadoop Number of Reducers Configuration Options Priority

Related

What will happen if Hive number of reducers is different to number of keys?

In Hive I ofter do queries like:
select columnA, sum(columnB) from ... group by ...
I read some mapreduce example and one reducer can only produce one key. It seems the number of reducers completely depends on number of keys in columnA.
Therefore, why could hive set number of reducers manully?
If there are 10 different values in columnA and I set number of reducers to 2, what will happen? Each reducers will be reused 5 times?
If there are 10 different values in columnA and I set number of reducers to 20, what will happen? hive will only generate 10 reducers?
Normally you should not set the exact number of reducers manually. Use bytes.per.reducer instead:
--The number of reduce tasks determined at compile time
--Default size is 1G, so if the input size estimated is 10G then 10 reducers will be used
set hive.exec.reducers.bytes.per.reducer=67108864;
If you want to limit cluster usage by job reducers, you can set this property: hive.exec.reducers.max
If you are running on Tez, at execution time Hive can dynamically set the number of reducers if this property is set:
set hive.tez.auto.reducer.parallelism = true;
In this case the number of reducers initially started may be bigger because it was estimated based on size, at runtime extra reducers can be removed.
One reducer can process many keys, it depends on data size and bytes.per.reducer and reducer limit configuration settings. The same keys will pass to the same reducer in case of query like in your example because each reducer container is running isolated and all rows having particular key need to be passed to single reducer to be able calculate count for this key.
Extra reducers can be forced (mapreduce.job.reducers=N) or started automatically based on wrong estimation(because of stale stats) and if not removed at run-time, they will do nothing and finish quickly because there is nothing to process. But such reducers anyway will be scheduled and containers allocated, so better do not force extra reducers and keep stats fresh for better estimation.

In hadoop, 1 reduce or number of reduces = number of mappers

In hadoop, what is the difference between using n mappers and n reduce, or n mappers and 1 reduce.
in the case of using 1 reduce, the reduce phase is made of which computer (mappers), if I have 3 computers
The number of mappers is controlled by the amount of data being processed. Reducers are controlled either by the developer or different system parameters.
To override the number of reducers:
set mapreduce.job.reduces=#;
or if it is a Hive job and you want to control more how much work each reducer has to do then you can tweak certain parameters such as:
hive.exec.reducers.bytes.per.reducer.
You can still override by using mapreduce.job.reduces it is just using the bytes per reducer allows you to control the amount each reducer processes.
In regards to controlling where the reducers run you really cannot control that except by using Node Labels. This would mean controlling where all of the tasks in the job run not just the reducers.

Number of reducers in hadoop

I was learning hadoop,
I found number of reducers very confusing :
1) Number of reducers is same as number of partitions.
2) Number of reducers is 0.95 or 1.75 multiplied by (no. of nodes) * (no. of maximum containers per node).
3) Number of reducers is set by mapred.reduce.tasks.
4) Number of reducers is closest to: A multiple of the block size * A task time between 5 and 15 minutes * Creates the fewest files possible.
I am very confused, Do we explicitly set number of reducers or it is done by mapreduce program itself?
How is number of reducers is calculated? Please tell me how to calculate number of reducers.
1 - The number of reducers is as number of partitions - False. A single reducer might work on one or more partitions. But a chosen partition will be fully done on the reducer it is started.
2 - That is just a theoretical number of maximum reducers you can configure for a Hadoop cluster. Which is very much dependent on the kind of data you are processing too (decides how much heavy lifting the reducers are burdened with).
3 - The mapred-site.xml configuration is just a suggestion to the Yarn. But internally the ResourceManager has its own algorithm running, optimizing things on the go. So that value is not really the number of reducer tasks running every time.
4 - This one seems a bit unrealistic. My block size might 128MB and everytime I can't have 128*5 minimum number of reducers. That's again is false, I believe.
There is no fixed number of reducers task that can be configured or calculated. It depends on the moment how much of the resources are actually available to allocate.
Number of reducer is internally calculated from size of the data we are processing if you don't explicitly specify using below API in driver program
job.setNumReduceTasks(x)
By default on 1 GB of data one reducer would be used.
so if you are playing with less than 1 GB of data and you are not specifically setting the number of reducer so 1 reducer would be used .
Similarly if your data is 10 Gb so 10 reducer would be used .
You can change the configuration as well that instead of 1 GB you can specify the bigger size or smaller size.
property in hive for setting size of reducer is :
hive.exec.reducers.bytes.per.reducer
you can view this property by firing set command in hive cli.
Partitioner only decides which data would go to which reducer.
Your job may or may not need reducers, it depends on what are you trying to do. When there are multiple reducers, the map tasks partition their output, each creating one partition for each reduce task. There can be many keys (and their associated values) in each partition, but the records for any given key are all in a single partition. One rule of thumb is to aim for reducers that each run for five minutes or so, and which produce at least one HDFS block’s worth of output. Too many reducers and you end up with lots of small files.
Partitioner makes sure that same keys from multiple mappers goes to the same reducer. This doesn't mean that number of partitions is equal to number of reducers. However, you can specify number of reduce tasks in the driver program using job instance like job.setNumReduceTasks(2). If you don't specify the number of reduce tasks in the driver program then it picks from the mapred.reduce.tasks which has the default value of 1 (https://hadoop.apache.org/docs/r1.0.4/mapred-default.html) i.e. all mappers output will go to the same reducer.
Also, note that programmer will not have control over number of mappers as it depends on the input split where as programmer can control the number of reducers for any job.

Dynamic Number of Reducers in Hadoop MR Application

Is there any means to set the number of reduce tasks once a job is submitted? For example if I need to collect English words based on start alphabet, I can directly set the number of reduce tasks as 26. But in case a scenario arises where I cannot pre determine the number of reducers required,is there any means to accomplish the requirement? Here the requirement is independent of the number of nodes on the cluster, it just depends on the key being processed. Say for example, the number of reducers is to increment by one each time a new key is met.
Thanks in advance for any support.
Is there any means to set the number of reduce tasks once a job is submitted?
No
For example if I need to collect English words based on start alphabet, I can directly set the number of reduce tasks as 26.
Even in the above scenario, you need not have 26 reducers, but only 1 reducer. The reduce function is called again and again for each key by the Hadoop framework. MultipleOutputFormat can be used to write the words to different files based on the key/value pair (first alphabet).
The criteria for the number of reducers for the job should be the amount of data it's processing. Also, remember that the reducer taking the most time will determine the time for the completion of the job.

number of reducers for 1 task in MapReduce

In a typical MapReduce setup(like Hadoop), how many reducer is used for 1 task, for example, counting words? My understanding of that MapReduce from Google means only 1 reducer is involved. Is that correct?
For example, the word count will divide the input into N chunks, and N Map will be running, producing the (word,#) list. My question is, once the Map phase is done, will there be only ONE reducer instance running to compute the result? or there will be reducers running in parallel?
The simple answer is that the number of reducers does not have to be 1 and yes, reducers can run in parallel. As I mentioned above this is user defined or derived.
To keep things in context I will refer to Hadoop in this case so you have an idea of how things work. If you are using the streaming API in Hadoop (0.20.2) you will have to explicitly define how many reducers you would like to run since by default, only 1 reduce task will be launched. You do so by passing the number of reducers to the -D mapred.reduce.tasks=# of reducers argument. The Java API will try to derive the number of reducers you will need but again you can explicitly set that too. In both cases, there is a hard cap on the number of reducers you can run per node and that is set in your mapred-site.xml configuration file using mapred.tasktracker.reduce.tasks.maximum.
On a more conceptual note, you can look at this post on the hadoop wiki that talks about choosing the number of map and reduce tasks.
I case of simple wordcount example it would make sense to use only one reducer.
If you want to have as a result of computation only one number you have to use one reducer (2 or more reducers would give you 2 or more output files).
If this reducer is taking long time to complete you can think of chaining multiple reducers where reducers in next phase would sum results from previous reducers.
This depends entirely on the situation. In some cases, you don't have any reducers...everything can be done mapside. In other cases, you cannot avoid having one reducer, but generally this comes in a 2nd or 3rd map/reduce job that condenses earlier results. Generally, however, you want to have a lot of reducers or else you are losing a lot of the power of MapReduce! In word count, for example, the result of your mappers will be pairs. These pairs are then partitioned based on the word such that each reducer will receive the same words, and can give you the ultimate sum. Each reducer then outputs the result. If you wanted to, you could then shoot off another M/R job that took all of these files and concatenated them-- that job would only have one reducer.
The default value is 1.
If you are considering hive or pig,then the number of reducer depends on the query , like group by , sum .....
In case of ur mapreduce code , it can be defined by setNumReduceTasks on job/conf.
job.setNumReduceTasks(3);
Most of the time it is done when you overwrite the getPartition(), i.e. you are using a custom partitioner
class customPartitioner extends Partitioner<Text,Text>{
public int getPartition(Text key, Text value, int numReduceTasks){
if(numReduceTasks==0)
return 0;
if(some logic)
return 0;
if(some logic)
return 1;
else
return 2;
}
}
One thing you will notice that the number of reducers = the number of part file in the output.
Let me know if you have doubts.
The reducers runs in parallel . The number of reducer you have set in your job while changing config file mapred-site.xml or by setting reducer while command of running job or you can set it in the program also that number of reducer will run parallely. Its not necessary to keep it as 1. By default its value is 1.

Resources