In hadoop, 1 reduce or number of reduces = number of mappers - hadoop

In hadoop, what is the difference between using n mappers and n reduce, or n mappers and 1 reduce.
in the case of using 1 reduce, the reduce phase is made of which computer (mappers), if I have 3 computers

The number of mappers is controlled by the amount of data being processed. Reducers are controlled either by the developer or different system parameters.
To override the number of reducers:
set mapreduce.job.reduces=#;
or if it is a Hive job and you want to control more how much work each reducer has to do then you can tweak certain parameters such as:
hive.exec.reducers.bytes.per.reducer.
You can still override by using mapreduce.job.reduces it is just using the bytes per reducer allows you to control the amount each reducer processes.
In regards to controlling where the reducers run you really cannot control that except by using Node Labels. This would mean controlling where all of the tasks in the job run not just the reducers.

Related

Number of reducers in hadoop

I was learning hadoop,
I found number of reducers very confusing :
1) Number of reducers is same as number of partitions.
2) Number of reducers is 0.95 or 1.75 multiplied by (no. of nodes) * (no. of maximum containers per node).
3) Number of reducers is set by mapred.reduce.tasks.
4) Number of reducers is closest to: A multiple of the block size * A task time between 5 and 15 minutes * Creates the fewest files possible.
I am very confused, Do we explicitly set number of reducers or it is done by mapreduce program itself?
How is number of reducers is calculated? Please tell me how to calculate number of reducers.
1 - The number of reducers is as number of partitions - False. A single reducer might work on one or more partitions. But a chosen partition will be fully done on the reducer it is started.
2 - That is just a theoretical number of maximum reducers you can configure for a Hadoop cluster. Which is very much dependent on the kind of data you are processing too (decides how much heavy lifting the reducers are burdened with).
3 - The mapred-site.xml configuration is just a suggestion to the Yarn. But internally the ResourceManager has its own algorithm running, optimizing things on the go. So that value is not really the number of reducer tasks running every time.
4 - This one seems a bit unrealistic. My block size might 128MB and everytime I can't have 128*5 minimum number of reducers. That's again is false, I believe.
There is no fixed number of reducers task that can be configured or calculated. It depends on the moment how much of the resources are actually available to allocate.
Number of reducer is internally calculated from size of the data we are processing if you don't explicitly specify using below API in driver program
job.setNumReduceTasks(x)
By default on 1 GB of data one reducer would be used.
so if you are playing with less than 1 GB of data and you are not specifically setting the number of reducer so 1 reducer would be used .
Similarly if your data is 10 Gb so 10 reducer would be used .
You can change the configuration as well that instead of 1 GB you can specify the bigger size or smaller size.
property in hive for setting size of reducer is :
hive.exec.reducers.bytes.per.reducer
you can view this property by firing set command in hive cli.
Partitioner only decides which data would go to which reducer.
Your job may or may not need reducers, it depends on what are you trying to do. When there are multiple reducers, the map tasks partition their output, each creating one partition for each reduce task. There can be many keys (and their associated values) in each partition, but the records for any given key are all in a single partition. One rule of thumb is to aim for reducers that each run for five minutes or so, and which produce at least one HDFS block’s worth of output. Too many reducers and you end up with lots of small files.
Partitioner makes sure that same keys from multiple mappers goes to the same reducer. This doesn't mean that number of partitions is equal to number of reducers. However, you can specify number of reduce tasks in the driver program using job instance like job.setNumReduceTasks(2). If you don't specify the number of reduce tasks in the driver program then it picks from the mapred.reduce.tasks which has the default value of 1 (https://hadoop.apache.org/docs/r1.0.4/mapred-default.html) i.e. all mappers output will go to the same reducer.
Also, note that programmer will not have control over number of mappers as it depends on the input split where as programmer can control the number of reducers for any job.

increase the number of map and reduce function

I have a question.
I want to increase my map and reduce functions to the number of my input data. when I execute System.out.println(conf.getNumReduceTasks()) and System.out.println(conf.getNumMapTasks()) it shows me:
1 1
and when I execute conf.setNumReduceTasks(1000000) and conf.setNumMapTasks(1000000) and again execute the println method it shows me:
1000000 1000000
but I think there is no change in my mapreduce program execution time. my input is from cassandra, actually it is the cassandra column family rows that is about 362000 rows.
I want to set the number of my map and reduce function to the number of input rows..
what should I do?
Setting the number of map/reduce tasks for your map/reduce job does define how many map/reduce processes will be used to process your job. Consider if you really need so many java processes.
That said, the number of map tasks is mostly determined automatically; setting the number of map tasks is only a hint that can increase the number of maps that were determined by Hadoop.
For reduce tasks, the default is 1 and the practical limit is around 1,000.
See: http://wiki.apache.org/hadoop/HowManyMapsAndReduces
It's also important to understand that each node of your cluster also has a maximum number of map/reduce tasks that can execute concurrently. This is set by the following configuration settings:
mapred.tasktracker.map.tasks.maximum
and
mapred.tasktracker.reduce.tasks.maximum
The default for both of these is 2.
So increasing the number of map/reduce tasks will be limited to the number of tasks that can run simultaneously per node. This may be one reason you aren't seeing a change in execution time for your job.
See: http://hadoop.apache.org/docs/stable/mapred-default.html
The summary is:
Let Hadoop determine the number of maps, unless you want more map tasks.
Use the mapred.tasktracker..tasks.maximum settings to control how many tasks can run at one time.
The max value for number of reduce tasks should be somewhere between 1 or 2 * (mapred.tasktracker.reduce.tasks.maximum * #nodes). You also have to take into account how many map/reduce jobs you expect to run at once, so that a single job doesn't consume all the available reduce slots.
A value of 1,000,000 is almost certainly too high for either setting; it's not practical to run that many java processes. I expect that such high values are simply being ignored.
After setting the mapred.tasktracker..tasks.maximum to the number of tasks your nodes are able to run simultaneously, then try increasing your job's map/reduce tasks incrementally.
You can see the actual number of tasks used by your job in the job.xml file to verify your settings.

pseudo distributed number map and reduce tasks

I am newbie to Hadoop. I have successfully configured a hadoop setup in pseudo distributed mode. Now I would like to know what's the logic of choosing the number of map and reduce tasks. What do we refer to?
Thanks
You cannot generalize how number of mappers/reducers are to be set.
Number of Mappers:
You cannot set number of mappers explicitly to a certain number(There are parameters to set this but it doesn't come into effect). This is decided by the number of Input Splits created by hadoop for your given set of input. You may control this by setting mapred.min.split.size parameter. For more read the InputSplit section here. If you have a lot of mappers being generated due to huge amount of small files and you want to reduce number of mappers then you will need to combine data from more than one files. Read this: How to combine input files to get to a single mapper and control number of mappers.
To quote from the wiki page:
The number of maps is usually driven by the number of DFS blocks in
the input files. Although that causes people to adjust their DFS block
size to adjust the number of maps. The right level of parallelism for
maps seems to be around 10-100 maps/node, although we have taken it up
to 300 or so for very cpu-light map tasks. Task setup takes awhile, so
it is best if the maps take at least a minute to execute.
Actually controlling the number of maps is subtle. The
mapred.map.tasks parameter is just a hint to the InputFormat for the
number of maps. The default InputFormat behavior is to split the total
number of bytes into the right number of fragments. However, in the
default case the DFS block size of the input files is treated as an
upper bound for input splits. A lower bound on the split size can be
set via mapred.min.split.size. Thus, if you expect 10TB of input data
and have 128MB DFS blocks, you'll end up with 82k maps, unless your
mapred.map.tasks is even larger. Ultimately the InputFormat determines
the number of maps.
The number of map tasks can also be increased manually using the
JobConf's conf.setNumMapTasks(int num). This can be used to increase
the number of map tasks, but will not set the number below that which
Hadoop determines via splitting the input data.
Number of Reducers:
You can explicitly set the number of reducers. Just set the parameter mapred.reduce.tasks. There are guidelines for setting this number, but usually the default number of reducers should be good enough. At times a single report file is required, in those cases you might want number of reducers to be set to be 1.
Again to quote from wiki:
The right number of reduces seems to be 0.95 or 1.75 * (nodes *
mapred.tasktracker.tasks.maximum). At 0.95 all of the reduces can
launch immediately and start transfering map outputs as the maps
finish. At 1.75 the faster nodes will finish their first round of
reduces and launch a second round of reduces doing a much better job
of load balancing.
Currently the number of reduces is limited to roughly 1000 by the
buffer size for the output files (io.buffer.size * 2 * numReduces <<
heapSize). This will be fixed at some point, but until it is it
provides a pretty firm upper bound.
The number of reduces also controls the number of output files in the
output directory, but usually that is not important because the next
map/reduce step will split them into even smaller splits for the maps.
The number of reduce tasks can also be increased in the same way as
the map tasks, via JobConf's conf.setNumReduceTasks(int num).
Actually no. of mappers is primarily governed by the no. of InputSplits created by the InputFormat you are using and the no. of reducers by the no. of partitions you get after the map phase. Having said that, you should also keep the no of slots, available per slave, in mind, along with the available memory. But as a rule of thumb you could use this approach :
Take the no. of virtual CPUs*.75 and that's the no. of slots you can configure. For example, if you have 12 physical cores (or 24 virtual cores), you would have (24*.75)=18 slots. Now, based on your requirement you could choose how many mappers and reducers you want to use. With 18 MR slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers or whatever you think is OK with you.
HTH

Pseudo distributed : Need to change number of mapper nodes

I am using a Intel(R) Core(TM)2 Duo processor. I have installed hadoop in pseudo distributed mode. I have written a program which needs 50 mappers nodes. Is it possible to have 50 mapper nodes in the pseudo distributed mode or I will be limited to 4 nodes(2 * number of cores) . I have tried setting "mapred.tasktracker.map.tasks.maximum" to 50, but there is no change in concurrency.
The maximum number of map and reduce tasks depends on the number of task trackers in your cluster and the maximum number of map/reduce tasks per node defined using the properties mapreduce.tasktracker.map.tasks.maximum and mapreduce.tasktracker.reduce.tasks.maximum.
I assume your map reduce job needs 50 map tasks in the default block size configuration. The number of map tasks needed for a job depends on the number of InputSplits for the processed data. Definitely you should not depend on the number of needed map tasks or define this limit anyhow in your program. This would impact the scaling of your map reduce job.
One option would be to to set the maximum number of mapper tasks to 50. The number of the available mapper tasks should be visible in the cluster summary section of the job tracker web ui. However as your processor has only two cores, you should reconsider, whether launching 50 mappers concurrently will have any positive impact on the performance of your map reduce job.

Hadoop slowstart configuration

What's an ideal value for "mapred.reduce.slowstart.completed.maps" for a Hadoop job? What are the rules to follow to set it appropriately?
Thanks!
It depends on a number of characteristics of your job, cluster and utilization:
How many map slots will your job require vs maximum map capacity: If you have a job that spawns 1000's of map tasks, but only have 10 map slots in total (an extreme case to demonstrate a point), then starting your reducers early could deprive over reduce tasks from executing. In this case i would set your slowstart to a large value (0.999 or 1.0). This is also true if your mappers take an age to complete - let someone else use the reducers
If your cluster is relatively lightly loaded (there isn't contention for the reducer slots) and your mappers output a good volume of data, then a low value for slowstart will assist in getting your job to finish earlier (while other map tasks execute, get the map output data moved to the reducers).
There are probably more

Resources