Controlling number of map and reduce jobs spawned? - hadoop

I am trying to understand how may map reduce jobs get started for a task and how to control the number of MR jobs.
Say I have a 1TB file in HDFS and my block size is 128MB.
For a MR task on this 1TB file if I specify the input split size as 256MB then how many Map and Reduce jobs gets started. From my understanding this is dependent on split size. i.e number of Map jobs = total size of file / split size and in this case it works out to be 1024 * 1024 MB / 256 MB = 4096. So the number of map task started by hadoop is 4096.
1) Am I right?
2) If I think that this is an inappropriate number, can I inform hadoop to start less number of jobs or even more number of jobs. If yes how?
And how about the number of reducer jobs spawned, I think this is totally controlled by the user.
3) But how and where should I mention the number of reducer jobs required.

1. Yes, you're right. No of mappers=(size of data)/(input split size). So, in your case it would be 4096
As per my understanding ,Before hadoop-2.7 you can only hint system to create number of mapper by conf.setNumMapTasks(int num) but mapper will created by their own. From hadoop-2.7 you can limit number of mapper by mapreduce.job.running.map.limit. See this JIRA ticket
By default number of reducer is 1. You can change it by job.setNumReduceTasks(integer_numer);
You can also provide this parameter from cli
-Dmapred.reduce.tasks=<num reduce tasks>

Related

Hadoop Performance Tuning

I increased the input split size from 128MB to 256MB. The execution time of the job has been decreased by a minute.
But I could not understand the behavior. Why it is happening? In what scenarios, we can tune the input split size?
Is it consistent or one off reading ? Is this on your local hadoop installation or on a cluster?
I would suggest to record number of mappers when input split size is 128MB and 256MB for number of runs. That may have a possible hint as to why the execution time is decreased by a minute.
The number of input splits corresponds to the number of mappers needed to process the input. If this number is higher than the map slots available on your cluster, job has to wait until one set of mappers are run before it can process remaining ones. However if number of input splits are less ( e.g 256MB in your case) then accordingly number of map tasks to be run are lesser than earlier case. If this number is lesser than or equal to number of map slots on your cluster then there are chances that all of your map tasks running simultaneously which may better your job execution time.

Number of reducers in hadoop

I was learning hadoop,
I found number of reducers very confusing :
1) Number of reducers is same as number of partitions.
2) Number of reducers is 0.95 or 1.75 multiplied by (no. of nodes) * (no. of maximum containers per node).
3) Number of reducers is set by mapred.reduce.tasks.
4) Number of reducers is closest to: A multiple of the block size * A task time between 5 and 15 minutes * Creates the fewest files possible.
I am very confused, Do we explicitly set number of reducers or it is done by mapreduce program itself?
How is number of reducers is calculated? Please tell me how to calculate number of reducers.
1 - The number of reducers is as number of partitions - False. A single reducer might work on one or more partitions. But a chosen partition will be fully done on the reducer it is started.
2 - That is just a theoretical number of maximum reducers you can configure for a Hadoop cluster. Which is very much dependent on the kind of data you are processing too (decides how much heavy lifting the reducers are burdened with).
3 - The mapred-site.xml configuration is just a suggestion to the Yarn. But internally the ResourceManager has its own algorithm running, optimizing things on the go. So that value is not really the number of reducer tasks running every time.
4 - This one seems a bit unrealistic. My block size might 128MB and everytime I can't have 128*5 minimum number of reducers. That's again is false, I believe.
There is no fixed number of reducers task that can be configured or calculated. It depends on the moment how much of the resources are actually available to allocate.
Number of reducer is internally calculated from size of the data we are processing if you don't explicitly specify using below API in driver program
job.setNumReduceTasks(x)
By default on 1 GB of data one reducer would be used.
so if you are playing with less than 1 GB of data and you are not specifically setting the number of reducer so 1 reducer would be used .
Similarly if your data is 10 Gb so 10 reducer would be used .
You can change the configuration as well that instead of 1 GB you can specify the bigger size or smaller size.
property in hive for setting size of reducer is :
hive.exec.reducers.bytes.per.reducer
you can view this property by firing set command in hive cli.
Partitioner only decides which data would go to which reducer.
Your job may or may not need reducers, it depends on what are you trying to do. When there are multiple reducers, the map tasks partition their output, each creating one partition for each reduce task. There can be many keys (and their associated values) in each partition, but the records for any given key are all in a single partition. One rule of thumb is to aim for reducers that each run for five minutes or so, and which produce at least one HDFS block’s worth of output. Too many reducers and you end up with lots of small files.
Partitioner makes sure that same keys from multiple mappers goes to the same reducer. This doesn't mean that number of partitions is equal to number of reducers. However, you can specify number of reduce tasks in the driver program using job instance like job.setNumReduceTasks(2). If you don't specify the number of reduce tasks in the driver program then it picks from the mapred.reduce.tasks which has the default value of 1 (https://hadoop.apache.org/docs/r1.0.4/mapred-default.html) i.e. all mappers output will go to the same reducer.
Also, note that programmer will not have control over number of mappers as it depends on the input split where as programmer can control the number of reducers for any job.

Default number of reducers

In Hadoop, if we have not set number of reducers, then how many number of reducers will be created?
Like number of mappers is dependent on (total data size)/(input split size),
E.g. if data size is 1 TB and input split size is 100 MB. Then number of mappers will be (1000*1000)/100 = 10000(Ten thousand).
The number of reducer is dependent on which factors ? How many reducers are created for a job?
How Many Reduces? ( From official documentation)
The right number of reduces seems to be 0.95 or 1.75 multiplied by
(no. of nodes) * (no. of maximum containers per node).
With 0.95 all of the reduces can launch immediately and start transferring map outputs as the maps finish. With 1.75 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing.
Increasing the number of reduces increases the framework overhead, but increases load balancing and lowers the cost of failures.
The scaling factors above are slightly less than whole numbers to reserve a few reduce slots in the framework for speculative-tasks and failed tasks.
This article covers about Mapper count too.
How Many Maps?
The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of the input files.
The right level of parallelism for maps seems to be around 10-100 maps per-node, although it has been set up to 300 maps for very cpu-light map tasks. Task setup takes a while, so it is best if the maps take at least a minute to execute.
Thus, if you expect 10TB of input data and have a blocksize of 128MB, you’ll end up with 82,000 maps, unless Configuration.set(MRJobConfig.NUM_MAPS, int) (which only provides a hint to the framework) is used to set it even higher.
If you want to change the default value of 1 for number of reducers, you can set below property (From hadoop 2.x version) as a command line parameter
mapreduce.job.reduces
OR
you can set programmatically with
job.setNumReduceTasks(integer_numer);
Have a look at one more related SE question: What is Ideal number of reducers on Hadoop?
By default the no of reducers is set to 1.
You can change it by adding a parameter
mapred.reduce.tasks in the command line or in the Driver code or in the conf file that you pass.
e.g: Command Line Argument: bin/hadoop jar ... -Dmapred.reduce.tasks=<num reduce tasks>
or, in Driver code as: conf.setNumReduceTasks(int num);
Recommended read:
https://wiki.apache.org/hadoop/HowManyMapsAndReduces

Reducing number of Map tasks during Hadoop Streaming

I have a folder with 3072 files, each of ~50mb. I'm running a Python script over this input using Hadoop Streaming and extracting some data.
On a single file, the script doesn't take more than 2 seconds. However, running this on an EMR cluster with 40 m1.large task nodes and 3072 files takes 12 minutes.
Hadoop streaming does this:
14/11/11 09:58:51 INFO mapred.FileInputFormat: Total input paths to process : 3072
14/11/11 09:58:52 INFO mapreduce.JobSubmitter: number of splits:3072
And hence 3072 map tasks are created.
Of course the Map Reduce overhead comes into play. From some initial research, it seems that it's very inefficient if map tasks take less than 30-40 seconds.
What can I do to reduce the number of map tasks here? Ideally, if each task handled around 10-20 files it would greatly reduce the overhead.
I've tried playing around with the block size; but since the files are all around 50mb in size, they're already in separate blocks and increasing the block size makes no differenece.
Unfortunately you can't. The number of map tasks for a given job is driven by the number of input splits. For each input split a map task is spawned. So, over the lifetime of a mapreduce job the number of map tasks is equal to the number of input splits.
mapred.min.split.size will specify the minimum split size to process by a mapper.
So, increasing split size should reduce the no of mappers.
Check out the link
Behavior of the parameter "mapred.min.split.size" in HDFS

Hadoop cluster - how to know the ideal maximum number of map/reduce tasks for each tasktracker

I've just set up a Hadoop cluster with Hadoop 0.20.205. I have a master (NameNode and JobTracker) and two other boxes (slaves).
I'm trying to understand, how to define the number of map and reduce tasks to use.
So far I understood that I can set the maximum number of map and reduce tasks that each TaskTracker is able to handle simultaneously with: *mapred.tasktracker.map.tasks.maximum* and *mapred.tasktracker.reduce.tasks.maximum*.
Also, I can define the maximum number of map tasks the whole cluster can run simultaneously with *mapred.map.tasks*. Is that right?
If so, how can I know what should be the value for *mapred.tasktracker.map.tasks.maximum*? I see that the default is 2. But why? What are the pros and cons of increasing or decreasing this value?
I don't think that there is a rule for that (like the rule for setting the number of reducers).
What I do is, set the number of mappers and reducers to the number of cores available minus 1 for each machine. Intuitively, this will leave each machine some memory for the other processes (like cluster communication). But I may be wrong. Anyway, this is the only thing I found from "Pro Hadoop". It suggests using as many mappers as the number of available cores and one or two reducers.
I hope it helps.
Here is what I propose. Hope it helps!
Run "hadoop fsck /" in the master node to find out the size and number of blocks. For e.g.:
...
Total size: 21600037259 B
Total dirs: 78
Total files: 152
Total blocks (validated): 334 (avg. block size 64670770 B)
...
I set up reduce tasks as num_of_blocks / 10.
set mapred.map.tasks=33;
I set up map tasks as block_size (in MB) * 2.
set mapred.reduce.tasks=124;
So far that's the best configuration I've found. And you'll have to modify it according your cluster's configuration.

Resources