Hadoop MapReduce: default number of mappers - hadoop

If I don't specify the number of mappers, how would the number be determined? Is there a default setting read from a configuration file (such as mapred-site.xml)?

Adding more to what Chris added above:
The number of maps is usually driven by the number of DFS blocks in the input files. Although that causes people to adjust their DFS block size to adjust the number of maps.
The right level of parallelism for maps seems to be around 10-100 maps/node, although this can go upto 300 or so for very cpu-light map tasks. Task setup takes awhile, so it is best if the maps take at least a minute to execute.
You can increased number of Map task by modifying JobConf's conf.setNumMapTasks(int num). Note: This could increase the number of map tasks, but will not set the number below that which Hadoop determines via splitting the input data.
Finally controlling the number of maps is subtle. The mapred.map.tasks parameter is just a hint to the InputFormat for the number of maps. The default InputFormat behavior is to split the total number of bytes into the right number of fragments. However, in the default case the DFS block size of the input files is treated as an upper bound for input splits.
A lower bound on the split size can be set via mapred.min.split.size.
Thus, if you expect 10TB of input data and have 128MB DFS blocks, you'll end up with 82k maps, unless your mapred.map.tasks is even larger. Ultimately the InputFormat determines the number of maps.
Read more: http://wiki.apache.org/hadoop/HowManyMapsAndReduces

It depends on a number of factors:
Input format and particular configuration properties for the format
for file based input formats (TextInputFormat, SequenceFileInputFormat etc):
Number of input files / paths
are the files splittable (typically compressed files are not, SequenceFiles are an exception to this)
block size of the files
There are probably more, but you hopefully get the idea

Related

Mapreduce why number of splits (text file) are more than 1 even for a tiny file

I know the difference between a physical block and Inputsplits in hadoop.
BTW I am using Hadoop 2.0 version (Yarn processing).
I have a very tiny input dataset. May be 1.5 Mb in size. When I run mapredce program that consumes this tiny dataset, during the run, it shows there are 2 input splits. Why should the tiny dataset should be split into two when it is less than 128 MB in size.
In my understanding a block size is configured to be 128 MB in size and input split is logical division of data. Meaning where does each split starts (like in which node and which block number) and where it does it end. Starting location and ending location of data is the split.
I didn't get the reason for splits in a tiny datasets.
can someone explain?
thanks
nath
First try to understand how the number of splits will be decided, and it depends on two thing:
If you have not defined any custom split size then it will take default size which will be the block size, in your case 128 MB.
This is important, now if you have two small files, it will be saved in two different blocks. So number of splits will be two.
Your answer is in above two point this is extra info, now the relation between number of mapper and number of splits is one - one so number of splits will be same as number of mapper.
I had the same problem.
I tried on a file of size 2.5MB and I would get on the console from running the job
number of splits:2
Job Counters
Launched map tasks=2
Logically speaking, it should be one split, as I am using the default setting
split_size 128M, block_size 128M
However, I think the reason MapReduce job is splitting the input is due to this config param:
mapreduce.job.maps: The default number of map tasks per job. Ignored when mapreduce.framework.name is "local".
The default value is 2, as you can see in the MapReduce defaults.
You can set it if you are using MRJob for python
MRJob.JOBCONF = {'mapreduce.job.maps': '1'}
By doing so, I got one split, then I ran another file 450M or so with the same setting, it was split into 4 (ceil(450/128)). Which means this setting is ignored when more splits are needed.

What is it in the Hadoop that prevents us from setting the numberof Mappers?

From what I understand there is no way we could set the number of mappers in a MR job even though we can set the number of reducers. \ref{ how to limit the number of mappers}
As a concept I don't see why we cant have a predetermined number of mappers and feed chunks of text files to them.
To optimally balance performance and work load distribution the framework determines the number of mappers by number of input splits.
The Apache hadoop wiki link herehttp://wiki.apache.org/hadoop/HowManyMapsAndReduces goes in detail -
Actually controlling the number of maps is subtle. The mapred.map.tasks parameter is just a hint to >the InputFormat for the number of maps. The default InputFormat behavior is to split the total number >of bytes into the right number of fragments. However, in the default case the DFS block size of the >input files is treated as an upper bound for input splits. A lower bound on the split size can be set >via mapred.min.split.size. Thus, if you expect 10TB of input data and have 128MB DFS blocks, you'll >end up with 82k maps, unless your mapred.map.tasks is even larger. Ultimately the InputFormat >determines the number of maps.

how to limit the number of mappers

I explicitly specify the number of mappers within my java program using conf.setNumMapTasks(), but when the job ends, the counter shows that the number of launched map tasks were more than the specified value. How to limit the number of mapper to the specified value?
According to the Hadoop API Jonf.setNumMapTasks is just a hint to the Hadoop runtime. The total number of map tasks equals to the number of blocks in the input data to be processed.
Although, it should be possible to configure the number of map/reduce slots per node by using the mapred.tasktracker.map.tasks.maximum and the mapred.tasktracker.reduce.tasks.maximum in mapred-site.xml. This way it's possible to configure the total number of mappers/reducers executing in parallel across the entire cluster.
Using conf.setNumMapTasks(int num) the number of mappers can be increased but cannot be reduced.
You cannot set number of mappers explicitly to a certain number which is less than the number of mappers calculated by Hadoop. This is decided by the number of Input Splits created by hadoop for your given set of input. You may control this by setting mapred.min.split.size parameter.
To quote from the wiki page:
The number of maps is usually driven by the number of DFS blocks in
the input files. Although that causes people to adjust their DFS block
size to adjust the number of maps. The right level of parallelism for
maps seems to be around 10-100 maps/node, although we have taken it up
to 300 or so for very cpu-light map tasks. Task setup takes awhile, so
it is best if the maps take at least a minute to execute.
Actually controlling the number of maps is subtle. The
mapred.map.tasks parameter is just a hint to the InputFormat for the
number of maps. The default InputFormat behavior is to split the total
number of bytes into the right number of fragments. However, in the
default case the DFS block size of the input files is treated as an
upper bound for input splits. A lower bound on the split size can be
set via mapred.min.split.size. Thus, if you expect 10TB of input data
and have 128MB DFS blocks, you'll end up with 82k maps, unless your
mapred.map.tasks is even larger. Ultimately the InputFormat determines
the number of maps.
The number of map tasks can also be increased manually using the
JobConf's conf.setNumMapTasks(int num). This can be used to increase
the number of map tasks, but will not set the number below that which
Hadoop determines via splitting the input data.
Quoting the javadoc of JobConf#setNumMapTasks():
Note: This is only a hint to the framework. The actual number of
spawned map tasks depends on the number of InputSplits generated by
the job's InputFormat.getSplits(JobConf, int). A custom InputFormat is
typically used to accurately control the number of map tasks for the
job.
Hadoop also relaunches failed or long running map tasks in order to provide high availability.
You can limit the number of map tasks concurrently running on a single node. And you could limit the number of launched tasks provided that you have big input files. You would have to write an own InputFormat class, which is not splitable. Then Hadoop will run a map task for every input file, that you have.
According to [Partitioning your job into maps and reduces], follows:
The mapred.map.tasks parameter is just a hint to the InputFormat for the number of maps. The default InputFormat behavior is to split the total number of bytes into the right number of fragments. However, in the default case the DFS block size of the input files is treated as an upper bound for input splits. A lower bound on the split size can be set via mapred.min.split.size. Thus, if you expect 10TB of input data and have 128MB DFS blocks, you'll end up with 82k maps, unless your mapred.map.tasks is even larger. Ultimately the InputFormat determines the number of maps.
However, you can learn more about InputFormat .

pseudo distributed number map and reduce tasks

I am newbie to Hadoop. I have successfully configured a hadoop setup in pseudo distributed mode. Now I would like to know what's the logic of choosing the number of map and reduce tasks. What do we refer to?
Thanks
You cannot generalize how number of mappers/reducers are to be set.
Number of Mappers:
You cannot set number of mappers explicitly to a certain number(There are parameters to set this but it doesn't come into effect). This is decided by the number of Input Splits created by hadoop for your given set of input. You may control this by setting mapred.min.split.size parameter. For more read the InputSplit section here. If you have a lot of mappers being generated due to huge amount of small files and you want to reduce number of mappers then you will need to combine data from more than one files. Read this: How to combine input files to get to a single mapper and control number of mappers.
To quote from the wiki page:
The number of maps is usually driven by the number of DFS blocks in
the input files. Although that causes people to adjust their DFS block
size to adjust the number of maps. The right level of parallelism for
maps seems to be around 10-100 maps/node, although we have taken it up
to 300 or so for very cpu-light map tasks. Task setup takes awhile, so
it is best if the maps take at least a minute to execute.
Actually controlling the number of maps is subtle. The
mapred.map.tasks parameter is just a hint to the InputFormat for the
number of maps. The default InputFormat behavior is to split the total
number of bytes into the right number of fragments. However, in the
default case the DFS block size of the input files is treated as an
upper bound for input splits. A lower bound on the split size can be
set via mapred.min.split.size. Thus, if you expect 10TB of input data
and have 128MB DFS blocks, you'll end up with 82k maps, unless your
mapred.map.tasks is even larger. Ultimately the InputFormat determines
the number of maps.
The number of map tasks can also be increased manually using the
JobConf's conf.setNumMapTasks(int num). This can be used to increase
the number of map tasks, but will not set the number below that which
Hadoop determines via splitting the input data.
Number of Reducers:
You can explicitly set the number of reducers. Just set the parameter mapred.reduce.tasks. There are guidelines for setting this number, but usually the default number of reducers should be good enough. At times a single report file is required, in those cases you might want number of reducers to be set to be 1.
Again to quote from wiki:
The right number of reduces seems to be 0.95 or 1.75 * (nodes *
mapred.tasktracker.tasks.maximum). At 0.95 all of the reduces can
launch immediately and start transfering map outputs as the maps
finish. At 1.75 the faster nodes will finish their first round of
reduces and launch a second round of reduces doing a much better job
of load balancing.
Currently the number of reduces is limited to roughly 1000 by the
buffer size for the output files (io.buffer.size * 2 * numReduces <<
heapSize). This will be fixed at some point, but until it is it
provides a pretty firm upper bound.
The number of reduces also controls the number of output files in the
output directory, but usually that is not important because the next
map/reduce step will split them into even smaller splits for the maps.
The number of reduce tasks can also be increased in the same way as
the map tasks, via JobConf's conf.setNumReduceTasks(int num).
Actually no. of mappers is primarily governed by the no. of InputSplits created by the InputFormat you are using and the no. of reducers by the no. of partitions you get after the map phase. Having said that, you should also keep the no of slots, available per slave, in mind, along with the available memory. But as a rule of thumb you could use this approach :
Take the no. of virtual CPUs*.75 and that's the no. of slots you can configure. For example, if you have 12 physical cores (or 24 virtual cores), you would have (24*.75)=18 slots. Now, based on your requirement you could choose how many mappers and reducers you want to use. With 18 MR slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers or whatever you think is OK with you.
HTH

Hadoop's input splitting- How does it work

I know brief about hadoop
I am curious to know how does it work.
To be precise I want to know, how exactly it divides/splits the input file.
Does it divides in equal chunks in terms of size?
or it is configurable thing.
I did go through this post, but I couldn't understand
This is dependent on the InputFormat, which for most file-based formats is defined in the FileInputFormat base class.
There are a number of configurable options which denote how hadoop will take a single file and either process it as a single split, or divide the file into multiple splits:
If the input file is compressed, the input format and compression method must be splittable. Gzip for example is not splittable (you can't randomly seek to a point in the file and recover the compressed stream). BZip2 is splittable. See the specific InputFormat.isSplittable() implementation for your input format for more information
If the file size is less than or equal to its defined HDFS block size, then hadoop will most probably process it in a single split (this can be configured, see a later point about split size properties)
If the file size is greater than its defined HDFS block size, then hadoop will most probably divide up the file into splits based upon the underlying blocks (4 blocks would result in 4 splits)
You can configure two properties mapred.min.split.size and mapred.max.split.size which help the input format when breaking up blocks into splits. Note that the minimum size may be overriden by the input format (which may have a fixed minumum input size)
If you want to know more, and are comfortable looking through the source, check out the getSplits() method in FileInputFormat (both the new and old api have the same method, but they may have some suttle differences).
When you submit a map-reduce job (or pig/hive job), Hadoop first calculates the input splits, each input split size generally equals to HDFS block size. For example, for a file of 1GB size, there will be 16 input splits, if block size is 64MB. However, split size can be configured to be less/more than HDFS block size. Calculation of input splits is done with FileInputFormat. For each of these input splits, a map task must be started.
But you can change the size of input split by configuring following properties:
mapred.min.split.size: The minimum size chunk that map input should be split into.
mapred.max.split.size: The largest valid size inbytes for a file split.
dfs.block.size: The default block size for new files.
And the formula for input split is:
Math.max("mapred.min.split.size", Math.min("mapred.max.split.size", blockSize));
You can check examples here.

Resources