Controlling reducer shuffle merge memory in Hadoop 2 - hadoop

I want understand how memory is used in the reduce phase of a MapReduce Job, so I can control the settings in the designated way.
If I understand correctly, the reducer first fetches its map output and leaves them in memory up to a certain threshold. The settings to control this are:
mapreduce.reduce.shuffle.merge.percent: The usage threshold at which an in-memory merge will be initiated, expressed as a percentage of the total memory allocated to storing in-memory map outputs, as defined by mapreduce.reduce.shuffle.input.buffer.percent.
mapreduce.reduce.input.buffer.percent: The percentage of memory- relative to the maximum heap size- to retain map outputs during the reduce. When the shuffle is concluded, any remaining map outputs in memory must consume less than this threshold before the reduce can begin.
Next, these spilled blocks are merged. It seems the following option controls how much memory is used for the shuffle:
mapreduce.reduce.shuffle.input.buffer.percent: The percentage of memory to be allocated from the maximum heap size to storing map outputs during the shuffle.
But then, there is the setting:
mapreduce.reduce.shuffle.memory.limit.percent: Maximum percentage of the in-memory limit that a single shuffle can consume.
But it is not clear to what value this percentage applies. Is there more information available regarding these values, i.e. what they control and how they differ?
Finally, after the merge completes, the reduce process is ran on the inputs. In the [Hadoop book][1], I found that the final merge-step directly feeds the reducers. But, the default value for mapreduce.reduce.input.buffer.percent=0 contradicts this, indicating that everything is spilled to disk BEFORE the reducers start. Is there any reference on which one of these explanations is correct?
[1]: Hadoop, The definitive guide, Fourth edition, p. 200

Here is how mapreduce.reduce.shuffle.memory.limit.percent is used and its percentage implies a 0.70 percent of the whole reducer memory. That would be the maximum bytes upto which the data could be kept in memory for a single shuffle.
maxSingleShuffleLimit = (long)(maxSize * MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION);
//MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION=mapreduce.reduce.shuffle.memory.limit.percent(0.25 f)
maxSize = (int)(conf.getInt("mapred.job.reduce.total.mem.bytes",(int)Math.min(Runtime.getRuntime().maxMemory(), Integer.MAX_VALUE))* maxInMemCopyUse);//maxInMemCopyuse(mapred.job.shuffle.input.buffer.percent - 0.70f)
This property is used in the copy phase of the reducer. If the required map output is greater than the maxShufflelimit then the data is moved to disk,else kept in memory.
Property mapreduce.reduce.input.buffer.percent is completety different.
Once all the data is copied and all the merge is done, just before the reducer starts it just checks whether the data stored in memory exceeds this limit.
You could refer this code(however it is for old mapred it should give an insight) on how maxSingleShuffleLimit and the other property are used.

Related

Map Reduce, does reducer automatically sorts?

there is something that i do not have clear about the whoel functioning view of a MapReduce programming environment.
Considering to have 1k of random unsorted words in the form (word, 1) coming out from a (or more than one) mapper. Suppose with the reducer i wanna save them all inside a single huge sorted file. How does it works? I mean, the reducer itself sort all the words automatically? What does the reducer function should do? What if i have just one reducer with limited ram and disk?
when the reducer get the data ,the data has already be sorted in the map side .
the process is like this
Map side:
1. Each inputSplit will be processed by a map task, and the result of the map output will be temporarily placed in a circular memory buffer [ SHUFFLE ](the size of the buffer is 100M by default, controlled by the io.sort.mb property). When the buffer is about to overflow (the default is 80% of the buffer size), an overflow file will be created in the local file system .
2. Before writing to the disk, the thread first divides the data into the same number of partitions according to the number of reduce tasks, that is, a reduce task corresponds to the data of one partition. to avoid some of the reduction tasks being assigned to large amounts of data, even without data. In fact, the data in each partition is sorted. If the Combiner is set at this time, the sorted result is subjected to the Combiner operation.
3. When the local task outputs the last record, there may be a lot of overflow files, and these files need to be merged. The sorting and combining operations are continually performed during the merge process for two purposes: 1. Minimize the amount of data written to disk each time; 2. Minimize the amount of data transferred by the network during the next replication phase. Finally merged into a partitioned and sorted file. In order to reduce the amount of data transmitted over the network, you can compress the data here, just set mapred.compress.map.out to true.
4. Copy the data from the partition to the corresponding reduce task.
Reduce side:
1.Reduce will receive data from different map tasks, and the amount of data sent from each map is ordered. If the amount of data accepted by the reduce side is quite small, it is directly stored in the memory. If the amount of data exceeds a certain proportion of the size of the buffer, the data is merged and written to the disk.
2. As the number of overflow files increases, the background thread will merge them into a larger, more ordered file. In fact, regardless of the map side or the reduce side, MapReduce repeatedly performs sorting and merging operations.
3. The merge process will generate a lot of intermediate files (written to disk), but MapReduce will make the data written to the disk as small as possible, and the result of the last merge is not written to the disk, but directly input To reduce the function.

Is there a limit on the number of key-value pairs emitted by a mapper?

In a Map Reduce program, is there an upper limit on the number of key-value pairs that can be emitted by a single mapper?
I am interested in both Hadoop 1.x and 2.x. I have googled it and could not find any answers, or any mention of it at all.
Thank you
There is no limit on the number of key value pairs emitted by a single mapper.
The mapper keeps generating output which are written to a buffer. The size of this buffer is determined by configuration mapreduce.task.io.sort.mb [Default: 256MB(CDH), 100MB(Source Code)].
Whenever this buffer occupancy reaches mapreduce.map.sort.spill.percent [Def: 0.8] of the capacity, the buffer content is spilled (non blocking process) to local disk - spill file.

Hadoop Reduce side what is really occurred?

I'm reading "Hadoop: The Definitive Guide" and I have some questions.
In Chapter 7 "How MapReduce Works" on page 201, the author said that in reduce side
When [A] the in-memory buffer reaches a threshold size (controlled by mapreduce.reduce.shuffle.merge.percent) or [B] reaches a threshold number of map outputs (mapreduce.reduce.merge.inmem.threshold), it is merged and spilled to disk.
My questions (4 questions) are about conditions A and B.
In condition A with default values config Hadoop 2 would you say that
Will 0.462(0.66 * 0.70) of 1GB memory reducer is full (462MB), will merge and spill is begin?
In condition B with default values config would you say that
When 1000 pairs of key-value map output from one mapper have been collected in buffer, will merge and spill is begin?
in above question one mapper is correct or more than one mapper from different machine?
in following paragraph, The author said that
As the copies accumulate on disk, a background thread merges them into larger, sorted files
it is correct that your intended purpose is that when a spill file wants to write to disk a before spill that already exist in disk merge with current spill?
Please help me to better understand what is really happening in Hadoop.
Please see this image
Each mapper run in different machine
Anyone can explain precise meaning as specified
Will 0.462(0.66 * 0.70) of 1GB memory reducer is full (462MB), will merge and spill is begin?
Yes. This will trigger the merge and spill to disk.
By default each reducer get 1GB of heap space, of this only 70% mapreduce.reduce.shuffle.input.buffer.percent is to be used as merge and shuffle buffer, the 30% might be needed by reducer code, and for other task specific requirements. When 66% mapreduce.reduce.shuffle.merge.percent of the merge and shuffle buffer is full merging (and spill) is triggred, to make sure some space if left for process to sorting and during the sort the other incoming files are not waiting for memory space.
refer http://www.bigsynapse.com/mapreduce-internals
When 1000 pairs of key-value map output from one mapper have been collected in buffer, will merge and spill is begin?
No. Here map outputs may refer to the partitions of map output file. (I think).
1000 key value pair might be a very small amount of memory.
In slide 24 of this presentation by cloudera the same is refered as segments
mapreduce.reduce.merge.inmem.threshold segments accumulated (default 1000)
in above question one mapper is correct or more than one mapper from different machine?
From my understanding, 1000 partitions can come from different partitions. As the partitions of map output files are bieng copied parallely from different nodes.
it is correct that your intended purpose is that when a spill file wants to write to disk a before spill that already exist in disk merge with current spill?
No, a (mapper output) partition file currently in memory is merged with other (mapper output) partition files in memory only to create a spill file and written to disk. This might be repeated several times as the buffer fills up. Thus many files would have been written to disk. To increase efficiency these files in disk are merged into a larger file by a background process.
source:
Hadoop the definitive guide. 4th edition.
http://answers.mapr.com/questions/3952/reduce-merge-performance-issues.html
https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
http://www.bigsynapse.com/mapreduce-internals
http://answers.mapr.com/questions/3952/reduce-merge-performance-issues.html
I would like to add to #shanmuga's answer.
1) Note that NOT all map outputs are stored in memory first and then spilled to disk.
Some map outputs are directly stored on Disk (OnDiskMapOutput).
Based on the uncompressed map output length (Size) it is determined if this particular Map's output will be saved in Memory or on Disk. The decision is based on:
Map Output Buffer Limit (MemoryLimit) = mapreduce.reduce.memory.totalbytes [Default = -Xmx * mapreduce.reduce.shuffle.input.buffer.percent [Default = 70%]]
MaxSingleShuffleLimit = MemoryLimit * mapreduce.reduce.shuffle.memory.limit.percent [Default = 25%]
If Size > MaxSingleShuffleLimit => OnDiskMapOutput
Else If
(Total size of current Map Outputs Buffer Used) < MemoryLimit =>
InMemoryMapOutput
Else => Halt the process
2) The SORT/MERGE phase is triggered during the COPY phase when either of these conditions are met:
At the end of an InMemoryMapOutput creation, if
(total size of
current set of InMemoryMapOutput) > MemoryLimit *
mapreduce.reduce.shuffle.merge.percent [Default: 90%]
, start the
InMemoryMerger thread which merges all the current InMemoryMapOutputs to one OnDiskMapOutput.
At the end of an OnDiskMapOutput creation, if
(No. of OnDiskMapOutputs) > 2 * mapreduce.task.io.sort.factor [Default 64
(CDH)/100(Code)]
, start the OnDiskMerger thread which merges mapreduce.task.io.sort.factor OnDiskOutputs to one OnDiskOutput.
3) Regarding Condition B you mentioned, I don't see the flag mapreduce.reduce.merge.inmem.threshold referenced in the Merge related source code for MapReduce. This flag is perhaps deprecated.

Which node sort/shuffle the keys in Hadoop?

In a Hadoop job, which node does the sorting/shuffling phase? Does increasing the memory of that node improve the performance of sorting/shuffling?
The relevant - in my experience - parameters to tune in mapred.site.xml are:
io.sort.mb This is the output buffer of a mapper. When this buffer is full the data is sorted and spilled to disk. Ideally you avoid having to many spills. Note that this memory is part of the maptask heap size.
mapred.map.child.java.opts This is the heap size of a map task, the higher this is the higher you can put output buffer size.
In principle the number of reduce tasks also influences the shuffle speed. The number of reduce rounds is the total number of reduce slots / the number of reduce tasks. Note that the initial shuffle (during the map phase) will only shuffle data to the active reducers. So mapred.reduce.tasks is also relevant.
io.sort.factor is the number threads performing the merge sort, both on the map and the reduce side.
Compression also has a large impact (it speeds up the transfer from mapper to reducer but the compr/decompr comes at a cost!
mapred.job.shuffle.input.buffer.percent is the percentage of the reducer's heap to store map output in memory.
There are without any doubt more tuning opportunities, but these are the ones I spent quite some time playing around with.
Sort And Shuffle Phase is divided among the Mappers and Reducers. That is the reason we seen the Reduce % increasing(Usually till 33%) while the Mapper is still Running.
Increasing the sort buffer memory and the performance gain from that will depend on:
a)The size/total Number of the Keys being emitted by the mapper
b) The Nature of the Mapper Tasks : (IO intensive, CPU intensive)
c) Available Primary Memory, Map/Reduce Slots(occupied) in the given Node
d) Data skewness
You can find more information # https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-6/shuffle-and-sort

pseudo distributed number map and reduce tasks

I am newbie to Hadoop. I have successfully configured a hadoop setup in pseudo distributed mode. Now I would like to know what's the logic of choosing the number of map and reduce tasks. What do we refer to?
Thanks
You cannot generalize how number of mappers/reducers are to be set.
Number of Mappers:
You cannot set number of mappers explicitly to a certain number(There are parameters to set this but it doesn't come into effect). This is decided by the number of Input Splits created by hadoop for your given set of input. You may control this by setting mapred.min.split.size parameter. For more read the InputSplit section here. If you have a lot of mappers being generated due to huge amount of small files and you want to reduce number of mappers then you will need to combine data from more than one files. Read this: How to combine input files to get to a single mapper and control number of mappers.
To quote from the wiki page:
The number of maps is usually driven by the number of DFS blocks in
the input files. Although that causes people to adjust their DFS block
size to adjust the number of maps. The right level of parallelism for
maps seems to be around 10-100 maps/node, although we have taken it up
to 300 or so for very cpu-light map tasks. Task setup takes awhile, so
it is best if the maps take at least a minute to execute.
Actually controlling the number of maps is subtle. The
mapred.map.tasks parameter is just a hint to the InputFormat for the
number of maps. The default InputFormat behavior is to split the total
number of bytes into the right number of fragments. However, in the
default case the DFS block size of the input files is treated as an
upper bound for input splits. A lower bound on the split size can be
set via mapred.min.split.size. Thus, if you expect 10TB of input data
and have 128MB DFS blocks, you'll end up with 82k maps, unless your
mapred.map.tasks is even larger. Ultimately the InputFormat determines
the number of maps.
The number of map tasks can also be increased manually using the
JobConf's conf.setNumMapTasks(int num). This can be used to increase
the number of map tasks, but will not set the number below that which
Hadoop determines via splitting the input data.
Number of Reducers:
You can explicitly set the number of reducers. Just set the parameter mapred.reduce.tasks. There are guidelines for setting this number, but usually the default number of reducers should be good enough. At times a single report file is required, in those cases you might want number of reducers to be set to be 1.
Again to quote from wiki:
The right number of reduces seems to be 0.95 or 1.75 * (nodes *
mapred.tasktracker.tasks.maximum). At 0.95 all of the reduces can
launch immediately and start transfering map outputs as the maps
finish. At 1.75 the faster nodes will finish their first round of
reduces and launch a second round of reduces doing a much better job
of load balancing.
Currently the number of reduces is limited to roughly 1000 by the
buffer size for the output files (io.buffer.size * 2 * numReduces <<
heapSize). This will be fixed at some point, but until it is it
provides a pretty firm upper bound.
The number of reduces also controls the number of output files in the
output directory, but usually that is not important because the next
map/reduce step will split them into even smaller splits for the maps.
The number of reduce tasks can also be increased in the same way as
the map tasks, via JobConf's conf.setNumReduceTasks(int num).
Actually no. of mappers is primarily governed by the no. of InputSplits created by the InputFormat you are using and the no. of reducers by the no. of partitions you get after the map phase. Having said that, you should also keep the no of slots, available per slave, in mind, along with the available memory. But as a rule of thumb you could use this approach :
Take the no. of virtual CPUs*.75 and that's the no. of slots you can configure. For example, if you have 12 physical cores (or 24 virtual cores), you would have (24*.75)=18 slots. Now, based on your requirement you could choose how many mappers and reducers you want to use. With 18 MR slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers or whatever you think is OK with you.
HTH

Resources