Hadoop Reduce side what is really occurred? - hadoop

I'm reading "Hadoop: The Definitive Guide" and I have some questions.
In Chapter 7 "How MapReduce Works" on page 201, the author said that in reduce side
When [A] the in-memory buffer reaches a threshold size (controlled by mapreduce.reduce.shuffle.merge.percent) or [B] reaches a threshold number of map outputs (mapreduce.reduce.merge.inmem.threshold), it is merged and spilled to disk.
My questions (4 questions) are about conditions A and B.
In condition A with default values config Hadoop 2 would you say that
Will 0.462(0.66 * 0.70) of 1GB memory reducer is full (462MB), will merge and spill is begin?
In condition B with default values config would you say that
When 1000 pairs of key-value map output from one mapper have been collected in buffer, will merge and spill is begin?
in above question one mapper is correct or more than one mapper from different machine?
in following paragraph, The author said that
As the copies accumulate on disk, a background thread merges them into larger, sorted files
it is correct that your intended purpose is that when a spill file wants to write to disk a before spill that already exist in disk merge with current spill?
Please help me to better understand what is really happening in Hadoop.
Please see this image
Each mapper run in different machine
Anyone can explain precise meaning as specified

Will 0.462(0.66 * 0.70) of 1GB memory reducer is full (462MB), will merge and spill is begin?
Yes. This will trigger the merge and spill to disk.
By default each reducer get 1GB of heap space, of this only 70% mapreduce.reduce.shuffle.input.buffer.percent is to be used as merge and shuffle buffer, the 30% might be needed by reducer code, and for other task specific requirements. When 66% mapreduce.reduce.shuffle.merge.percent of the merge and shuffle buffer is full merging (and spill) is triggred, to make sure some space if left for process to sorting and during the sort the other incoming files are not waiting for memory space.
refer http://www.bigsynapse.com/mapreduce-internals
When 1000 pairs of key-value map output from one mapper have been collected in buffer, will merge and spill is begin?
No. Here map outputs may refer to the partitions of map output file. (I think).
1000 key value pair might be a very small amount of memory.
In slide 24 of this presentation by cloudera the same is refered as segments
mapreduce.reduce.merge.inmem.threshold segments accumulated (default 1000)
in above question one mapper is correct or more than one mapper from different machine?
From my understanding, 1000 partitions can come from different partitions. As the partitions of map output files are bieng copied parallely from different nodes.
it is correct that your intended purpose is that when a spill file wants to write to disk a before spill that already exist in disk merge with current spill?
No, a (mapper output) partition file currently in memory is merged with other (mapper output) partition files in memory only to create a spill file and written to disk. This might be repeated several times as the buffer fills up. Thus many files would have been written to disk. To increase efficiency these files in disk are merged into a larger file by a background process.
source:
Hadoop the definitive guide. 4th edition.
http://answers.mapr.com/questions/3952/reduce-merge-performance-issues.html
https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
http://www.bigsynapse.com/mapreduce-internals
http://answers.mapr.com/questions/3952/reduce-merge-performance-issues.html

I would like to add to #shanmuga's answer.
1) Note that NOT all map outputs are stored in memory first and then spilled to disk.
Some map outputs are directly stored on Disk (OnDiskMapOutput).
Based on the uncompressed map output length (Size) it is determined if this particular Map's output will be saved in Memory or on Disk. The decision is based on:
Map Output Buffer Limit (MemoryLimit) = mapreduce.reduce.memory.totalbytes [Default = -Xmx * mapreduce.reduce.shuffle.input.buffer.percent [Default = 70%]]
MaxSingleShuffleLimit = MemoryLimit * mapreduce.reduce.shuffle.memory.limit.percent [Default = 25%]
If Size > MaxSingleShuffleLimit => OnDiskMapOutput
Else If
(Total size of current Map Outputs Buffer Used) < MemoryLimit =>
InMemoryMapOutput
Else => Halt the process
2) The SORT/MERGE phase is triggered during the COPY phase when either of these conditions are met:
At the end of an InMemoryMapOutput creation, if
(total size of
current set of InMemoryMapOutput) > MemoryLimit *
mapreduce.reduce.shuffle.merge.percent [Default: 90%]
, start the
InMemoryMerger thread which merges all the current InMemoryMapOutputs to one OnDiskMapOutput.
At the end of an OnDiskMapOutput creation, if
(No. of OnDiskMapOutputs) > 2 * mapreduce.task.io.sort.factor [Default 64
(CDH)/100(Code)]
, start the OnDiskMerger thread which merges mapreduce.task.io.sort.factor OnDiskOutputs to one OnDiskOutput.
3) Regarding Condition B you mentioned, I don't see the flag mapreduce.reduce.merge.inmem.threshold referenced in the Merge related source code for MapReduce. This flag is perhaps deprecated.

Related

Map Reduce, does reducer automatically sorts?

there is something that i do not have clear about the whoel functioning view of a MapReduce programming environment.
Considering to have 1k of random unsorted words in the form (word, 1) coming out from a (or more than one) mapper. Suppose with the reducer i wanna save them all inside a single huge sorted file. How does it works? I mean, the reducer itself sort all the words automatically? What does the reducer function should do? What if i have just one reducer with limited ram and disk?
when the reducer get the data ,the data has already be sorted in the map side .
the process is like this
Map side:
1. Each inputSplit will be processed by a map task, and the result of the map output will be temporarily placed in a circular memory buffer [ SHUFFLE ](the size of the buffer is 100M by default, controlled by the io.sort.mb property). When the buffer is about to overflow (the default is 80% of the buffer size), an overflow file will be created in the local file system .
2. Before writing to the disk, the thread first divides the data into the same number of partitions according to the number of reduce tasks, that is, a reduce task corresponds to the data of one partition. to avoid some of the reduction tasks being assigned to large amounts of data, even without data. In fact, the data in each partition is sorted. If the Combiner is set at this time, the sorted result is subjected to the Combiner operation.
3. When the local task outputs the last record, there may be a lot of overflow files, and these files need to be merged. The sorting and combining operations are continually performed during the merge process for two purposes: 1. Minimize the amount of data written to disk each time; 2. Minimize the amount of data transferred by the network during the next replication phase. Finally merged into a partitioned and sorted file. In order to reduce the amount of data transmitted over the network, you can compress the data here, just set mapred.compress.map.out to true.
4. Copy the data from the partition to the corresponding reduce task.
Reduce side:
1.Reduce will receive data from different map tasks, and the amount of data sent from each map is ordered. If the amount of data accepted by the reduce side is quite small, it is directly stored in the memory. If the amount of data exceeds a certain proportion of the size of the buffer, the data is merged and written to the disk.
2. As the number of overflow files increases, the background thread will merge them into a larger, more ordered file. In fact, regardless of the map side or the reduce side, MapReduce repeatedly performs sorting and merging operations.
3. The merge process will generate a lot of intermediate files (written to disk), but MapReduce will make the data written to the disk as small as possible, and the result of the last merge is not written to the disk, but directly input To reduce the function.

Controlling reducer shuffle merge memory in Hadoop 2

I want understand how memory is used in the reduce phase of a MapReduce Job, so I can control the settings in the designated way.
If I understand correctly, the reducer first fetches its map output and leaves them in memory up to a certain threshold. The settings to control this are:
mapreduce.reduce.shuffle.merge.percent: The usage threshold at which an in-memory merge will be initiated, expressed as a percentage of the total memory allocated to storing in-memory map outputs, as defined by mapreduce.reduce.shuffle.input.buffer.percent.
mapreduce.reduce.input.buffer.percent: The percentage of memory- relative to the maximum heap size- to retain map outputs during the reduce. When the shuffle is concluded, any remaining map outputs in memory must consume less than this threshold before the reduce can begin.
Next, these spilled blocks are merged. It seems the following option controls how much memory is used for the shuffle:
mapreduce.reduce.shuffle.input.buffer.percent: The percentage of memory to be allocated from the maximum heap size to storing map outputs during the shuffle.
But then, there is the setting:
mapreduce.reduce.shuffle.memory.limit.percent: Maximum percentage of the in-memory limit that a single shuffle can consume.
But it is not clear to what value this percentage applies. Is there more information available regarding these values, i.e. what they control and how they differ?
Finally, after the merge completes, the reduce process is ran on the inputs. In the [Hadoop book][1], I found that the final merge-step directly feeds the reducers. But, the default value for mapreduce.reduce.input.buffer.percent=0 contradicts this, indicating that everything is spilled to disk BEFORE the reducers start. Is there any reference on which one of these explanations is correct?
[1]: Hadoop, The definitive guide, Fourth edition, p. 200
Here is how mapreduce.reduce.shuffle.memory.limit.percent is used and its percentage implies a 0.70 percent of the whole reducer memory. That would be the maximum bytes upto which the data could be kept in memory for a single shuffle.
maxSingleShuffleLimit = (long)(maxSize * MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION);
//MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION=mapreduce.reduce.shuffle.memory.limit.percent(0.25 f)
maxSize = (int)(conf.getInt("mapred.job.reduce.total.mem.bytes",(int)Math.min(Runtime.getRuntime().maxMemory(), Integer.MAX_VALUE))* maxInMemCopyUse);//maxInMemCopyuse(mapred.job.shuffle.input.buffer.percent - 0.70f)
This property is used in the copy phase of the reducer. If the required map output is greater than the maxShufflelimit then the data is moved to disk,else kept in memory.
Property mapreduce.reduce.input.buffer.percent is completety different.
Once all the data is copied and all the merge is done, just before the reducer starts it just checks whether the data stored in memory exceeds this limit.
You could refer this code(however it is for old mapred it should give an insight) on how maxSingleShuffleLimit and the other property are used.

Hadoop Shuffle Questions

I have just learned the book <Hadoop:The definite Guide>.I have several questions on the most important process:Shuffle.
The time order of sort,partition and merge
The output of a mapper maybe the input of several Reducers.From the book ,we know that the mapper will write its output to its memory buffer firstly.And before it spills the buffer to disk , a sort and partition is conducted.I want to know their order in time.My inference is:Before the result is spilled to dist, a partitioned is conducted to determine which reducers the output belongs to.And then ,for each partition ,a sort method(as I know ,it is quick sort) is conducted seperated. When the buffered is full or reached the threshold, then a spilled to disk.
each spill file and merged file belongs to each reducer or multi-reducers?
Again ,according to the book ,when there are too many spilled file, a merge operation will occur.It confused me again.
2.1 Does each spill file belongs to each reducer, or they are just a simple dump file of memory buffer and belongs to multi-reducers?
2.2.After the spill files are merged, the merged file will contain the input data for several reducers? Then when it comes to the copy phase of reducer,how can reducer get the part which actually belongs to it from this merged file?
2.3 Each Mapper Task will generate a merge file, instead of each taskTracker,right?

Why has io.sort.record.percent been removed?

Why has the io.sort.record.percent property been removed from Hadoop 1.x onwards?
It's there in 2.x but the only difference is that its name has changed:
mapreduce.task.io.sort.mb: The total amount of buffer memory to use while sorting files, in megabytes. By default, gives each merge stream 1MB, which should minimize seeks.
Default value is still 100 mb. Please find more information on this link.
io.sort.record.percent is a percentage applying to io.sort.mb (old name).
Map output data is placed in an in-memory buffer. When the buffer fills up, the framework sorts it and then spills to disk. A separate thread merges the sorted on-disk files into a single larger sorted file. The buffer consists of two parts: a section with contiguous raw output data and a metadata section that holds pointers for each record into the raw data section. In MR1, the sizes of these sections were fixed, controlled by io.sort.record.percent, which says how much percent of the io.sort.mb space is used for metadata secton. This meant that, without proper tuning of this parameter, if a job had many small records, the metadata section could fill up much more quickly than the raw data section. The buffer would be spilled to disk before it was entirely full. This affects performance.
MAPREDUCE-64 fixed this issue in MR2 by allowing the two sections to share the same space and vary in size. Thus io.sort.record.percent is no longer required to minimize the number of spills in MR2. That's why this property has been removed

How hadoop handle very large individual split file

Suppose you only have 1GB heap size which can be used for each mapper, however, the block size is set to be 10 GB and each split is 10GB. How the mapper read the large individual split?
Will the mapper buffer the input into disk and process the input split in a round-robin fashion?
Thanks!
The overall pattern of a mapper is quite simple:
while not end of split
(key, value) = RecordReader.next()
(keyOut, valueOut) = map(key, value)
RecordWriter.write(keyOut, valueOut)
Usually the first two operations only care about the size of the record. For example when TextInputFormat is asked for the next line it stores the bytes in a buffer until the next end of line is found. Then the buffer is cleared. Etc.
The map implementation is up to you. If you don't store things in your mapper then you are fine. If you want it to be stateful, then you can be in trouble. Make sure that your memory consumption is bounded.
In the last step the keys and values written by your mapper are stored in memory. They are then partitioned and sorted. If the in-memory buffer becomes full, then its content is spilled to disk (it will eventually be anyway because reducers need to be able to download the partition file even after the mapper vanished).
So the answer to your question is: yes it will be fine.
What could cause trouble is:
Large records (exponential buffer growth + memory copies => significant to insane memory overhead)
Storing data from the previous key/value in your mapper
Storing data from the previous key/value in your custom (Input|Output)Format implementation if you have one
If you want to learn more, here are a few entry points:
In Mapper.java you can see the while loop
In LineRecordReader you can see how a line is read by a TextInputFormat
You most likely want to understand the spill mechanism because it impacts the performance of your jobs. See theses Cloudera slides for example. Then you will be able to decide what is the best setting for your use case (large vs small splits).

Resources