I have just learned the book <Hadoop:The definite Guide>.I have several questions on the most important process:Shuffle.
The time order of sort,partition and merge
The output of a mapper maybe the input of several Reducers.From the book ,we know that the mapper will write its output to its memory buffer firstly.And before it spills the buffer to disk , a sort and partition is conducted.I want to know their order in time.My inference is:Before the result is spilled to dist, a partitioned is conducted to determine which reducers the output belongs to.And then ,for each partition ,a sort method(as I know ,it is quick sort) is conducted seperated. When the buffered is full or reached the threshold, then a spilled to disk.
each spill file and merged file belongs to each reducer or multi-reducers?
Again ,according to the book ,when there are too many spilled file, a merge operation will occur.It confused me again.
2.1 Does each spill file belongs to each reducer, or they are just a simple dump file of memory buffer and belongs to multi-reducers?
2.2.After the spill files are merged, the merged file will contain the input data for several reducers? Then when it comes to the copy phase of reducer,how can reducer get the part which actually belongs to it from this merged file?
2.3 Each Mapper Task will generate a merge file, instead of each taskTracker,right?
Related
there is something that i do not have clear about the whoel functioning view of a MapReduce programming environment.
Considering to have 1k of random unsorted words in the form (word, 1) coming out from a (or more than one) mapper. Suppose with the reducer i wanna save them all inside a single huge sorted file. How does it works? I mean, the reducer itself sort all the words automatically? What does the reducer function should do? What if i have just one reducer with limited ram and disk?
when the reducer get the data ,the data has already be sorted in the map side .
the process is like this
Map side:
1. Each inputSplit will be processed by a map task, and the result of the map output will be temporarily placed in a circular memory buffer [ SHUFFLE ](the size of the buffer is 100M by default, controlled by the io.sort.mb property). When the buffer is about to overflow (the default is 80% of the buffer size), an overflow file will be created in the local file system .
2. Before writing to the disk, the thread first divides the data into the same number of partitions according to the number of reduce tasks, that is, a reduce task corresponds to the data of one partition. to avoid some of the reduction tasks being assigned to large amounts of data, even without data. In fact, the data in each partition is sorted. If the Combiner is set at this time, the sorted result is subjected to the Combiner operation.
3. When the local task outputs the last record, there may be a lot of overflow files, and these files need to be merged. The sorting and combining operations are continually performed during the merge process for two purposes: 1. Minimize the amount of data written to disk each time; 2. Minimize the amount of data transferred by the network during the next replication phase. Finally merged into a partitioned and sorted file. In order to reduce the amount of data transmitted over the network, you can compress the data here, just set mapred.compress.map.out to true.
4. Copy the data from the partition to the corresponding reduce task.
Reduce side:
1.Reduce will receive data from different map tasks, and the amount of data sent from each map is ordered. If the amount of data accepted by the reduce side is quite small, it is directly stored in the memory. If the amount of data exceeds a certain proportion of the size of the buffer, the data is merged and written to the disk.
2. As the number of overflow files increases, the background thread will merge them into a larger, more ordered file. In fact, regardless of the map side or the reduce side, MapReduce repeatedly performs sorting and merging operations.
3. The merge process will generate a lot of intermediate files (written to disk), but MapReduce will make the data written to the disk as small as possible, and the result of the last merge is not written to the disk, but directly input To reduce the function.
Summary from the book "hadoop definitive guide - tom white" is:
All the logic between user's map function and user's reduce function is called shuffle. Shuffle then spans across both map and reduce. After user's map() function, the output is in in-memory circular buffer. When the buffer is 80% full, the background thread starts to run. The background thread will output the buffer's content into a spill file. This spill file is partitioned by key. And within each partition, the key-value pairs are sorted by key.After sorting, if combiner function is enabled, then combiner function is called. All spill files will be merged into one MapOutputFile. And all Map tasks's MapOutputFile will be collected over network to Reduce task. Reduce task will do another sort. And then user's Reduce function will be called.
So the questions are:
1.) According to the above summary, this is the flow:
Mapper--Partioner--Sort--Combiner--Shuffle--Sort--Reducer--Output
1a.) Is this the flow or is it something else?
1b.) Can u explain the above flow with an example say word count example, (the ones I found online weren't that elaborative) ?
2.) So the mappers phase output is one big file (MapOutputFile)? And it is this one big file that is broken into and the key-value pairs are passed onto the respective reducers?
3.) Why does the sorting happens for a second time, when the data is already sorted & combined when passed onto their respective reducers?
4.) Say if mapper1 is run on Datanode1 then is it necessary for reducer1 to run on the datanode1? Or it can run on any Datanode?
Answering this question is like rewriting the whole history . A lot of your doubts have to do with Operating System concepts and not MapReduce.
Mappers data is written on local File System. The data is partitioned based on the number of reducer. And in each partition , there can be multiple files based on the number of time the spills have happened.
Each small file in a given partition is sorted , as before writing the file, in Memory sort is done.
Why the data needs to be sorted on mapper side ?
a.The data is sorted and merged on the mapper side to decrease the number of files.
b.The files are sorted as it would become impossible on the reducer to gather all the values for a given key.
After gathering data on the reducer, first the number of files on the system needs to be decreased (remember uLimit has a fixed amount for every user in this case hdfs)
Reducer just maintains a file pointer on a small set of sorted files and does a merge of them.
To know about more interesting ideas please refer :
http://bytepadding.com/big-data/map-reduce/understanding-map-reduce-the-missing-guide/
I'm reading "Hadoop: The Definitive Guide" and I have some questions.
In Chapter 7 "How MapReduce Works" on page 201, the author said that in reduce side
When [A] the in-memory buffer reaches a threshold size (controlled by mapreduce.reduce.shuffle.merge.percent) or [B] reaches a threshold number of map outputs (mapreduce.reduce.merge.inmem.threshold), it is merged and spilled to disk.
My questions (4 questions) are about conditions A and B.
In condition A with default values config Hadoop 2 would you say that
Will 0.462(0.66 * 0.70) of 1GB memory reducer is full (462MB), will merge and spill is begin?
In condition B with default values config would you say that
When 1000 pairs of key-value map output from one mapper have been collected in buffer, will merge and spill is begin?
in above question one mapper is correct or more than one mapper from different machine?
in following paragraph, The author said that
As the copies accumulate on disk, a background thread merges them into larger, sorted files
it is correct that your intended purpose is that when a spill file wants to write to disk a before spill that already exist in disk merge with current spill?
Please help me to better understand what is really happening in Hadoop.
Please see this image
Each mapper run in different machine
Anyone can explain precise meaning as specified
Will 0.462(0.66 * 0.70) of 1GB memory reducer is full (462MB), will merge and spill is begin?
Yes. This will trigger the merge and spill to disk.
By default each reducer get 1GB of heap space, of this only 70% mapreduce.reduce.shuffle.input.buffer.percent is to be used as merge and shuffle buffer, the 30% might be needed by reducer code, and for other task specific requirements. When 66% mapreduce.reduce.shuffle.merge.percent of the merge and shuffle buffer is full merging (and spill) is triggred, to make sure some space if left for process to sorting and during the sort the other incoming files are not waiting for memory space.
refer http://www.bigsynapse.com/mapreduce-internals
When 1000 pairs of key-value map output from one mapper have been collected in buffer, will merge and spill is begin?
No. Here map outputs may refer to the partitions of map output file. (I think).
1000 key value pair might be a very small amount of memory.
In slide 24 of this presentation by cloudera the same is refered as segments
mapreduce.reduce.merge.inmem.threshold segments accumulated (default 1000)
in above question one mapper is correct or more than one mapper from different machine?
From my understanding, 1000 partitions can come from different partitions. As the partitions of map output files are bieng copied parallely from different nodes.
it is correct that your intended purpose is that when a spill file wants to write to disk a before spill that already exist in disk merge with current spill?
No, a (mapper output) partition file currently in memory is merged with other (mapper output) partition files in memory only to create a spill file and written to disk. This might be repeated several times as the buffer fills up. Thus many files would have been written to disk. To increase efficiency these files in disk are merged into a larger file by a background process.
source:
Hadoop the definitive guide. 4th edition.
http://answers.mapr.com/questions/3952/reduce-merge-performance-issues.html
https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
http://www.bigsynapse.com/mapreduce-internals
http://answers.mapr.com/questions/3952/reduce-merge-performance-issues.html
I would like to add to #shanmuga's answer.
1) Note that NOT all map outputs are stored in memory first and then spilled to disk.
Some map outputs are directly stored on Disk (OnDiskMapOutput).
Based on the uncompressed map output length (Size) it is determined if this particular Map's output will be saved in Memory or on Disk. The decision is based on:
Map Output Buffer Limit (MemoryLimit) = mapreduce.reduce.memory.totalbytes [Default = -Xmx * mapreduce.reduce.shuffle.input.buffer.percent [Default = 70%]]
MaxSingleShuffleLimit = MemoryLimit * mapreduce.reduce.shuffle.memory.limit.percent [Default = 25%]
If Size > MaxSingleShuffleLimit => OnDiskMapOutput
Else If
(Total size of current Map Outputs Buffer Used) < MemoryLimit =>
InMemoryMapOutput
Else => Halt the process
2) The SORT/MERGE phase is triggered during the COPY phase when either of these conditions are met:
At the end of an InMemoryMapOutput creation, if
(total size of
current set of InMemoryMapOutput) > MemoryLimit *
mapreduce.reduce.shuffle.merge.percent [Default: 90%]
, start the
InMemoryMerger thread which merges all the current InMemoryMapOutputs to one OnDiskMapOutput.
At the end of an OnDiskMapOutput creation, if
(No. of OnDiskMapOutputs) > 2 * mapreduce.task.io.sort.factor [Default 64
(CDH)/100(Code)]
, start the OnDiskMerger thread which merges mapreduce.task.io.sort.factor OnDiskOutputs to one OnDiskOutput.
3) Regarding Condition B you mentioned, I don't see the flag mapreduce.reduce.merge.inmem.threshold referenced in the Merge related source code for MapReduce. This flag is perhaps deprecated.
I am reading Hadoop: The definitive guide 3rd edtition by Tom White. It is an excellent resource for understanding the internals of Hadoop, especially Map-Reduce which I am interested in.
From the book, (Page 205):
Shuffle and Sort
MapReduce makes the guarantee that the input to every reducer is sorted by key. The process by which the system performs the sort—and transfers the map outputs to the reducers as inputs—is known as the shuffle.
What I infer from this, is that before keys are sent to reducer, they are sorted, indicating that output of map phase of job is sorted. please note: I don't call it mapper, since a map phase include both mapper (written by programmer) and in-built sort mechanism of MR framework.
The Map Side
Each map task has a circular memory buffer that it writes the output to. The buffer is 100 MB by default, a size which can be tuned by changing the io.sort.mb property. When the contents of the buffer reaches a certain threshold size (io.sort.spill.per cent, default 0.80, or 80%), a background thread will start to spill the contents to disk. Map outputs will continue to be written to the buffer while the spill takes place, but if the buffer fills up during this time, the map will block until the spill is complete.
Before it writes to disk, the thread first divides the data into partitions corresponding to the reducers that they will ultimately be sent to. Within each partition, the back- ground thread performs an in-memory sort by key, and if there is a combiner function, it is run on the output of the sort. Running the combiner function makes for a more compact map output, so there is less data to write to local disk and to transfer to the reducer.
My understanding of the above paragraph is that as the mapper is producing key-value pairs, key-value pairs are partitioned and sorted. A hypothetical example:
consider mapper-1 for a word-count program:
>mapper-1 contents
partition-1
xxxx: 2
yyyy: 3
partition-2
aaaa: 15
zzzz: 11
(Note with-in each partition data is sorted by key, but it is not necessary that partition-1's data and partition-2's data must follow sequential order)
Continuing reading the chapter:
Each time the memory buffer reaches the spill threshold, a new spill file is created, so after the map task has written its last output record there could be several spill files. Before the task is finished, the spill files are merged into a single partitioned and sorted output file. The configuration property io.sort.factor controls the maximum number of streams to merge at once; the default is 10.
My understanding here is (please know the bold phrase in above para, that tricked me):
Within a map-task, several files may be spilled to disk but they are merged to a single file which still contains partition and is sorted. consider the same example as above:
Before a single map-task is finished, its intermediate data could be:
mapper-1 contents
spill 1: spill 2: spill 2:
partition-1 partition-1 partition-1
hhhh:5
xxxx: 2 xxxx: 3 mmmm: 2
yyyy: 3 yyyy: 7 yyyy: 9
partition-2 partition-2 partition-2
aaaa: 15 bbbb: 15 cccc: 15
zzzz: 10 zzzz: 15 zzzz: 13
After the map-task is completed, the output from mapper will be a single file (note three spill files above are added now but no combiner applied assuming no combiner specified in job conf):
>Mapper-1 contents:
partition-1:
hhhh: 5
mmmm: 2
xxxx: 2
xxxx: 3
yyyy: 3
yyyy: 7
yyyy: 9
partition-2:
aaaa: 15
bbbb: 15
cccc: 15
zzzz: 10
zzzz: 15
zzzz: 13
so here partition-1 may correspond to reducer-1. That is data corresponding parition-1 segment above is sent to reducer-1 and data corresponding to partition-2 segment is sent to reducer-2.
If so far, my understanding is correct,
how will I be able to get the intermediate file that has both partitions and sorted data from the mapper output.
It is interesting to note that running mapper alone does not produce sorted output
contradicting the points that data send to reducer is not sorted. More details here
Even no combiner is applied if No only Mapper is run: More details here
Map-only jobs work differently than Map-and-Reduce jobs. It's not inconsistent, just different.
how will I be able to get the intermediate file that has both partitions and sorted data from the mapper output.
You can't. There isn't a hook to be able to get pieces of data from intermediate stages of MapReduce. Same is true for getting data after the partitioner, or after a record reader, etc.
It is interesting to note that running mapper alone does not produce sorted output contradicting the points that data send to reducer is not sorted. More details here
It does not contradict. Mappers sort because the reducer needs it sorted to be able to do a merge. If there are no reducers, it has no reason to to sort, so it doesn't. This is the right behavior because I don't want it sorted in a map only job which would make my processing slower. I've never had a situation where I wanted my map output to be locally sorted.
Even no combiner is applied if No only Mapper is run: More details here
Combiners are an optimization. There is no guarantee that they actually run or over what data. Combiners are mostly there to make the reducers more efficient. So, again, just like the local sorting, combiners do not run if there are no reducers because it has no reason to.
If you want combiner-like behavior, I suggest writing data into a buffer (hashmap perhaps) and then writing out locally-summarized data in the cleanup function that runs when a Mapper finishes. Be careful of memory usage if you want to do this. This is a better approach because combiners are specified as a good-to-have optimization and you can't count on them running... even when they do run.
Hadoop the definitive guide (Tom White) Page 178
Section shuffle and sort : The map side.
Just after figure 6-4
Before it writes to disk , the thread first divides the data into partitions corresponding to the reducers that they will ultimately be sent to. WIthin each partition, the background thread performs an in-memory sort by key and if there is a combiner function, it is run on the output of the sort.
Question :
Does this mean the map writes each key output to a different file and then combine them later.
Thus if there were 2 different key outputs to be sent to a reducer , each different key will be sent seperately to the reducer instead of sending a single file.
If my above reasoning is incorrect, what is it that actually happens.
Only if the two key outputs are going to different reducers. If the partition thinks they should go to the same reducer they will be in the same file.
-- Updated to include more details - Mostly from the book:
The partitioner just sorts the keys in to buckets. 0 to n for the number of reducers in your job. The reduce task has a small number of copier threads so that it can fetch map outputs in parallel. Therefore, for a given job, the jobtracker knows the mapping between map outputs and hosts. A thread in the reducer periodically asks the master for map output hosts until it has retrieved them all.
The map outputs are copied to the reduce task JVM’s memory if they are small enough (the buffer’s size is controlled by mapred.job.shuffle.input.buffer.percent, which specifies the proportion of the heap to use for this purpose); otherwise, they are copied to disk. When the in-memory buffer reaches a threshold size (controlled by mapred.job.shuffle.merge.percent) or reaches a threshold number of map outputs (mapred.inmem.merge.threshold), it is merged and spilled to disk. If a combiner is specified, it will be run during the merge to reduce the amount of data written to disk.
As the copies accumulate on disk, a background thread merges them into larger, sorted files. This saves some time merging later on. Note that any map outputs that were compressed (by the map task) have to be decompressed in memory in order to perform a merge on them.
When all the map outputs have been copied, the reduce task moves into the sort phase (which should properly be called the merge phase, as the sorting was carried out on the map side), which merges the map outputs, maintaining their sort ordering. This is done in rounds. For example, if there were 50 map outputs and the merge factor was 10 (the default, controlled by the io.sort.factor property, just like in the map’s merge), there would be five rounds. Each round would merge 10 files into one, so at the end there would be five intermediate files.
Rather than have a final round that merges these five files into a single sorted file, the merge saves a trip to disk by directly feeding the reduce function in what is the last phase: the reduce phase. This final merge can come from a mixture of in-memory and on-disk segments.
If we have configured multiple reducer, then during partitioning if we get keys for different reducer, they will be stored in separate files corresponding to reducer, and at the end of map task complete file will be send to reducer and not single key.
Say, you have 3 reducers running. You can then use a partitioner to decide which keys goes to which of the three reducers. You can probably do a X%3 in the partitioner to decide which key goes to which reducer. Hadoop by default uses HashPartitioner.