there is something that i do not have clear about the whoel functioning view of a MapReduce programming environment.
Considering to have 1k of random unsorted words in the form (word, 1) coming out from a (or more than one) mapper. Suppose with the reducer i wanna save them all inside a single huge sorted file. How does it works? I mean, the reducer itself sort all the words automatically? What does the reducer function should do? What if i have just one reducer with limited ram and disk?
when the reducer get the data ,the data has already be sorted in the map side .
the process is like this
Map side:
1. Each inputSplit will be processed by a map task, and the result of the map output will be temporarily placed in a circular memory buffer [ SHUFFLE ](the size of the buffer is 100M by default, controlled by the io.sort.mb property). When the buffer is about to overflow (the default is 80% of the buffer size), an overflow file will be created in the local file system .
2. Before writing to the disk, the thread first divides the data into the same number of partitions according to the number of reduce tasks, that is, a reduce task corresponds to the data of one partition. to avoid some of the reduction tasks being assigned to large amounts of data, even without data. In fact, the data in each partition is sorted. If the Combiner is set at this time, the sorted result is subjected to the Combiner operation.
3. When the local task outputs the last record, there may be a lot of overflow files, and these files need to be merged. The sorting and combining operations are continually performed during the merge process for two purposes: 1. Minimize the amount of data written to disk each time; 2. Minimize the amount of data transferred by the network during the next replication phase. Finally merged into a partitioned and sorted file. In order to reduce the amount of data transmitted over the network, you can compress the data here, just set mapred.compress.map.out to true.
4. Copy the data from the partition to the corresponding reduce task.
Reduce side:
1.Reduce will receive data from different map tasks, and the amount of data sent from each map is ordered. If the amount of data accepted by the reduce side is quite small, it is directly stored in the memory. If the amount of data exceeds a certain proportion of the size of the buffer, the data is merged and written to the disk.
2. As the number of overflow files increases, the background thread will merge them into a larger, more ordered file. In fact, regardless of the map side or the reduce side, MapReduce repeatedly performs sorting and merging operations.
3. The merge process will generate a lot of intermediate files (written to disk), but MapReduce will make the data written to the disk as small as possible, and the result of the last merge is not written to the disk, but directly input To reduce the function.
Related
Summary from the book "hadoop definitive guide - tom white" is:
All the logic between user's map function and user's reduce function is called shuffle. Shuffle then spans across both map and reduce. After user's map() function, the output is in in-memory circular buffer. When the buffer is 80% full, the background thread starts to run. The background thread will output the buffer's content into a spill file. This spill file is partitioned by key. And within each partition, the key-value pairs are sorted by key.After sorting, if combiner function is enabled, then combiner function is called. All spill files will be merged into one MapOutputFile. And all Map tasks's MapOutputFile will be collected over network to Reduce task. Reduce task will do another sort. And then user's Reduce function will be called.
So the questions are:
1.) According to the above summary, this is the flow:
Mapper--Partioner--Sort--Combiner--Shuffle--Sort--Reducer--Output
1a.) Is this the flow or is it something else?
1b.) Can u explain the above flow with an example say word count example, (the ones I found online weren't that elaborative) ?
2.) So the mappers phase output is one big file (MapOutputFile)? And it is this one big file that is broken into and the key-value pairs are passed onto the respective reducers?
3.) Why does the sorting happens for a second time, when the data is already sorted & combined when passed onto their respective reducers?
4.) Say if mapper1 is run on Datanode1 then is it necessary for reducer1 to run on the datanode1? Or it can run on any Datanode?
Answering this question is like rewriting the whole history . A lot of your doubts have to do with Operating System concepts and not MapReduce.
Mappers data is written on local File System. The data is partitioned based on the number of reducer. And in each partition , there can be multiple files based on the number of time the spills have happened.
Each small file in a given partition is sorted , as before writing the file, in Memory sort is done.
Why the data needs to be sorted on mapper side ?
a.The data is sorted and merged on the mapper side to decrease the number of files.
b.The files are sorted as it would become impossible on the reducer to gather all the values for a given key.
After gathering data on the reducer, first the number of files on the system needs to be decreased (remember uLimit has a fixed amount for every user in this case hdfs)
Reducer just maintains a file pointer on a small set of sorted files and does a merge of them.
To know about more interesting ideas please refer :
http://bytepadding.com/big-data/map-reduce/understanding-map-reduce-the-missing-guide/
I am reading the original MapReduce paper. My understanding is that when working with, say hundreds of GBs of data, the network bandwidth for transferring so much data can be the bottleneck of a MapReduce job. For map tasks, we can reduce network bandwidth by scheduling map tasks on workers that already contain the data for any given split, since reading from local disk does not require network bandwidth.
However, the shuffle phase seems to be a huge bottleneck. A reduce task can potentially receive intermediate key/value pairs from all map tasks, and almost all of these intermediate key/value pairs will be streamed across the network.
When working with hundreds of GBs of data or more, is it necessary to use a combiner to have an efficient MapReduce job?
Combiner plays important role if it can fit into that situation it acts like a local reducer so instead of sending all data it will send only few values or local aggregated value but combiner can't be applied in all the cases .
If a reduce function is both commutative and associative, then it can be used as a Combiner.
Like in case of Median it won't work .
Combiner can't be used in all the situation.
There are other parameters which can be tuned Like :
When map emits output it directly does not go disk it goes to 100 MB circular buffer which when fill 80% it spill the records into disk.
so you can increase the buffer size and increase thresh hold value in that case spillage would be less.
if there are so many spills then spills would merge to make a single file we can play with spill factor.
There are so threads which copies data from local disk to reducer jvm's so their number can be increased.
Compression can be used at intermediate level and final level.
So Combiner is not the only solution and won't be used in all the situation.
Lets say the mapper is emitting (word, count). If you don't use combiner then if a mapper has the word abc 100 times then the reducer has to pull (abc, 1) 100 times Lets say the size of (word, count) is 7 bytes. Without combiner, the reducer has to pull 7 * 100 bytes of data where as the with combiner, the reducer only needs to pull 7 bytes of data. This example just illustrates how the combiner can reduce network traffic.
Note : This is a vague example just to make the understanding simpler.
What if the output is so big that it does not fit into the reducers RAM?
For example a sorting task. In this case, output is as big as the input. If you use a single reducer then all the data do not fit into the RAM. How does the sorting take place then?
I think I have got the answer.
Yes, it is possible to perform any map task in a single reducer, even if the data are bigger than the memory of reduce. In the shuffle phase reducer copies the data from mapper to reducer's memory and sorts it until it spills. Once it spills the memory that part of data is stored in reducers local disk and it starts to get the new values. Once it spills again it merges the new data with the previously stored file. The merged file maintains the sorted fashion (Probably using external merge sort). Once the shuffling is done the intermediate key,value pairs are stored in a sorted manner. Then the reduce task is performed on that data. As the data are sorted it is easy to do the aggregation in memory by taking a chunk of the data at a time in memory.
In my Hapoop job, when I set the number of reducers to 0, the mapping phase is dramatically faster than the case in which the number of reducers is not 0. In the beginning of the mapping phase there is reducer running, so I don't understand why the mapping time dramatically increases.
If you have not configured a reducer, the map output will not be sorted before written to disk.
The reason is that Hadoop uses an external sort algorithm, which means that the map tasks sort their task output [1]. Then the reducer just merges the sorted map output segments together.
In case there is no reducer, there is no need to group the data on the key- thus no need to sort.
[1] Addition for possible nit-pickers: A map task starts to sort once its output buffer is filled up. This sorted segment is spilled to disk and merged at the end of the map task with all other spilled segments until a single sorted file emerges. Sending a single file (maybe even compressed) is much more efficient for bandwidth usage / transfer performance. On the reducer side, the sorted files will then be merged again. The very last merge pass is directly streamed into the reduce method.
Hadoop the definitive guide (Tom White) Page 178
Section shuffle and sort : The map side.
Just after figure 6-4
Before it writes to disk , the thread first divides the data into partitions corresponding to the reducers that they will ultimately be sent to. WIthin each partition, the background thread performs an in-memory sort by key and if there is a combiner function, it is run on the output of the sort.
Question :
Does this mean the map writes each key output to a different file and then combine them later.
Thus if there were 2 different key outputs to be sent to a reducer , each different key will be sent seperately to the reducer instead of sending a single file.
If my above reasoning is incorrect, what is it that actually happens.
Only if the two key outputs are going to different reducers. If the partition thinks they should go to the same reducer they will be in the same file.
-- Updated to include more details - Mostly from the book:
The partitioner just sorts the keys in to buckets. 0 to n for the number of reducers in your job. The reduce task has a small number of copier threads so that it can fetch map outputs in parallel. Therefore, for a given job, the jobtracker knows the mapping between map outputs and hosts. A thread in the reducer periodically asks the master for map output hosts until it has retrieved them all.
The map outputs are copied to the reduce task JVM’s memory if they are small enough (the buffer’s size is controlled by mapred.job.shuffle.input.buffer.percent, which specifies the proportion of the heap to use for this purpose); otherwise, they are copied to disk. When the in-memory buffer reaches a threshold size (controlled by mapred.job.shuffle.merge.percent) or reaches a threshold number of map outputs (mapred.inmem.merge.threshold), it is merged and spilled to disk. If a combiner is specified, it will be run during the merge to reduce the amount of data written to disk.
As the copies accumulate on disk, a background thread merges them into larger, sorted files. This saves some time merging later on. Note that any map outputs that were compressed (by the map task) have to be decompressed in memory in order to perform a merge on them.
When all the map outputs have been copied, the reduce task moves into the sort phase (which should properly be called the merge phase, as the sorting was carried out on the map side), which merges the map outputs, maintaining their sort ordering. This is done in rounds. For example, if there were 50 map outputs and the merge factor was 10 (the default, controlled by the io.sort.factor property, just like in the map’s merge), there would be five rounds. Each round would merge 10 files into one, so at the end there would be five intermediate files.
Rather than have a final round that merges these five files into a single sorted file, the merge saves a trip to disk by directly feeding the reduce function in what is the last phase: the reduce phase. This final merge can come from a mixture of in-memory and on-disk segments.
If we have configured multiple reducer, then during partitioning if we get keys for different reducer, they will be stored in separate files corresponding to reducer, and at the end of map task complete file will be send to reducer and not single key.
Say, you have 3 reducers running. You can then use a partitioner to decide which keys goes to which of the three reducers. You can probably do a X%3 in the partitioner to decide which key goes to which reducer. Hadoop by default uses HashPartitioner.