MapReduce shuffle phase bottleneck - hadoop

I am reading the original MapReduce paper. My understanding is that when working with, say hundreds of GBs of data, the network bandwidth for transferring so much data can be the bottleneck of a MapReduce job. For map tasks, we can reduce network bandwidth by scheduling map tasks on workers that already contain the data for any given split, since reading from local disk does not require network bandwidth.
However, the shuffle phase seems to be a huge bottleneck. A reduce task can potentially receive intermediate key/value pairs from all map tasks, and almost all of these intermediate key/value pairs will be streamed across the network.
When working with hundreds of GBs of data or more, is it necessary to use a combiner to have an efficient MapReduce job?

Combiner plays important role if it can fit into that situation it acts like a local reducer so instead of sending all data it will send only few values or local aggregated value but combiner can't be applied in all the cases .
If a reduce function is both commutative and associative, then it can be used as a Combiner.
Like in case of Median it won't work .
Combiner can't be used in all the situation.
There are other parameters which can be tuned Like :
When map emits output it directly does not go disk it goes to 100 MB circular buffer which when fill 80% it spill the records into disk.
so you can increase the buffer size and increase thresh hold value in that case spillage would be less.
if there are so many spills then spills would merge to make a single file we can play with spill factor.
There are so threads which copies data from local disk to reducer jvm's so their number can be increased.
Compression can be used at intermediate level and final level.
So Combiner is not the only solution and won't be used in all the situation.

Lets say the mapper is emitting (word, count). If you don't use combiner then if a mapper has the word abc 100 times then the reducer has to pull (abc, 1) 100 times Lets say the size of (word, count) is 7 bytes. Without combiner, the reducer has to pull 7 * 100 bytes of data where as the with combiner, the reducer only needs to pull 7 bytes of data. This example just illustrates how the combiner can reduce network traffic.
Note : This is a vague example just to make the understanding simpler.

Related

Map Reduce, does reducer automatically sorts?

there is something that i do not have clear about the whoel functioning view of a MapReduce programming environment.
Considering to have 1k of random unsorted words in the form (word, 1) coming out from a (or more than one) mapper. Suppose with the reducer i wanna save them all inside a single huge sorted file. How does it works? I mean, the reducer itself sort all the words automatically? What does the reducer function should do? What if i have just one reducer with limited ram and disk?
when the reducer get the data ,the data has already be sorted in the map side .
the process is like this
Map side:
1. Each inputSplit will be processed by a map task, and the result of the map output will be temporarily placed in a circular memory buffer [ SHUFFLE ](the size of the buffer is 100M by default, controlled by the io.sort.mb property). When the buffer is about to overflow (the default is 80% of the buffer size), an overflow file will be created in the local file system .
2. Before writing to the disk, the thread first divides the data into the same number of partitions according to the number of reduce tasks, that is, a reduce task corresponds to the data of one partition. to avoid some of the reduction tasks being assigned to large amounts of data, even without data. In fact, the data in each partition is sorted. If the Combiner is set at this time, the sorted result is subjected to the Combiner operation.
3. When the local task outputs the last record, there may be a lot of overflow files, and these files need to be merged. The sorting and combining operations are continually performed during the merge process for two purposes: 1. Minimize the amount of data written to disk each time; 2. Minimize the amount of data transferred by the network during the next replication phase. Finally merged into a partitioned and sorted file. In order to reduce the amount of data transmitted over the network, you can compress the data here, just set mapred.compress.map.out to true.
4. Copy the data from the partition to the corresponding reduce task.
Reduce side:
1.Reduce will receive data from different map tasks, and the amount of data sent from each map is ordered. If the amount of data accepted by the reduce side is quite small, it is directly stored in the memory. If the amount of data exceeds a certain proportion of the size of the buffer, the data is merged and written to the disk.
2. As the number of overflow files increases, the background thread will merge them into a larger, more ordered file. In fact, regardless of the map side or the reduce side, MapReduce repeatedly performs sorting and merging operations.
3. The merge process will generate a lot of intermediate files (written to disk), but MapReduce will make the data written to the disk as small as possible, and the result of the last merge is not written to the disk, but directly input To reduce the function.

Can hadoop map/reduce be speeded up by splitting data size?

Can I increase the performance time of my hadoop map/reduce job by splitting the input data into smaller chunks?
First question:
For example, I have 1GB of input file for mapping task. My default block size is 250MB. So only 4 mappers will be assigned to do the job. If I split the data into 10 pieces, each piece will be 100MB, then I have 10 mappers to do the work. But then each split piece will occupy 1 block in the storage, which means 150MB will be wasted for each split data block. What should I do in this case if I don't want to change the block size of my storage?
Second question: If I split input data before mapping job, it can increase the performance of the mapping job. So If I want to do the same for reducing job, should I ask mapper to split the data before giving it to reducer or should I let reducer do it ?
Thank you very much. Please correct me if I also misunderstand something. Hadoop is quite new to me. So any help is appreciated.
When you change your block size to 100 MB, 150 MB is not wasted. It is still available memory for the system.
If Mappers are increased, it does not mean that it will definitely increase performance. Because it depends on the number of datanodes you have. For example, if you have 10 DataNode -> 10 Mapper, it is a good deal. But if you have 4 datanode -> 10 Mapper, obviously all mappers cannot run simultaneously. So if you have 4 data nodes, it is better to have 4 blocks (with a 250MB block size).
Reducer is something like a merge of all your mappers' output and you can't ask Mapper to split the data. In reverse, you can ask Mapper to do a mini-reduce by defining a Combiner. Combiner is nothing but a reducer in the same node where the mapper was executed, run before sending to the actual reducer. So the I/O will be minimized and so is the work of actual reducer. Introducing a Combiner will be a better option to improve performance
Good luck with Hadoop !!
There can be multiple parallel mappers running in a node for the same job based on the number of map slots available in the node. So, yes making smaller pieces of the input should give you more parallel mappers and speed up the process.(how to input all the pieces as single input? - put all of them in one directory and add that as input path)
On the reducer side of you are OK to combine multiple output files post processing, you can set more number of reducers and max parallel reducers running could be the number of reduce shots available in your cluster. This should improve cluster utilisation and speed up reduce phase.
If possible you may use combiner also to reduce disk and network i/o overhead.

Which node sort/shuffle the keys in Hadoop?

In a Hadoop job, which node does the sorting/shuffling phase? Does increasing the memory of that node improve the performance of sorting/shuffling?
The relevant - in my experience - parameters to tune in mapred.site.xml are:
io.sort.mb This is the output buffer of a mapper. When this buffer is full the data is sorted and spilled to disk. Ideally you avoid having to many spills. Note that this memory is part of the maptask heap size.
mapred.map.child.java.opts This is the heap size of a map task, the higher this is the higher you can put output buffer size.
In principle the number of reduce tasks also influences the shuffle speed. The number of reduce rounds is the total number of reduce slots / the number of reduce tasks. Note that the initial shuffle (during the map phase) will only shuffle data to the active reducers. So mapred.reduce.tasks is also relevant.
io.sort.factor is the number threads performing the merge sort, both on the map and the reduce side.
Compression also has a large impact (it speeds up the transfer from mapper to reducer but the compr/decompr comes at a cost!
mapred.job.shuffle.input.buffer.percent is the percentage of the reducer's heap to store map output in memory.
There are without any doubt more tuning opportunities, but these are the ones I spent quite some time playing around with.
Sort And Shuffle Phase is divided among the Mappers and Reducers. That is the reason we seen the Reduce % increasing(Usually till 33%) while the Mapper is still Running.
Increasing the sort buffer memory and the performance gain from that will depend on:
a)The size/total Number of the Keys being emitted by the mapper
b) The Nature of the Mapper Tasks : (IO intensive, CPU intensive)
c) Available Primary Memory, Map/Reduce Slots(occupied) in the given Node
d) Data skewness
You can find more information # https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-6/shuffle-and-sort

Regarding partitioning of data for reducers

Hadoop the definitive guide (Tom White) Page 178
Section shuffle and sort : The map side.
Just after figure 6-4
Before it writes to disk , the thread first divides the data into partitions corresponding to the reducers that they will ultimately be sent to. WIthin each partition, the background thread performs an in-memory sort by key and if there is a combiner function, it is run on the output of the sort.
Question :
Does this mean the map writes each key output to a different file and then combine them later.
Thus if there were 2 different key outputs to be sent to a reducer , each different key will be sent seperately to the reducer instead of sending a single file.
If my above reasoning is incorrect, what is it that actually happens.
Only if the two key outputs are going to different reducers. If the partition thinks they should go to the same reducer they will be in the same file.
-- Updated to include more details - Mostly from the book:
The partitioner just sorts the keys in to buckets. 0 to n for the number of reducers in your job. The reduce task has a small number of copier threads so that it can fetch map outputs in parallel. Therefore, for a given job, the jobtracker knows the mapping between map outputs and hosts. A thread in the reducer periodically asks the master for map output hosts until it has retrieved them all.
The map outputs are copied to the reduce task JVM’s memory if they are small enough (the buffer’s size is controlled by mapred.job.shuffle.input.buffer.percent, which specifies the proportion of the heap to use for this purpose); otherwise, they are copied to disk. When the in-memory buffer reaches a threshold size (controlled by mapred.job.shuffle.merge.percent) or reaches a threshold number of map outputs (mapred.inmem.merge.threshold), it is merged and spilled to disk. If a combiner is specified, it will be run during the merge to reduce the amount of data written to disk.
As the copies accumulate on disk, a background thread merges them into larger, sorted files. This saves some time merging later on. Note that any map outputs that were compressed (by the map task) have to be decompressed in memory in order to perform a merge on them.
When all the map outputs have been copied, the reduce task moves into the sort phase (which should properly be called the merge phase, as the sorting was carried out on the map side), which merges the map outputs, maintaining their sort ordering. This is done in rounds. For example, if there were 50 map outputs and the merge factor was 10 (the default, controlled by the io.sort.factor property, just like in the map’s merge), there would be five rounds. Each round would merge 10 files into one, so at the end there would be five intermediate files.
Rather than have a final round that merges these five files into a single sorted file, the merge saves a trip to disk by directly feeding the reduce function in what is the last phase: the reduce phase. This final merge can come from a mixture of in-memory and on-disk segments.
If we have configured multiple reducer, then during partitioning if we get keys for different reducer, they will be stored in separate files corresponding to reducer, and at the end of map task complete file will be send to reducer and not single key.
Say, you have 3 reducers running. You can then use a partitioner to decide which keys goes to which of the three reducers. You can probably do a X%3 in the partitioner to decide which key goes to which reducer. Hadoop by default uses HashPartitioner.

how to deal with large map output in hadoop?

I am new in hadoop and i'm working with 3 node in a cluster(each of them has 2GB RAM).
the input file is small(5 MB) but map output is very large(about 6 GB).
in the map phase my memory becomes full and the tasks run very slowly.
what's its reason?
Can anyone helps me how to make my program faster?
Use a NLineInputFormat , where N refers to the number of lines of input each mapper will receive. This way, you have more splits created , there by forcing smaller input data to each mapper task. If not, the entire 5 MB will go into one Map task.
Size of map output by itself does not cause memory problem, since mapper can work in "streaming" mode. It consume records, process them and write to output. Hadoop will store some amount of data in memory and then spill it to disc.
So you problems can be caused by one of the two:
a) Your mapper algorithm somehow accumulate data during processing.
b) Cumulative memory given to your mappers is less then RAM of the Nodes. Then OS start swapping and your performance can fell orders of magnitude.
Case b is more likely since 2 GB is actually too little for usual hadoop configuration. If you going to work on it - I would suggest to configure 1, maximum 2 mapper slots per node.

Resources