Which node sort/shuffle the keys in Hadoop? - hadoop

In a Hadoop job, which node does the sorting/shuffling phase? Does increasing the memory of that node improve the performance of sorting/shuffling?

The relevant - in my experience - parameters to tune in mapred.site.xml are:
io.sort.mb This is the output buffer of a mapper. When this buffer is full the data is sorted and spilled to disk. Ideally you avoid having to many spills. Note that this memory is part of the maptask heap size.
mapred.map.child.java.opts This is the heap size of a map task, the higher this is the higher you can put output buffer size.
In principle the number of reduce tasks also influences the shuffle speed. The number of reduce rounds is the total number of reduce slots / the number of reduce tasks. Note that the initial shuffle (during the map phase) will only shuffle data to the active reducers. So mapred.reduce.tasks is also relevant.
io.sort.factor is the number threads performing the merge sort, both on the map and the reduce side.
Compression also has a large impact (it speeds up the transfer from mapper to reducer but the compr/decompr comes at a cost!
mapred.job.shuffle.input.buffer.percent is the percentage of the reducer's heap to store map output in memory.
There are without any doubt more tuning opportunities, but these are the ones I spent quite some time playing around with.

Sort And Shuffle Phase is divided among the Mappers and Reducers. That is the reason we seen the Reduce % increasing(Usually till 33%) while the Mapper is still Running.
Increasing the sort buffer memory and the performance gain from that will depend on:
a)The size/total Number of the Keys being emitted by the mapper
b) The Nature of the Mapper Tasks : (IO intensive, CPU intensive)
c) Available Primary Memory, Map/Reduce Slots(occupied) in the given Node
d) Data skewness
You can find more information # https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-6/shuffle-and-sort

Related

Why the time of Hadoop job decreases significantly when reducers reach certain number

I test the scalability of a MapReduce based algorithm with increasing number of reducers. It looks fine generally (the time decreases with increasing reducers). But the time of the job always decreases significantly when the reducer reach certain number (30 in my hadoop cluster) instead of decreasing gradually. What are the possible causes?
Something about My Hadoop Job:
(1) Light Map Phase. Only a few hundred lines input. Each line will generate around five thousand key-value pairs. The whole map phase won't take more than 2 minutes.
(2) Heavy Reduce Phase. Each key in the reduce function will match 1-2 thousand values. And the algorithm in reduce phase is very compute intensive. Generally the reduce phase will take around 30 minutes to be finished.
Time performance plot:
it should be because of high no of key-value pair. At specific no of reducers they are getting equally distributed to the reducers, which is resulting in all reducer performing the task at almost same time.Otherwise it might be the case that combiner keeps on waiting for 1 or 2 heavily loaded reducers to finish there job.
IMHO it could be that with sufficient number of reducers available the network IO (to transfer intermediate results) between each reduce stage decreases.
As network IO is usually the bottleneck in most Map-Reduce programs. This decrease in network IO needed will give significant improvement.

MapReduce shuffle phase bottleneck

I am reading the original MapReduce paper. My understanding is that when working with, say hundreds of GBs of data, the network bandwidth for transferring so much data can be the bottleneck of a MapReduce job. For map tasks, we can reduce network bandwidth by scheduling map tasks on workers that already contain the data for any given split, since reading from local disk does not require network bandwidth.
However, the shuffle phase seems to be a huge bottleneck. A reduce task can potentially receive intermediate key/value pairs from all map tasks, and almost all of these intermediate key/value pairs will be streamed across the network.
When working with hundreds of GBs of data or more, is it necessary to use a combiner to have an efficient MapReduce job?
Combiner plays important role if it can fit into that situation it acts like a local reducer so instead of sending all data it will send only few values or local aggregated value but combiner can't be applied in all the cases .
If a reduce function is both commutative and associative, then it can be used as a Combiner.
Like in case of Median it won't work .
Combiner can't be used in all the situation.
There are other parameters which can be tuned Like :
When map emits output it directly does not go disk it goes to 100 MB circular buffer which when fill 80% it spill the records into disk.
so you can increase the buffer size and increase thresh hold value in that case spillage would be less.
if there are so many spills then spills would merge to make a single file we can play with spill factor.
There are so threads which copies data from local disk to reducer jvm's so their number can be increased.
Compression can be used at intermediate level and final level.
So Combiner is not the only solution and won't be used in all the situation.
Lets say the mapper is emitting (word, count). If you don't use combiner then if a mapper has the word abc 100 times then the reducer has to pull (abc, 1) 100 times Lets say the size of (word, count) is 7 bytes. Without combiner, the reducer has to pull 7 * 100 bytes of data where as the with combiner, the reducer only needs to pull 7 bytes of data. This example just illustrates how the combiner can reduce network traffic.
Note : This is a vague example just to make the understanding simpler.

Tips to improve MapReduce Job performance in Hadoop

I have 100 mapper and 1 reducer running in a job. How to improve the job performance?
As per my understanding: Use of combiner can improve the performance to great extent. But what else we need to configure to improve the jobs performance?
With the limited data in this question ( Input file size, HDFS block size, Average map processing time, Number of Mapper slots & Reduce slots in cluster etc.), we can't suggest tips.
But there are some general guidelines to improve the performance.
If each task takes less than 30-40 seconds, reduce the number of tasks
If a job has more than 1TB of input, consider increasing the block size of the input dataset to 256M or even 512M so that the number of tasks will be smaller.
So long as each task runs for at least 30-40 seconds, increase the number of mapper tasks to some multiple of the number of mapper slots in the cluster
Number of reduce tasks per a job should be equal to or a bit less than the number of reduce slots in the cluster.
Some more tips :
Configure the cluster properly with right diagnostic tools
Use compression when you are writing intermediate data to disk
Tune number of Map & Reduce tasks as per above tips
Incorporate Combiner wherever it is appropriate
Use Most appropriate data types for rendering Output ( Do not use LongWritable when range of output values are in Integer range. IntWritable is right choice in this case)
Reuse Writables
Have right profiling tools
Have a look at this cloudera article for some more tips.

Controlling reducer shuffle merge memory in Hadoop 2

I want understand how memory is used in the reduce phase of a MapReduce Job, so I can control the settings in the designated way.
If I understand correctly, the reducer first fetches its map output and leaves them in memory up to a certain threshold. The settings to control this are:
mapreduce.reduce.shuffle.merge.percent: The usage threshold at which an in-memory merge will be initiated, expressed as a percentage of the total memory allocated to storing in-memory map outputs, as defined by mapreduce.reduce.shuffle.input.buffer.percent.
mapreduce.reduce.input.buffer.percent: The percentage of memory- relative to the maximum heap size- to retain map outputs during the reduce. When the shuffle is concluded, any remaining map outputs in memory must consume less than this threshold before the reduce can begin.
Next, these spilled blocks are merged. It seems the following option controls how much memory is used for the shuffle:
mapreduce.reduce.shuffle.input.buffer.percent: The percentage of memory to be allocated from the maximum heap size to storing map outputs during the shuffle.
But then, there is the setting:
mapreduce.reduce.shuffle.memory.limit.percent: Maximum percentage of the in-memory limit that a single shuffle can consume.
But it is not clear to what value this percentage applies. Is there more information available regarding these values, i.e. what they control and how they differ?
Finally, after the merge completes, the reduce process is ran on the inputs. In the [Hadoop book][1], I found that the final merge-step directly feeds the reducers. But, the default value for mapreduce.reduce.input.buffer.percent=0 contradicts this, indicating that everything is spilled to disk BEFORE the reducers start. Is there any reference on which one of these explanations is correct?
[1]: Hadoop, The definitive guide, Fourth edition, p. 200
Here is how mapreduce.reduce.shuffle.memory.limit.percent is used and its percentage implies a 0.70 percent of the whole reducer memory. That would be the maximum bytes upto which the data could be kept in memory for a single shuffle.
maxSingleShuffleLimit = (long)(maxSize * MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION);
//MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION=mapreduce.reduce.shuffle.memory.limit.percent(0.25 f)
maxSize = (int)(conf.getInt("mapred.job.reduce.total.mem.bytes",(int)Math.min(Runtime.getRuntime().maxMemory(), Integer.MAX_VALUE))* maxInMemCopyUse);//maxInMemCopyuse(mapred.job.shuffle.input.buffer.percent - 0.70f)
This property is used in the copy phase of the reducer. If the required map output is greater than the maxShufflelimit then the data is moved to disk,else kept in memory.
Property mapreduce.reduce.input.buffer.percent is completety different.
Once all the data is copied and all the merge is done, just before the reducer starts it just checks whether the data stored in memory exceeds this limit.
You could refer this code(however it is for old mapred it should give an insight) on how maxSingleShuffleLimit and the other property are used.

How to number my splits and choosing right number of mappers/reducers

My map reduce job is looking like the following:
I map the first 2 blocks to the key 1,the next two will be mapped to the key 2 and so on, as you can refer from the picture:
Now, by theory i want to send each of this keys to a reducer.
But my question is:
How to choose the proper amount of mappers/reducers in reality?
It looks like i need to have #mappers = #num of hdfs blocks,
and the num of the #reducers will be half of #mappers.
Is that a good approach?
What is the right choice for this case?
Partitioning your job into maps and reduces
Picking the appropriate size for the tasks for your job can radically change the performance of Hadoop. Increasing the number of tasks increases the framework overhead, but increases load balancing and lowers the cost of failures. At one extreme is the 1 map/1 reduce case where nothing is distributed. The other extreme is to have 1,000,000 maps/ 1,000,000 reduces where the framework runs out of resources for the overhead.
Number of Maps
The number of maps is usually driven by the number of DFS blocks in the input files. Although that causes people to adjust their DFS block size to adjust the number of maps. The right level of parallelism for maps seems to be around 10-100 maps/node, although we have taken it up to 300 or so for very cpu-light map tasks. Task setup takes awhile, so it is best if the maps take at least a minute to execute.
Actually controlling the number of maps is subtle. The mapred.map.tasks parameter is just a hint to the InputFormat for the number of maps. The default InputFormat behavior is to split the total number of bytes into the right number of fragments. However, in the default case the DFS block size of the input files is treated as an upper bound for input splits. A lower bound on the split size can be set via mapred.min.split.size. Thus, if you expect 10TB of input data and have 128MB DFS blocks, you'll end up with 82k maps, unless your mapred.map.tasks is even larger. Ultimately the InputFormat determines the number of maps.
The number of map tasks can also be increased manually using the JobConf's conf.setNumMapTasks(int num). This can be used to increase the number of map tasks, but will not set the number below that which Hadoop determines via splitting the input data.
Number of Reduces
The ideal reducers should be the optimal value that gets them closest to:
A multiple of the block size * A task time between 5 and 15 minutes * Creates the fewest files possible
Anything other than that means there is a good chance your reducers are less than great. There is a tremendous tendency for users to use a REALLY high value ("More parallelism means faster!") or a REALLY low value ("I don't want to blow my namespace quota!"). Both are equally dangerous, resulting in one or more of:
Terrible performance on the next phase of the workflow * Terrible performance due to the shuffle * Terrible overall performance because you've overloaded the namenode with objects that are ultimately useless * Destroying disk IO for no really sane reason * Lots of network transfers due to dealing with crazy amounts of CFIF/MFIF work
Now, there are always exceptions and special cases. One particular special case is that if following that advice makes the next step in the workflow do ridiculous things, then we need to likely 'be an exception' in the above general rules of thumb.
Currently the number of reduces is limited to roughly 1000 by the buffer size for the output files (io.buffer.size * 2 * numReduces << heapSize). This will be fixed at some point, but until it is it provides a pretty firm upper bound.
The number of reduce tasks can also be increased in the same way as the map tasks, via JobConf's conf.setNumReduceTasks(int num).
I got this, I think it will solve your confusion regarding num of reducer
Let's say that you have 100 reduce slots available in your cluster.
With a load factor of 0.95 all the 95 reduce tasks will start at the same time, since there are enough reduce slots available for all the tasks. This means that no tasks will be waiting in the queue, until one of the rest finishes. I would recommend this option when the reduce tasks are "small", i.e., finish relatively fast, or they all require the same time, more or less.
On the other hand, with a load factor of 1.75, 100 reduce tasks will start at the same time, as many as the reduce slots available, and the 75 rest will be waiting in the queue, until a reduce slot becomes available. This offers better load balancing, since if some tasks are "heavier" than others, i.e., require more time, then they will not be the bottleneck of the job, since the other reduce slots, instead of finishing their tasks and waiting, will now be executing the tasks in the queue. This also lightens the load of each reduce task, since the data of the map output is spread to more tasks.
https://github.com/paulhoule/infovore/wiki/Choosing-the-number-of-reducers

Resources