When do reduce tasks start in Hadoop? - hadoop

In Hadoop when do reduce tasks start? Do they start after a certain percentage (threshold) of mappers complete? If so, is this threshold fixed? What kind of threshold is typically used?

The reduce phase has 3 steps: shuffle, sort, reduce. Shuffle is where the data is collected by the reducer from each mapper. This can happen while mappers are generating data since it is only a data transfer. On the other hand, sort and reduce can only start once all the mappers are done. You can tell which one MapReduce is doing by looking at the reducer completion percentage: 0-33% means its doing shuffle, 34-66% is sort, 67%-100% is reduce. This is why your reducers will sometimes seem "stuck" at 33%-- it's waiting for mappers to finish.
Reducers start shuffling based on a threshold of percentage of mappers that have finished. You can change the parameter to get reducers to start sooner or later.
Why is starting the reducers early a good thing? Because it spreads out the data transfer from the mappers to the reducers over time, which is a good thing if your network is the bottleneck.
Why is starting the reducers early a bad thing? Because they "hog up" reduce slots while only copying data and waiting for mappers to finish. Another job that starts later that will actually use the reduce slots now can't use them.
You can customize when the reducers startup by changing the default value of mapred.reduce.slowstart.completed.maps in mapred-site.xml. A value of 1.00 will wait for all the mappers to finish before starting the reducers. A value of 0.0 will start the reducers right away. A value of 0.5 will start the reducers when half of the mappers are complete. You can also change mapred.reduce.slowstart.completed.maps on a job-by-job basis. In new versions of Hadoop (at least 2.4.1) the parameter is called is mapreduce.job.reduce.slowstart.completedmaps (thanks user yegor256).
Typically, I like to keep mapred.reduce.slowstart.completed.maps above 0.9 if the system ever has multiple jobs running at once. This way the job doesn't hog up reducers when they aren't doing anything but copying data. If you only ever have one job running at a time, doing 0.1 would probably be appropriate.

The reduce phase can start long before a reducer is called. As soon as "a" mapper finishes the job, the generated data undergoes some sorting and shuffling (which includes call to combiner and partitioner). The reducer "phase" kicks in the moment post mapper data processing is started. As these processing is done, you will see progress in reducers percentage. However, none of the reducers have been called in yet. Depending on number of processors available/used, nature of data and number of expected reducers, you may want to change the parameter as described by #Donald-miner above.

As much I understand Reduce phase start with the map phase and keep consuming the record from maps. However since there is sort and shuffle phase after the map phase all the outputs have to be sorted and sent to the reducer. So logically you can imagine that reduce phase starts only after map phase but actually for performance reason reducers are also initialized with the mappers.

The percentage shown for the reduce phase is actually about the amount of the data copied from the maps output to the reducers input directories.
To know when does this copying start? It is a configuration you can set as Donald showed above. Once all the data is copied to reducers (ie. 100% reduce) that's when the reducers start working and hence might freeze in "100% reduce" if your reducers code is I/O or CPU intensive.

Reduce starts only after all the mapper have fished there task, Reducer have to communicate with all the mappers so it has to wait till the last mapper finished its task.however mapper starts transferring data to the moment it has completed its task.

Consider a WordCount example in order to understand better how the map reduce task works.Suppose we have a large file, say a novel and our task is to find the number of times each word occurs in the file. Since the file is large, it might be divided into different blocks and replicated in different worker nodes. The word count job is composed of map and reduce tasks. The map task takes as input each block and produces an intermediate key-value pair. In this example, since we are counting the number of occurences of words, the mapper while processing a block would result in intermediate results of the form (word1,count1), (word2,count2) etc. The intermediate results of all the mappers is passed through a shuffle phase which will reorder the intermediate result.
Assume that our map output from different mappers is of the following form:
Map 1:-
(is,24)
(was,32)
(and,12)
Map2 :-
(my,12)
(is,23)
(was,30)
The map outputs are sorted in such a manner that the same key values are given to the same reducer. Here it would mean that the keys corresponding to is,was etc go the same reducer.It is the reducer which produces the final output,which in this case would be:-
(and,12)(is,47)(my,12)(was,62)

Reducer tasks starts only after the completion of all the mappers.
But the data transfer happens after each Map.
Actually it is a pull operation.
That means, each time reducer will be asking every maptask if they have some data to retrive from Map.If they find any mapper completed their task , Reducer pull the intermediate data.
The intermediate data from Mapper is stored in disk.
And the data transfer from Mapper to Reduce happens through Network (Data Locality is not preserved in Reduce phase)

When Mapper finishes its task then Reducer starts its job to Reduce the Data this is Mapreduce job.

Related

Reducer doesn't start still progress on MapReduce Job

If reducers do not start before all mappers finish then why does the progress on MapReduce job shows something like Map(50%) Reduce(10%)? Why reducers progress percentage is displayed when mapper is not finished yet?
Its is because of the mapreduce.job.reduce.slowstart.completedmaps property which's default value is 0.05.
It means that the reducer phase will be started as soon as atleast 5% of total mappers have completed the execution.
So the dispatched reducers will continue to stay in copy phase until all mappers are completed.
If you wish to start reducers only after all mappers have completed, then configure 1.0 value for the given property in the job configuration.
Reducers start copying intermediate key-value pairs from the mappers as soon as they are available. The progress calculation also takes in account the processing of data transfer which is done by reduce process, therefore the reduce progress starts showing up as soon as any intermediate key-value pair for a mapper is available to be transferred to reducer. Though the reducer progress is updated still the programmer defined reduce method is called only after all the mappers have finished

Which method stops reducers from starting the actual reduce phase in hadoop yarn?

I am new to hadoop yarn and want reducers to start the actual reducing process before the completion of all the maps. I tried to find out the class where the reducers are invoked but could not find out. Can any one help me in this regard?
The reducers can only start collecting the output of mappers, before all the mappers are completed. This is called the shuffle phase.
However, they cannot start the sorting and reduce phases, since they need to have ALL the map output records, before starting to work on them. The reason is simple:
Imagine the wordcount example and that you want to count the frequency of a word. In the reduce phase, if you emit a value (the frequency) of a key (the word), before getting the output of all the mappers (i.e., some counts are missing for this word), then, you may give the wrong frequency of a word.
You can change the time when the reducers start collecting (not reducing) the mappers' outputs, by setting the mapreduce.job.reduce.slowstart.completedmaps property to 1, meaning that the reducers will only start when ALL the mappers are complete: conf.set(mapreduce.job.reduce.slowstart.completedmaps, "1.00");. In the old API this property used to be (based on this link):
mapred.reduce.slowstart.completed.maps

Meaning of map time or reduce time in JobHistoryServer

I want to know the exact meaning of the notations in the below picture. This picture came from job history server web UI. I definitely know the meaning of Elapsed but I am not sure about other things. Where can I find clear definition of those? Or is there anyone who knows the meaning of those?
What I want to know is map time, reduce time, shuffle time and merge time separately. And the sum of the four time should be very similar(or equal) to elapsed time. But the 'Average' keyword makes me confuse.
There are 396 map, and 1 reduce.
As you probably already know, there are three phases to a MapReduce job:
Map is the 1st phase, where each Map task is provided with an input split, which is a small portion of the total input data. The Map tasks process data from the input split & output intermediate data which needs to go to the reducers.
Shuffle phase is the next step, where the intermediate data that was generated by Map tasks is directed to the correct reducers. Reducers usually handle a subset of the total number of keys generated by the Map task. The Shuffle phase assigns keys to reducers & sends all values pertaining to a key to the assigned reducer. Sorting (or Merging) is also a part of this phase, where values of a given key are sorted and sent to the reducer. As you may realize, the shuffle phase involves transfer of data across the network from Map -> Reduce tasks.
Reduce is the last step of the MapReduce Job. The Reduce tasks process all values pertaining to a key & output their results to the desired location (HDFS/Hive/Hbase).
Now coming to the average times, you said there were 396 map tasks. Each Map task is essentially doing exactly the same processing job, but on different chunks of data. So the Average Map time is basically the average of time taken by all 396 map tasks to complete.
Average Map Time = Total time taken by all Map tasks/ Number of Map Tasks
Similarly,
Average Reduce Time = Total time taken by all Reduce tasks/Number of Reduce tasks
Now, why is the average time significant? It is because, most, if not all your map tasks & reduce tasks would be running in parallel (depending on your cluster capacity/ no. of slots per node, etc.). So calculating the average time of all map tasks & reduce tasks will give you good insight into the completion time of the Map or Reduce phase as a whole.
Another observation from your screenshot is that your Shuffle phase took 40 minutes. There can be several reasons for this.
You have 396 map tasks, each generating intermediate data. The shuffle phase had to pass all this data across the network to just 1 reducer, causing a lot of network traffic & hence increasing transfer time. Maybe you can optimize performance by increasing the number of reducers.
The network itself has very low bandwidth, and cannot efficiently handle large amounts of data transfer. In this case, consider deploying a combiner, which will effectively reduce the amount of data flowing through your network between the map and reduce phases.
There are also some hidden costs of execution such as job setup time, time required by job tracker to contact task trackers & assign map/reduce tasks, time taken by slave nodes to send heartbeat signals to JobTracker, time taken by NameNode to assign storage block & create Input splits, etc. which all go into the total elapsed time.
Hope this helps.

How can map and reduce run in parllel

I am a beginner to hadoop & when I am running a hadoop job I noticed the progress log which shows map 80% reduce 25%. My understanding of map reduce is that mappers produce bunch of intermediate values. After mappers producing output there is shuffle/sort of intermediate pairs & these values are sent to reduce job. Can someone please explain me how map/reduce can work in parallel.
The outputs from the mappers have to be copied to the appropriate reducer nodes. This is called the shuffle process. That can start even before all the mappers have finished, since the decision of which key goes to which reducer is dependent only on the output key from the mapper. So the 25% progress you see is due to the shuffle phase.
After shuffle, there is a sort phase and then the reduce phase. Sort and reduce cannot happen unless all mappers have completed. Since shuffle can happen before the mappers finish, you can see a maximum of 33.33% reduce completion before the mappers have finished. This is because the default apache implementation considers shuffle, sort and reduce each to take an equal 33.33% of the time.

how to start sort and reduce in hadoop before shuffle completes for all mappers?

I understand from When do reduce tasks start in Hadoop that the reduce task in hadoop contains three steps: shuffle, sort and reduce where the sort (and after that the reduce) can only start once all the mappers are done. Is there a way to start the sort and reduce every time a mapper finishes.
For example lets we have only one job with mappers mapperA and mapperB and 2 reducers. What i want to do is:
mapperA finishes
shuffles copies the appropriate partitions of the mapperAs output lets say to reducer 1 and 2
sort on reducer 1 and 2 starts sorting and reducing and generates some intermediate output
now mapperB finishes
shuffles copies the appropriate partitions of the mapperBs output to reducer 1 and 2
sort and reduce on reducer 1 and 2 starts again and the reducer merges the new output with the old one
Is this possible? Thanks
You can't with the current implementation. However, people have "hacked" the Hadoop code to do what you want to do.
In the MapReduce model, you need to wait for all mappers to finish, since the keys need to be grouped and sorted; plus, you may have some speculative mappers running and you do not know yet which of the duplicate mappers will finish first.
However, as the "Breaking the MapReduce Stage Barrier" paper indicates, for some applications, it may make sense not to wait for all of the output of the mappers. If you would want to implement this sort of behavior (most likely for research purposes), then you should take a look at theorg.apache.hadoop.mapred.ReduceTask.ReduceCopier class, which implements ShuffleConsumerPlugin.
EDIT: Finally, as #teo points out in this related SO question, the
ReduceCopier.fetchOutputs() method is the one that holds the reduce
task from running until all map outputs are copied (through the while
loop in line 2026 of Hadoop release 1.0.4).
You can configure this using the slowstart property, which denotes the percentage of your mappers that need to be finished before the copy to the reducers starts. It usual default is in the 0.9 - 0.95 (90-95%) mark, but you can override to be 0 if your want
`mapreduce.reduce.slowstart.completed.map`
Starting the sort process before all mappers finish is sort of a hadoop-antipattern (if I may put it that way!), in that the reducers cannot know that there is no more data to receive until all mappers finish. you, the invoker may know that, based on your definition of keys, partitioner etc., but the reducers don't.

Resources