Hadoop - Failure Recovery For Reduce Only - hadoop

I have a Hadoop job running which has finished the map part for for 4 days, and now it seems to get suspended at the stage of reduce (reducer is done for 30%)
I really hope to have a way to only re-process the reduce part if at all possible without having to re-process the long-run map part, Any suggestions?
Something probably gets it worse that I only have one reducer.

Hadoop will only restart the Reduce step in your case.
However, if your job fails, you can't just skip the map step.
In this case you should probably split the two stages into seperate jobs, especially if your mapper is computational intensive.

Related

Why the running time of a mapper should be more than 1 minute?

I have read from many blogs/web pages that state
the running time of a mapper should be more than X minutes
I understand there are overheads involved in setting up a mapper but how exactly is this calculated? Why is it after X minutes the overhead then is justified? And when we talk about overheads, what are the Hadoop overheads?
Its not a hard code rule but makes sense. At the background so many small process are handled out before a mapper is started. Its initialization,other stuffs apart from the real processing would itself take 10-15 seconds. So to reduce the number of split which in turn would reduce the mapper count, maxsplitsize could be set to some higher value that is what that blog conveys. If we fail to do that. Below are the overheads the MR framework has to handle while creating a mapper.
Calculating splits for that mapper.
The job scheduler in jobtracker has to create a separarte map task this would increase the latency a bit.
When it comes to assignment the jobtracker will have to look for a tasktracker based on its data locality. This will again involving creating local temp directories in the tasktracker which would be used up by the setup and cleanup task for that mapper, for example in the setup the if we are reading from a distributed cache and putting that intop a hashmap or initializing and cleaning up something in the mapper.And if already there are enough map and reduce tasks running in that tasktracker this would put a overhead on the tasktracker.
In worst case the number of fixed map task is full, then the JT will have to look for a different TT which would lead to a remote read.
Also TT would only send the heartbeat to JT once in 3 seconds this would cause a delay in the job initialization because the TT would have to contact the JT to run a job as well as sending the completed status.
Unfortunately if your mapper fails then that task would be run 3 times before it finally fails.

why Hadoop shuffling time takes longer than expected

I am trying to figure out which steps takes how much time in simple hadoop wordcount example.
In this example 3 maps and 1 reducer is used where each map generates ~7MB shuffle data. I have a cluster which is connected via 1Gb switches. When I look at the job details, realized that shuffling takes ~7 sec after all map tasks are completed wich is more than expected to transfer such a small data. What could be the reason behind this? Thanks
Hadoop uses heartbeats to communicate with nodes. By default hadoop uses minimal heartbeat interval equals to 3seconds. Consequently hadoop completes your task within two heartbeats (roughly 6 seconds).
More details: https://issues.apache.org/jira/browse/MAPREDUCE-1906
The transfer is not the only thing to complete after the map step. Each mapper outputs their part of a given split locally and sorts it. The reducer that is tasked with a particular split then gathers the parts from each mapper output, each requiring a transfer of 7 MB. The reducer then has to merge these segments into a final sorted file.
Honestly though, the scale you are testing on is absolutely tiny. I don't know all the parts of the Hadoop shuffle step, which I understand has some involved details, but you shouldn't expect performance of such small files to be indicative of actual performance on larger files.
I think the shuffling started after first mapper started. But waited for the next two mappers.
There is option to start reduce phase (begins with shuffling) after all the mappers were finished. But that's not really speed up anything.
(BTW. 7 seconds is considered fast in Hadoop. Hadoop is poor in performance. Especially for small files. Unless somebody else is paying for this. Don't use Hadoop.)

Why map and reduce run at the same time?

I am newbie on Hadoop. I remember I learned from somewhere that in Hadoop, all map functions have to be completed before reduce functions can start off.
But I just got the printout when I run a map reduce program like this:
map(15%), reduce(5%)
map(20%), reduce(7%)
map(30%), reduce(10%)
map(38%), reduce(17%)
map(40%), reduce(25%)
why they run in parallel?
Before actual Reduce phase starts, Shuffle, Sort and Merge take place as Mappers keep on completing. This percentage signifies that. It is not the actual Reduce phase. This happens in parallel to reduce the overhead which would otherwise be incurred if framework keeps on waiting for completion of all the Mappers first and then do the Shuffling, Sorting and Merging.

In which part/class of mapreduce is the logic of stopping reduce tasks implemented

In Hadoop MapReduce no reducer starts before all mappers are finished. Can someone please explain me at which part/class/codeline is this logic implemented? I am talking about Hadoop MapReduce version 1 (NOT Yarn). I have searched the map reduce framework but there are so many classes and i don't understand much the method calls and their ordering.
In other words i need (first for test purposes) to let the reducers start reducing even if there are still working mappers. I know that this way i am getting false results for the job but for know this is the start of some work for changing parts of the framework. So where should i start to look and make changes?
This is done in the shuffle phase. For Hadoop 1.x, take a look at org.apache.hadoop.mapred.ReduceTask.ReduceCopier, which implements ShuffleConsumerPlugin. You may also want to read the "Breaking the MapReduce Stage Barrier" research paper by Verma et al.
EDIT:
After reading #chris-white 's answer, I realized that my answer needed an extra explanation. In the MapReduce model, you need to wait for all mappers to finish, since the keys need to be grouped and sorted; plus, you may have some speculative mappers running and you do not know yet which of the duplicate mappers will finish first. However, as the "Breaking the MapReduce Stage Barrier" paper indicates, for some applications, it may make sense not to wait for all of the output of the mappers. If you would want to implement this sort of behavior (most likely for research purposes), then you should take a look at the classes I mentioned above.
Some points for clarification:
A reducer cannot start reducing until all mappers have finished, their partitions copied to the node where the reducer task is running, and finally sorted.
What you may see is a reducer pre-empting the copy of map outputs while other map tasks are still running. This is controlled via a configuration property known as slowstart (mapred.reduce.slowstart.completed.map). This value represents a ratio (0.0 - 1.0) of the number of map tasks that need to have completed before the reducer tasks will start up (copying over the map outputs from those map tasks that have completed). The default value is usually around 0.9, meaning that if you have 100 map tasks for your job, 90 of them would need to finish before the job tracker can start to launch the reduce tasks.
This is all controlled by the job tracker, in the JobInProgress class, lines 775, 1610, 1664.

MapReduce require all mappers to finish before combine stage

I recently had to run a job that required all the mappers to finish before passing the results to the combine stage (due to the way the processed files were structured). This feature is available to the reducer by configuring the following -
// force 100% of the mappers to conclude before reducers start
job.set("mapred.reduce.slowstart.completed.maps", "1.0");
I couldn't find any similar configuration for the combine stage. Eventually I split my job to 2 parts, with the combine stage acting as reducer, and my original reduce passed to job #2 (mapper2 simply passes the data w/o modifying it).
I was wondering - is there a way I missed to configure 100% map completion before combine? thanks.
There is no way to control this - the combiner may or may not run for any given map instance, in fact the combiner may run multiple times over the various spills of your map data.
There's a more detailed definition in Tom Whites book: "Hadoop the definitive guide":
http://books.google.com/books?id=Nff49D7vnJcC&pg=PA178&lpg=PA178&dq=hadoop+combiner+spill&source=bl&ots=IiesWqctTu&sig=V5b3Z2EVWp5JzIvc_Fzv1-AJerI&hl=en&sa=X&ei=QUJwT9XBCOna0QGOzpnlBg&ved=0CFMQ6AEwAw#v=onepage&q=hadoop%20combiner%20spill&f=false
So your combiner may run before your map even finishes

Resources