I am not sure if this is something that has been fixed for newer releases of Hadoop, but I'm currently locked into running Hadoop 0.20 (legacy code).
Here's the issue: when I launch a Hadoop job, there is "Job setup" task that needs to run first. It seems to me that Hadoop randomly picks this task to be either a map task or a reduce task.
We have more capacity for map tasks configured than reduce tasks, so whenever I get unlucky and have a reduce startup task, it takes forever long for my job to even start running. Any ideas how to overcome this?
Hadoop job first complete all your mapper task. Once all the mapper task is completed then it will go across the network and do shuffling and sorting and only after then your reducer task will start processing. So i guess there could possibly be some other for this delay.
Related
I found out job.getReduceProgress() not updated, but in Reducer context, context.getProgress() has a specific task schedule, why does the progress in job not updated in time?
I am running a Pig job that loads around 8 million rows from HBase (several columns) using HBaseStorage. The job finishes successfully and seems to produce the right results but when I look at the job details in the job tracker it says 50 map tasks were created of which 28 where successful and 22 were killed. The reduce ran fine. By looking at the logs of the killed map tasks there is nothing obvious to me as to why the tasks were killed. In fact the logs of successful and failed tasks are practically identical and both tasks are taking some reasonable time. Why are all these map tasks created and then killed? Is it normal or is it a sign of a problem?
This sounds like Speculative Execution in Hadoop. It runs the same task on several nodes and kills them when at least one completes. See the explanation this this book: https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-6/task-execution
Running a few map reduce jobs and one job takes over the all the reducer capacity. Is there a way to kill one or two reducer tasks to free up the cluster?
I can go directly to the one of the task tracker server and kill the java process manually. But I am wondering if there is a more decent way to do this?
You can kill the task-attempt by :
hadoop job -kill-task [task_attempt_id]
To get the task-attempt-id, you need to go one level deeper into the task(by clicking on task hyperlink on job tracker).
First find the job ID:
hadoop job -list
Now, kill the job:
hadoop job -kill <job_ID_goes_here>
hadoop job -kill-task [attempt-id] wherein the attempt-id can be obtained from the UI.
I run MR Job, Map Phase run successful, but Reduce Phase complied at 33% and hang (hanging about 1 hour) status: "reduce > sort"
How i can debug it?
It may be nothing to do with your case, but I had this happen when IPTABLES (~firewall) was mis-configured on one node. When that node was assigned a reducer role, the reduce phase would hang at 33%. Check the error logs to make sure the connections are working, especially if you have recently added new nodes and/or configured them manually.
This is how Hadoop currently works: If a reducer fails (throws a NullPointerException for example), Hadoop will reschedule another reducer to do the task of the reducer that failed.
Is it possible to configure Hadoop to not reschedule failed reducers i.e. if any reducer fails, Hadoop merely reports failure and does nothing else.
Of course, the reducers that did not fail will continue to completion.
you can set the mapred.reduce.max.attempts property using the Configuration class the job.xml
setting it to 0 should solve your problem
If you set the configuration to not reschedule failed tasks as soon as the first one fails your jobtracker will fail the job and kill currently running tasks. So what you want to do is pretty much impossible.