Duplicated tasks get killed - hadoop

After I submit job to Hadoop cluster, and job input is split between nodes, I can see that some tasks get two attempts running in parallel.
E.g. at node 39 task attempt attempt_201305230321_0019_m_000073_0 is started and in 3 minutes attempt_201305230321_0019_m_000073_1 is started at node 25. In additional 4 minutes first attempt (attempt_201305230321_0019_m_000073_0) gets killed (without any notice, logs contain no information) and second attempt is successfully completed in half an hour.
What is it happening? How do I prevent creating of duplicate attempts? Is this possible that these duplicated attempts cause mysterious kills?

Did you open the speculative execution? You can use the following code to prevent it:
job.getConfiguration().setBoolean(
"mapred.map.tasks.speculative.execution", false);
job.getConfiguration().setBoolean(
"mapred.reduce.tasks.speculative.execution", false);
Here are the definition about speculative execution from Hadoop document:
Speculative execution: One problem with the Hadoop system is that by dividing the tasks across many nodes, it is possible for a few slow nodes to rate-limit the rest of the program. For example if one node has a slow disk controller, then it may be reading its input at only 10% the speed of all the other nodes. So when 99 map tasks are already complete, the system is still waiting for the final map task to check in, which takes much longer than all the other nodes.
By forcing tasks to run in isolation from one another, individual tasks do not know where their inputs come from. Tasks trust the Hadoop platform to just deliver the appropriate input. Therefore, the same input can be processed multiple times in parallel, to exploit differences in machine capabilities. As most of the tasks in a job are coming to a close, the Hadoop platform will schedule redundant copies of the remaining tasks across several nodes which do not have other work to perform. This process is known as speculative execution. When tasks complete, they announce this fact to the JobTracker. Whichever copy of a task finishes first becomes the definitive copy. If other copies were executing speculatively, Hadoop tells the TaskTrackers to abandon the tasks and discard their outputs. The Reducers then receive their inputs from whichever Mapper completed successfully, first.
Speculative execution is enabled by default. You can disable speculative execution for the mappers and reducers by setting the mapred.map.tasks.speculative.execution and mapred.reduce.tasks.speculative.execution JobConf options to false, respectively.

Related

Hive query shows few reducers killed but query is still running. Will the output be proper?

I have a complex query with multiple left outer joins running for the last 1 hour in Amazon AWS EMR. But few reducers are shown as Failed and Killed.
My question is why do some reducers get killed? Will the final output be proper?
Usually each container has 3 attempts before final fail (configurable, as #rbyndoor mentioned). If one attempt has failed, it is being restarted until the number of attempts reaches limit, and if it is failed, the whole vertex is failed, all other tasks being killed.
Rare failures of some task attempts is not so critical issue, especially when running on EMR cluster with spot nodes, which can be removed during execution, causing failures and partial restarts of some vertices.
In most cases the reason of failures you can find in tracker logs.
And of course this is not the reason to switch to the deprecated MR. Try to find what is the root cause and fix it.
In some marginal cases when even if the job with some failed attempts succeeded, the data produced may be partially corrupted. For example when using some non-deterministic function in the distribute by clause. Like rand(). In this case restarted container may try to copy data produced by previous step (mapper), and the spot node with mapper results is already removed. In such case some previous step containers are restarted, but the data produced may be different because of non-deterministic nature of rand function.
About killed tasks.
Mappers or reducers can be killed because of many reasons. First of all when one of the containers has failed completely, all other tasks running are being killed. If speculative execution is switched on, duplicated tasks are killed, if the task is not responding for a long time, etc. This is quite normal and usually is not an indicator that something is wrong. If the whole job has failed or you have many attempts failures, you need to inspect failed tasks logs to find the reason, not killed ones.
There can be a lot of reasons for the reducers to be killed. Some of them are :
Low staging area memory.
Resource unavailability or deadlock.
Limit on the number of reducers to be spawned by a task. etc.
Generally, if a reducer gets killed it is restarted on its own and the job is completed, there will be no data loss. But if the reducers are getting killed again and again and your job is in a stuck state because of that then you might have to look at the yarn logs in order to get to a resolution.
Also, it seems like you are running hive in TEZ mode try running in MR mode, might help.
Short answer: Yes, If your job completes successfully then you will see right result.
There can be many reasons for a runtime failure of task. Mainly due to resources. It can be either cpu/disk/memory.
Tez AppMaster is responsible for dealing with transient container
execution failures and must respond to RM requests regarding allocated
and possibly deallocated Containers.
Tez AppMaster tries to reassign the task on some other containers with the constraints
tez.maxtaskfailures.per.node default=3 To make sure same node will not be used for reassigning.
tez.am.task.max.failed.attempts default=4 The maximum number of attempts that can fail for a particular task before the task is failed. This does not count killed attempts. 4 Task failure results in DAG failure

Should two attempts for same reduce tasks continue to run in parallel?

The actions in my hadoop reduce task have external effects, and they are not idempotent. And I have observed in task tracker that one reducer was attempted and then another reducer for same set of keys was started without killing the original one. Have I configured something wrong?
Here is the table for this reduce task:
Its due to speculative execution in hadoop. It is the option for Hadoop to specify backup tasks if it detects that there are some slow tasks on a few of the cluster nodes. The backup tasks will be preferentially scheduled on the faster nodes. Whichever of the duplicate tasks finishes first becomes the one that is used in further operations.
you can turn this off by setting the following parameter as false
mapred.reduce.tasks.speculative.execution

Master and Slaves in Hadoop

I know that Hadoop divides the work into independent chuncks. But imagine if one mapper finished handling its tasks before other mappers, can the master program give this mapper a work (i.e. some tasks) that was already associated to another mapper? if yes, how?
Read up on speculative execution Yahoo Tutorial-
One problem with the Hadoop system is that by dividing the tasks across many nodes, it is possible for a few slow nodes to rate-limit the rest of the program. For example if one node has a slow disk controller, then it may be reading its input at only 10% the speed of all the other nodes. So when 99 map tasks are already complete, the system is still waiting for the final map task to check in, which takes much longer than all the other nodes.
By forcing tasks to run in isolation from one another, individual tasks do not know where their inputs come from. Tasks trust the Hadoop platform to just deliver the appropriate input. Therefore, the same input can be processed multiple times in parallel, to exploit differences in machine capabilities. As most of the tasks in a job are coming to a close, the Hadoop platform will schedule redundant copies of the remaining tasks across several nodes which do not have other work to perform. This process is known as speculative execution. When tasks complete, they announce this fact to the JobTracker. Whichever copy of a task finishes first becomes the definitive copy. If other copies were executing speculatively, Hadoop tells the TaskTrackers to abandon the tasks and discard their outputs. The Reducers then receive their inputs from whichever Mapper completed successfully, first.
Speculative execution is enabled by default. You can disable speculative execution for the mappers and reducers by setting the mapred.map.tasks.speculative.execution and mapred.reduce.tasks.speculative.execution JobConf options to false, respectively.
The Yahoo Tutorial information, which only covers MapReduce v1, is a little out of date, though the concepts are the same. The new options for MR v2 are now:
mapreduce.map.speculative
mapreduce.reduce.speculative

Why submitting job to mapreduce takes so much time in General?

So usually for 20 node cluster submitting job to process 3GB(200 splits) of data takes about 30sec and actual execution about 1m.
I want to understand what is the bottleneck in job submitting process and understand next quote
Per-MapReduce overhead is significant: Starting/ending MapReduce job costs time
Some process I'm aware:
1. data splitting
2. jar file sharing
A few things to understand about HDFS and M/R that helps understand this latency:
HDFS stores your files as data chunk distributed on multiple machines called datanodes
M/R runs multiple programs called mapper on each of the data chunks or blocks. The (key,value) output of these mappers are compiled together as result by reducers. (Think of summing various results from multiple mappers)
Each mapper and reducer is a full fledged program that is spawned on these distributed system. It does take time to spawn a full fledged programs, even if let us say they did nothing (No-OP map reduce programs).
When the size of data to be processed becomes very big, these spawn times become insignificant and that is when Hadoop shines.
If you were to process a file with a 1000 lines content then you are better of using a normal file read and process program. Hadoop infrastructure to spawn a process on a distributed system will not yield any benefit but will only contribute to the additional overhead of locating datanodes containing relevant data chunks, starting the processing programs on them, tracking and collecting results.
Now expand that to 100 of Peta Bytes of data and these overheads looks completely insignificant compared to time it would take to process them. Parallelization of the processors (mappers and reducers) will show it's advantage here.
So before analyzing the performance of your M/R, you should first look to benchmark your cluster so that you understand the overheads better.
How much time does it take to do a no-operation map-reduce program on a cluster?
Use MRBench for this purpose:
MRbench loops a small job a number of times
Checks whether small job runs are responsive and running efficiently on your cluster.
Its impact on the HDFS layer is very limited
To run this program, try the following (Check the correct approach for latest versions:
hadoop jar /usr/lib/hadoop-0.20/hadoop-test.jar mrbench -numRuns 50
Surprisingly on one of our dev clusters it was 22 seconds.
Another issue is file size.
If the file sizes are less than the HDFS block size then Map/Reduce programs have significant overhead. Hadoop will typically try to spawn a mapper per block. That means if you have 30 5KB files, then Hadoop may end up spawning 30 mappers eventually per block even if the size of file is small. This is a real wastage as each program overhead is significant compared to the time it would spend processing the small sized file.
As far as I know, there is no single bottleneck which causes the job run latency; if there was, it would have been solved a long time ago.
There are a number of steps which takes time, and there are reasons why the process is slow. I will try to list them and estimate where I can:
Run hadoop client. It is running Java, and I think about 1 second overhead can be assumed.
Put job into the queue and let the current scheduler to run the job. I am not sure what is overhead, but, because of async nature of the process some latency should exists.
Calculating splits.
Running and syncronizing tasks. Here we face with the fact that TaskTrackes poll the JobTracker, and not opposite. I think it is done for the scalability sake. It mean that when JobTracker wants to execute some task, it do not call task tracker, but wait that approprieate tracker will ping it to get the job. Task trackers can not ping JobTracker to frequently, otherwise they will kill it in large clusters.
Running tasks. Without JVM reuse it takes about 3 seconds, with it overhead is about 1 seconds per task.
Client poll job tracker for the results (at least I think so) and it also add some latency to getting information that job is finished.
I have seen similar issue and I can state the solution to be broken in following steps :
When the HDFS stores too many small files with fixed chunk size, there will be issues on efficiency in HDFS, the best way would be to remove all unnecessary files and small files having data. Try again.
Try with the data nodes and name nodes:
Stop all the services using stop-all.sh.
Format name-node
Reboot machine
Start all services using start-all.sh
Check data and name nodes.
Try installing lower version of hadoop (hadoop 2.5.2) which worked in two cases and it worked in hit and trial.

Hadoop Fair Scheduler not assigning tasks to some nodes

I'm trying to run the Fair Scheduler, but it's not assigning Map tasks to some nodes with only one job running. My understanding is that the Fair Scheduler will use the conf slot limits unless multiple jobs exist, at which point the fairness calculations kick in. I've also tried setting all queues to FIFO in fair-scheduler.xml, but I get the same results.
I've set the scheduler in all mapred-site.xml files with the mapreduce.jobtracker.taskscheduler parameter (although I believe only the JobTracker needs it) and some nodes have no problem receiving and running Map tasks. However, other nodes either never get any Map tasks, or get one round of Map tasks (ie, all slots filled once) and then never get any again.
I tried this as a prerequisite to developing my own LoadManager, so I went ahead and put a debug LoadManager together. From log messages, I can see that the problem nodes keep requesting Map tasks, and that their slots are empty. However, they're never assigned any.
All nodes work perfectly with the default scheduler. I just started having this issue when I enabled the Fair Scheduler.
Any ideas? Does someone have this working, and has taken a step that I've missed?
EDIT: It's worth noting that the Fair Scheduler web UI page indicates the correct Fair Share count, but that the Running column is always less. I'm using the default per-user pools and only have 1 user and 1 job at a time.
The reason was the undocumented mapred.fairscheduler.locality.delay parameter. The problematic nodes were located on a different rack with HDFS disabled, making all tasks on these nodes non-rack local. Because of this, they were incurring large delays due to the Fair Scheduler's Delay Scheduling algorithm, described here.

Resources