How Container failure is handled for a YARN MapReduce job? - hadoop

How are software/hardware failures handled in YARN? Specifically, what happens in case of container(s) failure/crash?

Container and task failures are handled by node-manager. When a container fails or dies, node-manager detects the failure event and launches a new container to replace the failing container and restart the task execution in the new container.
In the event of application-master failure, the resource-manager detects the failure and start a new instance of the application-master with a new container.
Find the details here

App master will re-attempt task that complete with exception or stop responding ( 4 time by default )
_ Job with two many failed task are considered as failed job.

Related

Spring batch master is waiting but worker startup failed in remote partitioning

I am stuck in a scenario in spring batch job remote partitioning where master started successfully but worker failed to start.The job is deployed on Aws batch , so master is waiting indefinitely for workers to finish since worker cannot comeup.
Can anyone suggest me the way to handle such scenario. I dont want my master node to wait till timeout has occured.
The manager is configurable with a timeout to fail if workers do not reply in time. So it won't wait indefinitely.
And if that happens, the job instance will fail and you can either:
restart it (only failed partitions will be restarted)
or abandon it and start a new instance.

Spring Batch correctly restart uncompleted jobs in clustered environment

I used the following logic to restart the uncompleted jobs on single-node Spring Batch application:
public void restartUncompletedJobs() {
try {
jobRegistry.register(new ReferenceJobFactory(documetPipelineJob));
List<String> jobs = jobExplorer.getJobNames();
for (String job : jobs) {
Set<JobExecution> runningJobs = jobExplorer.findRunningJobExecutions(job);
for (JobExecution runningJob : runningJobs) {
runningJob.setStatus(BatchStatus.FAILED);
runningJob.setEndTime(new Date());
jobRepository.update(runningJob);
jobOperator.restart(runningJob.getId());
}
}
} catch (Exception e) {
LOGGER.error(e.getMessage(), e);
}
}
Right now I'm trying to make it working on the two-node cluster. Both of the application on every node will be pointed to the shared PostgreSQL database.
Let's consider the following example: I have 2 job instances - the jobInstance1 is running right now on node1 and the jobInstance2 is running on node2. Node1 is restarted for some reason during jobInstance1 execution. After node1 restart the spring batch application tries to restart the uncompleted jobs with a logic presented above - it sees that there are 2 uncompleted job instances - jobInstance1 and jobInstance2(which is correctly running on node2) and tries to restart both of them. This way instead to restart the only jobInstance1 - it will restart both jobInstance1 and jobInstance2.. but the jobInstance2 should not be restarted because it is correctly executing right now on node2.
How to correctly restart during the application startup the not completed jobs(before the previous application termination) and prevent the situation when the jobs like jobInstance2 will be also restarted?
UPDATED
This is the solution provided in the answer below:
Get the job instances of your job with JobOperator#getJobInstances
For each instance, check if there is a running execution using JobOperator#getExecutions.
2.1 If there is a running execution, move to next instance (in order to let the execution finish either successfully or with a failure)
2.2 If there is no currently running execution, check the status of the last execution and restart it if failed using JobOperator#restart.
I have a question regarding #2.1 - will Spring Batch automatically restart uncompleted jobs with a running execution after application restart or do I need to do manual actions to do so?
Your logic is not restarting uncompleted jobs. Your logic is taking currently running job executions, setting their status to FAILED and restarting them. Your logic should not find running executions, it should look for not currently running executions, especially failed ones and restart them.
How to correctly restart the failed jobs and prevent the situation when the jobs like jobInstance2 will be also restarted?
In pseudo code, what you need to do to achieve this is:
Get the job instances of your job with JobOperator#getJobInstances
For each instance, check if there is a running execution using JobOperator#getExecutions.
2.1 If there is a running execution, move to next instance (in order to let the execution finish either successfully or with a failure)
2.2 If there is no currently running execution, check the status of the last execution and restart it if failed using JobOperator#restart.
In your scenario:
jobInstance1 should be restarted in step 2.2
jobInstance2 should be filtered in step 2.1 since there is a running execution for it on node 2.

What is the job status, when Name Node fails in YARN?

When a job is running in the cluster, if suddenly the NameNode fails, then what will be the status of the job (failed or killed)?
If failed means, who is updating the job status?
How does this work internally?
Standby Namenode will become active Namenode with fail over process. Have a look at How does Hadoop Namenode failover process works?
YARN architecture revolves around Resource Manager, Node Manager and Applications Master. Jobs will continue without any of impact with namenode failure. If any of above three processes fails, job recovery will be done depending on respective process recovery.
Resource Manager recovery:
With the ResourceManger Restart enabled, the RM being promoted (current standby) to an active state loads the RM internal state and continues to operate from where the previous active left off as much as possible depending on the RM restart feature. A new attempt is spawned for each managed application previously submitted to the RM.
Application Master recovery:
For MapReduce running on YARN (aka MR2), the MR ApplicationMaster plays the role of a per-job jobtracker. MRAM failure recovery is controlled by the property, mapreduce.am.max-attempts. This property may be set per job. If its value is greater than 1, then when the ApplicationMaster dies, a new one is spun up for a new application attempt, up to the max-attempts. When a new application attempt is started, in-flight tasks are aborted and rerun but completed tasks are not rerun.
Node Manager Recovery:
During the recovery, the NM loads the applications’ state from the state store. The state for each application indicates whether the application has finished or not. Note that for a finished application no more containers will be launched but it may still be undergoing log- aggregation. As each application is recovered, a new Application object is created and initialization events are triggered to reinitialize the bookkeeping for the application within the NM.
During all these phases, Job History plays a critical role. Successfully completed Map & Reduce tasks status will be restored from Job History Server. This status is helpful to stop re-launch of successfully completed Map/Reduce tasks.
Have a look at Resource Manager HA article , Node Manager restart article and YARN HA article
I'm not completely sure of the following since I haven't tested it out. But it can't hurt to fire up a VM and test it out for yourself.
The namenode does not handle the status of jobs, that's what Yarn is doing.
If the namenode is not HA and it dies, you will lose your connection to HDFS (and maybe even have data loss). yarn will try to re-contact hdfs for a few tries by default and eventually time out and fail the job.

What happens to orphaned Yarn Child processes?

Hadoop YARN launches instances of YarnChild in child VM to execute the actual tasks. Those tasks communicate with their ApplicationMaster (AM) through the umbilical interface.
My question is what happens if AM dies and Resource Manager(RM) fails to bring it up (say, due to some code defect in AM)? In such a case, the children tasks would (a) note the absence of AM due to heartbeat and then, (b) go to RM to get new AM location, which in this case they will not get. So, what happens to these orphaned tasks? I have a scenario where I would like to terminate them. Is that the default behavior and does their NodeManager (NM) terminate them?
From Hadoop -Definitive Guide, Chapter 6, Failures, Failures in yarn
After a crash, a new resource manager instance is brought up(by
admin), and it recovers from the saved state. The state consists of
node managers in system, as well as running applications. Here tasks
are not part of resource managers state, as they are managed by
application.
Also, it is said that the resource manager is designed to be able to recover from crashes.
All child task related to that particular application master would be on halt state. Hadoop admin should either restart the application master or kill it. NodeManager doesn't terminate the failed Application Master.
If you want to kill a application then you can use yarn application -kill application_id command to kill the application. It will kill all running and queued jobs under the application.
If you want to kill a task in YARN then you can use hadoop job -kill-task <task-id> to kill a particular task in YARN

Marathon kills container when requests increase

I deploy Docker containers on Mesos(0.21) and Marathon(0.7.6) on Google Cloud Engine.
I use JMeter to test a REST service that run on Marathon. When the concurrent requests are less than 10, it works normal, but when the concurrent requests are over 50, the container is killed and Mesos start another container. I increase RAM, CPU but it still happens.
This is log in /var/log/mesos/
E0116 09:33:31.554816 19298 slave.cpp:2344] Failed to update resources for container 10e47946-4c54-4d64-9276-0ce94af31d44 of executor dev_service.2e25332d-964f-11e4-9004-42010af05efe running task dev_service.2e25332d-964f-11e4-9004-42010af05efe on status update for terminal task, destroying container: Failed to determine cgroup for the 'cpu' subsystem: Failed to read /proc/612/cgroup: Failed to open file '/proc/612/cgroup': No such file or directory
The error message you're seeing is actually another symptom, not the root cause of the issue. There's a good explanation/discussion in this Apache Jira bug report:
https://issues.apache.org/jira/browse/MESOS-1837
Basically, your container is crashing for one reason or another and the /proc/pid#/ directory is getting cleared without Mesos being aware, so it throws the error message you found when it goes to check that /proc directory.
Try setting your allocated CPU higher in the JSON file describing your task.

Resources