Hadoop YARN launches instances of YarnChild in child VM to execute the actual tasks. Those tasks communicate with their ApplicationMaster (AM) through the umbilical interface.
My question is what happens if AM dies and Resource Manager(RM) fails to bring it up (say, due to some code defect in AM)? In such a case, the children tasks would (a) note the absence of AM due to heartbeat and then, (b) go to RM to get new AM location, which in this case they will not get. So, what happens to these orphaned tasks? I have a scenario where I would like to terminate them. Is that the default behavior and does their NodeManager (NM) terminate them?
From Hadoop -Definitive Guide, Chapter 6, Failures, Failures in yarn
After a crash, a new resource manager instance is brought up(by
admin), and it recovers from the saved state. The state consists of
node managers in system, as well as running applications. Here tasks
are not part of resource managers state, as they are managed by
application.
Also, it is said that the resource manager is designed to be able to recover from crashes.
All child task related to that particular application master would be on halt state. Hadoop admin should either restart the application master or kill it. NodeManager doesn't terminate the failed Application Master.
If you want to kill a application then you can use yarn application -kill application_id command to kill the application. It will kill all running and queued jobs under the application.
If you want to kill a task in YARN then you can use hadoop job -kill-task <task-id> to kill a particular task in YARN
Related
I want to set up a series of spark steps on an EMR spark cluster, and terminate the current step if it's taking too long. However, when I ssh into the master node and run hadoop jobs -list, the master node seems to believe that there is no jobs running. I don't want to terminate the cluster, because doing so would force me to buy a whole new hour of whatever cluster I'm running. Can anyone please help me terminate a spark-step in EMR without terminating the entire cluster?
That's easy:
yarn application -kill [application id]
you can list your running applications with
yarn application -list
You can kill application from the Resource manager (in the links at the top right under cluster status).
In the resource manager, click on the application you want to kill and in the application page there is a small "kill" label (top left) you can click to kill the application.
Obviously you can also SSH but this way I think is faster and easier for some users.
Having read this question, I would like to ask additional questions:
The Cluster Manager is a long-running service, on which node it is running?
Is it possible that the Master and the Driver nodes will be the same machine? I presume that there should be a rule somewhere stating that these two nodes should be different?
In case where the Driver node fails, who is responsible of re-launching the application? and what will happen exactly? i.e. how the Master node, Cluster Manager and Workers nodes will get involved (if they do), and in which order?
Similarly to the previous question: In case where the Master node fails, what will happen exactly and who is responsible of recovering from the failure?
1. The Cluster Manager is a long-running service, on which node it is running?
Cluster Manager is Master process in Spark standalone mode. It can be started anywhere by doing ./sbin/start-master.sh, in YARN it would be Resource Manager.
2. Is it possible that the Master and the Driver nodes will be the same machine? I presume that there should be a rule somewhere stating that these two nodes should be different?
Master is per cluster, and Driver is per application. For standalone/yarn clusters, Spark currently supports two deploy modes.
In client mode, the driver is launched in the same process as the client that submits the application.
In cluster mode, however, for standalone, the driver is launched from one of the Worker & for yarn, it is launched inside application master node and the client process exits as soon as it fulfils its responsibility of submitting the application without waiting for the app to finish.
If an application submitted with --deploy-mode client in Master node, both Master and Driver will be on the same node. check deployment of Spark application over YARN
3. In the case where the Driver node fails, who is responsible for re-launching the application? And what will happen exactly? i.e. how the Master node, Cluster Manager and Workers nodes will get involved (if they do), and in which order?
If the driver fails, all executors tasks will be killed for that submitted/triggered spark application.
4. In the case where the Master node fails, what will happen exactly and who is responsible for recovering from the failure?
Master node failures are handled in two ways.
Standby Masters with ZooKeeper:
Utilizing ZooKeeper to provide leader election and some state storage,
you can launch multiple Masters in your cluster connected to the same
ZooKeeper instance. One will be elected “leader” and the others will
remain in standby mode. If the current leader dies, another Master
will be elected, recover the old Master’s state, and then resume
scheduling. The entire recovery process (from the time the first
leader goes down) should take between 1 and 2 minutes. Note that this
delay only affects scheduling new applications – applications that
were already running during Master failover are unaffected. check here
for configurations
Single-Node Recovery with Local File System:
ZooKeeper is the best way to go for production-level high
availability, but if you want to be able to restart the Master if
it goes down, FILESYSTEM mode can take care of it. When applications
and Workers register, they have enough state written to the provided
directory so that they can be recovered upon a restart of the Master
process. check here for conf and more details
The Cluster Manager is a long-running service, on which node it is running?
A cluster manager is just a manager of resources, i.e. CPUs and RAM, that SchedulerBackends use to launch tasks.
A cluster manager does nothing more to Apache Spark, but offering resources, and once Spark executors launch, they directly communicate with the driver to run tasks.
You can start a standalone master server by executing:
./sbin/start-master.sh
Can be started anywhere.
To run an application on the Spark cluster
./bin/spark-shell --master spark://IP:PORT
Is it possible that the Master and the Driver nodes will be the same machine?
I presume that there should be a rule somewhere stating that these two nodes should be different?
In standalone mode, when you start your machine certain JVM will start.Your SparK Master will start up and on each machine Worker JVM will start and they will register with the Spark Master.
Both are the resource manager.When you start your application or submit your application in cluster mode a Driver will start up wherever you do ssh to start that application.
Driver JVM will contact to the SparK Master for executors(Ex) and in standalone mode Worker will start the Ex.
So Spark Master is per cluster and Driver JVM is per application.
In case where the Driver node fails, who is responsible of re-launching the application? and what will happen exactly?
i.e. how the Master node, Cluster Manager and Workers nodes will get involved (if they do), and in which order?
If a Ex JVM will crashes the Worker JVM will start the Ex and when Worker JVM ill crashes Spark Master will start them.
And with a Spark standalone cluster with cluster deploy mode, you can also specify --supervise to make sure that the driver is automatically restarted if it fails with non-zero exit code.Spark Master will start Driver JVM
Similarly to the previous question: In case where the Master node fails,
what will happen exactly and who is responsible of recovering from the failure?
failing on master will result in executors not able to communicate with it. So, they will stop working. Failing of master will make driver unable to communicate with it for job status. So, your application will fail.
Master loss will be acknowledged by the running applications but otherwise these should continue to work more or less like nothing happened with two important exceptions:
1.application won't be able to finish in elegant way.
2.if Spark Master is down Worker will try to reregisterWithMaster. If this fails multiple times workers will simply give up.
reregisterWithMaster()-- Re-register with the active master this worker has been communicating with. If there is none, then it means this worker is still bootstrapping and hasn't established a connection with a master yet, in which case we should re-register with all masters.
It is important to re-register only with the active master during failures.worker unconditionally attempts to re-register with all masters,
will may arise race condition.Error detailed in SPARK-4592:
At this moment long running applications won't be able to continue processing but it still shouldn't result in immediate failure.
Instead application will wait for a master to go back on-line (file system recovery) or a contact from a new leader (Zookeeper mode), and if that happens it will continue processing.
When a job is running in the cluster, if suddenly the NameNode fails, then what will be the status of the job (failed or killed)?
If failed means, who is updating the job status?
How does this work internally?
Standby Namenode will become active Namenode with fail over process. Have a look at How does Hadoop Namenode failover process works?
YARN architecture revolves around Resource Manager, Node Manager and Applications Master. Jobs will continue without any of impact with namenode failure. If any of above three processes fails, job recovery will be done depending on respective process recovery.
Resource Manager recovery:
With the ResourceManger Restart enabled, the RM being promoted (current standby) to an active state loads the RM internal state and continues to operate from where the previous active left off as much as possible depending on the RM restart feature. A new attempt is spawned for each managed application previously submitted to the RM.
Application Master recovery:
For MapReduce running on YARN (aka MR2), the MR ApplicationMaster plays the role of a per-job jobtracker. MRAM failure recovery is controlled by the property, mapreduce.am.max-attempts. This property may be set per job. If its value is greater than 1, then when the ApplicationMaster dies, a new one is spun up for a new application attempt, up to the max-attempts. When a new application attempt is started, in-flight tasks are aborted and rerun but completed tasks are not rerun.
Node Manager Recovery:
During the recovery, the NM loads the applications’ state from the state store. The state for each application indicates whether the application has finished or not. Note that for a finished application no more containers will be launched but it may still be undergoing log- aggregation. As each application is recovered, a new Application object is created and initialization events are triggered to reinitialize the bookkeeping for the application within the NM.
During all these phases, Job History plays a critical role. Successfully completed Map & Reduce tasks status will be restored from Job History Server. This status is helpful to stop re-launch of successfully completed Map/Reduce tasks.
Have a look at Resource Manager HA article , Node Manager restart article and YARN HA article
I'm not completely sure of the following since I haven't tested it out. But it can't hurt to fire up a VM and test it out for yourself.
The namenode does not handle the status of jobs, that's what Yarn is doing.
If the namenode is not HA and it dies, you will lose your connection to HDFS (and maybe even have data loss). yarn will try to re-contact hdfs for a few tries by default and eventually time out and fail the job.
What happens when the Resource Manager (RM) goes down in Yarn?
In the middle of running a job, if the Resource Manager goes down, then what will happen to the job?
Does the job gets submitted automatically or do we need to submit the job again?
Thanks,
Venkat
Resource manager (RM) high availability is explained in Apache link as follows.
ResourceManager HA is realized through an Active/Standby architecture.
At any point of time, one of the RMs is Active, and other standby node is waiting to take over if Active RM fails.
The RM being promoted to an active state loads the RM internal state from State-store and continues to operate from where the previous active left off.
A new attempt is spawned for each managed application previously submitted to the RM. Applications can checkpoint periodically to avoid losing any work.
The State-store must be visible from the both of Active/Standby RMs. Currently, there are two RMStateStore implementations for persistence - FileSystemRMStateStore and ZKRMStateStore.
The ZKRMStateStore (ZooKeeper) implicitly allows write access to a single RM at any point in time, and hence is the recommended store to use in an HA cluster.
Using the ZKRMStateStore, there is no need for a separate fencing mechanism to address a potential split-brain situation where multiple RMs can potentially assume the Active role.This situation is handled with ZooKeeper very well.
ZooKeeper is not only used for Resource Manager fail over. Many of applications now a days using ZooKeeper. Example of other fail over use cases in Hadoop - Name Node fail over also happens through ZooKeeper. Have a look at Name node fail over process too.
After Hadoop 2.x and Before Hadoop 2.6.x:
When a ResourceManager dies and is restarted, or fails over to another ResourceManager in the case of an HA cluster, the newly active ResourceManager instructs running ApplicationMasters to abort. This uses up an application attempt.
Also, if the ResourceManager is down for some time and the ApplicationMaster is unable to connect, it will timeout and abort. That uses up an application attempt too.
When a new ResourceManager becomes active, it can recover applications with failed attempts that have not exceeded their max-attempts.
Have a look at this article for more details
From Hadoop 2.6.0:
Resource Manager recovers its running state by taking advantage of the container statuses sent from all Node Managers. Node Manager will not kill the containers when it re-syncs with the restarted Resource Manager.
It continues managing the containers and send the container statuses across to Resource Manager when it re-registers.
Resource Manager reconstructs the container instances and the associated applications’ scheduling status by absorbing these containers’ information
The admin will create a new resource manager.Will take the latest information from all the application managers and update the Persistent Storage which the new Resource Manager will use. It is purely an admin task
No application or task can be launched of RM is unavailable.
If you have HA of RM then it will restart from HA.
Can someone tell how to kill a container? i see nodes are still running containers even after the application is finished and i want to know the command to kill them? Because of this issue, my subsequent applications stays in accepted state.
Thanks
Hadoop job -list
This gives you jobs that are running with JobID's
To kill job
Hadoop job –kill JobID
If yarn application is finished and some containers are still running, I'd say this is a bug somewhere. Is this a MR app? I don't think there's any commands to kill containers and anyway those should be handled by a nodemanager. Resource manager and Node manager should kill all containers when application is finished.
You didn't provide any info on what is this app, hadoop version, operating system, etc. Having said that, I once had a problem in my ubuntu hosts which had HADOOP-9752 bug which prevented nodemanager to kill a container.