Understand Spark: Cluster Manager, Master and Driver nodes - hadoop

Having read this question, I would like to ask additional questions:
The Cluster Manager is a long-running service, on which node it is running?
Is it possible that the Master and the Driver nodes will be the same machine? I presume that there should be a rule somewhere stating that these two nodes should be different?
In case where the Driver node fails, who is responsible of re-launching the application? and what will happen exactly? i.e. how the Master node, Cluster Manager and Workers nodes will get involved (if they do), and in which order?
Similarly to the previous question: In case where the Master node fails, what will happen exactly and who is responsible of recovering from the failure?

1. The Cluster Manager is a long-running service, on which node it is running?
Cluster Manager is Master process in Spark standalone mode. It can be started anywhere by doing ./sbin/start-master.sh, in YARN it would be Resource Manager.
2. Is it possible that the Master and the Driver nodes will be the same machine? I presume that there should be a rule somewhere stating that these two nodes should be different?
Master is per cluster, and Driver is per application. For standalone/yarn clusters, Spark currently supports two deploy modes.
In client mode, the driver is launched in the same process as the client that submits the application.
In cluster mode, however, for standalone, the driver is launched from one of the Worker & for yarn, it is launched inside application master node and the client process exits as soon as it fulfils its responsibility of submitting the application without waiting for the app to finish.
If an application submitted with --deploy-mode client in Master node, both Master and Driver will be on the same node. check deployment of Spark application over YARN
3. In the case where the Driver node fails, who is responsible for re-launching the application? And what will happen exactly? i.e. how the Master node, Cluster Manager and Workers nodes will get involved (if they do), and in which order?
If the driver fails, all executors tasks will be killed for that submitted/triggered spark application.
4. In the case where the Master node fails, what will happen exactly and who is responsible for recovering from the failure?
Master node failures are handled in two ways.
Standby Masters with ZooKeeper:
Utilizing ZooKeeper to provide leader election and some state storage,
you can launch multiple Masters in your cluster connected to the same
ZooKeeper instance. One will be elected “leader” and the others will
remain in standby mode. If the current leader dies, another Master
will be elected, recover the old Master’s state, and then resume
scheduling. The entire recovery process (from the time the first
leader goes down) should take between 1 and 2 minutes. Note that this
delay only affects scheduling new applications – applications that
were already running during Master failover are unaffected. check here
for configurations
Single-Node Recovery with Local File System:
ZooKeeper is the best way to go for production-level high
availability, but if you want to be able to restart the Master if
it goes down, FILESYSTEM mode can take care of it. When applications
and Workers register, they have enough state written to the provided
directory so that they can be recovered upon a restart of the Master
process. check here for conf and more details

The Cluster Manager is a long-running service, on which node it is running?
A cluster manager is just a manager of resources, i.e. CPUs and RAM, that SchedulerBackends use to launch tasks.
A cluster manager does nothing more to Apache Spark, but offering resources, and once Spark executors launch, they directly communicate with the driver to run tasks.
You can start a standalone master server by executing:
./sbin/start-master.sh
Can be started anywhere.
To run an application on the Spark cluster
./bin/spark-shell --master spark://IP:PORT
Is it possible that the Master and the Driver nodes will be the same machine?
I presume that there should be a rule somewhere stating that these two nodes should be different?
In standalone mode, when you start your machine certain JVM will start.Your SparK Master will start up and on each machine Worker JVM will start and they will register with the Spark Master.
Both are the resource manager.When you start your application or submit your application in cluster mode a Driver will start up wherever you do ssh to start that application.
Driver JVM will contact to the SparK Master for executors(Ex) and in standalone mode Worker will start the Ex.
So Spark Master is per cluster and Driver JVM is per application.
In case where the Driver node fails, who is responsible of re-launching the application? and what will happen exactly?
i.e. how the Master node, Cluster Manager and Workers nodes will get involved (if they do), and in which order?
If a Ex JVM will crashes the Worker JVM will start the Ex and when Worker JVM ill crashes Spark Master will start them.
And with a Spark standalone cluster with cluster deploy mode, you can also specify --supervise to make sure that the driver is automatically restarted if it fails with non-zero exit code.Spark Master will start Driver JVM
Similarly to the previous question: In case where the Master node fails,
what will happen exactly and who is responsible of recovering from the failure?
failing on master will result in executors not able to communicate with it. So, they will stop working. Failing of master will make driver unable to communicate with it for job status. So, your application will fail.
Master loss will be acknowledged by the running applications but otherwise these should continue to work more or less like nothing happened with two important exceptions:
1.application won't be able to finish in elegant way.
2.if Spark Master is down Worker will try to reregisterWithMaster. If this fails multiple times workers will simply give up.
reregisterWithMaster()-- Re-register with the active master this worker has been communicating with. If there is none, then it means this worker is still bootstrapping and hasn't established a connection with a master yet, in which case we should re-register with all masters.
It is important to re-register only with the active master during failures.worker unconditionally attempts to re-register with all masters,
will may arise race condition.Error detailed in SPARK-4592:
At this moment long running applications won't be able to continue processing but it still shouldn't result in immediate failure.
Instead application will wait for a master to go back on-line (file system recovery) or a contact from a new leader (Zookeeper mode), and if that happens it will continue processing.

Related

What is the job status, when Name Node fails in YARN?

When a job is running in the cluster, if suddenly the NameNode fails, then what will be the status of the job (failed or killed)?
If failed means, who is updating the job status?
How does this work internally?
Standby Namenode will become active Namenode with fail over process. Have a look at How does Hadoop Namenode failover process works?
YARN architecture revolves around Resource Manager, Node Manager and Applications Master. Jobs will continue without any of impact with namenode failure. If any of above three processes fails, job recovery will be done depending on respective process recovery.
Resource Manager recovery:
With the ResourceManger Restart enabled, the RM being promoted (current standby) to an active state loads the RM internal state and continues to operate from where the previous active left off as much as possible depending on the RM restart feature. A new attempt is spawned for each managed application previously submitted to the RM.
Application Master recovery:
For MapReduce running on YARN (aka MR2), the MR ApplicationMaster plays the role of a per-job jobtracker. MRAM failure recovery is controlled by the property, mapreduce.am.max-attempts. This property may be set per job. If its value is greater than 1, then when the ApplicationMaster dies, a new one is spun up for a new application attempt, up to the max-attempts. When a new application attempt is started, in-flight tasks are aborted and rerun but completed tasks are not rerun.
Node Manager Recovery:
During the recovery, the NM loads the applications’ state from the state store. The state for each application indicates whether the application has finished or not. Note that for a finished application no more containers will be launched but it may still be undergoing log- aggregation. As each application is recovered, a new Application object is created and initialization events are triggered to reinitialize the bookkeeping for the application within the NM.
During all these phases, Job History plays a critical role. Successfully completed Map & Reduce tasks status will be restored from Job History Server. This status is helpful to stop re-launch of successfully completed Map/Reduce tasks.
Have a look at Resource Manager HA article , Node Manager restart article and YARN HA article
I'm not completely sure of the following since I haven't tested it out. But it can't hurt to fire up a VM and test it out for yourself.
The namenode does not handle the status of jobs, that's what Yarn is doing.
If the namenode is not HA and it dies, you will lose your connection to HDFS (and maybe even have data loss). yarn will try to re-contact hdfs for a few tries by default and eventually time out and fail the job.

What happens when the Resource Manager (RM) goes down in Yarn?

What happens when the Resource Manager (RM) goes down in Yarn?
In the middle of running a job, if the Resource Manager goes down, then what will happen to the job?
Does the job gets submitted automatically or do we need to submit the job again?
Thanks,
Venkat
Resource manager (RM) high availability is explained in Apache link as follows.
ResourceManager HA is realized through an Active/Standby architecture.
At any point of time, one of the RMs is Active, and other standby node is waiting to take over if Active RM fails.
The RM being promoted to an active state loads the RM internal state from State-store and continues to operate from where the previous active left off.
A new attempt is spawned for each managed application previously submitted to the RM. Applications can checkpoint periodically to avoid losing any work.
The State-store must be visible from the both of Active/Standby RMs. Currently, there are two RMStateStore implementations for persistence - FileSystemRMStateStore and ZKRMStateStore.
The ZKRMStateStore (ZooKeeper) implicitly allows write access to a single RM at any point in time, and hence is the recommended store to use in an HA cluster.
Using the ZKRMStateStore, there is no need for a separate fencing mechanism to address a potential split-brain situation where multiple RMs can potentially assume the Active role.This situation is handled with ZooKeeper very well.
ZooKeeper is not only used for Resource Manager fail over. Many of applications now a days using ZooKeeper. Example of other fail over use cases in Hadoop - Name Node fail over also happens through ZooKeeper. Have a look at Name node fail over process too.
After Hadoop 2.x and Before Hadoop 2.6.x:
When a ResourceManager dies and is restarted, or fails over to another ResourceManager in the case of an HA cluster, the newly active ResourceManager instructs running ApplicationMasters to abort. This uses up an application attempt.
Also, if the ResourceManager is down for some time and the ApplicationMaster is unable to connect, it will timeout and abort. That uses up an application attempt too.
When a new ResourceManager becomes active, it can recover applications with failed attempts that have not exceeded their max-attempts.
Have a look at this article for more details
From Hadoop 2.6.0:
Resource Manager recovers its running state by taking advantage of the container statuses sent from all Node Managers. Node Manager will not kill the containers when it re-syncs with the restarted Resource Manager.
It continues managing the containers and send the container statuses across to Resource Manager when it re-registers.
Resource Manager reconstructs the container instances and the associated applications’ scheduling status by absorbing these containers’ information
The admin will create a new resource manager.Will take the latest information from all the application managers and update the Persistent Storage which the new Resource Manager will use. It is purely an admin task
No application or task can be launched of RM is unavailable.
If you have HA of RM then it will restart from HA.

Queries about YARN (failure modes, container size, practical example)

I want to ask few questions to understand the working of YARN:
Anyone can explain or refer to any document which can easily about the failure modes in YARN (i.e. Task Failure, Application master failure, Node Manager failure, Resource manager failure)
What is the container size in YARN? is it same as slot in Map reduce 1?
Any practical/working example of YARN ?
Thank you
Refer to Hadoop Definitive Guide text book ... Apart from that there is lot of info in apache web site.
Container size is not fixed it is dynamically allocated based on requirement by Resource Manager.
From developer perspective same old map-reduce will work on YARN.
ResourceManager failures
In the initial versions of the YARN framework, ResourceManager failures meant a total cluster failure, as it was a single point of failure. The ResourceManager stores the state of
the cluster, such as the metadata of the submitted application, information on cluster
resource containers, information on the cluster’s general configurations, and so on.
Therefore, if the ResourceManager goes down because of some hardware failure, then
there is no way to avoid manually debugging the cluster and restarting the
ResourceManager. During the time the ResourceManager is down, the cluster is
unavailable, and once it gets restarted, all jobs would need a restart, so the half-completed jobs lose any data and need to be restarted again. In short, a restart of the ResourceManager used to restart all the running ApplicationMasters. The latest versions of YARN address this problem in two ways. One way is by creating an active-passive ResourceManager architecture, so that when one goes down, another becomes active and takes responsibility for the cluster. Another way is by using the Zookeeper ResourceManager quorum, so that the ResourceManager state is stored externally over the Zookeeper, and one
ResourceManager is in an active state and one or more ResourceManagers are in passive mode, waiting for something to happen that brings them to an active state.
ApplicationMaster failures
When the ApplicationMaster fails, the ResourceManager simply starts another container with a new ApplicationMaster running in it for another application attempt. It is the responsibility of the new ApplicationMaster
to recover the state of the older ApplicationMaster, and this is possible only when ApplicationMasters persist their states in the external location so that it can be used for future reference. ApplicatoinMaster will store their state to persisitant disk thus all the status till the failure can be recovered.
NodeManager Failures
If a Node Manager fails, the ResourceManager detects this failure using a time-out (that is, stops receiving the heartbeats from the NodeManager). The ResourceManager then removes the NodeManager from its pool of available NodeManagers. It also kills all the containers running on that node & reports the failure to all running AMs. AMs are then responsible for reacting to node failures, by redoing the work done by any containers running on that node during the fault.
Container Failures
Container failures will be reported by node manager to Resource manager and Resource manager informs the same to Application Master. Now Application will restart the container.

Mesos/Marathon checkpointing and HA

Mesos and Marathon mention checkpointing from time to time, but I couldn't find a good explanation of how it works anywhere. Also, what does it mean in practice?
1) Is the Task current state continuously being stored, or is only the Task ID stored? Where is it stored and what does it contain?
2) There are two Marathon instances. Marathon has been running Nginx for a week, then goes down. Does that mean that the actual Nginx application state continues running on the second Marathon instance, or does it just restart the task from beginning? If the Task actual state is copied, isn't there a lot of data to be continuously persisted and passed around between slaves?
Slave recovery is a feature of Mesos that allows:
Executors/tasks to keep running when the slave process is down and
Allows a restarted slave process to reconnect with running executors/tasks on the slave.
(Mesos Slave recovery).
So regarding you questions this means:
Enough information (a little more than TaskID) is stored in order that a new slave process can reconnect to the still running executor/task.
As the task state is not checkpointed, it would start the task from the beginning.
Hope this helps,
Joerg

Why marathon does not terminate jobs after the quorum is lost?

I'm working with Apache mesos and marathon. I have 3 master nodes and 3 slave nodes. I configure mesos with quorum 2. Later I post a JSON to run one job with marathon and all look fine.
Then I try a shutdown of two master nodes to break the quorum, after this, mesos unregister all slave and all look ok, but when I inspect the slaves I found that the started job was continue running...it is normal? I was supposing that marathon stop all job after the quorum is lost.
Part of the Mesos philosophy, especially for long-running services, is that a failure in one or more Mesos components should not need to stop the user application.
If a slave shuts down and the framework has checkpointing enabled, the executor driver will wait for the slave's --recovery_timeout (default 15min) before shutting down the executor/tasks. To prevent this, disable checkpointing on your framework (in Marathon, just set --checkpoint=false when starting Marathon). See also Marathon's --failover_timeout on https://mesosphere.github.io/marathon/docs/command-line-flags.html
On the other hand, if it's just the Masters/ZKs that shut down, and the Slaves are still up and running, the slaves can still monitor the tasks and queue up status updates, so the tasks can stay alive. If ZK loses quorum, then there is no leading master, and each slave will continue to operate independently until a new leader is detected, at which point it will reregister with the master and send any queued status updates.

Resources