I am using Mesos for container orchestration and get task history from Mesos using /task endpoint.
Mesos is running in a 7 nodes cluster and zookeeper is running in a 3 node cluster. I hope, Mesos uses Zookeeper to store the task History. We lost history sometimes when we restart Mesos. Does it store in memory? I am trying to understand what is happening here.
My questions are,
Where does it store task histories?
How can we configure the task history cleanup policy?
Why do we lose complete task history on restarting Mesos?
To answer your questions:
Task history/state for Mesos is stored in memory, and in the replicated_log (details here). The default is set to use the replicated_log, to store state completely in memory without the replicated_log you would have to specify this in your Mesos flags seen here in the configuration page as --registry=in_memory
Most users typically configure task history cleanup by using these three flags (there are more, but these are most common) --max_completed_frameworks=VALUE, --max_completed_tasks_per_framework=VALUE, and --max_unreachable_tasks_per_framework=VALUE as described in the previous document.
Yes, task history for the /tasks endpoint is lost every time a Mesos Master is restarted. However, the /state endpoint will still contain all task status changes over time.
**Edited to reflect information about the /tasks endpoint, not the /state endpoint.
Related
How to deploy apache airflow (formally known as airbnb's airflow) scheduler in high availability?
I am not asking about the backend DB or RabbitMQ that should obviously be deployed in high availability configuration.
My main focus is the scheduler - is there something special needs to be done?
After a bit digging I found that it is not safe to run multiple schedulers simoultanously, this means that out of the box - airflow schedulers are not safe to use in high availablity environments.
The airflow team are planning to solve this issue by adding a lock mechanism on the DAG data structure, but this is not implemented yet (I checked by running 2 schedulers and saw that they schedule the same dag instances which is not good).
This is described here:
https://groups.google.com/forum/#!topic/airbnb_airflow/-1wKa3OcwME
I did found a way to workaround this high availalbilty issue by wrapping the schedulers with my own code and use cluster tools for leader election (I personanlly use consul for this purpose). This way only the elected master is running the scheduler and when the master is down the slave replaces him.
Please consider this when u use airflow in high availabilty environments since out of the box, airflow scheduler is currently not suitable for this (unless you solve this issue yourself).
Edit - an alternative approach to the master slave solution is to use a cluster manager/scheduler to make sure that only one airflow scheduler instance is always available. This approach relies on the self healing abilities of the cluster manager u have. For example both mesos and nomad supports this kind of configuration (I presonally chose nomad for its simplicity).
My personal experience was to follow the instructions I found for some best practices; that is to restart the scheduler every 10 runs ( -N 10 ) and use this software when possible:
https://github.com/teamclairvoyant/airflow-scheduler-failover-controller
I also use a DAG which pings a monitoring system to be sure that the scheduler has not gone away.
In my scenario, I have 2 schedulers (on 2 separate docker swarms), with the standby cluster scheduler turned off (using docker swarm service scale=0). I needed to make sure the primary scheduler had stopped fully before I started up the standby scheduler. What I found was that having 2 running schedulers (even for a brief time period) resulted in an occasional DAG scheduled to run on both clusters leading to duplicate reports generated from two different cluster zone.
Having read this question, I would like to ask additional questions:
The Cluster Manager is a long-running service, on which node it is running?
Is it possible that the Master and the Driver nodes will be the same machine? I presume that there should be a rule somewhere stating that these two nodes should be different?
In case where the Driver node fails, who is responsible of re-launching the application? and what will happen exactly? i.e. how the Master node, Cluster Manager and Workers nodes will get involved (if they do), and in which order?
Similarly to the previous question: In case where the Master node fails, what will happen exactly and who is responsible of recovering from the failure?
1. The Cluster Manager is a long-running service, on which node it is running?
Cluster Manager is Master process in Spark standalone mode. It can be started anywhere by doing ./sbin/start-master.sh, in YARN it would be Resource Manager.
2. Is it possible that the Master and the Driver nodes will be the same machine? I presume that there should be a rule somewhere stating that these two nodes should be different?
Master is per cluster, and Driver is per application. For standalone/yarn clusters, Spark currently supports two deploy modes.
In client mode, the driver is launched in the same process as the client that submits the application.
In cluster mode, however, for standalone, the driver is launched from one of the Worker & for yarn, it is launched inside application master node and the client process exits as soon as it fulfils its responsibility of submitting the application without waiting for the app to finish.
If an application submitted with --deploy-mode client in Master node, both Master and Driver will be on the same node. check deployment of Spark application over YARN
3. In the case where the Driver node fails, who is responsible for re-launching the application? And what will happen exactly? i.e. how the Master node, Cluster Manager and Workers nodes will get involved (if they do), and in which order?
If the driver fails, all executors tasks will be killed for that submitted/triggered spark application.
4. In the case where the Master node fails, what will happen exactly and who is responsible for recovering from the failure?
Master node failures are handled in two ways.
Standby Masters with ZooKeeper:
Utilizing ZooKeeper to provide leader election and some state storage,
you can launch multiple Masters in your cluster connected to the same
ZooKeeper instance. One will be elected “leader” and the others will
remain in standby mode. If the current leader dies, another Master
will be elected, recover the old Master’s state, and then resume
scheduling. The entire recovery process (from the time the first
leader goes down) should take between 1 and 2 minutes. Note that this
delay only affects scheduling new applications – applications that
were already running during Master failover are unaffected. check here
for configurations
Single-Node Recovery with Local File System:
ZooKeeper is the best way to go for production-level high
availability, but if you want to be able to restart the Master if
it goes down, FILESYSTEM mode can take care of it. When applications
and Workers register, they have enough state written to the provided
directory so that they can be recovered upon a restart of the Master
process. check here for conf and more details
The Cluster Manager is a long-running service, on which node it is running?
A cluster manager is just a manager of resources, i.e. CPUs and RAM, that SchedulerBackends use to launch tasks.
A cluster manager does nothing more to Apache Spark, but offering resources, and once Spark executors launch, they directly communicate with the driver to run tasks.
You can start a standalone master server by executing:
./sbin/start-master.sh
Can be started anywhere.
To run an application on the Spark cluster
./bin/spark-shell --master spark://IP:PORT
Is it possible that the Master and the Driver nodes will be the same machine?
I presume that there should be a rule somewhere stating that these two nodes should be different?
In standalone mode, when you start your machine certain JVM will start.Your SparK Master will start up and on each machine Worker JVM will start and they will register with the Spark Master.
Both are the resource manager.When you start your application or submit your application in cluster mode a Driver will start up wherever you do ssh to start that application.
Driver JVM will contact to the SparK Master for executors(Ex) and in standalone mode Worker will start the Ex.
So Spark Master is per cluster and Driver JVM is per application.
In case where the Driver node fails, who is responsible of re-launching the application? and what will happen exactly?
i.e. how the Master node, Cluster Manager and Workers nodes will get involved (if they do), and in which order?
If a Ex JVM will crashes the Worker JVM will start the Ex and when Worker JVM ill crashes Spark Master will start them.
And with a Spark standalone cluster with cluster deploy mode, you can also specify --supervise to make sure that the driver is automatically restarted if it fails with non-zero exit code.Spark Master will start Driver JVM
Similarly to the previous question: In case where the Master node fails,
what will happen exactly and who is responsible of recovering from the failure?
failing on master will result in executors not able to communicate with it. So, they will stop working. Failing of master will make driver unable to communicate with it for job status. So, your application will fail.
Master loss will be acknowledged by the running applications but otherwise these should continue to work more or less like nothing happened with two important exceptions:
1.application won't be able to finish in elegant way.
2.if Spark Master is down Worker will try to reregisterWithMaster. If this fails multiple times workers will simply give up.
reregisterWithMaster()-- Re-register with the active master this worker has been communicating with. If there is none, then it means this worker is still bootstrapping and hasn't established a connection with a master yet, in which case we should re-register with all masters.
It is important to re-register only with the active master during failures.worker unconditionally attempts to re-register with all masters,
will may arise race condition.Error detailed in SPARK-4592:
At this moment long running applications won't be able to continue processing but it still shouldn't result in immediate failure.
Instead application will wait for a master to go back on-line (file system recovery) or a contact from a new leader (Zookeeper mode), and if that happens it will continue processing.
I have cluster of 3 Mesos slaves, where I have two applications: “redis” and “memcached”. Where redis depends on memcached and the requirement is both of the applications/services should start on same node instead of different slave nodes.
So I have created the application group and added the dependency properly in the JSON file. After launching the JSON file via “v2/groups” REST API, I observe that sometime both application group will start on same node but sometimes it will start on different slaves which breaks our requirement.
So intent/requirement is; if any application fails to start on a slave both the application should failover to other slave node. Also can I configure the JSON file to tell Marathon to start the application group on slave-1 (specific slave first) if it is available else start it on other slave in a cluster. Due to some reason if this application group will start on other slave can Marathon relaunch the application group to slave-1 if it is available to serve the request.
Thanks in advance for help.
Edit/Update (2):
Mesos, Marathon, and DC/OS support for PODs is available now:
DC/OS: https://dcos.io/docs/1.9/usage/pods/using-pods/
Mesos: https://github.com/apache/mesos/blob/master/docs/nested-container-and-task-group.md
Marathon: https://github.com/mesosphere/marathon/blob/master/docs/docs/pods.md
I assume you are talking about marathon apps.
Marathon application groups don't have any semantics concerning co-location on the same node and the same is the case for dependencies.
You seem to be looking for a Kubernetes like Pod abstraction in marathon, which is on the roadmap but not yet available (see update above :-)).
Hope this helps!
I think this should be possible (as a workaround) if you specify the correct app contraints within the group's JSON.
Have a look at the example request at
https://mesosphere.github.io/marathon/docs/generated/api.html#v2_groups_post
and the constraints syntax at
https://mesosphere.github.io/marathon/docs/constraints.html
e.g.
"constraints": [["hostname", "CLUSTER", "slave-1"]]
should do. Downside is that there will be no automatic failover to another slave that way. Still, I'd be curious why both apps need to specifically run on the same slave node...
What happens when the Resource Manager (RM) goes down in Yarn?
In the middle of running a job, if the Resource Manager goes down, then what will happen to the job?
Does the job gets submitted automatically or do we need to submit the job again?
Thanks,
Venkat
Resource manager (RM) high availability is explained in Apache link as follows.
ResourceManager HA is realized through an Active/Standby architecture.
At any point of time, one of the RMs is Active, and other standby node is waiting to take over if Active RM fails.
The RM being promoted to an active state loads the RM internal state from State-store and continues to operate from where the previous active left off.
A new attempt is spawned for each managed application previously submitted to the RM. Applications can checkpoint periodically to avoid losing any work.
The State-store must be visible from the both of Active/Standby RMs. Currently, there are two RMStateStore implementations for persistence - FileSystemRMStateStore and ZKRMStateStore.
The ZKRMStateStore (ZooKeeper) implicitly allows write access to a single RM at any point in time, and hence is the recommended store to use in an HA cluster.
Using the ZKRMStateStore, there is no need for a separate fencing mechanism to address a potential split-brain situation where multiple RMs can potentially assume the Active role.This situation is handled with ZooKeeper very well.
ZooKeeper is not only used for Resource Manager fail over. Many of applications now a days using ZooKeeper. Example of other fail over use cases in Hadoop - Name Node fail over also happens through ZooKeeper. Have a look at Name node fail over process too.
After Hadoop 2.x and Before Hadoop 2.6.x:
When a ResourceManager dies and is restarted, or fails over to another ResourceManager in the case of an HA cluster, the newly active ResourceManager instructs running ApplicationMasters to abort. This uses up an application attempt.
Also, if the ResourceManager is down for some time and the ApplicationMaster is unable to connect, it will timeout and abort. That uses up an application attempt too.
When a new ResourceManager becomes active, it can recover applications with failed attempts that have not exceeded their max-attempts.
Have a look at this article for more details
From Hadoop 2.6.0:
Resource Manager recovers its running state by taking advantage of the container statuses sent from all Node Managers. Node Manager will not kill the containers when it re-syncs with the restarted Resource Manager.
It continues managing the containers and send the container statuses across to Resource Manager when it re-registers.
Resource Manager reconstructs the container instances and the associated applications’ scheduling status by absorbing these containers’ information
The admin will create a new resource manager.Will take the latest information from all the application managers and update the Persistent Storage which the new Resource Manager will use. It is purely an admin task
No application or task can be launched of RM is unavailable.
If you have HA of RM then it will restart from HA.
Mesos and Marathon mention checkpointing from time to time, but I couldn't find a good explanation of how it works anywhere. Also, what does it mean in practice?
1) Is the Task current state continuously being stored, or is only the Task ID stored? Where is it stored and what does it contain?
2) There are two Marathon instances. Marathon has been running Nginx for a week, then goes down. Does that mean that the actual Nginx application state continues running on the second Marathon instance, or does it just restart the task from beginning? If the Task actual state is copied, isn't there a lot of data to be continuously persisted and passed around between slaves?
Slave recovery is a feature of Mesos that allows:
Executors/tasks to keep running when the slave process is down and
Allows a restarted slave process to reconnect with running executors/tasks on the slave.
(Mesos Slave recovery).
So regarding you questions this means:
Enough information (a little more than TaskID) is stored in order that a new slave process can reconnect to the still running executor/task.
As the task state is not checkpointed, it would start the task from the beginning.
Hope this helps,
Joerg