DC/OS (mesos/marathon) how set time to start killed instance of aplication - mesos

I have install DC/OS (3master and 7slave server - all Centos7)
I saw problem - when one of slave server shut down - mesos/marathon start killed instance of application after 5 minutes.
For example - I run in mesos/marathon 8 instance simple web application. When I shut down or deactivate network interface of one slave server marathon show that some instancje are killed. From this moment mesos/marathon wait 5 minutes and start killed instance to another online slave server.
My question is - how can I change this time? 5 minutes is to long. I read documentation of DC/OS but I can't find variable responsible for this.
I will be very thankful for your help.

You can have a at the Marathon command-line flags. Based on your description, I guess the default for either task_launch_timeout or scale_apps_interval could be responsible for this.
I'm unsure though if this can be configured on the fly, or during installation in DC/OS. I saw that there's a quite recent enhancement request to Make Marathon flags passable via environment variables.

Related

TeamCity with AWS cloudformation stuck on AgentService

Followed TeamCity's description of running a TeamCity build server on AWS with a cloudformation template. Launched it, it gets stuck at AgentService (Resource creation initiated). Waited for half an hour, no progress.
Resources tab shows the following:
What am I doing wrong here?
(For me) this typically happens if the service cannot be started for some reasons. For instance if the cluster does not have enough suitable instances to start your service or for some other reason.
For diagnostic, check your service in the ECS cluster and there check events and in tasks of your service, check stopped tasks (and reasons they were stopped).
Got a tip from a colleague that if you are creating a CF template based service, it may take up to 3(!) hours. Tried again today, after 3 hours it was up and running.
The reason for this is the setup of the ECS, which involves DNS setup for an internet facing service.

Can I run mesos/marathon application at specific host?

I wanna use marathon as cluster monitoring and management. Bellow scenario is possible?
My Scenario
Cassandra 5EA was already deployed and are running.
Cassandra hosts are physical machine.
I want to run script that verifies healthness of cassandra each host. ex) cassandra process, disk usage, number of file, ..
If problem found at host, than run correcting script on that host. Script launched manually.
Each script can be run by marathon application. But I couldn't found run application on (specific) error host.
No restriction of adding machines and installing mesos components.
And if you know more suitable tool, please recommend!!
If you are not running Cassandra on Mesos I think Marathon is not the best choice. From your description, it looks like you need a monitoring tool (e.g., Nagios) rather than service Orchestration.
Please extend your question with more information. It's not clear what you are asking.

Airflow setup for high availability

How to deploy apache airflow (formally known as airbnb's airflow) scheduler in high availability?
I am not asking about the backend DB or RabbitMQ that should obviously be deployed in high availability configuration.
My main focus is the scheduler - is there something special needs to be done?
After a bit digging I found that it is not safe to run multiple schedulers simoultanously, this means that out of the box - airflow schedulers are not safe to use in high availablity environments.
The airflow team are planning to solve this issue by adding a lock mechanism on the DAG data structure, but this is not implemented yet (I checked by running 2 schedulers and saw that they schedule the same dag instances which is not good).
This is described here:
https://groups.google.com/forum/#!topic/airbnb_airflow/-1wKa3OcwME
I did found a way to workaround this high availalbilty issue by wrapping the schedulers with my own code and use cluster tools for leader election (I personanlly use consul for this purpose). This way only the elected master is running the scheduler and when the master is down the slave replaces him.
Please consider this when u use airflow in high availabilty environments since out of the box, airflow scheduler is currently not suitable for this (unless you solve this issue yourself).
Edit - an alternative approach to the master slave solution is to use a cluster manager/scheduler to make sure that only one airflow scheduler instance is always available. This approach relies on the self healing abilities of the cluster manager u have. For example both mesos and nomad supports this kind of configuration (I presonally chose nomad for its simplicity).
My personal experience was to follow the instructions I found for some best practices; that is to restart the scheduler every 10 runs ( -N 10 ) and use this software when possible:
https://github.com/teamclairvoyant/airflow-scheduler-failover-controller
I also use a DAG which pings a monitoring system to be sure that the scheduler has not gone away.
In my scenario, I have 2 schedulers (on 2 separate docker swarms), with the standby cluster scheduler turned off (using docker swarm service scale=0). I needed to make sure the primary scheduler had stopped fully before I started up the standby scheduler. What I found was that having 2 running schedulers (even for a brief time period) resulted in an occasional DAG scheduled to run on both clusters leading to duplicate reports generated from two different cluster zone.

Provision to start group of applications on same Mesos slave

I have cluster of 3 Mesos slaves, where I have two applications: “redis” and “memcached”. Where redis depends on memcached and the requirement is both of the applications/services should start on same node instead of different slave nodes.
So I have created the application group and added the dependency properly in the JSON file. After launching the JSON file via “v2/groups” REST API, I observe that sometime both application group will start on same node but sometimes it will start on different slaves which breaks our requirement.
So intent/requirement is; if any application fails to start on a slave both the application should failover to other slave node. Also can I configure the JSON file to tell Marathon to start the application group on slave-1 (specific slave first) if it is available else start it on other slave in a cluster. Due to some reason if this application group will start on other slave can Marathon relaunch the application group to slave-1 if it is available to serve the request.
Thanks in advance for help.
Edit/Update (2):
Mesos, Marathon, and DC/OS support for PODs is available now:
DC/OS: https://dcos.io/docs/1.9/usage/pods/using-pods/
Mesos: https://github.com/apache/mesos/blob/master/docs/nested-container-and-task-group.md
Marathon: https://github.com/mesosphere/marathon/blob/master/docs/docs/pods.md
I assume you are talking about marathon apps.
Marathon application groups don't have any semantics concerning co-location on the same node and the same is the case for dependencies.
You seem to be looking for a Kubernetes like Pod abstraction in marathon, which is on the roadmap but not yet available (see update above :-)).
Hope this helps!
I think this should be possible (as a workaround) if you specify the correct app contraints within the group's JSON.
Have a look at the example request at
https://mesosphere.github.io/marathon/docs/generated/api.html#v2_groups_post
and the constraints syntax at
https://mesosphere.github.io/marathon/docs/constraints.html
e.g.
"constraints": [["hostname", "CLUSTER", "slave-1"]]
should do. Downside is that there will be no automatic failover to another slave that way. Still, I'd be curious why both apps need to specifically run on the same slave node...

Mesos/Marathon checkpointing and HA

Mesos and Marathon mention checkpointing from time to time, but I couldn't find a good explanation of how it works anywhere. Also, what does it mean in practice?
1) Is the Task current state continuously being stored, or is only the Task ID stored? Where is it stored and what does it contain?
2) There are two Marathon instances. Marathon has been running Nginx for a week, then goes down. Does that mean that the actual Nginx application state continues running on the second Marathon instance, or does it just restart the task from beginning? If the Task actual state is copied, isn't there a lot of data to be continuously persisted and passed around between slaves?
Slave recovery is a feature of Mesos that allows:
Executors/tasks to keep running when the slave process is down and
Allows a restarted slave process to reconnect with running executors/tasks on the slave.
(Mesos Slave recovery).
So regarding you questions this means:
Enough information (a little more than TaskID) is stored in order that a new slave process can reconnect to the still running executor/task.
As the task state is not checkpointed, it would start the task from the beginning.
Hope this helps,
Joerg

Resources