How do I get logs when Mesos agent launch task failed - mesos

I tried to write a self-defined scheduler(use HTTP API), Mesos master accept my task launch, but mesos agent raise a TASK_FAILED status update. I want to know where I can get more detailed logs about why task is failed.
My Mesos is 1.6.0, thanks

There are two places to look:
Check a task status update message and a reason
Take a look at Mesos sandbox and examine stdout/stderr and other logs generated by your app. Here you have instruction how to do it.
You may need to decipher the problem from exit code. Here is an expalantaion how to do it.

Related

How Nomad knows internally about the allocation status of job (Running, Failed, Queued, Starting, Complete, Lost)?

Allocation status Screenshot
Hi team,
we can see about the allocation status of a job in nomad UI / command line.
but how nomad finds whether a job is running or it failed or it completed ?
Basically want to know how nomad figure out about allocation status.
Most probably the agent running on each node in the cluster will report it back to the nomad servers.
How the agent knows?
Most certainly there's a watch loop in agent's code for monitoring each allocation running on the node.

Can I run mesos/marathon application at specific host?

I wanna use marathon as cluster monitoring and management. Bellow scenario is possible?
My Scenario
Cassandra 5EA was already deployed and are running.
Cassandra hosts are physical machine.
I want to run script that verifies healthness of cassandra each host. ex) cassandra process, disk usage, number of file, ..
If problem found at host, than run correcting script on that host. Script launched manually.
Each script can be run by marathon application. But I couldn't found run application on (specific) error host.
No restriction of adding machines and installing mesos components.
And if you know more suitable tool, please recommend!!
If you are not running Cassandra on Mesos I think Marathon is not the best choice. From your description, it looks like you need a monitoring tool (e.g., Nagios) rather than service Orchestration.
Please extend your question with more information. It's not clear what you are asking.

Mesos task history after restart

I am using Mesos for container orchestration and get task history from Mesos using /task endpoint.
Mesos is running in a 7 nodes cluster and zookeeper is running in a 3 node cluster. I hope, Mesos uses Zookeeper to store the task History. We lost history sometimes when we restart Mesos. Does it store in memory? I am trying to understand what is happening here.
My questions are,
Where does it store task histories?
How can we configure the task history cleanup policy?
Why do we lose complete task history on restarting Mesos?
To answer your questions:
Task history/state for Mesos is stored in memory, and in the replicated_log (details here). The default is set to use the replicated_log, to store state completely in memory without the replicated_log you would have to specify this in your Mesos flags seen here in the configuration page as --registry=in_memory
Most users typically configure task history cleanup by using these three flags (there are more, but these are most common) --max_completed_frameworks=VALUE, --max_completed_tasks_per_framework=VALUE, and --max_unreachable_tasks_per_framework=VALUE as described in the previous document.
Yes, task history for the /tasks endpoint is lost every time a Mesos Master is restarted. However, the /state endpoint will still contain all task status changes over time.
**Edited to reflect information about the /tasks endpoint, not the /state endpoint.

Event-hook upon up/down-scaling or deletion of an App

I didn't find info whether it is possible to define something like an Event-hook upon up/down-scaling or deletion of an App in the Marathon Rest API docs at https://mesosphere.github.io/marathon/docs/rest-api.html
What I'd like to achieve is that I'm able to backup some data from a running Docker container before be is destroyed. For example, I run a cluster of Elasticsearch nodes on Marathon, and I would like to delay the deletion of the app until the then triggered "Create snapshot to external disk resource" process is finished.
Is there currently something I could use?
Marathon provides an Event Bus covering some phases of the lifecycle. Beyond that, currently the only other option I see is to go for Mesos Modules/Hooks.

I am not sure whether the application is running on just the master or the whole cluster for Spark on EC2

I am using Spark 1.1.1 . I followed the instructions given on https://spark.apache.org/docs/1.1.1/ec2-scripts.html and have a cluster of 1 master node and 1 worker on EC2 running.
I have made a jar of the application and rsynced it to the slaves. When I run the application using spark-submit with the deploy-mode of client, the application works. However, when I do so using deploy-mode cluster it gives me an error saying it cannot find the jar on the worker. The permission of the jar is 755 on both the master and worker.
I am not sure whether when I run the application using deploy-mode=client whether the application is using the workers. I don't think it is since the worker url does not show any completed jobs. But it does show failed jobs during deploy-mode=cluster.
Am I doing something wrong? Thank you for your help.
You can check if executors are assigned to the application on the /executors page on port 4040 (e.g. http://localhost:4040/executors/). If you only see <driver> then you are not using the worker. If you see one line for <driver> and one other line (with ID 0, unless it has restarted), then the worker is also providing an executor to your application. Here you can also see how many tasks it has completed for your application, and other stats.

Resources