mesos agent keeps relaunching failed task infinitely - mesos

Scheduler launches a task, Mesos Master (apache mesos 1.0.0) accepts and gives this task to a mesos agent. After that mesos agent keeps on running that task infinitely on failure. Is there a parameter to control retry for failed task.

Which scheduler are you using (Restarting tasks is a schedulers responsibility)?
Could it be you are using marathon? Marathon is designed for long-running tasks, i.e., it will restart (failed) tasks.

Related

How can I stop Apache Storm Nimbus, UI and Supervisor?

I run Apache Storm in a cluster and I was looking for ways to stop and/or restart Nimbus, Supervisor and UI. Would writting a servise help? What should I write in this service file and where should I place it? Thank you in advance
Yes, writing a service is the recommended way to run Storm. The commands you want to run are storm nimbus to start Nimbus (minimum 1 per cluster), storm supervisor to run the supervisor (1 per worker machine), storm ui (1 per cluster) and storm logviewer (1 per worker machine). There are other commands you can also run, but you can find these by simply running storm, it will print a list.
Regarding how to write the service, take a look at the upstart cookbook http://upstart.ubuntu.com/cookbook/.
There's an example script here you can probably use to get started https://unix.stackexchange.com/a/84289
you can make them as service and start them up as the node starts and same can be used to stop them.
/etc/rc.d/SERVICE start or stop or restart
We can easily stop them using the command "ps -aux | grep nimbus" or supervisor etc. Then we have to find the process id and kill it with the “kill” command.

Get list of executed job on Hadoop cluster after cluster reboot

I have a hadoop cluster 2.7.4 version. Due to some reason, I have to restart my cluster. I need job IDs of those jobs that were executed on cluster before cluster reboot. Command mapred -list provide currently running of waiting jobs details only
You can see a list of all jobs on the Yarn Resource Manager Web UI.
In your browser go to http://ResourceManagerIPAdress:8088/
This is how the history looks on the Yarn cluster I am currently testing on (and I restarted the services several times):
See more info here

how to reduce task kill period time when task state is TASK_LOST?

I am working around with marathon & mesos & docker very well, but it recently discovered a problem.when mesos-slave encounter an Exception , the state of task on Marathon will change to TASK_LOST , and the task can not be killed only after about 15mins.
I did a test by manually Reboot My Operation System that run mesos-slave service and docker and run the task, and then the task state shown in Marathon UI became to " Unscheduled(100%) " ,and the task can not be killed automatically either manually, until past about 15 minutes.
My question is how to reduce this time?
I tried to add marathon startup command line args with
task_launch_confirm_timeout=30000
scale_apps_interval = 30000
task_lost_expunge_initial_delay = 30000
task_launch_timeout = 30000
and add mesos-slave startup command line args with
recovery_timeout=1mins
but it doesn't work for me.
To forcefully change the time after executor commit suicide if Mesos agent process failed you should configure --recovery_timeout
Amount of time allotted for the agent to recover. If the agent takes longer than recovery_timeout to recover, any executors that are waiting to reconnect to the agent will self-terminate. (default: 15mins)

Spark jobserver is not finishing YARN processes

I have configured Spark jobserver to run on YARN.
I am able to send spark jobs to YARN but even after the job finishes it does not quit on YARN
For eg:
I tried to make a simple spark context.
The context is reflecting in jobserver but YARN is still running the process and is not quieting I have to manually kill the tasks.
Yarn Job
Spark Context
Job server reflects the contexts but as soon as I try to run any task in it Job server give me an error
{
"status": "ERROR",
"result": "context test-context2 not found"
}
My Spark UI is also not very helpful

Restart task tracker and job tracker service (task tracker and job tracker) in CDH4

How do I restart the task trackers and job tracker using CDH4 from the command line?
I tried following given script but got error
[root#adc1210765 bin]# ./stop-mapred.sh
/usr/lib/hadoop-0.20-mapreduce/bin/../conf no jobtracker to stop cat:
/usr/lib/hadoop-0.20-mapreduce/bin/../conf/slaves: No such file or
directory
I want to restart all instance of tasktracker running at my cluster nodes
You must do this on each of the task trackers
sudo service hadoop-0.20-mapreduce-tasktracker restart
And on the job tracker
sudo service hadoop-0.20-mapreduce-jobtracker restart
You can also use stop and start in place of restart. Might have to change your hadoop version number.
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Installation-Guide/cdh4ig_topic_11_3.html?scroll=topic_11_3
You can also try starting the JobTracker Daemon by
/etc/init.d/hadoop-0.20-mapreduce-jobtracker start
and Task Tracker by
/etc/init.d/hadoop-0.20-mapreduce-tasktracker start
Ensure the version number is appropiate.

Resources