My Jenkins is running on a Linux machine with a node with 6 building agents.
I have a pipeline Master and a pipeline Slave. Master calling Slave several times (67).
Master code
list.each { element ->
build job: "Slave",
parameters: [
[$class: 'StringParameterValue', name: 'PARAM', value: element]
]
The issue is after 5 Slave builds my node is out of build agent knowing that the 5 previous Slave builds ended successfully. The 6th one is pending with the message "Waiting for next available executor on {Node Name}". When I kill the Master pipeline all nodes agents are back as if build agents aren't released after the Slave build end but after the Master build end.
Related
I am installing ambari hortonworks on CentOS 6. I have a local repo set up. Two data nodes and one master/name node are running. The ambari server is running on master.Now my issue is installation fails for random reasons at step "Install, Start and Test". one or more node randomly fails for reasons like failed to install service (hdp slecect,hive,hadoop) when i hit retry some nodes fails in other combination. I am missing something ???
I am working around with marathon & mesos & docker very well, but it recently discovered a problem.when mesos-slave encounter an Exception , the state of task on Marathon will change to TASK_LOST , and the task can not be killed only after about 15mins.
I did a test by manually Reboot My Operation System that run mesos-slave service and docker and run the task, and then the task state shown in Marathon UI became to " Unscheduled(100%) " ,and the task can not be killed automatically either manually, until past about 15 minutes.
My question is how to reduce this time?
I tried to add marathon startup command line args with
task_launch_confirm_timeout=30000
scale_apps_interval = 30000
task_lost_expunge_initial_delay = 30000
task_launch_timeout = 30000
and add mesos-slave startup command line args with
recovery_timeout=1mins
but it doesn't work for me.
To forcefully change the time after executor commit suicide if Mesos agent process failed you should configure --recovery_timeout
Amount of time allotted for the agent to recover. If the agent takes longer than recovery_timeout to recover, any executors that are waiting to reconnect to the agent will self-terminate. (default: 15mins)
Scheduler launches a task, Mesos Master (apache mesos 1.0.0) accepts and gives this task to a mesos agent. After that mesos agent keeps on running that task infinitely on failure. Is there a parameter to control retry for failed task.
Which scheduler are you using (Restarting tasks is a schedulers responsibility)?
Could it be you are using marathon? Marathon is designed for long-running tasks, i.e., it will restart (failed) tasks.
I've created a one mesos master and three mesos slave environment. Now, marathon is running as a framework for mesos jobs. I am trying to deploy a simple job:
{
"id": "basic-0",
"cmd": "while [ true ] ; do echo 'Hello Marathon' ; sleep 5 ; done",
"cpus": 0.1,
"mem": 10.0,
"instances": 1
}
But this is hanging into the marathon web ui for a long time. I have tried manually creating a marathon job, but that one also keeps itself in deployment state forever. I am clueless why its not running, any idea?
Please check your marathon logs and it says reason why its still in waiting state. It can be due to multiple issues like slave deregistration frequently, insufficient resources, not meeting constraint requirement
I run a YARN mapreduce job with 1 node.
But my job stuck on ACCEPTED state, and still 0% completed. I checked with jps command on my slave, there is no MR App Master or YARN Child to complete the job. On my slave all daemons have ran normally like datanode and nodemanager. There is no wrong configuration on my master node, because I've tried t before with different slave and it's works.
How can I fix it? Thanks....