How to remove/run a pending job in Nomad? - nomad

There are some pending jobs in "$ nomad status" output. Is there a way to run a pending job?
$ nomad status
ID Type Priority Status Submit Date
5e74587392a49e5dca9c9c6d-0-build-1004018-z100-solid-octo-potato-14 batch 50 pending (stopped) 2020-03-20T14:45:24+09:00
5e74678bdb1df409005677d6-0-build-1004018-z100-solid-octo-potato-17 batch 50 pending 2020-03-20T15:49:48+09:00
5e746884db1df409005677dc-0-build-1004018-z100-solid-octo-potato-19 batch 50 pending 2020-03-20T15:53:56+09:00
5e746a02db1df409005677e3-0-build-1004018-z100-solid-octo-potato-20 batch 50 pending 2020-03-20T16:00:19+09:00
Best regards,

Pending means that it is about to run but something prevents it from it.
Try this:
Check the status of a job, e.g. nomad job status 5e74587392a49e5dca9c9c6d-0-build-1004018-z100-solid-octo-potato-14 to see if there are any messages that can make you understand what's happening.
if the previous step doesn't help, Nomad creates allocations (set of tasks in a job should be run on a particular node). Their IDs will be visible in nomad job status output. You can check an allocation's status by nomad alloc status $ID, where $ID is the ID of an allocation.
As for removal of jobs, you can run nomad job stop -purge 5e74587392a49e5dca9c9c6d-0-build-1004018-z100-solid-octo-potato-14 to remove the job from the job list.

Related

Job with multiple tasks on different servers

I need to have a Job with multiple tasks, being run on different machines, one after another (not simultaneously), and while the current job is running, another same job can arrive to the queue, but should not be started until the previous one has finished. So I came up with this 'solution' which might not be the best but it gets the job done :). I just have one problem.
I figured out I would need a JobQueue (either MongoDb or Redis) with the following structure:
{
hostname: 'host where to execute the task',
running:FALSE,
task: 'current task number',
tasks:{
[task_id:1, commands:'run these ecommands', hostname:'aaa'],
[task_id:2,commands:'another command', hostname:'bbb']
}
}
Hosts:
search for the jobs with same hostname, and running==FALSE
execute the task that is set in that job
upon finish, host sets running=FALSE, checks if there are any other tasks to perform and increases task number + sets the hostname to the next machine from the next task
Because jobs can accumulate, imagine situation when jobs are queued for one host like this: A,B,A
Since I have to run all the jobs for the specified machine how do I not start the 3rd A (first A is still running)?
{
_id : ObjectId("xxxx"), // unique, generated by MongoDB, indexed, sortable
hostname: 'host where to execute the task',
running:FALSE,
task: 'current task number',
tasks:{
[task_id:1, commands:'run these ecommands', hostname:'aaa'],
[task_id:2,commands:'another command', hostname:'bbb']
}
}
The question is how would the next available "worker" know whether it's safe for it to start the next job on a particular host.
You probably need to have some sort of a sortable (indexed) field to indicate the arrival order of the jobs. If you are using MongoDB, then you can let it generate _id which will already be unique, indexed and in time-order since its first four bytes are timestamp.
You can now query to see if there is a job to run for a particular host like so:
// pseudo code - shell syntax, not actual code
var jobToRun = db.queue.findOne({hostname:<myHostName>},{},{sort:{_id:1}});
if (jobToRun.running == FALSE) {
myJob = db.queue.findAndModify({query:{_id:jobToRun._id, running:FALSE},update:{$set:{running:TRUE}}});
if (myJob == null) print("Someone else already grabbed it");
else {
/* now we know that we updated this and we can run it */
}
} else { /* sleep and try again */ }
What this does is checks for the oldest/earliest job for specific host. It then looks to see if that job is running. If yes then do nothing (sleep and try again?) otherwise try to "lock" it up by doing findAndModify on _id and running FALSE and setting running to TRUE. If that document is returned, it means this process succeeded with the update and can now start the work. Since two threads can be both trying to do this at the same time, if you get back null it means that this document already was changed to be running by another thread and we wait and start again.
I would advise using a timestamp somewhere to indicate when a job started "running" so that if a worker dies without completing a task it can be "found" - otherwise it will be "blocking" all the jobs behind it for the same host.
What I described works for a queue where you would remove the job when it was finished rather than setting running back to FALSE - if you set running to FALSE so that other "tasks" can be done, then you will probably also be updating the tasks array to indicate what's been done.

Understanding delayed_job status

I've implemented long-running tasks in my Rails app using delayed_job along with delayed_job_web. My delayed_job configuration instructs jobs to be attempted once, and for failures to be retained:
config/initializers/delayed_job.rb:
Delayed::Worker.max_attempts = 1
Delayed::Worker.destroy_failed_jobs = false
I tried 2 test jobs that automatically raised errors, in order to see how failures behave. What I get is the following:
My expectation was that Failed jobs would have a count of 2, but that Enqueued / Working / Pending would all be 0. I can't find any documentation on what determines whether a job is Enqueued / Working / Pending, or even what the difference between Working and Pending is (the web interface describes both lists as "contains jobs currently being processed".)
Can anyone provide some clarity?
If you check https://github.com/ejschmitt/delayed_job_web/blob/master/lib/delayed_job_web/application/app.rb , you see the following (starting line 114):
when :working
'locked_at is not null'
when :failed
'last_error is not null'
when :pending
'attempts = 0'
end
Enqueued would be the total number of delayed jobs, i.e. Delayed::Job.count
Working jobs are those that have been locked by the delayed_job process and are currently being worked.
Failed are those that have a last_error
Pending are those jobs that have never been attempted.

What can cause hadoop kill reducer task an retry

my hadoop job has a very high ‘Killed Task Attempts’ number on its reducer tasks, I check the status of killed task:
Request received to kill task 'attempt_201308122006_41526_r_000030_1' by user
-------
Task has been KILLED_UNCLEAN by the user
and no stdout and stderr logs
what could cause this ? and how can I solve it?
If you have speculative execution turned on, then you will potentially see a number of map / reduce tasks that will be 'killed'. This is due to hadoop running long running tasks on more than a single task tracker, and the first one to complete 'wins' while the others are killed off.
In general i would only worry about the task attempts that 'failed' in the job tracker
Try turning speculative execution off:
mapred.map.tasks.speculative.execution = false
mapred.reduce.tasks.speculative.execution = false
If not the speculative execution, it could be the Fair Scheduler kicked in claiming task trackers for pool with minMaps and minReduces.

Circular queue on beanstalkd

I'm using beanstalkc a python wrapper for the beanstalkd application.
What I'd like to do is have the producer to put some jobs(e.g: 'a','b','c','d') once and that the consumer could get the jobs continually(e.g: 'a','b','c','d','a','b',...).
In the consumer I get the jobs with job.reserve(). I thought the solution was just reserving the jobs without deleting them, but after I ran some consumer processes I got a TIMEOUT ERROR.
I'm clearly doing something wrong but I couldn't find a way to "re-queue" the jobs the consumers use.
I think this could be a solution:
producer:
queue.put('a', priority=0)
Consumer:
job = queue.reserve()
do something with job
new_priority = job.stats()['pri'] + 1
job.release(priority=new_priority)
Why not just, when you've completed a particular job, and after you've released it, put another copy of the same job you've just finished back into the queue?
You'd otherwise be trying to get it to do something that it's not designed to do.

hadoop streaming jobs fails to report?

All jobs were running successfully using hadoop-streaming, but all of a sudden I started to see errors due to one of worker machines
Hadoop job_201110302152_0002 failures on master
Attempt Task Machine State Error Logs
attempt_201110302152_0002_m_000037_0 task_201110302152_0002_m_000037 worker2 FAILED
Task attempt_201110302152_0002_m_000037_0 failed to report status for 622 seconds. Killing!
-------
Task attempt_201110302152_0002_m_000037_0 failed to report status for 601 seconds. Killing!
Last 4KB
Last 8KB
All
Questions :
- Why does this happening ?
- How can I handle such issues?
Thank you
The description for mapred.task.timeout which defaults to 600s says "The number of milliseconds before a task will be terminated if it neither reads an input, writes an output, nor updates its status string. "
Increasing the value of mapred.task.timeout might solve the problem, but you need to figure out if more than 600s is actually required for the map task to complete processing the input data or if there is a bug in the code which needs to be debugged.
According to the Hadoop best practices, on average a map task should take a minute or so to process an InputSplit.

Resources