I was running a cron (periodic) job through nomad which ran every 30 sec, the job does nothing but just
echo "some string"
and hence ends immediately
when I do
nomad status
I get all the dead jobs also-- the jobs that have finished executing, which are useless to me. Is there some way to remove the dead jobs?
PS: one obvious solution is to grep out the dead jobs, any solution provided by nomad?
Dead/Completed jobs are cleaned up in accordance to the garbage collection interval.
You can force garbage collection using the System API endpoint which will run the global garbage collector. The CURL command would look like:
$ curl -X PUT http://localhost:4646/v1/system/gc
If you wish to lower the GC interval permanently for jobs, you can use the job_gc_threshold configuration parameter within the server config stanza.
Related
We have an external service which processes a specific task on given data. Because this takes a while per task, and we have ten thousand of tasks we decided to put processing into jobs and a queue otherwise we will get an timeout.
Processing all the tasks can be take 15 hours.
So, we decided to split them into chunks and put processing the chunk into a job. So the job will only take about 1 minute.
Considering that the receiving service has limited resources it is important to process each job after each other without a synchronicity.
We put these jobs into a specific named queue to divide this jobs from other jobs like email submitting.
In the local test environment, it works properly with sync, database and sqs.
Now I will explain the issue with the live environment:
When I run the jobs in my local test environment with sqs, invoked by php art queue:listen --queue=name of the queue, all jobs will be written in the "message available" column and one by one will be removed from "message available" column and added to the "message in flight" column.
The "message in flight" column has never more than one message.
After deploying everything to production the following happens:
The command to add the jobs to the queue will invoked by a scheduler, instead of invoking in console on my local environment.
Then all jobs will be added to "message available" column and immediately dozens of jobs will be moved to "messages in flight". That means all jobs from "message available" will be moved to "messages in flight". So that it seems that the jobs won't be processed step by step instead of a kind of brute force.
The other thing is that only 5 jobs will be executed. After that nothing happens, the receiving service gets no requests, the failed_jobs table is empty, and the jobs still remains in "messages in flight".
I have no idea what I do wrong!
Is there another way to process thousands of jobs?
I've set the "queue-concurrency" to 1 and all queues are listed below the "queues" section in vapor.yml.
Furthermore, I've set timeout for cli, general and queue to 900 (seconds).
After checking the sqs log file in cloud watch I see that the job has been executed about 4 times.
The time between first job and last job in the log file is about 6 minutes max.
Does anybody has any ideas?
Thank you in advance.
Best
Michael
We have a job MyPrettyJob, that is queued through redis from a controller. When we run this job from the command like so, the job does succeed. When we run the job with little data the queue stays online, but when we run the job with a lot of data the queue crashes with an exit code of 12, which suggests an "Out of Memory" error.
The large job processes about 300.000 items, who mostly depend on each other. To that end, we cannot really split up this job without causing severe performance impact. In some extreme cases it could take up to hours instead of the few minutes it currently takes.
For the large job, the queue outputs the following:
$ php artisan queue:work --queue=myqueue
Processing: App\Jobs\MyPrettyJob
Processed: App\Jobs\MyPrettyJob
$ echo $?
12
The queue worker even crashes regardless if something is queued behind that job. That seems to suggest that the queue crashes through cleanup of the large job, but it does not seem to give any indication of what that is. The queue worker also crashes regardless if any database interactions are done, which rules anything related to the database.
What is the queue doing in-between jobs? Can I debug in any way why it is getting out of memory after completing the job? Does the queue write something to a log maybe, or is it doing something in redis in between jobs? It seems like a really weird time for that process to crash.
Exit code 12 happens when the queue worker system determines that it has used more memory than is allowed (see https://github.com/laravel/framework/blob/5.8/src/Illuminate/Queue/Worker.php#L199-L210 for the specific section of code). If you run php artisan queue:work --memory=<digit> where memory is enough to fully run your job (for example, 1024 for 1GB), you should be able to allow your job to complete and continue running after the fact.
I have a beanstalkd instance with two workers picking jobs from one tube.
I've noticed that occasionally one of the workers will reserve a job that has already been reserved (and being worked on) by the other worker.
I know there aren't duplicate jobs in the queue.
Why does beanstalkd allow the same job to be reserved twice?
It sounds to me that you didn't implemented the protocol properly. You need to handle DEADLINE_SOON, and do TOUCH.
What does DEADLINE_SOON mean?
DEADLINE_SOON is a response to a reserve command indicating that you have a job reserved whose deadline is real soon (current safety margin is approximately 1 second).
If you are frequently receiving DEADLINE_SOON errors on reserve, you should probably consider increasing the TTR on your jobs as it generally indicates you aren’t completing them in time. It may also be that you are failing to delete tasks when you have completed them.
See the mailing list discussion for more information.
How does TTR work?
TTR only applies to a job at the moment it becomes reserved. At that event, a timer (called “time-left” in the job stats) starts counting down from the job’s TTR.
If the timer reaches zero, the job gets put back in the ready queue.
If the job is buried, deleted, or released before the timer runs out, the timer ceases to exist.
If the job is touch"ed before the timer reaches zero, the timer starts over counting down from TTR.
The "touch" command
Allows a worker to request more time to work on a job.
This is useful for jobs that potentially take a long time, but you still want
the benefits of a TTR pulling a job away from an unresponsive worker. A worker
may periodically tell the server that it's still alive and processing a job
(e.g. it may do this on DEADLINE_SOON). The command postpones the auto
release of a reserved job until TTR seconds from when the command is issued.
The jobs take longer to run than the TTR, so it was being returned back to the queue and picked up by the other worker.
I now set a larger TTR on the job.
I have read the documentation so I know the difference.
My question however is that, is there any risk in using .submit instead of .waitForComplete if I want to run several Hadoop jobs on a cluster in parallel ?
I mostly use Elastic Map Reduce.
When I tried doing so, I noticed that only the first job being executed.
If your aim is to run jobs in parallel then there is certainly no risk in using job.submit(). The main reason job.waitForCompletion exists is that it's method call returns only when the job gets finished, and it returns with it's success or failure status which can be used to determine that further steps are to be run or not.
Now, getting back at you seeing only the first job being executed, this is because by default Hadoop schedules the jobs in FIFO order. You certainly can change this behaviour. Read more here.
I would like to run a script when all of the jobs that I have sent to a server are done.
for example, I send
ssh server "for i in config*; do qsub ./run 1 $i; done"
And I get back a list of the jobs that were started. I would like to automatically start another script on the server to process the output from these jobs once all are completed.
I would appreciate any advice that would help me avoid the following inelegant solution:
If I save each of the 1000 job id's from the above call in a separate file, I could check the contents of each file against the current list of running jobs, i.e. output from a call to:
ssh qstat
I would only need to check every half hour, but I would imagine that there is a better way.
It depends a bit on what job scheduler you are using and what version, but there's another approach that can be taken too if your results-processing can also be done on the same queue as the job.
One very handy way of managing lots of related job in more recent versions of torque (and with grid engine, and others) is to launch the any individual jobs as a job array (cf. http://docs.adaptivecomputing.com/torque/4-1-4/Content/topics/commands/qsub.htm#-t). This requires mapping the individual runs to numbers somehow, which may or may not be convenient; but if you can do it for your jobs, it does greatly simplify managing the jobs; you can qsub them all in one line, you can qdel or qhold them all at once (while still having the capability to deal with jobs individually).
If you do this, then you could submit an analysis job which had a dependency on the array of jobs which would only run once all of the jobs in the array were complete: (cf. http://docs.adaptivecomputing.com/torque/4-1-4/Content/topics/commands/qsub.htm#dependencyExamples). Submitting the job would look like:
qsub analyze.sh -W depend=afterokarray:427[]
where analyze.sh had the script to do the analysis, and 427 would be the job id of the array of jobs you launched. (The [] means only run after all are completed). The syntax differs for other schedulers (eg, SGE/OGE) but the ideas are the same.
Getting this right can take some doing, and certainly Tristan's approach has the advantage of being simple, and working with any scheduler; but learning to use job arrays in this situation if you'll be doing alot of this may be worth your time.
Something you might consider is having each job script just touch a filename in a dedicated folder like $i.jobdone, and in your master script, you could simply use ls *.jobdone | wc -l to test for the right number of jobs done.
You can use wait to stop execution until all your jobs are done. You can even collect all the exit statuses and other running statistics (time it took, count of jobs done at the time, whatever) if you cycle around waiting for specific ids.
I'd write a small C program to do the waiting and collecting (if you have permissions to upload and run executables), but you can easily use the bash wait built-in for roughly the same purpose, albeit with less flexibility.
Edit: small example.
#!/bin/bash
...
waitfor=''
for i in tasks; do
task &
waitfor="$waitfor $!"
done
wait $waitfor
...
If you run this script in background, It won't bother you and whatever comes after the wait line will run when your jobs are over.