Is there a way to check the position of my jobs within an LSF queue?
If I run:
bjobs -u all -q my_queue
I get a list of jobs from all users within my_queue, but is this list sorted by the position of my jobs within the queue?
I believe you're interested in pending jobs.
This will show all pending jobs in a queue.
bjobs -u all -p -q my_queue
This list is sorted by position in queue.
The position is based on job priority and submit time.
Keep in mind, what is displayed is not a perfect indication of when a job will run. LSF will start jobs as the correct resources are available. A job requiring a lot of resources may be first in the queue. But if the required resources are not available, LSF will start other jobs behind it that are within the available resources.
To check progress on individual job -
bpeek -f <job_id>
This command sort of creates an output stream where the job redirects it's output.
squeue -l [queue name]
lists the priorities for all users.
Jobs of users with higher priorities are run first.
Related
We have an external service which processes a specific task on given data. Because this takes a while per task, and we have ten thousand of tasks we decided to put processing into jobs and a queue otherwise we will get an timeout.
Processing all the tasks can be take 15 hours.
So, we decided to split them into chunks and put processing the chunk into a job. So the job will only take about 1 minute.
Considering that the receiving service has limited resources it is important to process each job after each other without a synchronicity.
We put these jobs into a specific named queue to divide this jobs from other jobs like email submitting.
In the local test environment, it works properly with sync, database and sqs.
Now I will explain the issue with the live environment:
When I run the jobs in my local test environment with sqs, invoked by php art queue:listen --queue=name of the queue, all jobs will be written in the "message available" column and one by one will be removed from "message available" column and added to the "message in flight" column.
The "message in flight" column has never more than one message.
After deploying everything to production the following happens:
The command to add the jobs to the queue will invoked by a scheduler, instead of invoking in console on my local environment.
Then all jobs will be added to "message available" column and immediately dozens of jobs will be moved to "messages in flight". That means all jobs from "message available" will be moved to "messages in flight". So that it seems that the jobs won't be processed step by step instead of a kind of brute force.
The other thing is that only 5 jobs will be executed. After that nothing happens, the receiving service gets no requests, the failed_jobs table is empty, and the jobs still remains in "messages in flight".
I have no idea what I do wrong!
Is there another way to process thousands of jobs?
I've set the "queue-concurrency" to 1 and all queues are listed below the "queues" section in vapor.yml.
Furthermore, I've set timeout for cli, general and queue to 900 (seconds).
After checking the sqs log file in cloud watch I see that the job has been executed about 4 times.
The time between first job and last job in the log file is about 6 minutes max.
Does anybody has any ideas?
Thank you in advance.
Best
Michael
I've built a system based on Laravel where users are able to begin a "task" which repeats a number of times, with a delay between each repetition. I've accomplished this by queueing a job with an amount argument, which then recursively queues an additional job until the count is up.
For example, I start my task with 3 repetitions:
A job is queued with an amount argument of 3. It is ran, the amount is decremented to 2. The same job is queued again with a delay of 5 seconds specified.
When the job runs again, the process repeats with an amount of 1.
The last job executes, and now that the amount has reached 0, it is not queued again and the tasks have been completed.
This is working as expected, but I need to know whether a user currently has any tasks being processed. I need to be able to do the following:
Check if a particular queue has any jobs started by a particular user.
Check the value that was set for amount on that job.
I'm using the database driver for a queue named tasks. Is there any existing method to accomplish my goals here?
Thanks!
You shoudln't be using delay to queue multiple repetitions of the same job over and over. That functionality is meant for something like retrying a failed network request. Keeping jobs in the queue for multiple hours at a time can lead to memory issues with your queues if the count gets too high.
I would suggest you use the php artisan schedule:run functionality to run a command every 1-5 minutes to check the database if it is time to run a user's job. If so, kick off that job and add a status flag to the user table (or whatever table you want to keep track of these things). When finished you mark that same row as completed and wait for the next time the cron runs to do it again.
I was running a cron (periodic) job through nomad which ran every 30 sec, the job does nothing but just
echo "some string"
and hence ends immediately
when I do
nomad status
I get all the dead jobs also-- the jobs that have finished executing, which are useless to me. Is there some way to remove the dead jobs?
PS: one obvious solution is to grep out the dead jobs, any solution provided by nomad?
Dead/Completed jobs are cleaned up in accordance to the garbage collection interval.
You can force garbage collection using the System API endpoint which will run the global garbage collector. The CURL command would look like:
$ curl -X PUT http://localhost:4646/v1/system/gc
If you wish to lower the GC interval permanently for jobs, you can use the job_gc_threshold configuration parameter within the server config stanza.
My problem :
Each night, my crontab launches several nightly tests on a supercomputer working with PBS under CentOS 6.5. When launched, the jobs wait in the queue. When the scheduler allow to run, my jobs start. It is quite common than the scheduler launch all the jobs exaclty at the same time (even if my crontab lauched them at separated moments).
I can't modify the main part of the job (but I can add things before). Each job starts with an update of a common SVN repository. But, when the jobs start simultaneously, I have an error due to concurrent updates on the same repository. I want to avoid that.
What I expect :
When launched by the scheduler, the job could wait some seconds before starting. A solution could be wait a random time before starting, but the risk to have the same random time grow fast with the number of tests I perform in parallel. If I reduce this risk by choosing a big random number, I have to wait too long (locking unused resources on the supercomputer).
I suppose it's possible to store the information of "I will launch now, others have to wait for 1 minute" for each job, in a multi-thread-safe manner, but I don't know how. What I imagine is a kind of mutex but inducing only a delay and not a lock waiting the end.
A solution without MPI is prefered.
Of course, I'm open to other solutions. Any help is welcome.
Call your script from a wrapper that attempts to obtain an exclusive lock on a lock file first. For example
{
flock -s 200
# your script/code here
} 200> /var/lock/myscript
The name of the lock file doesn't really matter, as long as you have write permission to open it. When this wrapper runs, it will first attempt to get an exclusive lock on /var/lock/myscript. If another script already has the lock, it will block until the lock becomes available.
Note that there are no arbitrary wait times; each script will run as soon as possible, in the order in which they first attempt to obtain the lock. This means you can also start the jobs simultaneously; the operating system will manage the access to the lock and the ordering.
Here is a solution by using GNU parallel
It might seem a bit counter-intuitive at first to use this tool, but if you set the maximum number of jobs to run at a time to 1, it can simulate a job queue that runs multiple jobs in sequence without any overlaps.
You can observe the desired effect of this command by using this example
seq 1 5 | parallel -j1 -k 'echo {}; sleep 1'
-j1 sets max jobs running at a time to 1 while -k preserves the order.
To apply this to your original problem, we can create say a file such that it contains a list of script files line by line. We can then pipe the content of that file to parallel to make multiple scripts run in sequence and in order.
cat file | parallel -j1 -k bash {}
I would like to run a script when all of the jobs that I have sent to a server are done.
for example, I send
ssh server "for i in config*; do qsub ./run 1 $i; done"
And I get back a list of the jobs that were started. I would like to automatically start another script on the server to process the output from these jobs once all are completed.
I would appreciate any advice that would help me avoid the following inelegant solution:
If I save each of the 1000 job id's from the above call in a separate file, I could check the contents of each file against the current list of running jobs, i.e. output from a call to:
ssh qstat
I would only need to check every half hour, but I would imagine that there is a better way.
It depends a bit on what job scheduler you are using and what version, but there's another approach that can be taken too if your results-processing can also be done on the same queue as the job.
One very handy way of managing lots of related job in more recent versions of torque (and with grid engine, and others) is to launch the any individual jobs as a job array (cf. http://docs.adaptivecomputing.com/torque/4-1-4/Content/topics/commands/qsub.htm#-t). This requires mapping the individual runs to numbers somehow, which may or may not be convenient; but if you can do it for your jobs, it does greatly simplify managing the jobs; you can qsub them all in one line, you can qdel or qhold them all at once (while still having the capability to deal with jobs individually).
If you do this, then you could submit an analysis job which had a dependency on the array of jobs which would only run once all of the jobs in the array were complete: (cf. http://docs.adaptivecomputing.com/torque/4-1-4/Content/topics/commands/qsub.htm#dependencyExamples). Submitting the job would look like:
qsub analyze.sh -W depend=afterokarray:427[]
where analyze.sh had the script to do the analysis, and 427 would be the job id of the array of jobs you launched. (The [] means only run after all are completed). The syntax differs for other schedulers (eg, SGE/OGE) but the ideas are the same.
Getting this right can take some doing, and certainly Tristan's approach has the advantage of being simple, and working with any scheduler; but learning to use job arrays in this situation if you'll be doing alot of this may be worth your time.
Something you might consider is having each job script just touch a filename in a dedicated folder like $i.jobdone, and in your master script, you could simply use ls *.jobdone | wc -l to test for the right number of jobs done.
You can use wait to stop execution until all your jobs are done. You can even collect all the exit statuses and other running statistics (time it took, count of jobs done at the time, whatever) if you cycle around waiting for specific ids.
I'd write a small C program to do the waiting and collecting (if you have permissions to upload and run executables), but you can easily use the bash wait built-in for roughly the same purpose, albeit with less flexibility.
Edit: small example.
#!/bin/bash
...
waitfor=''
for i in tasks; do
task &
waitfor="$waitfor $!"
done
wait $waitfor
...
If you run this script in background, It won't bother you and whatever comes after the wait line will run when your jobs are over.