Qsub - delaying/staggering job start with a job array - cluster-computing

It is possible to delay or stagger the start of jobs launched through a job array with qsub, e.g. qsub -t1-4 launch.pbs?
I could do this by a sleeping for a small, but random amount of time in my pbs script, but I wonder whether there is a direct way to specify this to the scheduler through qsub

Yes, it is possible.
From http://gridscheduler.sourceforge.net/htmlman/htmlman1/qsub.html :
-a date_time
Available for qsub and qalter only.
Defines or redefines the time and date at which a job
is eligible for execution. Date_time conforms to
[[CC]]YY]MMDDhhmm[.SS], for the details, please see
Date_time in: sge_types(1).
If this option is used with qsub or if a corresponding
value is specified in qmon then a parameter named a and
the value in the format CCYYMMDDhhmm.SS will be passed
to the defined JSV instances (see -jsv option below or
find more information concerning JSV in jsv(1))
You can add this option inside your .pbs.
For example,
#PBS -a 1550
makes the task to wait until 15:50; if it is to late to run at 15:50 today, it will run tomorrow; with
#PBS -a 010900
the task will run in the morning of the first day of the next month.

Related

How to make sbatch wait until last submitted job is *running* when submitting multiple jobs?

I'm running a numerical model which parameters are in a "parameter.input" file. I use sbatch to submit multiple iterations of the model, with one parameter in the parameter file changing every time. Here is the loop I use:
#!/bin/bash -l
for a in {01..30}
do
sed -i "s/control_[0-9][0-9]/control_${a}/g" parameter.input
sbatch --time=21-00:00:00 run_model.sh
sleep 60
done
The sed line changes a parameter in the parameter file. The
run_model.sh file runs the model.
The problem: depending on the resources available, a job might run immediately or stay pending for a few hours. With my default loop, if 60 seconds is not enough time to find resources for job n to run, the parameter file will be modified while job n is pending, meaning job n will run with the wrong parameters. (and I can't wait for job n to complete before submitting job n+1 because each job takes several days to complete)
How can I force batch to wait to submit job n+1 until job n is running?
I am not sure how to create an until loop that would grab the status of job n and wait until it changes to 'running' before submitting job n+1. I have experimented with a few things, but the server I use also hosts another 150 people's jobs, and I'm afraid too much experimenting might create some issues...
Use the following to grab the last submitted job's ID and its status, and wait until it isn't pending anymore to start the next job:
sentence=$(sbatch --time=21-00:00:00 run_model.sh) # get the output from sbatch
stringarray=($sentence) # separate the output in words
jobid=(${stringarray[3]}) # isolate the job ID
sentence="$(squeue -j $jobid)" # read job's slurm status
stringarray=($sentence)
jobstatus=(${stringarray[12]}) # isolate the status of job number jobid
Check that the job status is 'running' before submitting the next job with:
if [ "$jobstatus" = "R" ];then
# insert here relevant code to run next job
fi
You can insert that last snippet in an until loop that checks the job's status every few seconds.

How to dynamically chose PBS queues during job submission

I run a lot of small computing jobs in remote cluster where job submission is managed by PBS. Normally in a PBS (bash) script I specify the queue that I would like to submit the job through the command
#PBS -q <queue_name>
The job queue that I need to chose depends on the load on a specific queue. Every time before I submit a job, I analyze this through the command on terminal
qstat -q
which provides me an output which looks like as follows
Queue Memory CPU Time Walltime Node Run Que Lm State
---------------- ------ -------- -------- ---- --- --- -- -----
queue1 -- -- 03:00:00 -- 0 2 -- E R
queue2 -- -- 06:00:00 -- 8 6 -- E R
I would like to automate the queue selection by the job script based on two constraints
The queue selected must have a walltime more than the job time specified. The job time is specified through command #PBS -l walltime=02:30:00.
The queue must have the least no. of jobs in Que as shown in the above output.
I'm having trouble in identifying which tools that I need to use in terminal to help me automate the queue selection
It is possible that you could wrap your qsub submission in another script which would run qstat -q, parse the output, and then select the queue based on the walltime requested and how many active jobs are in each queue. The script could then submit the job and add -q <name of desired queue> to the end of the qsub command.
However, it seems that you are manually trying to do some of what a scheduler - with appropriate policies - does for you. Why do you need to dynamically switch queues? A better setup would be for the queues to essentially categorize the jobs - like you are already doing with walltime - and then allowing the scheduler to run the jobs appropriately. Any setup where a user needs to carefully select the queue seems a little suspect to me.

How can I tell if a PBS script was called by bash or qsub

I have a PBS script that processes several environment variables. PBS is a wrapper for bash that sends the bash script to a job scheduling queue. The processed variables form a command to run a scientific application. A PBS script is written in bash with additional information for the job scheduler encoded in the bash comments.
How can I determine programmatically if my script was called by qsub, the command that interprets PBS scripts, or if it as called by bash?
If the script is running under bash I would like to treat the call as a dry run and only print out the command that was generated. In that way it bypasses the job queue entirely.
This may not be completely robust, but one heuristic which may work is to test for the existence of any of the following environmental variables which tend to be defined under qsub, as listed here.
PBS_O_HOST (the name of the host upon which the qsub command is running)
PBS_SERVER (the hostname of the pbs_server which qsub submits the job to)
PBS_O_QUEUE (the name of the original queue to which the job was submitted)
PBS_O_WORKDIR (the absolute path of the current working directory of the qsub command)
PBS_ARRAYID (each member of a job array is assigned a unique identifier)
PBS_ENVIRONMENT (set to PBS_BATCH to indicate the job is a batch job, or to PBS_INTERACTIVE to indicate the job is a PBS interactive job)
PBS_JOBID (the job identifier assigned to the job by the batch system)
PBS_JOBNAME (the job name supplied by the user)
PBS_NODEFILE (the name of the file contain the list of nodes assigned to the job)
PBS_QUEUE (the name of the queue from which the job was executed from)
PBS_WALLTIME (the walltime requested by the user or default walltime allotted by the scheduler)
You can check the parent caller of bash:
CALLER=$(ps -p "$PPID" -o comm=)
if [[ <compare $CALLER with expected process name> ]]; then
<script was called by qsub or something>
fi
Extra note: Bash always has an unexported variable set: $BASH_VERSION so if it's set you'd be sure that the script is running with Bash. The question left would just be about which one called it.
Also, don't run the check inside a subshell () as you probably would get from $PPID the process of same shell, not the caller.
If your script is called with deeper levels in which case $PPID would not be enough, you can always recursively scan the parent pids with ps -p <pid> -o ppid=.

Hold a bash script on PBS status without Torque

I've access to a low priority queue on a large national system. I can allocate in the queue only 1 job at the time.
The PBS job contains a program who is not likely to complete before the wall-time ends. Jobs on hold can't be queued in a number that exceeds 3.
It means that:
I can not use -W depend=afterok:$ID_of_previous_job . The script would submit all the job at once, but just the first 3 will enter the queue (the last 2 in H state)
I can not modify the submission script with a last line that submit the next_job (it is very likely that the actual program won't finish before the walltime ends and then the last line is not executed.
I can not install any software so I am limited to use a Bash Script, rather than Torque
I'd rather not use a "time check" script (such as: every 5 minute check if previous_job is over)
Is it possible to use a while and or sleep ?
Option 1
To use a while and sleep requires you to do something very similar to a time check script:
#!/bin/bash
jobid=`submit the first job`
while [[ -z `qstat ${jobid} | grep C` ]]; do
sleep 5
done
# submit the new job once the loop is done, after checking the exit status if desired
Option 2 - may be TORQUE only, not sure:
Perhaps a better way, suggested by Dmitri Chubarov in the comments, would be to use the per-job epilogue option. To do this the compute nodes have to be able to submit jobs, but since you were considering having the final line of the job do it then this seems like a possibility.
Add to the job a perjob epilogue by adding this line to the script:
#PBS -l epilogue=/path/to/script
And then have the script:
#!/bin/bash
# check exit code if desired, its argument 10 to the script
# submit the next job

how to get the job_id in the sun grid system using qsub

Consider a script, "run.sh", to be sent to a cluster job queue via qsub,
qsub ./run.sh
My question is how do I get the number of the process -- the one that appears as ${PID} on the files *.o${PID} and *.e${PID} -- within the script run.sh?
Does qsub export it? On which name?
Well, apparently qsub man page does not have, but this page says that the variable $JOB_ID is created with the PID I was asking for.

Resources