Is there a way to find out if jobs that I already ran with SLURM workload manager where on a dependency ?
The sacct command has an option "pending" which should show if a job was on hold, but in my case it just prints out all jobs:
sacct -M cm2 --state PENDING -S2020-01-01T11:20:00 -E2021-01-01T11:20:00 --format="JobName, TimeLimit, State, Partition"
cm2 in the above command is the cluster I am working on, And I want do show all "pending" jobs from 01.01.2020 to 01.01.2021.
Am I not using the command correctly or is there a better way to directly see if jobs were on dependency?
sacct -M cm2 --state PENDING -S2020-01-01T11:20:00 -E2021-01-01T11:20:00 --format="JobName, TimeLimit, State, Partition, Reason"
Adding Reason will report out as "Dependency"
Just did research on this... I am sure you figured this out by now.
Related
Does anyone had a problem snakemake recognizing a timed-out job. I submit jobs to a cluster using qsub with a time-out set per rule:
snakemake --jobs 29 -k -p --latency-wait 60 --use-envmodules \
--cluster "qsub -l walltime={resources.walltime},nodes=1:ppn={threads},mem={resources.mem_mb}mb"
If a job fails within a script, the next one in line will be executed. When a job however hits the time-out defined in a rule, the next job in line is not executed, reducing the total number of jobs run in parallel on the cluster over time. A timed-out job raises according to the MOAB scheduler (PBS server) a -11 exit status. As far as I understood any non-zero exit status means failure - or does this only apply to positive integers?!
Thanks in advance for any hint:)
If you don't provide a --cluster-status script, snakemake internally checks job status by touching some hidden files in the submitted job script. When a job times out, snakemake (on the node) doesn't get a chance to report the failure to the main snakemake instance as qsub will kill it.
You can try a cluster profile or just grab a suitable cluster status file (be sure to chmod it as an exe and have qsub report a parsable job id).
Is it possible to set some requeue options so that JOBID is changed when slurm decides to requeue a job. (after a node failure, for instance)
So that the folder associated to first JOBID is not overwritten.
Thanks,
A requeued job is still the same job, so the job ID will not change.
What you can do is prevent requeuing with the --no-requeue. But then you will need to re-submit the job, either by hand or using a workflow manager.
Another option, is to append the restart count to the folder name. For instance, if your submission script has a line such as
WORKDIR=/some/path/${SLURM_JOB_ID}
mkdir -p $WORKDIR
cd $WORKDIR
you can replace it with
mkdir -p /some/path/${SLURM_JOB_ID}${SLURM_RESTART_COUNT}
mkdir -p $WORKDIR
cd $WORKDIR
Upon first run, the $SLURM_RESTART_COUNT will be unset, leaving the original behaviour, but then, it will be set to 1, 2, and so on, effectively suffixing the job ID with the requeue number.
For the name of the output file, you can use --open-mode=append to avoir overwriting the output file when the job restarts.
How do I get the slurm job status (e.g. COMPLETED, FAILED, TIMEOUT, ...) on job completion (within the submission script)?
I.e. I want to write to separately keep track of jobs which are timed out / failed.
Currently I work with the exit code, however jobs which TIMEOUT also get exit code 0.
For future reference, here is how I finally do it.
To retrieve the jobid at the beginning of the job and write some information (e.g. "${SLURM_JOB_ID} ${PWD}") to a summary file.
Then process this file and use something like sacct -X -n -o State --j ${jid} to get the job status.
Im working with the LSF, running bsub commands.
I'm implementing the -Ep switch to run a post exec script. This works great until the Job is killed or hits a memory limit, run limit etc.
Is there any way for the job to detect its running out of resource and then run the script? or to force it to run the script even if its been killed?
I guess my other option is running job with a dependency on that job which will run the "post exec" script when it finishes.
Any thoughts?
Kind Regards,
TheBigPeeler
From the documentation, you should be seeing the behaviour that you want.
A post-execution command runs after the job finishes, regardless of
the exit state of the job. Once a post-execution command is associated
with a job, that command runs even if the job fails. You cannot
configure the post-execution command to run only under certain
conditions.
I thought that maybe the interaction with JOB_INCLUDE_POSTEXEC (lsb.params) could account for the difference, but from my test the post-exec still runs in both cases. I used runlimit (bsub -W) to trigger the job kill.
Is it possible that the post exec is running, but exits early?
What version of LSF are you using? (What's the output of mbatchd -V and sbatchd -V)
In SGE , we have
qsub -now yes/no <command>
By "-now yes" the job is scheduled immediately(if possible) or not at all . We are not put in pending queue .
By "-now no " the job is put in pending queue if it cannot be executed immediately .
But in LSF , we have qsub's equivalent as bsub .
in bsub, we are put in pending queue, if it cannot be executed immediately. We don't have option as "-now yes" as in qsub .
Do we something in bsub as "qsub -now"
P.S : One solution is that we can check for some time(some secondss) after running bsub, if we are scheduled or not and then exit . I am searching for a more elegant way .
I found the answer in an LSF way.
LSF does provide a way to quit a job if we its unable to schedule the resource. We hava a environment variable LSF_NIOS_PEND_TIMEOUT(specified in minutes) which quits the job, if its still in pending queue.
env LSF_NIOS_PEND_TIMEOUT=1 bsub -Is -m host /bin/bash
From Somewhere on the web:
LSF_NIOS_PEND_TIMEOUT
Syntax
LSF_NIOS_PEND_TIMEOUT=minutes
Description
Applies only to interactive batch jobs.
Maximum amount of time that an interactive batch job can remain pending.
If this parameter is defined, and an interactive batch job is pending for longer than the specified time, the interactive batch job is terminated.
Valid values
Any integer greater than zero
LSF doesn't have the same thing. You could use expect w/ a timeout. LSF will output something like this when the job starts. Your expect script could expect <<Starting on. (But this is basically what your P.S. says.)
$ bsub -Is -m hostA /bin/bash
Job <7536> is submitted to default queue <interactive>.
<<Waiting for dispatch ...>>
<<Starting on hostA>>
hostA$
You could maybe use lsrun. But it won't work with the batch system to allocate a slot or other resource.