SLURM job failing with sbatch, successful with srun - cluster-computing

A researcher is submitting a job to our cluster that is failing when run with sbatch, but succeeding when run with srun. Any ideas on why this could be? I’ve included the error messages and the slurm script below:
Error message:
Unable to init server: Could not connect: Connection refused
(canavier_model_changes_no_plots.py:1589287): Gdk-CRITICAL **: 22:46:57.434: gdk_cursor_new_for_display: assertion 'GDK_IS_DISPLAY (display)' failed
can't open DISPLAY
My first thought based on that error was that it is something with the code that slurm is running rather than with the slurm functions itself, but not sure why srun would work if that is the case?
Here is the slurm script:
#SBATCH --job-name=networkmodel
#SBATCH --nodes=1
#SBATCH --cpus-per-task=10
#SBATCH --mem-per-cpu=4G
#SBATCH --time=00-00:05:00
python3 canavier_model_changes_no_plots.py
She thought it might have something to do with matplotlob scripts in her code, but it still failed when those were removed. Again, the code runs with srun, and fails with sbatch.

The error message is indicative that the job is trying to run an X11 application that attempts to create a GUI window. Matplotlib might very well be the cause indeed. The script should make sure to only create files and not try anything related to GUI windows.

Related

Snakemake does not recognise job failure due to timeout with error code -11

Does anyone had a problem snakemake recognizing a timed-out job. I submit jobs to a cluster using qsub with a time-out set per rule:
snakemake --jobs 29 -k -p --latency-wait 60 --use-envmodules \
--cluster "qsub -l walltime={resources.walltime},nodes=1:ppn={threads},mem={resources.mem_mb}mb"
If a job fails within a script, the next one in line will be executed. When a job however hits the time-out defined in a rule, the next job in line is not executed, reducing the total number of jobs run in parallel on the cluster over time. A timed-out job raises according to the MOAB scheduler (PBS server) a -11 exit status. As far as I understood any non-zero exit status means failure - or does this only apply to positive integers?!
Thanks in advance for any hint:)
If you don't provide a --cluster-status script, snakemake internally checks job status by touching some hidden files in the submitted job script. When a job times out, snakemake (on the node) doesn't get a chance to report the failure to the main snakemake instance as qsub will kill it.
You can try a cluster profile or just grab a suitable cluster status file (be sure to chmod it as an exe and have qsub report a parsable job id).

SLURM Array jobs - how to run as many job as possible? How to combine Slurm options most sensibly?

I am quite new to Slurm and this community, so plese correct me in any way if I am doing anything wrong! :)
I need to run my executable (a Python script) many times in parallel on a HPC Cluster. This executable takes the Slurm Array task ID as Input. This input is mapped within the Python script onto several parameters, on basis of which in turn again data is imported. Note that the exectutable itself is not internally parallelised. I think that each invocation of my executable should be able to run on one CPU only.
My aim: run many invocations of my executable as many times as possible! I was thinking at least like 50 invocations concurrently.
In principle, my scripts are working as intended on the cluster. I use this Slurm submission script:
#!/bin/bash -l
#SBATCH --job-name=NAME
#SBATCH --chdir=/my/dir
#SBATCH --output=.job/NAME%A_%a.out
#SBATCH --error=.job/NAME%A_%a.err
#SBATCH --mail-type=END
#SBATCH --mail-user=USER
# --- resource specification ---
#SBATCH --partition=general
#SBATCH --array=1-130
#SBATCH --ntasks-per-node=1
#SBATCH --mem=16G
#SBATCH --time=13:00:00
# --- start from a clean state and load necessary environment modules ---
module purge
module load anaconda/3
# --- instruct OpenMP to use the number of cpus requested per task ---
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
# --- run executable via srun ---
srun ./path/to/executable.py $SLURM_ARRAY_TASK_ID
However, this way, somehow only 8 jobs (that is, 'executable.py 1', 'executable.py 2', ...) get executed in parallel, each on a different node. (Note: I don't quite know what 'export OMP_NUM_THREADS' does; I was told to include it by IT support). If 'executable.py 1' ends, 'executable.py 9' starts. However, I want more than just 8 concurrently running invocations. So I thought, I need to specify that each inovcation only needs one CPU; maybe then many more of my jobs can run in parallel on the 8 nodes I somehow seem to receive. My new submission script looks like this (for readability I only show the 'resource specification' part, the rest was not changed):
# --- resource specification ---
#SBATCH --partition=general
#SBATCH --array=1-130
#SBATCH --ntasks-per-node=10
#SBATCH --cpus-per-task=1
#SBATCH --mem=16G
#SBATCH --time=13:00:00
This way, though, it seems that my executable gets run ten times for each Slurm array task ID, that is, 'executable.py 1' is run ten times, as is 'executable.py 2' and so on. This is not what I intended.
I think at the bottom of my problem is that (i) I am serioulsy cofused by the SBATCH options --ntasks-per-node, --ntasks, --cpus-per-task, --nodes, etc., and (ii) I don't know conceptually really what a 'job', 'job step' or 'task' is meant to be (both, for my case as well as on the man page for SBATCH).
If anyone knows which SBATCH option combination gives me what I want, I would be very grateful for a hint. Also, if you have general knowledge (in plain English) on how job steps and tasks etc. can be defined, that would be so great.
Please note that I extensively stared at the man pages and some online documentations. I also asked my local I support, but sadly they were not awfully helpful. I really need my script to run in parallel on a huge scale; I also really want to understand a bit better the workings of Slurm. I shall like to add that I am not a computer scientist by training, this is not my usual playing field.
Thanks so much for your time everyone!

Submit Slurm job from within a Slurm job

Say we have a Slurm script foo.bash like so:
#SBATCH <options>
SBATCH bar.bash
bar.bash is also a Slurm script:
#SBATCH <options>
<some stuff>
When I submit foo.bash via SBATCH, I'm expecting bar.bash to also get submitted (and executed). But that doesn't seem to be happening.
So my question is, how do I launch get a Slurm job to successfully submit the Slurm job contained within itself? Or is this just not possible?
I am sorry if this is a rather general question, but I haven't been able to find an answer.

sbatch+srun: Large amount of single thread jobs

Hello friendly people,
my question is rather specific.
For more than a week, I am trying to submit thousands of single thread jobs for a scientific experiment using sbatch and srun.
The problem is that these jobs may take different amounts of time to finish and some may even be aborted as they exceed the memory limit. Both behaviors are fine and my evaluation deals with it.
But, I am facing the problem that some of the jobs are never started, even though they have been submitted.
My sbatch script looks like this:
#!/usr/bin/bash
#SBATCH --nodes=4
#SBATCH --tasks-per-node=12
#SBATCH --mem-per-cpu=10000
for i in {1..500}
do
srun -N1 -n1 -c1 --exclusive --time=60 ${mybinary} $i &
wait 5s
done
Now, my error log shows the following message:
srun: Job 1846955 step creation temporarily disabled, retrying
1) What does 'step creation temporarily disabled' mean? Are all cpu's busy and the job is omitted or is it started again later when resources are free?
2) Why are some of my jobs not carried out and how can I fix it? Do I use the correct parameters for srun?
Thanks for your help!
srun: Job 1846955 step creation temporarily disabled, retrying
This is normal, you reserve 4 x 12 CPUs and start 500 instances of srun. Only 48 instances will run, while the other will output that message. Whenever a running instance stops, a pending instance starts.
wait 5s
The wait command is used to wait for processes, not for a certain amount of time. For that, use the sleep command. The wait command must be at the end of the script. Otherwise, the job could stop before all srun instances have finished.
So the scrip should look like this:
#!/usr/bin/bash
#SBATCH --nodes=4
#SBATCH --tasks-per-node=12
#SBATCH --mem-per-cpu=10000
for i in {1..500}
do
srun -N1 -n1 -c1 --exclusive --time=60 ${mybinary} $i &
done
wait

Slurm: how many times will failed jobs be --requeue'd

I have a Slurm job array for which the job file includes a --requeue directive. Here is the full job file:
#!/bin/bash
#SBATCH --job-name=catsss
#SBATCH --output=logs/cats.log
#SBATCH --array=1-10000
#SBATCH --requeue
#SBATCH --partition=scavenge
#SBATCH --mem=32g
#SBATCH --time=24:00:00
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=douglas.duhaime#gmail.com
module load Langs/Python/3.4.3
python3 cats.py ${SLURM_ARRAY_TASK_ID} 'cats'
Several of the array values have restarted at least once. I would like to know, how many times will these jobs restart before they are finally cancelled by the scheduler? Will the restarts carry on indefinitely until a sysadmin manually cancels them, or do jobs like this have a maximum number of retries?
AFAIK, the jobs can be requeued in infinite occasions. You just decide if the job is prepared to be requeued or not. If not-requeue, then it will never be requeued. If requeue, then it will be requeued everytime the system decides it is needed (node failure, higher priority job preemption...).
The jobs keep restarting until they finish (successfully or not, but finished instead of interrupted).

Resources