Submit Slurm job from within a Slurm job - cluster-computing

Say we have a Slurm script foo.bash like so:
#SBATCH <options>
SBATCH bar.bash
bar.bash is also a Slurm script:
#SBATCH <options>
<some stuff>
When I submit foo.bash via SBATCH, I'm expecting bar.bash to also get submitted (and executed). But that doesn't seem to be happening.
So my question is, how do I launch get a Slurm job to successfully submit the Slurm job contained within itself? Or is this just not possible?
I am sorry if this is a rather general question, but I haven't been able to find an answer.

Related

SLURM job failing with sbatch, successful with srun

A researcher is submitting a job to our cluster that is failing when run with sbatch, but succeeding when run with srun. Any ideas on why this could be? I’ve included the error messages and the slurm script below:
Error message:
Unable to init server: Could not connect: Connection refused
(canavier_model_changes_no_plots.py:1589287): Gdk-CRITICAL **: 22:46:57.434: gdk_cursor_new_for_display: assertion 'GDK_IS_DISPLAY (display)' failed
can't open DISPLAY
My first thought based on that error was that it is something with the code that slurm is running rather than with the slurm functions itself, but not sure why srun would work if that is the case?
Here is the slurm script:
#SBATCH --job-name=networkmodel
#SBATCH --nodes=1
#SBATCH --cpus-per-task=10
#SBATCH --mem-per-cpu=4G
#SBATCH --time=00-00:05:00
python3 canavier_model_changes_no_plots.py
She thought it might have something to do with matplotlob scripts in her code, but it still failed when those were removed. Again, the code runs with srun, and fails with sbatch.
The error message is indicative that the job is trying to run an X11 application that attempts to create a GUI window. Matplotlib might very well be the cause indeed. The script should make sure to only create files and not try anything related to GUI windows.

SLURM Array jobs - how to run as many job as possible? How to combine Slurm options most sensibly?

I am quite new to Slurm and this community, so plese correct me in any way if I am doing anything wrong! :)
I need to run my executable (a Python script) many times in parallel on a HPC Cluster. This executable takes the Slurm Array task ID as Input. This input is mapped within the Python script onto several parameters, on basis of which in turn again data is imported. Note that the exectutable itself is not internally parallelised. I think that each invocation of my executable should be able to run on one CPU only.
My aim: run many invocations of my executable as many times as possible! I was thinking at least like 50 invocations concurrently.
In principle, my scripts are working as intended on the cluster. I use this Slurm submission script:
#!/bin/bash -l
#SBATCH --job-name=NAME
#SBATCH --chdir=/my/dir
#SBATCH --output=.job/NAME%A_%a.out
#SBATCH --error=.job/NAME%A_%a.err
#SBATCH --mail-type=END
#SBATCH --mail-user=USER
# --- resource specification ---
#SBATCH --partition=general
#SBATCH --array=1-130
#SBATCH --ntasks-per-node=1
#SBATCH --mem=16G
#SBATCH --time=13:00:00
# --- start from a clean state and load necessary environment modules ---
module purge
module load anaconda/3
# --- instruct OpenMP to use the number of cpus requested per task ---
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
# --- run executable via srun ---
srun ./path/to/executable.py $SLURM_ARRAY_TASK_ID
However, this way, somehow only 8 jobs (that is, 'executable.py 1', 'executable.py 2', ...) get executed in parallel, each on a different node. (Note: I don't quite know what 'export OMP_NUM_THREADS' does; I was told to include it by IT support). If 'executable.py 1' ends, 'executable.py 9' starts. However, I want more than just 8 concurrently running invocations. So I thought, I need to specify that each inovcation only needs one CPU; maybe then many more of my jobs can run in parallel on the 8 nodes I somehow seem to receive. My new submission script looks like this (for readability I only show the 'resource specification' part, the rest was not changed):
# --- resource specification ---
#SBATCH --partition=general
#SBATCH --array=1-130
#SBATCH --ntasks-per-node=10
#SBATCH --cpus-per-task=1
#SBATCH --mem=16G
#SBATCH --time=13:00:00
This way, though, it seems that my executable gets run ten times for each Slurm array task ID, that is, 'executable.py 1' is run ten times, as is 'executable.py 2' and so on. This is not what I intended.
I think at the bottom of my problem is that (i) I am serioulsy cofused by the SBATCH options --ntasks-per-node, --ntasks, --cpus-per-task, --nodes, etc., and (ii) I don't know conceptually really what a 'job', 'job step' or 'task' is meant to be (both, for my case as well as on the man page for SBATCH).
If anyone knows which SBATCH option combination gives me what I want, I would be very grateful for a hint. Also, if you have general knowledge (in plain English) on how job steps and tasks etc. can be defined, that would be so great.
Please note that I extensively stared at the man pages and some online documentations. I also asked my local I support, but sadly they were not awfully helpful. I really need my script to run in parallel on a huge scale; I also really want to understand a bit better the workings of Slurm. I shall like to add that I am not a computer scientist by training, this is not my usual playing field.
Thanks so much for your time everyone!

What does the --ntasks or -n tasks does in SLURM?

I was using SLURM to use some computing cluster and it had the -ntasks or -n. I have obviously read the documentation for it (http://slurm.schedmd.com/sbatch.html):
sbatch does not launch tasks, it requests an allocation of resources
and submits a batch script. This option advises the Slurm controller
that job steps run within the allocation will launch a maximum of
number tasks and to provide for sufficient resources. The default is
one task per node, but note that the --cpus-per-task option will
change this default.
the specific part I do not understand what it means is:
run within the allocation will launch a maximum of number tasks and to
provide for sufficient resources.
I have a few questions:
I guess my first question is what does the word "task" mean and the difference is with the word "job" in the SLURM context. I usually think of a job as the running the bash script under sbatch as in sbatch my_batch_job.sh. Not sure what task means.
If I equate the word task with job then I thought it would have ran the same identical bash script multiple times according to the argument to -n, --ntasks=<number>. However, I obviously tested it out in the cluster, ran a echo hello with --ntask=9 and I expected sbatch would echo hello 9 times to stdout (which is collected in slurm-job_id.out, but to my surprise, there was a single execution of my echo hello script Then what does this command even do? It seems it does nothing or at least I can't see whats suppose to be doing.
I do know the -a, --array=<indexes> option exists for multiple jobs. That is a different topic. I simply want to know what --ntasks is suppose to do, ideally with an example so that I can test it out in the cluster.
The --ntasks parameter is useful if you have commands that you want to run in parallel within the same batch script.
This may be two separate commands separated by an & or two commands used in a bash pipe (|).
For example
Using the default ntasks=1
#!/bin/bash
#SBATCH --ntasks=1
srun sleep 10 &
srun sleep 12 &
wait
Will throw the warning:
Job step creation temporarily disabled, retrying
The number of tasks by default was specified to one, and therefore the second task cannot start until the first task has finished.
This job will finish in around 22 seconds. To break this down:
sacct -j515058 --format=JobID,Start,End,Elapsed,NCPUS
JobID Start End Elapsed NCPUS
------------ ------------------- ------------------- ---------- ----------
515058 2018-12-13T20:51:44 2018-12-13T20:52:06 00:00:22 1
515058.batch 2018-12-13T20:51:44 2018-12-13T20:52:06 00:00:22 1
515058.0 2018-12-13T20:51:44 2018-12-13T20:51:56 00:00:12 1
515058.1 2018-12-13T20:51:56 2018-12-13T20:52:06 00:00:10 1
Here task 0 started and finished (in 12 seconds) followed by task 1 (in 10 seconds). To make a total user time of 22 seconds.
To run both of these commands simultaneously:
#!/bin/bash
#SBATCH --ntasks=2
srun --ntasks=1 sleep 10 &
srun --ntasks=1 sleep 12 &
wait
Running the same sacct command as specified above
sacct -j 515064 --format=JobID,Start,End,Elapsed,NCPUS
JobID Start End Elapsed NCPUS
------------ ------------------- ------------------- ---------- ----------
515064 2018-12-13T21:34:08 2018-12-13T21:34:20 00:00:12 2
515064.batch 2018-12-13T21:34:08 2018-12-13T21:34:20 00:00:12 2
515064.0 2018-12-13T21:34:08 2018-12-13T21:34:20 00:00:12 1
515064.1 2018-12-13T21:34:08 2018-12-13T21:34:18 00:00:10 1
Here the total job taking 12 seconds. There is no risk of jobs waiting for resources as the number of tasks has been specified in the batch script and therefore the job has the resources to run this many commands at once.
Each task inherits the parameters specified for the batch script. This is why --ntasks=1 needs to be specified for each srun task, otherwise each task uses --ntasks=2 and so the second command will not run until the first task has finished.
Another caveat of the tasks inheriting the batch parameters is if --export=NONE is specified as a batch parameter. In this case --export=ALL should be specified for each srun command otherwise environment variables set within the sbatch script are not inherited by the srun command.
Additional notes:
When using bash pipes, it may be necessary to specify --nodes=1 to prevent commands either side of the pipes running on separate nodes.
When using & to run commands simultaneously, the wait is vital. In this case, without the wait command, task 0 would cancel itself, given task 1 completed successfully.
The "--ntasks" options specifies how many instances of your command are executed.
For a common cluster setup and if you start your command with "srun" this corresponds to the number of MPI ranks.
In contrast the option "--cpus-per-task" specify how many CPUs each task can use.
Your output surprises me as well. Have you launched your command in the script or via srun?
Does you script look like:
#!/bin/bash
#SBATCH --ntasks=8
## more options
echo hello
This should always output only a single line, because the script is only executed on the submitting node not the worker.
If your script look like
#!/bin/bash
#SBATCH --ntasks=8
## more options
srun echo hello
srun causes the script to run your command on the worker nodes and as a result you should get 8 lines of hello.
Tasks are processes that a job executes in parallel in one or more nodes. sbatch allocates resources for your job, but even if you request resources for multiple tasks, it will launch your job script in a single process in a single node only. srun is used to launch job steps from the batch script. --ntasks=N instructs srun to execute N copies of the job step.
For example,
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=2
means that you want to run two processes in parallel, and have each process access two CPUs. sbatch will allocate four CPUs for your job and then start the batch script in a single process. Within your batch script, you can create a parallel job step using
srun --ntasks=2 --cpus-per-task=2 step.sh
This will run two processes in parallel, both of them executing the step.sh script. From the same job, you could also run
srun --ntasks=1 --cpus-per-task=4 step.sh
This would launch a single process that can access all the four GPUs (although it would issue a warning).
It's worth noting that within the allocated resources, your job script is free to do anything, and it doesn't have to use srun to create job steps (but you need srun to launch a job step in multiple nodes). For example, the following script will run both steps in parallel:
#!/bin/bash
#SBATCH --ntasks=1
step1.sh &
step2.sh &
wait
If you want to launch job steps using srun and have two different steps run in parallel, then your job needs to allocate two tasks, and your job steps need to request only one task. You also need to provide the --exclusive argument to srun, for the job steps to use separate resources.
#!/bin/bash
#SBATCH --ntasks=2
srun --ntasks=1 --exclusive step1.sh &
srun --ntasks=1 --exclusive step2.sh &
wait

Force shell script to run tasks in sequence

I'm running a shell scripts that executes several tasks. The thing is that the script does not wait for a task to end before starting the next one. My script should work differently, waiting for one task to be completed before the next one to start. Is there a way to do that? My script looks like this
sbatch retr.sc 19860101 19860630
scp EN/EN1986* myhostname#myhost.it:/storage/myhostname/MetFiles
the first command runs retr.sc, that retrieves files and it takes half an hour roughly. The second command, though, is run right soon, moving just some files to destination. I wish the scp command to be run only when the first is complete.
thanks in advance
You have several options:
use srun rather than sbatch: srun retr.sc 19860101 19860630
use sbatch for the second command as well, and make it depend on the first one
like this:
RES=$(sbatch retr.sc 19860101 19860630)
sbatch --depend=after:${RES##* } --wrap "scp EN/EN1986* myhostname#myhost.it:/storage/myhostname/MetFiles"
create one script that incorporates both retr.sc and scp and submit that script.
sbatch exits immediately on submitting the job to slurm.
salloc will wait for the job to finish before exiting.
from the man page:
$ salloc -N16 xterm
salloc: Granted job allocation 65537
(at this point the xterm appears, and salloc waits for xterm to exit)
salloc: Relinquishing job allocation 65537
Thanks for you replies
I've sorted out this way
RES=$(sbatch retr.sc $date1 $date2)
array=(${RES// / })
JOBID=${array[3]}
year1={date1:0:4}
sbatch --dependency=afterok:${JOBID} scp.sh $year1
where scp.sh is the script for transferring the file to my local machine

Slurm: What is the difference for code executing under salloc vs srun

I'm using a cluster managed by slurm to run some yarn/hadoop benchmarks. To do this I am starting the hadoop servers on nodes allocated by slurm and then running the benchmarks on them. I realize that this is not the intended way to run a production hadoop cluster, but needs must.
To do this I started by writing a script that runs with srun eg srun -N 4 setup.sh. This script writes the configuration files and starts the servers on the allocated nodes, with the lowest numbered machine acting as the master. This all works, and I am able to run applications.
However, as I would like to start the servers once and then launch multiple applications on them without restarting/encoding everything in at the begining I would like to use salloc instead. I had thought that this would be a simple case of running salloc -N 4 and then running srun setup.sh. Unfortunately this does not work as the different servers are unable to communicate with each other. Could any one explain to me what the difference in the operating environment is between using srun and using salloc then srun?
Many thanks
Daniel
From the slurm-users mailing list:
sbatch and salloc allocate resources to the job, while srun launches parallel tasks across those resources. When invoked within a job allocation, srun will launch parallel tasks across some or all of the allocated resources. In that case, srun inherits by default the pertinent options of the sbatch or salloc which it runs under. You can then (usually) provide srun different options which will override what it receives by default. Each invocation of srun within a job is known as a job step.
srun can also be invoked outside of a job allocation. In that case, srun requests resources, and when those resources are granted, launches tasks across those resources as a single job and job step.

Resources