SLURM Submit multiple tasks per node? - job-scheduling

I found some very similar questions which helped me arrive at a script which seems to work however I'm still unsure if I fully understand why, hence this question..
My problem (example): On 3 nodes, I want to run 12 tasks on each node (so 36 tasks in total). Also each task uses OpenMP and should use 2 CPUs. In my case a node has 24 CPUs and 64GB memory. My script would be:
#SBATCH --nodes=3
#SBATCH --ntasks=36
#SBATCH --cpus-per-task=2
#SBATCH --mem-per-cpu=2000
export OMP_NUM_THREADS=2
for i in {1..36}; do
srun -N 1 -n 1 ./program input${i} >& out${i} &
done
wait
This seems to work as I require, successively running tasks on a node until all CPUs on that node are in use, and then continuing to run further tasks on the next node until all CPUs are used again, etc..
My question.. I'm not sure if this is actually what it does (?) as I didn't fully understand the man page of srun regarding -n, and i have not used srun before.
Mainly my confusion comes from "-n": In the man page for -n it says "The default is one task per node, ..", so I expected if I use "srun -n 1" that only one task will be run on each node, which doesn't seem to be the case.
Furthermore when i tried e.g. "srun -n 2 ./program" it seems to just run the exact same program twice as two different tasks with no way to use different input files.. which I can't think of why that would ever be useful?

Your setup is correct except that you must use the --exclusive option of srun (which has a different meaning in this case than for sbatch).
As for your remark regarding the usefulness of srun, the behaviour of the program can be changed based on the environment variable $SLURM_TASK_ID, or the rank in case of an MPI program. Your confusion arises from the fact that your program is not written to be parallel (appart from the 2 OMP threads) while srun is meant to start parallel programs, most of the time based on MPI.

An other way is to run all your tasks at once.
since the input and output file depends on the rank, a wrapper is needed
your SLURM script would be
#SBATCH --nodes=3
#SBATCH --ntasks=36
#SBATCH --cpus-per-task=2
#SBATCH --mem-per-cpu=2000
export OMP_NUM_THREADS=2
srun -n 36 ./program.sh
and your wrapper program.sh would be
#!/bin/sh
exec ./program input${SLURM_PROCID} > out${SLURM_PROCID} 2>&1

Related

Request maximum number of threads & cores on node via Slurm job scheduler

I have a heterogeneous cluster, containing either 14-core or 16-core CPUs (28 or 32 threads). I manage job submissions using Slurm. Some requirements:
It doesn't matter which CPU is used for a calculation.
I don't want to specify which CPU a job should go to.
A job should consume all available cores on the CPU (14 or 16).
I want mpirun to handle threading.
To illustrate the peculiarities of the problem, I show a job script that works on the 16-core CPUs:
#!/bin/bash
#SBATCH -J test
#SBATCH -o job.%j.out
#SBATCH -N 1
#SBATCH -n 32
mpirun -np 16 vasp
An example job script that works on the 14-core CPUs is:
#!/bin/bash
#SBATCH -J test
#SBATCH -o job.%j.out
#SBATCH -N 1
#SBATCH -n 28
mpirun -np 14 vasp
The second job script runs on the 16-core CPUs but, unfortunately, the job is about 35% slower than when I request 32 threads as is done in the first script. That's an unacceptable performance loss for my application.
I haven't figured out if there is a good way around this challenge. To me, a solution would be to request a variable number of resources, such as
#SBATCH -n [28-32]
and to tailor the mpirun -np x vasp line accordingly. I haven't found a way to do this, however. Are there any suggestions on how to achieve this directly in Slurm or is there a good workaround?
I tried to use the environmental variable $SLURM_CPUS_ON_NODE, but this variable is only set after the node is selected, so cannot be used in a #SBATCH line.
I also looked at the --constraint flag but this does not seem to give sufficiently granular control over threading requests.
Actually it should work as you want it to by simply specifying that you want a full node:
#!/bin/bash
#SBATCH -J test
#SBATCH -o job.%j.out
#SBATCH -N 1
#SBATCH --exclusive
mpirun vasp
mpirun will start the number of processes as defined in SLURM_TASKS_PER_NODE that will be set by Slurm to the number of tasks that can be created on the node, that is the number of CPUs if you do not request more than one CPU per task.

How to run two scripts in parallel on the same node but different cores with sbatch?

I am new to SLURM and I want to run two scripts (each of them take 2 minutes to run) in the same time (parallel) on the same node, same socket, but on different cores. I have a system where one node has 2 sockets, and each socket has 10 cores.
Based on what I have read in one other question (SLURM: How can I run different executables on parallel on the same compute node or in different nodes?) I came up with this code:
#!/bin/bash
#SBATCH -J script
#SBATCH --time=00:05:00
#SBATCH --exclusive
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=2
#SBATCH --partition=xeon
#SBATCH --output=OUTPUT.txt
#SBATCH --hint=nomultithread
ml intel
module load openmpi
srun -c 1 --exclusive ./runScript1 &
srun -c 1 --exclusive ./runScript2 &
wait
But when I am typing the squeue --job [jobID] in a repetitive way I see that it takes 4 minutes for the both scripts to execute, which makes me think that they are run sequentially (after the first one is finished, starts the second one).
I have tried also to use taskset to select specific core, but I had some errors.
I am running the script from above using sbatch.
Please tell me if I am assuming wrong or doing something wrong.
It is possible to select a specific core to run on, by including taskset or --cpu-bind option in my code?

mpirun gets higher rank than requested with --ntasks from sbatch script

I am launching a MPI program through a SBATCH script like this (corresponding to an example script provided by the system administrators):
#! /bin/bash -l
#SBATCH --job-name=test
#SBATCH -o stdout.log
#SBATCH -e stderr.log
#SBATCH --ntasks=160
#SBATCH --time=0-00:10:00
#SBATCH --qos=normal
cd $WORK/
mpirun ./mpiprogram
However, it seems that sometimes more MPI processes are launched than NTASKS. Resulting in sometimes 200 processes when 160 was requested, sometimes different numbers. Note that the nodes have 16 or 20 cores. Some of the worker processes (all pretty much identical in my case) run much slower than the others, perhaps because of swapping. The swapping may be caused by too many processes on one node, causing them to use too much memory.
Should I specify the number of threads to mpirun using $SLURM_NTASKS? Or what is going on here?

What does the --ntasks or -n tasks does in SLURM?

I was using SLURM to use some computing cluster and it had the -ntasks or -n. I have obviously read the documentation for it (http://slurm.schedmd.com/sbatch.html):
sbatch does not launch tasks, it requests an allocation of resources
and submits a batch script. This option advises the Slurm controller
that job steps run within the allocation will launch a maximum of
number tasks and to provide for sufficient resources. The default is
one task per node, but note that the --cpus-per-task option will
change this default.
the specific part I do not understand what it means is:
run within the allocation will launch a maximum of number tasks and to
provide for sufficient resources.
I have a few questions:
I guess my first question is what does the word "task" mean and the difference is with the word "job" in the SLURM context. I usually think of a job as the running the bash script under sbatch as in sbatch my_batch_job.sh. Not sure what task means.
If I equate the word task with job then I thought it would have ran the same identical bash script multiple times according to the argument to -n, --ntasks=<number>. However, I obviously tested it out in the cluster, ran a echo hello with --ntask=9 and I expected sbatch would echo hello 9 times to stdout (which is collected in slurm-job_id.out, but to my surprise, there was a single execution of my echo hello script Then what does this command even do? It seems it does nothing or at least I can't see whats suppose to be doing.
I do know the -a, --array=<indexes> option exists for multiple jobs. That is a different topic. I simply want to know what --ntasks is suppose to do, ideally with an example so that I can test it out in the cluster.
The --ntasks parameter is useful if you have commands that you want to run in parallel within the same batch script.
This may be two separate commands separated by an & or two commands used in a bash pipe (|).
For example
Using the default ntasks=1
#!/bin/bash
#SBATCH --ntasks=1
srun sleep 10 &
srun sleep 12 &
wait
Will throw the warning:
Job step creation temporarily disabled, retrying
The number of tasks by default was specified to one, and therefore the second task cannot start until the first task has finished.
This job will finish in around 22 seconds. To break this down:
sacct -j515058 --format=JobID,Start,End,Elapsed,NCPUS
JobID Start End Elapsed NCPUS
------------ ------------------- ------------------- ---------- ----------
515058 2018-12-13T20:51:44 2018-12-13T20:52:06 00:00:22 1
515058.batch 2018-12-13T20:51:44 2018-12-13T20:52:06 00:00:22 1
515058.0 2018-12-13T20:51:44 2018-12-13T20:51:56 00:00:12 1
515058.1 2018-12-13T20:51:56 2018-12-13T20:52:06 00:00:10 1
Here task 0 started and finished (in 12 seconds) followed by task 1 (in 10 seconds). To make a total user time of 22 seconds.
To run both of these commands simultaneously:
#!/bin/bash
#SBATCH --ntasks=2
srun --ntasks=1 sleep 10 &
srun --ntasks=1 sleep 12 &
wait
Running the same sacct command as specified above
sacct -j 515064 --format=JobID,Start,End,Elapsed,NCPUS
JobID Start End Elapsed NCPUS
------------ ------------------- ------------------- ---------- ----------
515064 2018-12-13T21:34:08 2018-12-13T21:34:20 00:00:12 2
515064.batch 2018-12-13T21:34:08 2018-12-13T21:34:20 00:00:12 2
515064.0 2018-12-13T21:34:08 2018-12-13T21:34:20 00:00:12 1
515064.1 2018-12-13T21:34:08 2018-12-13T21:34:18 00:00:10 1
Here the total job taking 12 seconds. There is no risk of jobs waiting for resources as the number of tasks has been specified in the batch script and therefore the job has the resources to run this many commands at once.
Each task inherits the parameters specified for the batch script. This is why --ntasks=1 needs to be specified for each srun task, otherwise each task uses --ntasks=2 and so the second command will not run until the first task has finished.
Another caveat of the tasks inheriting the batch parameters is if --export=NONE is specified as a batch parameter. In this case --export=ALL should be specified for each srun command otherwise environment variables set within the sbatch script are not inherited by the srun command.
Additional notes:
When using bash pipes, it may be necessary to specify --nodes=1 to prevent commands either side of the pipes running on separate nodes.
When using & to run commands simultaneously, the wait is vital. In this case, without the wait command, task 0 would cancel itself, given task 1 completed successfully.
The "--ntasks" options specifies how many instances of your command are executed.
For a common cluster setup and if you start your command with "srun" this corresponds to the number of MPI ranks.
In contrast the option "--cpus-per-task" specify how many CPUs each task can use.
Your output surprises me as well. Have you launched your command in the script or via srun?
Does you script look like:
#!/bin/bash
#SBATCH --ntasks=8
## more options
echo hello
This should always output only a single line, because the script is only executed on the submitting node not the worker.
If your script look like
#!/bin/bash
#SBATCH --ntasks=8
## more options
srun echo hello
srun causes the script to run your command on the worker nodes and as a result you should get 8 lines of hello.
Tasks are processes that a job executes in parallel in one or more nodes. sbatch allocates resources for your job, but even if you request resources for multiple tasks, it will launch your job script in a single process in a single node only. srun is used to launch job steps from the batch script. --ntasks=N instructs srun to execute N copies of the job step.
For example,
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=2
means that you want to run two processes in parallel, and have each process access two CPUs. sbatch will allocate four CPUs for your job and then start the batch script in a single process. Within your batch script, you can create a parallel job step using
srun --ntasks=2 --cpus-per-task=2 step.sh
This will run two processes in parallel, both of them executing the step.sh script. From the same job, you could also run
srun --ntasks=1 --cpus-per-task=4 step.sh
This would launch a single process that can access all the four GPUs (although it would issue a warning).
It's worth noting that within the allocated resources, your job script is free to do anything, and it doesn't have to use srun to create job steps (but you need srun to launch a job step in multiple nodes). For example, the following script will run both steps in parallel:
#!/bin/bash
#SBATCH --ntasks=1
step1.sh &
step2.sh &
wait
If you want to launch job steps using srun and have two different steps run in parallel, then your job needs to allocate two tasks, and your job steps need to request only one task. You also need to provide the --exclusive argument to srun, for the job steps to use separate resources.
#!/bin/bash
#SBATCH --ntasks=2
srun --ntasks=1 --exclusive step1.sh &
srun --ntasks=1 --exclusive step2.sh &
wait

Running slurm script with multiple nodes, launch job steps with 1 task

I am trying to launch a large number of job steps using a batch script. The different steps can be completely different programs and do need exactly one CPU each. First I tried doing this using the --multi-prog argument to srun. Unfortunately, when using all CPUs assigned to my job in this manner, performance degrades massively. The run time increases to almost its serialized value. By undersubscribing I could ameliorate this a little. I couldn't find anything online regarding this problem, so I assumed it to be a configuration problem of the cluster I am using.
So I tried going a different route. I implemented the following script (launched via sbatch my_script.slurm):
#!/bin/bash
#SBATCH -o $HOME/slurm/slurm_out/%j.%N.out
#SBATCH --error=$HOME/slurm/slurm_out/%j.%N.err_out
#SBATCH --get-user-env
#SBATCH -J test
#SBATCH -D $HOME/slurm
#SBATCH --export=NONE
#SBATCH --ntasks=48
NR_PROCS=$(($SLURM_NTASKS))
for PROC in $(seq 0 $(($NR_PROCS-1)));
do
#My call looks like this:
#srun --exclusive -n1 bash $PROJECT/call_shells/call_"$PROC".sh &
srun --exclusive -n1 hostname &
pids[${PROC}]=$! #Save PID of this background process
done
for pid in ${pids[*]};
do
wait ${pid} #Wait on all PIDs, this returns 0 if ANY process fails
done
I am aware, that the --exclusive argument is not really needed in my case. The shell scripts called contain the different binaries and their arguments. The remaining part of my script relies on the fact that all processes have finished hence the wait. I changed the calling line to make it a minimal working example.
At first this seemed to be the solution. Unfortunately when increasing the number of nodes used in my job allocation (for example by increasing --ntasks to a number larger than the number of CPUs per node in my cluster), the script does not work as expected anymore, returning
srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
and continuing using only one node (i.e. 48 CPUs in my case, which go through the job steps as fast as before, all processes on the other node(s) are subsequently killed).
This seems to be the expected behaviour, but I can't really understand it. Why is it that every job step in a given allocation needs to include a minimum number of tasks equal to the number of nodes included in the allocation. I ordinarily really do not care at all about the number of nodes used in my allocation.
How can I implement my batch script, so it can be used on multiple nodes reliably?
Found it! The nomenclature and the many command line options to slurm confused me. The solution is given by
#!/bin/bash
#SBATCH -o $HOME/slurm/slurm_out/%j.%N.out
#SBATCH --error=$HOME/slurm/slurm_out/%j.%N.err_out
#SBATCH --get-user-env
#SBATCH -J test
#SBATCH -D $HOME/slurm
#SBATCH --export=NONE
#SBATCH --ntasks=48
NR_PROCS=$(($SLURM_NTASKS))
for PROC in $(seq 0 $(($NR_PROCS-1)));
do
#My call looks like this:
#srun --exclusive -N1 -n1 bash $PROJECT/call_shells/call_"$PROC".sh &
srun --exclusive -N1 -n1 hostname &
pids[${PROC}]=$! #Save PID of this background process
done
for pid in ${pids[*]};
do
wait ${pid} #Wait on all PIDs, this returns 0 if ANY process fails
done
This specifies to run the job on exactly one node incorporating a single task only.

Resources