SLURM How to start a script once per node - cluster-computing

I have a Big Cluster available through SLURM.
I want to start my script e.g. ./calc on every requested node with a specified amount of cores. So for example on 2 nodes, 16 cores each.
I start with sbatch script
#SBATCH -N 2
#SBATCH --ntasks-per-node=16
srun -N 1 ./calc 2 &
srun -N 1 ./clac 2 &
wait
It doesn't work as intended though.
I tried many configurations of --ntask --nodes --cpus-per-task but nothing worked and I'm very lost.
I also don't understand the difference between task and CPUs in SLURM

In your example, you ask slurm to launch 16 tasks per node, on 2 nodes. At the end of the job, slurm will probably runs 8x(srun)x2 nodes tasks.
For your needs, you don't need to specify that you want 2 nodes specifically instead of the jobs have to run on 2 two different nodes.
For your example, run the following sbatch :
#!/bin/bash
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=16
#SBATCH --hint=nomultithread
srun <my program>
In this example, slurm will run 2 times the program with 16 cores. The nomultithread is optional and depends the cluster configuration. If the hyper-threading is activated, this will be 16 virtual cpus.

I found this to be a working solution. It turned out the most important thing was to define all parameters nodes tasks cpus
#!/bin/bash
#SBATCH -N 2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=16
srun -N 1 -n 1 -c 16 ./calc 2 &
srun -N 1 -n 1 -c 16 ./calc 2 &
wait

Related

Using OpenMP and OpenMPI together under Slurm

I have written a C++ code that uses both OpenMP and OpenMPI. I want to use (let's say) 3 nodes (so size_Of_Cluster should be 3) and use OpenMP in each node to parallelize the for loop (There are 24 cores in a node). In essence I want MPI ranks be assigned to nodes. The Slurm script I have written is as follows. (I have tried many variations but could not come up with the "correct" one. I would be grateful if you could help me.)
#!/bin/bash
#SBATCH -N 3
#SBATCH -n 72
#SBATCH -p defq
#SBATCH -A akademik
#SBATCH -o %J.out
#SBATCH -e %J.err
#SBATCH --job-name=MIXED
module load slurm
module load shared
module load gcc
module load openmpi
export OMP_NUM_THREADS=24
mpirun -n 3 --bynode ./program
Using srun did not help.
The relevant lines are:
#SBATCH -N 3
#SBATCH -n 72
export OMP_NUM_THREADS=24
This means you have 72 MPI processes, and each creates 24 thread. For that to be efficient you probably need 24x72 cores. Which you don't have. You should specify:
#SBATCH -n 3
Then you will have 3 processes, with 24 threads per process.
You don't have to worry about the placement of the ranks on the nodes: that is done by the run time. You could for instance let each process print MPI_Get_processor_name to confirm.

How to run jobs in paralell using one slurm batch script?

I am trying to run multiple python scripts in parallel with one Slurm batch script. Take a look at the example below:
#!/bin/bash
#
#SBATCH --job-name=test
#SBATCH --output=/dev/null
#SBATCH --error=/dev/null
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=1G
#SBATCH --partition=All
#SBATCH --time=5:00
srun sleep 60
srun sleep 60
wait
How do I tweak the script such that the execution will take only 60 sec (instead of 120) ? Splitting the script into two scripts is not an option.
As written, that script is running two sleep commands in parallel, two times in a row.
Each srun command initiates a step, and since you set --ntasks=2 each step instantiates two tasks (here the sleep command).
If you want to run two 1-task steps in parallel, you should write it this way:
srun --exclusive -n 1 -c 1 sleep 60 &
srun --exclusive -n 1 -c 1 sleep 60 &
wait
Then each step only instantiates one task, and is backgrounded by the & delimiter, meaning the next srun can start immediately. The wait command makes sure the script terminates only when both steps are finished.
In that context, the xargs command and the GNU parallel commands can be useful to avoid writing multiple identical srun lines or avoiding a for-loop.
For instance, if you have multiple files you need to run your script over:
find /path/to/data/*.csv -print0 | xargs -0 -n1 -P $SLURM_NTASKS srun -n1 --exclusive python my_python_script.py
This is equivalent to writing as many
srun -n 1 -c 1 --exclusive python my_python_script.py /path/to/data/file1.csv &
srun -n 1 -c 1 --exclusive python my_python_script.py /path/to/data/file1.csv &
srun -n 1 -c 1 --exclusive python my_python_script.py /path/to/data/file1.csv &
[...]
GNU parallel is useful to iterate over parameter values:
parallel -P $SLURM_NTASKS srun -n1 --exclusive python my_python_script.py ::: {1..1000}
will run
python my_python_script.py 1
python my_python_script.py 2
python my_python_script.py 3
...
python my_python_script.py 1000
Another approach is to just run
srun python my_python_script.py
and, inside the Python script, to look for the SLURM_PROCID environment variable and split the work according to its value. The srun command will start multiple instances of the script and each will 'see' a different value for SLURM_PROCID.
import os
print(os.environ['SLURM_PROCID'])

What does the keyword --exclusive mean in slurm?

This is a follow up question from [How to run jobs in paralell using one slurm batch script?]. The goal was to create a single SBatch-Script, which can start multiple processes and run them in parallel. The Answer given by
damienfrancois was very detailed and looked something like this.
#!/bin/bash
#
#SBATCH --job-name=test
#SBATCH --output=/dev/null
#SBATCH --error=/dev/null
#SBATCH --partition=All
srun -n 1 -c 1 --exclusive sleep 60 &
srun -n 1 -c 1 --exclusive sleep 60 &
....
wait
However, I am not able to understand the exclusive keyword. If I use the keyword, one node of the cluster is chosen and all processes are launched there. However, I would like Slurm to distribute the ["sleeps"/steps] over the entire cluster.
So how does the keyword exclusive work ? According to the Slurm documentaion, the restriction to one node should not happen, since the keyword is used within a step-allocation.
[I am new to Slurm]

Do I need a single bash file for each task in SLURM?

I am trying to launch several task in a SLURM-managed cluster, and would like to avoid dealing with dozens of files.
Right now, I have 50 tasks (subscripted i, and for simplicity, i is also the input parameter of my program), and for each one a single bash file slurm_run_i.sh which indicates the computations configuration, and the srun command:
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH -J pltCV
#SBATCH --mem=30G
srun python plotConvergence.py i
I am then using another bash file to submit all these tasks, slurm_run_all.sh
#!/bin/bash
for i in {1..50}:
sbatch slurm_run_$i.sh
done
This works (50 jobs are running on the cluster), but I find it troublesome to have more than 50 input files. Searching a solution, I came up with the & command, obtaining something as:
#!/bin/bash
#SBATCH --ntasks=50
#SBATCH --cpus-per-task=1
#SBATCH -J pltall
#SBATCH --mem=30G
# Running jobs
srun python plotConvergence.py 1 &
srun python plotConvergence.py 2 &
...
srun python plotConvergence.py 49 &
srun python plotConvergence.py 50 &
wait
echo "All done"
Which seems to run as well. However, I cannot manage each of these jobs independently: the output of squeue shows I have a single job (pltall) running on a single node. As there are only 12 cores on each node in the partition I am working in, I am assuming most of my jobs are waiting on the single node I've been allocated to. Setting the -N option doesn't change anything too.. Moreover, I cannot cancel some jobs individually anymore if I realize there's a mistake or something, which sounds problematic to me.
Is my interpretation right, and is there a better way (I guess) than my attempt to process several jobs in slurm without being lost among many files ?
What you are looking for is the jobs array feature of Slurm.
In your case, you would have a single submission file (slurm_run.sh) like this:
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH -J pltCV
#SBATCH --mem=30G
#SBATCH --array=1-50
srun python plotConvergence.py ${SLURM_ARRAY_TASK_ID}
and then submit the array of jobs with
sbatch slurm_run.sh
You will see that you will have 50 jobs submitted. You can cancel all of them at once or one by one. See the man page of sbatch for details.

What happens if I am running more subjobs than the number of core allocated

So I have a sbatch (slurm job scheduler) script in which I am processing a lot of data through 3 scripts: foo1.sh, foo2.sh and foo3.sh.
foo1.sh and foo2.sh are independent and I want to run them simultaneously.
foo3.sh needs the outputs of foo1.sh and foo2.sh so I am building a dependency.
And then I have to repeat it 30 times.
Let say:
## Resources config
#SBATCH --ntasks=30
#SBATCH --task-per-core=1
for i in {1..30};
do
srun -n 1 --jobid=foo1_$i ./foo1.sh &
srun -n 1 --jobid=foo2_$i ./foo2.sh &
srun -n 1 --jobid=foo3_$i --dependency=afterok:foo1_$1:foo2_$i ./foo3.sh &
done;
wait
The idea being that you launch foo1_1 and foo2_1 but since foo3_1 have to wait for the two other jobs to finish, I want to go to the next iteration. The next iteration is going to launch foo1_2 foo2_2 and foo3_2 will wait etc.
At some point, then, the number of subjobs launched with srun will be higher than --ntasks=30. What is going to happen? Will it wait for a previous job to finish (behavior I am looking for)?
Thanks
Slurm will run 30 srun's but the 31st will wait that a core get freed within your 30-cores allocation.
note that the proper argument is --ntasks-per-core=1, and not --tasks-per-core=1
You can test it by yourself using salloc rather than sbatch to work interactively:
$ salloc --ntasks=2 --ntasks-per-core=1
$ srun -n 1 sleep 10 & srun -n 1 sleep 10 & time srun -n 1 echo ok
[1] 2734
[2] 2735
ok
[1]- Done srun -n 1 sleep 10
[2]+ Done srun -n 1 sleep 10
real 0m10.201s
user 0m0.072s
sys 0m0.028s
You see that the simple echo took 10 seconds because the third srun had to wait until the first two have finished as the allocation is two cores only.
What should happen is, if you kick-off more subtasks than you have cores or hyperthreads, then the OS scheduling algorithms should handle prioritizing the tasks. Depending on which OS you are running (even if they are all Unix based), the way this is implemented under the hood will be different.
But you are correct in your assumption that if you run out of cores, then your parallel tasks must, in a sense, 'wait their turn'.

Resources