slurm, srun and parallelism

slurm, srun and parallelism - bash

I have seen the following two very similar schemes used when submitting jobs with multiple steps to slurm:
On the one hand you can do
#SBATCH -N1 -c1 -n5 # 5 tasks in total on 1 node, 1 cpu per task
for j in {1..4}; do
srun --exclusive -n1 script $j &
done
srun --exclusive -n1 script 5
wait
On the other hand you can do
#SBATCH -N1 -c1 -n5 # 5 tasks in total on 1 node, 1 cpu per task
for j in {1..5}; do
srun --exclusive -n1 script $j &
done
wait
Because the job should have only 5 CPUs allocated to it I don't really
understand how the second one can work correctly, since after
four job steps have been started with srun there is no way the scheduler can allocate a fifth job 'in the background' and then return to the original script... where would the original script run? (I admit my knowledge of these things is pretty limited though).
However, I have personally tested both ways and they both seem to work exactly the same. The second script is a bit simpler in my opinion, and when dealing with somewhat larger input scripts this can be an advantage, but I'm worried that I don't understand 100% what is going on here. Is there a preferred way to do this? What is the difference? What is Bash/slurm doing behind the scenes?

They both work the same in principle, though the second one is clearer (and correct - see below). Each invocation of srun will run script on a separate CPU (probably - though if it runs very fast they could run on a subset of the sbatch-allocated CPUs).
I think the first example doesn't need wait, since the last command isn't run in the background.
By the way, the first example has an error: %j is local to the for-loop, so the last run inside the loop and the run outside the loop both invoke script 4.

Related

sbatch+srun: Large amount of single thread jobs

Hello friendly people,
my question is rather specific.
For more than a week, I am trying to submit thousands of single thread jobs for a scientific experiment using sbatch and srun.
The problem is that these jobs may take different amounts of time to finish and some may even be aborted as they exceed the memory limit. Both behaviors are fine and my evaluation deals with it.
But, I am facing the problem that some of the jobs are never started, even though they have been submitted.
My sbatch script looks like this:
#!/usr/bin/bash
#SBATCH --nodes=4
#SBATCH --tasks-per-node=12
#SBATCH --mem-per-cpu=10000
for i in {1..500}
do
srun -N1 -n1 -c1 --exclusive --time=60 ${mybinary} $i &
wait 5s
done
Now, my error log shows the following message:
srun: Job 1846955 step creation temporarily disabled, retrying
1) What does 'step creation temporarily disabled' mean? Are all cpu's busy and the job is omitted or is it started again later when resources are free?
2) Why are some of my jobs not carried out and how can I fix it? Do I use the correct parameters for srun?
Thanks for your help!

srun: Job 1846955 step creation temporarily disabled, retrying
This is normal, you reserve 4 x 12 CPUs and start 500 instances of srun. Only 48 instances will run, while the other will output that message. Whenever a running instance stops, a pending instance starts.
wait 5s
The wait command is used to wait for processes, not for a certain amount of time. For that, use the sleep command. The wait command must be at the end of the script. Otherwise, the job could stop before all srun instances have finished.
So the scrip should look like this:
#!/usr/bin/bash
#SBATCH --nodes=4
#SBATCH --tasks-per-node=12
#SBATCH --mem-per-cpu=10000
for i in {1..500}
do
srun -N1 -n1 -c1 --exclusive --time=60 ${mybinary} $i &
done
wait

SLURM: Embarrassingly parallel program inside an embarrassingly parallel program

I have a complex model written in Matlab. The model was not written by us and is best thought of as a "black box" i.e. in order to fix the relevant problems from the inside would require rewritting the entire model which would take years.
If I have an "embarrassingly parallel" problem I can use an array to submit X variations of the same simulation with the option #SBATCH --array=1-X. However, clusters normally have a (frustratingly small) limit on the maximum array size.
Whilst using a PBS/TORQUE cluster I have got around this problem by forcing Matlab to run on a single thread, requesting multiple CPUs and then running multiple instances of Matlab in the background. An example submission script is:
#!/bin/bash
<OTHER PBS COMMANDS>
#PBS -l nodes=1:ppn=5,walltime=30:00:00
#PBS -t 1-600
<GATHER DYNAMIC ARGUMENTS FOR MATLAB FUNCTION CALLS BASED ON ARRAY NUMBER>
# define Matlab options
options="-nodesktop -noFigureWindows -nosplash -singleCompThread"
for sub_job in {1..5}
do
<GATHER DYNAMIC ARGUMENTS FOR MATLAB FUNCTION CALLS BASED ON LOOP NUMBER (i.e. sub_job)>
matlab ${options} -r "run_model(${arg1}, ${arg2}, ..., ${argN}); exit" &
done
wait
<TIDY UP AND FINISH COMMANDS>
Can anyone help me do the equivalent on a SLURM cluster?
The par function will not run my model in a parallel loop in Matlab.
The PBS/TORQUE language was very intuitive but SLURM's is confusing me. Assuming a similarly structured submission script as my PBS example, here is what I think certain commands will result in.
--ncpus-per-task=5 seems like the most obvious one to me. Would I put srun in front of the matlab command in the loop or leave it as it is in the PBS script loop?
--ntasks=5 I would imagine would request 5 CPUs but will run in serial unless a program specifically requests them (i.e. MPI or Python-Multithreaded etc). Would I need to put srun in front of the Matlab command in this case?

I am not a big expert on array jobs but I can help you with the inner loop.
I would always use GNU parallel to run several serial processes in parallel, within a single job that has more than one CPU available. It is a simple perl script, so not difficult to 'install', and its syntax is extremely easy. What it basically does is to run some (nested) loop in parallel. Each iteration of this loop contains a (long) process, like your Matlab command. In contrast to your solution it does not submit all these processes at once, but it runs only N processes at the same time (where N is the number of CPUs you have available). As soon as one finishes, the next one is submitted, and so on until your entire loop is finished. It is perfectly fine that not all processes take the same amount of time, as soon as one CPU is freed, another process is started.
Then, what you would like to do is to launch 600 jobs (for which I substitute 3 below, to show the complete behavior), each with 5 CPUs. To do that you could do the following (whereby I have not included the actual run of matlab, but that trivially can be included):
#!/bin/bash
#SBATCH --job-name example
#SBATCH --out job.slurm.out
#SBATCH --nodes 1
#SBATCH --ntasks 1
#SBATCH --cpus-per-task 5
#SBATCH --mem 512
#SBATCH --time 30:00:00
#SBATCH --array 1-3
cmd="echo matlab array=${SLURM_ARRAY_TASK_ID}"
parallel --max-procs=${SLURM_CPUS_PER_TASK} "$cmd,subjob={1}; sleep 30" ::: {1..5}
Submitting this job using:
$ sbatch job.slurm
submits 3 jobs to the queue. For example:
$ squeue | grep tdegeus
3395882_1 debug example tdegeus R 0:01 1 c07
3395882_2 debug example tdegeus R 0:01 1 c07
3395882_3 debug example tdegeus R 0:01 1 c07
Each job gets 5 CPUs. These are exploited by the parallel command, to run your inner loop in parallel. Once again, the range of this inner loop may be (much) larger than 5, parallel takes care of the balancing between the 5 available CPUs within this job.
Let's inspect the output:
$ cat job.slurm.out
matlab array=2,subjob=1
matlab array=2,subjob=2
matlab array=2,subjob=3
matlab array=2,subjob=4
matlab array=2,subjob=5
matlab array=1,subjob=1
matlab array=3,subjob=1
matlab array=1,subjob=2
matlab array=1,subjob=3
matlab array=1,subjob=4
matlab array=3,subjob=2
matlab array=3,subjob=3
matlab array=1,subjob=5
matlab array=3,subjob=4
matlab array=3,subjob=5
You can clearly see the 3 times 5 processes run at the same time now (as their output is mixed).
No need in this case to use srun. SLURM will create 3 jobs. Within each job everything happens on individual compute nodes (i.e. as if you were running on your own system).
Installing GNU Parallel - option 1
To 'install' GNU parallel into your home folder, for example in ~/opt.
Download the latest GNU Parallel.
Make the directory ~/opt if it does not yet exist
mkdir $HOME/opt
'Install' GNU Parallel:
tar jxvf parallel-latest.tar.bz2
cd parallel-XXXXXXXX
./configure --prefix=$HOME/opt
make
make install
Add ~/opt to your path:
export PATH=$HOME/opt/bin:$PATH
(To make it permanent, add that line to your ~/.bashrc.)
Installing GNU Parallel - option 2
Use conda.
(Optional) Create a new environment
conda create --name myenv
Load an existing environment:
conda activate myenv
Install GNU parallel:
conda install -c conda-forge parallel
Note that the command is available only when the environment is loaded.

While Tom's suggestion to use GNU Parallel is a good one, I will attempt to answer the question asked.
If you want to run 5 instances of the matlab command with the same arguments (for example if they were communicating via MPI) then you would want to ask for --ncpus-per-task=1, --ntasks=5 and you should preface your matlab line with srun and get rid of the loop.
In your case, as each of your 5 calls to matlab are independent, you want to ask for --ncpus-per-task=5, --ntasks=1. This will ensure that you allocate 5 CPU cores per job to do with as you wish. You can preface your matlab line with srun if you wish but it will make little difference you are only running one task.
Of course, this is only efficient if each of your 5 matlab runs take the same amount of time since if one takes much longer then the other 4 CPU cores will be sitting idle, waiting for the fifth to finish.

You can do it with python and subprocess, in what I describe below you just set the number of nodes and tasks and that is it, no need for an array, no need to match the size of the array to the number of simulations, etc... It will just execute python code until it is done, more nodes faster execution.
Also, it is easier to decide on variables as everything is being prepared in python (which is easier than bash).
It does assume that the Matlab scripts save the output to file - nothing is returned by this function (it can be changed..)
In the sbatch script you need to add something like this:
#!/bin/bash
#SBATCH --output=out_cluster.log
#SBATCH --error=err_cluster.log
#SBATCH --time=8:00:00
#SBATCH --nodes=36
#SBATCH --exclusive
#SBATCH --cpus-per-task=2
export IPYTHONDIR="`pwd`/.ipython"
export IPYTHON_PROFILE=ipyparallel.${SLURM_JOBID}
whereis ipcontroller
sleep 3
echo "===== Beginning ipcontroller execution ======"
ipcontroller --init --ip='*' --nodb --profile=${IPYTHON_PROFILE} --ping=30000 & # --sqlitedb
echo "===== Finish ipcontroller execution ======"
sleep 15
srun ipengine --profile=${IPYTHON_PROFILE} --timeout=300 &
sleep 75
echo "===== Beginning python execution ======"
python run_simulations.py
depending on your system, read more here:https://ipyparallel.readthedocs.io/en/latest/process.html
and run_simulations.py should contain something like this:
import os
from ipyparallel import Client
import sys
from tqdm import tqdm
import subprocess
from subprocess import PIPE
def run_sim(x):
import os
import subprocess
from subprocess import PIPE
# send job!
params = [str(i) for i in x]
p1 = subprocess.Popen(['matlab','-r',f'"run_model({x[0]},{x[1]})"'], env=dict(**os.environ))
p1.wait()
return
##load ipython parallel
rc = Client(profile=os.getenv('IPYTHON_PROFILE'))
print('Using ipyparallel with %d engines', len(rc))
lview = rc.load_balanced_view()
view = rc[:]
print('Using ipyparallel with %d engines', len(rc))
sys.stdout.flush()
map_function = lview.map_sync
to_send = []
#prepare variables <-- here you should prepare the arguments for matlab
####################
for param_1 in [1,2,3,4]:
for param_2 in [10,20,40]:
to_send.append([param_1, param_2])
ind_raw_features = lview.map_async(run_sim,to_send)
all_results = []
print('Sending jobs');sys.stdout.flush()
for i in tqdm(ind_raw_features,file=sys.stdout):
all_results.append(i)
You also get a progress bar in the stdout, which is nice... you can also easily add a check to see if the output files exist and ignore a run.

SLURM Submit multiple tasks per node?

I found some very similar questions which helped me arrive at a script which seems to work however I'm still unsure if I fully understand why, hence this question..
My problem (example): On 3 nodes, I want to run 12 tasks on each node (so 36 tasks in total). Also each task uses OpenMP and should use 2 CPUs. In my case a node has 24 CPUs and 64GB memory. My script would be:
#SBATCH --nodes=3
#SBATCH --ntasks=36
#SBATCH --cpus-per-task=2
#SBATCH --mem-per-cpu=2000
export OMP_NUM_THREADS=2
for i in {1..36}; do
srun -N 1 -n 1 ./program input${i} >& out${i} &
done
wait
This seems to work as I require, successively running tasks on a node until all CPUs on that node are in use, and then continuing to run further tasks on the next node until all CPUs are used again, etc..
My question.. I'm not sure if this is actually what it does (?) as I didn't fully understand the man page of srun regarding -n, and i have not used srun before.
Mainly my confusion comes from "-n": In the man page for -n it says "The default is one task per node, ..", so I expected if I use "srun -n 1" that only one task will be run on each node, which doesn't seem to be the case.
Furthermore when i tried e.g. "srun -n 2 ./program" it seems to just run the exact same program twice as two different tasks with no way to use different input files.. which I can't think of why that would ever be useful?

Your setup is correct except that you must use the --exclusive option of srun (which has a different meaning in this case than for sbatch).
As for your remark regarding the usefulness of srun, the behaviour of the program can be changed based on the environment variable $SLURM_TASK_ID, or the rank in case of an MPI program. Your confusion arises from the fact that your program is not written to be parallel (appart from the 2 OMP threads) while srun is meant to start parallel programs, most of the time based on MPI.

An other way is to run all your tasks at once.
since the input and output file depends on the rank, a wrapper is needed
your SLURM script would be
#SBATCH --nodes=3
#SBATCH --ntasks=36
#SBATCH --cpus-per-task=2
#SBATCH --mem-per-cpu=2000
export OMP_NUM_THREADS=2
srun -n 36 ./program.sh
and your wrapper program.sh would be
#!/bin/sh
exec ./program input${SLURM_PROCID} > out${SLURM_PROCID} 2>&1

Running a queue of MPI calls in parallel with SLURM and limited resources

I'm trying to run a Particle Swarm Optimization problem on a cluster using SLURM, with the optimization algorithm managed by a single-core matlab process. Each particle evaluation requires multiple MPI calls that alternate between two Python programs until the result converges. Each MPI call takes up to 20 minutes.
I initially naively submitted each MPI call as a separate SLURM job, but the resulting queue time made it slower than running each job locally in serial. I am now trying to figure out a way to submit an N node job that will continuously run MPI tasks to utilize the available resources. The matlab process would manage this job with text file flags.
Here is a pseudo-code bash file that might help to illustrate what I am trying to do on a smaller scale:
#!/bin/bash
#SBATCH -t 4:00:00 # walltime
#SBATCH -N 2 # number of nodes in this job
#SBATCH -n 32 # total number of processor cores in this job
# Set required modules
module purge
module load intel/16.0
module load gcc/6.3.0
# Job working directory
echo Working directory is $SLURM_SUBMIT_DIR
cd $SLURM_SUBMIT_DIR
echo Running on host `hostname`
echo Time is `date`
echo Directory is `pwd`
# Run Command
while <"KeepRunning.txt” == 1>
do
for i in {0..40}
do
if <“RunJob_i.txt” == 1>
then
mpirun -np 8 -rr -f ${PBS_NODEFILE} <job_i> &
fi
done
done
wait
This approach doesn't work (just crashes), but I don't know why (probably overutilization of resources?). Some of my peers have suggested using parallel with srun, but as far as I can tell this requires that I call the MPI functions in batches. This will be a huge waste of resources, as a significant portion of the runs finish or fail quickly (this is expected behavior). A concrete example of the problem would be starting a batch of 5 8-core jobs and having 4 of them crash immediately; now 32 cores would be doing nothing while they wait up to 20 minutes for the 5th job to finish.
Since the optimization will likely require upwards of 5000 mpi calls, any increase in efficiency will make a huge difference in absolute walltime. Does anyone have any advice as to how I could run a constant stream of MPI calls on a large SLURM job? I would really appreciate any help.

A couple of things: under SLURM you should be using srun, not mpirun.
The second thing is that the pseudo-code you provided launches an infinite number of jobs without waiting for any completion signal. You should try to put the wait into the inner loop, so you launch just a set of jobs, wait for them to finish, evaluate the condition and, maybe, launch the next set of jobs:
#!/bin/bash
#SBATCH -t 4:00:00 # walltime
#SBATCH -N 2 # number of nodes in this job
#SBATCH -n 4 # total number of tasks in this job
#SBATCH -s 8 # total number of processor cores for each task
# Set required modules
module purge
module load intel/16.0
module load gcc/6.3.0
# Job working directory
echo Working directory is $SLURM_SUBMIT_DIR
cd $SLURM_SUBMIT_DIR
echo Running on host `hostname`
echo Time is `date`
echo Directory is `pwd`
# Run Command
while <"KeepRunning.txt” == 1>
do
for i in {0..40}
do
if <“RunJob_i.txt” == 1>
then
srun -np 8 --exclusive <job_i> &
fi
done
wait
<Update "KeepRunning.txt”>
done
Take care also distinguishing tasks and cores. -n says how many tasks will be used, -c says how many cpus per task will be allocated.
The code I wrote will launch in the background 41 jobs (from 0 to 40, included), but they will only start once the resources are available (--exclusive), waiting while they are occupied. Each jobs will use 8 CPUs. The you will wait for them to finish and I assume that you will update the KeepRunning.txt after that round.

Running slurm script with multiple nodes, launch job steps with 1 task

I am trying to launch a large number of job steps using a batch script. The different steps can be completely different programs and do need exactly one CPU each. First I tried doing this using the --multi-prog argument to srun. Unfortunately, when using all CPUs assigned to my job in this manner, performance degrades massively. The run time increases to almost its serialized value. By undersubscribing I could ameliorate this a little. I couldn't find anything online regarding this problem, so I assumed it to be a configuration problem of the cluster I am using.
So I tried going a different route. I implemented the following script (launched via sbatch my_script.slurm):
#!/bin/bash
#SBATCH -o $HOME/slurm/slurm_out/%j.%N.out
#SBATCH --error=$HOME/slurm/slurm_out/%j.%N.err_out
#SBATCH --get-user-env
#SBATCH -J test
#SBATCH -D $HOME/slurm
#SBATCH --export=NONE
#SBATCH --ntasks=48
NR_PROCS=$(($SLURM_NTASKS))
for PROC in $(seq 0 $(($NR_PROCS-1)));
do
#My call looks like this:
#srun --exclusive -n1 bash $PROJECT/call_shells/call_"$PROC".sh &
srun --exclusive -n1 hostname &
pids[${PROC}]=$! #Save PID of this background process
done
for pid in ${pids[*]};
do
wait ${pid} #Wait on all PIDs, this returns 0 if ANY process fails
done
I am aware, that the --exclusive argument is not really needed in my case. The shell scripts called contain the different binaries and their arguments. The remaining part of my script relies on the fact that all processes have finished hence the wait. I changed the calling line to make it a minimal working example.
At first this seemed to be the solution. Unfortunately when increasing the number of nodes used in my job allocation (for example by increasing --ntasks to a number larger than the number of CPUs per node in my cluster), the script does not work as expected anymore, returning
srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
and continuing using only one node (i.e. 48 CPUs in my case, which go through the job steps as fast as before, all processes on the other node(s) are subsequently killed).
This seems to be the expected behaviour, but I can't really understand it. Why is it that every job step in a given allocation needs to include a minimum number of tasks equal to the number of nodes included in the allocation. I ordinarily really do not care at all about the number of nodes used in my allocation.
How can I implement my batch script, so it can be used on multiple nodes reliably?

Found it! The nomenclature and the many command line options to slurm confused me. The solution is given by
#!/bin/bash
#SBATCH -o $HOME/slurm/slurm_out/%j.%N.out
#SBATCH --error=$HOME/slurm/slurm_out/%j.%N.err_out
#SBATCH --get-user-env
#SBATCH -J test
#SBATCH -D $HOME/slurm
#SBATCH --export=NONE
#SBATCH --ntasks=48
NR_PROCS=$(($SLURM_NTASKS))
for PROC in $(seq 0 $(($NR_PROCS-1)));
do
#My call looks like this:
#srun --exclusive -N1 -n1 bash $PROJECT/call_shells/call_"$PROC".sh &
srun --exclusive -N1 -n1 hostname &
pids[${PROC}]=$! #Save PID of this background process
done
for pid in ${pids[*]};
do
wait ${pid} #Wait on all PIDs, this returns 0 if ANY process fails
done
This specifies to run the job on exactly one node incorporating a single task only.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio