Slurm array job with at most one concurrent job per node - cluster-computing

Problem
I have hundreds of files, for each i'd like to run a job with a fixed number of cores (let's say -c4) so that at no time more than one of these jobs run on any of the nodes.
(Reason if you're interested: complicated job setup out of my control. Each job starts a bunch of servers on hard-coded ports. These clash if run concurrently on one node :-/ (Yepp, don't tell me, i know.))
MVCE
I've already played around with all kinds of combinations of -N1, -n1, --ntasks-per-node=1 and an inner srun with --exclusive, but sadly no success:
sbatch -N1 -n1 -c4 --ntasks-per-node=1 --array=1-128 --wrap \
'echo "$(hostname) $(date) $(sleep 15) $(date)"'
or
sbatch -N1 -n1 -c4 --ntasks-per-node=1 --array=1-128 --wrap \
'srun --exclusive -n1 -c4 --ntasks-per-node=1 -- \
bash -c '\''echo "$(hostname) $(date) $(sleep 15) $(date)"'\'
However, if you look at the output (cat slurm-*.out) you'll in all cases quickly spot overlapping runs :-/
Question
Is there a way to constrain an array job to never concurrently run more than 1 of its jobs on any node?
Our cluster is quite heterogeneous wrt. the CPUs in each node (ranges from 32 - 256), so simple workarounds like asking for a high enough -c so that no 2 can run on the nodes lead to very long wait-times and poor utilization.
Any ideas / pointers?
Is there maybe a way to reserve a certain port per job?

I can think of two ways to achieve this, one with some admin-help and one without.:
If you ask your Slurm admin very nicely, he may be able to add a 'fake' gres to the nodes. This allows you to request this gres for your jobs. If there is only one of those gres per node, you should be limited to one job per node, however many other resources you need.
Instead of using an array, you could request a big job with lots of nodes, but one task per node and four cores per. Inside that job, you start the tasks with srun and as each node has one task, they should be distributed along the nodes. You might not want to wait for four cores on 128 nodes to be free at once, so cut your workload into chunks and submit them as dependencies (look into the singleton option).
Elaboration on the second option:
#SBATCH -N16
#SBATCH --ntasks-per-node=1
#SBATCH --job-name=something
#SBATCH --dependency=singleton
for i in `seq 1 $SLURM_JOB_NUM_NODES`; do
srun -N1 -n1 <your_program> &
done
wait
You could submit 100 of these in a row and they would run chunks of size 16, sequentially. This is not really efficient, but waiting for 100 nodes with a free task at once (so no chunking), might take even longer. I certainly prefer the first option, but this might be an option, if your admin doesn't want to add some gres.

The most efficient way is the gres approach as suggested by #Marcus Boden.
But if the admins are not able to help you with that, you could add a small piece of code at the beginning of the submission script, that would check if the needed ports is available or not (for instance with the netstat command).
If the port is not available, requeue the job with scontrol requeue SLURM_JOB_ID. Before requeueing, in order to prevent the job from hitting the same, unavailable, node, you can edit the job to exclude that node: scontrol update jobid=$SLURM_JOB_ID ExcNodeList=$(hostname -s)`. Ideally, the code should be a bit more clever and retrieve the current excluded node list from the job and append the current node.
Another option would be to modify the job with scontrol update jobid=$SLURM_JOB_ID StartTime=.... with a start time set to the current time plus the typical wall time of your job, with the idea that when the job becomes eligible again, the job currently running on the node would be completed. But of course, there is no guarantee that the node will not be allocated to another job in the meantime.

Related

Having trouble getting slurm sbatch job arrays to assign jobs to cores before assigning to additional nodes

I have a number of jobs that require a single core to run. The cluster I use has 5 nodes, each with 96 cores. When I use slurm to submit the jobs, the jobs are always assigned to multiple nodes and if more than 5 (i.e., number of nodes) they tend to run sequentially as opposed to concurrently on each node. The same behaviour is observed when I restrict the nodes; sequential, not concurrent. The configuration is set to "cons_tres" and I have tried many different suggestions and combinations of the script below. I did manage to get the desired operation using the $SLURM_PROCID accessed through a wrapper script, but I need to access data throughout the run for each model and have found the $SLURM_ARRAY_TASK_ID very convenient for this. I have tried submitting with srun within the sbatch script, but nothing seems to work. The last iteration with the optional srun inclusion is shown below. I am pretty new (~1 week) to the development of scheduling scripts, so please forgive any incorrect/inaccurate descriptions. I really appreciate any solutions, but am also looking to more fully understand where I am going wrong. Thanks!
#!/bin/tcsh
## SLURM TEST
#SBATCH --job-name=seatest
#SBATCH --nodes=1-1
#SBATCH --ntasks=5
#SBATCH --ntasks-per-node=5
#SBATCH --array=1-5
#SBATCH --output=slurm-%A_%03a.out
hostname
set CASE_NUM=`printf %03d $SLURM_ARRAY_TASK_ID`
[srun] program-name seatest.$CASE_NUM.in
This jobs were sent to 1 core of each of the five nodes, not to 5 cores of 1 node.
Memory based scheduling was enabled on the cluster, which required the memory (--mem) for each job to be specified.

separating values of CUDA_VISIBLE_DEVICES variable

I am running a job in a cluster that uses SLURM as a scheduler. I specify the type of GPU card with the option --gres=gpu:k80. However, because the cluster has nodes with a different number of cards, it happens that sometimes one gets 2 or 4. I can see the available devices with:
echo $CUDA_VISIBLE_DEVICES
which reports a list, 0,1 or 0,1,2,3. I need the maximum value of the list either 1 or 3. Here is my question: is there some option in SLURM to know that?
You can simplify with
export num_def=${CUDA_VISIBLE_DEVICES: -1}
Slurm can provide you with the information, though it will be more difficult to parse. For instance,
squeue -J $SLURM_JIBID -O tres-alloc:100
will show something like
cpu=1,mem=1000M,node=1,billing=1,gres/gpu=1
and the last number here is the number of GPUs allocated to the job.

Does SLURM support running multiple jobs on one node at the same time?

Our computer cluster runs slurm version 15.08.13 and mpich version is 3.2.1. My question is, could Slurm support multiple jobs running on one node at the same time? Our computer cluster has 16 cores cpu per node. We want to run two jobs at the same time on one node, each job uses 8 cores.
We have found that if a job uses all of the cpu cores for one node, the state of node becomes "allocated". If a job uses only part of the cpu cores for one node, the state of node becomes "mixed", but subsequent jobs can only be queued and the state of job is "pending".
Our order for submitting an job is as follows:
srun -N1 -n8 testProgram
So, does Slurm support running multiple jobs on one node at the same time? Thanks.
Yes, provided it was configured with SelectType=select/cons_res, which does not seem to be the case on your system. You can check with scontrol show config | grep Select. See more information here
Yes, you need to set SelectType=select/cons_res or SelectType=select/cons_tes
and SelectTypeParameters=CR_CPU_Memory
The difference between cons_res and cons_tes is that cons_tres adds GPUs support.

How to submit a job to any [subset] of nodes from nodelist in SLURM?

I have a couple of thousand jobs to run on a SLURM cluster with 16 nodes. These jobs should run only on a subset of the available nodes of size 7. Some of the tasks are parallelized, hence use all the CPU power of a single node while others are single threaded. Therefore, multiple jobs should run at the same time on a single node. None of the tasks should spawn over multiple nodes.
Currently I submit each of the jobs as follow:
sbatch --nodelist=myCluster[10-16] myScript.sh
However this parameter makes slurm to wait till the submitted job terminates, and hence leaves 3 nodes completely unused and, depending on the task (multi- or single-threaded), also the currently active node might be under low load in terms of CPU capability.
What are the best parameters of sbatch that force slurm to run multiple jobs at the same time on the specified nodes?
You can work the other way around; rather than specifying which nodes to use, with the effect that each job is allocated all the 7 nodes, specify which nodes not to use:
sbatch --exclude=myCluster[01-09] myScript.sh
and Slurm will never allocate more than 7 nodes to your jobs. Make sure though that the cluster configuration allows node sharing, and that your myScript.sh contains #SBATCH --ntasks=1 --cpu-per-task=n with n the number of threads of each job.
Some of the tasks are parallelized, hence use all the CPU power of a single node while others are single threaded.
I understand that you want the single-threaded jobs to share a node, whereas the parallel ones should be assigned a whole node exclusively?
multiple jobs should run at the same time on a single node.
As far as my understanding of SLURM goes, this implies that you must define CPU cores as consumable resources (i.e., SelectType=select/cons_res and SelectTypeParameters=CR_Core in slurm.conf)
Then, to constrain parallel jobs to get a whole node you can either use --exclusive option (but note that partition configuration takes precedence: you can't have shared nodes if the partition is configured for exclusive access), or use -N 1 --tasks-per-node="number_of_cores_in_a_node" (e.g., -N 1 --ntasks-per-node=8).
Note that the latter will only work if all nodes have the same number of cores.
None of the tasks should spawn over multiple nodes.
This should be guaranteed by -N 1.
Actually I think the way to go is setting up a 'reservation' first. According to this presentation http://slurm.schedmd.com/slurm_ug_2011/Advanced_Usage_Tutorial.pdf (last slide).
Scenario: Reserve ten nodes in the default SLURM partition starting at noon and with a duration of 60 minutes occurring daily. The reservation will be available only to users alan and brenda.
scontrol create reservation user=alan,brenda starttime=noon duration=60 flags=daily nodecnt=10
Reservation created: alan_6
scontrol show res
ReservationName=alan_6 StartTime=2009-02-05T12:00:00
EndTime=2009-02-05T13:00:00 Duration=60 Nodes=sun[000-003,007,010-013,017] NodeCnt=10 Features=(null) PartitionName=pdebug Flags=DAILY Licenses=(null)
Users=alan,brenda Accounts=(null)
# submit job with:
sbatch --reservation=alan_6 myScript.sh
Unfortunately I couldn't test this procedure, probaly due to a lack of privileges.

Limiting the number of qsub jobs to under the job limit

I am trying to do parameter tuning of my learning model on a Bright compute Cluster, which requires a large number of jobs due to the number of parameters being tuned. Each combination of the parameters requires around 162 qsub jobs. And there are around 50 combinations of parameters that I require to check. This is equivalent to running around 162*50 ~= 8100 jobs. However there is a 350 qsub job limit per account on the cluster that I am using. I was hence wondering whether there was a way in bash scripting to check the number of currently active qsub jobs so I could effectively automate the process of initiating new jobs.
Did you already try with job arrays? You didn't specify the scheduler you are using (PBS, OGE, ...), but there should be a way to define a job array and, in the whole array, a limit on the number of tasks really running at a time. In PBS
#PBS -t 1-1000%100
creates a one thousand job array limiting to one hundred the number of tasks effectively running at a time.
If you really want to find a way to check active jobs to automate the process of initiating new jobs, the qstat output should help you, but this should be the duty of your scheduler, not your.

Resources