Having trouble getting slurm sbatch job arrays to assign jobs to cores before assigning to additional nodes - cluster-computing

I have a number of jobs that require a single core to run. The cluster I use has 5 nodes, each with 96 cores. When I use slurm to submit the jobs, the jobs are always assigned to multiple nodes and if more than 5 (i.e., number of nodes) they tend to run sequentially as opposed to concurrently on each node. The same behaviour is observed when I restrict the nodes; sequential, not concurrent. The configuration is set to "cons_tres" and I have tried many different suggestions and combinations of the script below. I did manage to get the desired operation using the $SLURM_PROCID accessed through a wrapper script, but I need to access data throughout the run for each model and have found the $SLURM_ARRAY_TASK_ID very convenient for this. I have tried submitting with srun within the sbatch script, but nothing seems to work. The last iteration with the optional srun inclusion is shown below. I am pretty new (~1 week) to the development of scheduling scripts, so please forgive any incorrect/inaccurate descriptions. I really appreciate any solutions, but am also looking to more fully understand where I am going wrong. Thanks!
#!/bin/tcsh
## SLURM TEST
#SBATCH --job-name=seatest
#SBATCH --nodes=1-1
#SBATCH --ntasks=5
#SBATCH --ntasks-per-node=5
#SBATCH --array=1-5
#SBATCH --output=slurm-%A_%03a.out
hostname
set CASE_NUM=`printf %03d $SLURM_ARRAY_TASK_ID`
[srun] program-name seatest.$CASE_NUM.in
This jobs were sent to 1 core of each of the five nodes, not to 5 cores of 1 node.

Memory based scheduling was enabled on the cluster, which required the memory (--mem) for each job to be specified.

Related

Slurm array job with at most one concurrent job per node

Problem
I have hundreds of files, for each i'd like to run a job with a fixed number of cores (let's say -c4) so that at no time more than one of these jobs run on any of the nodes.
(Reason if you're interested: complicated job setup out of my control. Each job starts a bunch of servers on hard-coded ports. These clash if run concurrently on one node :-/ (Yepp, don't tell me, i know.))
MVCE
I've already played around with all kinds of combinations of -N1, -n1, --ntasks-per-node=1 and an inner srun with --exclusive, but sadly no success:
sbatch -N1 -n1 -c4 --ntasks-per-node=1 --array=1-128 --wrap \
'echo "$(hostname) $(date) $(sleep 15) $(date)"'
or
sbatch -N1 -n1 -c4 --ntasks-per-node=1 --array=1-128 --wrap \
'srun --exclusive -n1 -c4 --ntasks-per-node=1 -- \
bash -c '\''echo "$(hostname) $(date) $(sleep 15) $(date)"'\'
However, if you look at the output (cat slurm-*.out) you'll in all cases quickly spot overlapping runs :-/
Question
Is there a way to constrain an array job to never concurrently run more than 1 of its jobs on any node?
Our cluster is quite heterogeneous wrt. the CPUs in each node (ranges from 32 - 256), so simple workarounds like asking for a high enough -c so that no 2 can run on the nodes lead to very long wait-times and poor utilization.
Any ideas / pointers?
Is there maybe a way to reserve a certain port per job?
I can think of two ways to achieve this, one with some admin-help and one without.:
If you ask your Slurm admin very nicely, he may be able to add a 'fake' gres to the nodes. This allows you to request this gres for your jobs. If there is only one of those gres per node, you should be limited to one job per node, however many other resources you need.
Instead of using an array, you could request a big job with lots of nodes, but one task per node and four cores per. Inside that job, you start the tasks with srun and as each node has one task, they should be distributed along the nodes. You might not want to wait for four cores on 128 nodes to be free at once, so cut your workload into chunks and submit them as dependencies (look into the singleton option).
Elaboration on the second option:
#SBATCH -N16
#SBATCH --ntasks-per-node=1
#SBATCH --job-name=something
#SBATCH --dependency=singleton
for i in `seq 1 $SLURM_JOB_NUM_NODES`; do
srun -N1 -n1 <your_program> &
done
wait
You could submit 100 of these in a row and they would run chunks of size 16, sequentially. This is not really efficient, but waiting for 100 nodes with a free task at once (so no chunking), might take even longer. I certainly prefer the first option, but this might be an option, if your admin doesn't want to add some gres.
The most efficient way is the gres approach as suggested by #Marcus Boden.
But if the admins are not able to help you with that, you could add a small piece of code at the beginning of the submission script, that would check if the needed ports is available or not (for instance with the netstat command).
If the port is not available, requeue the job with scontrol requeue SLURM_JOB_ID. Before requeueing, in order to prevent the job from hitting the same, unavailable, node, you can edit the job to exclude that node: scontrol update jobid=$SLURM_JOB_ID ExcNodeList=$(hostname -s)`. Ideally, the code should be a bit more clever and retrieve the current excluded node list from the job and append the current node.
Another option would be to modify the job with scontrol update jobid=$SLURM_JOB_ID StartTime=.... with a start time set to the current time plus the typical wall time of your job, with the idea that when the job becomes eligible again, the job currently running on the node would be completed. But of course, there is no guarantee that the node will not be allocated to another job in the meantime.

sbatch script with number of CPUs different to total number of CPUS in cores?

I'm used to start an sbatch script in a cluster where the nodes have 32 CPUs and where my code needs a power of 2 number of processors.
For exemple i do this:
#SBATCH -N 1
#SBATCH -n 16
#SBATCH --ntasks-per-node=16
or
#SBATCH -N 2
#SBATCH -n 64
#SBATCH --ntasks-per-node=32
However i now need to use a different cluster where each node has 40 CPUs. For the moment i'm using only one node and 32 processes to do testing:
#SBATCH --ntasks=32
#SBATCH --ntasks-per-node=32
(I got this later script from the documentation of the cluster. They don't use in this example the #SBATCH -N line, i don't know why but maybe because it is an example)
However i will now need to do larger simulations with 512 processors. The closer number of nodes i will need to use is 13 (ie 40*13=520 processors). Now the problem is that the number of task per node will not be (technically) an integer.
I think a solution will be to ask for 13 nodes where i will fully use 12 and only i will not fully use the last one.
My question is how do i do this?, Is there another way of doing this without changing the code? (It will not be possible to change the code, is a huge code).
And a simulation with 512 proc will take 10 hours minimum, so doing a larger simulation with 32 procs will take a week. And i don't only need one simulation but at least 20 for the moment.
Another solution will be to ask for 16 nodes (32*16=512) and only use 32 procs per node. However this will be a waste of processors and number of hours I'm allowed in the cluster.
Ok the answer is simple but depends on the machine you are working. But i think it should work every time.
In the case of the second cluster i don't need to specify the line --ntasks-per-node=512. I just need to tell the machine how many tasks i need in total --tasks=512, automatically the machine will allocate the corresponding number of nodes necessary to do those tasks.
Important: If your ntasks is not a multiple of the processors per node, then the last node will be not completely used. For example in my case i need 512 tasks, this corresponds to 13 nodes = 520 processors. The first 12 processors are fully used but the last one is not and leaves 8 processors empty.
Note that this can cause some optimisation problems in some codes because the processes on the last node will need to communicate with the majority of processes in the other node(s). For me is not a problem but i know another code where this is a problem.

make -j across PBS nodes

I would like run parallel tasks with make -j and have those tasks be distributed across PBS nodes. The solution I am looking for is similar to the Slurm Makefile here except that my admin only allows me to run one job at a time, so I cannot set SHELL=qsub, for example.

How to submit a job to any [subset] of nodes from nodelist in SLURM?

I have a couple of thousand jobs to run on a SLURM cluster with 16 nodes. These jobs should run only on a subset of the available nodes of size 7. Some of the tasks are parallelized, hence use all the CPU power of a single node while others are single threaded. Therefore, multiple jobs should run at the same time on a single node. None of the tasks should spawn over multiple nodes.
Currently I submit each of the jobs as follow:
sbatch --nodelist=myCluster[10-16] myScript.sh
However this parameter makes slurm to wait till the submitted job terminates, and hence leaves 3 nodes completely unused and, depending on the task (multi- or single-threaded), also the currently active node might be under low load in terms of CPU capability.
What are the best parameters of sbatch that force slurm to run multiple jobs at the same time on the specified nodes?
You can work the other way around; rather than specifying which nodes to use, with the effect that each job is allocated all the 7 nodes, specify which nodes not to use:
sbatch --exclude=myCluster[01-09] myScript.sh
and Slurm will never allocate more than 7 nodes to your jobs. Make sure though that the cluster configuration allows node sharing, and that your myScript.sh contains #SBATCH --ntasks=1 --cpu-per-task=n with n the number of threads of each job.
Some of the tasks are parallelized, hence use all the CPU power of a single node while others are single threaded.
I understand that you want the single-threaded jobs to share a node, whereas the parallel ones should be assigned a whole node exclusively?
multiple jobs should run at the same time on a single node.
As far as my understanding of SLURM goes, this implies that you must define CPU cores as consumable resources (i.e., SelectType=select/cons_res and SelectTypeParameters=CR_Core in slurm.conf)
Then, to constrain parallel jobs to get a whole node you can either use --exclusive option (but note that partition configuration takes precedence: you can't have shared nodes if the partition is configured for exclusive access), or use -N 1 --tasks-per-node="number_of_cores_in_a_node" (e.g., -N 1 --ntasks-per-node=8).
Note that the latter will only work if all nodes have the same number of cores.
None of the tasks should spawn over multiple nodes.
This should be guaranteed by -N 1.
Actually I think the way to go is setting up a 'reservation' first. According to this presentation http://slurm.schedmd.com/slurm_ug_2011/Advanced_Usage_Tutorial.pdf (last slide).
Scenario: Reserve ten nodes in the default SLURM partition starting at noon and with a duration of 60 minutes occurring daily. The reservation will be available only to users alan and brenda.
scontrol create reservation user=alan,brenda starttime=noon duration=60 flags=daily nodecnt=10
Reservation created: alan_6
scontrol show res
ReservationName=alan_6 StartTime=2009-02-05T12:00:00
EndTime=2009-02-05T13:00:00 Duration=60 Nodes=sun[000-003,007,010-013,017] NodeCnt=10 Features=(null) PartitionName=pdebug Flags=DAILY Licenses=(null)
Users=alan,brenda Accounts=(null)
# submit job with:
sbatch --reservation=alan_6 myScript.sh
Unfortunately I couldn't test this procedure, probaly due to a lack of privileges.

Limiting the number of qsub jobs to under the job limit

I am trying to do parameter tuning of my learning model on a Bright compute Cluster, which requires a large number of jobs due to the number of parameters being tuned. Each combination of the parameters requires around 162 qsub jobs. And there are around 50 combinations of parameters that I require to check. This is equivalent to running around 162*50 ~= 8100 jobs. However there is a 350 qsub job limit per account on the cluster that I am using. I was hence wondering whether there was a way in bash scripting to check the number of currently active qsub jobs so I could effectively automate the process of initiating new jobs.
Did you already try with job arrays? You didn't specify the scheduler you are using (PBS, OGE, ...), but there should be a way to define a job array and, in the whole array, a limit on the number of tasks really running at a time. In PBS
#PBS -t 1-1000%100
creates a one thousand job array limiting to one hundred the number of tasks effectively running at a time.
If you really want to find a way to check active jobs to automate the process of initiating new jobs, the qstat output should help you, but this should be the duty of your scheduler, not your.

Resources