Optimize performance on a SLURM cluster

Optimize performance on a SLURM cluster - performance

I am writing you after many attempts I have done on a CPU cluster so structured:
144 standard compute nodes
2× AMD EPYC 7742, 2× 64 cores, 2.25 GHz
256 (16× 16) GB DDR4, 3200 MHz
InfiniBand HDR100 (Connect-X6)
local disk for operating system (1× 240 GB SSD)
1 TB NVMe
Now, since my core-h are here limited, I want to maximize performance as much as I can.
I am doing some benchmarking with the following submission script:
#!/bin/bash -x
#SBATCH --account=XXXX
#SBATCH --ntasks=256
#SBATCH --output=mp-out.%j
#SBATCH --error=mpi-err.%j
#SBATCH --time=24:00:00
#SBATCH --partition=batch
srun ./myprogram
The program I am running si Gromacs2020 (MPI), a Software to perform Molecular Dynamic Simualtions.
In the machine manual I read about these keys:
--ntasks
--ntasks-per-node
--cpu-per-node
However, considering the very recently technology, I am getting mediocre performances. Indeed, in a 5-years older cluster, I get better performance with comparable resources.
So, do you envision a good combination of those keywords to maximize performance and avoid core-h wasting?
My system size is ~100K atoms (if it can help).
Any feedback would be very much appreciated,
Looking forward to hearing from your opinions.
Best Regards
VG

In your case, the 256 tasks have no constraints to run in the same rack, location or not. Slurm have no clues to schedule correctly the job on your cluster. It could be schedule 1 task on 256 different nodes, and that is not efficient at all.
To be sure that all is schedule correctly, perhaps you should force to locate the tasks on the node.
#!/bin/bash -x
#SBATCH --account=XXXX
#SBATCH --nodes=2
#SBATCH --ntasks=256
#SBATCH --ntasks-per-core=1
#SBATCH --tasks-per-node=128
#SBATCH --output=mp-out.%j
#SBATCH --error=mpi-err.%j
#SBATCH --time=24:00:00
#SBATCH --partition=batch
srun ./myprogram
And normally, each 256 tasks will be schedule on 1 core on each AMD socket. and located on 2 nodes. This will avoid oversubscribing and cpu cycles sharing which is inefficient.
To be sure and not be be disturb for benchmarking, ask --exclusive.

Related

Running Slurm array jobs one per virtual core instead of one per physical core

we have a machine with 2x64-core CPU, each core consists of 2 virtual cores, so in htop we see 256 distinct (virtual) CPUs. We configured Slurm quality of service to better manage CPU usage per user. I.e. we have defined a --qos=cpus50 which, as far as I understand it, gives me a budget of 50 virtual cores to compute my jobs. I created a test.sbatch script with an array of 100 jobs. Each job takes 10s to compute. So with the following config, I would hope that my jobs will be finished in 20s + some small overhead.
#!/bin/bash
#SBATCH --job-name=test
#SBATCH --qos=cpus50
#SBATCH --array=1-100
#SBATCH --cpus-per-task=1
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --ntasks-per-core=1
#SBATCH --open-mode=append
#SBATCH --output=%x.out
python3 -c "import os; jobid = int(os.getenv('SLURM_ARRAY_TASK_ID')); start = f'JOBID:{jobid:04d} | Start {time.ctime()}'; time.sleep(10); print(f'{start} | End {time.ctime()} |')"
However,running the script above spawns only 25 jobs at once (according to squeue output) and finishes in 47seconds. (2x the desired duration). Running with --ntasks-per-core=2 results in the same behavior. Running with --ntasks=2 and --ntasks-per-core=2 results in the same behavior.
What am I doing wrong? I just want to run 50 jobs at once since I already have the virtual cores available. Thank you

Answering my own question. A member of our group found an answer here.
The problem was in Slurm configuration. In short, for our setup, we had to change the relevant part in slurm.conf from
SelectTypeParameters=CR_Core
NodeName=nodename CPUs=256 Sockets=2 CoresPerSocket=64 ThreadsPerCore=2
to
SelectTypeParameters=CR_CPU
NodeName=nodename CPUs=256
Now the sbatch script from the question spawns 50 jobs and finishes in a bit more than 20s as expected.

Problems with Orca and OpenMPI for parallel jobs

Hello to the community:
I recently started to use ORCA software for some quantum calculation but I have been having a lot of problems to lunch a parallel calculation in the cluster of my University.
To install Orca I used the static version:
orca_4_2_1_linux_x86-64_openmpi314.tar.xz.
In a shared direction of the cluster (/data/shared/opt/ORCA/).
And putted in my ~/.bash_profile:
export PATH="/data/shared/opt/ORCA/orca_4_2_1_linux_x86-64_openmpi314:$PATH"
export LD_LIBRARY_PATH="/data/shared/opt/ORCA/orca_4_2_1_linux_x86-64_openmpi314:$LD_LIBRARY_PATH"
For the installation of the corresponding OpenMPI version (3.1.4)
tar -xvf openmpi-3.1.4.tar.gz
cd openmpi-3.1.4
./configure --prefix="/data/shared/opt/ORCA/openmpi314/"
make -j 10
make install
When I use the frontend server all is wonderful:
With a .sh like this:
#! /bin/bash
export PATH="/data/shared/opt/ORCA/openmpi314/bin:$PATH"
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/data/shared/opt/ORCA/openmpi314/lib"
$(which orca) test.inp > test.out
and an input like this:
# Computation of myjob at b3lyp/6-31+G(d,p)
%pal nprocs 10 end
%maxcore 8192
! RKS B3LYP 6-31+G(d,p)
! TightSCF Grid5 NoFinalGrid
! Opt
! Freq
%cpcm
smd true
SMDsolvent "water"
end
* xyz 0 1
C 0 0 0
O 0 0 1.5
*
The problem appears when I use the nodes:
.inp file:
#! Computation at RKS B3LYP/6-31+G(d,p) for cis1_bh267_m_Cell_152
%pal nprocs 12 end
%maxcore 8192
! RKS B3LYP 6-31+G(d,p)
! TightSCF Grid5 NoFinalGrid
! Opt
! Freq
%cpcm
smd true
SMDsolvent "water"
end
* xyz 0 1
C -4.38728130 0.21799058 0.17853303
C -3.02072869 0.82609890 -0.29733316
F -2.96869122 2.10937041 0.07179384
F -3.01136328 0.87651596 -1.63230798
C -1.82118365 0.05327804 0.23420220
O -2.26240947 -0.92805650 1.01540713
C -0.53557484 0.33394113 -0.05236121
C 0.54692198 -0.46942807 0.50027196
O 0.31128292 -1.43114232 1.22440290
C 1.93990391 -0.12927675 0.16510948
C 2.87355011 -1.15536140 -0.00858832
C 4.18738231 -0.82592189 -0.32880964
C 4.53045856 0.52514329 -0.45102225
N 3.63662927 1.52101319 -0.26705841
C 2.36381718 1.20228695 0.03146190
F -4.51788749 0.24084604 1.49796862
F -4.53935644 -1.04617745 -0.19111502
F -5.43718443 0.87033190 -0.30564680
H -1.46980819 -1.48461498 1.39034280
H -0.26291843 1.15748249 -0.71875720
H 2.57132559 -2.20300864 0.10283592
H 4.93858460 -1.60267627 -0.48060140
H 5.55483009 0.83859415 -0.70271364
H 1.67507560 2.05019549 0.17738396
*
.sh file (Slurm job):
#!/bin/bash
#SBATCH -p deflt #which partition I want
#SBATCH -o cis1_bh267_m_Cell_152_myjob.out #path for the slurm output
#SBATCH -e cis1_bh267_m_Cell_152_myjob.err #path for the slurm error output
#SBATCH -c 12 #number of cpu(logical cores)/task (task is normally an MPI process, default is one and the option to change it is -n)
#SBATCH -t 2-00:00 #how many time I want the resources (this impacts the job priority as well)
#SBATCH --job-name=cis1_bh267_m_Cell_152 #(to recognize your jobs when checking them with "squeue -u USERID")
#SBATCH -N 1 #number of node, usually 1 when no parallelization over nodes
#SBATCH --nice=0 #lowering your priority if >0
#SBATCH --gpus=0 #number of gpu you want
# This block is echoing some SLURM variables
echo "Jobid = $SLURM_JOBID"
echo "Host = $SLURM_JOB_NODELIST"
echo "Jobname = $SLURM_JOB_NAME"
echo "Subcwd = $SLURM_SUBMIT_DIR"
echo "SLURM_CPUS_PER_TASK = $SLURM_CPUS_PER_TASK"
# This block is for the execution of the program
export PATH="/data/shared/opt/ORCA/openmpi314/bin:$PATH"
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/data/shared/opt/ORCA/openmpi314/lib"
$(which orca) ${SLURM_JOB_NAME}.inp > ${SLURM_JOB_NAME}.log --use-hwthread-cpus
I used the --use-hwthread-cpus flag as a recommendation but the same problem appears with and without this flag.
All the error is:
There are not enough slots available in the system to satisfy the 12 slots that were requested by the application: /data/shared/opt/ORCA/orca_4_2_1_linux_x86-64_openmpi314/orca_gtoint_mpi
Either request fewer slots for your application, or make more slots available for use. A "slot" is the Open MPI term for an allocatable unit where we can launch a process. The number of slots available are defined by the environment in which Open MPI processes are run:
1. Hostfile, via "slots=N" clauses (N defaults to number of processor cores if not provided)
2. The --host command line parameter, via a ":N" suffix on the hostname (N defaults to 1 if not provided)
3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
4. If none of a hostfile, the --host command line parameter, or an RM is present, Open MPI defaults to the number of processor cores In all the above cases, if you want Open MPI to default to the number
of hardware threads instead of the number of processor cores, use the --use-hwthread-cpus option.
Alternatively, you can use the --oversubscribe option to ignore the number of available slots when deciding the number of processes to launch.
*[file orca_tools/qcmsg.cpp, line 458]:
.... aborting the run*
When I go to the output of the calculation, it looks like start to run but when launch the parallel jobs fail and give:
ORCA finished by error termination in GTOInt
Calling Command: mpirun -np 12 --use-hwthread-cpus /data/shared/opt/ORCA/orca_4_2_1_linux_x86-64_openmpi314/orca_gtoint_mpi cis1_bh267_m_Cell_448.int.tmp cis1_bh267_m_Cell_448
[file orca_tools/qcmsg.cpp, line 458]:
.... aborting the run
We have two kind of nodes on the cluster:
A punch of them are:
Xeon 6-core E-2136 # 3.30GHz (12 logical cores) and Nvidia GTX 1070Ti
And the other ones:
AMD Epyc 24-core (24 logical cores) and 4x Nvidia RTX 2080Ti
Using the command scontrol show node the details of one node of each group are:
First Group:
NodeName=fang1 Arch=x86_64 CoresPerSocket=6
CPUAlloc=12 CPUTot=12 CPULoad=12.00
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=gpu:gtx1070ti:1
NodeAddr=fang1 NodeHostName=fang1 Version=19.05.5
OS=Linux 5.7.12-arch1-1 #1 SMP PREEMPT Fri, 31 Jul 2020 17:38:22 +0000
RealMemory=15923 AllocMem=0 FreeMem=171 Sockets=1 Boards=1
State=ALLOCATED ThreadsPerCore=2 TmpDisk=7961 Weight=1 Owner=N/A MCS_label=N/A
Partitions=deflt,debug,long
BootTime=2020-10-27T09:56:18 SlurmdStartTime=2020-10-27T15:33:51
CfgTRES=cpu=12,mem=15923M,billing=12,gres/gpu=1,gres/gpu:gtx1070ti=1
AllocTRES=cpu=12,gres/gpu=1,gres/gpu:gtx1070ti=1
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Second Group
NodeName=fang50 Arch=x86_64 CoresPerSocket=24
CPUAlloc=48 CPUTot=48 CPULoad=48.00
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=gpu:rtx2080ti:4
NodeAddr=fang50 NodeHostName=fang50 Version=19.05.5
OS=Linux 5.7.12-arch1-1 #1 SMP PREEMPT Fri, 31 Jul 2020 17:38:22 +0000
RealMemory=64245 AllocMem=0 FreeMem=807 Sockets=1 Boards=1
State=ALLOCATED ThreadsPerCore=2 TmpDisk=32122 Weight=1 Owner=N/A MCS_label=N/A
Partitions=deflt,long
BootTime=2020-12-15T10:09:43 SlurmdStartTime=2020-12-15T10:14:17
CfgTRES=cpu=48,mem=64245M,billing=48,gres/gpu=4,gres/gpu:rtx2080ti=4
AllocTRES=cpu=48,gres/gpu=4,gres/gpu:rtx2080ti=4
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
I use in the script of Slurm the flag -c, --cpus-per-task = integer; and in the input for Orca the command %pal nprocs integer end. I tested different combinations of this two parameters in order to see if I am using more CPU than the available:
-c, --cpus-per-task = integer
%pal nprocs integer end
None
6
None
3
None
2
1
2
1
12
2
6
3
4
12
12
With different amount of memories: 8000 MBi and 2000 MBi (my total memory is around 15 GBi). And in all the cases the same error appears. I am not an expert user neither in ORCA non in informatic (but maybe you guess this for the extension of the question), so maybe the solution is simple but I really don’t have it, Idon't know what's going on!
A lot of thanks in advance,
Alejandro.

Faced the same issue.
Explicit declaration --prefix ${OMPI_HOME} directly as ORCA parameter and using of static linked ORCA version helps me:
export RSH_COMMAND="/usr/bin/ssh"
export PARAMS="--mca routed direct --oversubscribe -machinefile ${HOSTS_FILE} --prefix ${OMPI_HOME}"
$ORCA_DIR/orca $WORKDIR/$JOBFILE.inp "$PARAMS" > $WORKDIR/$JOBFILE.out
Also, It's better to build OpenMPI 3.1.x with --disable-builtin-atomics flag.

Thank you #Alexey for your answer. And sorry for the wrong Tag, like I said, I am pretty rookie on this stuff.
The problem was not in the Orca or OpenMPI configuration but in the bash script used for scheduled the Slurm job.
I thought that the entire Orca job itself was what Slurm call a "task". For that reason I declared the flag --cpus-per-task equal to the number of parallel jobs that I want to do with Orca. But the problem is that each parallel Orca job (that is launch using OpenMPI) is a task for Slurm. Therefore with my Slurm script I was reserving a node with at least 12 CPU, but when Orca launch their parallel jobs, each one ask for 12 CPU, so: "There are not enough slots available ..." because I needed 144 CPU.
The rest of the cases in the table of my Question fails for another reason. I was launching at the same time 5 different Orca calculation. Now, because --cpus-per-task could be None, 1, 2 or 3; the five calculation might enter in the same node or in another node with this amount of free CPU, but when Orca ask for the parallel jobs, fail again because there are not this amount of CPU on the node.
The solution that I found is pretty simple. On the .sh script for Slurm I putted this:
#SBATCH --mincpus=n*m
#SBATCH --ntasks=n
#SBATCH --cpus-per-task m
Instead of only:
#SBATCH --cpus-per-task m
Where n will be equal to the number of parallel jobs specified on the Orca input (%pal nprocs n end) and m the number of CPU that you want to use for each parallel Orca job.
In my case I used n = 12, m = 1. With the flag --mincpus I ensured to take a node with at least 12 CPU and allocated them. With the --cpus-per-task is pretty evident what this flag do (even for me :-) ), which, by the way, has a default value of 1 and I don't know if more than 1 CPU for each OpenMPI Orca job improve the velocity of the calculation. And --ntasks gives the information to Slurm of how many task you will do.
Of course if you know the number of task and the CPU per task is easy to know how many CPU you need to reserve, but I don't know if this is easy to Slurm too :-). So, to be sure that I allocate the correct number of CPU i used --mincpus flag, but maybe is not needed. The thing is that it works now ^_^.
It is also important to take into account the amount of memory that you declare in the input of Orca in order of do not exceed the available memory. For example, if you have 12 task and a RAM of 15000 MBi, the right amount of memory to declared should be no more than 15000/12 = 1250 MBi

I had a similar problem with parallel jobs before. The slurm also output not enough slots error.
My solution is to change parallel threads into parallel processes. For my system is to change
#SBATCH -c 24
into
#SBATCH -n 24
and everything works just fine.

Understanding How to Submit a Parallel Computing Job on Slurm

I am using a fluids solver called IAMR and I am trying to make it execute faster via my schools cluster. I have options to add nodes and specify tasks, but I have no clue what the distinction is our what my simulation needs to run. I am trying to render a single simulation and so far the following slurm script has worked:
=============================
#!/bin/bash
#SBATCH --job-name=first_slurm_job
#SBATCH -N 10
#SBATCH -p debug_queue
#SBATCH --time=4:00:00 # format days-hh:mm:ss
./amr3d.gnu.MPI.OMP.ex inputs.3d.rt
==============================
Aside from not knowing how many nodes and tasks to request, I am not sure I am submitting the job correctly. In the IAMR guide it states:
For an MPI build, you can run in parallel using, e.g.:
mpiexec -n 4 ./amr2d.gnu.DEBUG.MPI.ex inputs.2d.bubble
But I am not using that line when I make the job submission. I asked a friend and they said: typically "tasks" means "MPI processes", so if you break your problem into 4 grids then the way AMReX works, you can have each MPI rank update one grid , so with 4 grids you would ask for 4 MPI processes. So does that mean I have to figure out how to make the grid split into 4 parts if I request 4 tasks? Any insight would help! Here are my clusters specs:
Cluster Specs

Yor file name us amr3d.gnu.MPI.OMP.ex. Is this a OpenMP program (parallel using multiple cores) or a MPI program (using multiple processes possible on multiple nodes) or a hybrid program using both like the filename sounds like?
Ok, it is a hybrid program, so we say you use 2 nodes with 16 cores each, then you can do it like
#!/bin/bash
#SBATCH --job-name=first_slurm_job
#SBATCH -p debug_queue
#SBATCH --time=4:00:00 # format days-hh:mm:ss
#SBATCH --cpus-per-task=16
#SBATCH --ntasks=2
export OMP_NUM_THREADS=16
echo "Used nodes:" $SLURM_NODELIST
mpirun ./amr3d.gnu.MPI.OMP.ex inputs.3d.rt

Solving SLURM "sbatch: error: Batch job submission failed: Requested node configuration is not available" error

We have a 4 GPU nodes with 2 36-core CPUs and 200 GB of RAM available at our local cluster. When I'm trying to submit a job with the follwoing configuration:
#SBATCH --nodes=1
#SBATCH --ntasks=40
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=1500MB
#SBATCH --gres=gpu:4
#SBATCH --time=0-10:00:00
I'm getting the following error:
sbatch: error: Batch job submission failed: Requested node configuration is not available
What might be the reason for this error? The nodes have exactly the kind of hardware that I need...

The CPUs are most likely 36-threads not 36-cores and Slurm is probably configured to allocate cores and not threads.
Check the output of scontrol show nodes to see what the nodes really offer.

You're requesting 40 tasks on nodes with 36 CPUs. The default SLURM configuration binds tasks to cores, so reducing the tasks to 36 or fewer may work. (Or increases nodes to 2, if your application can handle that)

slurm jobs are pending but resources are available

I'm having some trouble with resource allocation in the sense that according to how I understood
the documentation and applied that to the config file I am expecting some behavior that does not happen.
Here is the relevant excerpt from the config file:
60 SchedulerType=sched/backfill
61 SchedulerParameters=bf_continue,bf_interval=45,bf_resolution=90,max_array_tasks=1000
62 #SchedulerAuth=
63 #SchedulerPort=
64 #SchedulerRootFilter=
65 SelectType=select/cons_res
66 SelectTypeParameters=CR_CPU_Memory
67 FastSchedule=1
...
102 NodeName=cn_burebista Sockets=2 CoresPerSocket=14 ThreadsPerCore=2 RealMemory=256000 State=UNKNOWN
103 PartitionName=main_compute Nodes=cn_burebista Shared=YES Default=YES MaxTime=76:00:00 State=UP
According to the above I have the backfill scheduler enabled with CPUs and Memory configured as
resources. I have 56 CPUs and 256GB of RAM in my resource pool. I would expect that he backfill
scheduler attempts to allocate the resources in order to fill as much of the cores as possible if there
are multiple processes asking for more resources than available. In my case I have the following queue:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2361 main_comp training mc PD 0:00 1 (Resources)
2356 main_comp skrf_ori jh R 58:41 1 cn_burebista
2357 main_comp skrf_ori jh R 44:13 1 cn_burebista
Jobs 2356 and 2357 are asking for 16 CPUs each, job 2361 is asking for 20 CPUs, meaning in total 52 CPUs
As seen from above job 2361(which is started by a different user) is marked as pending due to lack of resources although there are plenty of CPUs and memory available. "scontrol show nodes cn_burebista" gives me the following:
NodeName=cn_burebista Arch=x86_64 CoresPerSocket=14
CPUAlloc=32 CPUErr=0 CPUTot=56 CPULoad=21.65
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=cn_burebista NodeHostName=cn_burebista Version=16.05
OS=Linux RealMemory=256000 AllocMem=64000 FreeMem=178166 Sockets=2 Boards=1
State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
BootTime=2018-03-09T12:04:52 SlurmdStartTime=2018-03-20T10:35:50
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
I'm going through the documentation again and again but I cannot figure out what am I doing wrong ...
Why do I have the above situation? What should I change to my config to make this work?
Similar(not the same situation) question asked here but no answer
EDIT:
This is part of my script for the task:
3 # job parameters
4 #SBATCH --job-name=training_carlib
5 #SBATCH --output=training_job_%j.out
6
7 # needed resources
8 #SBATCH --ntasks=1
9 #SBATCH --cpus-per-task=20
10 #SBATCH --export=ALL
17 export OMP_NUM_THREADS=20
18 srun ./super_awesome_app
As it can be seen the request is made for 1 task per node and 20 CPUs per task. As the scheduler is configured to consider CPUs as resources and not cores and I ask explicitly for CPUs in the script why would the job ask for cores? This is my reference document.
EDIT 2:
Here's the output from the suggested command:
JobId=2383 JobName=training_carlib
UserId=mcetateanu(1000) GroupId=mcetateanu(1001) MCS_label=N/A
Priority=4294901726 Nice=0 Account=(null) QOS=(null)
JobState=PENDING Reason=Resources Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=3-04:00:00 TimeMin=N/A
SubmitTime=2018-03-27T10:30:38 EligibleTime=2018-03-27T10:30:38
StartTime=2018-03-28T10:27:36 EndTime=2018-03-31T14:27:36 Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=main_compute AllocNode:Sid=zalmoxis:23690
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null) SchedNodeList=cn_burebista
NumNodes=1 NumCPUs=20 NumTasks=1 CPUs/Task=20 ReqB:S:C:T=0:0:*:*
TRES=cpu=20,node=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=20 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/home/mcetateanu/workspace/CarLib/src/_outputs/linux-xeon_e5v4-icc17.0/bin/classifier/train_classifier.sh
WorkDir=/home/mcetateanu/workspace/CarLib/src/_outputs/linux-xeon_e5v4-icc17.0/bin/classifier
StdErr=/home/mcetateanu/workspace/CarLib/src/_outputs/linux-xeon_e5v4-icc17.0/bin/classifier/training_job_2383.out
StdIn=/dev/null
StdOut=/home/mcetateanu/workspace/CarLib/src/_outputs/linux-xeon_e5v4-icc17.0/bin/classifier/training_job_2383.out
Power=

In your configuration, Slurm cannot allocate two jobs on two hardware threads of the same core. In your example, Slurm would thus need at least 10 cores completely free to start your job.
Also, if the default block:cyclic task affinity configuration is used, Slurm cycles over sockets to distribute tasks in a node.
So what is happening is the following I believe:
Job 2356 submitted, being allocated 16 physical cores because of the default task distribution
Job 2357 submitted, being allocated 2 hardware threads on 8 physical cores, overriding default task distribution to get the job to run
Job 2361 submitted, waiting for at least 10 physical cores to become available.
You can get the exact CPU numbers allocated to a job using
scontrol show -dd job <jobid>
To configure Slurm in a way that it considers hardware threads exactly as if they were core, you need indeed to define
SelectTypeParameters=CR_CPU_Memory
but you also need to specify CPUs directly in the node definition
NodeName=cn_burebista CPUs=56 RealMemory=256000 State=UNKNOWN
and not let Slurm compute CPUs from Sockets, CoresPerSocket, and ThreadsPerCore.
See the section about ThreadsPerCore in the slurm.conf manpage section about node definition.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio