Running Slurm array jobs one per virtual core instead of one per physical core - cpu

we have a machine with 2x64-core CPU, each core consists of 2 virtual cores, so in htop we see 256 distinct (virtual) CPUs. We configured Slurm quality of service to better manage CPU usage per user. I.e. we have defined a --qos=cpus50 which, as far as I understand it, gives me a budget of 50 virtual cores to compute my jobs. I created a test.sbatch script with an array of 100 jobs. Each job takes 10s to compute. So with the following config, I would hope that my jobs will be finished in 20s + some small overhead.
#!/bin/bash
#SBATCH --job-name=test
#SBATCH --qos=cpus50
#SBATCH --array=1-100
#SBATCH --cpus-per-task=1
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --ntasks-per-core=1
#SBATCH --open-mode=append
#SBATCH --output=%x.out
python3 -c "import os; jobid = int(os.getenv('SLURM_ARRAY_TASK_ID')); start = f'JOBID:{jobid:04d} | Start {time.ctime()}'; time.sleep(10); print(f'{start} | End {time.ctime()} |')"
However,running the script above spawns only 25 jobs at once (according to squeue output) and finishes in 47seconds. (2x the desired duration). Running with --ntasks-per-core=2 results in the same behavior. Running with --ntasks=2 and --ntasks-per-core=2 results in the same behavior.
What am I doing wrong? I just want to run 50 jobs at once since I already have the virtual cores available. Thank you

Answering my own question. A member of our group found an answer here.
The problem was in Slurm configuration. In short, for our setup, we had to change the relevant part in slurm.conf from
SelectTypeParameters=CR_Core
NodeName=nodename CPUs=256 Sockets=2 CoresPerSocket=64 ThreadsPerCore=2
to
SelectTypeParameters=CR_CPU
NodeName=nodename CPUs=256
Now the sbatch script from the question spawns 50 jobs and finishes in a bit more than 20s as expected.

Related

Optimize performance on a SLURM cluster

I am writing you after many attempts I have done on a CPU cluster so structured:
144 standard compute nodes
2× AMD EPYC 7742, 2× 64 cores, 2.25 GHz
256 (16× 16) GB DDR4, 3200 MHz
InfiniBand HDR100 (Connect-X6)
local disk for operating system (1× 240 GB SSD)
1 TB NVMe
Now, since my core-h are here limited, I want to maximize performance as much as I can.
I am doing some benchmarking with the following submission script:
#!/bin/bash -x
#SBATCH --account=XXXX
#SBATCH --ntasks=256
#SBATCH --output=mp-out.%j
#SBATCH --error=mpi-err.%j
#SBATCH --time=24:00:00
#SBATCH --partition=batch
srun ./myprogram
The program I am running si Gromacs2020 (MPI), a Software to perform Molecular Dynamic Simualtions.
In the machine manual I read about these keys:
--ntasks
--ntasks-per-node
--cpu-per-node
However, considering the very recently technology, I am getting mediocre performances. Indeed, in a 5-years older cluster, I get better performance with comparable resources.
So, do you envision a good combination of those keywords to maximize performance and avoid core-h wasting?
My system size is ~100K atoms (if it can help).
Any feedback would be very much appreciated,
Looking forward to hearing from your opinions.
Best Regards
VG
In your case, the 256 tasks have no constraints to run in the same rack, location or not. Slurm have no clues to schedule correctly the job on your cluster. It could be schedule 1 task on 256 different nodes, and that is not efficient at all.
To be sure that all is schedule correctly, perhaps you should force to locate the tasks on the node.
#!/bin/bash -x
#SBATCH --account=XXXX
#SBATCH --nodes=2
#SBATCH --ntasks=256
#SBATCH --ntasks-per-core=1
#SBATCH --tasks-per-node=128
#SBATCH --output=mp-out.%j
#SBATCH --error=mpi-err.%j
#SBATCH --time=24:00:00
#SBATCH --partition=batch
srun ./myprogram
And normally, each 256 tasks will be schedule on 1 core on each AMD socket. and located on 2 nodes. This will avoid oversubscribing and cpu cycles sharing which is inefficient.
To be sure and not be be disturb for benchmarking, ask --exclusive.

Slurm --cpus-per-task command

Hello everyone I'm actually using a soft called RepeatMasker, in this pipeline I can run parallelized job via slurm with the command -pa
here is a doc about this command :
RepeatMasker -h
-pa(rallel) [number]
The number of sequence batch jobs [50kb minimum] to run in parallel.
RepeatMasker will fork off this number of parallel jobs, each
running the search engine specified. For each search engine
invocation ( where applicable ) a fixed the number of cores/threads
is used:
RMBlast 4 cores
To estimate the number of cores a RepeatMasker run will use simply
multiply the -pa value by the number of cores the particular search
engine will use.
so in a slurm batch script I should add :
#SBATCH --cpus-per-task=8
RepeatMakser -pa 2, right?
since 8/4 =2
But I wondered if I should also add others #SBATCH parameters or if --cpus-per-task is sufficient ?
Thanks al ot

Understanding How to Submit a Parallel Computing Job on Slurm

I am using a fluids solver called IAMR and I am trying to make it execute faster via my schools cluster. I have options to add nodes and specify tasks, but I have no clue what the distinction is our what my simulation needs to run. I am trying to render a single simulation and so far the following slurm script has worked:
=============================
#!/bin/bash
#SBATCH --job-name=first_slurm_job
#SBATCH -N 10
#SBATCH -p debug_queue
#SBATCH --time=4:00:00 # format days-hh:mm:ss
./amr3d.gnu.MPI.OMP.ex inputs.3d.rt
==============================
Aside from not knowing how many nodes and tasks to request, I am not sure I am submitting the job correctly. In the IAMR guide it states:
For an MPI build, you can run in parallel using, e.g.:
mpiexec -n 4 ./amr2d.gnu.DEBUG.MPI.ex inputs.2d.bubble
But I am not using that line when I make the job submission. I asked a friend and they said: typically "tasks" means "MPI processes", so if you break your problem into 4 grids then the way AMReX works, you can have each MPI rank update one grid , so with 4 grids you would ask for 4 MPI processes. So does that mean I have to figure out how to make the grid split into 4 parts if I request 4 tasks? Any insight would help! Here are my clusters specs:
Cluster Specs
Yor file name us amr3d.gnu.MPI.OMP.ex. Is this a OpenMP program (parallel using multiple cores) or a MPI program (using multiple processes possible on multiple nodes) or a hybrid program using both like the filename sounds like?
Ok, it is a hybrid program, so we say you use 2 nodes with 16 cores each, then you can do it like
#!/bin/bash
#SBATCH --job-name=first_slurm_job
#SBATCH -p debug_queue
#SBATCH --time=4:00:00 # format days-hh:mm:ss
#SBATCH --cpus-per-task=16
#SBATCH --ntasks=2
export OMP_NUM_THREADS=16
echo "Used nodes:" $SLURM_NODELIST
mpirun ./amr3d.gnu.MPI.OMP.ex inputs.3d.rt

Slurm: variable for max SLURM_ARRAY_TASK_ID

I have a simple slurm job file that looks like:
#!/bin/bash
#SBATCH --array=1-1000
#SBATCH -t 60:00
#SBATCH --mail-type=ALL
python cats.py ${SLURM_ARRAY_TASK_ID} 1000
That second argument is so my script know the total number of workers in this job.
I'd like to make that 1000 value into a variable though, so I don't need to hardcode the total number of workers. Is there some slurm variable for the maximum array task id in the current job?
You can use the environment variable SLURM_ARRAY_TASK_MAX

Solving SLURM "sbatch: error: Batch job submission failed: Requested node configuration is not available" error

We have a 4 GPU nodes with 2 36-core CPUs and 200 GB of RAM available at our local cluster. When I'm trying to submit a job with the follwoing configuration:
#SBATCH --nodes=1
#SBATCH --ntasks=40
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=1500MB
#SBATCH --gres=gpu:4
#SBATCH --time=0-10:00:00
I'm getting the following error:
sbatch: error: Batch job submission failed: Requested node configuration is not available
What might be the reason for this error? The nodes have exactly the kind of hardware that I need...
The CPUs are most likely 36-threads not 36-cores and Slurm is probably configured to allocate cores and not threads.
Check the output of scontrol show nodes to see what the nodes really offer.
You're requesting 40 tasks on nodes with 36 CPUs. The default SLURM configuration binds tasks to cores, so reducing the tasks to 36 or fewer may work. (Or increases nodes to 2, if your application can handle that)

Resources