How to save SLURM job configuration? - job-scheduling

suppose I run a slurm job with the following configuration:
#!/bin/bash
#SBATCH --nodes=1 # set the number of nodes
#SBATCH --ntasks=1 # Run a single task
#SBATCH --cpus-per-task=4 # Number of CPU cores per task
#SBATCH --time=26:59:00 # set max wallclock time
#SBATCH --mem=16000M # set memory limit per node
#SBATCH --job-name=myjobname # set name of job
#SBATCH --mail-type=ALL # mail alert at start, end and abortion of execution
#SBATCH --mail-user=sb#sw.com # send mail to this address
#SBATCH --output=/path/to/output/%x-%j.out # set output path
echo ' mem: ' $SLURM_MEM
echo '\n nodes: ' $SLURM_NODES
echo '\n ntasks: ' $SLURM_NTASKS
echo '\n cpus: ' $SLURM_CPUS_PER_TASK
echo '\n time: ' $SLURM_TIME
I want to save the configuration of this job such as 'time, memory, number of tasks' so after the job finished I know under what configuration the job was executed.
So I decided to print these variables in output file, however there is nothing for time and memory in output:
\n nodes:
\n ntasks: 1
\n cpus: 1
\n time:
Does anyone knows a better way? or how to refer to time and memory?

You can dump a lot of information about your job with scontrol show job <job_id>. This will give you among other memory requested. This will not however give you the actual memory usage. For that you will need to use sacct -l -j <job_id>.
So, at the end of your submission script, you can add
scontrol show job $SLURM_JOB_ID
sacct -l -j $SLURM_JOB_ID
There are many options for selecting the output od the sacct command, refer to the man page for the complete list.

Related

SLURM How to start a script once per node

I have a Big Cluster available through SLURM.
I want to start my script e.g. ./calc on every requested node with a specified amount of cores. So for example on 2 nodes, 16 cores each.
I start with sbatch script
#SBATCH -N 2
#SBATCH --ntasks-per-node=16
srun -N 1 ./calc 2 &
srun -N 1 ./clac 2 &
wait
It doesn't work as intended though.
I tried many configurations of --ntask --nodes --cpus-per-task but nothing worked and I'm very lost.
I also don't understand the difference between task and CPUs in SLURM
In your example, you ask slurm to launch 16 tasks per node, on 2 nodes. At the end of the job, slurm will probably runs 8x(srun)x2 nodes tasks.
For your needs, you don't need to specify that you want 2 nodes specifically instead of the jobs have to run on 2 two different nodes.
For your example, run the following sbatch :
#!/bin/bash
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=16
#SBATCH --hint=nomultithread
srun <my program>
In this example, slurm will run 2 times the program with 16 cores. The nomultithread is optional and depends the cluster configuration. If the hyper-threading is activated, this will be 16 virtual cpus.
I found this to be a working solution. It turned out the most important thing was to define all parameters nodes tasks cpus
#!/bin/bash
#SBATCH -N 2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=16
srun -N 1 -n 1 -c 16 ./calc 2 &
srun -N 1 -n 1 -c 16 ./calc 2 &
wait

What does the keyword --exclusive mean in slurm?

This is a follow up question from [How to run jobs in paralell using one slurm batch script?]. The goal was to create a single SBatch-Script, which can start multiple processes and run them in parallel. The Answer given by
damienfrancois was very detailed and looked something like this.
#!/bin/bash
#
#SBATCH --job-name=test
#SBATCH --output=/dev/null
#SBATCH --error=/dev/null
#SBATCH --partition=All
srun -n 1 -c 1 --exclusive sleep 60 &
srun -n 1 -c 1 --exclusive sleep 60 &
....
wait
However, I am not able to understand the exclusive keyword. If I use the keyword, one node of the cluster is chosen and all processes are launched there. However, I would like Slurm to distribute the ["sleeps"/steps] over the entire cluster.
So how does the keyword exclusive work ? According to the Slurm documentaion, the restriction to one node should not happen, since the keyword is used within a step-allocation.
[I am new to Slurm]

Do I need a single bash file for each task in SLURM?

I am trying to launch several task in a SLURM-managed cluster, and would like to avoid dealing with dozens of files.
Right now, I have 50 tasks (subscripted i, and for simplicity, i is also the input parameter of my program), and for each one a single bash file slurm_run_i.sh which indicates the computations configuration, and the srun command:
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH -J pltCV
#SBATCH --mem=30G
srun python plotConvergence.py i
I am then using another bash file to submit all these tasks, slurm_run_all.sh
#!/bin/bash
for i in {1..50}:
sbatch slurm_run_$i.sh
done
This works (50 jobs are running on the cluster), but I find it troublesome to have more than 50 input files. Searching a solution, I came up with the & command, obtaining something as:
#!/bin/bash
#SBATCH --ntasks=50
#SBATCH --cpus-per-task=1
#SBATCH -J pltall
#SBATCH --mem=30G
# Running jobs
srun python plotConvergence.py 1 &
srun python plotConvergence.py 2 &
...
srun python plotConvergence.py 49 &
srun python plotConvergence.py 50 &
wait
echo "All done"
Which seems to run as well. However, I cannot manage each of these jobs independently: the output of squeue shows I have a single job (pltall) running on a single node. As there are only 12 cores on each node in the partition I am working in, I am assuming most of my jobs are waiting on the single node I've been allocated to. Setting the -N option doesn't change anything too.. Moreover, I cannot cancel some jobs individually anymore if I realize there's a mistake or something, which sounds problematic to me.
Is my interpretation right, and is there a better way (I guess) than my attempt to process several jobs in slurm without being lost among many files ?
What you are looking for is the jobs array feature of Slurm.
In your case, you would have a single submission file (slurm_run.sh) like this:
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH -J pltCV
#SBATCH --mem=30G
#SBATCH --array=1-50
srun python plotConvergence.py ${SLURM_ARRAY_TASK_ID}
and then submit the array of jobs with
sbatch slurm_run.sh
You will see that you will have 50 jobs submitted. You can cancel all of them at once or one by one. See the man page of sbatch for details.

comment in bash script processed by slurm

I am using slurm on a cluster to run jobs and submit a script that looks like below with sbatch:
#!/usr/bin/env bash
#SBATCH -o slurm.sh.out
#SBATCH -p defq
#SBATCH --mail-type=ALL
#SBATCH --mail-user=my.email#something.com
echo "hello"
Can I somehow comment out a #SBATCH line, e.g. the #SBATCH --mail-user=my.email#something.com in this script? Since the slurm instructions are bash comments themselves I would not know how to achieve this.
just add another # at the beginning.
##SBATCH --mail-user...
This will not be processed by Slurm

Change the value of external SLURM variable

I am running a bash script to run jobs on Linux clusters, using SLURM. The relevant part of the script is given below (slurm.sh):
#!/bin/bash
#SBATCH -p parallel
#SBATCH --qos=short
#SBATCH --exclusive
#SBATCH -o out.log
#SBATCH -e err.log
#SBATCH --open-mode=append
#SBATCH --cpus-per-task=1
#SBATCH -J hadoopslurm
#SBATCH --time=01:30:00
#SBATCH --mem-per-cpu=1000
#SBATCH --mail-user=amukherjee708#gmail.com
#SBATCH --mail-type=ALL
#SBATCH -N 5
I am calling this script from another script (ext.sh), a part of which is given below:
#!/bin/bash
for i in {1..3}
do
source slurm.sh
done
..
I want to manipulate the value of the N variable is slurm.sh (#SBATCH -N 5) by setting it to values like 3,6,8 etc, inside the for loop of ext.sh. How do I access the variable programmatically from ext.sh? Please help.
First note that if you simply source the shell script, you will not submit a job to Slurm, you will simply run the job on the submission node. So you need to write
#!/bin/bash
for i in {1..3}
do
sbatch slurm.sh
done
Now if you want to change the -N programmatically, one option is to remove it from the file slurm.sh and add it as argument to the sbatch command:
#!/bin/bash
for i in {1..3}
do
sbatch -N $i slurm.sh
done
The above script will submit three jobs with respectively 1, 2, and 3 nodes requested.

Resources