suppose I run a slurm job with the following configuration:
#!/bin/bash
#SBATCH --nodes=1 # set the number of nodes
#SBATCH --ntasks=1 # Run a single task
#SBATCH --cpus-per-task=4 # Number of CPU cores per task
#SBATCH --time=26:59:00 # set max wallclock time
#SBATCH --mem=16000M # set memory limit per node
#SBATCH --job-name=myjobname # set name of job
#SBATCH --mail-type=ALL # mail alert at start, end and abortion of execution
#SBATCH --mail-user=sb#sw.com # send mail to this address
#SBATCH --output=/path/to/output/%x-%j.out # set output path
echo ' mem: ' $SLURM_MEM
echo '\n nodes: ' $SLURM_NODES
echo '\n ntasks: ' $SLURM_NTASKS
echo '\n cpus: ' $SLURM_CPUS_PER_TASK
echo '\n time: ' $SLURM_TIME
I want to save the configuration of this job such as 'time, memory, number of tasks' so after the job finished I know under what configuration the job was executed.
So I decided to print these variables in output file, however there is nothing for time and memory in output:
\n nodes:
\n ntasks: 1
\n cpus: 1
\n time:
Does anyone knows a better way? or how to refer to time and memory?
You can dump a lot of information about your job with scontrol show job <job_id>. This will give you among other memory requested. This will not however give you the actual memory usage. For that you will need to use sacct -l -j <job_id>.
So, at the end of your submission script, you can add
scontrol show job $SLURM_JOB_ID
sacct -l -j $SLURM_JOB_ID
There are many options for selecting the output od the sacct command, refer to the man page for the complete list.
Related
I have a Big Cluster available through SLURM.
I want to start my script e.g. ./calc on every requested node with a specified amount of cores. So for example on 2 nodes, 16 cores each.
I start with sbatch script
#SBATCH -N 2
#SBATCH --ntasks-per-node=16
srun -N 1 ./calc 2 &
srun -N 1 ./clac 2 &
wait
It doesn't work as intended though.
I tried many configurations of --ntask --nodes --cpus-per-task but nothing worked and I'm very lost.
I also don't understand the difference between task and CPUs in SLURM
In your example, you ask slurm to launch 16 tasks per node, on 2 nodes. At the end of the job, slurm will probably runs 8x(srun)x2 nodes tasks.
For your needs, you don't need to specify that you want 2 nodes specifically instead of the jobs have to run on 2 two different nodes.
For your example, run the following sbatch :
#!/bin/bash
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=16
#SBATCH --hint=nomultithread
srun <my program>
In this example, slurm will run 2 times the program with 16 cores. The nomultithread is optional and depends the cluster configuration. If the hyper-threading is activated, this will be 16 virtual cpus.
I found this to be a working solution. It turned out the most important thing was to define all parameters nodes tasks cpus
#!/bin/bash
#SBATCH -N 2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=16
srun -N 1 -n 1 -c 16 ./calc 2 &
srun -N 1 -n 1 -c 16 ./calc 2 &
wait
This is a follow up question from [How to run jobs in paralell using one slurm batch script?]. The goal was to create a single SBatch-Script, which can start multiple processes and run them in parallel. The Answer given by
damienfrancois was very detailed and looked something like this.
#!/bin/bash
#
#SBATCH --job-name=test
#SBATCH --output=/dev/null
#SBATCH --error=/dev/null
#SBATCH --partition=All
srun -n 1 -c 1 --exclusive sleep 60 &
srun -n 1 -c 1 --exclusive sleep 60 &
....
wait
However, I am not able to understand the exclusive keyword. If I use the keyword, one node of the cluster is chosen and all processes are launched there. However, I would like Slurm to distribute the ["sleeps"/steps] over the entire cluster.
So how does the keyword exclusive work ? According to the Slurm documentaion, the restriction to one node should not happen, since the keyword is used within a step-allocation.
[I am new to Slurm]
I am trying to launch several task in a SLURM-managed cluster, and would like to avoid dealing with dozens of files.
Right now, I have 50 tasks (subscripted i, and for simplicity, i is also the input parameter of my program), and for each one a single bash file slurm_run_i.sh which indicates the computations configuration, and the srun command:
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH -J pltCV
#SBATCH --mem=30G
srun python plotConvergence.py i
I am then using another bash file to submit all these tasks, slurm_run_all.sh
#!/bin/bash
for i in {1..50}:
sbatch slurm_run_$i.sh
done
This works (50 jobs are running on the cluster), but I find it troublesome to have more than 50 input files. Searching a solution, I came up with the & command, obtaining something as:
#!/bin/bash
#SBATCH --ntasks=50
#SBATCH --cpus-per-task=1
#SBATCH -J pltall
#SBATCH --mem=30G
# Running jobs
srun python plotConvergence.py 1 &
srun python plotConvergence.py 2 &
...
srun python plotConvergence.py 49 &
srun python plotConvergence.py 50 &
wait
echo "All done"
Which seems to run as well. However, I cannot manage each of these jobs independently: the output of squeue shows I have a single job (pltall) running on a single node. As there are only 12 cores on each node in the partition I am working in, I am assuming most of my jobs are waiting on the single node I've been allocated to. Setting the -N option doesn't change anything too.. Moreover, I cannot cancel some jobs individually anymore if I realize there's a mistake or something, which sounds problematic to me.
Is my interpretation right, and is there a better way (I guess) than my attempt to process several jobs in slurm without being lost among many files ?
What you are looking for is the jobs array feature of Slurm.
In your case, you would have a single submission file (slurm_run.sh) like this:
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH -J pltCV
#SBATCH --mem=30G
#SBATCH --array=1-50
srun python plotConvergence.py ${SLURM_ARRAY_TASK_ID}
and then submit the array of jobs with
sbatch slurm_run.sh
You will see that you will have 50 jobs submitted. You can cancel all of them at once or one by one. See the man page of sbatch for details.
I am using slurm on a cluster to run jobs and submit a script that looks like below with sbatch:
#!/usr/bin/env bash
#SBATCH -o slurm.sh.out
#SBATCH -p defq
#SBATCH --mail-type=ALL
#SBATCH --mail-user=my.email#something.com
echo "hello"
Can I somehow comment out a #SBATCH line, e.g. the #SBATCH --mail-user=my.email#something.com in this script? Since the slurm instructions are bash comments themselves I would not know how to achieve this.
just add another # at the beginning.
##SBATCH --mail-user...
This will not be processed by Slurm
I am running a bash script to run jobs on Linux clusters, using SLURM. The relevant part of the script is given below (slurm.sh):
#!/bin/bash
#SBATCH -p parallel
#SBATCH --qos=short
#SBATCH --exclusive
#SBATCH -o out.log
#SBATCH -e err.log
#SBATCH --open-mode=append
#SBATCH --cpus-per-task=1
#SBATCH -J hadoopslurm
#SBATCH --time=01:30:00
#SBATCH --mem-per-cpu=1000
#SBATCH --mail-user=amukherjee708#gmail.com
#SBATCH --mail-type=ALL
#SBATCH -N 5
I am calling this script from another script (ext.sh), a part of which is given below:
#!/bin/bash
for i in {1..3}
do
source slurm.sh
done
..
I want to manipulate the value of the N variable is slurm.sh (#SBATCH -N 5) by setting it to values like 3,6,8 etc, inside the for loop of ext.sh. How do I access the variable programmatically from ext.sh? Please help.
First note that if you simply source the shell script, you will not submit a job to Slurm, you will simply run the job on the submission node. So you need to write
#!/bin/bash
for i in {1..3}
do
sbatch slurm.sh
done
Now if you want to change the -N programmatically, one option is to remove it from the file slurm.sh and add it as argument to the sbatch command:
#!/bin/bash
for i in {1..3}
do
sbatch -N $i slurm.sh
done
The above script will submit three jobs with respectively 1, 2, and 3 nodes requested.