Bash workaround, printing environment variable in comment - bash

I am using a makefile to run a pipeline, the number of cores is set in the makefile as an environment variable.
At one point in the pipeline the makefile will execute a wrapper script which will start an LSF job array (HPC).
#!/bin/bash
#BSUB -J hybrid_job_name # job name
#BSUB -n 32 # number of cores in job
#BSUB -o output.%J.hybrid # output file name
mpirun.lsf ./program_name.exe
The only problem here is that in the wrapper script the -n flag shoud be set by the 'CORES' environment variable, and not hard coded to 32. Is there anyway to work around so I can pass the CORES environment variable to the -n flag.

You could generate the wrapper script that contains the "#BSUB" directives on the fly, before submitting it to LSF. E.g. create a template such as job_script.tmpl in advance:
#!/bin/bash
#BSUB -J hybrid_job_name # job name
#BSUB -n %CORES% # number of cores in job
#BSUB -o output.%J.hybrid # output file name
mpirun.lsf ./program_name.exe
and then in your makefile do:
sed s/%CORES%/${CORES}/g job_script.tmpl > job_script.lsf
bsub < job_script.lsf
Alternatively, you can set options to bsub on the command line as well as via #BSUB directives inside the job script. So in the makefile do:
bsub -n $CORES < job_script.lsf
The value passed to bsub on the command line will override the value defined by the #BSUB -n directive inside the job script. This way is a bit simpler but the first way has the benefit of recording the number of cores used in the job, for future reference.

Related

Set number of gpus in PBS script from command line

I'm invoking a job with qsub myjob.pbs. In there, I have some logic to run my experiments, which includes running torchrun, a distributed utility for pytorch. In that command you can set the number of nodes and number of processes (+gpus) per node. Depending on the availability, I want to be able to invoke qsub with an arbitrary number of GPUs, so that both -l gpus= and torchrun --nproc_per_node= are set depending on the command line argument.
I tried, the following:
#!/bin/sh
#PBS -l "nodes=1:ppn=12:gpus=$1"
torchrun --standalone --nnodes=1 --nproc_per_node=$1 myscript.py
and invoked it like so:
qsub --pass "4" myjob.pbs
but I got the following error: ERROR: -l: gpus: expected valid integer, found '"$1"'. Is there a way to pass the number of GPUs to the script so that the PBS directives can read them?
The problem is that your shell sees PBS directives as comments, so it will not be able to expand arguments in this way. This means that the expansion of $1 will not be occur using:
#PBS -l "nodes=1:ppn=12:gpus=$1"
Instead, you can apply the -l gpus= argument on the command line and remove the directive from your PBS script. For example:
#!/bin/sh
#PBS -l ncpus=12
set -eu
torchrun \
--standalone \
--nnodes=1 \
--nproc_per_node="${nproc_per_node}" \
myscript.py
Then just use a simple wrapper, e.g. run_myjob.sh:
#!/bin/sh
set -eu
qsub \
-l gpus="$1" \
-v nproc_per_node="$1" \
myjob.pbs
Which should let you specify the number of gpus as a command-line argument:
sh run_myjob.sh 4

Read job name from bash script parameters in SGE

I am running Sun Grid Engine for submitting jobs, and I want to have a bash script that sends in any file I need to run, instead of having to run a different qsub command with a different bash file for each of the jobs. I have been capable of generating output and error files that share the name of the input file, but now I am struggling with setting a different name for each file. My approach has been the following:
#!/bin/bash
#
#$ -cwd
#$ -S /bin/bash
#$ -N $1
#
python -u $1 >/output_dir/$1.out 2>/error_dir/$1.error
This way, running qsub send_to_sge.sh foo executes the program, and creates the files foo.error and foo.out with the errors and printouts, respectively. However, the job appears with the name $1 in the SGE queue. Instead, I would like to have foo as the job name. Is there any way to achieve what I am seeking?

Submitting LSF job array using different arguments for each element of the array

I'm trying to avoid submitting separate jobs. I have so far have this at the start of my script:
#!/bin/bash
#BSUB -P account
#BSUB -q queue
#BSUB -W 48:00
#BSUB -n 2
#BSUB -R rusage[mem=40000]
#BSUB -J jobname[1-22]
#BSUB -a 000-176:1
#BSUB -eo jobname.%I.%a.err
#BSUB -oo jobname.%I.%a.out
And then submit the job as follows:
bsub < myscript.sh
I have also tried the -i option as well but that doesn't work either.
One more issue is that the ranges of input arguments are different for the different elements of the array. So for jobname[1] input arguments will range from 000-176 but for jobname[22] input arguments will range from 000-067.
Is there a way to do this without manually submitting the job 22 times or more?
Use the $LSB_JOBINDEX environment variable inside your script, which is set to the index number of the particular array element at execution time.

How to prevent multiple executables from running at the same time on cluster

I have submitted a job to a multicore cluster with LSF platform. It looks like the code at the end. The two executables, exec1 and exec2, start at the same time. In my intention they are separated by a column comma and the second should start after the first has finished. Of course, this caused several problems with the job that couldn't terminate correctly. Now that I have figured out this behavior, I am writing separated job-submission files for each executable. Can anybody explain why these executables are running at the same time?
#!/bin/bash -l
#
# Batch script for bash users
#
#BSUB -L /bin/bash
#BSUB -n 10
#BSUB -J jobname
#BSUB -oo output.log
#BSUB -eo error.log
#BSUB -q queue
#BSUB -P project
#BSUB -R "span[hosts=1]"
#BSUB -W 4:0
source /etc/profile.d/modules.sh
module purge
module load intel_comp/c4/2013.0.028
module load hdf5/1.8.9
module load platform_mpi/8.2.1
export OMP_NUM_THREADS=1
export MP_TASK_AFFINITY=core:$OMP_NUM_THREADS
OPT="-aff=automatic:latency"
mpirun $OPT exec1; mpirun $OPT exec2
I assume that both exec1 and exec2 are MPI applications?
Theoretically it should work, but LSF is probably doing something odd and the mpirun for exec1 is exiting before exec1 actually exits. You could instead try:
mpirun $OPT exec1 && mpirun $OPT exec2
so that mpirun $OPT exec1 has to exit with return code 0 before exec2 is launched.
However, it probably isn't a great idea to run two MPI jobs from the same script like this, since for instance the MPI environment variable setup may introduce conflicts. What you should really do is use job chaining, so that exec2 is run after exec1, like this.

Pass command line arguments via sbatch

Suppose that I have the following simple bash script which I want to submit to a batch server through SLURM:
#!/bin/bash
#SBATCH -o "outFile"$1".txt"
#SBATCH -e "errFile"$1".txt"
hostname
exit 0
In this script, I simply want to write the output of hostname on a textfile whose full name I control via the command-line, like so:
login-2:jobs$ sbatch -D `pwd` exampleJob.sh 1
Submitted batch job 203775
Unfortunately, it seems that my last command-line argument (1) is not parsed through sbatch, since the files created do not have the suffix I'm looking for and the string "$1" is interpreted literally:
login-2:jobs$ ls
errFile$1.txt exampleJob.sh outFile$1.txt
I've looked around places in SO and elsewhere, but I haven't had any luck. Essentially what I'm looking for is the equivalent of the -v switch of the qsub utility in Torque-enabled clusters.
Edit: As mentioned in the underlying comment thread, I solved my problem the hard way: instead of having one single script that would be submitted several times to the batch server, each with different command line arguments, I created a "master script" that simply echoed and redirected the same content onto different scripts, the content of each being changed by the command line parameter passed. Then I submitted all of those to my batch server through sbatch. However, this does not answer the original question, so I hesitate to add it as an answer to my question or mark this question solved.
I thought I'd offer some insight because I was also looking for the replacement to the -v option in qsub, which for sbatch can be accomplished using the --export option. I found a nice site here that shows a list of conversions from Torque to Slurm, and it made the transition much smoother.
You can specify the environment variable ahead of time in your bash script:
$ var_name='1'
$ sbatch -D `pwd` exampleJob.sh --export=var_name
Or define it directly within the sbatch command just like qsub allowed:
$ sbatch -D `pwd` exampleJob.sh --export=var_name='1'
Whether this works in the # preprocessors of exampleJob.sh is also another question, but I assume that it should give the same functionality found in Torque.
Using a wrapper is more convenient. I found this solution from this thread.
Basically the problem is that the SBATCH directives are seen as comments by the shell and therefore you can't use the passed arguments in them. Instead you can use a here document to feed in your bash script after the arguments are set accordingly.
In case of your question you can substitute the shell script file with this:
#!/bin/bash
sbatch <<EOT
#!/bin/bash
#SBATCH -o "outFile"$1".txt"
#SBATCH -e "errFile"$1".txt"
hostname
exit 0
EOT
And you run the shell script like this:
bash [script_name].sh [suffix]
And the outputs will be saved to outFile[suffix].txt and errFile[suffix].txt
If you pass your commands via the command line, you can actually bypass the issue of not being able to pass command line arguments in the batch script. So for instance, at the command line :
var1="my_error_file.txt"
var2="my_output_file.txt"
sbatch --error=$var1 --output=$var2 batch_script.sh
The lines starting with #SBATCH are not interpreted by bash but are replaced with code by sbatch.
The sbatch options do not support $1 vars (only %j and some others, replacing $1 by %1 will not work).
When you don't have different sbatch processes running in parallel, you could try
#!/bin/bash
touch outFile${1}.txt errFile${1}.txt
rm link_out.sbatch link_err.sbatch 2>/dev/null # remove links from previous runs
ln -s outFile${1}.txt link_out.sbatch
ln -s errFile${1}.txt link_err.sbatch
#SBATCH -o link_out.sbatch
#SBATCH -e link_err.sbatch
hostname
# I do not know about the background processing of sbatch, are the jobs still running
# at this point? When they are, you can not delete the temporary symlinks yet.
exit 0
Alternative:
As you said in a comment yourself, you could make a masterscript.
This script can contain lines like
cat exampleJob.sh.template | sed -e 's/File.txt/File'$1'.txt/' > exampleJob.sh
# I do not know, is the following needed with sbatch?
chmod +x exampleJob.sh
In your template the #SBATCH lines look like
#SBATCH -o "outFile.txt"
#SBATCH -e "errFile.txt"
This is an old question but I just stumbled into the same task and I think this solution is simpler:
Let's say I have the variable $OUT_PATH in the bash script launch_analysis.bash and I want to pass this variable to task_0_generate_features.sl which is my SLURM file to send the computation to a batch server. I would have the following in launch_analysis.bash:
`sbatch --export=OUT_PATH=$OUT_PATH task_0_generate_features.sl`
Which is directly accessible in task_0_generate_features.sl
In #Jason case we would have:
sbatch -D `pwd` --export=hostname=$hostname exampleJob.sh
Reference: Using Variables in SLURM Jobs
Something like this works for me and Torque
echo "$(pwd)/slurm.qsub 1" | qsub -S /bin/bash -N Slurm-TEST
slurm.qsub:
#!/bin/bash
hostname > outFile${1}.txt 2>errFile${1}.txt
exit 0

Resources