Submitting LSF job array using different arguments for each element of the array - bash

I'm trying to avoid submitting separate jobs. I have so far have this at the start of my script:
#!/bin/bash
#BSUB -P account
#BSUB -q queue
#BSUB -W 48:00
#BSUB -n 2
#BSUB -R rusage[mem=40000]
#BSUB -J jobname[1-22]
#BSUB -a 000-176:1
#BSUB -eo jobname.%I.%a.err
#BSUB -oo jobname.%I.%a.out
And then submit the job as follows:
bsub < myscript.sh
I have also tried the -i option as well but that doesn't work either.
One more issue is that the ranges of input arguments are different for the different elements of the array. So for jobname[1] input arguments will range from 000-176 but for jobname[22] input arguments will range from 000-067.
Is there a way to do this without manually submitting the job 22 times or more?

Use the $LSB_JOBINDEX environment variable inside your script, which is set to the index number of the particular array element at execution time.

Related

Set number of gpus in PBS script from command line

I'm invoking a job with qsub myjob.pbs. In there, I have some logic to run my experiments, which includes running torchrun, a distributed utility for pytorch. In that command you can set the number of nodes and number of processes (+gpus) per node. Depending on the availability, I want to be able to invoke qsub with an arbitrary number of GPUs, so that both -l gpus= and torchrun --nproc_per_node= are set depending on the command line argument.
I tried, the following:
#!/bin/sh
#PBS -l "nodes=1:ppn=12:gpus=$1"
torchrun --standalone --nnodes=1 --nproc_per_node=$1 myscript.py
and invoked it like so:
qsub --pass "4" myjob.pbs
but I got the following error: ERROR: -l: gpus: expected valid integer, found '"$1"'. Is there a way to pass the number of GPUs to the script so that the PBS directives can read them?
The problem is that your shell sees PBS directives as comments, so it will not be able to expand arguments in this way. This means that the expansion of $1 will not be occur using:
#PBS -l "nodes=1:ppn=12:gpus=$1"
Instead, you can apply the -l gpus= argument on the command line and remove the directive from your PBS script. For example:
#!/bin/sh
#PBS -l ncpus=12
set -eu
torchrun \
--standalone \
--nnodes=1 \
--nproc_per_node="${nproc_per_node}" \
myscript.py
Then just use a simple wrapper, e.g. run_myjob.sh:
#!/bin/sh
set -eu
qsub \
-l gpus="$1" \
-v nproc_per_node="$1" \
myjob.pbs
Which should let you specify the number of gpus as a command-line argument:
sh run_myjob.sh 4

LSF ERROR:Project must be 'acc_*'

I need to run a Python script on a supercomputer by submitting a job with LSF. I have been trying to become acquainted with the syntax using a simple example script:
#!/bin/bash
#BSUB –q alloc
#BSUB –n 1
#BSUB –o t.out
echo “Salve Munde!”
I saved this file as example.txt, and on the command line, I ran:
$ bsub < example.txt
This returned the message:
LSF ERROR:Project must be 'acc_*'. Request aborted by esub. Job not submitted.
What is the cause of this error?

How to prevent multiple executables from running at the same time on cluster

I have submitted a job to a multicore cluster with LSF platform. It looks like the code at the end. The two executables, exec1 and exec2, start at the same time. In my intention they are separated by a column comma and the second should start after the first has finished. Of course, this caused several problems with the job that couldn't terminate correctly. Now that I have figured out this behavior, I am writing separated job-submission files for each executable. Can anybody explain why these executables are running at the same time?
#!/bin/bash -l
#
# Batch script for bash users
#
#BSUB -L /bin/bash
#BSUB -n 10
#BSUB -J jobname
#BSUB -oo output.log
#BSUB -eo error.log
#BSUB -q queue
#BSUB -P project
#BSUB -R "span[hosts=1]"
#BSUB -W 4:0
source /etc/profile.d/modules.sh
module purge
module load intel_comp/c4/2013.0.028
module load hdf5/1.8.9
module load platform_mpi/8.2.1
export OMP_NUM_THREADS=1
export MP_TASK_AFFINITY=core:$OMP_NUM_THREADS
OPT="-aff=automatic:latency"
mpirun $OPT exec1; mpirun $OPT exec2
I assume that both exec1 and exec2 are MPI applications?
Theoretically it should work, but LSF is probably doing something odd and the mpirun for exec1 is exiting before exec1 actually exits. You could instead try:
mpirun $OPT exec1 && mpirun $OPT exec2
so that mpirun $OPT exec1 has to exit with return code 0 before exec2 is launched.
However, it probably isn't a great idea to run two MPI jobs from the same script like this, since for instance the MPI environment variable setup may introduce conflicts. What you should really do is use job chaining, so that exec2 is run after exec1, like this.

Bash workaround, printing environment variable in comment

I am using a makefile to run a pipeline, the number of cores is set in the makefile as an environment variable.
At one point in the pipeline the makefile will execute a wrapper script which will start an LSF job array (HPC).
#!/bin/bash
#BSUB -J hybrid_job_name # job name
#BSUB -n 32 # number of cores in job
#BSUB -o output.%J.hybrid # output file name
mpirun.lsf ./program_name.exe
The only problem here is that in the wrapper script the -n flag shoud be set by the 'CORES' environment variable, and not hard coded to 32. Is there anyway to work around so I can pass the CORES environment variable to the -n flag.
You could generate the wrapper script that contains the "#BSUB" directives on the fly, before submitting it to LSF. E.g. create a template such as job_script.tmpl in advance:
#!/bin/bash
#BSUB -J hybrid_job_name # job name
#BSUB -n %CORES% # number of cores in job
#BSUB -o output.%J.hybrid # output file name
mpirun.lsf ./program_name.exe
and then in your makefile do:
sed s/%CORES%/${CORES}/g job_script.tmpl > job_script.lsf
bsub < job_script.lsf
Alternatively, you can set options to bsub on the command line as well as via #BSUB directives inside the job script. So in the makefile do:
bsub -n $CORES < job_script.lsf
The value passed to bsub on the command line will override the value defined by the #BSUB -n directive inside the job script. This way is a bit simpler but the first way has the benefit of recording the number of cores used in the job, for future reference.

Directly pass parameters to pbs script

Is there a way to directly pass parameters to a .pbs script before submitting a job? I need to loop over a list of files indicated by different numbers and apply a script to analyze each file.
The best I've been able to come up with is the following:
#!/bin/sh
for ((i= 1; i<= 10; i++))
do
export FILENUM=$i
qsub pass_test.pbs
done
where pass_test.pbs is the following script:
#!/bin/sh
#PBS -V
#PBS -S /bin/sh
#PBS -N pass_test
#PBS -l nodes=1:ppn=1,walltime=00:02:00
#PBS -M XXXXXX#XXX.edu
cd /scratch/XXXXXX/pass_test
./run_test $FILENUM
But this feels a bit wonky. Particularly, I want to avoid having to create an environment variable to handle this.
The qsub utility can read the script from the standard input, so by using a here document you can create scripts on the fly, dynamically:
#!/bin/sh
for i in `seq 1 10`
do
cat <<EOS | qsub -
#!/bin/sh
#PBS -V
#PBS -S /bin/sh
#PBS -N pass_test
#PBS -l nodes=1:ppn=1,walltime=00:02:00
#PBS -M XXXXXX#XXX.edu
cd /scratch/XXXXXX/pass_test
./run_test $i
EOS
done
Personally, I would use a more compact version:
#!/bin/sh
for i in `seq 1 10`
do
cat <<EOS | qsub -V -S /bin/sh -N pass_test -l nodes=1:ppn=1,walltime=00:02:00 -M XXXXXX#XXX.edu -
cd /scratch/XXXXXX/pass_test
./run_test $i
EOS
done
You can use the -F option, as described here:
-F
Specifies the arguments that will be passed to the job script when the script is launched. The accepted syntax is:
qsub -F "myarg1 myarg2 myarg3=myarg3value" myscript2.sh
Note: Quotation marks are required. qsub will fail with an error
message if the argument following -F is not a quoted value. The
pbs_mom server will pass the quoted value as arguments to the job
script when it launches the script.
See also this answer
If you just need to pass numbers and run a list of jobs with the same command except the input file number, it's better to use a job array instead of a for loop as job array would have less burden on the job scheduler.
To run, you specify the file number with PBS_ARRAYID like this in the pbs file:
./run_test ${PBS_ARRAYID}
And to invoke it, on command line, type:
qsub -t 1-10 pass_test.pbs
where you can specify what array id to use after -t option

Resources