How to run a .sh in different nodes using different data in slurm? - parallel-processing

I'm a beginner in Slurm and I would like some help with the following problem. I've made a .sh, within which, initially, a matlab script creates two arrays (based on the value of a parameter i). Then, those arrays are being used by a Fortran program that computes a number of other arrays.
I want this process to be performed at the same time for 10 different values of i (in 10 different nodes(?)).
`I have the following slurm script but I'm pretty sure I'm doing it wrong.
#!/bin/bash
#SBATCH --job-name=Name
#SBATCH --nodes=10
#SBATCH --ntasks=10
#SBATCH --ntasks-per-node=1
for i in {1..10}
do
srun -n1 --exclusive ./gaia.sh $i &
done
date`
I'm getting a message Warning: "can't run 1 processes on 10 nodes, setting nnodes to 1" .
Can someone provide some help?

To get rid of the warning, add --nodes=1 in addition to the current parameters in line
srun -n1 --exclusive ./gaia.sh $i &
That said, there is probably no reason to force Slurm to use 10 different nodes. You could remove all the references to --nodes=....
If, furthermore, there is no real requirement for all runs of the ./gaia.sh script to run concurrently, you can simply use a job array:
#!/bin/bash
#SBATCH --job-name=Name
#SBATCH --array=1-10
#SBATCH --ntasks=1
./gaia.sh $SLURM_ARRAY_TASK_ID
date

Related

How to use a variable to define bsub jobname?

I do not know the maximum number of jobs a priori. So when I keep it a variable:
#!/bin/bash
caselist=/my/caselist.txt
N=`cat $caselist | wc -l`
#BSUB -J myjob[1-$N]
...
...
(I call the above script myjob.lsf)
And submit the job as bsub < myjob.lsf, I get:
Bad job name. Job not submitted.
So is there a way I can use a variable in #BSUB -J myjob[1-$N] within myjob.lsf?
The code inside the file does not get evaluated when you pass it to bsub. You can probably remove it from the script entirely, and instead submit it differently.
bsub -J "myjob[1-$(wc -l </my/caselist.txt)]" <myjob.lsf
(speculating a bit about the bsub options; the manuals I could find online were rather bad).

Bash subshell to file

I'm looping over a large file, on each line I'm running some commands, when they finish I want the entire output to be appended to a file.
Since there's nothing stopping me from running multiple commands at once, I tried to run this in the background &.
It doesn't work as expected, it just appends the commands to the file as they finish, but not in the order they appear in the subshell
#!/bin/bash
while read -r line; do
(
echo -e "$line\n-----------------"
trivy image --severity CRITICAL $line
# or any other command that might take 1-2 seconds
echo "============="
) >> vulnerabilities.txt &
done <images.txt
Where am I wrong?
Consider using GNU Parallel to get lots of things done in parallel. In your case:
parallel -k -a images.txt trivy image --severity CRITICAL > vulnerabilities.txt
The -k keeps the output in order. Add --bar or --eta for progress reports. Add --dry-run to see what it would do without actually doing anything. Add -j ... to control the number of parallel jobs at any one time - by default, it will run one job per CPU core at a time - so it will basically keep all your cores busy till the jobs are done.
If you want to do more processing on each line, you can declare a bash function and call that with each line as its parameter... see here.

How to increment a value in a text file and generate corresponding incremented ouput text files?

I have these type of bash text files :
#!/bin/sh
#SBATCH --partition=mono-shared
#SBATCH --cpus-per-task=1
#SBATCH --ntasks=1
#SBATCH --time=05:00:00
#SBATCH --mail-user=
#SBATCH --mail-type=ALL
#SBATCH --clusters=
module load rdkit
srun /opt/xxx /home/rand/data/DB_split/DB_0001.txt 0.001 /models/computetd/ /home/rand/DBIS_results/DB_0001
I have 455 input text files. I want to generate 455 correponding bash files. Each bash files should have the bold numerics incremented "srun /opt/xxx /home/rand/data/DB_split/DB_0001.txt 0.001 /models/computetd/ /home/rand/DBIS_results/DB_0001" and should be named Bash0001.txt, Bash002.txt etc accordingly.
Not sure if I am clear.
Is this possible which tool should I use?
Is awk an option? I looked at the tool but couldn't figure the syntax.
Assuming you have above file named as Bash0001.txt, you can use this script:
for ((i=2; i<=455; i++)); do
printf -v fn '%04d' $i
sed "s/DB_0001/DB_${fn}/g" Bash0001.txt > "Bash${fn}.txt"
done

Using BSUB to Submit Bash Scripts to Cluster

I use a cluster to process scripts I have written and submit these using code like:
bsub -n 10 < run.sh
The beginning of submitted scripts usually look like:
#BSUB -J align[1-10]
#BSUB -e logs/run.%I.%J.err
#BSUB -o logs/run.%I.%J.out
#BSUB -R "span[hosts=1]"
#BSUB -n 10
My question is will my scripts use all processors reserved even if the code is not broken up somehow? So if I have something really simple like:
echo "this"
which does not have multiple files to act on, will it still use multiple processors to compute or just one processor? And if it does use just a single processor, how do I make the script use multiple?
Just thought I would answer this in case anybody was looking for a simple explanation.
#BSUB -J array[1-3]
#This creates a named array whose size matches that of the array created
#on the first line
files=(first.txt second.txt third.txt)
#This refers to the 0, 1, or 2 item in the files array and sets as the
#current file
currentfile=${files[$(($LSB_JOBINDEX - 1))]}
count = 0
#Executes some code on each file individually
while true
do echo $count >> /dir/${currentfile}
let "count+=1"
done

Looping files in bash

I want to loop over these kind of files, where the the files with same Sample_ID have to be used together
Sample_51770BL1_R1.fastq.gz
Sample_51770BL1_R2.fastq.gz
Sample_52412_R1.fastq.gz
Sample_52412_R2.fastq.gz
e.g. Sample_51770BL1_R1.fastq.gz and Sample_51770BL1_R2.fastq.gz are used together in one command to create an output.
Similarly, Sample_52412_R1.fastq.gz and Sample_52412_R2.fastq.gz are used together to create output.
I want to write a for loop in bash to iterate over and create output.
sourcedir=/sourcepath/
destdir=/destinationpath/
bwa-0.7.5a/bwa mem -t 4 human_g1k_v37.fasta Sample_52412_R1.fastq.gz Sample_52412_R2.fastq.gz>$destdir/Sample_52412_R1_R2.sam
How should I pattern match the file names Sample_ID_R1 and Sample_ID_R2 to be used in one command?
Thanks,
for fname in *_R1.fastq.gz
do
base=${fname%_R1*}
bwa-0.7.5a/bwa mem -t 4 human_g1k_v37.fasta "${base}_R1.fastq.gz" "${base}_R2.fastq.gz" >"$destdir/${base}_R1_R2.sam"
done
In the comments, you ask about running several, but not too many, jobs in parallel. Below is my first stab at that:
#!/bin/bash
# Limit background jobs to no more that $maxproc at once.
maxproc=3
for fname in * # _R1.fastq.gz
do
while [ $(jobs | wc -l) -ge "$maxproc" ]
do
sleep 1
done
base=${fname%_R1*}
echo starting new job with ongoing=$(jobs | wc -l)
bwa-0.7.5a/bwa mem -t 4 human_g1k_v37.fasta "${base}_R1.fastq.gz" "${base}_R2.fastq.gz" >"$destdir/${base}_R1_R2.sam" &
done
The optimal value of maxproc will depend on how many processors your PC has. You may need to experiment to find what works best.
Note that the above script uses jobs which is a bash builtin function. Thus, it has to be run under bash, not dash which is the default for scripts under Debian-like distributions.

Resources