How to process a list of files with SLURM - bash

I'm new to SLURM. I want to process a list of files assembled_reads/*.sorted.bam in parallel. With the code below, however only one process is being used over and over again.
#!/bin/bash
#
#SBATCH --job-name=****
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=24
#SBATCH --partition=short
#SBATCH --time=12:00:00
#SBATCH --array=1-100
#SBATCH --mem-per-cpu=16000
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=****#***.edu
srun hostname
for FILE in assembled_reads/*.sorted.bam; do
echo ${FILE}
OUTFILE=$(basename ${FILE} .sorted.bam).raw.snps.indels.g.vcf
PLDY=$(awk -F "," '$1=="$FILE"{print $4}' metadata.csv)
PLDYNUM=$( [[$PLDY = "haploid" ]] && echo "1" || echo "2")
srun java -Djava.io.tmpdir="tmp" -jar GenomeAnalysisTK.jar \
-R scaffs_HAPSgracilaria92_50REF.fasta \
-T HaplotypeCaller \
-I ${${SLURM_ARRAY_TASK_ID}} \
--emitRefConfidence GVCF \
-ploidy $PLDYNUM \
-nt 1 \
-nct 24 \
-o $OUTFILE
sleep 1 # pause to be kind to the scheduler
done

You are creating a job array but are not using it. You should replace the for-loop with an indexing of the files based on the slurm job array id:
#!/bin/bash
#
#SBATCH --job-name=****
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=24
#SBATCH --partition=short
#SBATCH --time=12:00:00
#SBATCH --array=0-99
#SBATCH --mem-per-cpu=16000
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=****#***.edu
srun hostname
FILES=(assembled_reads/*.sorted.bam)
FILE=${FILES[$SLURM_ARRAY_TASK_ID]}
echo ${FILE}
OUTFILE=$(basename ${FILE} .sorted.bam).raw.snps.indels.g.vcf
PLDY=$(awk -F "," '$1=="$FILE"{print $4}' metadata.csv)
PLDYNUM=$( [[$PLDY = "haploid" ]] && echo "1" || echo "2")
srun java -Djava.io.tmpdir="tmp" -jar GenomeAnalysisTK.jar \
-R scaffs_HAPSgracilaria92_50REF.fasta \
-T HaplotypeCaller \
-I ${${SLURM_ARRAY_TASK_ID}} \
--emitRefConfidence GVCF \
-ploidy $PLDYNUM \
-nt 1 \
-nct 24 \
-o $OUTFILE
Just make sure to adapt the value of --array to be equal to the number of files to process.

Related

Addition of two variables in slurm script

I am having slurm scirpt processing fmri data and the maximum value I can give in an array is 999, but the name of my subject ist over 1000.
So I need to to an addition in my slurm script. I tried:
a=${SLURM_ARRAY_TASK_ID} sum=$(($a + 1200))
#!/bin/sh
#
#SBATCH --job-name psy-stephan_fmriprep_gsp
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=8GB
#SBATCH --output /projects/core-psy/logs/nako/stephan/slurm-%j.log
#SBATCH --error /projects/core-psy/logs/nako/stephan/slurm-%j.err
a=${SLURM_ARRAY_TASK_ID}
sum=$(($a + 1200))
singularity run \
--home /projects/core-psy/tmp3:/home/fmriprep \
--cleanenv \
-B /projects/core-psy/data/nako//swunderl/GSP_new/:/input \
-B /projects/core-psy/data/nako/swunderl/GSP_new/derivatives:/output \
-B //projects/core-psy/data/nako/swunderl/GSP_new_workdir/:/workdir \
-B /projects/core-psy/data/nako/swunderl/license.txt:/license \
/projects/core-psy/images/fmriprep-20.2.2.simg /input/sub-$sum /output participant \
--fs-license-file /license \
--skip-bids-validation \
--use-aroma \
--fs-no-reconall \
-w /workdir/ \
#--output-layout bids \
# sbatch --account=core-psy sbatch-multiple-job.slurm
So i can pass as a command SLURM_ARRAY_TASK_ID as 1.
But the addition keeps giving me sub-0+1200 and not the actual sum of both numbers.
Since you do not need to perform math on a, you can perform variable expansion on the string to make a 4 digit subject label for your fmriprep command:
sum="1${SLURM_ARRAY_TASK_ID}"
This way sbatch -a 200 ./your_job_script.sh will run for sub-1200. If you have labels like 1001, you will need to add to the variable expansion since 001 becomes a SLURM_ARRAY_TASK_ID of 1.
Here's an example adapted from my own - albeit not the most succint - code for sbatch scripts
if [ ${#SLURM_ARRAY_TASK_ID} == 1 ];
then
inputNo="100${SLURM_ARRAY_TASK_ID}"
singularity run \
--home /projects/core-psy/tmp3:/home/fmriprep \
--cleanenv \
-B /projects/core-psy/data/nako//swunderl/GSP_new/:/input \
-B /projects/core-psy/data/nako/swunderl/GSP_new/derivatives:/output \
-B //projects/core-psy/data/nako/swunderl/GSP_new_workdir/:/workdir \
-B /projects/core-psy/data/nako/swunderl/license.txt:/license \
/projects/core-psy/images/fmriprep-20.2.2.simg /input/sub-${inputNo} /output participant \
--fs-license-file /license \
--skip-bids-validation \
--use-aroma \
--fs-no-reconall \
-w /workdir/
elif [ ${#SLURM_ARRAY_TASK_ID} == 2 ];
then
inputNo="10${SLURM_ARRAY_TASK_ID}"
singularity run \
--home /projects/core-psy/tmp3:/home/fmriprep \
--cleanenv \
-B /projects/core-psy/data/nako//swunderl/GSP_new/:/input \
-B /projects/core-psy/data/nako/swunderl/GSP_new/derivatives:/output \
-B //projects/core-psy/data/nako/swunderl/GSP_new_workdir/:/workdir \
-B /projects/core-psy/data/nako/swunderl/license.txt:/license \
/projects/core-psy/images/fmriprep-20.2.2.simg /input/sub-${inputNo} /output participant \
--fs-license-file /license \
--skip-bids-validation \
--use-aroma \
--fs-no-reconall \
-w /workdir/
elif [ ${#SLURM_ARRAY_TASK_ID} == 3 ];
then
inputNo="1${SLURM_ARRAY_TASK_ID}"
singularity run \
--home /projects/core-psy/tmp3:/home/fmriprep \
--cleanenv \
-B /projects/core-psy/data/nako//swunderl/GSP_new/:/input \
-B /projects/core-psy/data/nako/swunderl/GSP_new/derivatives:/output \
-B //projects/core-psy/data/nako/swunderl/GSP_new_workdir/:/workdir \
-B /projects/core-psy/data/nako/swunderl/license.txt:/license \
/projects/core-psy/images/fmriprep-20.2.2.simg /input/sub-${inputNo} /output participant \
--fs-license-file /license \
--skip-bids-validation \
--use-aroma \
--fs-no-reconall \
-w /workdir/
fi

How to append memory usage for each step within a shell script in slurm output

I have a bash script:
#!/bin/bash
time srun -p my_partition -c 1 --mem=4G my_code -i my_file_1 -o my_output_file_1
time srun -p my_partition -c 1 --mem=4G my_code -i my_file_2 -o my_output_file_2
time srun -p my_partition -c 1 --mem=4G my_code -i my_file_3 -o my_output_file_3
time srun -p my_partition -c 1 --mem=4G my_code -i my_file_4 -o my_output_file_4
I want to know the average memory usage for each step (printed after the real/user/sys time) while the script is running.
I have tried
#!/bin/bash
time srun -p my_partition -c 1 --mem=4G --job-name"my_job" my_code -i my_file_1 -o my_output_file_1
time srun -p my_partition -c 1 --mem=4G --job-name"my_job" my_code -i my_file_2 -o my_output_file_2
time srun -p my_partition -c 1 --mem=4G --job-name"my_job" my_code -i my_file_3 -o my_output_file_3
time srun -p my_partition -c 1 --mem=4G --job-name"my_job" my_code -i my_file_4 -o my_output_file_4
sstat -a -j my_job --format=JobName,AveRSS,MaxRSS
You can try
#!/bin/bash
time srun -p my_partition -c 1 --mem=4G --job-name"my_job" my_code -i my_file_1 -o my_output_file_1
sstat -j ${SLURM_JOB_ID}.1 --format=JobName,AveRSS,MaxRSS
time srun -p my_partition -c 1 --mem=4G --job-name"my_job" my_code -i my_file_2 -o my_output_file_2
sstat -j ${SLURM_JOB_ID}.2 --format=JobName,AveRSS,MaxRSS
time srun -p my_partition -c 1 --mem=4G --job-name"my_job" my_code -i my_file_3 -o my_output_file_3
sstat -j ${SLURM_JOB_ID}.3 --format=JobName,AveRSS,MaxRSS
time srun -p my_partition -c 1 --mem=4G --job-name"my_job" my_code -i my_file_4 -o my_output_file_4
sstat -j ${SLURM_JOB_ID}.4 --format=JobName,AveRSS,MaxRSS

Pass arguments to a script that is an argument to a different script

I am new to programming, so plz bear with the way I try to explain my problem (also any help regarding how to elegantly phrase the tile is welcome).
I have a bash script (say for example script1.sh ) that takes in arguments a, b and c(another script). Essentially, argument c for script1.sh is the name of another script (let's say script2.sh). However, script2.sh takes in arguments d,e and f. So my question is, how do I pass arguments to script1.sh ?? (example, ./script1.sh -a 1 -b 2 -c script2.sh -d 3 -e 4 -f 5)
Sorry in advance if the above does not make sense, not sure how else to phrase it...
You should use "" for that
./script1.sh -a 1 -b 2 -c "script2.sh -d 3 -e 4 -f 5"
Try script1.sh with this code
#!/bin/bash
for arg in "$#"; { # loop through all arguments passed to the script
echo $arg
}
The output will be
$ ./script1.sh -a 1 -b 2 -c "script2.sh -d 3 -e 4 -f 5"
-a
1
-b
2
-c
script2.sh -d 3 -e 4 -f 5
But if you run this
#!/bin/bash
for arg in $#; { # no double quotes around $#
echo $arg
}
The output will be
$ ./script1.sh -a 1 -b 2 -c "script2.sh -d 3 -e 4 -f 5"
-a
1
-b
2
-c
script2.sh
-d
3
4
-f
5
But there is no -e why? Coz echo supports argument -e and use it.

Hisat2 with job array

I want to use Hsat2 instead of bowtie2 but
I have a problem with my script:
!/bin/bash
##SBATCH --time=5:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --mem=32G
#SBATCH --cpus-per-task=16 # Nb of threads we want to run on
#SBATCH -o log/slurmjob-%j
#SBATCH --job-name=hist2
#SBATCH --partition=short
#SBATCH --array=0-5
module load gcc/4.8.4 HISAT2/2.0.5 samtools/1.3
SCRATCHDIR=/storage/scratch/"$USER"/"$SLURM_JOB_ID"
DATABANK="$HOME/projet/GRCm38/bwa"
OUTPUT="$HOME"/hisat2
mkdir -p "$OUTPUT"
mkdir -p "$SCRATCHDIR"
cd "$SCRATCHDIR"
#Define an array to optimize tasks
ARRAY=()
hisat2 -p 8 -x "$DATABANK"/all.* -1 "$HOME"/chipseq/${ARRAY[$SLURM_ARRAY_TASK_ID]}_R1_trim.fastq.gz -2 "$HOME"/chipseq/${ARRAY[$SLURM_ARRAY_TASK_ID]}_R2_trim.fastq.gz -S Hisat2_out.${ARRAY[$SLURM_ARRAY_TASK_ID]}.sam | samtools view -b -S - | samtools sort - -o Hisat2_out.${ARRAY[$SLURM_ARRAY_TASK_ID]}.mapped.sorted.bam
samtools idxstats Hisat2_out.${ARRAY[$SLURM_ARRAY_TASK_ID]}.mapped.sorted.bam > $OUTPUT/"$HOME/hisat2/hisat2_indxstat".log
mv "$SCRATCHDIR" "$OUTPUT"
The error occurs on this one
${ARRAY[$SLURM_ARRAY_TASK_ID]} : variable without link
Thank you for the help !

combine GNU parallel with nested for loops and multiple variables

I have n folders in destdir. Each folder contains two files: *R1.fastq and *R2.fastq. Using this script, it will do the job (bowtie2) one by one and output {name of the sub folder}.sam in the destdir.
#!/bin/bash
mm9_index="/Users/bowtie2-2.2.6/indexes/mm9/mm9"
destdir=/Users/Desktop/test/outdir/
for f in $destdir/*
do
fbase=$(basename "$f")
echo "Sample $fbase"
bowtie2 -p 4 -x $mm9_index -X 2000 \
-1 "$f"/*R1.fastq \
-2 "$f"/*R2.fastq \
-S $destdir/${fbase}.sam
done
I want to use gnu parallel tool to speed this up, can you help? Thanks.
Use a bash function:
#!/bin/bash
my_bowtie() {
mm9_index="/Users/bowtie2-2.2.6/indexes/mm9/mm9"
destdir=/Users/Desktop/test/outdir/
f="$1"
fbase=$(basename "$f")
echo "Sample $fbase"
bowtie2 -p 4 -x $mm9_index -X 2000 \
-1 "$f"/*R1.fastq \
-2 "$f"/*R2.fastq \
-S $destdir/${fbase}.sam
}
export -f my_bowtie
parallel my_bowtie ::: $destdir/*
For more details: man parallel or http://www.gnu.org/software/parallel/man.html#EXAMPLE:-Calling-Bash-functions
At its simplest, you can normally just put echo on the front of your commands and send the list of commands, that you would have executed sequentially, to GNU Parallel, to execute in parallel, like this:
for f in ...; do
echo bowtie2 -p 4 ....
done | parallel

Resources