Running a command on SLURM that takes command-line arguments - bash

I'm completely new to using HPCs and SLURM, so I'd really appreciate some guidance here.
I need to iteratively run a command that looks like this
kallisto quant -i '/home/myName/genomes/hSapien.idx' \
-o "output-SRR3225412" \
"SRR3225412_1.fastq.gz" \
"SRR3225412_2.fastq.gz"
where the SRR3225412 part will be different in each interation
The problem is, as I found out, I can't just append this to the end of an sbatch command
sbatch --nodes=1 \
--ntasks-per-node=1 \
--cpus-per-task=1 \
kallisto quant -i '/home/myName/genomes/hSapien.idx' \
-o "output-SRR3225412" \
"SRR3225412_1.fastq.gz" \
"SRR3225412_2.fastq.gz"
This command doesn't work. I get the error
sbatch: error: This does not look like a batch script. The first
sbatch: error: line must start with #! followed by the path to an interpreter.
sbatch: error: For instance: #!/bin/sh
I wanted to ask, how do I run the sbatch command, specifying its run parameters, and also adding the command-line arguments for the kallisto program I'm trying to use? In the end I'd like to have something like
#!/bin/bash
for sample in ...
do
sbatch --nodes=1 \
--ntasks-per-node=1 \
--cpus-per-task=1 \
kallistoCommandOnSample --arg1 a1 \
--arg2 a2 arg3 a3
done

The error sbatch: error: This does not look like a batch script. is because sbatch expect a submission script. It is a batch script, typically a Bash script, in which comments starting with #SBATCH are interpreted by Slurm as options.
So the typical way of submitting a job is to create a file, let's name it submit.sh:
#! /bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
kallisto quant -i '/home/myName/genomes/hSapien.idx' \
-o "output-SRR3225412" \
"SRR3225412_1.fastq.gz" \
"SRR3225412_2.fastq.gz"
and then submit it with
sbatch submit.sh
If you have multiple similar jobs to submit, it is beneficial for several reasons to use a job array. The loop you want to create can be replaced with a single submission script looking like
#! /bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --array=1-10 # Replace here with the number of iterations in the loop
SAMPLES=(...) # here put what you would loop over
CURRSAMPLE=${SAMPLE[$SLURM_ARRAY_TASK_ID]}
kallisto quant -i '/home/myName/genomes/hSapien.idx' \
-o "output-${CURRSAMPLE}" \
"${CURRSAMPLE}_1.fastq.gz" \
"${CURRSAMPLE}_2.fastq.gz"
As pointed out by #Carles Fenoy, if you do not want to use a submission script, you can use the --wrap parameter of sbatch:
sbatch --nodes=1 \
--ntasks-per-node=1 \
--cpus-per-task=1 \
--wrap "kallisto quant -i '/home/myName/genomes/hSapien.idx' \
-o 'output-SRR3225412' \
'SRR3225412_1.fastq.gz' \
'SRR3225412_2.fastq.gz'"

Related

Output multiple files SGE

I have a .sh script that looks like this:
#$ -t 1-8
#$ -tc 8
#list of tasks
task_list=$( sed "${SGE_TASK_ID}q;d" list_of_jobs.txt )
#python script
./ldsc.py \
--h2 ${task_list} \
--ref-ld-chr /baselineLD. \
--out $cts_name \
I am running 8 jobs in parallel, but need each of them to output a separate file using the --out flag.
How can I do this?
The list_of_jobs.txt is a list of eight files (tasks) that get analyzed.
file1.txt
file2.txt
file3.txt
…
file8.txt
figured it out.
basename="${cts_name##*/}"
and then
--h2 $basename was the trick!

Anyone know what's causing this linux error?

I'm trying to run deepvariant via their singularity container on the HPC, however I get this error, and I can't figure it out!
Code:
#!/bin/bash --login
#SBATCH -J AmyHouseman_deepvariant
#SBATCH -o %x.stdout.%J.%N
#SBATCH -e %x.stderr.%J.%N
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH -p c_compute_wgp
#SBATCH --account=scw1581
#SBATCH --mail-type=ALL # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=HousemanA#cardiff.ac.uk # Where to send mail
#SBATCH --array=1-33
#SBATCH --time=02:00:00
#SBATCH --time=072:00:00
#SBATCH --mem-per-cpu=32GB
module purge
module load singularity
module load parallel
set -eu
cd /scratch/c.c21087028/
BIN_VERSION="1.3.0"
singularity pull docker://google/deepvariant:"${BIN_VERSION}"
sed -n "${SLURM_ARRAY_TASK_ID}p" Polyposis_Exome_Analysis/fastp/All_fastp_input/List_of_33_exome_IDs | parallel -j 1 "singularity run singularity run -B /usr/lib/locale/:/usr/lib/locale/ \
docker://google/deepvariant:"${BIN_VERSION}" \
/opt/deepvariant/bin/run_deepvariant \
--model_type=WES \
-ref=Polyposis_Exome_Analysis/bwa/index/HumanRefSeq/GRCh38_latest_genomic.fna \
--reads=Polyposis_Exome_Analysis/samtools/index/indexed_picardbamfiles/{}PE_markedduplicates.bam \
--output_vcf=Polyposis_Exome_Analysis/deepvariant/vcf/{}PE_output.vcf.gz \
--output_gvcf=Polyposis_Exome_Analysis/deepvariant/gvcf/{}PE_output.vcf.gz \
--intermediate_results_dir=Polyposis_Exome_Analysis/deepvariant/intermediateresults/{}PE_output_intermediate"
Error:
FATAL: While making image from oci registry: error fetching image to cache: failed to get checksum for docker://google/deepvariant:1.3.0: pinging container registry registry-1.docker.io: Get "https://registry-1.docker.io/v2/": dial tcp 52.0.218.102:443: connect: network is unreachable
I've asked a lot of people, and I'm still stuck! Thanks, Amy

Why it's not possible to run wget with background option in slurm script?

I used this script for downloading files. Without -b, wget download files one by one. With -b, I have the possibility to download files in background but also simultaneously. Unfortunately, the script doesn't work in SLURM. It only works without -b in Slurm.
Script for downloading files
#!/bin/bash
mkdir data
cd data
for i in 11 08 15 26 ;
do
wget -c -b -q ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR116/0${i}/SRR116802${i}/SRR116802${i}_1.fastq.gz
wget -c -b -q ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR116/0${i}/SRR116802${i}/SRR116802${i}_2.fastq.gz
done
cd ..
Slurm Script
#!/bin/bash
#SBATCH --job-name=mytestjob # create a short name for your job
#SBATCH --nodes=2 # node count
#SBATCH --ntasks=2 # total number of tasks across all nodes
#SBATCH --cpus-per-task=2 # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem-per-cpu=4G # memory per cpu-core (4G is default
#SBATCH --time=10:01:00 # total run time limit (HH:MM:SS)
#SBATCH --array=1-2 # job array with index values 1, 2
#Execution
bash download.sh
On the terminal : sbatch slurmsript.sh ( It dosen't work) no jobid
You can download multiple files at the same time with curl.
In your case, this should work:
# Create an empty bash array of urls.
urls=()
# Add each url to the array, such that '-O' and the url are separate
# items in the array. This is necessary so that the curl command will
# look like 'curl -O <url1> -O <url2> ...', since the -O command must
# be provided for each url.
for i in 11 08 15 16; do
urls+=( "-O" "ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR116/0${i}/SRR116802${i}/SRR116802${i}_1.fastq.gz" )
urls+=( "-O" "ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR116/0${i}/SRR116802${i}/SRR116802${i}_2.fastq.gz" )
done
# Simultaneously download from all urls.
curl --silent -C - "${urls[#]}"
To explain each of the curl options:
--silent is the equivalent of wget's -q. Disables curl's progress meter.
-C - is the equivalent of wget's -c. It tells curl to automatically find out where/how to resume a transfer.
-O tells curl to to write the output to a file with the same name as the remote file (this is the behavior of wget). This must be specified for each url.
Alternatively, you might want to consider installing and using aria2.

GNU parallel read from several files

I am trying to use GNU parallel to convert individual files with a bioinformatic tool called vcf2maf.
My command looks something like this:
${parallel} --link "perl ${vcf2maf} --input-vcf ${1} \
--output-maf ${maf_dir}/${2}.maf \
--tumor-id ${3} \
--tmp-dir ${vcf_dir} \
--vep-path ${vep_script} \
--vep-data ${vep_data} \
--ref-fasta ${fasta} \
--filter-vcf ${filter_vcf}" :::: ${VCF_files} ${results} ${tumor_ids}
VCF_files, results and tumor_ids contain one entry per line and correspond to one another.
When I try and run the command I get the following error for every file:
ERROR: Both input-vcf and output-maf must be defined!
This confused me, because if I run the command manually, the program works as intended, so I dont think that the input/outpit paths are wrong. To confirm this, I also ran
${parallel} --link "cat ${1}" :::: ${VCF_files} ${results} ${tumor_ids},
which correctly prints the contents of the VCF files, whose path is listed in VCF_files.
I am really confused what I did wrong, if anyone could help me out, I'd be very thankful!
Thanks!
For a command this long I would normally define a function:
doit() {
...
}
export -f doit
Then test this on a single input.
When it works:
parallel --link doit :::: ${VCF_files} ${results} ${tumor_ids}
But if you want to use a single command it will look something like:
${parallel} --link "perl ${vcf2maf} --input-vcf {1} \
--output-maf ${maf_dir}/{2}.maf \
--tumor-id {3} \
--tmp-dir ${vcf_dir} \
--vep-path ${vep_script} \
--vep-data ${vep_data} \
--ref-fasta ${fasta} \
--filter-vcf ${filter_vcf}" :::: ${VCF_files} ${results} ${tumor_ids}
GNU Parallel's replacement strings are {1}, {2}, and {3} - not ${1}, ${2}, and ${3}.
--dryrun is your friend when GNU Parallel does not do what you expect it to do.

Changing script from PBS to SLURM

I have just switched from PBS to SLURM. Originally my script read as:
Trying to change my script from PBS to SLURM. Before looked something like:
qsub -N $JK -e $LOGDIR/JK_MASTER.error -o $LOGDIR/JK_MASTER.log -v
Z="$ZBIN",NBINS="$nbins",MIN="$Theta_min" submit_MASTER_analysis.sh
Now need something like:
sbatch --job-name=$JK -e $LOGDIR/JK_MASTER.error -o $LOGDIR/JK_MASTER.log --export=Z="$ZBIN",NBINS="$nbins",MIN="$Theta_min"
submit_MASTER_analysis.sh
But for some reason this is not quite executing the job, think its a problem with the variables.
I have found out how to do this now so thought I better just update the post for anyone else interested.
In my launch script I now have
`sbatch --job-name=REALIZ_${R}_zbin${Z} \
--output=$RAND_DIR/RANDOM_MASTER_${R}_zbin${Z}.log \
--error=$RAND_DIR/RANDOM_MASTER_${R}_zbin${Z}.error \
--ntasks=1 \
--cpus-per-task=1 \
--ntasks-per-core=1 \
--threads-per-core=1 \
submit_RANDOMS_analysis.sh $JK $ZBIN $nbins $R $Theta_min 'LOW'`
where $JK $ZBIN $nbins $R $Theta_min 'LOW' are the arguments I pas through to the script I am submitting to the queue submit_RANDOMS_analysis.sh. This is then called in the submitted script by for instance the first argument JK=$1.

Resources