Hisat2 with job array - shell

I want to use Hsat2 instead of bowtie2 but
I have a problem with my script:
!/bin/bash
##SBATCH --time=5:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --mem=32G
#SBATCH --cpus-per-task=16 # Nb of threads we want to run on
#SBATCH -o log/slurmjob-%j
#SBATCH --job-name=hist2
#SBATCH --partition=short
#SBATCH --array=0-5
module load gcc/4.8.4 HISAT2/2.0.5 samtools/1.3
SCRATCHDIR=/storage/scratch/"$USER"/"$SLURM_JOB_ID"
DATABANK="$HOME/projet/GRCm38/bwa"
OUTPUT="$HOME"/hisat2
mkdir -p "$OUTPUT"
mkdir -p "$SCRATCHDIR"
cd "$SCRATCHDIR"
#Define an array to optimize tasks
ARRAY=()
hisat2 -p 8 -x "$DATABANK"/all.* -1 "$HOME"/chipseq/${ARRAY[$SLURM_ARRAY_TASK_ID]}_R1_trim.fastq.gz -2 "$HOME"/chipseq/${ARRAY[$SLURM_ARRAY_TASK_ID]}_R2_trim.fastq.gz -S Hisat2_out.${ARRAY[$SLURM_ARRAY_TASK_ID]}.sam | samtools view -b -S - | samtools sort - -o Hisat2_out.${ARRAY[$SLURM_ARRAY_TASK_ID]}.mapped.sorted.bam
samtools idxstats Hisat2_out.${ARRAY[$SLURM_ARRAY_TASK_ID]}.mapped.sorted.bam > $OUTPUT/"$HOME/hisat2/hisat2_indxstat".log
mv "$SCRATCHDIR" "$OUTPUT"
The error occurs on this one
${ARRAY[$SLURM_ARRAY_TASK_ID]} : variable without link
Thank you for the help !

Related

I have 2 bash loops with the same structure and only the first works

Issue
I have a few PE fastq files from an infected host. First I map reads to the host and keep reads that did not map and convert that single bam to new paired end fastq files. The next loop takes the new PE fastq files and maps them to the pathogen. The problem I'm facing is the beginning of the second loop does not find the associated R2.fastq. All work is being done on my institution's linux compute cluster.
The appropriate files are created at the end of the first loop and the second loop is able to find the R1 files, but not the F2 files in the same directory. I have stared at this for a couple days now, making changes in an attempt to figure out the naming issue.
Any help determining the issue with the second for loop would be greatly appreciated. Keep in mind this is my first post and my degree is in biology. Please gentle.
Code
#PBS -S /bin/bash
#PBS -l partition=bigmem,nodes=1:ppn=16,walltime=1:00:00:00
#PBS -A ACF-UTK0011
#PBS -M wbrewer5#vols.utk.edu
#PBS -m abe
#PBS -e /lustre/haven/user/wbrewer5/pandora/lowcov/error/
#PBS -o /lustre/haven/user/wbrewer5/pandora/lowcov/log/
#PBS -N PandoraLowCovMapping1
#PBS -n
cd $PBS_O_WORKDIR
set -x
module load samtools
module load bwa
#create indexes for pea aphid and pandora genomes
#bwa index -p pea_aphid.fna pea_aphid.fna
#bwa index -p pandora_canu_pilon.fasta pandora_canu_pilon.fasta
#map read files to the aphid genome and keep reads that do not map
for r1 in `ls /lustre/haven/user/wbrewer5/pandora/lowcov/reads/*R1.fastq`
do
r2=`sed 's/R1.fastq/R2.fastq/' <(echo $r1)`
BASE1=$(basename $r1 | sed 's/_R1.fastq*//g')
echo "r1 $r1"
echo "r2 $r2"
echo "BASE1 $BASE1"
bwa mem -t 16 -v 3 \
pea_aphid.fna \
$r1 \
$r2 |
samtools view -# 16 -u -f 12 -F 256 - |
samtools sort -# 16 -n - |
samtools fastq - \
-1 /lustre/haven/user/wbrewer5/pandora/lowcov/1_samtools/$BASE1\_unmapped_R1.fastq \
-2 /lustre/haven/user/wbrewer5/pandora/lowcov/1_samtools/$BASE1\_unmapped_R2.fastq \
-0 /lustre/haven/user/wbrewer5/pandora/lowcov/1_samtools/$BASE1\_trash.txt \
-s /lustre/haven/user/wbrewer5/pandora/lowcov/1_samtools/$BASE1\_more_trash.txt
echo "Step 1: mapped reads from $BASE1 to aphid genome and saved to 1_samtools as paired end .fastq"
done
rm /lustre/haven/user/wbrewer5/pandora/lowcov/1_samtools/*trash*
echo "saving unmapped reads to new fastq files complete!"
for f1 in `ls /lustre/haven/user/wbrewer5/pandora/lowcov/1_samtools/*unmapped_R1.fastq`
do
f2=`sed 's/R1.fastq/R2.fastq/' >(echo $f1)`
BASE2=$(basename $f1 | sed 's/_R1.fastq*//g')
echo "f1 $f1"
echo "f2 $f2"
echo "BASE2 $BASE2"
bwa mem -t 16 -v 3 \
pandora_canu_pilon.fasta \
$f1 \
$f2 |
samtools sort -# 16 -o ./2_angsd/$BASE2\.bam -
echo "Step 2: mapped reads from $BASE2 to pandora genome saved to 2_angsd as .bam"
done
echo "Mapping new fastq files to pandora genome complete!!"
Log
First file of first loop
++ ls /lustre/haven/user/wbrewer5/pandora/lowcov/reads/Matt_251_R1.fastq /lustre/haven/user/wbrewer5/pandora/lowcov/reads/Matt_614_R1.fastq /lustre/haven/user/wbrewer5/pandora/lowcov/reads/Matt_686_R1.fastq /lustre/haven/user/wbrewer5/pandora/lowcov/reads/p-251_R1.fastq /lustre/haven/user/wbrewer5/pandora/lowcov/reads/p-614_R1.fastq /lustre/haven/user/wbrewer5/pandora/lowcov/reads/p-686_R1.fastq
+ for r1 in '`ls /lustre/haven/user/wbrewer5/pandora/lowcov/reads/*R1.fastq`'
++ sed s/R1.fastq/R2.fastq/ /dev/fd/63
+++ echo /lustre/haven/user/wbrewer5/pandora/lowcov/reads/Matt_251_R1.fastq
+ r2=/lustre/haven/user/wbrewer5/pandora/lowcov/reads/Matt_251_R2.fastq
++ basename /lustre/haven/user/wbrewer5/pandora/lowcov/reads/Matt_251_R1.fastq
++ sed 's/_R1.fastq*//g'
+ BASE1=Matt_251
+ echo 'r1 /lustre/haven/user/wbrewer5/pandora/lowcov/reads/Matt_251_R1.fastq'
+ echo 'r2 /lustre/haven/user/wbrewer5/pandora/lowcov/reads/Matt_251_R2.fastq'
+ echo 'BASE1 Matt_251'
+ bwa mem -t 16 -v 3 pea_aphid.fna /lustre/haven/user/wbrewer5/pandora/lowcov/reads/Matt_251_R1.fastq /lustre/haven/user/wbrewer5/pandora/lowcov/reads/Matt_251_R2.fastq
+ samtools view -# 16 -u -f 12 -F 256 -
+ samtools sort -# 16 -n -
+ samtools fastq - -1 /lustre/haven/user/wbrewer5/pandora/lowcov/1_samtools/Matt_251_unmapped_R1.fastq -2 /lustre/haven/user/wbrewer5/pandora/lowcov/1_samtools/Matt_251_unmapped_R2.fastq -0 /lustre/haven/user/wbrewer5/pandora/lowcov/1_samtools/Matt_251_trash.txt -s /lustre/haven/user/wbrewer5/pandora/lowcov/1_samtools/Matt_251_more_trash.txt
First file of second loop
++ ls /lustre/haven/user/wbrewer5/pandora/lowcov/1_samtools/Matt_251_unmapped_R1.fastq /lustre/haven/user/wbrewer5/pandora/lowcov/1_samtools/Matt_614_unmapped_R1.fastq /lustre/haven/user/wbrewer5/pandora/lowcov/1_samtools/Matt_686_unmapped_R1.fastq /lustre/haven/user/wbrewer5/pandora/lowcov/1_samtools/p-251_unmapped_R1.fastq /lustre/haven/user/wbrewer5/pandora/lowcov/1_samtools/p-614_unmapped_R1.fastq /lustre/haven/user/wbrewer5/pandora/lowcov/1_samtools/p-686_unmapped_R1.fastq
+ for f1 in '`ls /lustre/haven/user/wbrewer5/pandora/lowcov/1_samtools/*unmapped_R1.fastq`'
++ sed s/R1.fastq/R2.fastq/ /dev/fd/63
+++ echo /lustre/haven/user/wbrewer5/pandora/lowcov/1_samtools/Matt_251_unmapped_R1.fastq

How to append memory usage for each step within a shell script in slurm output

I have a bash script:
#!/bin/bash
time srun -p my_partition -c 1 --mem=4G my_code -i my_file_1 -o my_output_file_1
time srun -p my_partition -c 1 --mem=4G my_code -i my_file_2 -o my_output_file_2
time srun -p my_partition -c 1 --mem=4G my_code -i my_file_3 -o my_output_file_3
time srun -p my_partition -c 1 --mem=4G my_code -i my_file_4 -o my_output_file_4
I want to know the average memory usage for each step (printed after the real/user/sys time) while the script is running.
I have tried
#!/bin/bash
time srun -p my_partition -c 1 --mem=4G --job-name"my_job" my_code -i my_file_1 -o my_output_file_1
time srun -p my_partition -c 1 --mem=4G --job-name"my_job" my_code -i my_file_2 -o my_output_file_2
time srun -p my_partition -c 1 --mem=4G --job-name"my_job" my_code -i my_file_3 -o my_output_file_3
time srun -p my_partition -c 1 --mem=4G --job-name"my_job" my_code -i my_file_4 -o my_output_file_4
sstat -a -j my_job --format=JobName,AveRSS,MaxRSS
You can try
#!/bin/bash
time srun -p my_partition -c 1 --mem=4G --job-name"my_job" my_code -i my_file_1 -o my_output_file_1
sstat -j ${SLURM_JOB_ID}.1 --format=JobName,AveRSS,MaxRSS
time srun -p my_partition -c 1 --mem=4G --job-name"my_job" my_code -i my_file_2 -o my_output_file_2
sstat -j ${SLURM_JOB_ID}.2 --format=JobName,AveRSS,MaxRSS
time srun -p my_partition -c 1 --mem=4G --job-name"my_job" my_code -i my_file_3 -o my_output_file_3
sstat -j ${SLURM_JOB_ID}.3 --format=JobName,AveRSS,MaxRSS
time srun -p my_partition -c 1 --mem=4G --job-name"my_job" my_code -i my_file_4 -o my_output_file_4
sstat -j ${SLURM_JOB_ID}.4 --format=JobName,AveRSS,MaxRSS

Pass arguments to a script that is an argument to a different script

I am new to programming, so plz bear with the way I try to explain my problem (also any help regarding how to elegantly phrase the tile is welcome).
I have a bash script (say for example script1.sh ) that takes in arguments a, b and c(another script). Essentially, argument c for script1.sh is the name of another script (let's say script2.sh). However, script2.sh takes in arguments d,e and f. So my question is, how do I pass arguments to script1.sh ?? (example, ./script1.sh -a 1 -b 2 -c script2.sh -d 3 -e 4 -f 5)
Sorry in advance if the above does not make sense, not sure how else to phrase it...
You should use "" for that
./script1.sh -a 1 -b 2 -c "script2.sh -d 3 -e 4 -f 5"
Try script1.sh with this code
#!/bin/bash
for arg in "$#"; { # loop through all arguments passed to the script
echo $arg
}
The output will be
$ ./script1.sh -a 1 -b 2 -c "script2.sh -d 3 -e 4 -f 5"
-a
1
-b
2
-c
script2.sh -d 3 -e 4 -f 5
But if you run this
#!/bin/bash
for arg in $#; { # no double quotes around $#
echo $arg
}
The output will be
$ ./script1.sh -a 1 -b 2 -c "script2.sh -d 3 -e 4 -f 5"
-a
1
-b
2
-c
script2.sh
-d
3
4
-f
5
But there is no -e why? Coz echo supports argument -e and use it.

How to process a list of files with SLURM

I'm new to SLURM. I want to process a list of files assembled_reads/*.sorted.bam in parallel. With the code below, however only one process is being used over and over again.
#!/bin/bash
#
#SBATCH --job-name=****
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=24
#SBATCH --partition=short
#SBATCH --time=12:00:00
#SBATCH --array=1-100
#SBATCH --mem-per-cpu=16000
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=****#***.edu
srun hostname
for FILE in assembled_reads/*.sorted.bam; do
echo ${FILE}
OUTFILE=$(basename ${FILE} .sorted.bam).raw.snps.indels.g.vcf
PLDY=$(awk -F "," '$1=="$FILE"{print $4}' metadata.csv)
PLDYNUM=$( [[$PLDY = "haploid" ]] && echo "1" || echo "2")
srun java -Djava.io.tmpdir="tmp" -jar GenomeAnalysisTK.jar \
-R scaffs_HAPSgracilaria92_50REF.fasta \
-T HaplotypeCaller \
-I ${${SLURM_ARRAY_TASK_ID}} \
--emitRefConfidence GVCF \
-ploidy $PLDYNUM \
-nt 1 \
-nct 24 \
-o $OUTFILE
sleep 1 # pause to be kind to the scheduler
done
You are creating a job array but are not using it. You should replace the for-loop with an indexing of the files based on the slurm job array id:
#!/bin/bash
#
#SBATCH --job-name=****
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=24
#SBATCH --partition=short
#SBATCH --time=12:00:00
#SBATCH --array=0-99
#SBATCH --mem-per-cpu=16000
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=****#***.edu
srun hostname
FILES=(assembled_reads/*.sorted.bam)
FILE=${FILES[$SLURM_ARRAY_TASK_ID]}
echo ${FILE}
OUTFILE=$(basename ${FILE} .sorted.bam).raw.snps.indels.g.vcf
PLDY=$(awk -F "," '$1=="$FILE"{print $4}' metadata.csv)
PLDYNUM=$( [[$PLDY = "haploid" ]] && echo "1" || echo "2")
srun java -Djava.io.tmpdir="tmp" -jar GenomeAnalysisTK.jar \
-R scaffs_HAPSgracilaria92_50REF.fasta \
-T HaplotypeCaller \
-I ${${SLURM_ARRAY_TASK_ID}} \
--emitRefConfidence GVCF \
-ploidy $PLDYNUM \
-nt 1 \
-nct 24 \
-o $OUTFILE
Just make sure to adapt the value of --array to be equal to the number of files to process.

combine GNU parallel with nested for loops and multiple variables

I have n folders in destdir. Each folder contains two files: *R1.fastq and *R2.fastq. Using this script, it will do the job (bowtie2) one by one and output {name of the sub folder}.sam in the destdir.
#!/bin/bash
mm9_index="/Users/bowtie2-2.2.6/indexes/mm9/mm9"
destdir=/Users/Desktop/test/outdir/
for f in $destdir/*
do
fbase=$(basename "$f")
echo "Sample $fbase"
bowtie2 -p 4 -x $mm9_index -X 2000 \
-1 "$f"/*R1.fastq \
-2 "$f"/*R2.fastq \
-S $destdir/${fbase}.sam
done
I want to use gnu parallel tool to speed this up, can you help? Thanks.
Use a bash function:
#!/bin/bash
my_bowtie() {
mm9_index="/Users/bowtie2-2.2.6/indexes/mm9/mm9"
destdir=/Users/Desktop/test/outdir/
f="$1"
fbase=$(basename "$f")
echo "Sample $fbase"
bowtie2 -p 4 -x $mm9_index -X 2000 \
-1 "$f"/*R1.fastq \
-2 "$f"/*R2.fastq \
-S $destdir/${fbase}.sam
}
export -f my_bowtie
parallel my_bowtie ::: $destdir/*
For more details: man parallel or http://www.gnu.org/software/parallel/man.html#EXAMPLE:-Calling-Bash-functions
At its simplest, you can normally just put echo on the front of your commands and send the list of commands, that you would have executed sequentially, to GNU Parallel, to execute in parallel, like this:
for f in ...; do
echo bowtie2 -p 4 ....
done | parallel

Resources