I have 2 bash loops with the same structure and only the first works - bash

Issue
I have a few PE fastq files from an infected host. First I map reads to the host and keep reads that did not map and convert that single bam to new paired end fastq files. The next loop takes the new PE fastq files and maps them to the pathogen. The problem I'm facing is the beginning of the second loop does not find the associated R2.fastq. All work is being done on my institution's linux compute cluster.
The appropriate files are created at the end of the first loop and the second loop is able to find the R1 files, but not the F2 files in the same directory. I have stared at this for a couple days now, making changes in an attempt to figure out the naming issue.
Any help determining the issue with the second for loop would be greatly appreciated. Keep in mind this is my first post and my degree is in biology. Please gentle.
Code
#PBS -S /bin/bash
#PBS -l partition=bigmem,nodes=1:ppn=16,walltime=1:00:00:00
#PBS -A ACF-UTK0011
#PBS -M wbrewer5#vols.utk.edu
#PBS -m abe
#PBS -e /lustre/haven/user/wbrewer5/pandora/lowcov/error/
#PBS -o /lustre/haven/user/wbrewer5/pandora/lowcov/log/
#PBS -N PandoraLowCovMapping1
#PBS -n
cd $PBS_O_WORKDIR
set -x
module load samtools
module load bwa
#create indexes for pea aphid and pandora genomes
#bwa index -p pea_aphid.fna pea_aphid.fna
#bwa index -p pandora_canu_pilon.fasta pandora_canu_pilon.fasta
#map read files to the aphid genome and keep reads that do not map
for r1 in `ls /lustre/haven/user/wbrewer5/pandora/lowcov/reads/*R1.fastq`
do
r2=`sed 's/R1.fastq/R2.fastq/' <(echo $r1)`
BASE1=$(basename $r1 | sed 's/_R1.fastq*//g')
echo "r1 $r1"
echo "r2 $r2"
echo "BASE1 $BASE1"
bwa mem -t 16 -v 3 \
pea_aphid.fna \
$r1 \
$r2 |
samtools view -# 16 -u -f 12 -F 256 - |
samtools sort -# 16 -n - |
samtools fastq - \
-1 /lustre/haven/user/wbrewer5/pandora/lowcov/1_samtools/$BASE1\_unmapped_R1.fastq \
-2 /lustre/haven/user/wbrewer5/pandora/lowcov/1_samtools/$BASE1\_unmapped_R2.fastq \
-0 /lustre/haven/user/wbrewer5/pandora/lowcov/1_samtools/$BASE1\_trash.txt \
-s /lustre/haven/user/wbrewer5/pandora/lowcov/1_samtools/$BASE1\_more_trash.txt
echo "Step 1: mapped reads from $BASE1 to aphid genome and saved to 1_samtools as paired end .fastq"
done
rm /lustre/haven/user/wbrewer5/pandora/lowcov/1_samtools/*trash*
echo "saving unmapped reads to new fastq files complete!"
for f1 in `ls /lustre/haven/user/wbrewer5/pandora/lowcov/1_samtools/*unmapped_R1.fastq`
do
f2=`sed 's/R1.fastq/R2.fastq/' >(echo $f1)`
BASE2=$(basename $f1 | sed 's/_R1.fastq*//g')
echo "f1 $f1"
echo "f2 $f2"
echo "BASE2 $BASE2"
bwa mem -t 16 -v 3 \
pandora_canu_pilon.fasta \
$f1 \
$f2 |
samtools sort -# 16 -o ./2_angsd/$BASE2\.bam -
echo "Step 2: mapped reads from $BASE2 to pandora genome saved to 2_angsd as .bam"
done
echo "Mapping new fastq files to pandora genome complete!!"
Log
First file of first loop
++ ls /lustre/haven/user/wbrewer5/pandora/lowcov/reads/Matt_251_R1.fastq /lustre/haven/user/wbrewer5/pandora/lowcov/reads/Matt_614_R1.fastq /lustre/haven/user/wbrewer5/pandora/lowcov/reads/Matt_686_R1.fastq /lustre/haven/user/wbrewer5/pandora/lowcov/reads/p-251_R1.fastq /lustre/haven/user/wbrewer5/pandora/lowcov/reads/p-614_R1.fastq /lustre/haven/user/wbrewer5/pandora/lowcov/reads/p-686_R1.fastq
+ for r1 in '`ls /lustre/haven/user/wbrewer5/pandora/lowcov/reads/*R1.fastq`'
++ sed s/R1.fastq/R2.fastq/ /dev/fd/63
+++ echo /lustre/haven/user/wbrewer5/pandora/lowcov/reads/Matt_251_R1.fastq
+ r2=/lustre/haven/user/wbrewer5/pandora/lowcov/reads/Matt_251_R2.fastq
++ basename /lustre/haven/user/wbrewer5/pandora/lowcov/reads/Matt_251_R1.fastq
++ sed 's/_R1.fastq*//g'
+ BASE1=Matt_251
+ echo 'r1 /lustre/haven/user/wbrewer5/pandora/lowcov/reads/Matt_251_R1.fastq'
+ echo 'r2 /lustre/haven/user/wbrewer5/pandora/lowcov/reads/Matt_251_R2.fastq'
+ echo 'BASE1 Matt_251'
+ bwa mem -t 16 -v 3 pea_aphid.fna /lustre/haven/user/wbrewer5/pandora/lowcov/reads/Matt_251_R1.fastq /lustre/haven/user/wbrewer5/pandora/lowcov/reads/Matt_251_R2.fastq
+ samtools view -# 16 -u -f 12 -F 256 -
+ samtools sort -# 16 -n -
+ samtools fastq - -1 /lustre/haven/user/wbrewer5/pandora/lowcov/1_samtools/Matt_251_unmapped_R1.fastq -2 /lustre/haven/user/wbrewer5/pandora/lowcov/1_samtools/Matt_251_unmapped_R2.fastq -0 /lustre/haven/user/wbrewer5/pandora/lowcov/1_samtools/Matt_251_trash.txt -s /lustre/haven/user/wbrewer5/pandora/lowcov/1_samtools/Matt_251_more_trash.txt
First file of second loop
++ ls /lustre/haven/user/wbrewer5/pandora/lowcov/1_samtools/Matt_251_unmapped_R1.fastq /lustre/haven/user/wbrewer5/pandora/lowcov/1_samtools/Matt_614_unmapped_R1.fastq /lustre/haven/user/wbrewer5/pandora/lowcov/1_samtools/Matt_686_unmapped_R1.fastq /lustre/haven/user/wbrewer5/pandora/lowcov/1_samtools/p-251_unmapped_R1.fastq /lustre/haven/user/wbrewer5/pandora/lowcov/1_samtools/p-614_unmapped_R1.fastq /lustre/haven/user/wbrewer5/pandora/lowcov/1_samtools/p-686_unmapped_R1.fastq
+ for f1 in '`ls /lustre/haven/user/wbrewer5/pandora/lowcov/1_samtools/*unmapped_R1.fastq`'
++ sed s/R1.fastq/R2.fastq/ /dev/fd/63
+++ echo /lustre/haven/user/wbrewer5/pandora/lowcov/1_samtools/Matt_251_unmapped_R1.fastq

Related

Bash script does not find index from array

I have written a bash script which I want to use to monitor backups on a synology via pushgateway.
The script should search for subfolders in the backup folder, write the newest file into a variable and write the age and size of the file into an array.
To give it finally to the push gateway, I list all metrics with indexes. All folders or files are available. If I execute the script, often one or more indexes are not found. If I execute the commands manually one by one, I get a correct output.
Here is the script:
#!/bin/bash
set -e
backup_dir=$1
for dir in $(find "$backup_dir" -maxdepth 1 -mindepth 1 -type d \( ! -name #eaDir \)); do
if compgen -G "${dir}/*.vib" > /dev/null; then
latest_vib=$(ls -t1 "$dir"/*.vib | head -1)
age_vib=$(( ( $(date +%s) - $(stat -c %Y "$latest_vib") ) ))
size_vib=$(stat -c %s "$latest_vib")
arrage_vib+=("${age_vib}")
arrsize_vib+=("${size_vib}")
fi
if compgen -G "${dir}/*.vbk" > /dev/null; then
latest_vbk=$(ls -t1 "$dir"/*.vbk | head -1)
age_vbk=$(( ( $(date +%s) - $(stat -c %Y "$latest_vbk") ) ))
size_vbk=$(stat -c %s "$latest_vbk")
arrage_vbk+=("${age_vbk}")
arrsize_vbk+=("${size_vbk}")
fi
min_dir=$(echo "$dir" | cut -d'/' -f4- | tr '[:upper:]' '[:lower:]')
sign_dir=${min_dir//_/-}
arrdir+=("${sign_dir}")
done
echo "${arrdir[4]}"
echo "${arrage_vib[4]}"
cat << EOF | curl -ks -u user:pw --data-binary #- https://pushgateway/metrics/job/backup/instance/instance_name
# HELP backup_age displays the age of backups in seconds
# TYPE backup_age gauge
backup_age_vib{dir="${arrdir[1]}"} ${arrage_vib[1]}
backup_age_vib{dir="${arrdir[2]}"} ${arrage_vib[2]}
backup_age_vib{dir="${arrdir[3]}"} ${arrage_vib[3]}
backup_age_vib{dir="${arrdir[4]}"} ${arrage_vib[4]}
backup_age_vbk{dir="${arrdir[1]}"} ${arrage_vbk[1]}
...
# HELP backup_size displays the size of backups in bytes
# TYPE backup_size gauge
backup_size_vib{dir="${arrdir[1]}"} ${arrsize_vib[1]}
...
EOF
I hope you can help me and point out where I made a mistake. I am also open for general optimizations of the script, because I assume that it can be solved better and more performant or optimal. I have a few lines of code from here ;-).
Many thanks in advance.

Is there a way to only require one echo in this scenario?

I have the following line of code:
for h in "${Hosts[#]}" ; do echo "$MyLog" | grep -m 1 -B 3 -A 1 $h >> /LogOutput ; done
My hosts variable is a large array of hosts
Is there a better way to do this that doesn't require me to echo on each loop? Like grep on a variable instead?
No echo, no loop
#!/bin/bash
hosts=(host1 host2 host3)
MyLog="
asf host
sdflkj
sadkjf
sdlkjds
lkasf
sfal
asf host2
sdflkj
sadkjf
"
re="${hosts[#]}"
egrep -m 1 -B 3 -A 1 ${re// /|} <<< "$MyLog"
Variant with one echo
echo "$MyLog" | egrep -m 1 -B 3 -A 1 ${re// /|}
Usage
$ ./test
sdlkjds
lkasf
sfal
asf host2
sdflkj
One echo, no loops, and all grepping done in parallel, with GNU Parallel:
echo "$MyLog" | parallel -k --tee --pipe 'grep -m 1 -B 3 -A 1 {}' ::: "${hosts[#]}"
The -k keeps the output in order.
The --tee and the --pipe ensure that the stdin is duplicated to all processes.
The processes that are run in parallel are enclosed in single quotes.
printf your string to multiple-line that you can then grep? Something like:
printf '%s\n' "${Hosts[#]}" | grep -m 1 -B 3 -A 1 $h >> /LogOutput
Assuming you're on GNU system. otherwise info grep
From grep --help
grep --help | head -n1
Output
Usage: grep [OPTION]... PATTERN [FILE]...
So according to that you can do.
for h in "${Hosts[#]}" ; do grep -m 1 -B 3 -A 1 "$h" "$MyLog" >> /LogOutput ; done

parallelizing nested for loop with GNU Parallel

I am working in Bash. I have a series of nested for loops that iteratively look for the presence of three lists of 96 barcodes sequences. My goal is to find each unique combination of barcodes there are 96x96x96 (884,736) possible combinations.
for barcode1 in "${ROUND1_BARCODES[#]}";
do
grep -B 1 -A 2 "$barcode1" $FASTQ_R > ROUND1_MATCH.fastq
echo barcode1.is.$barcode1 >> outputLOG
if [ -s ROUND1_MATCH.fastq ]
then
# Now we will look for the presence of ROUND2 barcodes in our reads containing barcodes from the previous step
for barcode2 in "${ROUND2_BARCODES[#]}";
do
grep -B 1 -A 2 "$barcode2" ROUND1_MATCH.fastq > ROUND2_MATCH.fastq
if [ -s ROUND2_MATCH.fastq ]
then
# Now we will look for the presence of ROUND3 barcodes in our reads containing barcodes from the previous step
for barcode3 in "${ROUND3_BARCODES[#]}";
do
grep -B 1 -A 2 "$barcode3" ./ROUND2_MATCH.fastq | sed '/^--/d' > ROUND3_MATCH.fastq
# If matches are found we will write them to an output .fastq file itteratively labelled with an ID number
if [ -s ROUND3_MATCH.fastq ]
then
mv ROUND3_MATCH.fastq results/result.$count.2.fastq
fi
count=`expr $count + 1`
done
fi
done
fi
done
This code works and I am able to successfully extract the sequences with each barcode combination. However, I think that the speed of this can be improved for working through large files by parallelizing this loop structure. I know that I can use GNU parallel to do this however I am struggling to nest the parallelizations.
# Parallelize nested loops
now=$(date +"%T")
echo "Beginning STEP1.2: PARALLEL Demultiplex using barcodes. Current
time : $now" >> outputLOG
mkdir ROUND1_PARALLEL_HITS
parallel -j 6 'grep -B 1 -A 2 -h {} SRR6750041_2_smalltest.fastq > ROUND1_PARALLEL_HITS/{#}_ROUND1_MATCH.fastq' ::: "${ROUND1_BARCODES[#]}"
mkdir ROUND2_PARALLEL_HITS
parallel -j 6 'grep -B 1 -A 2 -h {} ROUND1_PARALLEL_HITS/*.fastq > ROUND2_PARALLEL_HITS/{#}_{/.}.fastq' ::: "${ROUND2_BARCODES[#]}"
mkdir ROUND3_PARALLEL_HITS
parallel -j 6 'grep -B 1 -A 2 -h {} ROUND2_PARALLEL_HITS/*.fastq > ROUND3_PARALLEL_HITS/{#}_{/.}.fastq' ::: "${ROUND3_BARCODES[#]}"
mkdir parallel_results
parallel -j 6 'mv {} parallel_results/result_{#}.fastq' ::: ROUND3_PARALLEL_HITS/*.fastq
How can I successfully recreate the nested structure of the for loops using parallel?
Parallelized only the inner loop:
for barcode1 in "${ROUND1_BARCODES[#]}";
do
grep -B 1 -A 2 "$barcode1" $FASTQ_R > ROUND1_MATCH.fastq
echo barcode1.is.$barcode1 >> outputLOG
if [ -s ROUND1_MATCH.fastq ]
then
# Now we will look for the presence of ROUND2 barcodes in our reads containing barcodes from the previous step
for barcode2 in "${ROUND2_BARCODES[#]}";
do
grep -B 1 -A 2 "$barcode2" ROUND1_MATCH.fastq > ROUND2_MATCH.fastq
if [ -s ROUND2_MATCH.fastq ]
then
# Now we will look for the presence of ROUND3 barcodes in our reads containing barcodes from the previous step
doit() {
grep -B 1 -A 2 "$1" ./ROUND2_MATCH.fastq | sed '/^--/d'
}
export -f doit
parallel -j0 doit {} '>' results/$barcode1-$barcode2-{} ::: "${ROUND3_BARCODES[#]}"
# TODO remove files with 0 length
fi
done
fi
done

Shell command to save output in subfolders after using find?

I have a main directory with several subfolders. Each subfolder contains a *.fna file and I want my script to do the command with the fna file and write the output back in the subfolder.
As my script is now, it makes one big file in the main directory, but I want the output per subfolder, in each subfolder.
find * |grep fna$ |while read fna ; do formatdb -i $fna -p F ; blastall -p blastn/
-d $fna -i plasmiddb_genes_renamed.fsa -m 8 -e 1E-30 |while read hit ; do echo/
$fna $hit ; done ; done > $fna.blast_plasmidrefdb.out
Couple of ways
Capture the dirname of the $fna value
find * |grep fna$ |while read fna ; do fdir=dirname $fna; formatdb -i $fna -p F ; blastall -p blastn/
-d $fna -i plasmiddb_genes_renamed.fsa -m 8 -e 1E-30 |while read hit ; do echo/
$fna $hit | tee ${fdir}/out.txt; done ; done > $fna.blast_plasmidrefdb.out
use the $fna path with a suffix : e.g. ${fna}.out
find * |grep fna$ |while read fna ; do formatdb -i $fna -p F ; blastall -p blastn/
-d $fna -i plasmiddb_genes_renamed.fsa -m 8 -e 1E-30 |while read hit ; do echo/
$fna $hit | tee ${fna}.out; done ; done > $fna.blast_plasmidrefdb.out
used | tee in case you wanted to ALSO capture in one file still.
compressing the one liner even more
for fna in $(find . -type f -name "*.fna"); do formatdb -i $fna -p F ; blastall -p blastn/ -d $fna -i plasmiddb_genes_renamed.fsa -m 8 -e 1E-30 | tee ${fna}.out; done
If your find supports it, use -execdir:
find . -name '*.fna' -execdir sh -c '{
formatdb -i $0 -p F; blastall -p blastn -d $0 \
-i /path/to/plasmiddb_genes_renaed.fsa -m 8 -e 1E-30;
} > $0.output' {} \;
That will generate an output-file for each *.fna file in the directory containing the *.fna file. Note that you need to use a full path to plasmiddb_genes_renaed.fsa.

Storing a line in a variable

Hi I have the following batch script where I submitted each file to a separate processing as follows:
for file in ../Positive/*.txt_rn; do
bsub <<EOF
#BSUB -L /bin/bash
#BSUB -W 150:00
#BSUB -M 10000
#BSUB -n 3
#BSUB -e /somefolder/errors/%J.err
#BSUB -o /somefolder/errors/%J.out
while read line; do
name=`cat \$line | awk '{print $1":"$2"-"$3}'`
four=`cat \$line | awk '{print $4}' | cut -d\: -f4`
fasta=\$name".fa"
op=\$name".rs"
echo \$name | xargs samtools faidx /somefolder/rn4/Rattus_norvegicus/UCSC/rn4/Sequence/WholeGenomeFasta/genome.fa > \$fasta
Process -F \$fasta -M "list_"\$four".txt" -p 0.003 | awk '(\$5 >= 0.67)' > \$op
if [ -s "\$op" ]
then
cat "\$line" >> ../Positive_Strand/$file".cons"
fi
rm \$lne
rm \$op
rm \$fasta
done < $file
EOF
done
I am am somehow unable to store the values of the column from the line (which is in $line variable into the $name and $four variable and hence unable to carry on further processes. Also any suggestions to edit the code for a better version of it would be welcome.
If you change EOF to 'EOF' then you will more properly disable shell interpretation. Your problem is that your back-ticks (`) are not escaped.
I've fixed your indentation and cleaned up some of your code. Note that the syntax highlighting here doesn't understand cat <<'EOF'. If you paste that into vim with highlighting enabled, you'll see that block is all the same color since it's just a string.
bsub_helper() {
cat <<'EOF'
#BSUB -L /bin/bash
#BSUB -W 150:00
#BSUB -M 10000
#BSUB -n 3
#BSUB -e /somefolder/errors/%J.err
#BSUB -o /somefolder/errors/%J.out
while read line; do
name=`cat $line | awk '{print $1":"$2"-"$3}'`
four=`cat $line | awk '{print $4}' | cut -d: -f4`
fasta="$name.fa"
op="$name.rs"
genome="/somefolder/rn4/Rattus_norvegicus/UCSC/rn4/Sequence/WholeGenomeFasta/genome.fa"
echo $name | xargs samtools faidx "$genome" > "$fasta"
Process -F "$fasta" -M "list_$four.txt" -p 0.003 | awk '($5 >= 0.67)' > "$op"
if [ -s "$op" ]
then
cat "$line" >> "../Positive_Strand/$file.cons"
fi
rm "$lne" "$op" "$fasta"
EOF
echo " done < \"$1\""
}
for file in ../Positive/*.txt_rn; do
bsub_helper "$file" |bsub
done
I created a helper function because I needed to get the input in two commands. I am assuming that $file is the only variable in that block that you want interpreted. I also surrounded that variable (among others) with quotes so that the code can support file names with spaces in them. The final line of the helper has nested double quotes for this reason.
I left your echo $name | xargs … line alone because it's so odd. Without quotes around $name, xargs will take each whitespace-separated entry as its own file. With quotes, xargs will only supply one (likely invalid) file name to samtools.
If $name is a single file, try:
samtools faidx "$genome" "$name" > "$fasta"
If $name is multiple files and none of them have spaces, try:
samtools faidx "$genome" $name > "$fasta"
The only reason to use xargs here would be if you have too much content for one command line, but if you're running echo $name | xargs then you'll run into the same problem.

Resources