combine GNU parallel with nested for loops and multiple variables - bash

I have n folders in destdir. Each folder contains two files: *R1.fastq and *R2.fastq. Using this script, it will do the job (bowtie2) one by one and output {name of the sub folder}.sam in the destdir.
#!/bin/bash
mm9_index="/Users/bowtie2-2.2.6/indexes/mm9/mm9"
destdir=/Users/Desktop/test/outdir/
for f in $destdir/*
do
fbase=$(basename "$f")
echo "Sample $fbase"
bowtie2 -p 4 -x $mm9_index -X 2000 \
-1 "$f"/*R1.fastq \
-2 "$f"/*R2.fastq \
-S $destdir/${fbase}.sam
done
I want to use gnu parallel tool to speed this up, can you help? Thanks.

Use a bash function:
#!/bin/bash
my_bowtie() {
mm9_index="/Users/bowtie2-2.2.6/indexes/mm9/mm9"
destdir=/Users/Desktop/test/outdir/
f="$1"
fbase=$(basename "$f")
echo "Sample $fbase"
bowtie2 -p 4 -x $mm9_index -X 2000 \
-1 "$f"/*R1.fastq \
-2 "$f"/*R2.fastq \
-S $destdir/${fbase}.sam
}
export -f my_bowtie
parallel my_bowtie ::: $destdir/*
For more details: man parallel or http://www.gnu.org/software/parallel/man.html#EXAMPLE:-Calling-Bash-functions

At its simplest, you can normally just put echo on the front of your commands and send the list of commands, that you would have executed sequentially, to GNU Parallel, to execute in parallel, like this:
for f in ...; do
echo bowtie2 -p 4 ....
done | parallel

Related

Cannot get bash function call to work inside another bash call with xargs in it [duplicate]

I am trying to use xargs to call a more complex function in parallel.
#!/bin/bash
echo_var(){
echo $1
return 0
}
seq -f "n%04g" 1 100 |xargs -n 1 -P 10 -i echo_var {}
exit 0
This returns the error
xargs: echo_var: No such file or directory
Any ideas on how I can use xargs to accomplish this, or any other solution(s) would be welcome.
Exporting the function should do it (untested):
export -f echo_var
seq -f "n%04g" 1 100 | xargs -n 1 -P 10 -I {} bash -c 'echo_var "$#"' _ {}
You can use the builtin printf instead of the external seq:
printf "n%04g\n" {1..100} | xargs -n 1 -P 10 -I {} bash -c 'echo_var "$#"' _ {}
Also, using return 0 and exit 0 like that masks any error value that might be produced by the command preceding it. Also, if there's no error, it's the default and thus somewhat redundant.
#phobic mentions that the Bash command could be simplified to
bash -c 'echo_var "{}"'
moving the {} directly inside it. But it's vulnerable to command injection as pointed out by #Sasha.
Here is an example why you should not use the embedded format:
$ echo '$(date)' | xargs -I {} bash -c 'echo_var "{}"'
Sun Aug 18 11:56:45 CDT 2019
Another example of why not:
echo '\"; date\"' | xargs -I {} bash -c 'echo_var "{}"'
This is what is output using the safe format:
$ echo '$(date)' | xargs -I {} bash -c 'echo_var "$#"' _ {}
$(date)
This is comparable to using parameterized SQL queries to avoid injection.
I'm using date in a command substitution or in escaped quotes here instead of the rm command used in Sasha's comment since it's non-destructive.
Using GNU Parallel is looks like this:
#!/bin/bash
echo_var(){
echo $1
return 0
}
export -f echo_var
seq -f "n%04g" 1 100 | parallel -P 10 echo_var {}
exit 0
If you use version 20170822 you do not even have to export -f as long as you have run this:
. `which env_parallel.bash`
seq -f "n%04g" 1 100 | env_parallel -P 10 echo_var {}
Something like this should work also:
function testing() { sleep $1 ; }
echo {1..10} | xargs -n 1 | xargs -I# -P4 bash -c "$(declare -f testing) ; testing # ; echo # "
Maybe this is bad practice, but you if you are defining functions in a .bashrc or other script, you can wrap the file or at least the function definitions with a setting of allexport:
set -o allexport
function funcy_town {
echo 'this is a function'
}
function func_rock {
echo 'this is a function, but different'
}
function cyber_func {
echo 'this function does important things'
}
function the_man_from_funcle {
echo 'not gonna lie'
}
function funcle_wiggly {
echo 'at this point I\'m doing it for the funny names'
}
function extreme_function {
echo 'goodbye'
}
set +o allexport
Seems I can't make comments :-(
I was wondering about the focus on
bash -c 'echo_var "$#"' _ {}
vs
bash -c 'echo_var "{}"'
The 1st substitutes the {} as an arg to bash while the 2nd as an arg to the function. The fact that example 1 doesn't expand the $(date) is simply a a side effect.
If you don't want the functions args expanded , use single single quotes rather than double. To avoid messy nesting , use double quote (expand args on the other one)
$ echo '$(date)' | xargs -0 -L1 -I {} bash -c 'printit "{}"'
Fri 11 Sep 17:02:24 BST 2020
$ echo '$(date)' | xargs -0 -L1 -I {} bash -c "printit '{}'"
$(date)

Passing args to defined bash functions through GNU parallel

Let me show you a snippet of my Bash script and how I try to run parallel:
parallel -a "$file" \
-k \
-j8 \
--block 100M \
--pipepart \
--bar \
--will-cite \
_fix_col_number {} | _unify_null_value {} >> "$OUTPUT_DIR/$new_filename"
So, I am basically trying to process each line in a file in parallel using Bash functions defined inside my script. However, I am not sure how to pass each line to my defined functions "_fix_col_number" and "_unify_null_value". Whatever I do, nothing gets passed to the functions.
I am exporting the functions like this in my script:
declare -x NUM_OF_COLUMNS
export -f _fix_col_number
export -f _add_tabs
export -f _unify_null_value
The mentioned functions are:
_unify_null_value()
{
_string=$(echo "$1" | perl -0777 -pe "s/(?<=\t)\.(?=\s)//g" | \
perl -0777 -pe "s/(?<=\t)NA(?=\s)//g" | \
perl -0777 -pe "s/(?<=\t)No Info(?=\s)//g")
echo "$_string"
}
_add_tabs()
{
_tabs=""
for (( c=1; c<=$1; c++ ))
do
_tabs="$_tabs\t"
done
echo -e "$_tabs"
}
_fix_col_number()
{
line_cols=$(echo "$1" | awk -F"\t" '{ print NF }')
if [[ $line_cols -gt $NUM_OF_COLUMNS ]]; then
new_line=$(echo "$1" | cut -f1-"$NUM_OF_COLUMNS")
echo -e "$new_line\n"
elif [[ $line_cols -lt $NUM_OF_COLUMNS ]]; then
missing_columns=$(( NUM_OF_COLUMNS - line_cols ))
new_line="${1//$'\n'/}$(_add_tabs $missing_columns)"
echo -e "$new_line\n"
else
echo -e "$1"
fi
}
I tried removing {} from parallel. Not really sure what I am doing wrong.
I see two problems in the invocation plus additional problems with the functions:
With --pipepart there are no arguments. The blocks read from -a file are passed over stdin to your functions. Try the following commands to confirm this:
seq 9 > file
parallel -a file --pipepart echo
parallel -a file --pipepart cat
Theoretically, you could read stdin into a variable and pass that variable to your functions, ...
parallel -a file --pipepart 'b=$(cat); someFunction "$b"'
... but I wouldn't recommend it, especially since your blocks are 100MB each.
Bash interprets the pipe | in your command before parallel even sees it. To run a pipe, quote the entire command:
parallel ... 'b=$(cat); _fix_col_number "$b" | _unify_null_value "$b"' >> ...
_fix_col_number seems to assume its argument to be a single line, but receives 100MB blocks instead.
_unify_null_value does not read stdin, so _fix_col_number {} | _unify_null_value {} is equivalent to _unify_null_value {}.
That being said, your functions can be drastically improved. They start a lot of processes which becomes incredibly expensive for larger files. You can do some trivial improvements like combining perl ... | perl ... | perl ... into a single perl. Likewise, instead of storing everything in variables, you can process stdin directly: Just use f() { cmd1 | cmd2; } instead of f() { var=$(echo "$1" | cmd1); var=$(echo "$var" | cmd2); echo "$var"; }.
However, don't waste time on small things like these. A complete rewrite in sed, awk, or perl is easy and should outperfom every optimization on the existing functions.
Try
n="INSERT NUMBER OF COLUMNS HERE"
tabs=$(perl -e "print \"\t\" x $n")
perl -pe "s/\r?\$/$tabs/; s/\t\K(\.|NA|No Info)(?=\s)//g;" file |
cut -f "1-$n"
If you still find this too slow, leave out file; pack the command into a function, export that function and then call parallel -a file -k --pipepart nameOfTheFunction. The option --block is not necessary as pipepart will evenly split the input based on the number of jobs (can be specified with -j).

Pass arguments to a script that is an argument to a different script

I am new to programming, so plz bear with the way I try to explain my problem (also any help regarding how to elegantly phrase the tile is welcome).
I have a bash script (say for example script1.sh ) that takes in arguments a, b and c(another script). Essentially, argument c for script1.sh is the name of another script (let's say script2.sh). However, script2.sh takes in arguments d,e and f. So my question is, how do I pass arguments to script1.sh ?? (example, ./script1.sh -a 1 -b 2 -c script2.sh -d 3 -e 4 -f 5)
Sorry in advance if the above does not make sense, not sure how else to phrase it...
You should use "" for that
./script1.sh -a 1 -b 2 -c "script2.sh -d 3 -e 4 -f 5"
Try script1.sh with this code
#!/bin/bash
for arg in "$#"; { # loop through all arguments passed to the script
echo $arg
}
The output will be
$ ./script1.sh -a 1 -b 2 -c "script2.sh -d 3 -e 4 -f 5"
-a
1
-b
2
-c
script2.sh -d 3 -e 4 -f 5
But if you run this
#!/bin/bash
for arg in $#; { # no double quotes around $#
echo $arg
}
The output will be
$ ./script1.sh -a 1 -b 2 -c "script2.sh -d 3 -e 4 -f 5"
-a
1
-b
2
-c
script2.sh
-d
3
4
-f
5
But there is no -e why? Coz echo supports argument -e and use it.

parallelizing nested for loop with GNU Parallel

I am working in Bash. I have a series of nested for loops that iteratively look for the presence of three lists of 96 barcodes sequences. My goal is to find each unique combination of barcodes there are 96x96x96 (884,736) possible combinations.
for barcode1 in "${ROUND1_BARCODES[#]}";
do
grep -B 1 -A 2 "$barcode1" $FASTQ_R > ROUND1_MATCH.fastq
echo barcode1.is.$barcode1 >> outputLOG
if [ -s ROUND1_MATCH.fastq ]
then
# Now we will look for the presence of ROUND2 barcodes in our reads containing barcodes from the previous step
for barcode2 in "${ROUND2_BARCODES[#]}";
do
grep -B 1 -A 2 "$barcode2" ROUND1_MATCH.fastq > ROUND2_MATCH.fastq
if [ -s ROUND2_MATCH.fastq ]
then
# Now we will look for the presence of ROUND3 barcodes in our reads containing barcodes from the previous step
for barcode3 in "${ROUND3_BARCODES[#]}";
do
grep -B 1 -A 2 "$barcode3" ./ROUND2_MATCH.fastq | sed '/^--/d' > ROUND3_MATCH.fastq
# If matches are found we will write them to an output .fastq file itteratively labelled with an ID number
if [ -s ROUND3_MATCH.fastq ]
then
mv ROUND3_MATCH.fastq results/result.$count.2.fastq
fi
count=`expr $count + 1`
done
fi
done
fi
done
This code works and I am able to successfully extract the sequences with each barcode combination. However, I think that the speed of this can be improved for working through large files by parallelizing this loop structure. I know that I can use GNU parallel to do this however I am struggling to nest the parallelizations.
# Parallelize nested loops
now=$(date +"%T")
echo "Beginning STEP1.2: PARALLEL Demultiplex using barcodes. Current
time : $now" >> outputLOG
mkdir ROUND1_PARALLEL_HITS
parallel -j 6 'grep -B 1 -A 2 -h {} SRR6750041_2_smalltest.fastq > ROUND1_PARALLEL_HITS/{#}_ROUND1_MATCH.fastq' ::: "${ROUND1_BARCODES[#]}"
mkdir ROUND2_PARALLEL_HITS
parallel -j 6 'grep -B 1 -A 2 -h {} ROUND1_PARALLEL_HITS/*.fastq > ROUND2_PARALLEL_HITS/{#}_{/.}.fastq' ::: "${ROUND2_BARCODES[#]}"
mkdir ROUND3_PARALLEL_HITS
parallel -j 6 'grep -B 1 -A 2 -h {} ROUND2_PARALLEL_HITS/*.fastq > ROUND3_PARALLEL_HITS/{#}_{/.}.fastq' ::: "${ROUND3_BARCODES[#]}"
mkdir parallel_results
parallel -j 6 'mv {} parallel_results/result_{#}.fastq' ::: ROUND3_PARALLEL_HITS/*.fastq
How can I successfully recreate the nested structure of the for loops using parallel?
Parallelized only the inner loop:
for barcode1 in "${ROUND1_BARCODES[#]}";
do
grep -B 1 -A 2 "$barcode1" $FASTQ_R > ROUND1_MATCH.fastq
echo barcode1.is.$barcode1 >> outputLOG
if [ -s ROUND1_MATCH.fastq ]
then
# Now we will look for the presence of ROUND2 barcodes in our reads containing barcodes from the previous step
for barcode2 in "${ROUND2_BARCODES[#]}";
do
grep -B 1 -A 2 "$barcode2" ROUND1_MATCH.fastq > ROUND2_MATCH.fastq
if [ -s ROUND2_MATCH.fastq ]
then
# Now we will look for the presence of ROUND3 barcodes in our reads containing barcodes from the previous step
doit() {
grep -B 1 -A 2 "$1" ./ROUND2_MATCH.fastq | sed '/^--/d'
}
export -f doit
parallel -j0 doit {} '>' results/$barcode1-$barcode2-{} ::: "${ROUND3_BARCODES[#]}"
# TODO remove files with 0 length
fi
done
fi
done

Using export -f with xargs not working [duplicate]

I am trying to use xargs to call a more complex function in parallel.
#!/bin/bash
echo_var(){
echo $1
return 0
}
seq -f "n%04g" 1 100 |xargs -n 1 -P 10 -i echo_var {}
exit 0
This returns the error
xargs: echo_var: No such file or directory
Any ideas on how I can use xargs to accomplish this, or any other solution(s) would be welcome.
Exporting the function should do it (untested):
export -f echo_var
seq -f "n%04g" 1 100 | xargs -n 1 -P 10 -I {} bash -c 'echo_var "$#"' _ {}
You can use the builtin printf instead of the external seq:
printf "n%04g\n" {1..100} | xargs -n 1 -P 10 -I {} bash -c 'echo_var "$#"' _ {}
Also, using return 0 and exit 0 like that masks any error value that might be produced by the command preceding it. Also, if there's no error, it's the default and thus somewhat redundant.
#phobic mentions that the Bash command could be simplified to
bash -c 'echo_var "{}"'
moving the {} directly inside it. But it's vulnerable to command injection as pointed out by #Sasha.
Here is an example why you should not use the embedded format:
$ echo '$(date)' | xargs -I {} bash -c 'echo_var "{}"'
Sun Aug 18 11:56:45 CDT 2019
Another example of why not:
echo '\"; date\"' | xargs -I {} bash -c 'echo_var "{}"'
This is what is output using the safe format:
$ echo '$(date)' | xargs -I {} bash -c 'echo_var "$#"' _ {}
$(date)
This is comparable to using parameterized SQL queries to avoid injection.
I'm using date in a command substitution or in escaped quotes here instead of the rm command used in Sasha's comment since it's non-destructive.
Using GNU Parallel is looks like this:
#!/bin/bash
echo_var(){
echo $1
return 0
}
export -f echo_var
seq -f "n%04g" 1 100 | parallel -P 10 echo_var {}
exit 0
If you use version 20170822 you do not even have to export -f as long as you have run this:
. `which env_parallel.bash`
seq -f "n%04g" 1 100 | env_parallel -P 10 echo_var {}
Something like this should work also:
function testing() { sleep $1 ; }
echo {1..10} | xargs -n 1 | xargs -I# -P4 bash -c "$(declare -f testing) ; testing # ; echo # "
Maybe this is bad practice, but you if you are defining functions in a .bashrc or other script, you can wrap the file or at least the function definitions with a setting of allexport:
set -o allexport
function funcy_town {
echo 'this is a function'
}
function func_rock {
echo 'this is a function, but different'
}
function cyber_func {
echo 'this function does important things'
}
function the_man_from_funcle {
echo 'not gonna lie'
}
function funcle_wiggly {
echo 'at this point I\'m doing it for the funny names'
}
function extreme_function {
echo 'goodbye'
}
set +o allexport
Seems I can't make comments :-(
I was wondering about the focus on
bash -c 'echo_var "$#"' _ {}
vs
bash -c 'echo_var "{}"'
The 1st substitutes the {} as an arg to bash while the 2nd as an arg to the function. The fact that example 1 doesn't expand the $(date) is simply a a side effect.
If you don't want the functions args expanded , use single single quotes rather than double. To avoid messy nesting , use double quote (expand args on the other one)
$ echo '$(date)' | xargs -0 -L1 -I {} bash -c 'printit "{}"'
Fri 11 Sep 17:02:24 BST 2020
$ echo '$(date)' | xargs -0 -L1 -I {} bash -c "printit '{}'"
$(date)

Resources