Adding parallel capabilities to a FOR task

Adding parallel capabilities to a FOR task - parallel-processing

I want to execute the following command:
for i in {0-months,3-months,6-months,9-months,12-months,EC1,EC2_CZ,EC2,EC3}; do freebayes --fasta-reference ../Genome/ECIII_Lemming_assembly_masked.fasta --pooled-continuous --min-coverage 10 -F 0.01 -C 2 --vcf vcf/${i}.vcf ${i}.sort.grp.bam; done
But these task are independent of each other, and can be run in parallel. I was wondering if there is a way to do this with gnu parallel.
Usually when using parallel, I would have a file listing all commands needed to run, in this case it would look like this:
freebayes --fasta-reference ../Genome/ECIII_Lemming_assembly_masked.fasta --pooled-continuous --min-coverage 10 -F 0.01 -C 2 --vcf vcf/0-months.vcf 0-months.sort.grp.bam
freebayes --fasta-reference ../Genome/ECIII_Lemming_assembly_masked.fasta --pooled-continuous --min-coverage 10 -F 0.01 -C 2 --vcf vcf/3-months.vcf 3-months.sort.grp.bam
freebayes --fasta-reference ../Genome/ECIII_Lemming_assembly_masked.fasta --pooled-continuous --min-coverage 10 -F 0.01 -C 2 --vcf vcf/6-months.vcf 6-months.sort.grp.bam
freebayes --fasta-reference ../Genome/ECIII_Lemming_assembly_masked.fasta --pooled-continuous --min-coverage 10 -F 0.01 -C 2 --vcf vcf/9-months.vcf 9-months.sort.grp.bam
freebayes --fasta-reference ../Genome/ECIII_Lemming_assembly_masked.fasta --pooled-continuous --min-coverage 10 -F 0.01 -C 2 --vcf vcf/12-months.vcf 12-months.sort.grp.bam
freebayes --fasta-reference ../Genome/ECIII_Lemming_assembly_masked.fasta --pooled-continuous --min-coverage 10 -F 0.01 -C 2 --vcf vcf/EC1.vcf EC1.sort.grp.bam
freebayes --fasta-reference ../Genome/ECIII_Lemming_assembly_masked.fasta --pooled-continuous --min-coverage 10 -F 0.01 -C 2 --vcf vcf/EC2_CZ.vcf EC2_CZ.sort.grp.bam
freebayes --fasta-reference ../Genome/ECIII_Lemming_assembly_masked.fasta --pooled-continuous --min-coverage 10 -F 0.01 -C 2 --vcf vcf/EC2.vcf EC2.sort.grp.bam
freebayes --fasta-reference ../Genome/ECIII_Lemming_assembly_masked.fasta --pooled-continuous --min-coverage 10 -F 0.01 -C 2 --vcf vcf/EC3.vcf EC3.sort.grp.bam
So when the file is ready, I could simply run:
parallel -j 4 -a FILE freebayes
But that requires writing the commands out to a file, then invoking parallel, there must be a simpler way.
This seems to work:
parallel -j 4 -a \
<(for i in {0-months,3-months,6-months,9-months,12-months,EC1,EC2_CZ,EC2,EC3}; do echo "freebayes --fasta-reference ../Genome/ECIII_Lemming_assembly_masked.fasta --pooled-continuous --min-coverage 10 -F 0.01 -C 2 --vcf vcf/${i}.vcf ${i}.sort.grp.bam"; done)
freebayes
But it just looks silly... any easier way to do this?
Thanks!

I am really puzzled how you came up with your extremely convoluted (but working) way to do this:
parallel -j 4 freebayes --fasta-reference \
../Genome/ECIII_Lemming_assembly_masked.fasta --pooled-continuous \
--min-coverage 10 -F 0.01 -C 2 --vcf vcf/{}.vcf {}.sort.grp.bam \
::: 0-months 3-months 6-months 9-months 12-months EC1 EC2_CZ EC2 EC3
If these are all the vcf files in the vcf-dir and it is a 4-core machine, you can even do:
parallel freebayes --fasta-reference \
../Genome/ECIII_Lemming_assembly_masked.fasta --pooled-continuous \
--min-coverage 10 -F 0.01 -C 2 --vcf {} {/.}.sort.grp.bam \
::: vcf/*.vcf
Have you walked through the tutorial? man parallel_tutorial
Have you watched the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Have you looked at the examples: LESS=+/EXAMPLE: man parallel

Related

parallel execution with a fixed order

#!/bin/bash
doone() {
tracelength="$1"
short="$2"
long="$3"
ratio="$4"
echo "$tracelength $short $long $ratio" >> results.csv
python3 main.py "$tracelength" "$short" "$long" "$ratio" >> file.smt2
gtime -f "%U" /Users/Desktop/optimathsat-1.5.1-macos-64-bit/bin/optimathsat < file.smt2
}
export -f doone
step=0.1
parallel doone \
::: 200 300 \
:::: <(seq 0 $step 0.2) \
::::+ <(seq 1 -$step 0.8) \
:::: <(seq 0 $step 0.1) \
::: {1..2} &> results.csv
I need the data given in the results.csv to be in order. Every job prints its inputs which are the 3 variable mentioned at the beginning : $tracelength, $short, $long and $ratio, and then the associated execution time of that job; all in one line. So far my results look something like this:
0.00
0.00
0.00
0.00
200 0 1 0
200 0 1 0.1
200 0.1 0.9 0
how can I fix the order? and why is the execution time always 0.00? file.smt2 is a big file, and in no way can the execution time be 0.00.

It is really a bad idea to append to the same file in parallel. You are going to have race conditions all over the place.
You are doing that with both results.csv and file.smt2.
So if you write to a file in doone make sure it is has a unique name (e.g. by using myfile.$$).
To see if race conditions are your problem, you can make GNU Parallel run one job at a time: parallel --jobs 1.
If the problem goes away by that, then you can probably get away with:
doone() {
tracelength="$1"
short="$2"
long="$3"
ratio="$4"
# No >> is needed here, as all output is sent to results.csv
echo "$tracelength $short $long $ratio"
tmpfile=file.smt.$$
cp file.smt2 $tmpfile
python3 main.py "$tracelength" "$short" "$long" "$ratio" >> $tmpfile
# Be aware that the output from gtime and optimathsat will be put into results.csv - making results.csv not a CSV-file
gtime -f "%U" /Users/Desktop/optimathsat-1.5.1-macos-64-bit/bin/optimathsat < $tmpfile
rm $tmpfile
}
If results.csv is just a log file, consider using parallel --joblog my.log instead.
If the problem does not go away by that, then your problem is elsewhere. In that case make an MCVE (https://stackoverflow.com/help/mcve): Your example is not complete as you refer to file.smt2 and optimathsat without providing them, thus we cannot run your example.

parralelizing a part of a script shell

#!/bin/bash
for tracelength in 10 20 50 100 ; do
step=0.1
short=0
long=1
for firstloop in {1..10}; do
ratio=0
for secondloop in {1..10} ; do
for repeat in {1..20} ; do
echo $tracelength $short $long $ratio >results.csv
python3 main.py "$tracelength" "$short" "$long" "$ratio" > file.smt2
/usr/bin/time /Users/Desktop/optimathsat-1.5.1-macos-64-bit/bin/optimathsat < file.smt2 > results.csv
done
ratio=$(echo "scale=10; $ratio + $step" | bc)
done
short=$(echo "scale=10; $short + $step" | bc)
long=$(echo "scale=10; $long - $step" | bc)
done
done
I want to parallelize the inside loop (repeat).
I have installed GNU parallel and I know some of the basics but because the loop has more than one command I have no idea how I can parallelize them.
I transfered the content of the loop to another script which I guess is the way to go but my 3 commands need to take the variables (tracelength, ratio, short, long) and run according to them. any idea how to pass the parameters from a script to a subscript. or do you maybe have a better idea of parallelization?
I am editing the question because I used the answer below but the now my execution time is always 0.00 regadless of how big file.smt2 is.
this is the new version of code:
#!/bin/bash
doone() {
tracelength="$1"
short="$2"
long="$3"
ratio="$4"
#echo "$tracelength $short $long $ratio" >> results.csv
python3 main.py "$tracelength" "$short" "$long" "$ratio" >> file.smt2
gtime -f "%U" /Users/Desktop/optimathsat-1.5.1-macos-64-bit/bin/optimathsat < file.smt2
}
export -f doone
step=0.2
parallel doone \
::: 200 300 \
:::: <(seq 0 $step 0.5) \
::::+ <(seq 1 -$step 0.5) \
:::: <(seq 0 $step 0.5) \
::: {1..5} &> results.csv

In your original code you overwrite results.csv again and again. I assume that is a mistake and that you instead want it combined into a big csvfile:
doone() {
tracelength="$1"
short="$2"
long="$3"
ratio="$4"
echo "$tracelength $short $long $ratio"
python3 main.py "$tracelength" "$short" "$long" "$ratio" |
/usr/bin/time /Users/Desktop/optimathsat-1.5.1-macos-64-bit/bin/optimathsat
}
export -f doone
step=0.1
parallel doone \
::: 10 20 50 100 \
:::: <(seq 0 $step 0.9) \
::::+ <(seq 1 -$step 0.1) \
:::: <(seq 0 $step 0.9) \
::: {1..20} > results.csv
If you want a csvfile per run:
parallel --results outputdir/ doone \
::: 10 20 50 100 \
:::: <(seq 0 $step 0.9) \
::::+ <(seq 1 -$step 0.1) \
:::: <(seq 0 $step 0.9) \
::: {1..20}
If you want a csv file containing the arguments and run time use:
parallel --results output.csv doone \
::: 10 20 50 100 \
:::: <(seq 0 $step 0.9) \
::::+ <(seq 1 -$step 0.1) \
:::: <(seq 0 $step 0.9) \
::: {1..20}

gnu parellel re-run when it fails with a while loop

Assuming we have a csv file
1
2
3
4
Here is the code:
cat A.csv | while read A; do
echo "echo $A" > $A.sh
echo "$A.sh"
done | xargs -I {} parallel --joblog test.log --jobs 2 -k sh ::: {}
The above is a simplified case. But pretty much get the bulk part. The parallel here will run like this:
parallel --joblog test.log --jobs 2 -k sh ::: 1.sh 2.sh 3.sh 4.sh
Now assume 3.sh failed for some reasons. Is there going to be any easy way to rerun the failed 3.sh in the current shell script setting within the same line of parallel command? I have tried the following, but it doesnt works and quite lengthy.
cat A.csv | while read A; do
echo "echo $A" > $A.sh
echo "$A.sh"
done | xargs -I {} parallel --joblog test.log --jobs 2 -k sh ::: {}
# The above will do this:
# parallel --joblog test.log --jobs 2 -k sh ::: 1.sh 2.sh 3.sh 4.sh
cat A.csv | while read A; do
echo "echo $A" > $A.sh
echo "$A.sh"
done | xargs -I {} parallel --resume-failed --joblog test.log --jobs 2 -k sh ::: {}
# The above will do this:
# parallel --resume-failed --joblog test.log --jobs 2 -k sh ::: 1.sh 2.sh 3.sh 4.sh
######## 2017-09-25
Thanks Ole. I have tried the following
doit() {
myarg="$1"
if [ $myarg -eq 3 ]
then
exit 1
else
echo do crazy stuff with "$myarg"
fi
}
export -f doit
parallel -k --retries 3 --joblog ole.log doit :::: A.csv
It returns the log file like this:
Seq Host Starttime JobRuntime Send Receive Exitval Signal Command
1 : 1506362303.003 0.016 0 22 0 0 doit 1
2 : 1506362303.006 0.013 0 22 0 0 doit 2
3 : 1506362303.026 0.002 0 0 1 0 doit 3
4 : 1506362303.014 0.006 0 22 0 0 doit 4
However, I dont see the doit 3 being retried 3 times as expected. Could you enlighten? Thanks.

First: Generating .sh files seems like a bad idea. You can most likely just make a function instead:
doit() {
myarg="$1"
echo do crazy stuff with "$myarg"
}
export -f doit
To retry a failing command use --retries:
parallel --retries 3 doit :::: file.csv
If your CSV-file has multiple columns, use --colsep:
parallel --retries 3 --colsep '\t' doit :::: file.csv
Using this:
doit() {
myarg="$1"
if [ $myarg -eq 3 ] ; then
echo do not do crazy stuff with "$myarg"
exit 1
else
echo do crazy stuff with "$myarg"
fi
}
export -f doit
This will retry '3' job 3 times:
parallel -k --retries 3 --joblog ole.log doit ::: 1 2 3 4
It will only log the last time. To be convinced this actually runs thrice, run the output unbuffered:
parallel -u --retries 3 --joblog ole.log doit ::: 1 2 3 4

if elif else multiple options error

I have a .ini file, with one of the options being:
[LDdecay]
;determine if examining LD decay for SNPs
;"0" is false (no), "1" is true for GBS only, "2" is true for SoySNP50K only, "3" is true for Merge only, "4" is true for all (GBS, SoySNP50K and Merge)
decay=1
I am continuously getting the following error for lines 69, 72, 80, 88 and 96 (essentially anywhere there is an if or elif statement):
[: : integer expression expected
I'm clearly overlooking something as I have worked with if, elif and else successfully in the past, so anyone who can catch the glitch would be greatly appreaciated.
Thanks
#!/bin/bash
#################################################
# #
# A base directory must be created and #
# the config file must be placed in there #
# #
#################################################
#check if config file was supplied
if [ $# -ne 1 ]; then #if number of argument is different than 1
echo -e "Usage: run_pipeline_cfg.sh <configuration file>" #print error message
exit 1 #stop script execution
fi
#Parse the info from the ini file
ini=$(readlink -m "$1")
source <(grep="$ini" | \
sed -e "s/[[:space:]]*\=[[:space:]]*/=/g" \
-e "s/;.*$//" \
-e "s/[[:space:]]*$//" \
-e "s/^[[:space:]]*//" \
-e "s/^\(.*\)=\([^\"']*\)$/\1=\"\2\"/")
#ini="/home/directory/Initiator.ini" #debug
#Retrieve the base directory path
baseDir=$(dirname "$ini")
#Create required directory structure
logs="$baseDir/logs"
LDdecay="$baseDir/LDdecay"
imputed="$baseDir/imputed"
#dont create if already exists
[[ -d "$logs" ]] || mkdir "$logs"
[[ -d "$LDdecay" ]] || mkdir "$LDdecay"
[[ -d "$imputed" ]] || mkdir "$imputed"
#find imputed vcf files
if [ -e $imputed ] ; then
echo -e "Folder with imputed vcf files exists. Determining if calculating LD decay"
else
echo -e "Folder with imputed vcf files does not exist. Cannot calculate LD decay."
exit 1
fi
#######################################################
# #
# Create LD decay files for LD decay plots #
# #
#######################################################
#determine on which files to perform LD decay calculations
#"0" is false (no), "1" is true for GBS only, "2" is true for Microarray only, "3" is true for Integrated only, "4" is true for all (GBS, Microarray and Integrated)
if [ "$decay" -eq 0 ]; then
printf "LD decay rates not calculated" | tee -a $logs/log.txt;
elif [ "$decay" -eq 1 ]; then
#perform LD decay calculation for GBS only
zcat $imputed/GBS_MAF0.01.vcf.gz > $LDdecay/GBS_MAF0.01.vcf
plink --vcf $LDdecay/GBS_MAF0.01.vcf --r2 --ld-window 10000000 --ld-window-kb 10000 --ld-window-r2 0 --output $LDdecay/GBS_LD_decay
rm $LDdecay/GBS_MAF0.01.vcf
printf "LD decay for GBS dataset completed" | tee -a $logs/log.txt
elif [ "$decay" -eq 2 ]; then
#perform LD decay calculation for microarray only
zcat $imputed/microarray_MAF0.01.vcf.gz > $LDdecay/microarray_MAF0.01.vcf
plink --vcf $LDdecay/microarray_MAF0.01.vcf --r2 --ld-window 10000000 --ld-window-kb 10000 --ld-window-r2 0 --output $LDdecay/microarray_LD_decay
rm $LDdecay/microarray_MAF0.01.vcf
printf "LD decay for microarray dataset completed" | tee -a $logs/log.txt
elif [ "$decay" -eq 3 ]; then
#perform LD decay for Merged dataset only
zcat $imputed/Integrated_MAF0.01_sorted.vcf.gz > $LDdecay/Integrated_MAF0.01.vcf
plink --vcf $LDdecay/Integrated_MAF0.01.vcf --r2 --ld-window 10000000 --ld-window-kb 10000 --ld-window-r2 0 --output $LDdecay/Integrated_LD_decay
rm $LDdecay/Integrated_MAF0.01.vcf
printf "LD decay calculation for Integrated dataset complete" | tee -a $logs/log.txt
elif [ "$decay" -eq 4 ]; then
#perform LD decay for Merged, GBS and SoySNP50K datasets
zcat $imputed/Integrated_MAF0.01_sorted.vcf.gz > $LDdecay/Integrated_MAF0.01.vcf
plink --vcf $LDdecay/Integrated_MAF0.01.vcf --r2 --ld-window 10000000 --ld-window-kb 10000 --ld-window-r2 0 --output $LDdecay/Integrated_LD_decay
rm $LDdecay/Integrated_MAF0.01.vcf
zcat $imputed/microarray_MAF0.01.vcf.gz > $LDdecay/microarray_MAF0.01.vcf
plink --vcf $LDdecay/microarray_MAF0.01.vcf --r2 --ld-window 10000000 --ld-window-kb 10000 --ld-window-r2 0 --output $LDdecay/microarray_LD_decay
rm $LDdecay/microarray_MAF0.01.vcf
zcat $imputed/snp_imputed_GBS_MAF0.01.vcf.gz > $LDdecay/GBS_MAF0.01.vcf
plink --vcf $LDdecay/GBS_MAF0.01.vcf --r2 --ld-window 10000000 --ld-window-kb 10000 --ld-window-r2 0 --output $LDdecay/GBS_LD_decay
rm $LDdecay/GBS_MAF0.01.vcf
printf "LD decay calculation completed for Integrated, GBS and SoySNP50K datasets completed" | tee -a $logs/log.txt
else
echo "Wrong LD Decay calculation setup"
exit 1
fi

Consider changing the elsif chain to a case statement. It should side-step the need for the tests to have integer arguments and it is both clearer and more easy to extend:
case $decay in
0)
printf "LD decay rates not calculated" | tee -a $logs/log.txt;
;;
1)
#perform LD decay calculation for GBS only
zcat $imputed/GBS_MAF0.01.vcf.gz > $LDdecay/GBS_MAF0.01.vcf
plink --vcf $LDdecay/GBS_MAF0.01.vcf --r2 --ld-window 10000000 --ld-window-kb 10000 --ld-window-r2 0 --output $LDdecay/GBS_LD_decay
rm $LDdecay/GBS_MAF0.01.vcf
printf "LD decay for GBS dataset completed" | tee -a $logs/log.txt
;;
2)
#perform LD decay calculation for microarray only
zcat $imputed/microarray_MAF0.01.vcf.gz > $LDdecay/microarray_MAF0.01.vcf
plink --vcf $LDdecay/microarray_MAF0.01.vcf --r2 --ld-window 10000000 --ld-window-kb 10000 --ld-window-r2 0 --output $LDdecay/microarray_LD_decay
rm $LDdecay/microarray_MAF0.01.vcf
printf "LD decay for microarray dataset completed" | tee -a $logs/log.txt
;;
3)
#perform LD decay for Merged dataset only
zcat $imputed/Integrated_MAF0.01_sorted.vcf.gz > $LDdecay/Integrated_MAF0.01.vcf
plink --vcf $LDdecay/Integrated_MAF0.01.vcf --r2 --ld-window 10000000 --ld-window-kb 10000 --ld-window-r2 0 --output $LDdecay/Integrated_LD_decay
rm $LDdecay/Integrated_MAF0.01.vcf
printf "LD decay calculation for Integrated dataset complete" | tee -a $logs/log.txt
;;
4)
#perform LD decay for Merged, GBS and SoySNP50K datasets
zcat $imputed/Integrated_MAF0.01_sorted.vcf.gz > $LDdecay/Integrated_MAF0.01.vcf
plink --vcf $LDdecay/Integrated_MAF0.01.vcf --r2 --ld-window 10000000 --ld-window-kb 10000 --ld-window-r2 0 --output $LDdecay/Integrated_LD_decay
rm $LDdecay/Integrated_MAF0.01.vcf
zcat $imputed/microarray_MAF0.01.vcf.gz > $LDdecay/microarray_MAF0.01.vcf
plink --vcf $LDdecay/microarray_MAF0.01.vcf --r2 --ld-window 10000000 --ld-window-kb 10000 --ld-window-r2 0 --output $LDdecay/microarray_LD_decay
rm $LDdecay/microarray_MAF0.01.vcf
zcat $imputed/snp_imputed_GBS_MAF0.01.vcf.gz > $LDdecay/GBS_MAF0.01.vcf
plink --vcf $LDdecay/GBS_MAF0.01.vcf --r2 --ld-window 10000000 --ld-window-kb 10000 --ld-window-r2 0 --output $LDdecay/GBS_LD_decay
rm $LDdecay/GBS_MAF0.01.vcf
printf "LD decay calculation completed for Integrated, GBS and SoySNP50K datasets completed" | tee -a $logs/log.txt
;;
*)
echo "Wrong LD Decay calculation setup"
exit 1
;;
esac

Error when running a makefile

Here's what my makefile looks like:
.PHONY: all clean
all : 1wchaotic.png
1wchaotic.png : plot-script1.gp duffing stroboscopic points1.data
gnuplot "$<" > "$#"
points1.data : duffing
./duffing -a 1 -b 1 -w 1 -u 0.1 -A 35.5 -t 1000 | ./stroboscopic > points1.data
duffing : duffing.c Makefile
make
clean :
rm *~ *.png *.data
This is what I get when running the makefile:
user#ubuntu:~/HW$ make -f makefile2
./duffing -a 1 -b 1 -w 1 -u 0.1 -A 35.5 -t 1000 | ./stroboscopic > points1.data
make: *** [points1.data] Error 255

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Adding parallel capabilities to a FOR task - parallel-processing

Related

parallel execution with a fixed order

parralelizing a part of a script shell

gnu parellel re-run when it fails with a while loop

if elif else multiple options error

Error when running a makefile

Categories

Resources