parralelizing a part of a script shell - shell

#!/bin/bash
for tracelength in 10 20 50 100 ; do
step=0.1
short=0
long=1
for firstloop in {1..10}; do
ratio=0
for secondloop in {1..10} ; do
for repeat in {1..20} ; do
echo $tracelength $short $long $ratio >results.csv
python3 main.py "$tracelength" "$short" "$long" "$ratio" > file.smt2
/usr/bin/time /Users/Desktop/optimathsat-1.5.1-macos-64-bit/bin/optimathsat < file.smt2 > results.csv
done
ratio=$(echo "scale=10; $ratio + $step" | bc)
done
short=$(echo "scale=10; $short + $step" | bc)
long=$(echo "scale=10; $long - $step" | bc)
done
done
I want to parallelize the inside loop (repeat).
I have installed GNU parallel and I know some of the basics but because the loop has more than one command I have no idea how I can parallelize them.
I transfered the content of the loop to another script which I guess is the way to go but my 3 commands need to take the variables (tracelength, ratio, short, long) and run according to them. any idea how to pass the parameters from a script to a subscript. or do you maybe have a better idea of parallelization?
I am editing the question because I used the answer below but the now my execution time is always 0.00 regadless of how big file.smt2 is.
this is the new version of code:
#!/bin/bash
doone() {
tracelength="$1"
short="$2"
long="$3"
ratio="$4"
#echo "$tracelength $short $long $ratio" >> results.csv
python3 main.py "$tracelength" "$short" "$long" "$ratio" >> file.smt2
gtime -f "%U" /Users/Desktop/optimathsat-1.5.1-macos-64-bit/bin/optimathsat < file.smt2
}
export -f doone
step=0.2
parallel doone \
::: 200 300 \
:::: <(seq 0 $step 0.5) \
::::+ <(seq 1 -$step 0.5) \
:::: <(seq 0 $step 0.5) \
::: {1..5} &> results.csv

In your original code you overwrite results.csv again and again. I assume that is a mistake and that you instead want it combined into a big csvfile:
doone() {
tracelength="$1"
short="$2"
long="$3"
ratio="$4"
echo "$tracelength $short $long $ratio"
python3 main.py "$tracelength" "$short" "$long" "$ratio" |
/usr/bin/time /Users/Desktop/optimathsat-1.5.1-macos-64-bit/bin/optimathsat
}
export -f doone
step=0.1
parallel doone \
::: 10 20 50 100 \
:::: <(seq 0 $step 0.9) \
::::+ <(seq 1 -$step 0.1) \
:::: <(seq 0 $step 0.9) \
::: {1..20} > results.csv
If you want a csvfile per run:
parallel --results outputdir/ doone \
::: 10 20 50 100 \
:::: <(seq 0 $step 0.9) \
::::+ <(seq 1 -$step 0.1) \
:::: <(seq 0 $step 0.9) \
::: {1..20}
If you want a csv file containing the arguments and run time use:
parallel --results output.csv doone \
::: 10 20 50 100 \
:::: <(seq 0 $step 0.9) \
::::+ <(seq 1 -$step 0.1) \
:::: <(seq 0 $step 0.9) \
::: {1..20}

Related

parallel execution with a fixed order

#!/bin/bash
doone() {
tracelength="$1"
short="$2"
long="$3"
ratio="$4"
echo "$tracelength $short $long $ratio" >> results.csv
python3 main.py "$tracelength" "$short" "$long" "$ratio" >> file.smt2
gtime -f "%U" /Users/Desktop/optimathsat-1.5.1-macos-64-bit/bin/optimathsat < file.smt2
}
export -f doone
step=0.1
parallel doone \
::: 200 300 \
:::: <(seq 0 $step 0.2) \
::::+ <(seq 1 -$step 0.8) \
:::: <(seq 0 $step 0.1) \
::: {1..2} &> results.csv
I need the data given in the results.csv to be in order. Every job prints its inputs which are the 3 variable mentioned at the beginning : $tracelength, $short, $long and $ratio, and then the associated execution time of that job; all in one line. So far my results look something like this:
0.00
0.00
0.00
0.00
200 0 1 0
200 0 1 0.1
200 0.1 0.9 0
how can I fix the order? and why is the execution time always 0.00? file.smt2 is a big file, and in no way can the execution time be 0.00.
It is really a bad idea to append to the same file in parallel. You are going to have race conditions all over the place.
You are doing that with both results.csv and file.smt2.
So if you write to a file in doone make sure it is has a unique name (e.g. by using myfile.$$).
To see if race conditions are your problem, you can make GNU Parallel run one job at a time: parallel --jobs 1.
If the problem goes away by that, then you can probably get away with:
doone() {
tracelength="$1"
short="$2"
long="$3"
ratio="$4"
# No >> is needed here, as all output is sent to results.csv
echo "$tracelength $short $long $ratio"
tmpfile=file.smt.$$
cp file.smt2 $tmpfile
python3 main.py "$tracelength" "$short" "$long" "$ratio" >> $tmpfile
# Be aware that the output from gtime and optimathsat will be put into results.csv - making results.csv not a CSV-file
gtime -f "%U" /Users/Desktop/optimathsat-1.5.1-macos-64-bit/bin/optimathsat < $tmpfile
rm $tmpfile
}
If results.csv is just a log file, consider using parallel --joblog my.log instead.
If the problem does not go away by that, then your problem is elsewhere. In that case make an MCVE (https://stackoverflow.com/help/mcve): Your example is not complete as you refer to file.smt2 and optimathsat without providing them, thus we cannot run your example.

Bash - assign variables to yad values - sed usage in for loop

In the code below I am attempting to assign variables to the two yad values Radius and Amount.
This can be done with awk by printing the yad values to file but I want to avoid this if I can.
The string (that is, both yad values) is assigned a variable and trimmed of characters, as required, using sed. However, the script stops at this line;
radius=$(sed 's|[amount*,]||g')
Two questions
is there a better way of tackling this; and
why is the script not completing? I have not been able to figure out the syntax.
EDIT: don't need the loop and working on the sed syntax
#!/bin/bash
#ifs.sh
values=`yad --form --center --width=300 --title="Test" --separator=' ' \
--button=Skip:1 \
--button=Apply:0 \
--field="Radius":NUM \
'0!0..30!1!0' \
--field="Amount":NUM \
'0!0..5!0.01!2'`
radius=$(echo "$values" | sed 's|[amount*,]||g')
amount=$(echo "$values" | sed 's/.a://')
if [ $? = 1 ]; then
echo " " >/dev/null 2>&1; else
echo "Radius = $radius"
echo "Amount = $amount"
fi
exit
Alternatives
# with separator
# radius="${values%????????}"
# amount="${values#????????}"
# without separator
# radius=$(echo "$values" | sed s'/........$//')
# amount=$(echo "$values" | sed 's/^........//')
It's easier than you think:
$ values=( $(echo '7.000000 0.100000 ') )
$ echo "${values[0]}"
7.000000
$ echo "${values[1]}"
0.100000
Replace $(echo '7.000000 0.100000 ') with yad ... so the script would be:
values=( $(yad --form --center --width=300 --title="Test" --separator=' ' \
--button=Skip:1 \
--button=Apply:0 \
--field="Radius":NUM \
'0!0..30!1!0' \
--field="Amount":NUM \
'0!0..5!0.01!2') )
if [ $? -eq 0 ]; then
echo "Radius = ${values[0]}"
echo "Amount = ${values[1]}"
fi
EDIT: Changed answer based on #Ed Morton
#!/bin/bash
#ifs.sh
values=($(yad --form --center --width=300 --title="Test" --separator=' ' \
--button=Skip:1 \
--button=Apply:0 \
--field="Radius":NUM \
'0!0..30!1!0' \
--field="Amount":NUM \
'0!0..5!0.01!2'))
if [ $? -eq 0 ]; then
radius="${values[0]}"
amount="${values[1]}"
fi
exit
bash -x Output
+ '[' 0 -eq 0 ']'
+ radius=7.000000
+ amount=1.000000
+ exit

gnu parellel re-run when it fails with a while loop

Assuming we have a csv file
1
2
3
4
Here is the code:
cat A.csv | while read A; do
echo "echo $A" > $A.sh
echo "$A.sh"
done | xargs -I {} parallel --joblog test.log --jobs 2 -k sh ::: {}
The above is a simplified case. But pretty much get the bulk part. The parallel here will run like this:
parallel --joblog test.log --jobs 2 -k sh ::: 1.sh 2.sh 3.sh 4.sh
Now assume 3.sh failed for some reasons. Is there going to be any easy way to rerun the failed 3.sh in the current shell script setting within the same line of parallel command? I have tried the following, but it doesnt works and quite lengthy.
cat A.csv | while read A; do
echo "echo $A" > $A.sh
echo "$A.sh"
done | xargs -I {} parallel --joblog test.log --jobs 2 -k sh ::: {}
# The above will do this:
# parallel --joblog test.log --jobs 2 -k sh ::: 1.sh 2.sh 3.sh 4.sh
cat A.csv | while read A; do
echo "echo $A" > $A.sh
echo "$A.sh"
done | xargs -I {} parallel --resume-failed --joblog test.log --jobs 2 -k sh ::: {}
# The above will do this:
# parallel --resume-failed --joblog test.log --jobs 2 -k sh ::: 1.sh 2.sh 3.sh 4.sh
######## 2017-09-25
Thanks Ole. I have tried the following
doit() {
myarg="$1"
if [ $myarg -eq 3 ]
then
exit 1
else
echo do crazy stuff with "$myarg"
fi
}
export -f doit
parallel -k --retries 3 --joblog ole.log doit :::: A.csv
It returns the log file like this:
Seq Host Starttime JobRuntime Send Receive Exitval Signal Command
1 : 1506362303.003 0.016 0 22 0 0 doit 1
2 : 1506362303.006 0.013 0 22 0 0 doit 2
3 : 1506362303.026 0.002 0 0 1 0 doit 3
4 : 1506362303.014 0.006 0 22 0 0 doit 4
However, I dont see the doit 3 being retried 3 times as expected. Could you enlighten? Thanks.
First: Generating .sh files seems like a bad idea. You can most likely just make a function instead:
doit() {
myarg="$1"
echo do crazy stuff with "$myarg"
}
export -f doit
To retry a failing command use --retries:
parallel --retries 3 doit :::: file.csv
If your CSV-file has multiple columns, use --colsep:
parallel --retries 3 --colsep '\t' doit :::: file.csv
Using this:
doit() {
myarg="$1"
if [ $myarg -eq 3 ] ; then
echo do not do crazy stuff with "$myarg"
exit 1
else
echo do crazy stuff with "$myarg"
fi
}
export -f doit
This will retry '3' job 3 times:
parallel -k --retries 3 --joblog ole.log doit ::: 1 2 3 4
It will only log the last time. To be convinced this actually runs thrice, run the output unbuffered:
parallel -u --retries 3 --joblog ole.log doit ::: 1 2 3 4

if elif else multiple options error

I have a .ini file, with one of the options being:
[LDdecay]
;determine if examining LD decay for SNPs
;"0" is false (no), "1" is true for GBS only, "2" is true for SoySNP50K only, "3" is true for Merge only, "4" is true for all (GBS, SoySNP50K and Merge)
decay=1
I am continuously getting the following error for lines 69, 72, 80, 88 and 96 (essentially anywhere there is an if or elif statement):
[: : integer expression expected
I'm clearly overlooking something as I have worked with if, elif and else successfully in the past, so anyone who can catch the glitch would be greatly appreaciated.
Thanks
#!/bin/bash
#################################################
# #
# A base directory must be created and #
# the config file must be placed in there #
# #
#################################################
#check if config file was supplied
if [ $# -ne 1 ]; then #if number of argument is different than 1
echo -e "Usage: run_pipeline_cfg.sh <configuration file>" #print error message
exit 1 #stop script execution
fi
#Parse the info from the ini file
ini=$(readlink -m "$1")
source <(grep="$ini" | \
sed -e "s/[[:space:]]*\=[[:space:]]*/=/g" \
-e "s/;.*$//" \
-e "s/[[:space:]]*$//" \
-e "s/^[[:space:]]*//" \
-e "s/^\(.*\)=\([^\"']*\)$/\1=\"\2\"/")
#ini="/home/directory/Initiator.ini" #debug
#Retrieve the base directory path
baseDir=$(dirname "$ini")
#Create required directory structure
logs="$baseDir/logs"
LDdecay="$baseDir/LDdecay"
imputed="$baseDir/imputed"
#dont create if already exists
[[ -d "$logs" ]] || mkdir "$logs"
[[ -d "$LDdecay" ]] || mkdir "$LDdecay"
[[ -d "$imputed" ]] || mkdir "$imputed"
#find imputed vcf files
if [ -e $imputed ] ; then
echo -e "Folder with imputed vcf files exists. Determining if calculating LD decay"
else
echo -e "Folder with imputed vcf files does not exist. Cannot calculate LD decay."
exit 1
fi
#######################################################
# #
# Create LD decay files for LD decay plots #
# #
#######################################################
#determine on which files to perform LD decay calculations
#"0" is false (no), "1" is true for GBS only, "2" is true for Microarray only, "3" is true for Integrated only, "4" is true for all (GBS, Microarray and Integrated)
if [ "$decay" -eq 0 ]; then
printf "LD decay rates not calculated" | tee -a $logs/log.txt;
elif [ "$decay" -eq 1 ]; then
#perform LD decay calculation for GBS only
zcat $imputed/GBS_MAF0.01.vcf.gz > $LDdecay/GBS_MAF0.01.vcf
plink --vcf $LDdecay/GBS_MAF0.01.vcf --r2 --ld-window 10000000 --ld-window-kb 10000 --ld-window-r2 0 --output $LDdecay/GBS_LD_decay
rm $LDdecay/GBS_MAF0.01.vcf
printf "LD decay for GBS dataset completed" | tee -a $logs/log.txt
elif [ "$decay" -eq 2 ]; then
#perform LD decay calculation for microarray only
zcat $imputed/microarray_MAF0.01.vcf.gz > $LDdecay/microarray_MAF0.01.vcf
plink --vcf $LDdecay/microarray_MAF0.01.vcf --r2 --ld-window 10000000 --ld-window-kb 10000 --ld-window-r2 0 --output $LDdecay/microarray_LD_decay
rm $LDdecay/microarray_MAF0.01.vcf
printf "LD decay for microarray dataset completed" | tee -a $logs/log.txt
elif [ "$decay" -eq 3 ]; then
#perform LD decay for Merged dataset only
zcat $imputed/Integrated_MAF0.01_sorted.vcf.gz > $LDdecay/Integrated_MAF0.01.vcf
plink --vcf $LDdecay/Integrated_MAF0.01.vcf --r2 --ld-window 10000000 --ld-window-kb 10000 --ld-window-r2 0 --output $LDdecay/Integrated_LD_decay
rm $LDdecay/Integrated_MAF0.01.vcf
printf "LD decay calculation for Integrated dataset complete" | tee -a $logs/log.txt
elif [ "$decay" -eq 4 ]; then
#perform LD decay for Merged, GBS and SoySNP50K datasets
zcat $imputed/Integrated_MAF0.01_sorted.vcf.gz > $LDdecay/Integrated_MAF0.01.vcf
plink --vcf $LDdecay/Integrated_MAF0.01.vcf --r2 --ld-window 10000000 --ld-window-kb 10000 --ld-window-r2 0 --output $LDdecay/Integrated_LD_decay
rm $LDdecay/Integrated_MAF0.01.vcf
zcat $imputed/microarray_MAF0.01.vcf.gz > $LDdecay/microarray_MAF0.01.vcf
plink --vcf $LDdecay/microarray_MAF0.01.vcf --r2 --ld-window 10000000 --ld-window-kb 10000 --ld-window-r2 0 --output $LDdecay/microarray_LD_decay
rm $LDdecay/microarray_MAF0.01.vcf
zcat $imputed/snp_imputed_GBS_MAF0.01.vcf.gz > $LDdecay/GBS_MAF0.01.vcf
plink --vcf $LDdecay/GBS_MAF0.01.vcf --r2 --ld-window 10000000 --ld-window-kb 10000 --ld-window-r2 0 --output $LDdecay/GBS_LD_decay
rm $LDdecay/GBS_MAF0.01.vcf
printf "LD decay calculation completed for Integrated, GBS and SoySNP50K datasets completed" | tee -a $logs/log.txt
else
echo "Wrong LD Decay calculation setup"
exit 1
fi
Consider changing the elsif chain to a case statement. It should side-step the need for the tests to have integer arguments and it is both clearer and more easy to extend:
case $decay in
0)
printf "LD decay rates not calculated" | tee -a $logs/log.txt;
;;
1)
#perform LD decay calculation for GBS only
zcat $imputed/GBS_MAF0.01.vcf.gz > $LDdecay/GBS_MAF0.01.vcf
plink --vcf $LDdecay/GBS_MAF0.01.vcf --r2 --ld-window 10000000 --ld-window-kb 10000 --ld-window-r2 0 --output $LDdecay/GBS_LD_decay
rm $LDdecay/GBS_MAF0.01.vcf
printf "LD decay for GBS dataset completed" | tee -a $logs/log.txt
;;
2)
#perform LD decay calculation for microarray only
zcat $imputed/microarray_MAF0.01.vcf.gz > $LDdecay/microarray_MAF0.01.vcf
plink --vcf $LDdecay/microarray_MAF0.01.vcf --r2 --ld-window 10000000 --ld-window-kb 10000 --ld-window-r2 0 --output $LDdecay/microarray_LD_decay
rm $LDdecay/microarray_MAF0.01.vcf
printf "LD decay for microarray dataset completed" | tee -a $logs/log.txt
;;
3)
#perform LD decay for Merged dataset only
zcat $imputed/Integrated_MAF0.01_sorted.vcf.gz > $LDdecay/Integrated_MAF0.01.vcf
plink --vcf $LDdecay/Integrated_MAF0.01.vcf --r2 --ld-window 10000000 --ld-window-kb 10000 --ld-window-r2 0 --output $LDdecay/Integrated_LD_decay
rm $LDdecay/Integrated_MAF0.01.vcf
printf "LD decay calculation for Integrated dataset complete" | tee -a $logs/log.txt
;;
4)
#perform LD decay for Merged, GBS and SoySNP50K datasets
zcat $imputed/Integrated_MAF0.01_sorted.vcf.gz > $LDdecay/Integrated_MAF0.01.vcf
plink --vcf $LDdecay/Integrated_MAF0.01.vcf --r2 --ld-window 10000000 --ld-window-kb 10000 --ld-window-r2 0 --output $LDdecay/Integrated_LD_decay
rm $LDdecay/Integrated_MAF0.01.vcf
zcat $imputed/microarray_MAF0.01.vcf.gz > $LDdecay/microarray_MAF0.01.vcf
plink --vcf $LDdecay/microarray_MAF0.01.vcf --r2 --ld-window 10000000 --ld-window-kb 10000 --ld-window-r2 0 --output $LDdecay/microarray_LD_decay
rm $LDdecay/microarray_MAF0.01.vcf
zcat $imputed/snp_imputed_GBS_MAF0.01.vcf.gz > $LDdecay/GBS_MAF0.01.vcf
plink --vcf $LDdecay/GBS_MAF0.01.vcf --r2 --ld-window 10000000 --ld-window-kb 10000 --ld-window-r2 0 --output $LDdecay/GBS_LD_decay
rm $LDdecay/GBS_MAF0.01.vcf
printf "LD decay calculation completed for Integrated, GBS and SoySNP50K datasets completed" | tee -a $logs/log.txt
;;
*)
echo "Wrong LD Decay calculation setup"
exit 1
;;
esac

Is this the faster way to test cpu load using shell scripting?

I'm relatively new to shell scripting and I'm in the process of writing my own health checking scripts using bash.
Is the following script to test cpu load the best I can have in terms of performance, readability and maintainability?
#!/bin/sh
getloadavg5 () {
echo $(cat /proc/loadavg | cut -f2 -d' ')
}
getnumcpus () {
echo $(cat /proc/cpuinfo | grep '^processor' | wc -l)
}
awk \
-v failthold=0.8 \
-v warnthold=0.7 \
-v loadavg=$(getloadavg5) \
-v numcpus=$(getnumcpus) \
'BEGIN {
ratio=loadavg/numcpus
if (ratio >= failthold) exit 2
if (ratio >= warnthold) exit 1
exit 0
}'
This might be more suitable for the code review stackexchange, but without condoning the use of load averages in this way, here are some ideas:
#!/bin/sh
read -r one five fifteen rest < /proc/loadavg
cpus=$(grep -c '^processor' /proc/cpuinfo)
awk \
-v failthold=0.8 \
-v warnthold=0.7 \
-v loadavg="$five" \
-v numcpus="$cpus" \
'BEGIN {
ratio=loadavg/numcpus
if (ratio >= failthold) exit 2
if (ratio >= warnthold) exit 1
exit 0
}'
It doesn't have any of the unnecessary cats/echos.
It also happens to run faster thanks to forking 1 or 2 times (depending on shell) instead of ~10, but if performance is an issue then shell scripts should be avoided in general.

Resources