gnu parellel re-run when it fails with a while loop - bash

Assuming we have a csv file
1
2
3
4
Here is the code:
cat A.csv | while read A; do
echo "echo $A" > $A.sh
echo "$A.sh"
done | xargs -I {} parallel --joblog test.log --jobs 2 -k sh ::: {}
The above is a simplified case. But pretty much get the bulk part. The parallel here will run like this:
parallel --joblog test.log --jobs 2 -k sh ::: 1.sh 2.sh 3.sh 4.sh
Now assume 3.sh failed for some reasons. Is there going to be any easy way to rerun the failed 3.sh in the current shell script setting within the same line of parallel command? I have tried the following, but it doesnt works and quite lengthy.
cat A.csv | while read A; do
echo "echo $A" > $A.sh
echo "$A.sh"
done | xargs -I {} parallel --joblog test.log --jobs 2 -k sh ::: {}
# The above will do this:
# parallel --joblog test.log --jobs 2 -k sh ::: 1.sh 2.sh 3.sh 4.sh
cat A.csv | while read A; do
echo "echo $A" > $A.sh
echo "$A.sh"
done | xargs -I {} parallel --resume-failed --joblog test.log --jobs 2 -k sh ::: {}
# The above will do this:
# parallel --resume-failed --joblog test.log --jobs 2 -k sh ::: 1.sh 2.sh 3.sh 4.sh
######## 2017-09-25
Thanks Ole. I have tried the following
doit() {
myarg="$1"
if [ $myarg -eq 3 ]
then
exit 1
else
echo do crazy stuff with "$myarg"
fi
}
export -f doit
parallel -k --retries 3 --joblog ole.log doit :::: A.csv
It returns the log file like this:
Seq Host Starttime JobRuntime Send Receive Exitval Signal Command
1 : 1506362303.003 0.016 0 22 0 0 doit 1
2 : 1506362303.006 0.013 0 22 0 0 doit 2
3 : 1506362303.026 0.002 0 0 1 0 doit 3
4 : 1506362303.014 0.006 0 22 0 0 doit 4
However, I dont see the doit 3 being retried 3 times as expected. Could you enlighten? Thanks.

First: Generating .sh files seems like a bad idea. You can most likely just make a function instead:
doit() {
myarg="$1"
echo do crazy stuff with "$myarg"
}
export -f doit
To retry a failing command use --retries:
parallel --retries 3 doit :::: file.csv
If your CSV-file has multiple columns, use --colsep:
parallel --retries 3 --colsep '\t' doit :::: file.csv
Using this:
doit() {
myarg="$1"
if [ $myarg -eq 3 ] ; then
echo do not do crazy stuff with "$myarg"
exit 1
else
echo do crazy stuff with "$myarg"
fi
}
export -f doit
This will retry '3' job 3 times:
parallel -k --retries 3 --joblog ole.log doit ::: 1 2 3 4
It will only log the last time. To be convinced this actually runs thrice, run the output unbuffered:
parallel -u --retries 3 --joblog ole.log doit ::: 1 2 3 4

Related

how to group all arguments as position argument for `xargs`

I have a script which takes in only one positional parameter which is a list of values, and I'm trying to get the parameter from stdin with xargs.
However by default, xargs passes all the lists to my script as positional parameters, e.g. when doing:
echo 1 2 3 | xargs myScript, it will essentially be myScript 1 2 3, and what I'm looking for is myScript "1 2 3". What is the best way to achieve this?
Change the delimiter.
$ echo 1 2 3 | xargs -d '\n' printf '%s\n'
1 2 3
Not all xargs implementations have -d though.
And not sure if there is an actual use case for this but you can also resort to spawning another shell instance if you have to. Like
$ echo -e '1 2\n3' | xargs sh -c 'printf '\''%s\n'\'' "$*"' sh
1 2 3
If the input can be altered, you can do this. But not sure if this is what you wanted.
echo \"1 2 3\"|xargs ./myScript
Here is the example.
$ cat myScript
#!/bin/bash
echo $1; shift
echo $1; shift
echo $1;
$ echo \"1 2 3\"|xargs ./myScript
1 2 3
$ echo 1 2 3|xargs ./myScript
1
2
3

Pass arguments to a script that is an argument to a different script

I am new to programming, so plz bear with the way I try to explain my problem (also any help regarding how to elegantly phrase the tile is welcome).
I have a bash script (say for example script1.sh ) that takes in arguments a, b and c(another script). Essentially, argument c for script1.sh is the name of another script (let's say script2.sh). However, script2.sh takes in arguments d,e and f. So my question is, how do I pass arguments to script1.sh ?? (example, ./script1.sh -a 1 -b 2 -c script2.sh -d 3 -e 4 -f 5)
Sorry in advance if the above does not make sense, not sure how else to phrase it...
You should use "" for that
./script1.sh -a 1 -b 2 -c "script2.sh -d 3 -e 4 -f 5"
Try script1.sh with this code
#!/bin/bash
for arg in "$#"; { # loop through all arguments passed to the script
echo $arg
}
The output will be
$ ./script1.sh -a 1 -b 2 -c "script2.sh -d 3 -e 4 -f 5"
-a
1
-b
2
-c
script2.sh -d 3 -e 4 -f 5
But if you run this
#!/bin/bash
for arg in $#; { # no double quotes around $#
echo $arg
}
The output will be
$ ./script1.sh -a 1 -b 2 -c "script2.sh -d 3 -e 4 -f 5"
-a
1
-b
2
-c
script2.sh
-d
3
4
-f
5
But there is no -e why? Coz echo supports argument -e and use it.

How to suspend the main command when piping the output to another delay command

I have two custom scripts to implement their own tasks, one for outputting some URLs (pretend as cat command below) and another for receiving a URL to parse it via network requests (pretend as sleep command below).
Here is the prototype:
Case 1:
cat urls.txt | xargs -I{} sleep 1 && echo "END: {}"
The output is END: {} and the sleep works.
Case 2:
cat urls.txt | xargs -I{} echo "BEGIN: {}" && sleep 1 && echo "END: {}"
The output is
BEGIN: https://www.example.com/1
BEGIN: https://www.example.com/2
BEGIN: https://www.example.com/3
END: {}
but it seems only sleep 1 second.
Q1: I'm a little confused, why are these outputs?
Q2: Are there any solutions to execute the full pipelined xargs delay command for every cat line output?
You can put the commands into a separate script:
worker.sh
#!/bin/bash
echo "BEGIN: $*" && sleep 1 && echo "END: $*"
set execute permission:
chmod +x worker.sh
and call it with xargs:
cat urls.txt | xargs -I{} ./worker.sh {}
output
BEGIN: https://www.example.com/1
END: https://www.example.com/1
BEGIN: https://www.example.com/2
END: https://www.example.com/2
BEGIN: https://www.example.com/3
END: https://www.example.com/3
Between BEGIN and END the script sleep for one second.
Thanks for shellter and UtLox's reminder, I found the xargs is the key.
Here is my finding, the shell/zsh interpreter splits the sleep 5 and echo END: {} as another serial of commands, so xargs didn't receive my expected two && inline commands as one utility command and replace the {} with value in the END expression. This could be proved by xargs -t.
cat urls.txt | xargs -I{} -t echo "BEGIN: {}" && sleep 1 && echo "END: {}"
Inspired by UtLox's the answer, I found I could join my expectation with sh -c in xargs.
cat urls.txt | xargs -I{} -P 5 sh -c 'echo "BEGIN: {}" && sleep 1 && echo "END: {}"'
For the -P 5, it makes the utility commmand ran with max specified subprocesses in parallel mode to make use of most bandwide resources.
Done!

parallelizing nested for loop with GNU Parallel

I am working in Bash. I have a series of nested for loops that iteratively look for the presence of three lists of 96 barcodes sequences. My goal is to find each unique combination of barcodes there are 96x96x96 (884,736) possible combinations.
for barcode1 in "${ROUND1_BARCODES[#]}";
do
grep -B 1 -A 2 "$barcode1" $FASTQ_R > ROUND1_MATCH.fastq
echo barcode1.is.$barcode1 >> outputLOG
if [ -s ROUND1_MATCH.fastq ]
then
# Now we will look for the presence of ROUND2 barcodes in our reads containing barcodes from the previous step
for barcode2 in "${ROUND2_BARCODES[#]}";
do
grep -B 1 -A 2 "$barcode2" ROUND1_MATCH.fastq > ROUND2_MATCH.fastq
if [ -s ROUND2_MATCH.fastq ]
then
# Now we will look for the presence of ROUND3 barcodes in our reads containing barcodes from the previous step
for barcode3 in "${ROUND3_BARCODES[#]}";
do
grep -B 1 -A 2 "$barcode3" ./ROUND2_MATCH.fastq | sed '/^--/d' > ROUND3_MATCH.fastq
# If matches are found we will write them to an output .fastq file itteratively labelled with an ID number
if [ -s ROUND3_MATCH.fastq ]
then
mv ROUND3_MATCH.fastq results/result.$count.2.fastq
fi
count=`expr $count + 1`
done
fi
done
fi
done
This code works and I am able to successfully extract the sequences with each barcode combination. However, I think that the speed of this can be improved for working through large files by parallelizing this loop structure. I know that I can use GNU parallel to do this however I am struggling to nest the parallelizations.
# Parallelize nested loops
now=$(date +"%T")
echo "Beginning STEP1.2: PARALLEL Demultiplex using barcodes. Current
time : $now" >> outputLOG
mkdir ROUND1_PARALLEL_HITS
parallel -j 6 'grep -B 1 -A 2 -h {} SRR6750041_2_smalltest.fastq > ROUND1_PARALLEL_HITS/{#}_ROUND1_MATCH.fastq' ::: "${ROUND1_BARCODES[#]}"
mkdir ROUND2_PARALLEL_HITS
parallel -j 6 'grep -B 1 -A 2 -h {} ROUND1_PARALLEL_HITS/*.fastq > ROUND2_PARALLEL_HITS/{#}_{/.}.fastq' ::: "${ROUND2_BARCODES[#]}"
mkdir ROUND3_PARALLEL_HITS
parallel -j 6 'grep -B 1 -A 2 -h {} ROUND2_PARALLEL_HITS/*.fastq > ROUND3_PARALLEL_HITS/{#}_{/.}.fastq' ::: "${ROUND3_BARCODES[#]}"
mkdir parallel_results
parallel -j 6 'mv {} parallel_results/result_{#}.fastq' ::: ROUND3_PARALLEL_HITS/*.fastq
How can I successfully recreate the nested structure of the for loops using parallel?
Parallelized only the inner loop:
for barcode1 in "${ROUND1_BARCODES[#]}";
do
grep -B 1 -A 2 "$barcode1" $FASTQ_R > ROUND1_MATCH.fastq
echo barcode1.is.$barcode1 >> outputLOG
if [ -s ROUND1_MATCH.fastq ]
then
# Now we will look for the presence of ROUND2 barcodes in our reads containing barcodes from the previous step
for barcode2 in "${ROUND2_BARCODES[#]}";
do
grep -B 1 -A 2 "$barcode2" ROUND1_MATCH.fastq > ROUND2_MATCH.fastq
if [ -s ROUND2_MATCH.fastq ]
then
# Now we will look for the presence of ROUND3 barcodes in our reads containing barcodes from the previous step
doit() {
grep -B 1 -A 2 "$1" ./ROUND2_MATCH.fastq | sed '/^--/d'
}
export -f doit
parallel -j0 doit {} '>' results/$barcode1-$barcode2-{} ::: "${ROUND3_BARCODES[#]}"
# TODO remove files with 0 length
fi
done
fi
done

append variables from a while loop into a command line option

I have a while loop, where A=1~3
mysql -e "select A from xxx;" while read A;
do
whatever
done
The mySQL command will return only numbers, each number in each line. So the while loop here will have A=1, A=2, A=3
I would like to append the integer number in the loop (here is A=1~3) into a command line to run outside the while loop. Any bash way to do this?
parallel --joblog test.log --jobs 2 -k sh ::: 1.sh 2.sh 3.sh
You probably want something like this:
mysql -e "select A from xxx;" | while read A; do
whatever > standard_out 2>standard_error
echo "$A.sh"
done | xargs parallel --joblog test.log --jobs 2 -k sh :::
Thanks for enlightening me. xargs works perfectly here:
Assuming we have A.csv (mimic the mysql command)
1
2
3
4
We can simply do:
cat A.csv | while read A; do
echo "echo $A" > $A.sh
echo "$A.sh"
done | xargs -I {} parallel --joblog test.log --jobs 2 -k sh ::: {}
The above will print the following output as expected
1
2
3
4
Here -I {} & {} are the argument list markers:
https://www.cyberciti.biz/faq/linux-unix-bsd-xargs-construct-argument-lists-utility/

Resources