How to run two commands with different inputs with GNU Parallel? - parallel-processing

I have two programs that have different proposes and they are called as follows:
./FolderCounter <PATH TO FOLDER> traceX
./VideoCounter <PATH TO VIDEO> traceY
Running these applications I have the following commands with GNU parallel:
parallel ./FolderCounter {} trace3 ::: $(cat PatinN_files.txt) &> data_output/Result_PatinN_files.txt
parallel ./FolderCounter {} trace5 ::: $(cat PatinS_files.txt) &> data_output/Result_PatinS_files.txt
parallel ./VideoCounter {} trace3 ::: $(cat PatinN_videos.txt) &> data_output/Result_PatinN_video.txt
parallel ./VideoCounter {} trace5 ::: $(cat PatinS_videos.txt) &> data_output/Result_PatinS_video.txt
My goal is combine these four lines into a single GNU parallel command, so that it can better manage the number of parallel jobs and start the next batch of files as soon as there are processors available.
How can I do that?

First: Don't do:
parallel ... ::: $(cat foo)
Do:
parallel ... :::: foo
In most cases this will do what you want whereas the first may cause problems if it contains lines with spaces.
I assume that PatinN_files.txt has the same number of lines as PatinN_videos.txt.
Normally I would do 2 runs: a trace3-run and a trace5 run:
parallel ./FolderCounter {1} trace3 ";" ./VideoCounter {2} trace3 ::::+ PatinN_files.txt PatinN_videos.txt &> data_output/Result_PatinN.txt
parallel ./FolderCounter {1} trace5 ";" ./VideoCounter {2} trace5 ::::+ PatinS_files.txt PatinS_videos.txt &> data_output/Result_PatinS.txt
Alternatively you can simply use GNU Parallel to first generate all the commands to run and then run them (this does not require the txt-files to have the same number of lines):
(
parallel --dry-run ./FolderCounter {} trace3 :::: PatinN_files.txt
parallel --dry-run ./FolderCounter {} trace5 :::: PatinS_files.txt
parallel --dry-run ./VideoCounter {} trace3 :::: PatinN_videos.txt
parallel --dry-run ./VideoCounter {} trace5 :::: PatinS_videos.txt
) | parallel &> data_output/Result.txt
To track which input generates what output, use:
) | parallel --tag &> data_output/Result.txt
To get the log output into 4 different files is a bit harder. If that is really needed it can be done, but is not as elegant as the above.
If you simply want to run the jobs if there is spare cpus sitting idle, you can use --load 100%:
parallel --load 100% ./FolderCounter {} trace3 ::: $(cat PatinN_files.txt) &> data_output/Result_PatinN_files.txt &
parallel --load 100% ./FolderCounter {} trace5 ::: $(cat PatinS_files.txt) &> data_output/Result_PatinS_files.txt &
parallel --load 100% ./VideoCounter {} trace3 ::: $(cat PatinN_videos.txt) &> data_output/Result_PatinN_video.txt &
parallel --load 100% ./VideoCounter {} trace5 ::: $(cat PatinS_videos.txt) &> data_output/Result_PatinS_video.txt &
wait
It will start a job if the instant load is less than the number of cpus.

Related

Rewriting nested loops using parallel where the arguments for second loop depends on arguments of first loop

Use case:
I have two nested for loops in which the arguments provided to the second loop depends on the arguments provided to the first loop.
for node in `ls -l $dir | grep $pattern1`
do
for part in `ls -l $dir/$node | grep $pattern2`
do something using $node and $part
done
done
To rewrite this using parallel I created a function:
doit(){
$node=$1
$part=$2
do something using $node and $part
}
parallel doit ::: <arguments for first for loop> ::: <arguments for second for loop>
How do I provide these arguments when running using parallel?
GNU Parallel cannot do that directly, so you need to generate the combinations by hand:
for node in `ls -l $dir | grep $pattern1`; do
for part in `ls -l $dir/$node | grep $pattern2`; do
printf "$node\t$part\n"
done
done | parallel --dry-run --colsep '\t' doit {1} {2}
You may be able to use skip() like this:
parallel --dry-run doit '{= -e "$arg[1]/$arg[2]" or skip() =}' ::: all nodes ::: all parts

How to suspend the main command when piping the output to another delay command

I have two custom scripts to implement their own tasks, one for outputting some URLs (pretend as cat command below) and another for receiving a URL to parse it via network requests (pretend as sleep command below).
Here is the prototype:
Case 1:
cat urls.txt | xargs -I{} sleep 1 && echo "END: {}"
The output is END: {} and the sleep works.
Case 2:
cat urls.txt | xargs -I{} echo "BEGIN: {}" && sleep 1 && echo "END: {}"
The output is
BEGIN: https://www.example.com/1
BEGIN: https://www.example.com/2
BEGIN: https://www.example.com/3
END: {}
but it seems only sleep 1 second.
Q1: I'm a little confused, why are these outputs?
Q2: Are there any solutions to execute the full pipelined xargs delay command for every cat line output?
You can put the commands into a separate script:
worker.sh
#!/bin/bash
echo "BEGIN: $*" && sleep 1 && echo "END: $*"
set execute permission:
chmod +x worker.sh
and call it with xargs:
cat urls.txt | xargs -I{} ./worker.sh {}
output
BEGIN: https://www.example.com/1
END: https://www.example.com/1
BEGIN: https://www.example.com/2
END: https://www.example.com/2
BEGIN: https://www.example.com/3
END: https://www.example.com/3
Between BEGIN and END the script sleep for one second.
Thanks for shellter and UtLox's reminder, I found the xargs is the key.
Here is my finding, the shell/zsh interpreter splits the sleep 5 and echo END: {} as another serial of commands, so xargs didn't receive my expected two && inline commands as one utility command and replace the {} with value in the END expression. This could be proved by xargs -t.
cat urls.txt | xargs -I{} -t echo "BEGIN: {}" && sleep 1 && echo "END: {}"
Inspired by UtLox's the answer, I found I could join my expectation with sh -c in xargs.
cat urls.txt | xargs -I{} -P 5 sh -c 'echo "BEGIN: {}" && sleep 1 && echo "END: {}"'
For the -P 5, it makes the utility commmand ran with max specified subprocesses in parallel mode to make use of most bandwide resources.
Done!

gnu parellel re-run when it fails with a while loop

Assuming we have a csv file
1
2
3
4
Here is the code:
cat A.csv | while read A; do
echo "echo $A" > $A.sh
echo "$A.sh"
done | xargs -I {} parallel --joblog test.log --jobs 2 -k sh ::: {}
The above is a simplified case. But pretty much get the bulk part. The parallel here will run like this:
parallel --joblog test.log --jobs 2 -k sh ::: 1.sh 2.sh 3.sh 4.sh
Now assume 3.sh failed for some reasons. Is there going to be any easy way to rerun the failed 3.sh in the current shell script setting within the same line of parallel command? I have tried the following, but it doesnt works and quite lengthy.
cat A.csv | while read A; do
echo "echo $A" > $A.sh
echo "$A.sh"
done | xargs -I {} parallel --joblog test.log --jobs 2 -k sh ::: {}
# The above will do this:
# parallel --joblog test.log --jobs 2 -k sh ::: 1.sh 2.sh 3.sh 4.sh
cat A.csv | while read A; do
echo "echo $A" > $A.sh
echo "$A.sh"
done | xargs -I {} parallel --resume-failed --joblog test.log --jobs 2 -k sh ::: {}
# The above will do this:
# parallel --resume-failed --joblog test.log --jobs 2 -k sh ::: 1.sh 2.sh 3.sh 4.sh
######## 2017-09-25
Thanks Ole. I have tried the following
doit() {
myarg="$1"
if [ $myarg -eq 3 ]
then
exit 1
else
echo do crazy stuff with "$myarg"
fi
}
export -f doit
parallel -k --retries 3 --joblog ole.log doit :::: A.csv
It returns the log file like this:
Seq Host Starttime JobRuntime Send Receive Exitval Signal Command
1 : 1506362303.003 0.016 0 22 0 0 doit 1
2 : 1506362303.006 0.013 0 22 0 0 doit 2
3 : 1506362303.026 0.002 0 0 1 0 doit 3
4 : 1506362303.014 0.006 0 22 0 0 doit 4
However, I dont see the doit 3 being retried 3 times as expected. Could you enlighten? Thanks.
First: Generating .sh files seems like a bad idea. You can most likely just make a function instead:
doit() {
myarg="$1"
echo do crazy stuff with "$myarg"
}
export -f doit
To retry a failing command use --retries:
parallel --retries 3 doit :::: file.csv
If your CSV-file has multiple columns, use --colsep:
parallel --retries 3 --colsep '\t' doit :::: file.csv
Using this:
doit() {
myarg="$1"
if [ $myarg -eq 3 ] ; then
echo do not do crazy stuff with "$myarg"
exit 1
else
echo do crazy stuff with "$myarg"
fi
}
export -f doit
This will retry '3' job 3 times:
parallel -k --retries 3 --joblog ole.log doit ::: 1 2 3 4
It will only log the last time. To be convinced this actually runs thrice, run the output unbuffered:
parallel -u --retries 3 --joblog ole.log doit ::: 1 2 3 4

append variables from a while loop into a command line option

I have a while loop, where A=1~3
mysql -e "select A from xxx;" while read A;
do
whatever
done
The mySQL command will return only numbers, each number in each line. So the while loop here will have A=1, A=2, A=3
I would like to append the integer number in the loop (here is A=1~3) into a command line to run outside the while loop. Any bash way to do this?
parallel --joblog test.log --jobs 2 -k sh ::: 1.sh 2.sh 3.sh
You probably want something like this:
mysql -e "select A from xxx;" | while read A; do
whatever > standard_out 2>standard_error
echo "$A.sh"
done | xargs parallel --joblog test.log --jobs 2 -k sh :::
Thanks for enlightening me. xargs works perfectly here:
Assuming we have A.csv (mimic the mysql command)
1
2
3
4
We can simply do:
cat A.csv | while read A; do
echo "echo $A" > $A.sh
echo "$A.sh"
done | xargs -I {} parallel --joblog test.log --jobs 2 -k sh ::: {}
The above will print the following output as expected
1
2
3
4
Here -I {} & {} are the argument list markers:
https://www.cyberciti.biz/faq/linux-unix-bsd-xargs-construct-argument-lists-utility/

combine GNU parallel with nested for loops and multiple variables

I have n folders in destdir. Each folder contains two files: *R1.fastq and *R2.fastq. Using this script, it will do the job (bowtie2) one by one and output {name of the sub folder}.sam in the destdir.
#!/bin/bash
mm9_index="/Users/bowtie2-2.2.6/indexes/mm9/mm9"
destdir=/Users/Desktop/test/outdir/
for f in $destdir/*
do
fbase=$(basename "$f")
echo "Sample $fbase"
bowtie2 -p 4 -x $mm9_index -X 2000 \
-1 "$f"/*R1.fastq \
-2 "$f"/*R2.fastq \
-S $destdir/${fbase}.sam
done
I want to use gnu parallel tool to speed this up, can you help? Thanks.
Use a bash function:
#!/bin/bash
my_bowtie() {
mm9_index="/Users/bowtie2-2.2.6/indexes/mm9/mm9"
destdir=/Users/Desktop/test/outdir/
f="$1"
fbase=$(basename "$f")
echo "Sample $fbase"
bowtie2 -p 4 -x $mm9_index -X 2000 \
-1 "$f"/*R1.fastq \
-2 "$f"/*R2.fastq \
-S $destdir/${fbase}.sam
}
export -f my_bowtie
parallel my_bowtie ::: $destdir/*
For more details: man parallel or http://www.gnu.org/software/parallel/man.html#EXAMPLE:-Calling-Bash-functions
At its simplest, you can normally just put echo on the front of your commands and send the list of commands, that you would have executed sequentially, to GNU Parallel, to execute in parallel, like this:
for f in ...; do
echo bowtie2 -p 4 ....
done | parallel

Resources