How do I use GNU parallel retries feature without passing any parameters? - bash

I like the feature
parallel -q --retries 5 ./myprogram
But GNU parallel doesn't seem to work unless I pass it a set of args. So I have do something like this
seq 1 | parallel -q --retries 5 ./myprogram
Is there a way to tell GNU Parallel I don't want to pass it args, and just want to use it as a wrapper for retries?
Is there a bash way to do retries 5 without doing a bash for loop testing exit code?

You clearly know you are abusing GNU Parallel :) and thus should not be surprised if there is no elegant way of doing it.
One way to do it is to use -N0
parallel -N0 -q --retries 5 ./myprogram ::: dummy

Related

Correct order of parallel execution of shell `time` command

I need to execute the command below (as part of a script) but I don't know in what order to put things so that it executes correctly. What I am trying to do is to give file.smt2 as input to optimathsat, execute it, get the execution time. But I want this to be done several times in parallel using all CPU cores.
parallel -j+0 time Desktop/optimathsat-1.5.1-macos-64-bit/bin/optimathsat < file.smt2 &>results.csv
I added #!/bin/bash -x at the beginning of my file to look at what is happening and this was the output:
+ parallel -j+0 time file.smt2
parallel: Warning: Input is read from the terminal. You are either an expert
parallel: Warning: (in which case: YOU ARE AWESOME!) or maybe you forgot.
parallel: Warning: ::: or :::: or -a or to pipe data into parallel.
...from the 1st line, I can tell the order is wrong. From line 2,3 and 4 the syntax is lacking. How can I fix this?
So I take it you do not care about the results, but only the timing:
seq $(parallel --number-of-threads) |
parallel -j+0 -N0 --joblog my.log 'Desktop/optimathsat-1.5.1-macos-64-bit/bin/optimathsat < file.smt2'
cat my.log
-N0 inserts 0 arguments.
Consider reading GNU Parallel 2018 (printed, online) - at least chapter 1+2. Your command line will thank you for it.

In bash, is it generally better to use process substitution or pipelines

In the use-case of having the output of a singular command being consumed by only one other, is it better to use | (pipelines) or <() (process substitution)?
Better is, of course, subjective. For my specific use case I am after performance as the primary driver, but also interested in robustness.
The while read do done < <(cmd) benefits I already know about and have switched over to.
I have several var=$(cmd1|cmd2) instances that I suspect might be better replaced as var=$(cmd2 < <(cmd1)).
I would like to know what specific benefits the latter case brings over the former.
tl;dr: Use pipes, unless you have a convincing reason not to.
Piping and redirecting stdin from a process substitution is essentially the same thing: both will result in two processes connected by an anonymous pipe.
There are three practical differences:
1. Bash defaults to creating a fork for every stage in a pipeline.
Which is why you started looking into this in the first place:
#!/bin/bash
cat "$1" | while IFS= read -r last; do true; done
echo "Last line of $1 is $last"
This script won't work by default with a pipelines, because unlike ksh and zsh, bash will fork a subshell for each stage.
If you set shopt -s lastpipe in bash 4.2+, bash mimics the ksh and zsh behavior and works just fine.
2. Bash does not wait for process substitutions to finish.
POSIX only requires a shell to wait for the last process in a pipeline, but most shells including bash will wait for all of them.
This makes a notable difference when you have a slow producer, like in a /dev/random password generator:
tr -cd 'a-zA-Z0-9' < /dev/random | head -c 10 # Slow?
head -c 10 < <(tr -cd 'a-zA-Z0-9' < /dev/random) # Fast?
The first example will not benchmark favorably. Once head is satisfied and exits, tr will wait around for its next write() call to discover that the pipe is broken.
Since bash waits for both head and tr to finish, it will appear seem slower.
In the procsub version, bash only waits for head, and lets tr finish in the background.
3. Bash does not currently optimize away forks for single simple commands in process substitutions.
If you invoke an external command like sleep 1, then the Unix process model requires that bash forks and executes the command.
Since forks are expensive, bash optimizes the cases that it can. For example, the command:
bash -c 'sleep 1'
Would naively incur two forks: one to run bash, and one to run sleep. However, bash can optimize it because there's no need for bash to stay around after sleep finishes, so it can instead just replace itself with sleep (execve with no fork). This is very similar to tail call optimization.
( sleep 1 ) is similarly optimized, but <( sleep 1 ) is not. The source code does not offer a particular reason why, so it may just not have come up.
$ strace -f bash -c '/bin/true | /bin/true' 2>&1 | grep -c clone
2
$ strace -f bash -c '/bin/true < <(/bin/true)' 2>&1 | grep -c clone
3
Given the above you can create a benchmark favoring whichever position you want, but since the number of forks is generally much more relevant, pipes would be the best default.
And obviously, it doesn't hurt that pipes are the POSIX standard, canonical way of connecting stdin/stdout of two processes, and works equally well on all platforms.

Is it possible to use a Bash script to process an input file 5 rows at a time?

If an input file "input.txt" contains 10 rows of echo commands. Is it possible to process 5 rows at a time? Once a row completes its command run the next row in the file.
e.g.
$ cat input.txt
echo command 1
echo command 2
echo command 3
echo command 4
echo command 5
echo command 6
echo command 7
echo command 8
echo command 9
echo command 10
I realize these are simple commands, the ultimate idea is to run up to 5 rows of commands at a time and once each once completes successfully, a new command from the input file would start.
Use parallel:
$ cat input.txt | parallel -j5
cat input.txt | xargs -P5 -i bash -c "{}" certainly works for most cases.
xargs -P5 -i bash -c "{}" <input.txt suggested by David below is probably better, and I'd imagine there are simple ways of avoiding the explicit bash usage as well.
Just to break this down a bit xargs breaks up input in ways you can specify. In this case, the -i and {} tells it WHERE you want the broken up input and implicitly tells it to only use one piece of input for each command. The -P5 tells it to run up to 5 commands in parallel.
By most cases, I mean commands that don't rely on having variables passed to them or other complicating factors.
Of course, when running 5 commands at a time, command 5 can complete before command 1. If the order matters, you can group commands together:
echo 2;sleep 1
(And the grouped sleep is also pretty useful for testing it to make sure it's behaving how you're expecting.)

Is there a way to force xargs to send multiple lines at once?

I have a job that reads data from a \n delimited stream and sends the information to xargs to process 1 line at a time. The problem is, this is not performant enough, but I know that if I altered the program such that the command executed by xargs was sent multiple lines instead of just one line at a time, it could drastically improve the performance of my script.
Is there a way to do this? I haven't been having any luck with various combinations of -L or -n. Unfortunately, I think I'm also stuck with -I to parameterize the input since my command doesn't seem to want to take stdin if I don't use -I.
The basic idea is that I'm trying to simulate mini-batch processing using xargs.
Conceptually, here's something similar to what I currently have written
contiguous-stream | xargs -d '\n' -n 10 -L 10 -I {} bash -c 'process_line {}'
^ in the above, process_line is easy to change so that it could process many lines at once, and this function right now is the bottleneck. For emphasis, above, -n 10 and -L 10 don't seem to do anything, my lines are still processing one at a time.
Multiple Lines Per Shell Invocation
Don't use -I here. It prevents more than one argument from being passed at a time, and is outright major-security-bug dangerous when being used to substitute values into a string passed as code.
contiguous-stream | xargs -d $'\n' -n 10 \
bash -c 'for line in "$#"; do process_line "$line"; done' _
Here, we're passing arguments added by xargs out-of-band from the code, in position populated from $1 and later, and then using "$#" to iterate over them.
Note that this reduces overhead inasmuch as it passes multiple arguments to each shell (so you pay shell startup costs fewer times), but it doesn't actually process all those arguments concurrently. For that, you want...
Multiple Lines In Parallel
Assuming GNU xargs, you can use -P to specify a level of parallel processing:
contiguous-stream | xargs -d $'\n' -n 10 -P 8 \
bash -c 'for line in "$#"; do process_line "$line"; done' _
Here, we're passing 10 arguments to each shell, and running 8 shells at a time. Tune your arguments to taste: Higher values of -n spend less time starting up new shells but increase the amount of waste at the end (if one process still has 8 to go and every other process is done, you're operating suboptimally).

How to run given function in Bash in parallel?

There have been some similar questions, but my problem is not "run several programs in parallel" - which can be trivially done with parallel or xargs.
I need to parallelize Bash functions.
Let's imagine code like this:
for i in "${list[#]}"
do
for j in "${other[#]}"
do
# some processing in here - 20-30 lines of almost pure bash
done
done
Some of the processing requires calls to external programs.
I'd like to run some (4-10) tasks, each running for different $i. Total number of elements in $list is > 500.
I know I can put the whole for j ... done loop in external script, and just call this program in parallel, but is it possible to do without splitting the functionality between two separate programs?
sem is part of GNU Parallel and is made for this kind of situation.
for i in "${list[#]}"
do
for j in "${other[#]}"
do
# some processing in here - 20-30 lines of almost pure bash
sem -j 4 dolong task
done
done
If you like the function better GNU Parallel can do the dual for loop in one go:
dowork() {
echo "Starting i=$1, j=$2"
sleep 5
echo "Done i=$1, j=$2"
}
export -f dowork
parallel dowork ::: "${list[#]}" ::: "${other[#]}"
Edit: Please consider Ole's answer instead.
Instead of a separate script, you can put your code in a separate bash function. You can then export it, and run it via xargs:
#!/bin/bash
dowork() {
sleep $((RANDOM % 10 + 1))
echo "Processing i=$1, j=$2"
}
export -f dowork
for i in "${list[#]}"
do
for j in "${other[#]}"
do
printf "%s\0%s\0" "$i" "$j"
done
done | xargs -0 -n 2 -P 4 bash -c 'dowork "$#"' --
An efficient solution that can also run multi-line commands in parallel:
for ...your_loop...; do
if test "$(jobs | wc -l)" -ge 8; then
wait -n
fi
{
command1
command2
...
} &
done
wait
In your case:
for i in "${list[#]}"
do
for j in "${other[#]}"
do
if test "$(jobs | wc -l)" -ge 8; then
wait -n
fi
{
your
commands
here
} &
done
done
wait
If there are 8 bash jobs already running, wait will wait for at least one job to complete. If/when there are less jobs, it starts new ones asynchronously.
Benefits of this approach:
It's very easy for multi-line commands. All your variables are automatically "captured" in scope, no need to pass them around as arguments
It's relatively fast. Compare this, for example, to parallel (I'm quoting official man):
parallel is slow at starting up - around 250 ms the first time and 150 ms after that.
Only needs bash to work.
Downsides:
There is a possibility that there were 8 jobs when we counted them, but less when we started waiting. (It happens if a jobs finishes in those milliseconds between the two commands.) This can make us wait with fewer jobs than required. However, it will resume when at least one job completes, or immediately if there are 0 jobs running (wait -n exits immediately in this case).
If you already have some commands running asynchronously (&) within the same bash script, you'll have fewer worker processes in the loop.

Resources