Parallel runs in N-process batches in BASH

Parallel runs in N-process batches in BASH - bash

I have a for loop and I want to process it 4 times in parallel at a time.
I tried the following code from the page https://unix.stackexchange.com/questions/103920/parallelize-a-bash-for-loop:
task(){
sleep 0.5; echo "$1";
}
N=4
(
for thing in a b c d e f g; do
((i=i%N)); ((i++==0)) && wait
task "$thing" &
done
)
I have stored the above file as test.sh, the output I get it is as follows:
path$ ./test.sh
a
b
c
d
path$ e
f
g
and the cursor doesn't come back to my terminal after 'g', it waits/ sleeps indefinitely.I want the cursor come back to my terminal and also I don't understand why the output 'e' has my path preceding it, shouldn't the output be displayed as 'a' to 'g' continuously and the code should stop?

It's pretty hard to understand what you want, but I think you want to do 7 things, called a,b,c...g in parallel, no more than 4 instances at a time.
If so, you could try this:
echo {a..g} | xargs -P4 -n1 bash -c 'echo "$1"; sleep 2' {}
That sends the letters a..g into xargs which then starts a new bash shell for each letter, passing one letter (-n1) to the shell as {}. The bash shell picks up the parameter (its first parameter being $1) and echoes it then waits 2 seconds before exiting - so you can see the pause.
The -P4 tells xargs to run 4 instances of bash at a time in parallel.
Here is a little video of it running. The first one uses -P4 and runs in groups of 4, the second sequence uses -P2 and does 2 at a time:
Or, more simply, if you don't mind spending 10 seconds installing GNU Parallel:
parallel -j4 -k 'echo {}; sleep 2' ::: {a..g}

If you press enter toy can see than you are in normal shell. If you want to wait batched process before exit the script just add wait at the end of script

Related

Use bash wait in for-loop [duplicate]

This question already has answers here:
how to write a process-pool bash shell
(14 answers)
How to run a fixed number of processes in a loop?
(4 answers)
What simple mechanism for synchronous Unix pooled processes?
(1 answer)
Closed 4 years ago.
(I have searched and expected this question to have been asked before but couldn't find anything like this although there are plenty of similar questions)
I want this for-loop to run in 3 different threads/processes and wait seem to be the right command
for file in 1.txt 2.txt 3.text 4.txt 5.txt
do something lengthy &
i=$((i + 1))
wait $!
done
But this construct, I guess, just starts one thread and then wait until it is done before it starts the next thread. I could place wait outside the loop but how do I then
Access the pids?
Limit it to 3 threads?

The jobs builtin can list the currently running background jobs, so you can use that to limit how many you create. To limit your jobs to three, try something like this:
for file in 1.txt 2.txt 3.txt 4.txt 5.txt; do
if [ $(jobs -r | wc -l) -ge 3 ]; then
wait $(jobs -r -p | head -1)
fi
# Start a slow background job here:
(echo Begin processing $file; sleep 10; echo Done with $file)&
done
wait # wait for the last jobs to finish

The GNU Parallel might be worth a look.
My first attempt,
parallel -j 3 'bash -c "sleep {}; echo {};"' ::: 4 1 2 5 3
can be, according to the inventor of parallel, be shortened to
parallel -j3 sleep {}\; echo {} ::: 4 1 2 5 3
1
2
4
3
5
and masking the semicolon, more friendly to type, like this:
parallel -j3 sleep {}";" echo {} ::: 4 1 2 5 3
works too.
It doesn't look trivial and I only tested it 2 times so far, once to answer this question. parallel --help shows a source where there is more info, the man page is a little bit shocking. :)
parallel -j 3 "something lengthy {}" ::: {1..5}.txt
might work, depending on something lengthy being a program (fine) or just bashcode (afaik, you can't just call a bash function in parallel with parallel).
On xUbuntu-Linux 16.04, parallel wasn't installed but in the repo.

Building on Rob Davis' answer:
#!/bin/bash
qty=3
for file in 1.txt 2.txt 3.txt 4.txt 5.txt; do
while [ `jobs -r | wc -l` -ge $qty ]; do
sleep 1
# jobs #(if you want an update every second on what is running)
done
echo -n "Begin processing $file"
something_lengthy $file &
echo $!
done
wait

You can use a subshell approach example
( (sleep 10) &
p1=$!
(sleep 20) &
p2=$!
(sleep 15) &
p3=$!
wait
echo "all finished ..." )
Note wait call wait for all child inside a subshell, you can use modulo operator (%) with 3 and use the reminder to check for 1st 2nd and 3rd process id (if needed) or can use it to run 3 parallel thread.
Hope this helps.

Is it possible to use a Bash script to process an input file 5 rows at a time?

If an input file "input.txt" contains 10 rows of echo commands. Is it possible to process 5 rows at a time? Once a row completes its command run the next row in the file.
e.g.
$ cat input.txt
echo command 1
echo command 2
echo command 3
echo command 4
echo command 5
echo command 6
echo command 7
echo command 8
echo command 9
echo command 10
I realize these are simple commands, the ultimate idea is to run up to 5 rows of commands at a time and once each once completes successfully, a new command from the input file would start.

Use parallel:
$ cat input.txt | parallel -j5

cat input.txt | xargs -P5 -i bash -c "{}" certainly works for most cases.
xargs -P5 -i bash -c "{}" <input.txt suggested by David below is probably better, and I'd imagine there are simple ways of avoiding the explicit bash usage as well.
Just to break this down a bit xargs breaks up input in ways you can specify. In this case, the -i and {} tells it WHERE you want the broken up input and implicitly tells it to only use one piece of input for each command. The -P5 tells it to run up to 5 commands in parallel.
By most cases, I mean commands that don't rely on having variables passed to them or other complicating factors.
Of course, when running 5 commands at a time, command 5 can complete before command 1. If the order matters, you can group commands together:
echo 2;sleep 1
(And the grouped sleep is also pretty useful for testing it to make sure it's behaving how you're expecting.)

lazy (non-buffered) processing of shell pipeline

I'm trying to figure out how to perform the laziest possible processing of a standard UNIX shell pipeline. For example, let's say I have a command which does some calculations and outputting along the way, but the calculations get more and more expensive so that the first few lines of output arrive quickly but then subsequent lines get slower. If I'm only interested in the first few lines then I want to obtain those via lazy evaluation, terminating the calculations ASAP before they get too expensive.
This can be achieved with a straight-forward shell pipeline, e.g.:
./expensive | head -n 2
However this does not work optimally. Let's simulate the calculations with a script which gets exponentially slower:
#!/bin/sh
i=1
while true; do
echo line $i
sleep $(( i ** 4 ))
i=$(( i+1 ))
done
Now when I pipe this script through head -n 2, I observe the following:
line 1 is output.
After sleeping one second, line 2 is output.
Despite head -n 2 having already received two (\n-terminated) lines and exiting, expensive carries on running and now waits a further 16 seconds (2 ** 4) before completing, at which point the pipeline also completes.
Obviously this is not as lazy as desired, because ideally expensive would terminate as soon as the head process receives two lines. However, this does not happen; IIUC it actually terminates after trying to write its third line, because at this point it tries to write to its STDOUT which is connected through a pipe to STDIN the head process which has already exited and is therefore no longer reading input from the pipe. This causes expensive to receive a SIGPIPE, which causes the bash interpreter running the script to invoke its SIGPIPE handler which by default terminates running the script (although this can be changed via the trap command).
So the question is, how can I make it so that expensive quits immediately when head quits, not just when expensive tries to write its third line to a pipe which no longer has a listener at the other end? Since the pipeline is constructed and managed by the interactive shell process I typed the ./expensive | head -n 2 command into, presumably that interactive shell is the place where any solution for this problem would lie, rather than in any modification of expensive or head? Is there any native trick or extra utility which can construct pipelines with the behaviour I want? Or maybe it's simply impossible to achieve what I want in bash or zsh, and the only way would be to write my own pipeline manager (e.g. in Ruby or Python) which spots when the reader terminates and immediately terminates the writer?

If all you care about is foreground control, you can run expensive in a process substitution; it still blocks until it next tries to write, but head exits immediately (and your script's flow control can continue) after it's received its input
head -n 2 < <(exec ./expensive)
# expensive still runs 16 seconds in the background, but doesn't block your program
In bash 4.4, these store their PIDs in $! and allow process management in the same manner as other background processes.
# REQUIRES BASH 4.4 OR NEWER
exec {expensive_fd}< <(exec ./expensive); expensive_pid=$!
head -n 2 <&"$expensive_fd" # read the content we want
exec {expensive_fd}<&- # close the descriptor
kill "$expensive_pid" # and kill the process
Another approach is a coprocess, which has the advantage of only requiring bash 4.0:
# magic: store stdin and stdout FDs in an array named "expensive", and PID in expensive_PID
coproc expensive { exec ./expensive }
# read two lines from input FD...
head -n 2 <&"${expensive[0]}"
# ...and kill the process.
kill "$expensive_PID"

I'll answer with a POSIX shell in mind.
What you can do is use a fifo instead of a pipe and kill the first link the moment the second finishes.
If the expensive process is a leaf process or if it takes care of killing its children, you can use a simple kill. If it's a process-spawning shell script, you should run it in a process group (doable with set -m) and kill it with a process-group kill.
Example code:
#!/bin/sh -e
expensive()
{
i=1
while true; do
echo line $i
sleep 0.$i #sped it up a little
echo >&2 slept
i=$(( i+1 ))
done
}
echo >&2 NORMAL
expensive | head -n2
#line 1
#slept
#line 2
#slept
echo >&2 SPED-UP
mkfifo pipe
exec 3<>pipe
rm pipe
set -m; expensive >&3 & set +m
<&3 head -n 2
kill -- -$!
#line 1
#slept
#line 2
If you run this, the second run should not have the second slept line, meaning the first link was killed the moment head finished, not when the first link tried to output after head was finished.

Bash function hangs once conditions are met

All,
I am trying to run a bash script that kicks off several sub processes. The processes redirect to their own log files and I must kick them off in parallel. To do this i have written a check_procs procedure, that monitors for the number of processes using the same parent PID. Once the number reaches 1 again, the script should continue. However, it seems to just hang. I am not sure why, but the code is below:
check_procs() {
while true; do
mypid=$$
backup_procs=`ps -eo ppid | grep -w $mypid | wc -w`
until [ $backup_procs == 1 ]; do
echo $backup_procs
sleep 5
backup_procs=`ps -eo ppid | grep -w $mypid | wc -w`
done
done
}
This function is called after the processes are kicked off, and I can see it echoing out the number of processes, but then the echoing stops (suggesting the function has completed since the process count is now 1, but then nothing happens, and I can see the script is still in the process list of the server. I have to kill it off manually. The part where the function is called is below:
for ((i=1; i <= $threads; i++)); do
<Some trickery here to generate $cmdfile and $logfile>
nohup rman target / cmdfile=$cmdfile log=$logfile &
x=$(($x+1))
done
check_procs
$threads is a command line parameter passed to the script, and is a small number like 4 or 6. These are kicked off using nohup, as shown. When the IF in check_procs is satisfied, everything hangs instead of executing the remainder of the script. What's wrong with my function?

Maybe I'm mistaken, but it is not expected? Your outer loop runs forever, there is no exit point. Unless the process count increases again the outer loop runs infinitely (without any delay which is not recommended).

How to run given function in Bash in parallel?

There have been some similar questions, but my problem is not "run several programs in parallel" - which can be trivially done with parallel or xargs.
I need to parallelize Bash functions.
Let's imagine code like this:
for i in "${list[#]}"
do
for j in "${other[#]}"
do
# some processing in here - 20-30 lines of almost pure bash
done
done
Some of the processing requires calls to external programs.
I'd like to run some (4-10) tasks, each running for different $i. Total number of elements in $list is > 500.
I know I can put the whole for j ... done loop in external script, and just call this program in parallel, but is it possible to do without splitting the functionality between two separate programs?

sem is part of GNU Parallel and is made for this kind of situation.
for i in "${list[#]}"
do
for j in "${other[#]}"
do
# some processing in here - 20-30 lines of almost pure bash
sem -j 4 dolong task
done
done
If you like the function better GNU Parallel can do the dual for loop in one go:
dowork() {
echo "Starting i=$1, j=$2"
sleep 5
echo "Done i=$1, j=$2"
}
export -f dowork
parallel dowork ::: "${list[#]}" ::: "${other[#]}"

Edit: Please consider Ole's answer instead.
Instead of a separate script, you can put your code in a separate bash function. You can then export it, and run it via xargs:
#!/bin/bash
dowork() {
sleep $((RANDOM % 10 + 1))
echo "Processing i=$1, j=$2"
}
export -f dowork
for i in "${list[#]}"
do
for j in "${other[#]}"
do
printf "%s\0%s\0" "$i" "$j"
done
done | xargs -0 -n 2 -P 4 bash -c 'dowork "$#"' --

An efficient solution that can also run multi-line commands in parallel:
for ...your_loop...; do
if test "$(jobs | wc -l)" -ge 8; then
wait -n
fi
{
command1
command2
...
} &
done
wait
In your case:
for i in "${list[#]}"
do
for j in "${other[#]}"
do
if test "$(jobs | wc -l)" -ge 8; then
wait -n
fi
{
your
commands
here
} &
done
done
wait
If there are 8 bash jobs already running, wait will wait for at least one job to complete. If/when there are less jobs, it starts new ones asynchronously.
Benefits of this approach:
It's very easy for multi-line commands. All your variables are automatically "captured" in scope, no need to pass them around as arguments
It's relatively fast. Compare this, for example, to parallel (I'm quoting official man):
parallel is slow at starting up - around 250 ms the first time and 150 ms after that.
Only needs bash to work.
Downsides:
There is a possibility that there were 8 jobs when we counted them, but less when we started waiting. (It happens if a jobs finishes in those milliseconds between the two commands.) This can make us wait with fewer jobs than required. However, it will resume when at least one job completes, or immediately if there are 0 jobs running (wait -n exits immediately in this case).
If you already have some commands running asynchronously (&) within the same bash script, you'll have fewer worker processes in the loop.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio