There have been some similar questions, but my problem is not "run several programs in parallel" - which can be trivially done with parallel or xargs.
I need to parallelize Bash functions.
Let's imagine code like this:
for i in "${list[#]}"
do
for j in "${other[#]}"
do
# some processing in here - 20-30 lines of almost pure bash
done
done
Some of the processing requires calls to external programs.
I'd like to run some (4-10) tasks, each running for different $i. Total number of elements in $list is > 500.
I know I can put the whole for j ... done loop in external script, and just call this program in parallel, but is it possible to do without splitting the functionality between two separate programs?
sem is part of GNU Parallel and is made for this kind of situation.
for i in "${list[#]}"
do
for j in "${other[#]}"
do
# some processing in here - 20-30 lines of almost pure bash
sem -j 4 dolong task
done
done
If you like the function better GNU Parallel can do the dual for loop in one go:
dowork() {
echo "Starting i=$1, j=$2"
sleep 5
echo "Done i=$1, j=$2"
}
export -f dowork
parallel dowork ::: "${list[#]}" ::: "${other[#]}"
Edit: Please consider Ole's answer instead.
Instead of a separate script, you can put your code in a separate bash function. You can then export it, and run it via xargs:
#!/bin/bash
dowork() {
sleep $((RANDOM % 10 + 1))
echo "Processing i=$1, j=$2"
}
export -f dowork
for i in "${list[#]}"
do
for j in "${other[#]}"
do
printf "%s\0%s\0" "$i" "$j"
done
done | xargs -0 -n 2 -P 4 bash -c 'dowork "$#"' --
An efficient solution that can also run multi-line commands in parallel:
for ...your_loop...; do
if test "$(jobs | wc -l)" -ge 8; then
wait -n
fi
{
command1
command2
...
} &
done
wait
In your case:
for i in "${list[#]}"
do
for j in "${other[#]}"
do
if test "$(jobs | wc -l)" -ge 8; then
wait -n
fi
{
your
commands
here
} &
done
done
wait
If there are 8 bash jobs already running, wait will wait for at least one job to complete. If/when there are less jobs, it starts new ones asynchronously.
Benefits of this approach:
It's very easy for multi-line commands. All your variables are automatically "captured" in scope, no need to pass them around as arguments
It's relatively fast. Compare this, for example, to parallel (I'm quoting official man):
parallel is slow at starting up - around 250 ms the first time and 150 ms after that.
Only needs bash to work.
Downsides:
There is a possibility that there were 8 jobs when we counted them, but less when we started waiting. (It happens if a jobs finishes in those milliseconds between the two commands.) This can make us wait with fewer jobs than required. However, it will resume when at least one job completes, or immediately if there are 0 jobs running (wait -n exits immediately in this case).
If you already have some commands running asynchronously (&) within the same bash script, you'll have fewer worker processes in the loop.
Related
This question already has answers here:
how to write a process-pool bash shell
(14 answers)
How to run a fixed number of processes in a loop?
(4 answers)
What simple mechanism for synchronous Unix pooled processes?
(1 answer)
Closed 4 years ago.
(I have searched and expected this question to have been asked before but couldn't find anything like this although there are plenty of similar questions)
I want this for-loop to run in 3 different threads/processes and wait seem to be the right command
for file in 1.txt 2.txt 3.text 4.txt 5.txt
do something lengthy &
i=$((i + 1))
wait $!
done
But this construct, I guess, just starts one thread and then wait until it is done before it starts the next thread. I could place wait outside the loop but how do I then
Access the pids?
Limit it to 3 threads?
The jobs builtin can list the currently running background jobs, so you can use that to limit how many you create. To limit your jobs to three, try something like this:
for file in 1.txt 2.txt 3.txt 4.txt 5.txt; do
if [ $(jobs -r | wc -l) -ge 3 ]; then
wait $(jobs -r -p | head -1)
fi
# Start a slow background job here:
(echo Begin processing $file; sleep 10; echo Done with $file)&
done
wait # wait for the last jobs to finish
The GNU Parallel might be worth a look.
My first attempt,
parallel -j 3 'bash -c "sleep {}; echo {};"' ::: 4 1 2 5 3
can be, according to the inventor of parallel, be shortened to
parallel -j3 sleep {}\; echo {} ::: 4 1 2 5 3
1
2
4
3
5
and masking the semicolon, more friendly to type, like this:
parallel -j3 sleep {}";" echo {} ::: 4 1 2 5 3
works too.
It doesn't look trivial and I only tested it 2 times so far, once to answer this question. parallel --help shows a source where there is more info, the man page is a little bit shocking. :)
parallel -j 3 "something lengthy {}" ::: {1..5}.txt
might work, depending on something lengthy being a program (fine) or just bashcode (afaik, you can't just call a bash function in parallel with parallel).
On xUbuntu-Linux 16.04, parallel wasn't installed but in the repo.
Building on Rob Davis' answer:
#!/bin/bash
qty=3
for file in 1.txt 2.txt 3.txt 4.txt 5.txt; do
while [ `jobs -r | wc -l` -ge $qty ]; do
sleep 1
# jobs #(if you want an update every second on what is running)
done
echo -n "Begin processing $file"
something_lengthy $file &
echo $!
done
wait
You can use a subshell approach example
( (sleep 10) &
p1=$!
(sleep 20) &
p2=$!
(sleep 15) &
p3=$!
wait
echo "all finished ..." )
Note wait call wait for all child inside a subshell, you can use modulo operator (%) with 3 and use the reminder to check for 1st 2nd and 3rd process id (if needed) or can use it to run 3 parallel thread.
Hope this helps.
I have a for loop and I want to process it 4 times in parallel at a time.
I tried the following code from the page https://unix.stackexchange.com/questions/103920/parallelize-a-bash-for-loop:
task(){
sleep 0.5; echo "$1";
}
N=4
(
for thing in a b c d e f g; do
((i=i%N)); ((i++==0)) && wait
task "$thing" &
done
)
I have stored the above file as test.sh, the output I get it is as follows:
path$ ./test.sh
a
b
c
d
path$ e
f
g
and the cursor doesn't come back to my terminal after 'g', it waits/ sleeps indefinitely.I want the cursor come back to my terminal and also I don't understand why the output 'e' has my path preceding it, shouldn't the output be displayed as 'a' to 'g' continuously and the code should stop?
It's pretty hard to understand what you want, but I think you want to do 7 things, called a,b,c...g in parallel, no more than 4 instances at a time.
If so, you could try this:
echo {a..g} | xargs -P4 -n1 bash -c 'echo "$1"; sleep 2' {}
That sends the letters a..g into xargs which then starts a new bash shell for each letter, passing one letter (-n1) to the shell as {}. The bash shell picks up the parameter (its first parameter being $1) and echoes it then waits 2 seconds before exiting - so you can see the pause.
The -P4 tells xargs to run 4 instances of bash at a time in parallel.
Here is a little video of it running. The first one uses -P4 and runs in groups of 4, the second sequence uses -P2 and does 2 at a time:
Or, more simply, if you don't mind spending 10 seconds installing GNU Parallel:
parallel -j4 -k 'echo {}; sleep 2' ::: {a..g}
If you press enter toy can see than you are in normal shell. If you want to wait batched process before exit the script just add wait at the end of script
The short version: My bash script has a function.
This function then launches several instances (a maximum of 10) of another function in the background (with &).
I keep a count of how many are still active with jobs -p | wc -w in a do loop. When I'm done with the loop, I break.
I then use wait to ensure that all those processes terminate before continuing.
However, when I check the count (with jobs -p) I sometimes find this:
[10] 9311 Done my_background_function_name $param
How can I get wait to only proceed when all the launched child-processes have completely terminated and the jobs list is empty?
Why are jobs sometimes shown with "Done" and sometimes not?
Clearly, my knowledge of how jobs works is deficient. :)
Thanks.
Inside a bash script, it seems that when all jobs has ended, jobs -p still returns the last one finished.
This works for me in bash:
while true; do
sleep 5
jobs_running=($(jobs -l | grep Running | awk '{print $2}'))
if [ ${#jobs_running[#]} -eq 0 ]; then
break
fi
echo "Jobs running: ${jobs_running[#]}"
done
Using the "wait" command you cannot tell when each process ends.
With the previous algorithm you can.
I am writing a script in shell in which a command is running and taking 2 min. everytime. Also, there is nothing we can do with this. But if i want to run this command 100 times in script then total time would be 200min. and this will create a big issue. Nobody want to wait for 200min. What i want is to run all 100 commands parallely so that output will come in 2min or may be some more time but dont take 200min.
it will be appreciated, if any body can help me on this in any way.
GNU Parallel is what you want, unless you want to reinvent the wheel. Here are some more detailed examples, but the short of it:
ls | parallel gzip # gzip all files in a directory
... run all 100 commands parallely so that output will come in 2min
This is only possible if you have 200 processors on your system.
There's no such utility/command in shell script to run commands in parallel. What you can do is run your command in background:
for ((i=0;i<200;i++))
do
MyCommand &
done
With & (background), each execution is scheduled as soon as possible. But this doesn't guarantee that your code will be executed in less 200 min. It depends how many processors are there on your system.
If you have only one processor and each execution of the command (that takes 2min) is doing some computation for 2 min, then processor is doing some work, meaning there's no cycles wasted. In this case, running the commands in parallel is not going help because, there's only one processor which is also not free. So, the processes will be just waiting for their turn to be executed.
If you have more than one processors, then the above method (for loop) might help in reducing the total execution time.
As #KingsIndian said, you can background tasks, which sort of lets them run in parallel. Beyond this, you can also keep track of them by process ID:
#!/bin/bash
# Function to be backgrounded
track() {
sleep $1
printf "\nFinished: %d\n" "$1"
}
start=$(date '+%s')
rand3="$(jot -s\ -r 3 5 10)"
# If you don't have `jot` (*BSD/OSX), substitute your own numbers here.
#rand3="5 8 10"
echo "Random numbers: $rand3"
# Make an associative array in which you'll record pids.
declare -A pids
# Background an instance of the track() function for each number, record the pid.
for n in $rand3; do
track $n &
pid=$!
echo "Backgrounded: $n (pid=$pid)"
pids[$pid]=$n
done
# Watch your stable of backgrounded processes.
# If a pid goes away, remove it from the array.
while [ -n "${pids[*]}" ]; do
sleep 1
for pid in "${!pids[#]}"; do
if ! ps "$pid" >/dev/null; then
unset pids[$pid]
echo "unset: $pid"
fi
done
if [ -z "${!pids[*]}" ]; then
break
fi
printf "\rStill waiting for: %s ... " "${pids[*]}"
done
printf "\r%-25s \n" "Done."
printf "Total runtime: %d seconds\n" "$((`date '+%s'` - $start))"
You should also take a look at the Bash documentation on coprocesses.
I want a simple testing shell script that launches a program N times in parallel, and saves each different output to a different file. I have made a start that launches the program in parallel and saves the output, but how can I keep only the outputs that are different? Also how can I actually make the echo DONE! indicate the end?
#!/bin/bash
N=10
for((i=1; j<=$N; ++i)); do
./test > output-$N &
done
echo DONE!
You'll want to use the wait builtin.
wait [n ...]
Wait for each specified process and return its termination status. Each n may be a process ID or a job specification; if a job spec is given, all processes in that job’s pipeline are waited for. If n is not given, all currently active child pro- cesses are waited for, and the return status is zero. If n specifies a non-existent process or job, the return status is 127. Otherwise, the return status is the exit status of the last process or job waited for.
You could specify your jobs as %1, %2, ...:
wait %1 %2 %3 ...
but as long as you have no other child processes, you can just use it with no arguments; it'll then wait for all child processes to finish:
for ...; do
...
done
wait
echo "All done!"
Your separate question, how to keep only different outputs, is a little trickier. What exactly do you mean - different from what? If you have a baseline, you could do this:
for ...; do
if diff -q $this_output $base_output; then
# files are identical
rm $this_output
fi
done
If you want to keep all unique outputs, the algorithm's a little more complex, obviously, but you could still use diff -q to test for identical output.
In order to have your output indicate that all the processes are finished, you need to call wait:
#!/bin/bash
N=10
for((i=1; j<=$N; ++i)); do
./test > output-$N &
done
wait # wait until all jobs are finished
echo DONE!
With GNU Parallel http://www.gnu.org/software/parallel/ you can do:
/tmp/test > base; seq 1 10 | parallel -k "/tmp/test >output-{}; if diff -q output-{} base; then rm {}; fi" ; echo DONE
GNU Parallel is useful for other stuff. Watch the intro video to GNU Parallel: http://www.youtube.com/watch?v=OpaiGYxkSuQ