How to parallelize for-loop in bash limiting number of processes - bash

I have a bash script similar to:
NUM_PROCS=$1
NUM_ITERS=$2
for ((i=0; i<$NUM_ITERS; i++)); do
python foo.py $i arg2 &
done
What's the most straightforward way to limit the number of parallel processes to NUM_PROCS? I'm looking for a solution that doesn't require packages/installations/modules (like GNU Parallel) if possible.
When I tried Charles Duffy's latest approach, I got the following error from bash -x:
+ python run.py args 1
+ python run.py ... 3
+ python run.py ... 4
+ python run.py ... 2
+ read -r line
+ python run.py ... 1
+ read -r line
+ python run.py ... 4
+ read -r line
+ python run.py ... 2
+ read -r line
+ python run.py ... 3
+ read -r line
+ python run.py ... 0
+ read -r line
... continuing with other numbers between 0 and 5, until too many processes were started for the system to handle and the bash script was shut down.

bash 4.4 will have an interesting new type of parameter expansion that simplifies Charles Duffy's answer.
#!/bin/bash
num_procs=$1
num_iters=$2
num_jobs="\j" # The prompt escape for number of jobs currently running
for ((i=0; i<num_iters; i++)); do
while (( ${num_jobs#P} >= num_procs )); do
wait -n
done
python foo.py "$i" arg2 &
done

GNU, macOS/OSX, FreeBSD and NetBSD can all do this with xargs -P, no bash versions or package installs required. Here's 4 processes at a time:
printf "%s\0" {1..10} | xargs -0 -I # -P 4 python foo.py # arg2

As a very simple implementation, depending on a version of bash new enough to have wait -n (to wait until only the next job exits, as opposed to waiting for all jobs):
#!/bin/bash
# ^^^^ - NOT /bin/sh!
num_procs=$1
num_iters=$2
declare -A pids=( )
for ((i=0; i<num_iters; i++)); do
while (( ${#pids[#]} >= num_procs )); do
wait -n
for pid in "${!pids[#]}"; do
kill -0 "$pid" &>/dev/null || unset "pids[$pid]"
done
done
python foo.py "$i" arg2 & pids["$!"]=1
done
If running on a shell without wait -n, one can (very inefficiently) replace it with a command such as sleep 0.2, to poll every 1/5th of a second.
Since you're actually reading input from a file, another approach is to start N subprocesses, each of processes only lines where (linenum % N == threadnum):
num_procs=$1
infile=$2
for ((i=0; i<num_procs; i++)); do
(
while read -r line; do
echo "Thread $i: processing $line"
done < <(awk -v num_procs="$num_procs" -v i="$i" \
'NR % num_procs == i { print }' <"$infile")
) &
done
wait # wait for all the $num_procs subprocesses to finish

A relatively simple way to accomplish this with only two additional lines of code. Explanation is inline.
NUM_PROCS=$1
NUM_ITERS=$2
for ((i=0; i<$NUM_ITERS; i++)); do
python foo.py $i arg2 &
let 'i>=NUM_PROCS' && wait -n # wait for one process at a time once we've spawned $NUM_PROC workers
done
wait # wait for all remaining workers

Are you aware that if you are allowed to write and run your own scripts, then you can also use GNU Parallel? In essence it is a Perl script in one single file.
From the README:
= Minimal installation =
If you just need parallel and do not have 'make' installed (maybe the
system is old or Microsoft Windows):
wget http://git.savannah.gnu.org/cgit/parallel.git/plain/src/parallel
chmod 755 parallel
cp parallel sem
mv parallel sem dir-in-your-$PATH/bin/
seq $2 | parallel -j$1 python foo.py {} arg2
parallel --embed (available since 20180322) even makes it possible to distribute GNU Parallel as part of a shell script (i.e. no extra files needed):
parallel --embed >newscript
Then edit the end of newscript.

This isn't the simplest solution, but if your version of bash doesn't have "wait -n" and you don't want to use other programs like parallel, awk etc, here is a solution using while and for loops.
num_iters=10
total_threads=4
iter=1
while [[ "$iter" -lt "$num_iters" ]]; do
iters_remainder=$(echo "(${num_iters}-${iter})+1" | bc)
if [[ "$iters_remainder" -lt "$total_threads" ]]; then
threads=$iters_remainder
else
threads=$total_threads
fi
for ((t=1; t<="$threads"; t++)); do
(
# do stuff
) &
((++iter))
done
wait
done

Related

bash: Wait for process substitution subshell to finish

How can bash wait for the subshell used in process substitution to finish in the following construct? (This is of course simplified from the real for loop and subshell which I am using, but it illustrates the intent well.)
for i in {1..3}; do
echo "$i"
done > >(xargs -n1 bash -c 'sleep 1; echo "Subshell: $0"')
echo "Finished"
Prints:
Finished
Subshell: 1
Subshell: 2
Subshell: 3
Instead of:
Subshell: 1
Subshell: 2
Subshell: 3
Finished
How can I make bash wait for those subshells to complete?
UPDATE
The reason for using process substitution is that I'm wanting to use file descriptors to control what is printed to the screen and what is sent to the process. Here is a fuller version of what I'm doing:
for myFile in file1 file2 file3; do
echo "Downloading $myFile" # Should print to terminal
scp -q $user#$host:$myFile ./ # Might take a long time
echo "$myFile" >&3 # Should go to process substitution
done 3> >(xargs -n1 bash -c 'sleep 1; echo "Processing: $0"')
echo "Finished"
Prints:
Downloading file1
Downloading file2
Downloading file3
Finished
Processing: file1
Processing: file2
Processing: file3
Processing each may take much longer than the transfer. The file transfers should be sequential since bandwidth is the limiting factor. I would like to start processing each file after it is received without waiting for all of them to transfer. The processing can be done in parallel, but only a with a limited number of instances (due to limited memory/CPU). So if the fifth file just finished transferring but only the second file has finished processing, the third and fourth files should complete processing before the fifth file is processed. Meanwhile the sixth file should start transferring.
Bash 4.4 lets you collect the PID of a process substitution with $!, so you can actually use wait, just as you would for a background process:
case $BASH_VERSION in ''|[123].*|4.[0123])
echo "ERROR: Bash 4.4 required" >&2; exit 1;;
esac
# open the process substitution
exec {ps_out_fd}> >(xargs -n1 bash -c 'sleep 1; echo "Subshell: $0"'); ps_out_pid=$!
for i in {1..3}; do
echo "$i"
done >&$ps_out_fd
# close the process substitution
exec {ps_out_fd}>&-
# ...and wait for it to exit.
wait "$ps_out_pid"
Beyond that, consider flock-style locking -- though beware of races:
for i in {1..3}; do
echo "$i"
done > >(flock -x my.lock xargs -n1 bash -c 'sleep 1; echo "Subshell: $0"')
# this is only safe if the "for" loop can't exit without the process substitution reading
# something (and thus signalling that it successfully started up)
flock -x my.lock echo "Lock grabbed; the subshell has finished"
That said, given your actual use case, what you want should presumably look more like:
download() {
for arg; do
scp -q $user#$host:$myFile ./ || (( retval |= $? ))
done
exit "$retval"
}
export -f download
printf '%s\0' file1 file2 file3 |
xargs -0 -P2 -n1 bash -c 'download "$#"' _
you could have the subshell create a file that the main shell waits for.
tempfile=/tmp/finished.$$
for i in {1..3}; do
echo "$i"
done > >(xargs -n1 bash -c 'sleep 1; echo "Subshell: $0"'; touch $tempfile)
while ! test -f $tempfile; do sleep 1; done
rm $tempfile
echo "Finished"
You can use bash coproc to hold a read-able filedescriptor to be closed when all process' children die:
coproc read # previously: `coproc cat`, see comments
for i in {1..3}; do
echo "$i"
done > >(xargs -n1 bash -c 'sleep 1; echo "Subshell: $0"')
exec {COPROC[1]}>&- # close my writing side
read -u ${COPROC[0]} # will wait until all potential writers (ie process children) end
echo "Finished"
If this is to be run on a system where there is an attacker you should not use a temp file name that can be guessed. So based on #Barmar's solution here is one that avoids that:
tempfile="`tempfile`"
for i in {1..3}; do
echo "$i"
done > >(xargs -n1 bash -c 'sleep 1; echo "Subshell: $0"'; rm "$tempfile")
while test -f "$tempfile"; do sleep 1; done
echo "Finished"
I think you are making it more complicated than it needs to be. Something like this works because the internal bash executions are a subprocess of the main process, the wait causes the process to wait until everything is finished before printing.
for i in {1..3}
do
bash -c "sleep 1; echo Subshell: $i" &
done
wait
echo "Finished"
Unix and derivatives (Linux) have the ability to wait for child (sub) processes but not grandchild processes such as occurred in your original. Some would consider the polling solution where you go back and check for completion to be vulgar since it does not use this mechanism.
The solution where the xargs PID was captured was not vulgar, just too complicated.

running each element in array in parallel in bash script

Lets say I have a bash script that looks like this:
array=( 1 2 3 4 5 6 )
for each in "${array[#]}"
do
echo "$each"
command --arg1 $each
done
If I want to run the everything in the loop in parallel, I could just change command --arg1 $each to command --arg1 $each &.
But now lets say I want to take the results of command --arg1 $each and do something with those results like this:
array=( 1 2 3 4 5 6 )
for each in "${array[#]}"
do
echo "$each"
lags=($(command --arg1 $each)
lngth_lags=${#lags[*]}
for (( i=1; i<=$(( $lngth_lags -1 )); i++))
do
result=${lags[$i]}
echo -e "$timestamp\t$result" >> $log_file
echo "result piped"
done
done
If I just add a & to the end of command --arg1 $each, everything after command --arg1 $each will run without command --arg1 $each finishing first. How do I prevent that from happening? Also, how do I also limit the amount of threads the loop can occupy?
Essentially, this block should run in parallel for 1,2,3,4,5,6
echo "$each"
lags=($(command --arg1 $each)
lngth_lags=${#lags[*]}
for (( i=1; i<=$(( $lngth_lags -1 )); i++))
do
result=${lags[$i]}
echo -e "$timestamp\t$result" >> $log_file
echo "result piped"
done
-----EDIT--------
Here is the original code:
#!/bin/bash
export KAFKA_OPTS="-Djava.security.krb5.conf=/etc/krb5.conf -Djava.security.auth.login.config=/etc/kafka/kafka.client.jaas.conf"
IFS=$'\n'
array=($(kafka-consumer-groups --bootstrap-server kafka1:9092 --list --command-config /etc/kafka/client.properties --new-consumer))
lngth=${#array[*]}
echo "array length: " $lngth
timestamp=$(($(date +%s%N)/1000000))
log_time=`date +%Y-%m-%d:%H`
echo "log time: " $log_time
log_file="/home/ec2-user/laglogs/laglog.$log_time.log"
echo "log file: " $log_file
echo "timestamp: " $timestamp
get_lags () {
echo "$1"
lags=($(kafka-consumer-groups --bootstrap-server kafka1:9092 --describe --group $1 --command-config /etc/kafka/client.properties --new-consumer))
lngth_lags=${#lags[*]}
for (( i=1; i<=$(( $lngth_lags -1 )); i++))
do
result=${lags[$i]}
echo -e "$timestamp\t$result" >> $log_file
echo "result piped"
done
}
for each in "${array[#]}"
do
get_lags $each &
done
------EDIT 2-----------
Trying with answer below:
#!/bin/bash
export KAFKA_OPTS="-Djava.security.krb5.conf=/etc/krb5.conf -Djava.security.auth.login.config=/etc/kafka/kafka.client.jaas.conf"
IFS=$'\n'
array=($(kafka-consumer-groups --bootstrap-server kafka1:9092 --list --command-config /etc/kafka/client.properties --new-consumer))
lngth=${#array[*]}
echo "array length: " $lngth
timestamp=$(($(date +%s%N)/1000000))
log_time=`date +%Y-%m-%d:%H`
echo "log time: " $log_time
log_file="/home/ec2-user/laglogs/laglog.$log_time.log"
echo "log file: " $log_file
echo "timestamp: " $timestamp
max_proc_count=8
run_for_each() {
local each=$1
echo "Processing: $each" >&2
IFS=$'\n' read -r -d '' -a lags < <(kafka-consumer-groups --bootstrap-server kafka1:9092 --describe --command-config /etc/kafka/client.properties --new-consumer --group "$each" && printf '\0')
for result in "${lags[#]}"; do
printf '%(%Y-%m-%dT%H:%M:%S)T\t%s\t%s\n' -1 "$each" "$result"
done >>"$log_file"
}
export -f run_for_each
export log_file # make log_file visible to subprocesses
printf '%s\0' "${array[#]}" |
xargs -P "$max_proc_count" -n 1 -0 bash -c 'run_for_each "$#"'
The convenient thing to do is to push your background code into a separate script -- or an exported function. That way xargs can create a new shell, and access the function from its parent. (Be sure to export any other variables that need to be available in the child as well).
array=( 1 2 3 4 5 6 )
max_proc_count=8
log_file=out.txt
run_for_each() {
local each=$1
echo "Processing: $each" >&2
IFS=$' \t\n' read -r -d '' -a lags < <(yourcommand --arg1 "$each" && printf '\0')
for result in "${lags[#]}"; do
printf '%(%Y-%m-%dT%H:%M:%S)T\t%s\t%s\n' -1 "$each" "$result"
done >>"$log_file"
}
export -f run_for_each
export log_file # make log_file visible to subprocesses
printf '%s\0' "${array[#]}" |
xargs -P "$max_proc_count" -n 1 -0 bash -c 'run_for_each "$#"'
Some notes:
Using echo -e is bad form. See the APPLICATION USAGE and RATIONALE sections in the POSIX spec for echo, explicitly advising using printf instead (and not defining an -e option, and explicitly defining than echo must not accept any options other than -n).
We're including the each value in the log file so it can be extracted from there later.
You haven't specified whether the output of yourcommand is space-delimited, tab-delimited, line-delimited, or otherwise. I'm thus accepting all these for now; modify the value of IFS passed to the read to taste.
printf '%(...)T' to get a timestamp without external tools such as date requires bash 4.2 or newer. Replace with your own code if you see fit.
read -r -a arrayname < <(...) is much more robust than arrayname=( $(...) ). In particular, it avoids treating emitted values as globs -- replacing *s with a list of files in the current directory, or Foo[Bar] with FooB should any file by that name exist (or, if the failglob or nullglob options are set, triggering a failure or emitting no value at all in that case).
Redirecting stdout to your log_file once for the entire loop is somewhat more efficient than redirecting it every time you want to run printf once. Note that having multiple processes writing to the same file at the same time is only safe if all of them opened it with O_APPEND (which >> will do), and if they're writing in chunks small enough to individually complete as single syscalls (which is probably happening unless the individual lags values are quite large).
A lot of lenghty and theoretical answers here, I'll try to keep it simple - what about using | (pipe) to connect the commands as usual ?;) (And GNU parallel, which excels for these type of tasks).
seq 6 | parallel -j4 "command --arg1 {} | command2 > results/{}"
The -j4 will limit number of threads (jobs) as requested. You DON'T want to write to a single file from multiple jobs, output one file per job and join them after the parallel processing is finished.
Using GNU Parallel it looks like this:
array=( 1 2 3 4 5 6 )
parallel -0 --bar --tagstring '{= $_=localtime(time)."\t".$_; =}' \
command --arg1 {} ::: "${array[#]}" > output
GNU Parallel makes sure output from different jobs is not mixed.
If you prefer the output from jobs mixed:
parallel -0 --bar --line-buffer --tagstring '{= $_=localtime(time)."\t".$_; =}' \
command --arg1 {} ::: "${array[#]}" > output-linebuffer
Again GNU Parallel makes sure to only mix with full lines: You will not see half a line from one job and half a line from another job.
It also works if the array is a bit more nasty:
array=( "new
line" 'quotes" '"'" 'echo `do not execute me`')
Or if the command prints long lines half-lines:
command() {
echo Input: "$#"
echo '" '"'"
sleep 1
echo -n 'Half a line '
sleep 1
echo other half
superlong_a=$(perl -e 'print "a"x1000000')
superlong_b=$(perl -e 'print "b"x1000000')
echo -n $superlong_a
sleep 1
echo $superlong_b
}
export -f command
GNU Parallel strives to be a general solution. This is because I have designed GNU Parallel to care about correctness and try vehemently to deal correctly with corner cases, too, while staying reasonably fast.
GNU Parallel guards against race conditions and does not split words in the output on each their line.
array=( $(seq 30) )
max_proc_count=30
command() {
# If 'a', 'b' and 'c' mix: Very bad
perl -e 'print "a"x3000_000," "'
perl -e 'print "b"x3000_000," "'
perl -e 'print "c"x3000_000," "'
echo
}
export -f command
parallel -0 --bar --tagstring '{= $_=localtime(time)."\t".$_; =}' \
command --arg1 {} ::: "${array[#]}" > parallel.out
# 'abc' should always stay together
# and there should only be a single line per job
cat parallel.out | tr -s abc
GNU Parallel works fine if the output has a lot of words:
array=(1)
command() {
yes "`seq 1000`" | head -c 10M
}
export -f command
parallel -0 --bar --tagstring '{= $_=localtime(time)."\t".$_; =}' \
command --arg1 {} ::: "${array[#]}" > parallel.out
GNU Parallel does not eat all your memory - even if the output is bigger than your RAM:
array=(1)
outputsize=1000M
export outputsize
command() {
yes "`perl -e 'print \"c\"x30_000'`" | head -c $outputsize
}
export -f command
parallel -0 --bar --tagstring '{= $_=localtime(time)."\t".$_; =}' \
command --arg1 {} ::: "${array[#]}" > parallel.out
You know how to execute commands in separate processes. The missing part is how to allow those processes to communicate, as separate processes cannot share variables.
Basically, you must chose whether to communicate using regular files, or inter-process communication/FIFOs (which still boils down to using files).
The general approach :
Decide how you want to present tasks to be executed. You could have them as separate files on the filesystem, as a FIFO special file that can be read from, etc. This could be a simple as writing to a separate file each command to be executed, or writing each command to a FIFO (one command per line).
In the main process, prepare the files describing tasks to perform or launch a separate process in the background that will feed the FIFO.
Then, still in the main process, launch worker processes in the background (with &), as many of them as you want parallel tasks being executed (not one per task to perform). Once they have been launched, use wait to, well, wait until all processes are finished. Separate processes cannot share variables, you will have to write any output that needs to be used later to separate files, or a FIFO, etc. If using a FIFO, remember more than one process can write to a FIFO at the same time, so use some kind of mutex mechanism (I suggest looking into the use of mkdir/rmdir for that purpose).
Each worker process must fetch the next task (from a file/FIFO), execute it, generate the output (to a file/FIFO), loop until there are no new tasks, then exit. If using files, you will need to use a mutex to "reserve" a file, read it, and then delete it to mark it as taken care of. This would not be needed for a FIFO.
Depending on the case, your main process may have to wait until all tasks are finished before handling the output, or in some cases may launch a worker process that will detect and handle output as it appears. This worker process would have to either be stopped by the main process once all tasks have been executed, or figure out for itself when all tasks have been executed and exit (while being waited on by the main process).
This is not detailed code, but I hope it gives you an idea of how to approach problems like this.
(Community Wiki answer with the OP's proposed self-answer from the question -- now edited out):
So here is one way I can think of doing this, not sure if this is the most efficient way and also, I can't control the amount of threads (I think, or processes?) this would use:
array=( 1 2 3 4 5 6 )
lag_func () {
echo "$1"
lags=($(command --arg1 $1)
lngth_lags=${#lags[*]}
for (( i=1; i<=$(( $lngth_lags -1 )); i++))
do
result=${lags[$i]}
echo -e "$timestamp\t$result" >> $log_file
echo "result piped"
done
}
for each in "${array[#]}"
do
lag_func $each &
done

How to run a fixed number of processes in a loop?

I have a script like this:
#!/bin/bash
for i=1 to 200000
do
create input file
run ./java
done
I need to run a number (8 or 16) of processes (java) at the same time and I don't know how. I know that wait could help but it should be running 8 processes all the time and not wait for the first 8 to finish before starting the other 8.
bash 4.3 added a useful new flag to the wait command, -n, which causes wait to block until any single background job, not just the members of a given subset (or all), to complete.
#!/bin/bash
cores=8 # or 16, or whatever
for ((i=1; i <= 200000; i++))
do
# create input file and run java in the background.
./java &
# Check how many background jobs there are, and if it
# is equal to the number of cores, wait for anyone to
# finish before continuing.
background=( $(jobs -p) )
if (( ${#background[#]} == cores )); then
wait -n
fi
done
There is a small race condition: if you are at maximum load but a job completes after you run jobs -p, you'll still block until another job
completes. There's not much you can do about this, but it shouldn't present too much trouble in practice.
Prior to bash 4.3, you would need to poll the set of background jobs periodically to see when the pool dropped below your threshold.
while :; do
background=( $(jobs -p))
if (( ${#background[#]} < cores )); then
break
fi
sleep 1
done
Use GNU Parallel like this, simplified to 20 jobs rather than 200,000 and the first job is echo rather than "create file" and the second job is sleep rather than "java".
seq 1 20 | parallel -j 8 -k 'echo {}; sleep 2'
The -j 8 says how many jobs to run at once. The -k says to keep the output in order.
Here is a little animation of the output so you can see the timing/sequence:
With a non-ancient version of GNU utilities or on *BSD/OSX, use xargs with the -P option to run processes in parallel.
#!/bin/bash
seq 200000 | xargs -P 8 -n 1 mytask
where mytask is an auxiliary script, with the sequence number (the input line) available as the argument$1`:
#!/bin/bash
echo "Task number $1"
create input file
run ./java
You can put everything in one script if you want:
#!/bin/bash
seq 200000 | xargs -P 8 -n 1 sh -c '
echo "Task number $1"
create input file
run ./java
' mytask
If your system doesn't have seq, you can use the bash snippet
for ((i=1; i<=200000; i++)); do echo "$i"; done
or other shell tools such as
awk '{for (i=1; i<=200000; i++) print i}' </dev/null
or
</dev/zero tr '\0' '\n' | head -n 200000 | nl
Set up 8 subprocesses that read from a common stream; each subprocess reads one line of input and starts a new job whenever its current job completes.
forker () {
while read; do
# create input file
./java
done
}
cores=8 # or 16, or whatever
for ((i=1; i<=200000; i++)); do
echo $i
done | while :; do
for ((j=0; j< cores; j++)); do
forker &
done
done
wait # Waiting for the $core forkers to complete

bash: limiting subshells in a for loop with file list

I've been trying to get a for loop to run a bunch of commands sort of simultaneously and was attempting to do it via subshells. Ive managed to cobble together the script below to test and it seems to work ok.
#!/bin/bash
for i in {1..255}; do
(
#commands
)&
done
wait
The only problem is that my actual loop is going to be for i in files* and then it just crashes, i assume because its started too many subshells to handle. So i added
#!/bin/bash
for i in files*; do
(
#commands
)&
if (( $i % 10 == 0 )); then wait; fi
done
wait
which now fails. Does anyone know a way around this? Either using a different command to limit the number of subshells or provide a number for $i?
Cheers
xargs/parallel
Another solution would be to use tools designed for concurrency:
printf '%s\0' files* | xargs -0 -P6 -n1 yourScript
The -P6 is the maximum number of concurrent processes that xargs will launch. Make it 10 if you like.
I suggest xargs because it is likely already on your system. If you want a really robust solution, look at GNU Parallel.
Filenames in array
For another answer explicit to your question: Get the counter as the array index?
files=( files* )
for i in "${!files[#]}"; do
commands "${files[i]}" &
(( i % 10 )) || wait
done
(The parentheses around the compound command aren't important because backgrounding the job will have the same effects as using a subshell anyway.)
Function
Just different semantics:
simultaneous() {
while [[ $1 ]]; do
for i in {1..11}; do
[[ ${#:i:1} ]] || break
commands "${#:i:1}" &
done
shift 10 || shift "$#"
wait
done
}
simultaneous files*
You can find useful to count the number of jobs with jobs. e.g.:
wc -w <<<$(jobs -p)
So, your code would look like this:
#!/bin/bash
for i in files*; do
(
#commands
)&
if (( $(wc -w <<<$(jobs -p)) % 10 == 0 )); then wait; fi
done
wait
As #chepner suggested:
In bash 4.3, you can use wait -n to proceed as soon as any job completes, rather than waiting for all of them
Define the counter explicitly
#!/bin/bash
for f in files*; do
(
#commands
)&
(( i++ % 10 == 0 )) && wait
done
wait
There's no need to initialize i, as it will default to 0 the first time you use it. There's also no need to reset the value, as i %10 will be 0 for i=10, 20, 30, etc.
If you have Bash≥4.3, you can use wait -n:
#!/bin/bash
max_nb_jobs=10
for i in file*; do
# Wait until there are less than max_nb_jobs jobs running
while mapfile -t < <(jobs -pr) && ((${#MAPFILE[#]}>=max_nb_jobs)); do
wait -n
done
{
# Your commands here: no useless subshells! use grouping instead
} &
done
wait
If you don't have wait -n available, you can use something like this:
#!/bin/bash
set -m
max_nb_jobs=10
sleep_jobs() {
# This function sleeps until there are less than $1 jobs running
local n=$1
while mapfile -t < <(jobs -pr) && ((${#MAPFILE[#]}>=n)); do
coproc read
trap "echo >&${COPROC[1]}; trap '' SIGCHLD" SIGCHLD
[[ $COPROC_PID ]] && wait $COPROC_PID
done
}
for i in files*; do
# Wait until there are less than 10 jobs running
sleep_jobs "$max_nb_jobs"
{
# Your commands here: no useless subshells! use grouping instead
} &
done
wait
The advantage of proceeding like this, is that we make no assumptions on the time taken to finish the jobs. A new job is launched as soon as there's room for it. Moreover, it's all pure Bash, so doesn't rely on external tools and (maybe more importantly), you may use your Bash environment (variables, functions, etc.) without exporting them (arrays can't be easily exported so that can be a huge pro).

Process Scheduling

Let's say, I have 10 scripts that I want to run regularly as cron jobs. However, I don't want all of them to run at the same time. I want only 2 of them running simultaneously.
One solution that I'm thinking of is create two script, put 5 statements on each of them, and them as separate entries in the crontab. However the solution seem very adhoc.
Is there existing unix tool to perform the task I mentioned above?
The jobs builtin can tell you how many child processes are running. Some simple shell scripting can accomplish this task:
MAX_JOBS=2
launch_when_not_busy()
{
while [ $(jobs | wc -l) -ge $MAX_JOBS ]
do
# at least $MAX_JOBS are still running.
sleep 1
done
"$#" &
}
launch_when_not_busy bash job1.sh --args
launch_when_not_busy bash jobTwo.sh
launch_when_not_busy bash job_three.sh
...
wait
NOTE: As pointed out by mobrule, my original answer will not work because the wait builtin with no arguments waits for ALL children to finish. Hence the following 'parallelexec' script, which avoids polling at the cost of more child processes:
#!/bin/bash
N="$1"
I=0
{
if [[ "$#" -le 1 ]]; then
cat
else
while [[ "$#" -gt 1 ]]; do
echo "$2"
set -- "$1" "${#:3}"
done
fi
} | {
d=$(mktemp -d /tmp/fifo.XXXXXXXX)
mkfifo "$d"/fifo
exec 3<>"$d"/fifo
rm -rf "$d"
while [[ "$I" -lt "$N" ]] && read C; do
($C; echo >&3) &
let I++
done
while read C; do
read -u 3
($C; echo >&3) &
done
}
The first argument is the number of parallel jobs. If there are more, each one is run as a job, otherwise all commands to run are read from stdin line by line.
I use a named pipe (which is sent to oblivion as soon as the shell opens it) as a synchronization method. Since only single bytes are written there are no race condition issues that could complicate things.
GNU Parallel is designed for this kind of tasks:
sem -j2 do_stuff
sem -j2 do_other_stuff
sem -j2 do_third_stuff
do_third_stuff will only be run when either do_stuff or do_other_stuff has finished.
Watch the intro videos to learn more:
http://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Resources