running each element in array in parallel in bash script - bash

Lets say I have a bash script that looks like this:
array=( 1 2 3 4 5 6 )
for each in "${array[#]}"
do
echo "$each"
command --arg1 $each
done
If I want to run the everything in the loop in parallel, I could just change command --arg1 $each to command --arg1 $each &.
But now lets say I want to take the results of command --arg1 $each and do something with those results like this:
array=( 1 2 3 4 5 6 )
for each in "${array[#]}"
do
echo "$each"
lags=($(command --arg1 $each)
lngth_lags=${#lags[*]}
for (( i=1; i<=$(( $lngth_lags -1 )); i++))
do
result=${lags[$i]}
echo -e "$timestamp\t$result" >> $log_file
echo "result piped"
done
done
If I just add a & to the end of command --arg1 $each, everything after command --arg1 $each will run without command --arg1 $each finishing first. How do I prevent that from happening? Also, how do I also limit the amount of threads the loop can occupy?
Essentially, this block should run in parallel for 1,2,3,4,5,6
echo "$each"
lags=($(command --arg1 $each)
lngth_lags=${#lags[*]}
for (( i=1; i<=$(( $lngth_lags -1 )); i++))
do
result=${lags[$i]}
echo -e "$timestamp\t$result" >> $log_file
echo "result piped"
done
-----EDIT--------
Here is the original code:
#!/bin/bash
export KAFKA_OPTS="-Djava.security.krb5.conf=/etc/krb5.conf -Djava.security.auth.login.config=/etc/kafka/kafka.client.jaas.conf"
IFS=$'\n'
array=($(kafka-consumer-groups --bootstrap-server kafka1:9092 --list --command-config /etc/kafka/client.properties --new-consumer))
lngth=${#array[*]}
echo "array length: " $lngth
timestamp=$(($(date +%s%N)/1000000))
log_time=`date +%Y-%m-%d:%H`
echo "log time: " $log_time
log_file="/home/ec2-user/laglogs/laglog.$log_time.log"
echo "log file: " $log_file
echo "timestamp: " $timestamp
get_lags () {
echo "$1"
lags=($(kafka-consumer-groups --bootstrap-server kafka1:9092 --describe --group $1 --command-config /etc/kafka/client.properties --new-consumer))
lngth_lags=${#lags[*]}
for (( i=1; i<=$(( $lngth_lags -1 )); i++))
do
result=${lags[$i]}
echo -e "$timestamp\t$result" >> $log_file
echo "result piped"
done
}
for each in "${array[#]}"
do
get_lags $each &
done
------EDIT 2-----------
Trying with answer below:
#!/bin/bash
export KAFKA_OPTS="-Djava.security.krb5.conf=/etc/krb5.conf -Djava.security.auth.login.config=/etc/kafka/kafka.client.jaas.conf"
IFS=$'\n'
array=($(kafka-consumer-groups --bootstrap-server kafka1:9092 --list --command-config /etc/kafka/client.properties --new-consumer))
lngth=${#array[*]}
echo "array length: " $lngth
timestamp=$(($(date +%s%N)/1000000))
log_time=`date +%Y-%m-%d:%H`
echo "log time: " $log_time
log_file="/home/ec2-user/laglogs/laglog.$log_time.log"
echo "log file: " $log_file
echo "timestamp: " $timestamp
max_proc_count=8
run_for_each() {
local each=$1
echo "Processing: $each" >&2
IFS=$'\n' read -r -d '' -a lags < <(kafka-consumer-groups --bootstrap-server kafka1:9092 --describe --command-config /etc/kafka/client.properties --new-consumer --group "$each" && printf '\0')
for result in "${lags[#]}"; do
printf '%(%Y-%m-%dT%H:%M:%S)T\t%s\t%s\n' -1 "$each" "$result"
done >>"$log_file"
}
export -f run_for_each
export log_file # make log_file visible to subprocesses
printf '%s\0' "${array[#]}" |
xargs -P "$max_proc_count" -n 1 -0 bash -c 'run_for_each "$#"'

The convenient thing to do is to push your background code into a separate script -- or an exported function. That way xargs can create a new shell, and access the function from its parent. (Be sure to export any other variables that need to be available in the child as well).
array=( 1 2 3 4 5 6 )
max_proc_count=8
log_file=out.txt
run_for_each() {
local each=$1
echo "Processing: $each" >&2
IFS=$' \t\n' read -r -d '' -a lags < <(yourcommand --arg1 "$each" && printf '\0')
for result in "${lags[#]}"; do
printf '%(%Y-%m-%dT%H:%M:%S)T\t%s\t%s\n' -1 "$each" "$result"
done >>"$log_file"
}
export -f run_for_each
export log_file # make log_file visible to subprocesses
printf '%s\0' "${array[#]}" |
xargs -P "$max_proc_count" -n 1 -0 bash -c 'run_for_each "$#"'
Some notes:
Using echo -e is bad form. See the APPLICATION USAGE and RATIONALE sections in the POSIX spec for echo, explicitly advising using printf instead (and not defining an -e option, and explicitly defining than echo must not accept any options other than -n).
We're including the each value in the log file so it can be extracted from there later.
You haven't specified whether the output of yourcommand is space-delimited, tab-delimited, line-delimited, or otherwise. I'm thus accepting all these for now; modify the value of IFS passed to the read to taste.
printf '%(...)T' to get a timestamp without external tools such as date requires bash 4.2 or newer. Replace with your own code if you see fit.
read -r -a arrayname < <(...) is much more robust than arrayname=( $(...) ). In particular, it avoids treating emitted values as globs -- replacing *s with a list of files in the current directory, or Foo[Bar] with FooB should any file by that name exist (or, if the failglob or nullglob options are set, triggering a failure or emitting no value at all in that case).
Redirecting stdout to your log_file once for the entire loop is somewhat more efficient than redirecting it every time you want to run printf once. Note that having multiple processes writing to the same file at the same time is only safe if all of them opened it with O_APPEND (which >> will do), and if they're writing in chunks small enough to individually complete as single syscalls (which is probably happening unless the individual lags values are quite large).

A lot of lenghty and theoretical answers here, I'll try to keep it simple - what about using | (pipe) to connect the commands as usual ?;) (And GNU parallel, which excels for these type of tasks).
seq 6 | parallel -j4 "command --arg1 {} | command2 > results/{}"
The -j4 will limit number of threads (jobs) as requested. You DON'T want to write to a single file from multiple jobs, output one file per job and join them after the parallel processing is finished.

Using GNU Parallel it looks like this:
array=( 1 2 3 4 5 6 )
parallel -0 --bar --tagstring '{= $_=localtime(time)."\t".$_; =}' \
command --arg1 {} ::: "${array[#]}" > output
GNU Parallel makes sure output from different jobs is not mixed.
If you prefer the output from jobs mixed:
parallel -0 --bar --line-buffer --tagstring '{= $_=localtime(time)."\t".$_; =}' \
command --arg1 {} ::: "${array[#]}" > output-linebuffer
Again GNU Parallel makes sure to only mix with full lines: You will not see half a line from one job and half a line from another job.
It also works if the array is a bit more nasty:
array=( "new
line" 'quotes" '"'" 'echo `do not execute me`')
Or if the command prints long lines half-lines:
command() {
echo Input: "$#"
echo '" '"'"
sleep 1
echo -n 'Half a line '
sleep 1
echo other half
superlong_a=$(perl -e 'print "a"x1000000')
superlong_b=$(perl -e 'print "b"x1000000')
echo -n $superlong_a
sleep 1
echo $superlong_b
}
export -f command
GNU Parallel strives to be a general solution. This is because I have designed GNU Parallel to care about correctness and try vehemently to deal correctly with corner cases, too, while staying reasonably fast.
GNU Parallel guards against race conditions and does not split words in the output on each their line.
array=( $(seq 30) )
max_proc_count=30
command() {
# If 'a', 'b' and 'c' mix: Very bad
perl -e 'print "a"x3000_000," "'
perl -e 'print "b"x3000_000," "'
perl -e 'print "c"x3000_000," "'
echo
}
export -f command
parallel -0 --bar --tagstring '{= $_=localtime(time)."\t".$_; =}' \
command --arg1 {} ::: "${array[#]}" > parallel.out
# 'abc' should always stay together
# and there should only be a single line per job
cat parallel.out | tr -s abc
GNU Parallel works fine if the output has a lot of words:
array=(1)
command() {
yes "`seq 1000`" | head -c 10M
}
export -f command
parallel -0 --bar --tagstring '{= $_=localtime(time)."\t".$_; =}' \
command --arg1 {} ::: "${array[#]}" > parallel.out
GNU Parallel does not eat all your memory - even if the output is bigger than your RAM:
array=(1)
outputsize=1000M
export outputsize
command() {
yes "`perl -e 'print \"c\"x30_000'`" | head -c $outputsize
}
export -f command
parallel -0 --bar --tagstring '{= $_=localtime(time)."\t".$_; =}' \
command --arg1 {} ::: "${array[#]}" > parallel.out

You know how to execute commands in separate processes. The missing part is how to allow those processes to communicate, as separate processes cannot share variables.
Basically, you must chose whether to communicate using regular files, or inter-process communication/FIFOs (which still boils down to using files).
The general approach :
Decide how you want to present tasks to be executed. You could have them as separate files on the filesystem, as a FIFO special file that can be read from, etc. This could be a simple as writing to a separate file each command to be executed, or writing each command to a FIFO (one command per line).
In the main process, prepare the files describing tasks to perform or launch a separate process in the background that will feed the FIFO.
Then, still in the main process, launch worker processes in the background (with &), as many of them as you want parallel tasks being executed (not one per task to perform). Once they have been launched, use wait to, well, wait until all processes are finished. Separate processes cannot share variables, you will have to write any output that needs to be used later to separate files, or a FIFO, etc. If using a FIFO, remember more than one process can write to a FIFO at the same time, so use some kind of mutex mechanism (I suggest looking into the use of mkdir/rmdir for that purpose).
Each worker process must fetch the next task (from a file/FIFO), execute it, generate the output (to a file/FIFO), loop until there are no new tasks, then exit. If using files, you will need to use a mutex to "reserve" a file, read it, and then delete it to mark it as taken care of. This would not be needed for a FIFO.
Depending on the case, your main process may have to wait until all tasks are finished before handling the output, or in some cases may launch a worker process that will detect and handle output as it appears. This worker process would have to either be stopped by the main process once all tasks have been executed, or figure out for itself when all tasks have been executed and exit (while being waited on by the main process).
This is not detailed code, but I hope it gives you an idea of how to approach problems like this.

(Community Wiki answer with the OP's proposed self-answer from the question -- now edited out):
So here is one way I can think of doing this, not sure if this is the most efficient way and also, I can't control the amount of threads (I think, or processes?) this would use:
array=( 1 2 3 4 5 6 )
lag_func () {
echo "$1"
lags=($(command --arg1 $1)
lngth_lags=${#lags[*]}
for (( i=1; i<=$(( $lngth_lags -1 )); i++))
do
result=${lags[$i]}
echo -e "$timestamp\t$result" >> $log_file
echo "result piped"
done
}
for each in "${array[#]}"
do
lag_func $each &
done

Related

Speed up shell script/Performance enhancement of shell script

Is there a way to speed up the below shell script? It's taking me a good 40 mins to update about 150000 files everyday. Sure, given the volume of files to create & update, this may be acceptable. I don't deny that. However, if there is a much more efficient way to write this or re-write the logic entirely, I'm open to it. Please I'm looking for some help
#!/bin/bash
DATA_FILE_SOURCE="<path_to_source_data/${1}"
DATA_FILE_DEST="<path_to_dest>"
for fname in $(ls -1 "${DATA_FILE_SOURCE}")
do
for line in $(cat "${DATA_FILE_SOURCE}"/"${fname}")
do
FILE_TO_WRITE_TO=$(echo "${line}" | awk -F',' '{print $1"."$2".daily.csv"}')
CONTENT_TO_WRITE=$(echo "${line}" | cut -d, -f3-)
if [[ ! -f "${DATA_FILE_DEST}"/"${FILE_TO_WRITE_TO}" ]]
then
echo "${CONTENT_TO_WRITE}" >> "${DATA_FILE_DEST}"/"${FILE_TO_WRITE_TO}"
else
if ! grep -Fxq "${CONTENT_TO_WRITE}" "${DATA_FILE_DEST}"/"${FILE_TO_WRITE_TO}"
then
sed -i "/${1}/d" "${DATA_FILE_DEST}"/"${FILE_TO_WRITE_TO}"
"${DATA_FILE_DEST}"/"${FILE_TO_WRITE_TO}"
echo "${CONTENT_TO_WRITE}" >> "${DATA_FILE_DEST}"/"${FILE_TO_WRITE_TO}"
fi
fi
done
done
There are still parts of your published script that are unclear like the sed command. Although I rewrote it with saner practices and much less external calls witch should really speed it up.
#!/usr/bin/env sh
DATA_FILE_SOURCE="<path_to_source_data/$1"
DATA_FILE_DEST="<path_to_dest>"
for fname in "$DATA_FILE_SOURCE/"*; do
while IFS=, read -r a b content || [ "$a" ]; do
destfile="$DATA_FILE_DEST/$a.$b.daily.csv"
if grep -Fxq "$content" "$destfile"; then
sed -i "/$1/d" "$destfile"
fi
printf '%s\n' "$content" >>"$destfile"
done < "$fname"
done
Make it parallel (as much as you can).
#!/bin/bash
set -e -o pipefail
declare -ir MAX_PARALLELISM=20 # pick a limit
declare -i pid
declare -a pids
# ...
for fname in "${DATA_FILE_SOURCE}/"*; do
if ((${#pids[#]} >= MAX_PARALLELISM)); then
wait -p pid -n || echo "${pids[pid]} failed with ${?}" 1>&2
unset 'pids[pid]'
fi
while IFS= read -r line; do
FILE_TO_WRITE_TO="..."
# ...
done < "${fname}" & # forking here
pids[$!]="${fname}"
done
for pid in "${!pids[#]}"; do
wait -n "$((pid))" || echo "${pids[pid]} failed with ${?}" 1>&2
done
Here’s a directly runnable skeleton showing how the harness above works (with 36 items to process and 20 parallel processes at most):
#!/bin/bash
set -e -o pipefail
declare -ir MAX_PARALLELISM=20 # pick a limit
declare -i pid
declare -a pids
do_something_and_maybe_fail() {
sleep $((RANDOM % 10))
return $((RANDOM % 2 * 5))
}
for fname in some_name_{a..f}{0..5}.txt; do # 36 items
if ((${#pids[#]} >= MAX_PARALLELISM)); then
wait -p pid -n || echo "${pids[pid]} failed with ${?}" 1>&2
unset 'pids[pid]'
fi
do_something_and_maybe_fail & # forking here
pids[$!]="${fname}"
echo "${#pids[#]} running" 1>&2
done
for pid in "${!pids[#]}"; do
wait -n "$((pid))" || echo "${pids[pid]} failed with ${?}" 1>&2
done
Strictly avoid external processes (such as awk, grep and cut) when processing one-liners for each line. fork()ing is extremely inefficient in comparison to:
Running one single awk / grep / cut process on an entire input file (to preprocess all lines at once for easier processing in bash) and feeding the whole output into (e.g.) a bash loop.
Using Bash expansions instead, where feasible, e.g. "${line/,/.}" and other tricks from the EXPANSION section of the man bash page, without fork()ing any further processes.
Off-topic side notes:
ls -1 is unnecessary. First, ls won’t write multiple columns unless the output is a terminal, so a plain ls would do. Second, bash expansions are usually a cleaner and more efficient choice. (You can use nullglob to correctly handle empty directories / “no match” cases.)
Looping over the output from cat is a (less common) useless use of cat case. Feed the file into a loop in bash instead and read it line by line. (This also gives you more line format flexibility.)

bash: Wait for process substitution subshell to finish

How can bash wait for the subshell used in process substitution to finish in the following construct? (This is of course simplified from the real for loop and subshell which I am using, but it illustrates the intent well.)
for i in {1..3}; do
echo "$i"
done > >(xargs -n1 bash -c 'sleep 1; echo "Subshell: $0"')
echo "Finished"
Prints:
Finished
Subshell: 1
Subshell: 2
Subshell: 3
Instead of:
Subshell: 1
Subshell: 2
Subshell: 3
Finished
How can I make bash wait for those subshells to complete?
UPDATE
The reason for using process substitution is that I'm wanting to use file descriptors to control what is printed to the screen and what is sent to the process. Here is a fuller version of what I'm doing:
for myFile in file1 file2 file3; do
echo "Downloading $myFile" # Should print to terminal
scp -q $user#$host:$myFile ./ # Might take a long time
echo "$myFile" >&3 # Should go to process substitution
done 3> >(xargs -n1 bash -c 'sleep 1; echo "Processing: $0"')
echo "Finished"
Prints:
Downloading file1
Downloading file2
Downloading file3
Finished
Processing: file1
Processing: file2
Processing: file3
Processing each may take much longer than the transfer. The file transfers should be sequential since bandwidth is the limiting factor. I would like to start processing each file after it is received without waiting for all of them to transfer. The processing can be done in parallel, but only a with a limited number of instances (due to limited memory/CPU). So if the fifth file just finished transferring but only the second file has finished processing, the third and fourth files should complete processing before the fifth file is processed. Meanwhile the sixth file should start transferring.
Bash 4.4 lets you collect the PID of a process substitution with $!, so you can actually use wait, just as you would for a background process:
case $BASH_VERSION in ''|[123].*|4.[0123])
echo "ERROR: Bash 4.4 required" >&2; exit 1;;
esac
# open the process substitution
exec {ps_out_fd}> >(xargs -n1 bash -c 'sleep 1; echo "Subshell: $0"'); ps_out_pid=$!
for i in {1..3}; do
echo "$i"
done >&$ps_out_fd
# close the process substitution
exec {ps_out_fd}>&-
# ...and wait for it to exit.
wait "$ps_out_pid"
Beyond that, consider flock-style locking -- though beware of races:
for i in {1..3}; do
echo "$i"
done > >(flock -x my.lock xargs -n1 bash -c 'sleep 1; echo "Subshell: $0"')
# this is only safe if the "for" loop can't exit without the process substitution reading
# something (and thus signalling that it successfully started up)
flock -x my.lock echo "Lock grabbed; the subshell has finished"
That said, given your actual use case, what you want should presumably look more like:
download() {
for arg; do
scp -q $user#$host:$myFile ./ || (( retval |= $? ))
done
exit "$retval"
}
export -f download
printf '%s\0' file1 file2 file3 |
xargs -0 -P2 -n1 bash -c 'download "$#"' _
you could have the subshell create a file that the main shell waits for.
tempfile=/tmp/finished.$$
for i in {1..3}; do
echo "$i"
done > >(xargs -n1 bash -c 'sleep 1; echo "Subshell: $0"'; touch $tempfile)
while ! test -f $tempfile; do sleep 1; done
rm $tempfile
echo "Finished"
You can use bash coproc to hold a read-able filedescriptor to be closed when all process' children die:
coproc read # previously: `coproc cat`, see comments
for i in {1..3}; do
echo "$i"
done > >(xargs -n1 bash -c 'sleep 1; echo "Subshell: $0"')
exec {COPROC[1]}>&- # close my writing side
read -u ${COPROC[0]} # will wait until all potential writers (ie process children) end
echo "Finished"
If this is to be run on a system where there is an attacker you should not use a temp file name that can be guessed. So based on #Barmar's solution here is one that avoids that:
tempfile="`tempfile`"
for i in {1..3}; do
echo "$i"
done > >(xargs -n1 bash -c 'sleep 1; echo "Subshell: $0"'; rm "$tempfile")
while test -f "$tempfile"; do sleep 1; done
echo "Finished"
I think you are making it more complicated than it needs to be. Something like this works because the internal bash executions are a subprocess of the main process, the wait causes the process to wait until everything is finished before printing.
for i in {1..3}
do
bash -c "sleep 1; echo Subshell: $i" &
done
wait
echo "Finished"
Unix and derivatives (Linux) have the ability to wait for child (sub) processes but not grandchild processes such as occurred in your original. Some would consider the polling solution where you go back and check for completion to be vulgar since it does not use this mechanism.
The solution where the xargs PID was captured was not vulgar, just too complicated.

xargs output buffering -P parallel

I have a bash function that i call in parallel using xargs -P like so
echo ${list} | xargs -n 1 -P 24 -I# bash -l -c 'myAwesomeShellFunction #'
Everything works fine but output is messed up for obvious reasons (no buffering)
Trying to figure out a way to buffer output effectively. I was thinking I could use awk, but I'm not good enough to write such a script and I can't find anything worthwhile on google? Can someone help me write this "output buffer" in sed or awk? Nothing fancy, just accumulate output and spit it out after process terminates. I don't care the order that shell functions execute, just need their output buffered... Something like:
echo ${list} | xargs -n 1 -P 24 -I# bash -l -c 'myAwesomeShellFunction # | sed -u ""'
P.s. I tried to use stdbuf as per
https://unix.stackexchange.com/questions/25372/turn-off-buffering-in-pipe but did not work, i specified buffering on o and e but output still unbuffered:
echo ${list} | xargs -n 1 -P 24 -I# stdbuf -i0 -oL -eL bash -l -c 'myAwesomeShellFunction #'
Here's my first attempt, this only captures first line of output:
$ bash -c "echo stuff;sleep 3; echo more stuff" | awk '{while (( getline line) > 0 )print "got ",$line;}'
$ got stuff
This isn't quite atomic if your output is longer than a page (4kb typically), but for most cases it'll do:
xargs -P 24 bash -c 'for arg; do printf "%s\n" "$(myAwesomeShellFunction "$arg")"; done' _
The magic here is the command substitution: $(...) creates a subshell (a fork()ed-off copy of your shell), runs the code ... in it, and then reads that in to be substituted into the relevant position in the outer script.
Note that we don't need -n 1 (if you're dealing with a large number of arguments -- for a small number it may improve parallelization), since we're iterating over as many arguments as each of your 24 parallel bash instances is passed.
If you want to make it truly atomic, you can do that with a lockfile:
# generate a lockfile, arrange for it to be deleted when this shell exits
lockfile=$(mktemp -t lock.XXXXXX); export lockfile
trap 'rm -f "$lockfile"' 0
xargs -P 24 bash -c '
for arg; do
{
output=$(myAwesomeShellFunction "$arg")
flock -x 99
printf "%s\n" "$output"
} 99>"$lockfile"
done
' _

How to parallelize for-loop in bash limiting number of processes

I have a bash script similar to:
NUM_PROCS=$1
NUM_ITERS=$2
for ((i=0; i<$NUM_ITERS; i++)); do
python foo.py $i arg2 &
done
What's the most straightforward way to limit the number of parallel processes to NUM_PROCS? I'm looking for a solution that doesn't require packages/installations/modules (like GNU Parallel) if possible.
When I tried Charles Duffy's latest approach, I got the following error from bash -x:
+ python run.py args 1
+ python run.py ... 3
+ python run.py ... 4
+ python run.py ... 2
+ read -r line
+ python run.py ... 1
+ read -r line
+ python run.py ... 4
+ read -r line
+ python run.py ... 2
+ read -r line
+ python run.py ... 3
+ read -r line
+ python run.py ... 0
+ read -r line
... continuing with other numbers between 0 and 5, until too many processes were started for the system to handle and the bash script was shut down.
bash 4.4 will have an interesting new type of parameter expansion that simplifies Charles Duffy's answer.
#!/bin/bash
num_procs=$1
num_iters=$2
num_jobs="\j" # The prompt escape for number of jobs currently running
for ((i=0; i<num_iters; i++)); do
while (( ${num_jobs#P} >= num_procs )); do
wait -n
done
python foo.py "$i" arg2 &
done
GNU, macOS/OSX, FreeBSD and NetBSD can all do this with xargs -P, no bash versions or package installs required. Here's 4 processes at a time:
printf "%s\0" {1..10} | xargs -0 -I # -P 4 python foo.py # arg2
As a very simple implementation, depending on a version of bash new enough to have wait -n (to wait until only the next job exits, as opposed to waiting for all jobs):
#!/bin/bash
# ^^^^ - NOT /bin/sh!
num_procs=$1
num_iters=$2
declare -A pids=( )
for ((i=0; i<num_iters; i++)); do
while (( ${#pids[#]} >= num_procs )); do
wait -n
for pid in "${!pids[#]}"; do
kill -0 "$pid" &>/dev/null || unset "pids[$pid]"
done
done
python foo.py "$i" arg2 & pids["$!"]=1
done
If running on a shell without wait -n, one can (very inefficiently) replace it with a command such as sleep 0.2, to poll every 1/5th of a second.
Since you're actually reading input from a file, another approach is to start N subprocesses, each of processes only lines where (linenum % N == threadnum):
num_procs=$1
infile=$2
for ((i=0; i<num_procs; i++)); do
(
while read -r line; do
echo "Thread $i: processing $line"
done < <(awk -v num_procs="$num_procs" -v i="$i" \
'NR % num_procs == i { print }' <"$infile")
) &
done
wait # wait for all the $num_procs subprocesses to finish
A relatively simple way to accomplish this with only two additional lines of code. Explanation is inline.
NUM_PROCS=$1
NUM_ITERS=$2
for ((i=0; i<$NUM_ITERS; i++)); do
python foo.py $i arg2 &
let 'i>=NUM_PROCS' && wait -n # wait for one process at a time once we've spawned $NUM_PROC workers
done
wait # wait for all remaining workers
Are you aware that if you are allowed to write and run your own scripts, then you can also use GNU Parallel? In essence it is a Perl script in one single file.
From the README:
= Minimal installation =
If you just need parallel and do not have 'make' installed (maybe the
system is old or Microsoft Windows):
wget http://git.savannah.gnu.org/cgit/parallel.git/plain/src/parallel
chmod 755 parallel
cp parallel sem
mv parallel sem dir-in-your-$PATH/bin/
seq $2 | parallel -j$1 python foo.py {} arg2
parallel --embed (available since 20180322) even makes it possible to distribute GNU Parallel as part of a shell script (i.e. no extra files needed):
parallel --embed >newscript
Then edit the end of newscript.
This isn't the simplest solution, but if your version of bash doesn't have "wait -n" and you don't want to use other programs like parallel, awk etc, here is a solution using while and for loops.
num_iters=10
total_threads=4
iter=1
while [[ "$iter" -lt "$num_iters" ]]; do
iters_remainder=$(echo "(${num_iters}-${iter})+1" | bc)
if [[ "$iters_remainder" -lt "$total_threads" ]]; then
threads=$iters_remainder
else
threads=$total_threads
fi
for ((t=1; t<="$threads"; t++)); do
(
# do stuff
) &
((++iter))
done
wait
done

How to run a fixed number of processes in a loop?

I have a script like this:
#!/bin/bash
for i=1 to 200000
do
create input file
run ./java
done
I need to run a number (8 or 16) of processes (java) at the same time and I don't know how. I know that wait could help but it should be running 8 processes all the time and not wait for the first 8 to finish before starting the other 8.
bash 4.3 added a useful new flag to the wait command, -n, which causes wait to block until any single background job, not just the members of a given subset (or all), to complete.
#!/bin/bash
cores=8 # or 16, or whatever
for ((i=1; i <= 200000; i++))
do
# create input file and run java in the background.
./java &
# Check how many background jobs there are, and if it
# is equal to the number of cores, wait for anyone to
# finish before continuing.
background=( $(jobs -p) )
if (( ${#background[#]} == cores )); then
wait -n
fi
done
There is a small race condition: if you are at maximum load but a job completes after you run jobs -p, you'll still block until another job
completes. There's not much you can do about this, but it shouldn't present too much trouble in practice.
Prior to bash 4.3, you would need to poll the set of background jobs periodically to see when the pool dropped below your threshold.
while :; do
background=( $(jobs -p))
if (( ${#background[#]} < cores )); then
break
fi
sleep 1
done
Use GNU Parallel like this, simplified to 20 jobs rather than 200,000 and the first job is echo rather than "create file" and the second job is sleep rather than "java".
seq 1 20 | parallel -j 8 -k 'echo {}; sleep 2'
The -j 8 says how many jobs to run at once. The -k says to keep the output in order.
Here is a little animation of the output so you can see the timing/sequence:
With a non-ancient version of GNU utilities or on *BSD/OSX, use xargs with the -P option to run processes in parallel.
#!/bin/bash
seq 200000 | xargs -P 8 -n 1 mytask
where mytask is an auxiliary script, with the sequence number (the input line) available as the argument$1`:
#!/bin/bash
echo "Task number $1"
create input file
run ./java
You can put everything in one script if you want:
#!/bin/bash
seq 200000 | xargs -P 8 -n 1 sh -c '
echo "Task number $1"
create input file
run ./java
' mytask
If your system doesn't have seq, you can use the bash snippet
for ((i=1; i<=200000; i++)); do echo "$i"; done
or other shell tools such as
awk '{for (i=1; i<=200000; i++) print i}' </dev/null
or
</dev/zero tr '\0' '\n' | head -n 200000 | nl
Set up 8 subprocesses that read from a common stream; each subprocess reads one line of input and starts a new job whenever its current job completes.
forker () {
while read; do
# create input file
./java
done
}
cores=8 # or 16, or whatever
for ((i=1; i<=200000; i++)); do
echo $i
done | while :; do
for ((j=0; j< cores; j++)); do
forker &
done
done
wait # Waiting for the $core forkers to complete

Resources