Multiple shell script workers - bash

We'd like to interpret tons of coordinates and do something with them using multiple workers.
What we got:
coords.txt
100, 100, 100
244, 433, 233
553, 212, 432
776, 332, 223
...
8887887, 5545554, 2243234
worker.sh
coord_reader='^([0-9]+), ([0-9]+), ([0-9]+)$'
while IFS='' read -r line || [[ -n "$line" ]]; do
if [[ $line =~ $coord_reader ]]; then
x=${BASH_REMATCH[1]}
y=${BASH_REMATCH[2]}
z=${BASH_REMATCH[3]}
echo "x is $x, y is $y, z is $z"
fi
done < "$1"
To execute worker.sh we call bash worker.sh coords.txt
Bc we have an amount of millions of coordinates it's needed to split the coords.txt and create multiple workers doing the same task, like coordsaa, coordsab, coordsac each 1 worker.
So we split coords.txt using split.
split -l 1000 coords.txt coords
But, how to assign one file per worker?
I am new to stackoverflow, feel free to comment so I can improve my asking skills.

To run workers from bash to treat a lot of files:
Files architecture:
files/ runner.sh worker.sh
files/ : it is a folder with a lot a files (for example 1000)
runner.sh: launch a lot a worker
worker.sh file: task to treat a file
For example:
worker.sh:
#!/usr/bin/env bash
sleep 5
echo $1
To run all files in files/ one per worker do:
runner.sh:
#!/usr/bin/env bash
n_processes=$(find files/ -type f | wc -l)
echo "spawning ${n_processes}"
for file in $(find . -type f); then
bash worker.sh "${file}" &
done
wait
/!\ 1000 processes is a lot !!
It is better to create a "pool of processes" (here it guarantees only a number maximum of process running at the same time, an old child process is not reused for a new task but died when its task is done or failed) :
#!/usr/bin/env bash
n_processes=8
echo "max of processes: ${n_processes}"
for file in $(find files/ -type f); do
while [[ $(jobs -r | wc -l) -gt ${n_processes} ]]; do
:
done
bash worker.sh "${file}" &
echo "process pid: $! finished"
done
wait
It is not really a pool of processes but it avoids having a lot of processes at the same time alive, number maximum of processes alive at the same time is given by n_processes.
Execute bash runner.sh.

I would do this with GNU Parallel. Say you want 8 workers running at a time till all the processing is done:
parallel -j 8 --pipepart -a coords.txt --fifo bash worker.sh {}
where:
-j8 means "keep 8 jobs running at a time"
--pipepart means "split the input file into parts"
-a coords.txt means "this is the input file"
--fifo means "create a temporary fifo to send the data to, and save its name in {} to pass to your worker script"

Related

How to wait in bash till a shell script is finished?

right now I'm using this script for a program:
export FREESURFER_HOME=$HOME/freesurfer
source $FREESURFER_HOME/SetUpFreeSurfer.sh
cd /home/ubuntu/fastsurfer
datadir=/home/ubuntu/moya/data
fastsurferdir=/home/ubuntu/moya/output
mkdir -p $fastsurferdir/logs # create log dir for storing nohup output log (optional)
while read p ; do
echo $p
nohup ./run_fastsurfer.sh --t1 $datadir/$p/orig.nii \
--parallel --threads 16 --sid $p --sd $fastsurferdir > $fastsurferdir/logs/out-${p}.log &
sleep 3600s
done < /home/ubuntu/moya/data/subjects-list.txt
Instead of using sleep 3600s, as the program needs around an hour, I'd like to use wait until all processes (several PIDS) are finished.
If this is the right way, can you tell me how to do that?
BR Alex
wait will wait for all background processes to finish (see help wait). So all you need is to run wait after creating all of the background processes.
This may be more than what you are asking for but I figured I would provide some methods for controlling the number of threads you want to have running at once. I find that I always want to limit the number for various reasons.
Explaination
The following will limit concurrent threads to max_threads running at one time. I am also using the main design pattern so we have a main that runs the script with a function run_jobs that handles the calling and waiting. I read all of $p into an array, then traverse that array as we launch threads. It will either launch a thread up to 4 or wait 5 seconds, once there are at least one less than four it will start another thread. When finished it waits for any remaining to be done. If you want something more simplistic I can do that as well.
#!/usr/bin/env bash
export FREESURFER_HOME=$HOME/freesurfer
source $FREESURFER_HOME/SetUpFreeSurfer.sh
typeset max_threads=4
typeset subjects_list="/home/ubuntu/moya/data/subjects-list.txt"
typeset subjectsArray
run_jobs() {
local child="$$"
local num_children=0
local i=0
while [[ 1 ]] ; do
num_children=$(ps --no-headers -o pid --ppid=$child | wc -w) ; ((num_children-=1))
echo "Children: $num_children"
if [[ ${num_children} -lt ${max_threads} ]] ;then
if [ $i -lt ${#subjectsArray[#]} ] ;then
((i+=1))
# RUN COMMAND HERE &
./run_fastsurfer.sh --t1 $datadir/${subjectsArray[$i]}/orig.nii \
--parallel --threads 16 --sid ${subjectsArray[$i]} --sd $fastsurferdir
fi
fi
sleep 10
done
wait
}
main() {
cd /home/ubuntu/fastsurfer
datadir=/home/ubuntu/moya/data
fastsurferdir=/home/ubuntu/moya/output
mkdir -p $fastsurferdir/logs # create log dir for storing nohup output log (optional)
mapfile -t subjectsArray < ${subjects_list}
run_jobs
}
main
Note: I did not run this code since you have not provided enough information to actually do so.

How to parallelize for-loop in bash limiting number of processes

I have a bash script similar to:
NUM_PROCS=$1
NUM_ITERS=$2
for ((i=0; i<$NUM_ITERS; i++)); do
python foo.py $i arg2 &
done
What's the most straightforward way to limit the number of parallel processes to NUM_PROCS? I'm looking for a solution that doesn't require packages/installations/modules (like GNU Parallel) if possible.
When I tried Charles Duffy's latest approach, I got the following error from bash -x:
+ python run.py args 1
+ python run.py ... 3
+ python run.py ... 4
+ python run.py ... 2
+ read -r line
+ python run.py ... 1
+ read -r line
+ python run.py ... 4
+ read -r line
+ python run.py ... 2
+ read -r line
+ python run.py ... 3
+ read -r line
+ python run.py ... 0
+ read -r line
... continuing with other numbers between 0 and 5, until too many processes were started for the system to handle and the bash script was shut down.
bash 4.4 will have an interesting new type of parameter expansion that simplifies Charles Duffy's answer.
#!/bin/bash
num_procs=$1
num_iters=$2
num_jobs="\j" # The prompt escape for number of jobs currently running
for ((i=0; i<num_iters; i++)); do
while (( ${num_jobs#P} >= num_procs )); do
wait -n
done
python foo.py "$i" arg2 &
done
GNU, macOS/OSX, FreeBSD and NetBSD can all do this with xargs -P, no bash versions or package installs required. Here's 4 processes at a time:
printf "%s\0" {1..10} | xargs -0 -I # -P 4 python foo.py # arg2
As a very simple implementation, depending on a version of bash new enough to have wait -n (to wait until only the next job exits, as opposed to waiting for all jobs):
#!/bin/bash
# ^^^^ - NOT /bin/sh!
num_procs=$1
num_iters=$2
declare -A pids=( )
for ((i=0; i<num_iters; i++)); do
while (( ${#pids[#]} >= num_procs )); do
wait -n
for pid in "${!pids[#]}"; do
kill -0 "$pid" &>/dev/null || unset "pids[$pid]"
done
done
python foo.py "$i" arg2 & pids["$!"]=1
done
If running on a shell without wait -n, one can (very inefficiently) replace it with a command such as sleep 0.2, to poll every 1/5th of a second.
Since you're actually reading input from a file, another approach is to start N subprocesses, each of processes only lines where (linenum % N == threadnum):
num_procs=$1
infile=$2
for ((i=0; i<num_procs; i++)); do
(
while read -r line; do
echo "Thread $i: processing $line"
done < <(awk -v num_procs="$num_procs" -v i="$i" \
'NR % num_procs == i { print }' <"$infile")
) &
done
wait # wait for all the $num_procs subprocesses to finish
A relatively simple way to accomplish this with only two additional lines of code. Explanation is inline.
NUM_PROCS=$1
NUM_ITERS=$2
for ((i=0; i<$NUM_ITERS; i++)); do
python foo.py $i arg2 &
let 'i>=NUM_PROCS' && wait -n # wait for one process at a time once we've spawned $NUM_PROC workers
done
wait # wait for all remaining workers
Are you aware that if you are allowed to write and run your own scripts, then you can also use GNU Parallel? In essence it is a Perl script in one single file.
From the README:
= Minimal installation =
If you just need parallel and do not have 'make' installed (maybe the
system is old or Microsoft Windows):
wget http://git.savannah.gnu.org/cgit/parallel.git/plain/src/parallel
chmod 755 parallel
cp parallel sem
mv parallel sem dir-in-your-$PATH/bin/
seq $2 | parallel -j$1 python foo.py {} arg2
parallel --embed (available since 20180322) even makes it possible to distribute GNU Parallel as part of a shell script (i.e. no extra files needed):
parallel --embed >newscript
Then edit the end of newscript.
This isn't the simplest solution, but if your version of bash doesn't have "wait -n" and you don't want to use other programs like parallel, awk etc, here is a solution using while and for loops.
num_iters=10
total_threads=4
iter=1
while [[ "$iter" -lt "$num_iters" ]]; do
iters_remainder=$(echo "(${num_iters}-${iter})+1" | bc)
if [[ "$iters_remainder" -lt "$total_threads" ]]; then
threads=$iters_remainder
else
threads=$total_threads
fi
for ((t=1; t<="$threads"; t++)); do
(
# do stuff
) &
((++iter))
done
wait
done

How to run a fixed number of processes in a loop?

I have a script like this:
#!/bin/bash
for i=1 to 200000
do
create input file
run ./java
done
I need to run a number (8 or 16) of processes (java) at the same time and I don't know how. I know that wait could help but it should be running 8 processes all the time and not wait for the first 8 to finish before starting the other 8.
bash 4.3 added a useful new flag to the wait command, -n, which causes wait to block until any single background job, not just the members of a given subset (or all), to complete.
#!/bin/bash
cores=8 # or 16, or whatever
for ((i=1; i <= 200000; i++))
do
# create input file and run java in the background.
./java &
# Check how many background jobs there are, and if it
# is equal to the number of cores, wait for anyone to
# finish before continuing.
background=( $(jobs -p) )
if (( ${#background[#]} == cores )); then
wait -n
fi
done
There is a small race condition: if you are at maximum load but a job completes after you run jobs -p, you'll still block until another job
completes. There's not much you can do about this, but it shouldn't present too much trouble in practice.
Prior to bash 4.3, you would need to poll the set of background jobs periodically to see when the pool dropped below your threshold.
while :; do
background=( $(jobs -p))
if (( ${#background[#]} < cores )); then
break
fi
sleep 1
done
Use GNU Parallel like this, simplified to 20 jobs rather than 200,000 and the first job is echo rather than "create file" and the second job is sleep rather than "java".
seq 1 20 | parallel -j 8 -k 'echo {}; sleep 2'
The -j 8 says how many jobs to run at once. The -k says to keep the output in order.
Here is a little animation of the output so you can see the timing/sequence:
With a non-ancient version of GNU utilities or on *BSD/OSX, use xargs with the -P option to run processes in parallel.
#!/bin/bash
seq 200000 | xargs -P 8 -n 1 mytask
where mytask is an auxiliary script, with the sequence number (the input line) available as the argument$1`:
#!/bin/bash
echo "Task number $1"
create input file
run ./java
You can put everything in one script if you want:
#!/bin/bash
seq 200000 | xargs -P 8 -n 1 sh -c '
echo "Task number $1"
create input file
run ./java
' mytask
If your system doesn't have seq, you can use the bash snippet
for ((i=1; i<=200000; i++)); do echo "$i"; done
or other shell tools such as
awk '{for (i=1; i<=200000; i++) print i}' </dev/null
or
</dev/zero tr '\0' '\n' | head -n 200000 | nl
Set up 8 subprocesses that read from a common stream; each subprocess reads one line of input and starts a new job whenever its current job completes.
forker () {
while read; do
# create input file
./java
done
}
cores=8 # or 16, or whatever
for ((i=1; i<=200000; i++)); do
echo $i
done | while :; do
for ((j=0; j< cores; j++)); do
forker &
done
done
wait # Waiting for the $core forkers to complete

bash: limiting subshells in a for loop with file list

I've been trying to get a for loop to run a bunch of commands sort of simultaneously and was attempting to do it via subshells. Ive managed to cobble together the script below to test and it seems to work ok.
#!/bin/bash
for i in {1..255}; do
(
#commands
)&
done
wait
The only problem is that my actual loop is going to be for i in files* and then it just crashes, i assume because its started too many subshells to handle. So i added
#!/bin/bash
for i in files*; do
(
#commands
)&
if (( $i % 10 == 0 )); then wait; fi
done
wait
which now fails. Does anyone know a way around this? Either using a different command to limit the number of subshells or provide a number for $i?
Cheers
xargs/parallel
Another solution would be to use tools designed for concurrency:
printf '%s\0' files* | xargs -0 -P6 -n1 yourScript
The -P6 is the maximum number of concurrent processes that xargs will launch. Make it 10 if you like.
I suggest xargs because it is likely already on your system. If you want a really robust solution, look at GNU Parallel.
Filenames in array
For another answer explicit to your question: Get the counter as the array index?
files=( files* )
for i in "${!files[#]}"; do
commands "${files[i]}" &
(( i % 10 )) || wait
done
(The parentheses around the compound command aren't important because backgrounding the job will have the same effects as using a subshell anyway.)
Function
Just different semantics:
simultaneous() {
while [[ $1 ]]; do
for i in {1..11}; do
[[ ${#:i:1} ]] || break
commands "${#:i:1}" &
done
shift 10 || shift "$#"
wait
done
}
simultaneous files*
You can find useful to count the number of jobs with jobs. e.g.:
wc -w <<<$(jobs -p)
So, your code would look like this:
#!/bin/bash
for i in files*; do
(
#commands
)&
if (( $(wc -w <<<$(jobs -p)) % 10 == 0 )); then wait; fi
done
wait
As #chepner suggested:
In bash 4.3, you can use wait -n to proceed as soon as any job completes, rather than waiting for all of them
Define the counter explicitly
#!/bin/bash
for f in files*; do
(
#commands
)&
(( i++ % 10 == 0 )) && wait
done
wait
There's no need to initialize i, as it will default to 0 the first time you use it. There's also no need to reset the value, as i %10 will be 0 for i=10, 20, 30, etc.
If you have Bash≥4.3, you can use wait -n:
#!/bin/bash
max_nb_jobs=10
for i in file*; do
# Wait until there are less than max_nb_jobs jobs running
while mapfile -t < <(jobs -pr) && ((${#MAPFILE[#]}>=max_nb_jobs)); do
wait -n
done
{
# Your commands here: no useless subshells! use grouping instead
} &
done
wait
If you don't have wait -n available, you can use something like this:
#!/bin/bash
set -m
max_nb_jobs=10
sleep_jobs() {
# This function sleeps until there are less than $1 jobs running
local n=$1
while mapfile -t < <(jobs -pr) && ((${#MAPFILE[#]}>=n)); do
coproc read
trap "echo >&${COPROC[1]}; trap '' SIGCHLD" SIGCHLD
[[ $COPROC_PID ]] && wait $COPROC_PID
done
}
for i in files*; do
# Wait until there are less than 10 jobs running
sleep_jobs "$max_nb_jobs"
{
# Your commands here: no useless subshells! use grouping instead
} &
done
wait
The advantage of proceeding like this, is that we make no assumptions on the time taken to finish the jobs. A new job is launched as soon as there's room for it. Moreover, it's all pure Bash, so doesn't rely on external tools and (maybe more importantly), you may use your Bash environment (variables, functions, etc.) without exporting them (arrays can't be easily exported so that can be a huge pro).

Process Scheduling

Let's say, I have 10 scripts that I want to run regularly as cron jobs. However, I don't want all of them to run at the same time. I want only 2 of them running simultaneously.
One solution that I'm thinking of is create two script, put 5 statements on each of them, and them as separate entries in the crontab. However the solution seem very adhoc.
Is there existing unix tool to perform the task I mentioned above?
The jobs builtin can tell you how many child processes are running. Some simple shell scripting can accomplish this task:
MAX_JOBS=2
launch_when_not_busy()
{
while [ $(jobs | wc -l) -ge $MAX_JOBS ]
do
# at least $MAX_JOBS are still running.
sleep 1
done
"$#" &
}
launch_when_not_busy bash job1.sh --args
launch_when_not_busy bash jobTwo.sh
launch_when_not_busy bash job_three.sh
...
wait
NOTE: As pointed out by mobrule, my original answer will not work because the wait builtin with no arguments waits for ALL children to finish. Hence the following 'parallelexec' script, which avoids polling at the cost of more child processes:
#!/bin/bash
N="$1"
I=0
{
if [[ "$#" -le 1 ]]; then
cat
else
while [[ "$#" -gt 1 ]]; do
echo "$2"
set -- "$1" "${#:3}"
done
fi
} | {
d=$(mktemp -d /tmp/fifo.XXXXXXXX)
mkfifo "$d"/fifo
exec 3<>"$d"/fifo
rm -rf "$d"
while [[ "$I" -lt "$N" ]] && read C; do
($C; echo >&3) &
let I++
done
while read C; do
read -u 3
($C; echo >&3) &
done
}
The first argument is the number of parallel jobs. If there are more, each one is run as a job, otherwise all commands to run are read from stdin line by line.
I use a named pipe (which is sent to oblivion as soon as the shell opens it) as a synchronization method. Since only single bytes are written there are no race condition issues that could complicate things.
GNU Parallel is designed for this kind of tasks:
sem -j2 do_stuff
sem -j2 do_other_stuff
sem -j2 do_third_stuff
do_third_stuff will only be run when either do_stuff or do_other_stuff has finished.
Watch the intro videos to learn more:
http://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Resources