download files parallely in a bash script - bash

I am using the below logic to download 3 file from the array at once, once all 3 completed only the next 3 files will be picked up.
parallel=3
downLoad() {
while (( "$#" )); do
for (( i=0; i<$parallel; i++ )); do
echo "downloading ${1}..."
curl -s -o ${filename}.tar.gz <download_URL> &
shift
done
wait
echo "#################################"
done
}
downLoad ${layers[#]}
But how i am expecting is "at any point in time 3 downloads should run" - i mean suppose we sent 3 file-downloads to background and one among the 3 gets completed very soon because of very less size, i want another file from the queue should be send for download.
COMPLETE SCRIPT:
#!/bin/bash
set -eu
reg="registry.hub.docker.com"
repo="hjd48"
image="redhat"
name="${repo}/${image}"
tag="latest"
parallel=3
# Get auth token
token=$( curl -s "https://auth.docker.io/token?service=registry.docker.io&scope=repository:${name}:pull" | jq -r .token )
# Get layers
resp=$(curl -s -H "Authorization: Bearer $token" "https://${reg}/v2/${name}/manifests/${tag}" | jq -r .fsLayers[].blobSum )
layers=( $( echo $resp | tr ' ' '\n' | sort -u ) )
prun() {
PIDS=()
while (( "$#" )); do
if ( kill -0 ${PIDS[#]} 2>/dev/null ; [[ $(( ${#PIDS[#]} - $? )) -lt $parallel ]])
then
echo "Download: ${1}.tar.gz"
curl -s -o $1.tar.gz -L -H "Authorization: Bearer $token" "https://${reg}/v2/${name}/blobs/${1}" &
PIDS+=($!)
shift
fi
done
wait
}
prun ${layers[#]}

If you do not mind using xargs then you can:
xargs -I xxx -P 3 sleep xxx < sleep
and sleep is:
1
2
3
4
5
6
7
8
9
and if you watch the background with:
watch -n 1 -exec ps --forest -g -p your-Bash-pid
(sleep could be your array of link ) then you will see that 3 jobs are run in parallel and when one of these three is completed the next job is added. In fact always 3 jobs are running till the end of array.
sample output of watch(1):
12260 pts/3 S+ 0:00 \_ xargs -I xxx -P 3 sleep xxx
12263 pts/3 S+ 0:00 \_ sleep 1
12265 pts/3 S+ 0:00 \_ sleep 2
12267 pts/3 S+ 0:00 \_ sleep 3
xargs starts with 3 jobs and when one of them is finished it will add the next which bacomes:
12260 pts/3 S+ 0:00 \_ xargs -I xxx -P 3 sleep xxx
12265 pts/3 S+ 0:00 \_ sleep 2
12267 pts/3 S+ 0:00 \_ sleep 3
12269 pts/3 S+ 0:00 \_ sleep 4 # this one was added

I've done just this by using trap to handle SIGCHLD and start another transfer when one ends.
The difficult part is that once your script installs a SIGCHLD handler with that trap line, you can't create any child processes other than your transfer processes. For example, if your shell doesn't have a built-in echo, calling echo would spawn a child process that would cause you to start one more transfer when the echo process ends.
I don't have a copy available, but it was something like this:
startDownload() {
# only start another download if there are URLs left in
# in the array that haven't been downloaded yet
if [ ${ urls[ $fileno ] } ];
# start a curl download in the background and increment fileno
# so the next call downloads the next URL in the array
curl ... ${ urls[ $fileno ] } &
fileno=$((fileno+1))
fi
}
trap startDownload SIGCHLD
# start at file zero and set up an array
# of URLs to download
fileno=0
urls=...
parallel=3
# start the initial parallel downloads
# when one ends, the SIGCHLD will cause
# another one to be started if there are
# remaining URLs in the array
for (( i=0; i<$parallel; i++ )); do
startDownload
done
wait
That's not been tested at all, and probably has all kinds of errors.

I would read all provided filenames into three variables, and then process each stream separately, e.g.
PARALLEL=3
COUNTER=1
for FILENAME in $#
do
eval FILESTREAM${COUNTER}="\$FILESTREAM${COUNTER} \${FILENAME}"
COUNTER=`expr ${COUNTER} + 1`
if [ ${COUNTER} -gt ${PARALLEL} ]
then
COUNTER=1
fi
done
and now call the download function for each of the streams in parallel:
COUNTER=1
while [ ${COUNTER} -le ${PARALLEL} ]
do
eval "download \$FILESTREAM${COUNTER} &"
COUNTER=`expr ${COUNTER} + 1`
done

Besides implementing a parallel bash script from scratch, GNU parallel is an available tool to use which is quite suitable to perform these type of tasks.
parallel -j3 curl -s -o {}.tar.gz download_url ::: "${layers[#]}"
-j3 ensures a maximum of 3 jobs running at the same time
you can add an additional option --dry-run after parallel to make sure the built command is exactly as you want

Related

wait doesn't wait for the processes in the while loop to finish

Here is my code:
count=0
head -n 10 urls.txt | while read LINE; do
curl -o /dev/null -s "$LINE" -w "%{time_total}\n" &
count=$((count+1))
[ 0 -eq $((count % 3)) ] && wait && echo "process wait" # wait for 3 urls
done
echo "before wait"
wait
echo "after wait"
I am expecting the last curl to finish before printing the last echo, but actually it's not the case:
0.595499
0.602349
0.618237
process wait
0.084970
0.084243
0.099969
process wait
0.067999
0.068253
0.081602
process wait
before wait
after wait
➜ Downloads 0.088755 # already exited the script
Does anyone know why it's happening? And how to fix this?
As described in BashFAQ #24, this is caused by your pipeline causing the while loop to be performed in a different shell from the rest of your script.
Consequently, your curls are subprocesses of that subshell, not the outer interpreter; so the outer interpreter cannot wait for them.
This can be resolved by not piping to while read, but instead redirecting its input in a way that doesn't shuffle it into a pipeline element -- as with <(...), a process substitution:
#!/usr/bin/env bash
# ^^^^ - NOT /bin/sh; also, must not start with "sh scriptname"
count=0
while IFS= read -r line; do
curl -o /dev/null -s "$line" -w "%{time_total}\n" &
count=$((count+1))
(( count % 3 == 0 )) && { wait; echo "process wait"; } # wait for 3 urls
done < <(head -n 10 urls.txt)
echo "before wait"
wait
echo "after wait"
why it's happening?
Because you run the processes in the subshell, the parent process can't wait for them.
$ echo | { echo subshell; sleep 100 & }
$ wait # exits immiedately
$
Call wait from the same process the background processes were spawned:
someotherthing | {
while someotherthing; do
something &
done
wait # will wait for something
}
And how to fix this?
I recommend not to use a crude while read loop and use different approach using some tool. Use GNU xargs with -P option to run 3 processes concurently:
head -n 10 urls.txt | xargs -P3 -n1 -d '\n' curl -o /dev/null -w "%{time_total}\n" -s
But you could just use move wait into the subshell as above, or make the while loop to be executed in the parent shell alternatively.

How to check whether a long time task running properly? / How to launch a function after given time while a command is running?

How to check whether a long time task running properly? (How to launch a function after given time while a command is running)?
I'm writing a bash script to download some files regularly. I'd like to informed while a successful download is started.
But I couldn't make it right.
#!/bin/bash
URL="http://testurl"
FILENAME="/tmp/test"
function is_downloading() {
sleep 11
echo -e "$DOWNLOADING" # 0 wanted here with a failed download but always get empty
if [[ $DOWNLOADING -eq 1 ]]; then
echo "Send Message"
# send_msg
fi
}
while [[ 0 ]]; do
is_downloading &
DOWNLOADING=1
curl --connect-timeout 10 --speed-time 10 --speed-limit 1 --location -o "$FILENAME" "$URL"
DOWNLOADING=0
echo -e "$DOWNLOADING"
sleep 3600
done
is_downloading is running in another process, the best it could see is a copy of our variables at the time it started. Variables are not shared, bash does not support multi-threading (yet).
So you need to arrange some form of Inter-Process Communication (IPC). There are many methods available, I favour a named pipe (also known as a FIFO). Something like this:
function is_downloading() {
thepipe="$1"
while :
do
read -r DOWNLOADING < "$thepipe"
echo "$DOWNLOADING"
if [[ $DOWNLOADING -eq 1 ]]; then
echo "Send Message"
# send_msg
fi
done
}
pipename="/tmp/$0$$"
mkfifo "$pipename"
is_downloading "$pipename" &
trap 'kill %1;rm "$pipename"' INT TERM EXIT
while :
do
DOWNLOADING=1
echo "$DOWNLOADING" > "$pipename"
curl --connect-timeout 10 --speed-time 10 --speed-limit 1 --location -o "$FILENAME" "$URL"
DOWNLOADING=0
echo "$DOWNLOADING" > "$pipename"
sleep 3600
done
Modifications: taken the function call out of the loop. Tidy-up code put into a trap statement.

How to control the number of a job executions ended with "&" within a loop [duplicate]

This question already has answers here:
Parallelize Bash script with maximum number of processes
(16 answers)
Closed 1 year ago.
Is there an easy way to limit the number of concurrent jobs in bash? By that I mean making the & block when there are more then n concurrent jobs running in the background.
I know I can implement this with ps | grep -style tricks, but is there an easier way?
If you have GNU Parallel http://www.gnu.org/software/parallel/ installed you can do this:
parallel gzip ::: *.log
which will run one gzip per CPU core until all logfiles are gzipped.
If it is part of a larger loop you can use sem instead:
for i in *.log ; do
echo $i Do more stuff here
sem -j+0 gzip $i ";" echo done
done
sem --wait
It will do the same, but give you a chance to do more stuff for each file.
If GNU Parallel is not packaged for your distribution you can install GNU Parallel simply by:
$ (wget -O - pi.dk/3 || lynx -source pi.dk/3 || curl pi.dk/3/ || \
fetch -o - http://pi.dk/3 ) > install.sh
$ sha1sum install.sh | grep 883c667e01eed62f975ad28b6d50e22a
12345678 883c667e 01eed62f 975ad28b 6d50e22a
$ md5sum install.sh | grep cc21b4c943fd03e93ae1ae49e28573c0
cc21b4c9 43fd03e9 3ae1ae49 e28573c0
$ sha512sum install.sh | grep da012ec113b49a54e705f86d51e784ebced224fdf
79945d9d 250b42a4 2067bb00 99da012e c113b49a 54e705f8 6d51e784 ebced224
fdff3f52 ca588d64 e75f6033 61bd543f d631f592 2f87ceb2 ab034149 6df84a35
$ bash install.sh
It will download, check signature, and do a personal installation if it cannot install globally.
Watch the intro videos for GNU Parallel to learn more:
https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
A small bash script could help you:
# content of script exec-async.sh
joblist=($(jobs -p))
while (( ${#joblist[*]} >= 3 ))
do
sleep 1
joblist=($(jobs -p))
done
$* &
If you call:
. exec-async.sh sleep 10
...four times, the first three calls will return immediately, the fourth call will block until there are less than three jobs running.
You need to start this script inside the current session by prefixing it with ., because jobs lists only the jobs of the current session.
The sleep inside is ugly, but I didn't find a way to wait for the first job that terminates.
The following script shows a way to do this with functions. You can either put the bgxupdate() and bgxlimit() functions in your script, or have them in a separate file which is sourced from your script with:
. /path/to/bgx.sh
It has the advantage that you can maintain multiple groups of processes independently (you can run, for example, one group with a limit of 10 and another totally separate group with a limit of 3).
It uses the Bash built-in jobs to get a list of sub-processes but maintains them in individual variables. In the loop at the bottom, you can see how to call the bgxlimit() function:
Set up an empty group variable.
Transfer that to bgxgrp.
Call bgxlimit() with the limit and command you want to run.
Transfer the new group back to your group variable.
Of course, if you only have one group, just use bgxgrp variable directly rather than transferring in and out.
#!/bin/bash
# bgxupdate - update active processes in a group.
# Works by transferring each process to new group
# if it is still active.
# in: bgxgrp - current group of processes.
# out: bgxgrp - new group of processes.
# out: bgxcount - number of processes in new group.
bgxupdate() {
bgxoldgrp=${bgxgrp}
bgxgrp=""
((bgxcount = 0))
bgxjobs=" $(jobs -pr | tr '\n' ' ')"
for bgxpid in ${bgxoldgrp} ; do
echo "${bgxjobs}" | grep " ${bgxpid} " >/dev/null 2>&1
if [[ $? -eq 0 ]]; then
bgxgrp="${bgxgrp} ${bgxpid}"
((bgxcount++))
fi
done
}
# bgxlimit - start a sub-process with a limit.
# Loops, calling bgxupdate until there is a free
# slot to run another sub-process. Then runs it
# an updates the process group.
# in: $1 - the limit on processes.
# in: $2+ - the command to run for new process.
# in: bgxgrp - the current group of processes.
# out: bgxgrp - new group of processes
bgxlimit() {
bgxmax=$1; shift
bgxupdate
while [[ ${bgxcount} -ge ${bgxmax} ]]; do
sleep 1
bgxupdate
done
if [[ "$1" != "-" ]]; then
$* &
bgxgrp="${bgxgrp} $!"
fi
}
# Test program, create group and run 6 sleeps with
# limit of 3.
group1=""
echo 0 $(date | awk '{print $4}') '[' ${group1} ']'
echo
for i in 1 2 3 4 5 6; do
bgxgrp=${group1}; bgxlimit 3 sleep ${i}0; group1=${bgxgrp}
echo ${i} $(date | awk '{print $4}') '[' ${group1} ']'
done
# Wait until all others are finished.
echo
bgxgrp=${group1}; bgxupdate; group1=${bgxgrp}
while [[ ${bgxcount} -ne 0 ]]; do
oldcount=${bgxcount}
while [[ ${oldcount} -eq ${bgxcount} ]]; do
sleep 1
bgxgrp=${group1}; bgxupdate; group1=${bgxgrp}
done
echo 9 $(date | awk '{print $4}') '[' ${group1} ']'
done
Here’s a sample run, with blank lines inserted to clearly delineate different time points:
0 12:38:00 [ ]
1 12:38:00 [ 3368 ]
2 12:38:00 [ 3368 5880 ]
3 12:38:00 [ 3368 5880 2524 ]
4 12:38:10 [ 5880 2524 1560 ]
5 12:38:20 [ 2524 1560 5032 ]
6 12:38:30 [ 1560 5032 5212 ]
9 12:38:50 [ 5032 5212 ]
9 12:39:10 [ 5212 ]
9 12:39:30 [ ]
The whole thing starts at 12:38:00 (time t = 0) and, as you can see, the first three processes run immediately.
Each process sleeps for 10n seconds and the fourth process doesn’t start until the first exits (at time t = 10). You can see that process 3368 has disappeared from the list before 1560 is added.
Similarly, the fifth process 5032 starts when 5880 (the second) exits at time t = 20.
And finally, the sixth process 5212 starts when 2524 (the third) exits at time t = 30.
Then the rundown begins, the fourth process exits at time t = 50 (started at 10 with 40 duration).
The fifth exits at time t = 70 (started at 20 with 50 duration).
Finally, the sixth exits at time t = 90 (started at 30 with 60 duration).
Or, if you prefer it in a more graphical time-line form:
Process: 1 2 3 4 5 6
-------- - - - - - -
12:38:00 ^ ^ ^ 1/2/3 start together.
12:38:10 v | | ^ 4 starts when 1 done.
12:38:20 v | | ^ 5 starts when 2 done.
12:38:30 v | | ^ 6 starts when 3 done.
12:38:40 | | |
12:38:50 v | | 4 ends.
12:39:00 | |
12:39:10 v | 5 ends.
12:39:20 |
12:39:30 v 6 ends.
Here's the shortest way:
waitforjobs() {
while test $(jobs -p | wc -w) -ge "$1"; do wait -n; done
}
Call this function before forking off any new job:
waitforjobs 10
run_another_job &
To have as many background jobs as cores on the machine, use $(nproc) instead of a fixed number like 10.
Assuming you'd like to write code like this:
for x in $(seq 1 100); do # 100 things we want to put into the background.
max_bg_procs 5 # Define the limit. See below.
your_intensive_job &
done
Where max_bg_procs should be put in your .bashrc:
function max_bg_procs {
if [[ $# -eq 0 ]] ; then
echo "Usage: max_bg_procs NUM_PROCS. Will wait until the number of background (&)"
echo " bash processes (as determined by 'jobs -pr') falls below NUM_PROCS"
return
fi
local max_number=$((0 + ${1:-0}))
while true; do
local current_number=$(jobs -pr | wc -l)
if [[ $current_number -lt $max_number ]]; then
break
fi
sleep 1
done
}
The following function (developed from tangens answer above, either copy into script or source from file):
job_limit () {
# Test for single positive integer input
if (( $# == 1 )) && [[ $1 =~ ^[1-9][0-9]*$ ]]
then
# Check number of running jobs
joblist=($(jobs -rp))
while (( ${#joblist[*]} >= $1 ))
do
# Wait for any job to finish
command='wait '${joblist[0]}
for job in ${joblist[#]:1}
do
command+=' || wait '$job
done
eval $command
joblist=($(jobs -rp))
done
fi
}
1) Only requires inserting a single line to limit an existing loop
while :
do
task &
job_limit `nproc`
done
2) Waits on completion of existing background tasks rather than polling, increasing efficiency for fast tasks
This might be good enough for most purposes, but is not optimal.
#!/bin/bash
n=0
maxjobs=10
for i in *.m4a ; do
# ( DO SOMETHING ) &
# limit jobs
if (( $(($((++n)) % $maxjobs)) == 0 )) ; then
wait # wait until all have finished (not optimal, but most times good enough)
echo $n wait
fi
done
If you're willing to do this outside of pure bash, you should look into a job queuing system.
For instance, there's GNU queue or PBS. And for PBS, you might want to look into Maui for configuration.
Both systems will require some configuration, but it's entirely possible to allow a specific number of jobs to run at once, only starting newly queued jobs when a running job finishes. Typically, these job queuing systems would be used on supercomputing clusters, where you would want to allocate a specific amount of memory or computing time to any given batch job; however, there's no reason you can't use one of these on a single desktop computer without regard for compute time or memory limits.
It is hard to do without wait -n (for example, shell in busybox does not support it). So here is a workaround, it is not optimal because it calls 'jobs' and 'wc' commands 10x per second. You can reduce the calls to 1x per second for example, if you don't mind waiting a bit longer for each job to complete.
# $1 = maximum concurent jobs
#
limit_jobs()
{
while true; do
if [ "$(jobs -p | wc -l)" -lt "$1" ]; then break; fi
usleep 100000
done
}
# and now start some tasks:
task &
limit_jobs 2
task &
limit_jobs 2
task &
limit_jobs 2
task &
limit_jobs 2
wait
On Linux I use this to limit the bash jobs to the number of available CPUs (possibly overriden by setting the CPU_NUMBER).
[ "$CPU_NUMBER" ] || CPU_NUMBER="`nproc 2>/dev/null || echo 1`"
while [ "$1" ]; do
{
do something
with $1
in parallel
echo "[$# items left] $1 done"
} &
while true; do
# load the PIDs of all child processes to the array
joblist=(`jobs -p`)
if [ ${#joblist[*]} -ge "$CPU_NUMBER" ]; then
# when the job limit is reached, wait for *single* job to finish
wait -n
else
# stop checking when we're below the limit
break
fi
done
# it's great we executed zero external commands to check!
shift
done
# wait for all currently active child processes
wait
Wait command, -n option, waits for the next job to terminate.
maxjobs=10
# wait for the amount of processes less to $maxjobs
jobIds=($(jobs -p))
len=${#jobIds[#]}
while [ $len -ge $maxjobs ]; do
# Wait until one job is finished
wait -n $jobIds
jobIds=($(jobs -p))
len=${#jobIds[#]}
done
Have you considered starting ten long-running listener processes and communicating with them via named pipes?
you can use ulimit -u
see http://ss64.com/bash/ulimit.html
Bash mostly processes files line by line.
So you cap split input file input files by N lines then simple pattern is applicable:
mkdir tmp ; pushd tmp ; split -l 50 ../mainfile.txt
for file in * ; do
while read a b c ; do curl -s http://$a/$b/$c <$file &
done ; wait ; done
popd ; rm -rf tmp;

how to write a process-pool bash shell

I have more than 10 tasks to execute, and the system restrict that there at most 4 tasks can run at the same time.
My task can be started like:
myprog taskname
How can I write a bash shell script to run these task. The most important thing is that when one task finish, the script can start another immediately, making the running tasks count remain 4 all the time.
Use xargs:
xargs -P <maximum-number-of-process-at-a-time> -n <arguments-per-process> <command>
Details here.
I chanced upon this thread while looking into writing my own process pool and particularly liked Brandon Horsley's solution, though I couldn't get the signals working right, so I took inspiration from Apache and decided to try a pre-fork model with a fifo as my job queue.
The following function is the function that the worker processes run when forked.
# \brief the worker function that is called when we fork off worker processes
# \param[in] id the worker ID
# \param[in] job_queue the fifo to read jobs from
# \param[in] result_log the temporary log file to write exit codes to
function _job_pool_worker()
{
local id=$1
local job_queue=$2
local result_log=$3
local line=
exec 7<> ${job_queue}
while [[ "${line}" != "${job_pool_end_of_jobs}" && -e "${job_queue}" ]]; do
# workers block on the exclusive lock to read the job queue
flock --exclusive 7
read line <${job_queue}
flock --unlock 7
# the worker should exit if it sees the end-of-job marker or run the
# job otherwise and save its exit code to the result log.
if [[ "${line}" == "${job_pool_end_of_jobs}" ]]; then
# write it one more time for the next sibling so that everyone
# will know we are exiting.
echo "${line}" >&7
else
_job_pool_echo "### _job_pool_worker-${id}: ${line}"
# run the job
{ ${line} ; }
# now check the exit code and prepend "ERROR" to the result log entry
# which we will use to count errors and then strip out later.
local result=$?
local status=
if [[ "${result}" != "0" ]]; then
status=ERROR
fi
# now write the error to the log, making sure multiple processes
# don't trample over each other.
exec 8<> ${result_log}
flock --exclusive 8
echo "${status}job_pool: exited ${result}: ${line}" >> ${result_log}
flock --unlock 8
exec 8>&-
_job_pool_echo "### _job_pool_worker-${id}: exited ${result}: ${line}"
fi
done
exec 7>&-
}
You can get a copy of my solution at Github. Here's a sample program using my implementation.
#!/bin/bash
. job_pool.sh
function foobar()
{
# do something
true
}
# initialize the job pool to allow 3 parallel jobs and echo commands
job_pool_init 3 0
# run jobs
job_pool_run sleep 1
job_pool_run sleep 2
job_pool_run sleep 3
job_pool_run foobar
job_pool_run foobar
job_pool_run /bin/false
# wait until all jobs complete before continuing
job_pool_wait
# more jobs
job_pool_run /bin/false
job_pool_run sleep 1
job_pool_run sleep 2
job_pool_run foobar
# don't forget to shut down the job pool
job_pool_shutdown
# check the $job_pool_nerrors for the number of jobs that exited non-zero
echo "job_pool_nerrors: ${job_pool_nerrors}"
Hope this helps!
Using GNU Parallel you can do:
cat tasks | parallel -j4 myprog
If you have 4 cores, you can even just do:
cat tasks | parallel myprog
From http://git.savannah.gnu.org/cgit/parallel.git/tree/README:
Full installation
Full installation of GNU Parallel is as simple as:
./configure && make && make install
Personal installation
If you are not root you can add ~/bin to your path and install in
~/bin and ~/share:
./configure --prefix=$HOME && make && make install
Or if your system lacks 'make' you can simply copy src/parallel
src/sem src/niceload src/sql to a dir in your path.
Minimal installation
If you just need parallel and do not have 'make' installed (maybe the
system is old or Microsoft Windows):
wget http://git.savannah.gnu.org/cgit/parallel.git/plain/src/parallel
chmod 755 parallel
cp parallel sem
mv parallel sem dir-in-your-$PATH/bin/
Test the installation
After this you should be able to do:
parallel -j0 ping -nc 3 ::: foss.org.my gnu.org freenetproject.org
This will send 3 ping packets to 3 different hosts in parallel and print
the output when they complete.
Watch the intro video for a quick introduction:
https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
I would suggest writing four scripts, each one of which executes a certain number of tasks in series. Then write another script that starts the four scripts in parallel. For instance, if you have scripts, script1.sh, script2.sh, script3.sh, and script4.sh, you could have a script called headscript.sh like so.
#!/bin/sh
./script1.sh &
./script2.sh &
./script3.sh &
./script4.sh &
I found the best solution proposed in A Foo Walks into a Bar... blog using build-in functionality of well know xargs tool
First create a file commands.txt with list of commands you want to execute
myprog taskname1
myprog taskname2
myprog taskname3
myprog taskname4
...
myprog taskname123
and then pipe it to xargs like this to execute in 4 processes pool:
cat commands.txt | xargs -I CMD --max-procs=4 bash -c CMD
you can modify no of process
Following #Parag Sardas' answer and the documentation linked here's a quick script you might want to add on your .bash_aliases.
Relinking the doc link because it's worth a read
#!/bin/bash
# https://stackoverflow.com/a/19618159
# https://stackoverflow.com/a/51861820
#
# Example file contents:
# touch /tmp/a.txt
# touch /tmp/b.txt
if [ "$#" -eq 0 ]; then
echo "$0 <file> [max-procs=0]"
exit 1
fi
FILE=${1}
MAX_PROCS=${2:-0}
cat $FILE | while read line; do printf "%q\n" "$line"; done | xargs --max-procs=$MAX_PROCS -I CMD bash -c CMD
I.e.
./xargs-parallel.sh jobs.txt 4 maximum of 4 processes read from jobs.txt
You could probably do something clever with signals.
Note this is only to illustrate the concept, and thus not thoroughly tested.
#!/usr/local/bin/bash
this_pid="$$"
jobs_running=0
sleep_pid=
# Catch alarm signals to adjust the number of running jobs
trap 'decrement_jobs' SIGALRM
# When a job finishes, decrement the total and kill the sleep process
decrement_jobs()
{
jobs_running=$(($jobs_running - 1))
if [ -n "${sleep_pid}" ]
then
kill -s SIGKILL "${sleep_pid}"
sleep_pid=
fi
}
# Check to see if the max jobs are running, if so sleep until woken
launch_task()
{
if [ ${jobs_running} -gt 3 ]
then
(
while true
do
sleep 999
done
) &
sleep_pid=$!
wait ${sleep_pid}
fi
# Launch the requested task, signalling the parent upon completion
(
"$#"
kill -s SIGALRM "${this_pid}"
) &
jobs_running=$((${jobs_running} + 1))
}
# Launch all of the tasks, this can be in a loop, etc.
launch_task task1
launch_task tast2
...
launch_task task99
This tested script runs 5 jobs at a time and will restart a new job as soon as it does (due to the kill of the sleep 10.9 when we get a SIGCHLD. A simpler version of this could use direct polling (change the sleep 10.9 to sleep 1 and get rid of the trap).
#!/usr/bin/bash
set -o monitor
trap "pkill -P $$ -f 'sleep 10\.9' >&/dev/null" SIGCHLD
totaljobs=15
numjobs=5
worktime=10
curjobs=0
declare -A pidlist
dojob()
{
slot=$1
time=$(echo "$RANDOM * 10 / 32768" | bc -l)
echo Starting job $slot with args $time
sleep $time &
pidlist[$slot]=`jobs -p %%`
curjobs=$(($curjobs + 1))
totaljobs=$(($totaljobs - 1))
}
# start
while [ $curjobs -lt $numjobs -a $totaljobs -gt 0 ]
do
dojob $curjobs
done
# Poll for jobs to die, restarting while we have them
while [ $totaljobs -gt 0 ]
do
for ((i=0;$i < $curjobs;i++))
do
if ! kill -0 ${pidlist[$i]} >&/dev/null
then
dojob $i
break
fi
done
sleep 10.9 >&/dev/null
done
wait
Other answer about 4 shell scripts does not fully satisfies me as it assumes that all tasks take approximatelu the same time and because it requires manual set up. But here is how I would improve it.
Main script will create symbolic links to executables following certain namimg convention. For example,
ln -s executable1 ./01-task.01
first prefix is for sorting and suffix identifies batch (01-04).
Now we spawn 4 shell scripts that would take batch number as input and do something like this
for t in $(ls ./*-task.$batch | sort ; do
t
rm t
done
Look at my implementation of job pool in bash: https://github.com/spektom/shell-utils/blob/master/jp.sh
For example, to run at most 3 processes of cURL when downloading from a lot of URLs, you can wrap your cURL commands as follows:
./jp.sh "My Download Pool" 3 curl http://site1/...
./jp.sh "My Download Pool" 3 curl http://site2/...
./jp.sh "My Download Pool" 3 curl http://site3/...
...
Here is my solution. The idea is quite simple. I create a fifo as a semaphore, where each line stands for an available resource. When reading the queue, the main process blocks if there is nothing left. And, we return the resource after the task is done by simply echoing anything to the queue.
function task() {
local task_no="$1"
# doing the actual task...
echo "Executing Task ${task_no}"
# which takes a long time
sleep 1
}
function execute_concurrently() {
local tasks="$1"
local ps_pool_size="$2"
# create an anonymous fifo as a Semaphore
local sema_fifo
sema_fifo="$(mktemp -u)"
mkfifo "${sema_fifo}"
exec 3<>"${sema_fifo}"
rm -f "${sema_fifo}"
# every 'x' stands for an available resource
for i in $(seq 1 "${ps_pool_size}"); do
echo 'x' >&3
done
for task_no in $(seq 1 "${tasks}"); do
read dummy <&3 # blocks util a resource is available
(
trap 'echo x >&3' EXIT # returns the resource on exit
task "${task_no}"
)&
done
wait # wait util all forked tasks have finished
}
execute_concurrently 10 4
The script above will run 10 tasks and 4 each time concurrently. You can change the $(seq 1 "${tasks}") sequence to the actual task queue you want to run.
I made my modifications based on methods introduced in this Writing a process pool in Bash.
#!/bin/bash
#set -e # this doesn't work here for some reason
POOL_SIZE=4 # number of workers running in parallel
#######################################################################
# populate jobs #
#######################################################################
declare -a jobs
for (( i = 1988; i < 2019; i++ )); do
jobs+=($i)
done
echo '################################################'
echo ' Launching jobs'
echo '################################################'
parallel() {
local proc procs jobs cur
jobs=("$#") # input jobs array
declare -a procs=() # processes array
cur=0 # current job idx
morework=true
while $morework; do
# if process array size < pool size, try forking a new proc
if [[ "${#procs[#]}" -lt "$POOL_SIZE" ]]; then
if [[ $cur -lt "${#jobs[#]}" ]]; then
proc=${jobs[$cur]}
echo "JOB ID = $cur; JOB = $proc."
###############
# do job here #
###############
sleep 3 &
# add to current running processes
procs+=("$!")
# move to the next job
((cur++))
else
morework=false
continue
fi
fi
for n in "${!procs[#]}"; do
kill -0 "${procs[n]}" 2>/dev/null && continue
# if process is not running anymore, remove from array
unset procs[n]
done
done
wait
}
parallel "${jobs[#]}"
xargs with -P and -L options does the job.
You can extract the idea from the example below:
#!/usr/bin/env bash
workers_pool_size=10
set -e
function doit {
cmds=""
for e in 4 8 16; do
for m in 1 2 3 4 5 6; do
cmd="python3 ./doit.py --m $m -e $e -m $m"
cmds="$cmd\n$cmds"
done
done
echo -e "All commands:\n$cmds"
echo "Workers pool size = $workers_pool_size"
echo -e "$cmds" | xargs -t -P $workers_pool_size -L 1 time > /dev/null
}
doit
#! /bin/bash
doSomething() {
<...>
}
getCompletedThreads() {
_runningThreads=("$#")
removableThreads=()
for pid in "${_runningThreads[#]}"; do
if ! ps -p $pid > /dev/null; then
removableThreads+=($pid)
fi
done
echo "$removableThreads"
}
releasePool() {
while [[ ${#runningThreads[#]} -eq $MAX_THREAD_NO ]]; do
echo "releasing"
removableThreads=( $(getCompletedThreads "${runningThreads[#]}") )
if [ ${#removableThreads[#]} -eq 0 ]; then
sleep 0.2
else
for removableThread in "${removableThreads[#]}"; do
runningThreads=( ${runningThreads[#]/$removableThread} )
done
echo "released"
fi
done
}
waitAllThreadComplete() {
while [[ ${#runningThreads[#]} -ne 0 ]]; do
removableThreads=( $(getCompletedThreads "${runningThreads[#]}") )
for removableThread in "${removableThreads[#]}"; do
runningThreads=( ${runningThreads[#]/$removableThread} )
done
if [ ${#removableThreads[#]} -eq 0 ]; then
sleep 0.2
fi
done
}
MAX_THREAD_NO=10
runningThreads=()
sequenceNo=0
for i in {1..36}; do
releasePool
((sequenceNo++))
echo "added $sequenceNo"
doSomething &
pid=$!
runningThreads+=($pid)
done
waitAllThreadComplete

Bash: limit the number of concurrent jobs? [duplicate]

This question already has answers here:
Parallelize Bash script with maximum number of processes
(16 answers)
Closed 1 year ago.
Is there an easy way to limit the number of concurrent jobs in bash? By that I mean making the & block when there are more then n concurrent jobs running in the background.
I know I can implement this with ps | grep -style tricks, but is there an easier way?
If you have GNU Parallel http://www.gnu.org/software/parallel/ installed you can do this:
parallel gzip ::: *.log
which will run one gzip per CPU core until all logfiles are gzipped.
If it is part of a larger loop you can use sem instead:
for i in *.log ; do
echo $i Do more stuff here
sem -j+0 gzip $i ";" echo done
done
sem --wait
It will do the same, but give you a chance to do more stuff for each file.
If GNU Parallel is not packaged for your distribution you can install GNU Parallel simply by:
$ (wget -O - pi.dk/3 || lynx -source pi.dk/3 || curl pi.dk/3/ || \
fetch -o - http://pi.dk/3 ) > install.sh
$ sha1sum install.sh | grep 883c667e01eed62f975ad28b6d50e22a
12345678 883c667e 01eed62f 975ad28b 6d50e22a
$ md5sum install.sh | grep cc21b4c943fd03e93ae1ae49e28573c0
cc21b4c9 43fd03e9 3ae1ae49 e28573c0
$ sha512sum install.sh | grep da012ec113b49a54e705f86d51e784ebced224fdf
79945d9d 250b42a4 2067bb00 99da012e c113b49a 54e705f8 6d51e784 ebced224
fdff3f52 ca588d64 e75f6033 61bd543f d631f592 2f87ceb2 ab034149 6df84a35
$ bash install.sh
It will download, check signature, and do a personal installation if it cannot install globally.
Watch the intro videos for GNU Parallel to learn more:
https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
A small bash script could help you:
# content of script exec-async.sh
joblist=($(jobs -p))
while (( ${#joblist[*]} >= 3 ))
do
sleep 1
joblist=($(jobs -p))
done
$* &
If you call:
. exec-async.sh sleep 10
...four times, the first three calls will return immediately, the fourth call will block until there are less than three jobs running.
You need to start this script inside the current session by prefixing it with ., because jobs lists only the jobs of the current session.
The sleep inside is ugly, but I didn't find a way to wait for the first job that terminates.
The following script shows a way to do this with functions. You can either put the bgxupdate() and bgxlimit() functions in your script, or have them in a separate file which is sourced from your script with:
. /path/to/bgx.sh
It has the advantage that you can maintain multiple groups of processes independently (you can run, for example, one group with a limit of 10 and another totally separate group with a limit of 3).
It uses the Bash built-in jobs to get a list of sub-processes but maintains them in individual variables. In the loop at the bottom, you can see how to call the bgxlimit() function:
Set up an empty group variable.
Transfer that to bgxgrp.
Call bgxlimit() with the limit and command you want to run.
Transfer the new group back to your group variable.
Of course, if you only have one group, just use bgxgrp variable directly rather than transferring in and out.
#!/bin/bash
# bgxupdate - update active processes in a group.
# Works by transferring each process to new group
# if it is still active.
# in: bgxgrp - current group of processes.
# out: bgxgrp - new group of processes.
# out: bgxcount - number of processes in new group.
bgxupdate() {
bgxoldgrp=${bgxgrp}
bgxgrp=""
((bgxcount = 0))
bgxjobs=" $(jobs -pr | tr '\n' ' ')"
for bgxpid in ${bgxoldgrp} ; do
echo "${bgxjobs}" | grep " ${bgxpid} " >/dev/null 2>&1
if [[ $? -eq 0 ]]; then
bgxgrp="${bgxgrp} ${bgxpid}"
((bgxcount++))
fi
done
}
# bgxlimit - start a sub-process with a limit.
# Loops, calling bgxupdate until there is a free
# slot to run another sub-process. Then runs it
# an updates the process group.
# in: $1 - the limit on processes.
# in: $2+ - the command to run for new process.
# in: bgxgrp - the current group of processes.
# out: bgxgrp - new group of processes
bgxlimit() {
bgxmax=$1; shift
bgxupdate
while [[ ${bgxcount} -ge ${bgxmax} ]]; do
sleep 1
bgxupdate
done
if [[ "$1" != "-" ]]; then
$* &
bgxgrp="${bgxgrp} $!"
fi
}
# Test program, create group and run 6 sleeps with
# limit of 3.
group1=""
echo 0 $(date | awk '{print $4}') '[' ${group1} ']'
echo
for i in 1 2 3 4 5 6; do
bgxgrp=${group1}; bgxlimit 3 sleep ${i}0; group1=${bgxgrp}
echo ${i} $(date | awk '{print $4}') '[' ${group1} ']'
done
# Wait until all others are finished.
echo
bgxgrp=${group1}; bgxupdate; group1=${bgxgrp}
while [[ ${bgxcount} -ne 0 ]]; do
oldcount=${bgxcount}
while [[ ${oldcount} -eq ${bgxcount} ]]; do
sleep 1
bgxgrp=${group1}; bgxupdate; group1=${bgxgrp}
done
echo 9 $(date | awk '{print $4}') '[' ${group1} ']'
done
Here’s a sample run, with blank lines inserted to clearly delineate different time points:
0 12:38:00 [ ]
1 12:38:00 [ 3368 ]
2 12:38:00 [ 3368 5880 ]
3 12:38:00 [ 3368 5880 2524 ]
4 12:38:10 [ 5880 2524 1560 ]
5 12:38:20 [ 2524 1560 5032 ]
6 12:38:30 [ 1560 5032 5212 ]
9 12:38:50 [ 5032 5212 ]
9 12:39:10 [ 5212 ]
9 12:39:30 [ ]
The whole thing starts at 12:38:00 (time t = 0) and, as you can see, the first three processes run immediately.
Each process sleeps for 10n seconds and the fourth process doesn’t start until the first exits (at time t = 10). You can see that process 3368 has disappeared from the list before 1560 is added.
Similarly, the fifth process 5032 starts when 5880 (the second) exits at time t = 20.
And finally, the sixth process 5212 starts when 2524 (the third) exits at time t = 30.
Then the rundown begins, the fourth process exits at time t = 50 (started at 10 with 40 duration).
The fifth exits at time t = 70 (started at 20 with 50 duration).
Finally, the sixth exits at time t = 90 (started at 30 with 60 duration).
Or, if you prefer it in a more graphical time-line form:
Process: 1 2 3 4 5 6
-------- - - - - - -
12:38:00 ^ ^ ^ 1/2/3 start together.
12:38:10 v | | ^ 4 starts when 1 done.
12:38:20 v | | ^ 5 starts when 2 done.
12:38:30 v | | ^ 6 starts when 3 done.
12:38:40 | | |
12:38:50 v | | 4 ends.
12:39:00 | |
12:39:10 v | 5 ends.
12:39:20 |
12:39:30 v 6 ends.
Here's the shortest way:
waitforjobs() {
while test $(jobs -p | wc -w) -ge "$1"; do wait -n; done
}
Call this function before forking off any new job:
waitforjobs 10
run_another_job &
To have as many background jobs as cores on the machine, use $(nproc) instead of a fixed number like 10.
Assuming you'd like to write code like this:
for x in $(seq 1 100); do # 100 things we want to put into the background.
max_bg_procs 5 # Define the limit. See below.
your_intensive_job &
done
Where max_bg_procs should be put in your .bashrc:
function max_bg_procs {
if [[ $# -eq 0 ]] ; then
echo "Usage: max_bg_procs NUM_PROCS. Will wait until the number of background (&)"
echo " bash processes (as determined by 'jobs -pr') falls below NUM_PROCS"
return
fi
local max_number=$((0 + ${1:-0}))
while true; do
local current_number=$(jobs -pr | wc -l)
if [[ $current_number -lt $max_number ]]; then
break
fi
sleep 1
done
}
The following function (developed from tangens answer above, either copy into script or source from file):
job_limit () {
# Test for single positive integer input
if (( $# == 1 )) && [[ $1 =~ ^[1-9][0-9]*$ ]]
then
# Check number of running jobs
joblist=($(jobs -rp))
while (( ${#joblist[*]} >= $1 ))
do
# Wait for any job to finish
command='wait '${joblist[0]}
for job in ${joblist[#]:1}
do
command+=' || wait '$job
done
eval $command
joblist=($(jobs -rp))
done
fi
}
1) Only requires inserting a single line to limit an existing loop
while :
do
task &
job_limit `nproc`
done
2) Waits on completion of existing background tasks rather than polling, increasing efficiency for fast tasks
This might be good enough for most purposes, but is not optimal.
#!/bin/bash
n=0
maxjobs=10
for i in *.m4a ; do
# ( DO SOMETHING ) &
# limit jobs
if (( $(($((++n)) % $maxjobs)) == 0 )) ; then
wait # wait until all have finished (not optimal, but most times good enough)
echo $n wait
fi
done
If you're willing to do this outside of pure bash, you should look into a job queuing system.
For instance, there's GNU queue or PBS. And for PBS, you might want to look into Maui for configuration.
Both systems will require some configuration, but it's entirely possible to allow a specific number of jobs to run at once, only starting newly queued jobs when a running job finishes. Typically, these job queuing systems would be used on supercomputing clusters, where you would want to allocate a specific amount of memory or computing time to any given batch job; however, there's no reason you can't use one of these on a single desktop computer without regard for compute time or memory limits.
It is hard to do without wait -n (for example, shell in busybox does not support it). So here is a workaround, it is not optimal because it calls 'jobs' and 'wc' commands 10x per second. You can reduce the calls to 1x per second for example, if you don't mind waiting a bit longer for each job to complete.
# $1 = maximum concurent jobs
#
limit_jobs()
{
while true; do
if [ "$(jobs -p | wc -l)" -lt "$1" ]; then break; fi
usleep 100000
done
}
# and now start some tasks:
task &
limit_jobs 2
task &
limit_jobs 2
task &
limit_jobs 2
task &
limit_jobs 2
wait
On Linux I use this to limit the bash jobs to the number of available CPUs (possibly overriden by setting the CPU_NUMBER).
[ "$CPU_NUMBER" ] || CPU_NUMBER="`nproc 2>/dev/null || echo 1`"
while [ "$1" ]; do
{
do something
with $1
in parallel
echo "[$# items left] $1 done"
} &
while true; do
# load the PIDs of all child processes to the array
joblist=(`jobs -p`)
if [ ${#joblist[*]} -ge "$CPU_NUMBER" ]; then
# when the job limit is reached, wait for *single* job to finish
wait -n
else
# stop checking when we're below the limit
break
fi
done
# it's great we executed zero external commands to check!
shift
done
# wait for all currently active child processes
wait
Wait command, -n option, waits for the next job to terminate.
maxjobs=10
# wait for the amount of processes less to $maxjobs
jobIds=($(jobs -p))
len=${#jobIds[#]}
while [ $len -ge $maxjobs ]; do
# Wait until one job is finished
wait -n $jobIds
jobIds=($(jobs -p))
len=${#jobIds[#]}
done
Have you considered starting ten long-running listener processes and communicating with them via named pipes?
you can use ulimit -u
see http://ss64.com/bash/ulimit.html
Bash mostly processes files line by line.
So you cap split input file input files by N lines then simple pattern is applicable:
mkdir tmp ; pushd tmp ; split -l 50 ../mainfile.txt
for file in * ; do
while read a b c ; do curl -s http://$a/$b/$c <$file &
done ; wait ; done
popd ; rm -rf tmp;

Resources