Queuing commands in Korn shell - shell

I am using AIX Korn shell to execute a Perl script that accepts a numeric argument from 1 to 50 and runs them in the background simultaneously. Is there a way to limit the background process like 5 for example? If one finishes, execute the next one. My current code just executes all of them in the background.
i=1; while [[ $i -le 50 ]]; do perl some_script.pl $i &; ((i+=1)); done;
For example, if 2 finishes, execute the next one which is 6 and so on.

There are various versions of KSH around. The original Korn Shell, ksh88 is the default shell in IBM AIX since version 4 (/usr/bin/ksh). But they also support the Enhanced Korn Shell, ksh93 (/usr/bin/ksh93), which has more bells and whistles. It are those bells and whistles which make life easy in this case:
KSH93: In KSH93, you have a shell variable JOBMAX which does this for you:
JOBMAX: This variable defines the maximum number of running background jobs that can run at a time. When this limit is reached, the shell will wait for a job to complete before starting a new job.
JOBMAX=5
i=1; while [[ $i -le 50 ]]; do perl some_script.pl $i &; ((i+=1)); done;
Btw. you might be interested in using a for-loop instead.
JOBMAX=5
for i in $(seq 1 50); do perl some_script.pl "$i" &; done
KSH: If you cannot use KSH93 and have to stick to the POSIX 2 compliant KSH, you might consider using xarg, but only if it allows the --max-procs flag.
seq 1 50 | xargs -I{} --max-procs=5 perl some_script.pl {}
Sadly, AIX does not support the --max-procs flag.
So, you have to build something yourself:
procmax=5
for i in $(seq 1 50); do
perl some_script.pl "$i" &;
(( i%procmax == 0 )) && wait
done
Unfortunately, this is not really a true parallel version as it will wait until the first 5 processes finished before it starts the next batch of 5.
So, you could have a look at jobs and do something with that:
procmax=5
checkinterval=1
for i in $(seq 1 50); do
perl some_script.pl "$i" &;
while [[ $(jobs -l | wc -l) -ge "$procmax" ]]; do
sleep "$checkinterval";
done
done
This is still not perfectly parallel, due to the sleep, but it will have to do.

Related

How to parallelize for-loop in bash limiting number of processes

I have a bash script similar to:
NUM_PROCS=$1
NUM_ITERS=$2
for ((i=0; i<$NUM_ITERS; i++)); do
python foo.py $i arg2 &
done
What's the most straightforward way to limit the number of parallel processes to NUM_PROCS? I'm looking for a solution that doesn't require packages/installations/modules (like GNU Parallel) if possible.
When I tried Charles Duffy's latest approach, I got the following error from bash -x:
+ python run.py args 1
+ python run.py ... 3
+ python run.py ... 4
+ python run.py ... 2
+ read -r line
+ python run.py ... 1
+ read -r line
+ python run.py ... 4
+ read -r line
+ python run.py ... 2
+ read -r line
+ python run.py ... 3
+ read -r line
+ python run.py ... 0
+ read -r line
... continuing with other numbers between 0 and 5, until too many processes were started for the system to handle and the bash script was shut down.
bash 4.4 will have an interesting new type of parameter expansion that simplifies Charles Duffy's answer.
#!/bin/bash
num_procs=$1
num_iters=$2
num_jobs="\j" # The prompt escape for number of jobs currently running
for ((i=0; i<num_iters; i++)); do
while (( ${num_jobs#P} >= num_procs )); do
wait -n
done
python foo.py "$i" arg2 &
done
GNU, macOS/OSX, FreeBSD and NetBSD can all do this with xargs -P, no bash versions or package installs required. Here's 4 processes at a time:
printf "%s\0" {1..10} | xargs -0 -I # -P 4 python foo.py # arg2
As a very simple implementation, depending on a version of bash new enough to have wait -n (to wait until only the next job exits, as opposed to waiting for all jobs):
#!/bin/bash
# ^^^^ - NOT /bin/sh!
num_procs=$1
num_iters=$2
declare -A pids=( )
for ((i=0; i<num_iters; i++)); do
while (( ${#pids[#]} >= num_procs )); do
wait -n
for pid in "${!pids[#]}"; do
kill -0 "$pid" &>/dev/null || unset "pids[$pid]"
done
done
python foo.py "$i" arg2 & pids["$!"]=1
done
If running on a shell without wait -n, one can (very inefficiently) replace it with a command such as sleep 0.2, to poll every 1/5th of a second.
Since you're actually reading input from a file, another approach is to start N subprocesses, each of processes only lines where (linenum % N == threadnum):
num_procs=$1
infile=$2
for ((i=0; i<num_procs; i++)); do
(
while read -r line; do
echo "Thread $i: processing $line"
done < <(awk -v num_procs="$num_procs" -v i="$i" \
'NR % num_procs == i { print }' <"$infile")
) &
done
wait # wait for all the $num_procs subprocesses to finish
A relatively simple way to accomplish this with only two additional lines of code. Explanation is inline.
NUM_PROCS=$1
NUM_ITERS=$2
for ((i=0; i<$NUM_ITERS; i++)); do
python foo.py $i arg2 &
let 'i>=NUM_PROCS' && wait -n # wait for one process at a time once we've spawned $NUM_PROC workers
done
wait # wait for all remaining workers
Are you aware that if you are allowed to write and run your own scripts, then you can also use GNU Parallel? In essence it is a Perl script in one single file.
From the README:
= Minimal installation =
If you just need parallel and do not have 'make' installed (maybe the
system is old or Microsoft Windows):
wget http://git.savannah.gnu.org/cgit/parallel.git/plain/src/parallel
chmod 755 parallel
cp parallel sem
mv parallel sem dir-in-your-$PATH/bin/
seq $2 | parallel -j$1 python foo.py {} arg2
parallel --embed (available since 20180322) even makes it possible to distribute GNU Parallel as part of a shell script (i.e. no extra files needed):
parallel --embed >newscript
Then edit the end of newscript.
This isn't the simplest solution, but if your version of bash doesn't have "wait -n" and you don't want to use other programs like parallel, awk etc, here is a solution using while and for loops.
num_iters=10
total_threads=4
iter=1
while [[ "$iter" -lt "$num_iters" ]]; do
iters_remainder=$(echo "(${num_iters}-${iter})+1" | bc)
if [[ "$iters_remainder" -lt "$total_threads" ]]; then
threads=$iters_remainder
else
threads=$total_threads
fi
for ((t=1; t<="$threads"; t++)); do
(
# do stuff
) &
((++iter))
done
wait
done

How to run a fixed number of processes in a loop?

I have a script like this:
#!/bin/bash
for i=1 to 200000
do
create input file
run ./java
done
I need to run a number (8 or 16) of processes (java) at the same time and I don't know how. I know that wait could help but it should be running 8 processes all the time and not wait for the first 8 to finish before starting the other 8.
bash 4.3 added a useful new flag to the wait command, -n, which causes wait to block until any single background job, not just the members of a given subset (or all), to complete.
#!/bin/bash
cores=8 # or 16, or whatever
for ((i=1; i <= 200000; i++))
do
# create input file and run java in the background.
./java &
# Check how many background jobs there are, and if it
# is equal to the number of cores, wait for anyone to
# finish before continuing.
background=( $(jobs -p) )
if (( ${#background[#]} == cores )); then
wait -n
fi
done
There is a small race condition: if you are at maximum load but a job completes after you run jobs -p, you'll still block until another job
completes. There's not much you can do about this, but it shouldn't present too much trouble in practice.
Prior to bash 4.3, you would need to poll the set of background jobs periodically to see when the pool dropped below your threshold.
while :; do
background=( $(jobs -p))
if (( ${#background[#]} < cores )); then
break
fi
sleep 1
done
Use GNU Parallel like this, simplified to 20 jobs rather than 200,000 and the first job is echo rather than "create file" and the second job is sleep rather than "java".
seq 1 20 | parallel -j 8 -k 'echo {}; sleep 2'
The -j 8 says how many jobs to run at once. The -k says to keep the output in order.
Here is a little animation of the output so you can see the timing/sequence:
With a non-ancient version of GNU utilities or on *BSD/OSX, use xargs with the -P option to run processes in parallel.
#!/bin/bash
seq 200000 | xargs -P 8 -n 1 mytask
where mytask is an auxiliary script, with the sequence number (the input line) available as the argument$1`:
#!/bin/bash
echo "Task number $1"
create input file
run ./java
You can put everything in one script if you want:
#!/bin/bash
seq 200000 | xargs -P 8 -n 1 sh -c '
echo "Task number $1"
create input file
run ./java
' mytask
If your system doesn't have seq, you can use the bash snippet
for ((i=1; i<=200000; i++)); do echo "$i"; done
or other shell tools such as
awk '{for (i=1; i<=200000; i++) print i}' </dev/null
or
</dev/zero tr '\0' '\n' | head -n 200000 | nl
Set up 8 subprocesses that read from a common stream; each subprocess reads one line of input and starts a new job whenever its current job completes.
forker () {
while read; do
# create input file
./java
done
}
cores=8 # or 16, or whatever
for ((i=1; i<=200000; i++)); do
echo $i
done | while :; do
for ((j=0; j< cores; j++)); do
forker &
done
done
wait # Waiting for the $core forkers to complete

Monitoring life time of a process

I have a python script called hdsr_writer.py. I can launch this script in shell by calling
"python hdsr_writer.py 1234"
where 1234 is a parameter.
I made a shell script to increase the number and execute the python script with the number every 1 second
for param from 1 to 100000
python hdsr_writer.py $param &
sleep (1)
Usually, the python script executes its task within 0.5 second. However, there are times at which the python script gets stuck and resides in the system for longer than 30 seconds. I don't want that. So I would like to monitor life time of each python process executed. If it has stayed for longer than 2 second it would be killed and re-executed 2 times at most.
Note: I would like do this in the shell script not python script because I could not change the python script.
Update: More explainations about my question
Please note that: launching a new python process and monitoring python processes are independent jobs. Launching job doesn't care how many python processes are running and how "old" they are, just calls "python hdsr_writer.py $param &" every 1 second after increasing param. On the other hand, monitoring job periodically checks life time of all hdsr_writer python processes. If one has resided more than 2 second in memory, kills it, and re-runs it at most of 2 times.
Not so short answer
#/bin/bash
param=1
while [[ $param -lt 100000 ]]; do
echo "param=$param"
chances=3
while [[ $chances -gt 0 ]]; do
python tst.py $param &
sleep 2
if [[ "$(jobs | grep 'Running')" == "" ]]; then
chances=0
else
kill -9 $(jobs -l | awk '{print $2}')
chances=$(($chances-1))
if [[ $chances -gt 0 ]]; then
echo "one more chance for parameter $param"
fi
fi
done
param=$(($param+1))
done
UPD
This is another answer as requested by OP.
Here is still 2 scripts in one. But they can be spitted in two files.
Please pay attention that $() & is used to run sub-shells in background
#!/bin/bash
# Script launcher
pscript='rand.py'
for param in {1..10}
do
# start background sub-shell, where python with $param is started
echo $(
left=3
error_on_exit=1
# go if any chances left and previous run exits not with code 0
while [[ ( ( $left -gt 0 ) && ( $error_on_exit -ne 0 ) ) ]]; do
left=$(($left-1))
echo "param=$param; chances left $left "
# run python and grab python exit code (=0 if ok)
python $pscript $param
error_on_exit=$?
done
) &
done
# Script controller
# just kills python processes older than 2 seconds
# exits after no python left
# $(...) & can be removed if this code goes to separate script
$(while [[ $(ps | grep -v 'grep' | grep -c python ) != "0" ]]
do
sleep 0.5
killall -9 -q --older-than 2s python
done) &
Use a combination of sleep and nohup commands. After sleep time use kill to finish the execution of python script. You can check if the process is running with ps command.
#!/usr/bin/ksh
for param from {1..100000}
nohup python hdsr_writer.py $param &
pid=$!
sleep(2)
if [ ps -p $pid ]
then
kill -9 $pid
fi
done
Re-answer:
I'd use two scripts, the first one (script1.ksh):
#!/usr/bin/ksh
for param from {1..1000000}
nohup script2.sh $param &
done
And the second (script2.ksh):
#!/usr/bin/ksh
for i from {1..3}
python hsdr_write.py $1 &
pid=$!
sleep(2)
if [ ps -p $pid ]
then
kill -9 $pid
else
echo 'Finalizado'$1 >> log.txt
return
fi
done
The first script will launch all yours processes one after the other. The second one will check his own python process.

bash: limiting subshells in a for loop with file list

I've been trying to get a for loop to run a bunch of commands sort of simultaneously and was attempting to do it via subshells. Ive managed to cobble together the script below to test and it seems to work ok.
#!/bin/bash
for i in {1..255}; do
(
#commands
)&
done
wait
The only problem is that my actual loop is going to be for i in files* and then it just crashes, i assume because its started too many subshells to handle. So i added
#!/bin/bash
for i in files*; do
(
#commands
)&
if (( $i % 10 == 0 )); then wait; fi
done
wait
which now fails. Does anyone know a way around this? Either using a different command to limit the number of subshells or provide a number for $i?
Cheers
xargs/parallel
Another solution would be to use tools designed for concurrency:
printf '%s\0' files* | xargs -0 -P6 -n1 yourScript
The -P6 is the maximum number of concurrent processes that xargs will launch. Make it 10 if you like.
I suggest xargs because it is likely already on your system. If you want a really robust solution, look at GNU Parallel.
Filenames in array
For another answer explicit to your question: Get the counter as the array index?
files=( files* )
for i in "${!files[#]}"; do
commands "${files[i]}" &
(( i % 10 )) || wait
done
(The parentheses around the compound command aren't important because backgrounding the job will have the same effects as using a subshell anyway.)
Function
Just different semantics:
simultaneous() {
while [[ $1 ]]; do
for i in {1..11}; do
[[ ${#:i:1} ]] || break
commands "${#:i:1}" &
done
shift 10 || shift "$#"
wait
done
}
simultaneous files*
You can find useful to count the number of jobs with jobs. e.g.:
wc -w <<<$(jobs -p)
So, your code would look like this:
#!/bin/bash
for i in files*; do
(
#commands
)&
if (( $(wc -w <<<$(jobs -p)) % 10 == 0 )); then wait; fi
done
wait
As #chepner suggested:
In bash 4.3, you can use wait -n to proceed as soon as any job completes, rather than waiting for all of them
Define the counter explicitly
#!/bin/bash
for f in files*; do
(
#commands
)&
(( i++ % 10 == 0 )) && wait
done
wait
There's no need to initialize i, as it will default to 0 the first time you use it. There's also no need to reset the value, as i %10 will be 0 for i=10, 20, 30, etc.
If you have Bash≥4.3, you can use wait -n:
#!/bin/bash
max_nb_jobs=10
for i in file*; do
# Wait until there are less than max_nb_jobs jobs running
while mapfile -t < <(jobs -pr) && ((${#MAPFILE[#]}>=max_nb_jobs)); do
wait -n
done
{
# Your commands here: no useless subshells! use grouping instead
} &
done
wait
If you don't have wait -n available, you can use something like this:
#!/bin/bash
set -m
max_nb_jobs=10
sleep_jobs() {
# This function sleeps until there are less than $1 jobs running
local n=$1
while mapfile -t < <(jobs -pr) && ((${#MAPFILE[#]}>=n)); do
coproc read
trap "echo >&${COPROC[1]}; trap '' SIGCHLD" SIGCHLD
[[ $COPROC_PID ]] && wait $COPROC_PID
done
}
for i in files*; do
# Wait until there are less than 10 jobs running
sleep_jobs "$max_nb_jobs"
{
# Your commands here: no useless subshells! use grouping instead
} &
done
wait
The advantage of proceeding like this, is that we make no assumptions on the time taken to finish the jobs. A new job is launched as soon as there's room for it. Moreover, it's all pure Bash, so doesn't rely on external tools and (maybe more importantly), you may use your Bash environment (variables, functions, etc.) without exporting them (arrays can't be easily exported so that can be a huge pro).

Process Scheduling

Let's say, I have 10 scripts that I want to run regularly as cron jobs. However, I don't want all of them to run at the same time. I want only 2 of them running simultaneously.
One solution that I'm thinking of is create two script, put 5 statements on each of them, and them as separate entries in the crontab. However the solution seem very adhoc.
Is there existing unix tool to perform the task I mentioned above?
The jobs builtin can tell you how many child processes are running. Some simple shell scripting can accomplish this task:
MAX_JOBS=2
launch_when_not_busy()
{
while [ $(jobs | wc -l) -ge $MAX_JOBS ]
do
# at least $MAX_JOBS are still running.
sleep 1
done
"$#" &
}
launch_when_not_busy bash job1.sh --args
launch_when_not_busy bash jobTwo.sh
launch_when_not_busy bash job_three.sh
...
wait
NOTE: As pointed out by mobrule, my original answer will not work because the wait builtin with no arguments waits for ALL children to finish. Hence the following 'parallelexec' script, which avoids polling at the cost of more child processes:
#!/bin/bash
N="$1"
I=0
{
if [[ "$#" -le 1 ]]; then
cat
else
while [[ "$#" -gt 1 ]]; do
echo "$2"
set -- "$1" "${#:3}"
done
fi
} | {
d=$(mktemp -d /tmp/fifo.XXXXXXXX)
mkfifo "$d"/fifo
exec 3<>"$d"/fifo
rm -rf "$d"
while [[ "$I" -lt "$N" ]] && read C; do
($C; echo >&3) &
let I++
done
while read C; do
read -u 3
($C; echo >&3) &
done
}
The first argument is the number of parallel jobs. If there are more, each one is run as a job, otherwise all commands to run are read from stdin line by line.
I use a named pipe (which is sent to oblivion as soon as the shell opens it) as a synchronization method. Since only single bytes are written there are no race condition issues that could complicate things.
GNU Parallel is designed for this kind of tasks:
sem -j2 do_stuff
sem -j2 do_other_stuff
sem -j2 do_third_stuff
do_third_stuff will only be run when either do_stuff or do_other_stuff has finished.
Watch the intro videos to learn more:
http://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Resources