Parallel processing or threading in Shell scripting - shell

I am writing a script in shell in which a command is running and taking 2 min. everytime. Also, there is nothing we can do with this. But if i want to run this command 100 times in script then total time would be 200min. and this will create a big issue. Nobody want to wait for 200min. What i want is to run all 100 commands parallely so that output will come in 2min or may be some more time but dont take 200min.
it will be appreciated, if any body can help me on this in any way.

GNU Parallel is what you want, unless you want to reinvent the wheel. Here are some more detailed examples, but the short of it:
ls | parallel gzip # gzip all files in a directory

... run all 100 commands parallely so that output will come in 2min
This is only possible if you have 200 processors on your system.
There's no such utility/command in shell script to run commands in parallel. What you can do is run your command in background:
for ((i=0;i<200;i++))
do
MyCommand &
done
With & (background), each execution is scheduled as soon as possible. But this doesn't guarantee that your code will be executed in less 200 min. It depends how many processors are there on your system.
If you have only one processor and each execution of the command (that takes 2min) is doing some computation for 2 min, then processor is doing some work, meaning there's no cycles wasted. In this case, running the commands in parallel is not going help because, there's only one processor which is also not free. So, the processes will be just waiting for their turn to be executed.
If you have more than one processors, then the above method (for loop) might help in reducing the total execution time.

As #KingsIndian said, you can background tasks, which sort of lets them run in parallel. Beyond this, you can also keep track of them by process ID:
#!/bin/bash
# Function to be backgrounded
track() {
sleep $1
printf "\nFinished: %d\n" "$1"
}
start=$(date '+%s')
rand3="$(jot -s\ -r 3 5 10)"
# If you don't have `jot` (*BSD/OSX), substitute your own numbers here.
#rand3="5 8 10"
echo "Random numbers: $rand3"
# Make an associative array in which you'll record pids.
declare -A pids
# Background an instance of the track() function for each number, record the pid.
for n in $rand3; do
track $n &
pid=$!
echo "Backgrounded: $n (pid=$pid)"
pids[$pid]=$n
done
# Watch your stable of backgrounded processes.
# If a pid goes away, remove it from the array.
while [ -n "${pids[*]}" ]; do
sleep 1
for pid in "${!pids[#]}"; do
if ! ps "$pid" >/dev/null; then
unset pids[$pid]
echo "unset: $pid"
fi
done
if [ -z "${!pids[*]}" ]; then
break
fi
printf "\rStill waiting for: %s ... " "${pids[*]}"
done
printf "\r%-25s \n" "Done."
printf "Total runtime: %d seconds\n" "$((`date '+%s'` - $start))"
You should also take a look at the Bash documentation on coprocesses.

Related

BASH - After 'wait', why does 'jobs -p' sometimes show 'Done' for a background process?

The short version: My bash script has a function.
This function then launches several instances (a maximum of 10) of another function in the background (with &).
I keep a count of how many are still active with jobs -p | wc -w in a do loop. When I'm done with the loop, I break.
I then use wait to ensure that all those processes terminate before continuing.
However, when I check the count (with jobs -p) I sometimes find this:
[10] 9311 Done my_background_function_name $param
How can I get wait to only proceed when all the launched child-processes have completely terminated and the jobs list is empty?
Why are jobs sometimes shown with "Done" and sometimes not?
Clearly, my knowledge of how jobs works is deficient. :)
Thanks.
Inside a bash script, it seems that when all jobs has ended, jobs -p still returns the last one finished.
This works for me in bash:
while true; do
sleep 5
jobs_running=($(jobs -l | grep Running | awk '{print $2}'))
if [ ${#jobs_running[#]} -eq 0 ]; then
break
fi
echo "Jobs running: ${jobs_running[#]}"
done
Using the "wait" command you cannot tell when each process ends.
With the previous algorithm you can.

Background processes in bash for loop

On Ubuntu 14.04, I have some Python scripts that I need to run on a range of inputs. Some of these can be run in parallel, so I launch them as background processes from a bash script (below).
for m in `seq 3 7`;
do
python script_1.py $m
for a in `seq 1 10 201`;
do
for b in `seq 0 9`;
do
t=$(($a + $b))
python script_2.py $m $t &
done
wait
done
done
So I would like to run the Python script in batches of 10, then wait until the entire batch has finished before moving on to the next batch of 10, hence the wait command.
However, I find that when I run this bash script, script_2.py runs on the first 20 input values, rather than just the first 10, as background processes. Moreover, the script continues to execute as desired, in batches of 10, after this first batch of 20. Is there an obvious reason why this is happening, and how I can prevent it from doing so?
Cheers!
I don't see anything wrong in your code. The only possible explanation that comes to my mind is that the first 10 executions of your script_2.py exit almost immediately so you have the impression that 20 instances are executed in parallel the very first time. I'd add some debug code to your script to check this. Something like:
...
for b in {0..9} ; do
t=$(($a + $b))
echo "now running script_2.py with t=${t}" >> mylog.txt
python script_2.py $m $t &
done
echo "now waiting..." >> mylog.txt
wait
...

Wait for pid end signal

I wrote a script to parallelize matlab (i'm experiencing some trouble with the matlab parallel computing toolbox). The idea is to launch simultaneously matlab on all processors available. At the moment, the script launches more times matlab than there is processors on the machine. I would like to know how to add something to the code which would wait for a signal. That signal would mean, 'yep keep going'.
How to be sure that every task is sent on a different core ?
Moreover, i'm working on a remote computer and i would like to be able to close my terminal while code keeps running. So i use disown, how to be sure that the disown applies to the job that had been launched on the previous line ?
Thanks a lot
#! /bin/bash
#
# parmat.sh File Nb_iteration
#
np=$(nproc)
echo "nombre de processeurs disponibles : "$np
nbf=$(( $2 / $np)) #number of loops on all processors
rmd=$(expr $2 % $np) #remainder
# Loop
for var1 in $(seq 1 $nbf)
do
lp=$((var1 * $np - $np + 1))
le=$(($lp + $np - 1))
for var in $(seq $lp $le)
do
echo $var
sed s/pl_id/$var/g <$1 >temp_$var.m
/applications/matlab/r2013a/bin/matlab -nodesktop -r temp_$var &
#rm temp_$var.m
disown
done
#write something for the loop to wait that all matlabs finished their run.
done
# Remainder
if [ "$rmd" -ne "0" ]
then
lp=$(($nbf * $np + 1))
le=$(($lp + $rmd - 1))
for var in $(seq $lp $le)
do
echo $var
sed s/pl_id/$var/g <$1 >temp_$var.m
/usr/local/MATLAB/R2011b/bin/matlab -nodesktop -r temp_$var &
#rm temp_$var.m
disown
done
fi
You might try executing each matlab instance in a separate backgrounded subshell, and then just calling wait at the bottom of your outer loop.
Here's an example I came up with that (I think) solves both problems (i.e., how to wait for all instances to finish, and how to run each instance on a specific CPU):
#!/bin/bash
numCpus=$(grep -c ^processor /proc/cpuinfo) # I don't have nproc on my system
for cpu in $(seq 0 $((numCpus-1))); do
(
sleepSecs=$(( RANDOM % 10 + 1 ))
echo "Sleeping for $sleepSecs seconds on CPU $cpu..."
taskset -c $cpu sleep $sleepSecs
echo "Done sleeping on CPU $cpu."
) &
done
usleep 500 # This is just here to keep the output ordered correctly
echo "Waiting for subshells to finish..."
wait
echo "All subshells completed."
Each subshell is run in the background with the & suffix and sleeps a random amount of time between 1 and 10 seconds. After spawning the subshells, calling wait with no arguments causes the parent shell to wait for all subshells to complete. Note that this assumes you haven't spawned any other subshells prior to this point in the script. If you have, you'll have to keep track of the PIDs or job numbers of each of the subshells you want to wait on, and pass them as arguments to wait.
Running this on my machine, I get something that looks like this:
Sleeping for 2 seconds on CPU 0...
Sleeping for 9 seconds on CPU 1...
Sleeping for 4 seconds on CPU 3...
Sleeping for 8 seconds on CPU 2...
Sleeping for 10 seconds on CPU 4...
Sleeping for 9 seconds on CPU 5...
Waiting for subshells to finish...
Done sleeping on CPU 0.
Done sleeping on CPU 3.
Done sleeping on CPU 2.
Done sleeping on CPU 5.
Done sleeping on CPU 1.
Done sleeping on CPU 4.
All subshells completed.
Edit: Of course, if you want a visual confirmation that each subshell is running on the intended CPU, you should have it do something other than sleep, since sleep (by design) doesn't eat CPU cycles, and so won't show up on your CPU monitor. You can still confirm by printing out the PID of each spawned subshell, and then verifying with ps or top what CPU they're running on. These commands don't show that information by default, but I'm sure there are options to get them to display it. Also, keep in mind that although taskset lets you set a process' CPU affinity, there is no guarantee that the kernel will run it on that CPU, or that the kernel won't switch it over to another CPU. The CPU affinity is more like a suggestion to the kernel regarding which CPU to use.

How to run given function in Bash in parallel?

There have been some similar questions, but my problem is not "run several programs in parallel" - which can be trivially done with parallel or xargs.
I need to parallelize Bash functions.
Let's imagine code like this:
for i in "${list[#]}"
do
for j in "${other[#]}"
do
# some processing in here - 20-30 lines of almost pure bash
done
done
Some of the processing requires calls to external programs.
I'd like to run some (4-10) tasks, each running for different $i. Total number of elements in $list is > 500.
I know I can put the whole for j ... done loop in external script, and just call this program in parallel, but is it possible to do without splitting the functionality between two separate programs?
sem is part of GNU Parallel and is made for this kind of situation.
for i in "${list[#]}"
do
for j in "${other[#]}"
do
# some processing in here - 20-30 lines of almost pure bash
sem -j 4 dolong task
done
done
If you like the function better GNU Parallel can do the dual for loop in one go:
dowork() {
echo "Starting i=$1, j=$2"
sleep 5
echo "Done i=$1, j=$2"
}
export -f dowork
parallel dowork ::: "${list[#]}" ::: "${other[#]}"
Edit: Please consider Ole's answer instead.
Instead of a separate script, you can put your code in a separate bash function. You can then export it, and run it via xargs:
#!/bin/bash
dowork() {
sleep $((RANDOM % 10 + 1))
echo "Processing i=$1, j=$2"
}
export -f dowork
for i in "${list[#]}"
do
for j in "${other[#]}"
do
printf "%s\0%s\0" "$i" "$j"
done
done | xargs -0 -n 2 -P 4 bash -c 'dowork "$#"' --
An efficient solution that can also run multi-line commands in parallel:
for ...your_loop...; do
if test "$(jobs | wc -l)" -ge 8; then
wait -n
fi
{
command1
command2
...
} &
done
wait
In your case:
for i in "${list[#]}"
do
for j in "${other[#]}"
do
if test "$(jobs | wc -l)" -ge 8; then
wait -n
fi
{
your
commands
here
} &
done
done
wait
If there are 8 bash jobs already running, wait will wait for at least one job to complete. If/when there are less jobs, it starts new ones asynchronously.
Benefits of this approach:
It's very easy for multi-line commands. All your variables are automatically "captured" in scope, no need to pass them around as arguments
It's relatively fast. Compare this, for example, to parallel (I'm quoting official man):
parallel is slow at starting up - around 250 ms the first time and 150 ms after that.
Only needs bash to work.
Downsides:
There is a possibility that there were 8 jobs when we counted them, but less when we started waiting. (It happens if a jobs finishes in those milliseconds between the two commands.) This can make us wait with fewer jobs than required. However, it will resume when at least one job completes, or immediately if there are 0 jobs running (wait -n exits immediately in this case).
If you already have some commands running asynchronously (&) within the same bash script, you'll have fewer worker processes in the loop.

Shell script for testing

I want a simple testing shell script that launches a program N times in parallel, and saves each different output to a different file. I have made a start that launches the program in parallel and saves the output, but how can I keep only the outputs that are different? Also how can I actually make the echo DONE! indicate the end?
#!/bin/bash
N=10
for((i=1; j<=$N; ++i)); do
./test > output-$N &
done
echo DONE!
You'll want to use the wait builtin.
wait [n ...]
Wait for each specified process and return its termination status. Each n may be a process ID or a job specification; if a job spec is given, all processes in that job’s pipeline are waited for. If n is not given, all currently active child pro- cesses are waited for, and the return status is zero. If n specifies a non-existent process or job, the return status is 127. Otherwise, the return status is the exit status of the last process or job waited for.
You could specify your jobs as %1, %2, ...:
wait %1 %2 %3 ...
but as long as you have no other child processes, you can just use it with no arguments; it'll then wait for all child processes to finish:
for ...; do
...
done
wait
echo "All done!"
Your separate question, how to keep only different outputs, is a little trickier. What exactly do you mean - different from what? If you have a baseline, you could do this:
for ...; do
if diff -q $this_output $base_output; then
# files are identical
rm $this_output
fi
done
If you want to keep all unique outputs, the algorithm's a little more complex, obviously, but you could still use diff -q to test for identical output.
In order to have your output indicate that all the processes are finished, you need to call wait:
#!/bin/bash
N=10
for((i=1; j<=$N; ++i)); do
./test > output-$N &
done
wait # wait until all jobs are finished
echo DONE!
With GNU Parallel http://www.gnu.org/software/parallel/ you can do:
/tmp/test > base; seq 1 10 | parallel -k "/tmp/test >output-{}; if diff -q output-{} base; then rm {}; fi" ; echo DONE
GNU Parallel is useful for other stuff. Watch the intro video to GNU Parallel: http://www.youtube.com/watch?v=OpaiGYxkSuQ

Resources