Shell script for testing - bash

I want a simple testing shell script that launches a program N times in parallel, and saves each different output to a different file. I have made a start that launches the program in parallel and saves the output, but how can I keep only the outputs that are different? Also how can I actually make the echo DONE! indicate the end?
#!/bin/bash
N=10
for((i=1; j<=$N; ++i)); do
./test > output-$N &
done
echo DONE!

You'll want to use the wait builtin.
wait [n ...]
Wait for each specified process and return its termination status. Each n may be a process ID or a job specification; if a job spec is given, all processes in that job’s pipeline are waited for. If n is not given, all currently active child pro- cesses are waited for, and the return status is zero. If n specifies a non-existent process or job, the return status is 127. Otherwise, the return status is the exit status of the last process or job waited for.
You could specify your jobs as %1, %2, ...:
wait %1 %2 %3 ...
but as long as you have no other child processes, you can just use it with no arguments; it'll then wait for all child processes to finish:
for ...; do
...
done
wait
echo "All done!"
Your separate question, how to keep only different outputs, is a little trickier. What exactly do you mean - different from what? If you have a baseline, you could do this:
for ...; do
if diff -q $this_output $base_output; then
# files are identical
rm $this_output
fi
done
If you want to keep all unique outputs, the algorithm's a little more complex, obviously, but you could still use diff -q to test for identical output.

In order to have your output indicate that all the processes are finished, you need to call wait:
#!/bin/bash
N=10
for((i=1; j<=$N; ++i)); do
./test > output-$N &
done
wait # wait until all jobs are finished
echo DONE!

With GNU Parallel http://www.gnu.org/software/parallel/ you can do:
/tmp/test > base; seq 1 10 | parallel -k "/tmp/test >output-{}; if diff -q output-{} base; then rm {}; fi" ; echo DONE
GNU Parallel is useful for other stuff. Watch the intro video to GNU Parallel: http://www.youtube.com/watch?v=OpaiGYxkSuQ

Related

lazy (non-buffered) processing of shell pipeline

I'm trying to figure out how to perform the laziest possible processing of a standard UNIX shell pipeline. For example, let's say I have a command which does some calculations and outputting along the way, but the calculations get more and more expensive so that the first few lines of output arrive quickly but then subsequent lines get slower. If I'm only interested in the first few lines then I want to obtain those via lazy evaluation, terminating the calculations ASAP before they get too expensive.
This can be achieved with a straight-forward shell pipeline, e.g.:
./expensive | head -n 2
However this does not work optimally. Let's simulate the calculations with a script which gets exponentially slower:
#!/bin/sh
i=1
while true; do
echo line $i
sleep $(( i ** 4 ))
i=$(( i+1 ))
done
Now when I pipe this script through head -n 2, I observe the following:
line 1 is output.
After sleeping one second, line 2 is output.
Despite head -n 2 having already received two (\n-terminated) lines and exiting, expensive carries on running and now waits a further 16 seconds (2 ** 4) before completing, at which point the pipeline also completes.
Obviously this is not as lazy as desired, because ideally expensive would terminate as soon as the head process receives two lines. However, this does not happen; IIUC it actually terminates after trying to write its third line, because at this point it tries to write to its STDOUT which is connected through a pipe to STDIN the head process which has already exited and is therefore no longer reading input from the pipe. This causes expensive to receive a SIGPIPE, which causes the bash interpreter running the script to invoke its SIGPIPE handler which by default terminates running the script (although this can be changed via the trap command).
So the question is, how can I make it so that expensive quits immediately when head quits, not just when expensive tries to write its third line to a pipe which no longer has a listener at the other end? Since the pipeline is constructed and managed by the interactive shell process I typed the ./expensive | head -n 2 command into, presumably that interactive shell is the place where any solution for this problem would lie, rather than in any modification of expensive or head? Is there any native trick or extra utility which can construct pipelines with the behaviour I want? Or maybe it's simply impossible to achieve what I want in bash or zsh, and the only way would be to write my own pipeline manager (e.g. in Ruby or Python) which spots when the reader terminates and immediately terminates the writer?
If all you care about is foreground control, you can run expensive in a process substitution; it still blocks until it next tries to write, but head exits immediately (and your script's flow control can continue) after it's received its input
head -n 2 < <(exec ./expensive)
# expensive still runs 16 seconds in the background, but doesn't block your program
In bash 4.4, these store their PIDs in $! and allow process management in the same manner as other background processes.
# REQUIRES BASH 4.4 OR NEWER
exec {expensive_fd}< <(exec ./expensive); expensive_pid=$!
head -n 2 <&"$expensive_fd" # read the content we want
exec {expensive_fd}<&- # close the descriptor
kill "$expensive_pid" # and kill the process
Another approach is a coprocess, which has the advantage of only requiring bash 4.0:
# magic: store stdin and stdout FDs in an array named "expensive", and PID in expensive_PID
coproc expensive { exec ./expensive }
# read two lines from input FD...
head -n 2 <&"${expensive[0]}"
# ...and kill the process.
kill "$expensive_PID"
I'll answer with a POSIX shell in mind.
What you can do is use a fifo instead of a pipe and kill the first link the moment the second finishes.
If the expensive process is a leaf process or if it takes care of killing its children, you can use a simple kill. If it's a process-spawning shell script, you should run it in a process group (doable with set -m) and kill it with a process-group kill.
Example code:
#!/bin/sh -e
expensive()
{
i=1
while true; do
echo line $i
sleep 0.$i #sped it up a little
echo >&2 slept
i=$(( i+1 ))
done
}
echo >&2 NORMAL
expensive | head -n2
#line 1
#slept
#line 2
#slept
echo >&2 SPED-UP
mkfifo pipe
exec 3<>pipe
rm pipe
set -m; expensive >&3 & set +m
<&3 head -n 2
kill -- -$!
#line 1
#slept
#line 2
If you run this, the second run should not have the second slept line, meaning the first link was killed the moment head finished, not when the first link tried to output after head was finished.

Bash function hangs once conditions are met

All,
I am trying to run a bash script that kicks off several sub processes. The processes redirect to their own log files and I must kick them off in parallel. To do this i have written a check_procs procedure, that monitors for the number of processes using the same parent PID. Once the number reaches 1 again, the script should continue. However, it seems to just hang. I am not sure why, but the code is below:
check_procs() {
while true; do
mypid=$$
backup_procs=`ps -eo ppid | grep -w $mypid | wc -w`
until [ $backup_procs == 1 ]; do
echo $backup_procs
sleep 5
backup_procs=`ps -eo ppid | grep -w $mypid | wc -w`
done
done
}
This function is called after the processes are kicked off, and I can see it echoing out the number of processes, but then the echoing stops (suggesting the function has completed since the process count is now 1, but then nothing happens, and I can see the script is still in the process list of the server. I have to kill it off manually. The part where the function is called is below:
for ((i=1; i <= $threads; i++)); do
<Some trickery here to generate $cmdfile and $logfile>
nohup rman target / cmdfile=$cmdfile log=$logfile &
x=$(($x+1))
done
check_procs
$threads is a command line parameter passed to the script, and is a small number like 4 or 6. These are kicked off using nohup, as shown. When the IF in check_procs is satisfied, everything hangs instead of executing the remainder of the script. What's wrong with my function?
Maybe I'm mistaken, but it is not expected? Your outer loop runs forever, there is no exit point. Unless the process count increases again the outer loop runs infinitely (without any delay which is not recommended).

BASH - After 'wait', why does 'jobs -p' sometimes show 'Done' for a background process?

The short version: My bash script has a function.
This function then launches several instances (a maximum of 10) of another function in the background (with &).
I keep a count of how many are still active with jobs -p | wc -w in a do loop. When I'm done with the loop, I break.
I then use wait to ensure that all those processes terminate before continuing.
However, when I check the count (with jobs -p) I sometimes find this:
[10] 9311 Done my_background_function_name $param
How can I get wait to only proceed when all the launched child-processes have completely terminated and the jobs list is empty?
Why are jobs sometimes shown with "Done" and sometimes not?
Clearly, my knowledge of how jobs works is deficient. :)
Thanks.
Inside a bash script, it seems that when all jobs has ended, jobs -p still returns the last one finished.
This works for me in bash:
while true; do
sleep 5
jobs_running=($(jobs -l | grep Running | awk '{print $2}'))
if [ ${#jobs_running[#]} -eq 0 ]; then
break
fi
echo "Jobs running: ${jobs_running[#]}"
done
Using the "wait" command you cannot tell when each process ends.
With the previous algorithm you can.

How to run given function in Bash in parallel?

There have been some similar questions, but my problem is not "run several programs in parallel" - which can be trivially done with parallel or xargs.
I need to parallelize Bash functions.
Let's imagine code like this:
for i in "${list[#]}"
do
for j in "${other[#]}"
do
# some processing in here - 20-30 lines of almost pure bash
done
done
Some of the processing requires calls to external programs.
I'd like to run some (4-10) tasks, each running for different $i. Total number of elements in $list is > 500.
I know I can put the whole for j ... done loop in external script, and just call this program in parallel, but is it possible to do without splitting the functionality between two separate programs?
sem is part of GNU Parallel and is made for this kind of situation.
for i in "${list[#]}"
do
for j in "${other[#]}"
do
# some processing in here - 20-30 lines of almost pure bash
sem -j 4 dolong task
done
done
If you like the function better GNU Parallel can do the dual for loop in one go:
dowork() {
echo "Starting i=$1, j=$2"
sleep 5
echo "Done i=$1, j=$2"
}
export -f dowork
parallel dowork ::: "${list[#]}" ::: "${other[#]}"
Edit: Please consider Ole's answer instead.
Instead of a separate script, you can put your code in a separate bash function. You can then export it, and run it via xargs:
#!/bin/bash
dowork() {
sleep $((RANDOM % 10 + 1))
echo "Processing i=$1, j=$2"
}
export -f dowork
for i in "${list[#]}"
do
for j in "${other[#]}"
do
printf "%s\0%s\0" "$i" "$j"
done
done | xargs -0 -n 2 -P 4 bash -c 'dowork "$#"' --
An efficient solution that can also run multi-line commands in parallel:
for ...your_loop...; do
if test "$(jobs | wc -l)" -ge 8; then
wait -n
fi
{
command1
command2
...
} &
done
wait
In your case:
for i in "${list[#]}"
do
for j in "${other[#]}"
do
if test "$(jobs | wc -l)" -ge 8; then
wait -n
fi
{
your
commands
here
} &
done
done
wait
If there are 8 bash jobs already running, wait will wait for at least one job to complete. If/when there are less jobs, it starts new ones asynchronously.
Benefits of this approach:
It's very easy for multi-line commands. All your variables are automatically "captured" in scope, no need to pass them around as arguments
It's relatively fast. Compare this, for example, to parallel (I'm quoting official man):
parallel is slow at starting up - around 250 ms the first time and 150 ms after that.
Only needs bash to work.
Downsides:
There is a possibility that there were 8 jobs when we counted them, but less when we started waiting. (It happens if a jobs finishes in those milliseconds between the two commands.) This can make us wait with fewer jobs than required. However, it will resume when at least one job completes, or immediately if there are 0 jobs running (wait -n exits immediately in this case).
If you already have some commands running asynchronously (&) within the same bash script, you'll have fewer worker processes in the loop.

Parallel processing or threading in Shell scripting

I am writing a script in shell in which a command is running and taking 2 min. everytime. Also, there is nothing we can do with this. But if i want to run this command 100 times in script then total time would be 200min. and this will create a big issue. Nobody want to wait for 200min. What i want is to run all 100 commands parallely so that output will come in 2min or may be some more time but dont take 200min.
it will be appreciated, if any body can help me on this in any way.
GNU Parallel is what you want, unless you want to reinvent the wheel. Here are some more detailed examples, but the short of it:
ls | parallel gzip # gzip all files in a directory
... run all 100 commands parallely so that output will come in 2min
This is only possible if you have 200 processors on your system.
There's no such utility/command in shell script to run commands in parallel. What you can do is run your command in background:
for ((i=0;i<200;i++))
do
MyCommand &
done
With & (background), each execution is scheduled as soon as possible. But this doesn't guarantee that your code will be executed in less 200 min. It depends how many processors are there on your system.
If you have only one processor and each execution of the command (that takes 2min) is doing some computation for 2 min, then processor is doing some work, meaning there's no cycles wasted. In this case, running the commands in parallel is not going help because, there's only one processor which is also not free. So, the processes will be just waiting for their turn to be executed.
If you have more than one processors, then the above method (for loop) might help in reducing the total execution time.
As #KingsIndian said, you can background tasks, which sort of lets them run in parallel. Beyond this, you can also keep track of them by process ID:
#!/bin/bash
# Function to be backgrounded
track() {
sleep $1
printf "\nFinished: %d\n" "$1"
}
start=$(date '+%s')
rand3="$(jot -s\ -r 3 5 10)"
# If you don't have `jot` (*BSD/OSX), substitute your own numbers here.
#rand3="5 8 10"
echo "Random numbers: $rand3"
# Make an associative array in which you'll record pids.
declare -A pids
# Background an instance of the track() function for each number, record the pid.
for n in $rand3; do
track $n &
pid=$!
echo "Backgrounded: $n (pid=$pid)"
pids[$pid]=$n
done
# Watch your stable of backgrounded processes.
# If a pid goes away, remove it from the array.
while [ -n "${pids[*]}" ]; do
sleep 1
for pid in "${!pids[#]}"; do
if ! ps "$pid" >/dev/null; then
unset pids[$pid]
echo "unset: $pid"
fi
done
if [ -z "${!pids[*]}" ]; then
break
fi
printf "\rStill waiting for: %s ... " "${pids[*]}"
done
printf "\r%-25s \n" "Done."
printf "Total runtime: %d seconds\n" "$((`date '+%s'` - $start))"
You should also take a look at the Bash documentation on coprocesses.

Resources