set -e with multiple subshells. non-blocking wait -n - bash

In a CI setting, I'd like to run multiple jobs in the background, and use set -e to exit on the first error.
This requires using wait -n instead of wait, but for increasing throughput I'd then want to move the for i in {1..20}; do wait -n; done to the end of the script.
Unfortunately, this means that it is hard to track the errors.
Rather, what I would want is to do the equivalent to a non-blocking wait -n often, and exit as soon as possible.
Is this possible or do I have to write my bash scripts as a Makefile?

Alternative Approach: Emulate set -e for background jobs
Instead of checking the jobs all the time it could be easier and more efficient to exit the script directly when a job fails. To this end, append ... || kill $$ to every job you start:
# before
myCommand &
myProgram arg1 arg2 &
# after
myCommand || kill $$ &
myProgram arg1 arg2 || kill $$ &
Non-Blocking wait -n
If you really have to, you can write your own non-blocking wait -n with a little trick:
nextJobExitCode() {
sleep 0.1 &
wait -n
exitCode="$?"
kill %%
return "$exitCode"
}
The function nextJobExitCode waits at most 0.1 seconds for your jobs. If none of your jobs were already finished or did finish in that 0.1 seconds, nextJobExitCode will terminate with exit code 0.
Example usage
set -e
sleep 1 & # job 1
(sleep 3; false) & # job 2
nextJobExitCode # won't exit. No jobs finished yet
sleep 2
nextJobExitCode # won't exit. Job 1 finished with 0
sleep 2
nextJobExitCode # will exit! Job 2 finished with 1

Related

shell script - how to stop "watch" command in the shell script [duplicate]

I have a bash script that launches a child process that crashes (actually, hangs) from time to time and with no apparent reason (closed source, so there isn't much I can do about it). As a result, I would like to be able to launch this process for a given amount of time, and kill it if it did not return successfully after a given amount of time.
Is there a simple and robust way to achieve that using bash?
P.S.: tell me if this question is better suited to serverfault or superuser.
(As seen in:
BASH FAQ entry #68: "How do I run a command, and have it abort (timeout) after N seconds?")
If you don't mind downloading something, use timeout (sudo apt-get install timeout) and use it like: (most Systems have it already installed otherwise use sudo apt-get install coreutils)
timeout 10 ping www.goooooogle.com
If you don't want to download something, do what timeout does internally:
( cmdpid=$BASHPID; (sleep 10; kill $cmdpid) & exec ping www.goooooogle.com )
In case that you want to do a timeout for longer bash code, use the second option as such:
( cmdpid=$BASHPID;
(sleep 10; kill $cmdpid) \
& while ! ping -w 1 www.goooooogle.com
do
echo crap;
done )
# Spawn a child process:
(dosmth) & pid=$!
# in the background, sleep for 10 secs then kill that process
(sleep 10 && kill -9 $pid) &
or to get the exit codes as well:
# Spawn a child process:
(dosmth) & pid=$!
# in the background, sleep for 10 secs then kill that process
(sleep 10 && kill -9 $pid) & waiter=$!
# wait on our worker process and return the exitcode
exitcode=$(wait $pid && echo $?)
# kill the waiter subshell, if it still runs
kill -9 $waiter 2>/dev/null
# 0 if we killed the waiter, cause that means the process finished before the waiter
finished_gracefully=$?
sleep 999&
t=$!
sleep 10
kill $t
I also had this question and found two more things very useful:
The SECONDS variable in bash.
The command "pgrep".
So I use something like this on the command line (OSX 10.9):
ping www.goooooogle.com & PING_PID=$(pgrep 'ping'); SECONDS=0; while pgrep -q 'ping'; do sleep 0.2; if [ $SECONDS = 10 ]; then kill $PING_PID; fi; done
As this is a loop I included a "sleep 0.2" to keep the CPU cool. ;-)
(BTW: ping is a bad example anyway, you just would use the built-in "-t" (timeout) option.)
Assuming you have (or can easily make) a pid file for tracking the child's pid, you could then create a script that checks the modtime of the pid file and kills/respawns the process as needed. Then just put the script in crontab to run at approximately the period you need.
Let me know if you need more details. If that doesn't sound like it'd suit your needs, what about upstart?
One way is to run the program in a subshell, and communicate with the subshell through a named pipe with the read command. This way you can check the exit status of the process being run and communicate this back through the pipe.
Here's an example of timing out the yes command after 3 seconds. It gets the PID of the process using pgrep (possibly only works on Linux). There is also some problem with using a pipe in that a process opening a pipe for read will hang until it is also opened for write, and vice versa. So to prevent the read command hanging, I've "wedged" open the pipe for read with a background subshell. (Another way to prevent a freeze to open the pipe read-write, i.e. read -t 5 <>finished.pipe - however, that also may not work except with Linux.)
rm -f finished.pipe
mkfifo finished.pipe
{ yes >/dev/null; echo finished >finished.pipe ; } &
SUBSHELL=$!
# Get command PID
while : ; do
PID=$( pgrep -P $SUBSHELL yes )
test "$PID" = "" || break
sleep 1
done
# Open pipe for writing
{ exec 4>finished.pipe ; while : ; do sleep 1000; done } &
read -t 3 FINISHED <finished.pipe
if [ "$FINISHED" = finished ] ; then
echo 'Subprocess finished'
else
echo 'Subprocess timed out'
kill $PID
fi
rm finished.pipe
Here's an attempt which tries to avoid killing a process after it has already exited, which reduces the chance of killing another process with the same process ID (although it's probably impossible to avoid this kind of error completely).
run_with_timeout ()
{
t=$1
shift
echo "running \"$*\" with timeout $t"
(
# first, run process in background
(exec sh -c "$*") &
pid=$!
echo $pid
# the timeout shell
(sleep $t ; echo timeout) &
waiter=$!
echo $waiter
# finally, allow process to end naturally
wait $pid
echo $?
) \
| (read pid
read waiter
if test $waiter != timeout ; then
read status
else
status=timeout
fi
# if we timed out, kill the process
if test $status = timeout ; then
kill $pid
exit 99
else
# if the program exited normally, kill the waiting shell
kill $waiter
exit $status
fi
)
}
Use like run_with_timeout 3 sleep 10000, which runs sleep 10000 but ends it after 3 seconds.
This is like other answers which use a background timeout process to kill the child process after a delay. I think this is almost the same as Dan's extended answer (https://stackoverflow.com/a/5161274/1351983), except the timeout shell will not be killed if it has already ended.
After this program has ended, there will still be a few lingering "sleep" processes running, but they should be harmless.
This may be a better solution than my other answer because it does not use the non-portable shell feature read -t and does not use pgrep.
Here's the third answer I've submitted here. This one handles signal interrupts and cleans up background processes when SIGINT is received. It uses the $BASHPID and exec trick used in the top answer to get the PID of a process (in this case $$ in a sh invocation). It uses a FIFO to communicate with a subshell that is responsible for killing and cleanup. (This is like the pipe in my second answer, but having a named pipe means that the signal handler can write into it too.)
run_with_timeout ()
{
t=$1 ; shift
trap cleanup 2
F=$$.fifo ; rm -f $F ; mkfifo $F
# first, run main process in background
"$#" & pid=$!
# sleeper process to time out
( sh -c "echo \$\$ >$F ; exec sleep $t" ; echo timeout >$F ) &
read sleeper <$F
# control shell. read from fifo.
# final input is "finished". after that
# we clean up. we can get a timeout or a
# signal first.
( exec 0<$F
while : ; do
read input
case $input in
finished)
test $sleeper != 0 && kill $sleeper
rm -f $F
exit 0
;;
timeout)
test $pid != 0 && kill $pid
sleeper=0
;;
signal)
test $pid != 0 && kill $pid
;;
esac
done
) &
# wait for process to end
wait $pid
status=$?
echo finished >$F
return $status
}
cleanup ()
{
echo signal >$$.fifo
}
I've tried to avoid race conditions as far as I can. However, one source of error I couldn't remove is when the process ends near the same time as the timeout. For example, run_with_timeout 2 sleep 2 or run_with_timeout 0 sleep 0. For me, the latter gives an error:
timeout.sh: line 250: kill: (23248) - No such process
as it is trying to kill a process that has already exited by itself.
#Kill command after 10 seconds
timeout 10 command
#If you don't have timeout installed, this is almost the same:
sh -c '(sleep 10; kill "$$") & command'
#The same as above, with muted duplicate messages:
sh -c '(sleep 10; kill "$$" 2>/dev/null) & command'

Kill not killing process if exiting properly

I have a simple bash script which I have written to simplify some work I am doing. All it needs to do is start one process, process_1, as a background process then start another, process_2. Once process_2 is finished I then need to terminate process_1.
process_1 starts a program which does not actually stop unless it receives the kill signal, or CTRL+C when I run it myself. The program is output into a file via {program} {args} > output_file
process_2 can take an arbitrary amount of time depending on the arguments it is given.
Code:
#!/bin/bash
#Call this on exit to kill all background processes
function killJobs () {
#Check process is still running before killing
if kill -0 "$PID"; then
kill $PID
fi
}
...Check given arguments are valid...
#Start process_1
eval "./process_1 ${Arg1} ${Arg2} ${Arg3}" &
PID=$!
#Lay a trap to catch any exits from script
trap killJobs TERM INT
#Start process_2 - sleep for 5 seconds before and after
#Need space between process_1 and process_2 starting and stopping
sleep 5
eval "./process_2 ${Arg1} ${Arg2} ${Arg3} ${Arg4} 2> ${output_file}"
sleep 5
#Make sure background job is killed on exit
killJobs
I check process_1 has been terminated by checking of its output file is still being updated after my script has ended.
If I run the script and then press CTRL+C the script is terminated and process_1 is also killed, the output file is no longer updated.
If I let the script run to its completion without my intervention process_2 and the script both terminate but when I check the output from process_1 it is still being updated.
To check this I put an echo statement just after process_1 is started and another within the if statement of killJobs, so it would only be echoed if kill $PID is called.
Doing this I can see that both ways of exiting start process_1 and then also enter the if statement to kill it. Yet kill does not actually kill the process in the case of normal exit. No error messages are produced either.
You're backgrounding the eval instead of process_1, which sets $! to the PID of the script itself, not to process_1. Change to:
#!/bin/bash
#Call this on exit to kill all background processes
function killJobs () {
#Check process is still running before killing
if kill -0 "$PID"; then
kill $PID
fi
}
...Check given arguments are valid...
#Start process_1
./process_1 ${Arg1} ${Arg2} ${Arg3} &
PID=$!
#Lay a trap to catch any exits from script
trap killJobs TERM INT
#Start process_2 - sleep for 5 seconds before and after
#Need space between process_1 and process_2 starting and stopping
sleep 5
./process_2 ${Arg1} ${Arg2} ${Arg3} ${Arg4} 2> ${output_file}
sleep 5
#Make sure background job is killed on exit
killJobs

Run shell script after succesfull execution of parellel scripts

I have 4 shell scripts, first 3 scripts i want to execute parallel. Later after successful completion of all 3 scripts i want to execute 4th script
Parellelexecution
sh script1.sh,
sh script2.sh,
sh script3.sh
script4.sh should execute after all 3 execution.
bash 4.3 added a -n flag to wait that lets it wait for any one background job to complete. For a fixed number of background jobs, you could do use something like
script1.sh &
script2.sh &
script3.sh &
wait -n && wait -n && wait -n && script4.sh
For a large or variable number of background jobs, Kurt's answer is better.
In bash you can do:
pids=
for s in script1.sh script2.sh script3.sh; do
$s &
pids="$pids $!"
done
JOBS_FAILED=false
for pid in $pids; do
if ! wait $pid; then
# script didn't exit successfully
JOBS_FAILED=true
fi
done
if [[ $JOBS_FAILED == false ]]; then
script4.sh
fi
First it starts all the first 3 scripts in background and collects their pids. Then it runs through each pid waiting for it to exit and checking its return value. If any of the first three scripts fail, $JOBS_FAILED is set to the string true but all the processes are still waited on. Once all the first 3 scripts finish, the script checks if any jobs failed. If not, script4.sh is run.

How can I silence the "Terminated" message when my command is killed by timeout?

By referencing bash: silently kill background function process and Timeout a command in bash without unnecessary delay, I wrote my own script to set a timeout for a command, as well as silencing the kill message.
But I still am getting a "Terminated" message when my process gets killed. What's wrong with my code?
#!/bin/bash
silent_kill() {
kill $1 2>/dev/null
wait $1 2>/dev/null
}
timeout() {
limit=$1 #timeout limit
shift
command=$* #command to run
interval=1 #default interval between checks if the process is still alive
delay=1 #default delay between SIGTERM and SIGKILL
(
((t = limit))
while ((t > 0)); do
sleep $interval;
#kill -0 $$ || exit 0
((t -= interval))
done
silent_kill $$
#kill -s SIGTERM $$ && kill -0 $$ || exit 0
sleep $delay
#kill -s SIGKILL $$
) &> /dev/null &
exec $*
}
timeout 1 sleep 10
There's nothing wrong with your code, that "Terminated" message doesn't come from your script but from the invoking shell (the one you launch your script
from).
You can deactivate if by disabling job control:
$ set +m
$ bash <your timeout script>
Perhaps bash has moved on in 4 years. I do know you can avoid
getting Terminated by disowning a child process. You can no longer job control it though. Eg:
$ sleep 100 &
[1] 15436
$ disown -r
$ kill -9 15436
help disown:
disown [-h] [-ar] [jobspec ...]
Remove jobs from current shell.
Removes each JOBSPEC argument from the table of active jobs. Without
any JOBSPECs, the shell uses its notion of the current job.
-a remove all jobs if JOBSPEC is not supplied
-h mark each JOBSPEC so that SIGHUP is not sent to the job if the shell receives a SIGHUP
-r remove only running jobs
Internally the shell maintains a list of children it forked and wait()s for any of them to exit or be killed. When a child's exit status was collected, the shell prints a message. This is called monitoring in shell parlance.
It seems you want to turn off monitoring. Monitoring is managed with the m option; to turn it on, use set -m (the default at startup). To turn it off, set +m.
Note that monitoring off also disables messages for asynchronous jobs, e.g. no more messages like
$ sleep 5 &
[1] 59468
$
[1] + done sleep 5
$

How do I terminate all the subshell processes?

I have a bash script to test how a server performs under load.
num=1
if [ $# -gt 0 ]; then
num=$1
fi
for i in {1 .. $num}; do
(while true; do
{ time curl --silent 'http://localhost'; } 2>&1 | grep real
done) &
done
wait
When I hit Ctrl-C, the main process exits, but the background loops keep running. How do I make them all exit? Or is there a better way of spawning a configurable number of logic loops executing in parallel?
Here's a simpler solution -- just add the following line at the top of your script:
trap "kill 0" SIGINT
Killing 0 sends the signal to all processes in the current process group.
One way to kill subshells, but not self:
kill $(jobs -p)
Bit of a late answer, but for me solutions like kill 0 or kill $(jobs -p) go too far (kill all child processes).
If you just want to make sure one specific child-process (and its own children) are tidied up then a better solution is to kill by process group (PGID) using the sub-process' PID, like so:
set -m
./some_child_script.sh &
some_pid=$!
kill -- -${some_pid}
Firstly, the set -m command will enable job management (if it isn't already), this is important, as otherwise all commands, sub-shells etc. will be assigned to the same process group as your parent script (unlike when you run the commands manually in a terminal), and kill will just give a "no such process" error. This needs to be called before you run the background command you wish to manage as a group (or just call it at script start if you have several).
Secondly, note that the argument to kill is negative, this indicates that you want to kill an entire process group. By default the process group ID is the same as the first command in the group, so we can get it by simply adding a minus sign in front of the PID we fetched with $!. If you need to get the process group ID in a more complex case, you will need to use ps -o pgid= ${some_pid}, then add the minus sign to that.
Lastly, note the use of the explicit end of options --, this is important, as otherwise the process group argument will be treated as an option (signal number), and kill will complain it doesn't have enough arguments. You only need this if the process group argument is the first one you wish to terminate.
Here is a simplified example of a background timeout process, and how to cleanup as much as possible:
#!/bin/bash
# Use the overkill method in case we're terminated ourselves
trap 'kill $(jobs -p | xargs)' SIGINT SIGHUP SIGTERM EXIT
# Setup a simple timeout command (an echo)
set -m
{ sleep 3600; echo "Operation took longer than an hour"; } &
timeout_pid=$!
# Run our actual operation here
do_something
# Cancel our timeout
kill -- -${timeout_pid} >/dev/null 2>&1
wait -- -${timeout_pid} >/dev/null 2>&1
printf '' 2>&1
This should cleanly handle cancelling this simplistic timeout in all reasonable cases; the only case that can't be handled is the script being terminated immediately (kill -9), as it won't get a chance to cleanup.
I've also added a wait, followed by a no-op (printf ''), this is to suppress "terminated" messages that can be caused by the kill command, it's a bit of a hack, but is reliable enough in my experience.
You need to use job control, which, unfortunately, is a bit complicated. If these are the only background jobs that you expect will be running, you can run a command like this one:
jobs \
| perl -ne 'print "$1\n" if m/^\[(\d+)\][+-]? +Running/;' \
| while read -r ; do kill %"$REPLY" ; done
jobs prints a list of all active jobs (running jobs, plus recently finished or terminated jobs), in a format like this:
[1] Running sleep 10 &
[2] Running sleep 10 &
[3] Running sleep 10 &
[4] Running sleep 10 &
[5] Running sleep 10 &
[6] Running sleep 10 &
[7] Running sleep 10 &
[8] Running sleep 10 &
[9]- Running sleep 10 &
[10]+ Running sleep 10 &
(Those are jobs that I launched by running for i in {1..10} ; do sleep 10 & done.)
perl -ne ... is me using Perl to extract the job numbers of the running jobs; you can obviously use a different tool if you prefer. You may need to modify this script if your jobs has a different output format; but the above output is also on Cygwin, so it's very likely identical to yours.
read -r reads a "raw" line from standard input, and saves it into the variable $REPLY. kill %"$REPLY" will be something like kill %1, which "kills" (sends an interrupt signal to) job number 1. (Not to be confused with kill 1, which would kill process number 1.) Together, while read -r ; do kill %"$REPLY" ; done goes through each job number printed by the Perl script, and kills it.
By the way, your for i in {1 .. $num} won't do what you expect, since brace expansion is handled before parameter expansion, so what you have is equivalent to for i in "{1" .. "$num}". (And you can't have white-space inside the brace expansion, anyway.) Unfortunately, I don't know of a clean alternative; I think you have to do something like for i in $(bash -c "{1..$num}"), or else switch to an arithmetic for-loop or whatnot.
Also by the way, you don't need to wrap your while-loop in parentheses; & already causes the job to be run in a subshell.
Here's my eventual solution. I'm keeping track of the subshell process IDs using an array variable, and trapping the Ctrl-C signal to kill them.
declare -a subs #array of subshell pids
function kill_subs() {
for pid in ${subs[#]}; do
kill $pid
done
exit 0
}
num=1 if [ $# -gt 0 ]; then
num=$1 fi
for ((i=0;i < $num; i++)); do
while true; do
{ time curl --silent 'http://localhost'; } 2>&1 | grep real
done &
subs[$i]=$! #grab the pid of the subshell
done
trap kill_subs 1 2 15
wait
While these is not an answer, I just would like to point out something which invalidates the selected one; using jobs or kill 0 might have unexpected results; in my case it killed unintended processes which in my case is not an option.
It has been highlighted somehow in some of the answers but I am afraid not with enough stress or it has been not considered:
"Bit of a late answer, but for me solutions like kill 0 or kill $(jobs -p) go too far (kill all child processes)."
"If these are the only background jobs that you expect will be running, you can run a command like this one:"

Resources