limit spawned parallel processes and exit all upon failure of any - bash

I'm running some tests in parallel by calling a process from a script. Each process prints only to stdout > a file, and exits 0 iff successful (otherwise -1).
If and when a process exits with -1, I print something to its (or a related) output file (namely, the arguments it was called with), kill all other processes, and exit.
I have written a script using trap "..." CHLD to run some code when a subprocess exits and this works under certain conditions, but I find my script is not very robust. If I send a keyboard interrupt sometimes the subprocesses keep going, and sometimes the number of subprocesses simply overwhelm the machine(s) and none of them seem to advance.
I am using this on my quad core laptop as well as a cluster of 128 CPUs, over which subprocesses are distributed automatically. How do I run a large number of background subprocesses in a bash script, limited to some number of them running concurrently, and do something + exit if one of them returns with a bad code? I would also like the script to clean up after keyboard interrupt. Should I use GNU-parallel? how?
Here is a MWE of my script so far, which spawns subprocesses unhindered, annotated with what I think each part means. I got the idea to use trap from shell - get exit code of background process
$ cat parallel_tests.sh
#!/bin/bash
# some help from https://stackoverflow.com/questions/1570262/shell-get-exit-code-of-background-process
handle_chld() {
#echo pids are ${pids[#]}
local tmp=() ###temporary storage for pids that haven't finished
#for each pid that hadn't finished since the last trap
for((i=0;i<${#pids[#]};++i)); do
#if this pid is still running
if [[ $(ps -p ${pids[i]} -o pid=) ]]
then
tmp+=(${pids[i]}) ### add pid to list of pids that are running
else
wait ${pids[i]} ### put the exit code of this pid into $?
if [ "$?" != "0" ] ### if the exit code $? is non-zero
then
#kill all remaning processes
for((j=0;j<${#pids[#]};++j))
do
if [[ $(ps -p ${pids[j]} -o pid=) ]]
then
echo killing child processes of ${pids[j]}
pkill -P ${pids[j]}
fi
done
cat _tmp${pids[i]}
#print things to the terminal here
echo "FAILED process ${pids[i]} args: `cat _tmpargs${pids[i]}`"
exit 1
else
echo "FINISHED process ${pids[i]} args: `cat _tmpargs${pids[i]}`"
fi
fi
done
#update list of running pids
pids=(${tmp[#]})
}
# set this to monitor SIGCHLD
set -o monitor
# call handle_chld() when SIGCHLD signal is triggered
trap "handle_chld" CHLD
ALL_ARGS="2 32 87" ### ad nauseam
for A in $ALL_ARGS; do
(sleep $A; false) > _tmp$! &
pids+=($!)
echo $A > _tmpargs${pids[${#pids[#]}-1]}
echo "STARTED process ${pids[${#pids[#]}-1]} args: `cat _tmpargs${pids[${#pids[#]}-1]}`"
done
echo "Every process started. Now waiting on PIDS:"
echo ${pids[#]}
wait ${pids[#]} ###wait until every process is finished (or exit in the trap)
The output of this version after 2+epsilon seconds is:
$ ./parallel_tests.sh
STARTED process 66369 args: 2
STARTED process 66374 args: 32
STARTED process 66381 args: 87
Every process started. Now waiting on PIDS:
66369 66374 66381
killing child processes of 66374
./parallel_tests.sh: line 43: 66376 Terminated: 15 sleep $A
killing child processes of 66381
./parallel_tests.sh: line 43: 66383 Terminated: 15 sleep $A
FAILED process 66369 args: 2
Essentially, pid 66369 fails first, and the other two processes are dealt with in the trap. I have simplified the construction of the test processes here, so we can't assume that I'll manually insert waits before spawning new ones. Additionally, some of the test processes can be nearly instant. Essentially, I have a whole mess of test processes, long and short, starting as soon as resources can be allotted.
I'm not sure what's causing the problems I mentioned above, as this script uses several features that are new to me. General pointers are welcomed!
(I have seen this question and it does not answer my question)

cat arguments | parallel --halt now,fail=1 my_prg
Alternatively:
parallel --halt now,fail=1 my_prg ::: $ALL_ARGS
GNU Parallel is designed so it will also kill remote jobs. It does that using process groups and heavy perl scripting on the remote server: https://www.gnu.org/software/parallel/parallel_design.html#The-remote-system-wrapper

Related

Bash: why wait returns prematurely with code 145

This problem is very strange and I cannot find any documentation about this online. In the following code snippet I am merely trying to run a bunch of sub-processes in parallel, printing something when they exit and collect/print their exit code at the end. I find that without catching SIGCHLD things work as I would expect however, things break when I catch the signal. Here is the code:
#!/bin/bash
#enabling job control
set -m
cmd_array=( "$#" ) #array of commands to run in parallel
cmd_count=$# #number of commands to run
cmd_idx=0; #current index of command
cmd_pids=() #array of child proc pids
trap 'echo "Child job existed"' SIGCHLD #setting up signal handler on SIGCHLD
#running jobs in parallel
while [ $cmd_idx -lt $cmd_count ]; do
cmd=${cmd_array[$cmd_idx]} #retreiving the job command as a string
eval "$cmd" &
cmd_pids[$cmd_idx]=$! #keeping track of the job pid
echo "Job #$cmd_idx launched '$cmd']"
(( cmd_idx++ ))
done
#all jobs have been launched, collecting exit codes
idx=0
for pid in "${cmd_pids[#]}"; do
wait $pid
child_exit_code=$?
if [ $child_exit_code -ne 0 ]; then
echo "ERROR: Job #$idx failed with return code $child_exit_code. [job_command: '${cmd_array[$idx]}']"
fi
(( idx++ ))
done
You can tell something is wrong when you try to run this the following command:
./parallel_script.sh "sleep 20; echo done_20" "sleep 3; echo done_3"
The interesting thing here is that you can tell as soon as the signal handler is called (when sleep 3 is done), the wait (which is waiting on sleep 20) is interrupted right away with a return code 145. I can tell the sleep 20 is still running even after the script is done.
I can't find any documentation about such a return code from wait. Can anyone shed some light as to what is going on here?
(By the way if I add a while loop when I wait and keep on waiting while the return code is 145, I actually get the result I expect)
Thanks to #muru, I was able to reproduce the "problem" using much less code, which you can see below:
#!/bin/bash
set -m
trap "echo child_exit" SIGCHLD
function test() {
sleep $1
echo "'sleep $1' just returned now"
}
echo sleeping for 6 seconds in the background
test 6 &
pid=$!
echo sleeping for 2 second in the background
test 2 &
echo waiting on the 6 second sleep
wait $pid
echo "wait return code: $?"
If you run this you will get the following output:
linux:~$ sh test2.sh
sleeping for 6 seconds in the background
sleeping for 2 second in the background
waiting on the 6 second sleep
'sleep 2' just returned now
child_exit
wait return code: 145
lunux:~$ 'sleep 6' just returned now
Explanation:
As #muru pointed out "When a command terminates on a fatal signal whose number is N, Bash uses the value 128+N as the exit status." (c.f. Bash manual on Exit Status).
Now what mislead me here is the "fatal" signal. I was looking for a command to fail somewhere when nothing did.
Digging a little deeper in Bash manual on Signals: "When Bash is waiting for an asynchronous command via the wait builtin, the reception of a signal for which a trap has been set will cause the wait builtin to return immediately with an exit status greater than 128, immediately after which the trap is executed."
So there you have it, what happens in the script above is the following:
sleep 6 starts in the background
sleep 3 starts in the background
wait starts waiting on sleep 6
sleep 3terminates and the SIGCHLD trap if fired interrupting wait, which returns 128 + SIGCHLD = 145
my script exits since it does not wait anymore
the background sleep 6 terminates hence the "'sleep 6' just returned now" after the script already exited

shell script - how to stop "watch" command in the shell script [duplicate]

I have a bash script that launches a child process that crashes (actually, hangs) from time to time and with no apparent reason (closed source, so there isn't much I can do about it). As a result, I would like to be able to launch this process for a given amount of time, and kill it if it did not return successfully after a given amount of time.
Is there a simple and robust way to achieve that using bash?
P.S.: tell me if this question is better suited to serverfault or superuser.
(As seen in:
BASH FAQ entry #68: "How do I run a command, and have it abort (timeout) after N seconds?")
If you don't mind downloading something, use timeout (sudo apt-get install timeout) and use it like: (most Systems have it already installed otherwise use sudo apt-get install coreutils)
timeout 10 ping www.goooooogle.com
If you don't want to download something, do what timeout does internally:
( cmdpid=$BASHPID; (sleep 10; kill $cmdpid) & exec ping www.goooooogle.com )
In case that you want to do a timeout for longer bash code, use the second option as such:
( cmdpid=$BASHPID;
(sleep 10; kill $cmdpid) \
& while ! ping -w 1 www.goooooogle.com
do
echo crap;
done )
# Spawn a child process:
(dosmth) & pid=$!
# in the background, sleep for 10 secs then kill that process
(sleep 10 && kill -9 $pid) &
or to get the exit codes as well:
# Spawn a child process:
(dosmth) & pid=$!
# in the background, sleep for 10 secs then kill that process
(sleep 10 && kill -9 $pid) & waiter=$!
# wait on our worker process and return the exitcode
exitcode=$(wait $pid && echo $?)
# kill the waiter subshell, if it still runs
kill -9 $waiter 2>/dev/null
# 0 if we killed the waiter, cause that means the process finished before the waiter
finished_gracefully=$?
sleep 999&
t=$!
sleep 10
kill $t
I also had this question and found two more things very useful:
The SECONDS variable in bash.
The command "pgrep".
So I use something like this on the command line (OSX 10.9):
ping www.goooooogle.com & PING_PID=$(pgrep 'ping'); SECONDS=0; while pgrep -q 'ping'; do sleep 0.2; if [ $SECONDS = 10 ]; then kill $PING_PID; fi; done
As this is a loop I included a "sleep 0.2" to keep the CPU cool. ;-)
(BTW: ping is a bad example anyway, you just would use the built-in "-t" (timeout) option.)
Assuming you have (or can easily make) a pid file for tracking the child's pid, you could then create a script that checks the modtime of the pid file and kills/respawns the process as needed. Then just put the script in crontab to run at approximately the period you need.
Let me know if you need more details. If that doesn't sound like it'd suit your needs, what about upstart?
One way is to run the program in a subshell, and communicate with the subshell through a named pipe with the read command. This way you can check the exit status of the process being run and communicate this back through the pipe.
Here's an example of timing out the yes command after 3 seconds. It gets the PID of the process using pgrep (possibly only works on Linux). There is also some problem with using a pipe in that a process opening a pipe for read will hang until it is also opened for write, and vice versa. So to prevent the read command hanging, I've "wedged" open the pipe for read with a background subshell. (Another way to prevent a freeze to open the pipe read-write, i.e. read -t 5 <>finished.pipe - however, that also may not work except with Linux.)
rm -f finished.pipe
mkfifo finished.pipe
{ yes >/dev/null; echo finished >finished.pipe ; } &
SUBSHELL=$!
# Get command PID
while : ; do
PID=$( pgrep -P $SUBSHELL yes )
test "$PID" = "" || break
sleep 1
done
# Open pipe for writing
{ exec 4>finished.pipe ; while : ; do sleep 1000; done } &
read -t 3 FINISHED <finished.pipe
if [ "$FINISHED" = finished ] ; then
echo 'Subprocess finished'
else
echo 'Subprocess timed out'
kill $PID
fi
rm finished.pipe
Here's an attempt which tries to avoid killing a process after it has already exited, which reduces the chance of killing another process with the same process ID (although it's probably impossible to avoid this kind of error completely).
run_with_timeout ()
{
t=$1
shift
echo "running \"$*\" with timeout $t"
(
# first, run process in background
(exec sh -c "$*") &
pid=$!
echo $pid
# the timeout shell
(sleep $t ; echo timeout) &
waiter=$!
echo $waiter
# finally, allow process to end naturally
wait $pid
echo $?
) \
| (read pid
read waiter
if test $waiter != timeout ; then
read status
else
status=timeout
fi
# if we timed out, kill the process
if test $status = timeout ; then
kill $pid
exit 99
else
# if the program exited normally, kill the waiting shell
kill $waiter
exit $status
fi
)
}
Use like run_with_timeout 3 sleep 10000, which runs sleep 10000 but ends it after 3 seconds.
This is like other answers which use a background timeout process to kill the child process after a delay. I think this is almost the same as Dan's extended answer (https://stackoverflow.com/a/5161274/1351983), except the timeout shell will not be killed if it has already ended.
After this program has ended, there will still be a few lingering "sleep" processes running, but they should be harmless.
This may be a better solution than my other answer because it does not use the non-portable shell feature read -t and does not use pgrep.
Here's the third answer I've submitted here. This one handles signal interrupts and cleans up background processes when SIGINT is received. It uses the $BASHPID and exec trick used in the top answer to get the PID of a process (in this case $$ in a sh invocation). It uses a FIFO to communicate with a subshell that is responsible for killing and cleanup. (This is like the pipe in my second answer, but having a named pipe means that the signal handler can write into it too.)
run_with_timeout ()
{
t=$1 ; shift
trap cleanup 2
F=$$.fifo ; rm -f $F ; mkfifo $F
# first, run main process in background
"$#" & pid=$!
# sleeper process to time out
( sh -c "echo \$\$ >$F ; exec sleep $t" ; echo timeout >$F ) &
read sleeper <$F
# control shell. read from fifo.
# final input is "finished". after that
# we clean up. we can get a timeout or a
# signal first.
( exec 0<$F
while : ; do
read input
case $input in
finished)
test $sleeper != 0 && kill $sleeper
rm -f $F
exit 0
;;
timeout)
test $pid != 0 && kill $pid
sleeper=0
;;
signal)
test $pid != 0 && kill $pid
;;
esac
done
) &
# wait for process to end
wait $pid
status=$?
echo finished >$F
return $status
}
cleanup ()
{
echo signal >$$.fifo
}
I've tried to avoid race conditions as far as I can. However, one source of error I couldn't remove is when the process ends near the same time as the timeout. For example, run_with_timeout 2 sleep 2 or run_with_timeout 0 sleep 0. For me, the latter gives an error:
timeout.sh: line 250: kill: (23248) - No such process
as it is trying to kill a process that has already exited by itself.
#Kill command after 10 seconds
timeout 10 command
#If you don't have timeout installed, this is almost the same:
sh -c '(sleep 10; kill "$$") & command'
#The same as above, with muted duplicate messages:
sh -c '(sleep 10; kill "$$" 2>/dev/null) & command'

How do I receive notification in a bash script when a specific child process terminates?

I wonder if anyone can help with this?
I have a bash script. It starts a sub-process which is another gui-based application. The bash script then goes into an interactive mode getting input from the user. This interactive mode continues indefinately. I would like it to terminate when the gui-application in the sub-process exits.
I have looked at SIGCHLD but this doesn't seem to be the answer. Here's what I've tried but I don't get a signal when the prog ends.
set -o monitor
"${prog}" &
prog_pid=$!
function check_pid {
kill -0 $1 2> /dev/null
}
function cleanup {
### does cleanup stuff here
exit
}
function sigchld {
check_pid $prog_pid
[[ $? == 1 ]] && cleanup
}
trap sigchld SIGCHLD
Updated following answers. I now have this working using the suggestion from 'nosid'. I have another, related, issue now which is that the interactive process that follows is a basic menu driven process that blocks waiting for key input from the user. If the child process ends the USR1 signal is not handled until after input is received. Is there any way to force the signal to be handled immediately?
The wait look looks like this:
stty raw # set the tty driver to raw mode
max=$1 # maximum valid choice
choice=$(expr $max + 1) # invalid choice
while [[ $choice -gt $max ]]; do
choice=`dd if=/dev/tty bs=1 count=1 2>/dev/null`
done
stty sane # restore tty
Updated with solution. I have solved this. The trick was to use nonblocking I/O for the read. Now, with the answer from 'nosid' and my modifications, I have exactly what I want. For completeness, here is what works for me:
#!/bin/bash -bm
{
"${1}"
kill -USR1 $$
} &
function cleanup {
# cleanup stuff
exit
}
trap cleanup SIGUSR1
while true ; do
stty raw # set the tty driver to raw mode
max=9 # maximum valid choice
while [[ $choice -gt $max || -z $choice ]]; do
choice=`dd iflag=nonblock if=/dev/tty bs=1 count=1 2>/dev/null`
done
stty sane # restore tty
# process choice
done
Here is a different approach. Instead of using SIGCHLD, you can execute an arbitrary command as soon as the GUI application terminates.
{
some_command args...
kill -USR1 $$
} &
function sigusr1() { ... }
trap sigusr1 SIGUSR1
Ok. I think I understand what you need. Have a look at my .xinitrc:
xrdb ~/.Xdefaults
source ~/.xinitrc.hw.settings
xcompmgr &
xscreensaver &
# after starting some arbitrary crap we want to start the main gui.
startfluxbox & PIDOFAPP=$! ## THIS IS THE IMPORTANT PART
setxkbmap genja
wmclockmon -bl &
sleep 1
wmctrl -s 3 && aterms sone &
sleep 1
wmctrl -s 0
wait $PIDOFAPP ## THIS IS THE SECOND PART OF THE IMPORTANT PART
xeyes -geometry 400x400+500+400 &
sleep 2
echo im out!
What happens is that after you send a process to the background, you can use wait to wait until the process dies. whatever is after wait will not be executed as long as the application is running. You can use this to exit after the GUI has been shut down.
PS: I run bash.
I think you need to do:
set -bm
or
set -o monitor notify
As per the bash manual:
-b
Cause the status of terminated background jobs to be reported immediately, rather than before printing the next primary prompt.
The shell's main job is executing child processes, and
it needs to catch SIGCHLD for its own purposes. This somehow restricts it to pass on the signal to the script itself.
Could you just check for the child pid and based on that send the alert. You can find the child pid as below-
bash_pid=$$
while true
do
children=`ps -eo ppid | grep -w $bash_pid`
if [ -z "$children" ]; then
cleanup
alert
exit
fi
done

How do I terminate all the subshell processes?

I have a bash script to test how a server performs under load.
num=1
if [ $# -gt 0 ]; then
num=$1
fi
for i in {1 .. $num}; do
(while true; do
{ time curl --silent 'http://localhost'; } 2>&1 | grep real
done) &
done
wait
When I hit Ctrl-C, the main process exits, but the background loops keep running. How do I make them all exit? Or is there a better way of spawning a configurable number of logic loops executing in parallel?
Here's a simpler solution -- just add the following line at the top of your script:
trap "kill 0" SIGINT
Killing 0 sends the signal to all processes in the current process group.
One way to kill subshells, but not self:
kill $(jobs -p)
Bit of a late answer, but for me solutions like kill 0 or kill $(jobs -p) go too far (kill all child processes).
If you just want to make sure one specific child-process (and its own children) are tidied up then a better solution is to kill by process group (PGID) using the sub-process' PID, like so:
set -m
./some_child_script.sh &
some_pid=$!
kill -- -${some_pid}
Firstly, the set -m command will enable job management (if it isn't already), this is important, as otherwise all commands, sub-shells etc. will be assigned to the same process group as your parent script (unlike when you run the commands manually in a terminal), and kill will just give a "no such process" error. This needs to be called before you run the background command you wish to manage as a group (or just call it at script start if you have several).
Secondly, note that the argument to kill is negative, this indicates that you want to kill an entire process group. By default the process group ID is the same as the first command in the group, so we can get it by simply adding a minus sign in front of the PID we fetched with $!. If you need to get the process group ID in a more complex case, you will need to use ps -o pgid= ${some_pid}, then add the minus sign to that.
Lastly, note the use of the explicit end of options --, this is important, as otherwise the process group argument will be treated as an option (signal number), and kill will complain it doesn't have enough arguments. You only need this if the process group argument is the first one you wish to terminate.
Here is a simplified example of a background timeout process, and how to cleanup as much as possible:
#!/bin/bash
# Use the overkill method in case we're terminated ourselves
trap 'kill $(jobs -p | xargs)' SIGINT SIGHUP SIGTERM EXIT
# Setup a simple timeout command (an echo)
set -m
{ sleep 3600; echo "Operation took longer than an hour"; } &
timeout_pid=$!
# Run our actual operation here
do_something
# Cancel our timeout
kill -- -${timeout_pid} >/dev/null 2>&1
wait -- -${timeout_pid} >/dev/null 2>&1
printf '' 2>&1
This should cleanly handle cancelling this simplistic timeout in all reasonable cases; the only case that can't be handled is the script being terminated immediately (kill -9), as it won't get a chance to cleanup.
I've also added a wait, followed by a no-op (printf ''), this is to suppress "terminated" messages that can be caused by the kill command, it's a bit of a hack, but is reliable enough in my experience.
You need to use job control, which, unfortunately, is a bit complicated. If these are the only background jobs that you expect will be running, you can run a command like this one:
jobs \
| perl -ne 'print "$1\n" if m/^\[(\d+)\][+-]? +Running/;' \
| while read -r ; do kill %"$REPLY" ; done
jobs prints a list of all active jobs (running jobs, plus recently finished or terminated jobs), in a format like this:
[1] Running sleep 10 &
[2] Running sleep 10 &
[3] Running sleep 10 &
[4] Running sleep 10 &
[5] Running sleep 10 &
[6] Running sleep 10 &
[7] Running sleep 10 &
[8] Running sleep 10 &
[9]- Running sleep 10 &
[10]+ Running sleep 10 &
(Those are jobs that I launched by running for i in {1..10} ; do sleep 10 & done.)
perl -ne ... is me using Perl to extract the job numbers of the running jobs; you can obviously use a different tool if you prefer. You may need to modify this script if your jobs has a different output format; but the above output is also on Cygwin, so it's very likely identical to yours.
read -r reads a "raw" line from standard input, and saves it into the variable $REPLY. kill %"$REPLY" will be something like kill %1, which "kills" (sends an interrupt signal to) job number 1. (Not to be confused with kill 1, which would kill process number 1.) Together, while read -r ; do kill %"$REPLY" ; done goes through each job number printed by the Perl script, and kills it.
By the way, your for i in {1 .. $num} won't do what you expect, since brace expansion is handled before parameter expansion, so what you have is equivalent to for i in "{1" .. "$num}". (And you can't have white-space inside the brace expansion, anyway.) Unfortunately, I don't know of a clean alternative; I think you have to do something like for i in $(bash -c "{1..$num}"), or else switch to an arithmetic for-loop or whatnot.
Also by the way, you don't need to wrap your while-loop in parentheses; & already causes the job to be run in a subshell.
Here's my eventual solution. I'm keeping track of the subshell process IDs using an array variable, and trapping the Ctrl-C signal to kill them.
declare -a subs #array of subshell pids
function kill_subs() {
for pid in ${subs[#]}; do
kill $pid
done
exit 0
}
num=1 if [ $# -gt 0 ]; then
num=$1 fi
for ((i=0;i < $num; i++)); do
while true; do
{ time curl --silent 'http://localhost'; } 2>&1 | grep real
done &
subs[$i]=$! #grab the pid of the subshell
done
trap kill_subs 1 2 15
wait
While these is not an answer, I just would like to point out something which invalidates the selected one; using jobs or kill 0 might have unexpected results; in my case it killed unintended processes which in my case is not an option.
It has been highlighted somehow in some of the answers but I am afraid not with enough stress or it has been not considered:
"Bit of a late answer, but for me solutions like kill 0 or kill $(jobs -p) go too far (kill all child processes)."
"If these are the only background jobs that you expect will be running, you can run a command like this one:"

Kill process after a given time bash?

I have a script that tries to make a DB connection using another program and the timeout(2.5min) of the program is to long. I want to add this functionality to the script.
If it takes longer then 5 seconds to connect, kill the process
Else kill the sleep/kill process.
The issue I'm having is how bash reports when a process is killed, that's because the processes are in the same shell just the background. Is there a better way to do this or how can I silence the shell for the kill commands?
DB_CONNECTION_PROGRAM > $CONNECTFILE &
pid=$!
(sleep 5; kill $pid) &
sleep_pid=$!
wait $pid
# If the DB failed to connect after 5 seconds and was killed
status=$? #Kill returns 128+n (fatal error)
if [ $status -gt 128 ]; then
no_connection="ERROR: Timeout while trying to connect to $dbserver"
else # If it connected kill the sleep and any errors collect
kill $sleep_pid
no_connection=`sed -n '/^ERROR:/,$p' $CONNECTFILE`
fi
There's a GNU coreutils utility called timeout: http://www.gnu.org/s/coreutils/manual/html_node/timeout-invocation.html
If you have it on your platform, you could do:
timeout 5 CONNECT_TO_DB
if [ $? -eq 124 ]; then
# Timeout occurred
else
# No hang
fi
I don't know if it's identical but I did fix a similar issue a few years ago. However I'm a programmer, not a Unix-like sysadmin so take the following with a grain of salt because my Bash-fu may not be that strong...
Basically I did fork, fork and fork : )
Out of memory After founding back my old code (which I amazingly still use daily) because my memory wasn't good enough, in Bash it worked a bit like this:
commandThatMayHang.sh 2 > /dev/null 2>&1 & # notice that last '&', we're forking
MAYBE_HUNG_PID=$!
sleepAndMaybeKill.sh $MAYBE_HUNG_PID 2 > /dev/null 2>&1 & # we're forking again
SLEEP_AND_MAYBE_KILL_PID=$!
wait $MAYBE_HUNG_PID > /dev/null 2>&1
if [ $? -eq 0 ]
# commandThatMayHand.sh did not hang, fine, no need to monitor it anymore
kill -9 $SLEEP_AND_MAYBE_KILL 2> /dev/null 2>&1
fi
where sleepAndMaybeKill.sh sleeps the amount of time you want and then kills commandThatMayHand.sh.
So basically the two scenario are:
your command exits fine (before your 5 seconds timeout or whatever) and so the wait stop as soon as your command exits fine (and kills the "killer" because it's not needed anymore
the command locks up, the killer ends up killing the command
In any case you're guaranteed to either succeed as soon as the command is done or to fail after the timeout.
You can set a timeout after 2 hours and restart your javaScriptThatStalls 100 times this way in a loop
seq 100|xargs -II timeout $((2 * 60 * 60)) javaScriptThatStalls
Do you mean you don't want the error message printed if the process isn't still running? Then you could just redirect stderr: kill $pid 2>/dev/null.
You could also check whether the process is still running:
if ps -p $pid >/dev/null; then kill $pid; fi
I found this bash script
timeout.sh
by Anthony Thyssen (his web). Looks good.

Resources