Using Until to Restart a Process when it dies - bash

I read this Question:
"How do I write a bash script to restart a process if it dies".
A solution is to:
until myserver; do
echo "Server 'myserver' crashed with exit code $?. Respawning.." >&2
sleep 1
done
Will this work for Tomcat/Jetty processes that are started from a script?
Do I test the success with "kill" to see if the process restarts?

If the script returns exit codes as specified in the answer at that link, then it should work. If you go back and read that answer again, it implies that you should not use kill. Using until will test for startup because a failed startup should return a non-zero exit code. Replace "myserver" with the name of your script.
Your script can have traps that handle various signals and other conditions. Those trap handlers can set appropriate exit codes.
Here is a demo. The subshell (echo "running 'dummy'"; sleep 2; exit $result) is a standin for your script:
result=0
until (echo "running 'dummy'"; sleep 2; exit $result)
do
echo "Server 'dummy' crashed with exit code $?. Respawning.." >&2
sleep 1
done
Try it with a failing "dummy" by setting result=1 and running the until loop again.

while true
do
if pgrep jett[y] 1>/dev/null;then
sleep 1
else
# restart your program here
fi
done

Related

Bash script: how to give an alert when current program is killed

I'm trying to write a program using bash script. I'd like to give an alert when this program is killed.
The desired action is like this:
#!/bin/bash
... # The original program
if killed ; do
echo "trying to kill the demo program ... "
sleep 5s
echo "demo program killed"
fi
If you expect the signal to be delivered only to the running program and not to the shell running your script, then the basic synopsis might be:
#!/bin/bash
set -euo pipefail
sleep 1 & # The original program
pid="$!"
kill -9 "$pid" # Pick your lethal signal
wait -n "$pid" && status=0 || status="$?"
((status > 128)) && echo "${pid} got signal $((status - 128))" 1>&2 || :
Presumably, here^^^ we run the program in the background, so that we can send it the kill signal from the same snippet. In practice you would probably run it in the foreground and then check its $? return status instead of the status from wait -n.
If the killing signal is delivered to your entire process group, including the shell running your script, that is a different story. For the signal KILL (9) in particular, there is no way to mask it or report it. When the shell gets it, it dies. For other signals you could set up a trap command (see man bash for its syntax) to handle the signal gracefully in the script while still being able to detect and report the child process’ death from the signal.

Obtain the exit code for a known process id

I have a list of processes triggered one after the other, in parallel. And, I need to know the exit code of all of these processes when they complete execution, without waiting for all of the processes to finish.
While status=$?; echo $status would provide the exit code for the last command executed, how do I know the exit code of any completed process, knowing the process id?
You can do that with GNU Parallel like this:
parallel --halt=now,done=1 ::: ./job1 ./job2 ./job3
The --halt=now,done=1 means halt immediately, as soon as any one job is done, killing all outstanding jobs immediately and exiting itself with the exit status of the complete job.
There are options to exit on success, or on failure as well as by completion. The number of successful, failing or complete jobs can be given as a percentage too. See documentation here.
Save the background job id using a wrapper shell function. After that the exit status of each job can be queried:
#!/bin/bash
jobs=()
function run_child() {
"$#" &
jobs+=($!)
}
run_child sleep 1
run_child sleep 2
run_child false
for job in ${jobs[#]}; do
wait $job
echo Exit Code $?
done
Output:
Exit Code 0
Exit Code 0
Exit Code 1

Prevent SIGINT from interrupting current task while still passing information about SIGINT (and preserve the exit code)

I have a quite long shell script and I'm trying to add signal handling to it.
The main task of the script is to run various programs and then clean up their temporary files.
I want to trap SIGINT.
When the signal is caught, the script should wait for the current program to finish execution, then do the cleanup and exit.
Here is an MCVE:
#!/bin/sh
stop_this=0
trap 'stop_this=1' 2
while true ; do
result="$(sleep 2 ; echo success)" # run some program
echo "result: '$result'"
echo "Cleaning up..." # clean up temporary files
if [ $stop_this -ne 0 ] ; then
echo 'OK, time to stop this.'
break
fi
done
exit 0
The expected result:
Cleaning up...
result: 'success'
Cleaning up...
^Cresult: 'success'
Cleaning up...
OK, time to stop this.
The actual result:
Cleaning up...
result: 'success'
Cleaning up...
^Cresult: ''
Cleaning up...
OK, time to stop this.
The problem is that the currently running instruction (result="$(sleep 2 ; echo success)" in this case) is interrupted.
What can I do so it would behave more like I was set trap '' 2?
I'm looking for either a POSIX solution or one that is supported by most of shell interpreters (BusyBox, dash, Cygwin...)
I already saw answers for Prevent SIGINT from closing child process in bash script but this isn't really working for me. All of these solutions require to modify each line which shouldn't be interrupted. My real script is quite long and much more complicated than the example. I would have to modify hundreds of lines.
You need to prevent the SIGINT from going to the echo in the first place (or rewrite the cmd that you are running in the variable assignment to ignore SIGINT). Also, you need to allow the variable assignment to happen, and it appears that the shell is aborting the assignment when it receives the SIGINT. If you're only worried about user generated SIGINT from the tty, you need to disassociate that command from the tty (eg, get it out of the foreground process group) and prevent the SIGINT from aborting the assignment. You can (almost) accomplish both of those with:
#!/bin/sh
stop_this=0
while true ; do
trap 'stop_this=1' INT
{ sleep 1; echo success > tmpfile; } & # run some program
while ! wait; do : ; done
trap : INT
result=$(cat tmpfile& wait)
echo "result: '$result'"
echo "Cleaning up..." # clean up temporary files
if [ $stop_this -ne 0 ] ; then
echo 'OK, time to stop this.'
break
fi
done
exit 0
If you're worried about SIGINT from another source, you'll have to re-implement sleep (or whatever command I presume sleep is a proxy for) to handle SIGINT the way you want. The key here is to run the command in the background and wait for it to prevent the SIGINT from going to it and terminating it early. Note that we've opened at least 2 new cans of worms here. By waiting in a loop, we're effectively ignoring the any errors that the subcommand might raise (we're doing this to try and implement a SIGRESTART), so may potentially hang. Also, if the SIGINT arrives during the cat, we have attempted to prevent the cat from aborting by running it in the background, but now the variable assignment will be terminated and you'll get your original behavior. Signal handling is not clean in the shell! But this gets you closer to your desired goal.
Sighandling in shell scripts can get clumsy. It's pretty much impossible to
do it "right" without the support of C.
The problem with:
result="$(sleep 2 ; echo success)" # run some program
is that $() creates a subshell and in subshells, non-ignored (trap '' SIGNAL is how you ignore SIGNAL)
signals are reset to their default dispositions which for SIGINT is to terminate the process
($( ) gets its own process, thought it will receive the signal too because the terminal-generated SIGINT
is process-group targeted)
To prevent this, you could do something like:
result="$(
trap '' INT #ignore; could get killed right before the trap command
sleep 2; echo success)"
or
result="$( trap : INT; #no-op handler; same problem
sleep 2; while ! echo success; do :; done)"
but as noted, there will be a small race-condition window between the start of the
subshell and the registration of the signal handler during which
the subshell could get killed by the reset-to-default SIGINT signal.
Both answers from #PSkocik and #WilliamPursell have helped me to get on the right track.
I have a fully working solution. It ain't pretty because it needs to use an external file to indicate that the signal didn't occurred but beside that it should work reliably.
#!/bin/sh
touch ./continue
trap 'rm -f ./continue' 2
( # the whole main body of the script is in a separate background process
trap '' 2 # ignore SIGINT
while true ; do
result="$(sleep 2 ; echo success)" # run some program
echo "result: '$result'"
echo "Cleaning up..." # clean up temporary files
if [ ! -e ./continue ] ; then # exit the loop if file "./continue" is deleted
echo 'OK, time to stop this.'
break
fi
done
) & # end of the main body of the script
while ! wait ; do : ; done # wait for the background process to end (ignore signals)
wait $! # wait again to get the exit code
result=$? # exit code of the background process
rm -f ./continue # clean up if the background process ended without a signal
exit $result
EDIT: There are some problems with this code in Cygwin.
The main functionality regarding signals work.
However, it seems like the finished background process doesn't stay in the system as a zombie. This makes the wait $! to not work. The exit code of the script has incorrect value of 127.
Solution to that would be removing lines wait $!, result=$? and result=$? so the script always returns 0.
It should be also possible to keep the proper error code by using another layer of subshell and temporarily store the exit code in a file.
For disallowing interrupting the program:
trap "" ERR HUP INT QUIT TERM TSTP TTIN TTOU
But if a sub-command handles traps by itself, and that command must really complete, you need to prevent passing signals to it.
For people on Linux that don't mind installing extra commands, you can just use:
waitFor [command]
Alternatively you can adapt the latest source code of waitFor into your program as needed, or use the code from Gilles' answer. Although that has the disadvantage of not benefiting from updates upstream.
Just mind that other terminals and the service manager can still terminate "command". If you want the service manager to be unable to close "command", it shall be run as a service with the appropriate kill mode and kill signal set.
You may want to adapt the following:
#!/bin/sh
tmpfile=".tmpfile"
rm -f $tmpfile
trap : INT
# put the action that should not be interrupted in the innermost brackets
# | |
( set -m; (sleep 10; echo success > $tmpfile) & wait ) &
wait # wait will be interrupted by Ctrl+c
while [ ! -r $tmpfile ]; do
echo "waiting for $tmpfile"
sleep 1
done
result=`cat $tmpfile`
echo "result: '$result'"
This seems also to work with programs that install their own SIGINT handler like mpirun and mpiexec and so on.

catching error code of background task

I have a python script, in my raspberry, that runs in a infinite loop. I want to catch it's exit code in case it stops. I made a script named run like this:
#!/bin/bash
~/bin/script.py &
wait $! && echo "script exited with code $?" >> ~/bin/log/script.log &
but when I run it i get the following error:
~/bin/run: line 3: wait: pid 2728 is not a child of this shell
Can anyone give me some hint of a solution?
You are pushing your (single) script to the background and then do a blocking wait. I think, this is unnecessary. You may just write:
!/bin/bash
~/bin/script.py
echo "script exited with code $?" >> ~/bin/log/script.log

Exit all called KornShell (ksh) scripts

How can a KornShell (ksh) script exit/kill all the processes started from another ksh script?
If scriptA.ksh calls scriptB.ksh then the following code works good enough, but is there a better solution for this?:
scriptA.ksh:
#call scriptBSnippet
scriptBSnippet.ksh ${a}
scriptB.ksh:
#if error: exit this script (scriptB) and calling script (scriptA)#
kill ${PPID}
exit 1
To add complexity what if scriptA calls scriptB which calls scriptC, then how to exit out of all three scripts if there is an error in scriptC?
scriptA.ksh:
#call scriptBSnippet
scriptBSnippet.ksh ${a}
scriptB.ksh:
#if error: exit this script (scriptB) and calling script (scriptA)#
kill ${PPID}
exit 1
scriptC.ksh:
#if error: exit this script (scriptC) and calling scripts (scriptA, scriptB)#
#kill ${PPID}
#exit 1
Thanks in advance.
Killing all processes started by the same script is a bit of a brute force method.
It would be best to have some method of communication between the processes that would allow them to gracefully shutdown.
However, if all processes are in the same process group, you can send a signal to the entire process group:
kill -${Signal:?} -${Pgid:?}
Note that two arguments are required in this case. A single argument starting with - is always interpreted as a signal.
Run some tests to see which processes get included in the process group.
parent.sh:
Shell=ksh
($Shell -c :) || exit
$Shell child1.sh & pid1=$!
$Shell child2.sh & pid2=$!
$Shell child3.sh & pid3=$!
ps -o pid,sid,pgid,tty,cmd $PPID $$ $pid1 $pid2 $pid3
exit
child.sh:
sleep 50
If you run parent.sh from a terminal - it will become the process leader.
granny.sh:
Shell=ksh
($Shell -c :) || exit
$Shell parent.sh &
wait
exit
If you run parent.sh from another script granny.sh, then that will be the process group leader, and will be included when you use the kill -SIG -PGID method.
See also this answer to:
What are “session leaders” in ps? for some background on sessions and process groups.

Resources