Identify whether a process was killed by a signal in bash - bash

Consider these two C programs:
#include <signal.h>
int main(void) {
raise(SIGTERM);
}
int main(void) {
return 143;
}
If I run either one, the value of $? in bash will be 143. The wait syscall lets you distinguish them, though:
wait4(-1, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGTERM}], 0, NULL) = 11148
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 143}], 0, NULL) = 11214
And bash clearly uses this knowledge, since the first one results in Terminated being printed to the terminal (oddly, this happens even if I redirect both stdout and stderr elsewhere), and the second one doesn't. How can I differentiate these two cases from a bash script?

I believe getting the full exit codes from pure bash/shell is not possible.
The answers on Unix' StackExchange are very comprehensive.
What's common between all shells is that $? contains the lowest 8 bits of the exit code (the number passed to exit()) if the process terminated normally.
Where it differs is when the process is terminated by a signal. In all cases, and that's required by POSIX, the number will be greater than 128. POSIX doesn't specify what the value may be. In practice though, in all Bourne-like shells that I know, the lowest 7 bits of $? will contain the signal number. But, where n is the signal number,
in ash, zsh, pdksh, bash, the Bourne shell, $? is 128 + n. What that means is that in those shells, if you get a $? of 129, you don't know whether it's because the process exited with exit(129) or whether it was killed by the signal 1 (HUP on most systems). But the rationale is that shells, when they do exit themselves, by default return the exit status of the last exited command. By making sure $? is never greater than 255, that allows to have a consistent exit status:
$ bash -c 'sh -c "kill \$\$"; printf "%x\n" "$?"'
bash: line 1: 16720 Terminated sh -c "kill \$\$"
8f # 128 + 15
$ bash -c 'sh -c "kill \$\$"; exit'; printf '%x\n' "$?"
bash: line 1: 16726 Terminated sh -c "kill \$\$"
8f # here that 0x8f is from a exit(143) done by bash. Though it's
# not from a killed process, that does tell us that probably
# something was killed by a SIGTERM
For this reason, i believe, that you would need to run a command outside of bash to catch the exit code.
With some abstraction, a similar question has been asked regarding unbuffer which is a small script written in tcl. To be more precise, unbuffer uses the library libexpect with a tcl/tk wrapper.
From the source of unbuffer I extracted the relevant code to derive a workaround:
#!/bin/bash
expectStat() {
expect <(cat << EOT
set stty_init "-opost"
set timeout -1
eval [list spawn -noecho ] $#
expect
send_user "[wait]\n"
EOT
)
}
expectStat sleep 5 &
wait
which returns approximately the following line if sleep exits normally:
18383 exp4 0 0
If sleep is killed before it's exiting itself, the above script will approximately return:
18383 exp4 0 0 CHILDKILLED SIGTERM {software termination signal}
If a script is terminated with exit 143, the script will approximately return:
18383 exp4 0 143
The meaning of these strings can be extracted from the manual for expect. The integrated function wait is returning the above return lines.
The first two values are the pid, and expect's name for the process.
The fourth is the exit status. If a singal occurs more information is printed. The sixth value is the signal send to the process on its termination.
wait
normally returns a list of four integers. The first integer is the pid of the process that was waited upon. The second integer is the corresponding spawn id. The third integer is -1 if an operating system error occurred, or 0 otherwise. If the third integer was 0, the fourth integer is the status returned by the spawned process. If the third integer was -1, the fourth integer is the value of errno set by the operating system. The global variable errorCode is also set.
Additional elements may appear at the end of the return value from wait. An optional fifth element identifies a class of information. Currently, the only possible value for this element is CHILDKILLED in which case the next two values are the C-style signal name and a short textual description.
This means the fourth value and if present the sixth value are the values you are looking for. Store the whole line and extract the signal and exit code, for example with the following code:
RET=$(expectStat script.sh 1>&1)
# Filter status
EXITVALUE="$(echo "$RET" | cut -d' ' -f4)"
SIGNAL=$(echo "$RET" | cut -d' ' -f6)
#echo "Exit value: $EXITVALUE, Signal: $SIGNAL"
if [ -n "$SIGNAL" ]; then
echo "Likely killed by signal"
else
echo "$EXITVALUE"
fi
Conclusively, this workaround is very inelegant. Maybe, there is another tool which brings its own c-based tools to get the occurrence of a signal.

wait is a syscall and also a bash builtin.
To differentiate the two cases from bash run the program in the background and use the builtin wait to report the outcome.
Following are examples of both a non-zero exit code and an uncaught signal. These examples use the exit and kill bash builtins in a child bash shell, instead of a child bash shell you would run your program.
$ bash -c 'kill -s SIGTERM $$' & wait
[1] 36068
[1]+ Terminated: 15 bash -c 'kill -s SIGTERM $$'
$ bash -c 'exit 143' & wait
[1] 36079
[1]+ Exit 143 bash -c 'exit 143'
$
As to why you see Terminated printed to the terminal even when you redirect stdout and stderr the reason is that is printed by bash, not by the program.
Update:
By explicitly using the wait builtin you can now redirect its stderr (with the exit status of the program) to a separate file.
The following examples show the three types of termination: normal exit 0, non-zero exit, and uncaught signal. The results reported by wait are stored in files tagged with the PID of the corresponding program.
$ bash -c 'exit 0' & wait 2> exit_status_pid_$!
[1] 40279
$ bash -c 'exit 143' & wait 2> exit_status_pid_$!
[1] 40291
$ bash -c 'kill -s SIGTERM $$' & wait 2> exit_status_pid_$!
[1] 40303
$ for f in exit_status_pid*; do echo $f: $(cat $f); done
exit_status_pid_40279: [1]+ Done bash -c 'exit 0'
exit_status_pid_40291: [1]+ Exit 143 bash -c 'exit 143'
exit_status_pid_40303: [1]+ Terminated: 15 bash -c 'kill -s SIGTERM $$'
$

This is straying farther from bash but bcc offers exitsnoop. Using the
examples from the description, on Debian Sid:
root#vsid:~# apt install bpfcc-tools linux-headers-amd64
root#vsid:~# exitsnoop-bpfcc
PCOMM PID PPID TID AGE(s) EXIT_CODE
example1 1041 948 1041 0.00 signal 15 (TERM)
example2 1042 948 1042 0.00 code 143
^C
See the install guide for other distributions.

Strace can capture most of the signals, but might not work for syscalls (e.g. kill -9 ), therefore, as mentioned in this article:
Auditd is a daemon process or service that does as the name implies and produces audit logs of System level activities. It is installed from the usual repository as the audit package and then is configured in /etc/audit/auditd.conf and the rules are in /etc/audit/audit.rules.
The article provides examples for Audit output, which can help determining if it's helpful for you:
The usual output will look like this:
time->Wed Jun 3 16:34:08 2015
type=SYSCALL msg=audit(1433363648.091:6342): arch=c000003e syscall=62 success=no exit=-3 a0=1e06 a1=0 a2=1e06 a3=fffffffffffffff0 items=0 ppid=10044 pid=10140 auid=500 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=2 comm=4174746163682041504920696E6974 exe="/opt/ibm/WebSphere/AppServer/java/jre/bin/java" subj=unconfined_u:unconfined_r:unconfined_java_t:s0-s0:c0.c1023 key="kill_signals"
There is also mention of System Tap, and a redirection to a guide .

Related

Why is the KILL signal handler not executing when my child process dies

I have migrated some scripts from ksh to bash and I have observed very strange behavior in bash. I was able to reduce to a very short snippet.
echo first test
LC_ALL=C xclock &
Active_pid=$!
sleep 1
kill -9 $Active_pid
sleep 1
echo second test
LC_ALL=C xclock &
Active_pid=$!
sleep 1
trap "echo Signal SIGKILL caught" 9
kill -9 $Active_pid
sleep 1
The output is
first test
./mig_bash.sh: line 15: 4471 Killed LC_ALL=C xclock
second test
My problem was the production of the trace in the first test. I have tried to see if a signal was received. By trial and error, I wrote the "second test" that solve my problem. I do not understand it. How this removes the trace of the first test without executing echo Signal SIGKILL ?
I am completely lost.
I couldn't find anything in the bash documentation that would explain the observed behavior, so I turned to the source code. Debugging lead to the function notify_of_job_status(). The line that prints the message about a killed subprocess can be reached only if all of the following conditions hold:
the subprocess is registered in the job table (i.e. has not been disown-ed)
the shell was NOT started in interactive mode
the signal that terminated the child process is NOT trapped in the parent shell (see the signal_is_trapped (termsig) == 0 check)
Demonstration:
$ cat test.sh
echo Starting a subprocess
LC_ALL=C sleep 100 &
Active_pid=$!
case "$1" in
disown) disown ;;
trapsigkill) trap "echo Signal SIGKILL caught" 9 ;;
esac
sleep 1
kill -9 $Active_pid
sleep 1
echo End of script
$ # Demonstrate the undesired message
$ bash test.sh
Starting a subprocess
test.sh: line 14: 15269 Killed LC_ALL=C sleep 100
End of script
$ # Suppress the undesired message by disowning the child process
$ bash test.sh disown
Starting a subprocess
End of script
$ # Suppress the undesired message by trapping SIGKILL in the parent shell
$ bash test.sh trapsigkill
Starting a subprocess
End of script
$ # Suppress the undesired message by using an interactive shell
$ bash -i test.sh
Starting a subprocess
End of script
How this removes the trace of the first test without executing echo Signal SIGKILL ?
The trap is not executed since the KILL signal is received by the sub-process rather than the shell process for which the trap has been set. The effect of the trap on the diagnostics is in the (somewhat arguable) logic in the notify_of_job_status() function.

limit spawned parallel processes and exit all upon failure of any

I'm running some tests in parallel by calling a process from a script. Each process prints only to stdout > a file, and exits 0 iff successful (otherwise -1).
If and when a process exits with -1, I print something to its (or a related) output file (namely, the arguments it was called with), kill all other processes, and exit.
I have written a script using trap "..." CHLD to run some code when a subprocess exits and this works under certain conditions, but I find my script is not very robust. If I send a keyboard interrupt sometimes the subprocesses keep going, and sometimes the number of subprocesses simply overwhelm the machine(s) and none of them seem to advance.
I am using this on my quad core laptop as well as a cluster of 128 CPUs, over which subprocesses are distributed automatically. How do I run a large number of background subprocesses in a bash script, limited to some number of them running concurrently, and do something + exit if one of them returns with a bad code? I would also like the script to clean up after keyboard interrupt. Should I use GNU-parallel? how?
Here is a MWE of my script so far, which spawns subprocesses unhindered, annotated with what I think each part means. I got the idea to use trap from shell - get exit code of background process
$ cat parallel_tests.sh
#!/bin/bash
# some help from https://stackoverflow.com/questions/1570262/shell-get-exit-code-of-background-process
handle_chld() {
#echo pids are ${pids[#]}
local tmp=() ###temporary storage for pids that haven't finished
#for each pid that hadn't finished since the last trap
for((i=0;i<${#pids[#]};++i)); do
#if this pid is still running
if [[ $(ps -p ${pids[i]} -o pid=) ]]
then
tmp+=(${pids[i]}) ### add pid to list of pids that are running
else
wait ${pids[i]} ### put the exit code of this pid into $?
if [ "$?" != "0" ] ### if the exit code $? is non-zero
then
#kill all remaning processes
for((j=0;j<${#pids[#]};++j))
do
if [[ $(ps -p ${pids[j]} -o pid=) ]]
then
echo killing child processes of ${pids[j]}
pkill -P ${pids[j]}
fi
done
cat _tmp${pids[i]}
#print things to the terminal here
echo "FAILED process ${pids[i]} args: `cat _tmpargs${pids[i]}`"
exit 1
else
echo "FINISHED process ${pids[i]} args: `cat _tmpargs${pids[i]}`"
fi
fi
done
#update list of running pids
pids=(${tmp[#]})
}
# set this to monitor SIGCHLD
set -o monitor
# call handle_chld() when SIGCHLD signal is triggered
trap "handle_chld" CHLD
ALL_ARGS="2 32 87" ### ad nauseam
for A in $ALL_ARGS; do
(sleep $A; false) > _tmp$! &
pids+=($!)
echo $A > _tmpargs${pids[${#pids[#]}-1]}
echo "STARTED process ${pids[${#pids[#]}-1]} args: `cat _tmpargs${pids[${#pids[#]}-1]}`"
done
echo "Every process started. Now waiting on PIDS:"
echo ${pids[#]}
wait ${pids[#]} ###wait until every process is finished (or exit in the trap)
The output of this version after 2+epsilon seconds is:
$ ./parallel_tests.sh
STARTED process 66369 args: 2
STARTED process 66374 args: 32
STARTED process 66381 args: 87
Every process started. Now waiting on PIDS:
66369 66374 66381
killing child processes of 66374
./parallel_tests.sh: line 43: 66376 Terminated: 15 sleep $A
killing child processes of 66381
./parallel_tests.sh: line 43: 66383 Terminated: 15 sleep $A
FAILED process 66369 args: 2
Essentially, pid 66369 fails first, and the other two processes are dealt with in the trap. I have simplified the construction of the test processes here, so we can't assume that I'll manually insert waits before spawning new ones. Additionally, some of the test processes can be nearly instant. Essentially, I have a whole mess of test processes, long and short, starting as soon as resources can be allotted.
I'm not sure what's causing the problems I mentioned above, as this script uses several features that are new to me. General pointers are welcomed!
(I have seen this question and it does not answer my question)
cat arguments | parallel --halt now,fail=1 my_prg
Alternatively:
parallel --halt now,fail=1 my_prg ::: $ALL_ARGS
GNU Parallel is designed so it will also kill remote jobs. It does that using process groups and heavy perl scripting on the remote server: https://www.gnu.org/software/parallel/parallel_design.html#The-remote-system-wrapper

SIGALRM waits for subshell processes?

Here is the unexpected situation: in the following script, SIGALRM doesn't invoke the function alarm() at the expected time.
#!/bin/sh -x
alarm() {
echo "alarmed!!!"
}
trap alarm 14
OUTER=$(exec sh -c 'echo $PPID')
#for arg in `ls $0`; do
ls $0 | while read arg; do
INNER=$(exec sh -c 'echo $PPID')
# child A, the timer
sleep 1 && kill -s 14 $$ &
# child B, some other scripts
sleep 60 &
wait $!
done
Expectation:
After 1 second, the function alarm() should be called.
Actually:
alarm() is called until 60s, or when we hit Ctrl+C.
We know in the script, $$ actually indicates the OUTER process, so I suppose we should see the string printed to screen after 1 second. However, it is until child B exits do we see alarm() is called.
When we get the trap line commented, the whole program just terminates after 1 second. So... I suppose SIGALRM is at least received, but why doesn't it invoke actions?
And, as a side question, is the default behavior of SIGALRM to be termination? From here I am told that by default it is ignored, so why OUTER exits after receiving it?
From the bash man page:
If bash is waiting for a command to complete and receives a signal for which a trap has been set, the trap will
not be executed until the command completes. When bash is waiting for an asynchronous command via the wait
builtin, the reception of a signal for which a trap has been set will cause the wait builtin to return immediately
with an exit status greater than 128, immediately after which the trap is executed.
Your original script is in the first scenario. The subshell (the while read loop) has called wait, but the top level script is just waiting for the subshell, so when it receives the signal, the trap is not executed until the subshell completes. If you send the signal to the subshell with kill -s 14 $INNER, you get the behavior you expect.

How do I terminate all the subshell processes?

I have a bash script to test how a server performs under load.
num=1
if [ $# -gt 0 ]; then
num=$1
fi
for i in {1 .. $num}; do
(while true; do
{ time curl --silent 'http://localhost'; } 2>&1 | grep real
done) &
done
wait
When I hit Ctrl-C, the main process exits, but the background loops keep running. How do I make them all exit? Or is there a better way of spawning a configurable number of logic loops executing in parallel?
Here's a simpler solution -- just add the following line at the top of your script:
trap "kill 0" SIGINT
Killing 0 sends the signal to all processes in the current process group.
One way to kill subshells, but not self:
kill $(jobs -p)
Bit of a late answer, but for me solutions like kill 0 or kill $(jobs -p) go too far (kill all child processes).
If you just want to make sure one specific child-process (and its own children) are tidied up then a better solution is to kill by process group (PGID) using the sub-process' PID, like so:
set -m
./some_child_script.sh &
some_pid=$!
kill -- -${some_pid}
Firstly, the set -m command will enable job management (if it isn't already), this is important, as otherwise all commands, sub-shells etc. will be assigned to the same process group as your parent script (unlike when you run the commands manually in a terminal), and kill will just give a "no such process" error. This needs to be called before you run the background command you wish to manage as a group (or just call it at script start if you have several).
Secondly, note that the argument to kill is negative, this indicates that you want to kill an entire process group. By default the process group ID is the same as the first command in the group, so we can get it by simply adding a minus sign in front of the PID we fetched with $!. If you need to get the process group ID in a more complex case, you will need to use ps -o pgid= ${some_pid}, then add the minus sign to that.
Lastly, note the use of the explicit end of options --, this is important, as otherwise the process group argument will be treated as an option (signal number), and kill will complain it doesn't have enough arguments. You only need this if the process group argument is the first one you wish to terminate.
Here is a simplified example of a background timeout process, and how to cleanup as much as possible:
#!/bin/bash
# Use the overkill method in case we're terminated ourselves
trap 'kill $(jobs -p | xargs)' SIGINT SIGHUP SIGTERM EXIT
# Setup a simple timeout command (an echo)
set -m
{ sleep 3600; echo "Operation took longer than an hour"; } &
timeout_pid=$!
# Run our actual operation here
do_something
# Cancel our timeout
kill -- -${timeout_pid} >/dev/null 2>&1
wait -- -${timeout_pid} >/dev/null 2>&1
printf '' 2>&1
This should cleanly handle cancelling this simplistic timeout in all reasonable cases; the only case that can't be handled is the script being terminated immediately (kill -9), as it won't get a chance to cleanup.
I've also added a wait, followed by a no-op (printf ''), this is to suppress "terminated" messages that can be caused by the kill command, it's a bit of a hack, but is reliable enough in my experience.
You need to use job control, which, unfortunately, is a bit complicated. If these are the only background jobs that you expect will be running, you can run a command like this one:
jobs \
| perl -ne 'print "$1\n" if m/^\[(\d+)\][+-]? +Running/;' \
| while read -r ; do kill %"$REPLY" ; done
jobs prints a list of all active jobs (running jobs, plus recently finished or terminated jobs), in a format like this:
[1] Running sleep 10 &
[2] Running sleep 10 &
[3] Running sleep 10 &
[4] Running sleep 10 &
[5] Running sleep 10 &
[6] Running sleep 10 &
[7] Running sleep 10 &
[8] Running sleep 10 &
[9]- Running sleep 10 &
[10]+ Running sleep 10 &
(Those are jobs that I launched by running for i in {1..10} ; do sleep 10 & done.)
perl -ne ... is me using Perl to extract the job numbers of the running jobs; you can obviously use a different tool if you prefer. You may need to modify this script if your jobs has a different output format; but the above output is also on Cygwin, so it's very likely identical to yours.
read -r reads a "raw" line from standard input, and saves it into the variable $REPLY. kill %"$REPLY" will be something like kill %1, which "kills" (sends an interrupt signal to) job number 1. (Not to be confused with kill 1, which would kill process number 1.) Together, while read -r ; do kill %"$REPLY" ; done goes through each job number printed by the Perl script, and kills it.
By the way, your for i in {1 .. $num} won't do what you expect, since brace expansion is handled before parameter expansion, so what you have is equivalent to for i in "{1" .. "$num}". (And you can't have white-space inside the brace expansion, anyway.) Unfortunately, I don't know of a clean alternative; I think you have to do something like for i in $(bash -c "{1..$num}"), or else switch to an arithmetic for-loop or whatnot.
Also by the way, you don't need to wrap your while-loop in parentheses; & already causes the job to be run in a subshell.
Here's my eventual solution. I'm keeping track of the subshell process IDs using an array variable, and trapping the Ctrl-C signal to kill them.
declare -a subs #array of subshell pids
function kill_subs() {
for pid in ${subs[#]}; do
kill $pid
done
exit 0
}
num=1 if [ $# -gt 0 ]; then
num=$1 fi
for ((i=0;i < $num; i++)); do
while true; do
{ time curl --silent 'http://localhost'; } 2>&1 | grep real
done &
subs[$i]=$! #grab the pid of the subshell
done
trap kill_subs 1 2 15
wait
While these is not an answer, I just would like to point out something which invalidates the selected one; using jobs or kill 0 might have unexpected results; in my case it killed unintended processes which in my case is not an option.
It has been highlighted somehow in some of the answers but I am afraid not with enough stress or it has been not considered:
"Bit of a late answer, but for me solutions like kill 0 or kill $(jobs -p) go too far (kill all child processes)."
"If these are the only background jobs that you expect will be running, you can run a command like this one:"

How do I check the exit code of a command executed by flock?

Greetings all. I'm setting up a cron job to execute a bash script, and I'm worried that the next one may start before the previous one ends. A little googling reveals that a popular way to address this is the flock command, used in the following manner:
flock -n lockfile myscript.sh
if [ $? -eq 1 ]; then
echo "Previous script is still running! Can't execute!"
fi
This works great. However, what do I do if I want to check the exit code of myscript.sh? Whatever exit code it returns will be overwritten by flock's, so I have no way of knowing if it executed successfully or not.
It looks like you can use the alternate form of flock, flock <fd>, where <fd> is a file descriptor. If you put this into a subshell, and redirect that file descriptor to your lock file, then flock will wait until it can write to that file (or error out if it can't open it immediately and you've passed -n). You can then do everything in your subshell, including testing the return value of scripts you run:
(
if flock -n 200
then
myscript.sh
echo $?
fi
) 200>lockfile
According to the flock man page, flock has a -E or --exit-conflict-code flag you can use to set what the exit code of flock should be in the case a conflict occurs:
-E, --conflict-exit-code number
The exit status used when the -n option is in use, and the conflicting lock exists, or the -w option is in use, and the timeout is reached. The default value is 1. The number has to be in the range of 0 to 255.
The man page also states:
EXIT STATUS
The command uses sysexits.h exit status values for everything, except when using either of the options -n or -w which report a failure to acquire the lock with a exit status given by the -E option, or 1 by default. The exit status given by -E has to be in the range of 0 to 255.
When using the command variant, and executing the child worked, then the exit status is that of the child command.
So, in the case of the -n or -w flags while using the "command" variant, you can see both exit statuses.
Example:
$ flock --exclusive /tmp/flock.lock bash -c 'exit 42'; echo $?
42
$ flock --exclusive /tmp/flock.lock flock --exclusive --nonblock --conflict-exit-code 100 /tmp/flock.lock bash -c 'exit 42'; echo $?
100
In the first example, we see that we get back the exit status of the process we're running with flock. In the second example, we are creating contention for the lock. In that case, flock itself returns the status code we tell it (100). If you do not specify a value with the --conflict-exit-code flag, it will return 1 instead. However, I prefer setting less common values to prevent confusion from other processess/scripts which also might return a value of 1.
#!/bin/bash
if ! pgrep myscript.sh; then
flock -n lockfile myscript.sh
fi
If I understand you right, you want to make sure 'myscript.sh' is not running before cron attempts to run your command again. Assuming that's right, we check to see if pgrep failed to find myscript.sh in the processes list and if so we run the flock command again.
Perhaps something like this would work for you.
#!/bin/bash
RETVAL=0
lockfailed()
{
echo "cannot flock"
exit 1
}
(
flock -w 2 42 || lockfailed
false
RETVAL=$?
echo "original retval $RETVAL"
exit $RETVAL
) 42>|/tmp/flocker
RETVAL=$?
echo "returned $RETVAL"
exit $RETVAL

Resources