Bash: why wait returns prematurely with code 145 - bash

This problem is very strange and I cannot find any documentation about this online. In the following code snippet I am merely trying to run a bunch of sub-processes in parallel, printing something when they exit and collect/print their exit code at the end. I find that without catching SIGCHLD things work as I would expect however, things break when I catch the signal. Here is the code:
#!/bin/bash
#enabling job control
set -m
cmd_array=( "$#" ) #array of commands to run in parallel
cmd_count=$# #number of commands to run
cmd_idx=0; #current index of command
cmd_pids=() #array of child proc pids
trap 'echo "Child job existed"' SIGCHLD #setting up signal handler on SIGCHLD
#running jobs in parallel
while [ $cmd_idx -lt $cmd_count ]; do
cmd=${cmd_array[$cmd_idx]} #retreiving the job command as a string
eval "$cmd" &
cmd_pids[$cmd_idx]=$! #keeping track of the job pid
echo "Job #$cmd_idx launched '$cmd']"
(( cmd_idx++ ))
done
#all jobs have been launched, collecting exit codes
idx=0
for pid in "${cmd_pids[#]}"; do
wait $pid
child_exit_code=$?
if [ $child_exit_code -ne 0 ]; then
echo "ERROR: Job #$idx failed with return code $child_exit_code. [job_command: '${cmd_array[$idx]}']"
fi
(( idx++ ))
done
You can tell something is wrong when you try to run this the following command:
./parallel_script.sh "sleep 20; echo done_20" "sleep 3; echo done_3"
The interesting thing here is that you can tell as soon as the signal handler is called (when sleep 3 is done), the wait (which is waiting on sleep 20) is interrupted right away with a return code 145. I can tell the sleep 20 is still running even after the script is done.
I can't find any documentation about such a return code from wait. Can anyone shed some light as to what is going on here?
(By the way if I add a while loop when I wait and keep on waiting while the return code is 145, I actually get the result I expect)

Thanks to #muru, I was able to reproduce the "problem" using much less code, which you can see below:
#!/bin/bash
set -m
trap "echo child_exit" SIGCHLD
function test() {
sleep $1
echo "'sleep $1' just returned now"
}
echo sleeping for 6 seconds in the background
test 6 &
pid=$!
echo sleeping for 2 second in the background
test 2 &
echo waiting on the 6 second sleep
wait $pid
echo "wait return code: $?"
If you run this you will get the following output:
linux:~$ sh test2.sh
sleeping for 6 seconds in the background
sleeping for 2 second in the background
waiting on the 6 second sleep
'sleep 2' just returned now
child_exit
wait return code: 145
lunux:~$ 'sleep 6' just returned now
Explanation:
As #muru pointed out "When a command terminates on a fatal signal whose number is N, Bash uses the value 128+N as the exit status." (c.f. Bash manual on Exit Status).
Now what mislead me here is the "fatal" signal. I was looking for a command to fail somewhere when nothing did.
Digging a little deeper in Bash manual on Signals: "When Bash is waiting for an asynchronous command via the wait builtin, the reception of a signal for which a trap has been set will cause the wait builtin to return immediately with an exit status greater than 128, immediately after which the trap is executed."
So there you have it, what happens in the script above is the following:
sleep 6 starts in the background
sleep 3 starts in the background
wait starts waiting on sleep 6
sleep 3terminates and the SIGCHLD trap if fired interrupting wait, which returns 128 + SIGCHLD = 145
my script exits since it does not wait anymore
the background sleep 6 terminates hence the "'sleep 6' just returned now" after the script already exited

Related

Why can't I exit from an exit trap when I'm inside of a function in ZSH, unless I'm in a loop?

I'm really trying to understand the difference in how ZSH and Bash are handling signal traps, but I'm having a very hard time grasping why ZSH is doing what it's doing.
In short, I'm not able to exit a script with exit in ZSH from within a trap if the execution point is within a function, unless it's also within a loop.
Here is an example of how exit in a trap action behaves in the global / file level scope.
#!/bin/zsh
trap 'echo "Trap SIGINT" ; exit 130' SIGINT
sleep 1
echo "1"
sleep 1
echo "2"
sleep 1
echo "3"
If I call the script, I can send an INT signal by pressing Cntrl+C at any time to echo "Trap SIGINT" and exit the script immediately.
If I hit Cntrl+C after I see the first 1, the output looks like this:
$ ./foobar
1
^CTrap SIGINT
But if I wrap the code in a function, then the trap doesn't want to stop script execution until the function finishes. Using exit 130 from within the trap action just continues the code from the execution point within the function.
Here is an example of how using trap behaves in the function level scope.
#!/bin/zsh
trap 'echo "Trap SIGINT" ; exit 130' SIGINT
foobar() {
sleep 1
echo "1"
sleep 1
echo "2"
sleep 1
echo "3"
}
foobar
echo "Finished"
If I call the script, the only thing that an INT signal does is end the sleep command early. The script will just keep on going from the same execution point after that.
If I hit Cntrl+C repeatedly the output looks like this.
$ ./foobar
^CTrap SIGINT
1
^CTrap SIGINT
2
^CTrap SIGINT
3
It doesn't echo the "Finished" at the end, so it is exiting when the function is finished, but I can't seem to exit before it's finished.
It doesn't make a difference if I set the trap in the global / file scope or from within the function.
If I change exit 130 to return 130, then it will jump out of that function early but continue script execution. This is expected behavior from what I could read in the ZSH documentation.
If I wrap the code inside of a for or while loop as shown in the code below, the code then has no problem breaking out of the loop.
#!/bin/zsh
trap 'echo "Trap SIGINT" ; exit 130' SIGINT
foobar() {
for i in 1; do
sleep 1
echo "1"
sleep 1
echo "2"
sleep 1
echo "3"
done
sleep 1
echo "Outside of loop"
}
foobar
echo "Finished"
Even if I have the loop in the global / file scope and calling foobar from within the loop, it still has no problem exiting within the trap action. I assume it's because using
The one thing that does work correctly is defining a TRAPINT function instead of using the trap built-in, and returning a non-exit code from that function. However exiting from the TRAPINT function works the same way it does with the trap built-in.
I've tried to find anything on why it acts like this but I couldn't find anything.
So what's actually happening here? Why is ZSH not letting me exit from the trap action when the execution point is inside a function?
One way to make this work as expected is setting the ERR_EXIT option.
From the documentation:
If a command has a non-zero exit status, execute the ZERR trap, if set, and exit. This is disabled while running initialization scripts.
There's also ERR_RETURN:
If a command has a non-zero exit status, return immediately from the enclosing function. The logic is similar to that for ERR_EXIT, except that an implicit return statement is executed instead of an exit. This will trigger an exit at the outermost level of a non-interactive script.
Both options have some caveats and notes; refer to the documentation.
Adding a setopt localoptions err_exit as the first line of the foobar function (You probably don't want to do this globally) in your script causes:
$ ./foobar
1
^CTrap SIGINT
$
Now, the interesting bit. In your demonstration script, if you change your exit value from 130 to some other number, and the echo lines to echo "1 - $?" etc., you get:
$ ./foobar
1 - 0
2 - 0
^CTrap SIGINT
3 - 130
The sleep is still exiting with 130, the normal value for a process killed by a SIGINT. What happened to your exit in the trap and its value? Not a clue (I'll update the answer if I figure it out) .
I'd just stick with the TRAPnal functions when writing zsh scripts that care about signals.

Prevent CTRL+C being sent to process called from a Bash script

Here is a simplified version of some code I am working on:
#!/bin/bash
term() {
echo ctrl c pressed!
# perform cleanup - don't exit immediately
}
trap term SIGINT
sleep 100 &
wait $!
As you can see, I would like to trap CTRL+C / SIGINT and handle these with a custom function to perform some cleanup operation, rather than exiting immediately.
However, upon pressing CTRL+C, what actually seems to happen is that, while I see ctrl c pressed! is echoed as expected, the wait command is also killed which I would not like to happen (part of my cleanup operation kills sleep a bit later but first does some other things). Is there a way I can prevent this, i.e. stop CTRL+C input being sent to the wait command?
You can prevent a process called from a Bash script from receiving sigint by first ignoring the signal with trap:
#!/bin/bash
# Cannot be interrupted
( trap '' INT; exec sleep 10; )
However, only a parent process can wait for its child, so wait is a shell builtin and not a new process. This therefore doesn't apply.
Instead, just restart the wait after it gets interrupted:
#!/bin/bash
n=0
term() {
echo "ctrl c pressed!"
n=$((n+1))
}
trap term INT
sleep 100 &
while
wait "$!"
[ "$?" -eq 130 ] # Sigint results in exit code 128+2
do
if [ "$n" -ge 3 ]
then
echo "Jeez, fine"
exit 1
fi
done
I ended up using a modified version of what #thatotherguy suggested:
#!/bin/bash
term() {
echo ctrl c pressed!
# perform cleanup - don't exit immediately
}
trap term SIGINT
sleep 100 &
pid=$!
while ps -p $pid > /dev/null; do
wait $pid
done
This checks if the process is still running and, if so, runs wait again.

Kill not killing process if exiting properly

I have a simple bash script which I have written to simplify some work I am doing. All it needs to do is start one process, process_1, as a background process then start another, process_2. Once process_2 is finished I then need to terminate process_1.
process_1 starts a program which does not actually stop unless it receives the kill signal, or CTRL+C when I run it myself. The program is output into a file via {program} {args} > output_file
process_2 can take an arbitrary amount of time depending on the arguments it is given.
Code:
#!/bin/bash
#Call this on exit to kill all background processes
function killJobs () {
#Check process is still running before killing
if kill -0 "$PID"; then
kill $PID
fi
}
...Check given arguments are valid...
#Start process_1
eval "./process_1 ${Arg1} ${Arg2} ${Arg3}" &
PID=$!
#Lay a trap to catch any exits from script
trap killJobs TERM INT
#Start process_2 - sleep for 5 seconds before and after
#Need space between process_1 and process_2 starting and stopping
sleep 5
eval "./process_2 ${Arg1} ${Arg2} ${Arg3} ${Arg4} 2> ${output_file}"
sleep 5
#Make sure background job is killed on exit
killJobs
I check process_1 has been terminated by checking of its output file is still being updated after my script has ended.
If I run the script and then press CTRL+C the script is terminated and process_1 is also killed, the output file is no longer updated.
If I let the script run to its completion without my intervention process_2 and the script both terminate but when I check the output from process_1 it is still being updated.
To check this I put an echo statement just after process_1 is started and another within the if statement of killJobs, so it would only be echoed if kill $PID is called.
Doing this I can see that both ways of exiting start process_1 and then also enter the if statement to kill it. Yet kill does not actually kill the process in the case of normal exit. No error messages are produced either.
You're backgrounding the eval instead of process_1, which sets $! to the PID of the script itself, not to process_1. Change to:
#!/bin/bash
#Call this on exit to kill all background processes
function killJobs () {
#Check process is still running before killing
if kill -0 "$PID"; then
kill $PID
fi
}
...Check given arguments are valid...
#Start process_1
./process_1 ${Arg1} ${Arg2} ${Arg3} &
PID=$!
#Lay a trap to catch any exits from script
trap killJobs TERM INT
#Start process_2 - sleep for 5 seconds before and after
#Need space between process_1 and process_2 starting and stopping
sleep 5
./process_2 ${Arg1} ${Arg2} ${Arg3} ${Arg4} 2> ${output_file}
sleep 5
#Make sure background job is killed on exit
killJobs

limit spawned parallel processes and exit all upon failure of any

I'm running some tests in parallel by calling a process from a script. Each process prints only to stdout > a file, and exits 0 iff successful (otherwise -1).
If and when a process exits with -1, I print something to its (or a related) output file (namely, the arguments it was called with), kill all other processes, and exit.
I have written a script using trap "..." CHLD to run some code when a subprocess exits and this works under certain conditions, but I find my script is not very robust. If I send a keyboard interrupt sometimes the subprocesses keep going, and sometimes the number of subprocesses simply overwhelm the machine(s) and none of them seem to advance.
I am using this on my quad core laptop as well as a cluster of 128 CPUs, over which subprocesses are distributed automatically. How do I run a large number of background subprocesses in a bash script, limited to some number of them running concurrently, and do something + exit if one of them returns with a bad code? I would also like the script to clean up after keyboard interrupt. Should I use GNU-parallel? how?
Here is a MWE of my script so far, which spawns subprocesses unhindered, annotated with what I think each part means. I got the idea to use trap from shell - get exit code of background process
$ cat parallel_tests.sh
#!/bin/bash
# some help from https://stackoverflow.com/questions/1570262/shell-get-exit-code-of-background-process
handle_chld() {
#echo pids are ${pids[#]}
local tmp=() ###temporary storage for pids that haven't finished
#for each pid that hadn't finished since the last trap
for((i=0;i<${#pids[#]};++i)); do
#if this pid is still running
if [[ $(ps -p ${pids[i]} -o pid=) ]]
then
tmp+=(${pids[i]}) ### add pid to list of pids that are running
else
wait ${pids[i]} ### put the exit code of this pid into $?
if [ "$?" != "0" ] ### if the exit code $? is non-zero
then
#kill all remaning processes
for((j=0;j<${#pids[#]};++j))
do
if [[ $(ps -p ${pids[j]} -o pid=) ]]
then
echo killing child processes of ${pids[j]}
pkill -P ${pids[j]}
fi
done
cat _tmp${pids[i]}
#print things to the terminal here
echo "FAILED process ${pids[i]} args: `cat _tmpargs${pids[i]}`"
exit 1
else
echo "FINISHED process ${pids[i]} args: `cat _tmpargs${pids[i]}`"
fi
fi
done
#update list of running pids
pids=(${tmp[#]})
}
# set this to monitor SIGCHLD
set -o monitor
# call handle_chld() when SIGCHLD signal is triggered
trap "handle_chld" CHLD
ALL_ARGS="2 32 87" ### ad nauseam
for A in $ALL_ARGS; do
(sleep $A; false) > _tmp$! &
pids+=($!)
echo $A > _tmpargs${pids[${#pids[#]}-1]}
echo "STARTED process ${pids[${#pids[#]}-1]} args: `cat _tmpargs${pids[${#pids[#]}-1]}`"
done
echo "Every process started. Now waiting on PIDS:"
echo ${pids[#]}
wait ${pids[#]} ###wait until every process is finished (or exit in the trap)
The output of this version after 2+epsilon seconds is:
$ ./parallel_tests.sh
STARTED process 66369 args: 2
STARTED process 66374 args: 32
STARTED process 66381 args: 87
Every process started. Now waiting on PIDS:
66369 66374 66381
killing child processes of 66374
./parallel_tests.sh: line 43: 66376 Terminated: 15 sleep $A
killing child processes of 66381
./parallel_tests.sh: line 43: 66383 Terminated: 15 sleep $A
FAILED process 66369 args: 2
Essentially, pid 66369 fails first, and the other two processes are dealt with in the trap. I have simplified the construction of the test processes here, so we can't assume that I'll manually insert waits before spawning new ones. Additionally, some of the test processes can be nearly instant. Essentially, I have a whole mess of test processes, long and short, starting as soon as resources can be allotted.
I'm not sure what's causing the problems I mentioned above, as this script uses several features that are new to me. General pointers are welcomed!
(I have seen this question and it does not answer my question)
cat arguments | parallel --halt now,fail=1 my_prg
Alternatively:
parallel --halt now,fail=1 my_prg ::: $ALL_ARGS
GNU Parallel is designed so it will also kill remote jobs. It does that using process groups and heavy perl scripting on the remote server: https://www.gnu.org/software/parallel/parallel_design.html#The-remote-system-wrapper

inconsistent signal behavior? Only works for the first signal?

Trying to have a script that is able to restart itself with exec (so it can pick up any "upgrade") given a specific signal (tried SIGHUP & SIGUSR1).
This seems to work the first time, but not the second, even tho the registration (trap) does recur in the execed instance (which is still the same PID).
#!/usr/bin/env bash
set -x
readonly PROGNAME="${0}"
function run_prog()
{
echo hi
sleep 2
echo ho
sleep 1000 &
wait $!
}
restart()
{
sleep 5
exec "${PROGNAME}"
}
trap restart USR1
echo -e "TRAPS:"
trap
echo
run_prog
This is how I run it:
./tst.sh & TSTPID=$! # Starts ok, see both "hi" & "ho" messages
sleep 10
kill -USR1 ${TSTPID} # Restarts ok, see both "hi" & "ho" messages
sleep 10
kill -USR1 ${TSTPID} # NOTHING HAPPENS
sleep 5
kill ${TSTPID}
Any idea why the second signal is ignored? (some code, like de-registering the trap in the cleanup may just be paranoia)
Maybe because you're execing from a signal handler, the signal code is continuing to run and continuing into oblivion, due to the exec, or preventing other cleanup code or daisy-chained handlers from executing.
Who knows what's going on in the blackbox of the OS signal handling code and bash's own layering over it that might be circumvented by exec. exec is a very draconian measure :-)
Also check out this cool bash site. I'm looking for the bash source code that handles signals. Just curious.
Your solution here is the right approach:
#!/usr/bin/env bash
set -x
readonly PROGNAME="${0}"
DO_RESTART=
function run_prog()
{
echo hi
sleep 2
echo ho
sleep 1000 &
SLEEPPID=$!
#builtin
wait ${SLEEPPID}
}
trap DO_RESTART=1 SIGUSR1
echo -e "TRAPS:"
trap -p
echo
run_prog
if [ -n "${DO_RESTART}" ]; then
sleep 5
kill ${SLEEPPID}
exec "${PROGNAME}"
fi

Resources