Why does asynchronous child become a zombie altough parent waits for it? - bash

I use the following code to start some long running task asynchronously but detect if it fails at the very beginning:
sleep 0.3 &
long_running &
wait -n
# [Error handling]
# Do other stuff.
# Wait for completion of 'long_running'.
wait -n
# [Error handling]
If I SIGINT (using Ctrl+C) the script during waiting for the long running child, the long running task just continues and gets a zombie after completion.
Furthermore the parent script consumes full CPU. I have to SIGKILL the parent to get rid of the processes.
I know that SIGINT is ignored by the child (which is probably the reason it continues till completion), but why does the parent get into such confusing state?
It works (like expected) if I kill the child when SIGINT has been received (the commented trap below), but I want to understand why it does not work the other way.
Below is the complete script. Please refer also to https://gist.github.com/doak/08b69c500c91a7fade9f2c61882c93b4 for an even more complete example/try-out:
#!/usr/bin/env bash
count="count=100000" # Adapt that 'dd' lasts about 3s. Comment out to run forever.
#fail=YES # Demonstrates failure of background task.
# This would work.
#trap "jobs -p | xargs kill" SIGINT
echo executing long running asynchronous task ...
sleep 0.3 &
dd if=/dev/zero$fail of=/dev/null bs=1M $count &
wait -n
errcode=$?
if test $errcode -ne -0; then
echo "failed"
exit $errcode
fi
echo waiting for completion ...
wait -n
errcode=$?
echo finished
exit $errcode
It could be that my question is related to this C question, although it discusses the system call wait(): Possible for parent process to HANG on "wait" step if child process becomes a ZOMBIE or CRASHES?

Related

Why doesn't bash script wait for its child processes to finish before exiting the parent script on receiving Sigterm?

trap exit_gracefully TERM
exit_gracefully() {
echo "start.sh got SIGTERM"
echo "Sending TERM to child_process_1_pid: ${child_process_1_pid}"
echo "Sending TERM to child_process_2_pid: ${child_process_2_pid}"
echo "Sending TERM to child_process_3_pid: ${child_process_3_pid}"
kill -TERM ${child_process_1_pid} ${child_process_2_pid} ${child_process_3_pid}
}
consul watch -http-addr=${hostIP}:8500 -type=key -key=${consul_kv_key} /child_process_1.sh 2>&1 &
child_process_1_pid=$!
/child_process_2.sh &
child_process_2_pid=$!
/child_process_3.sh &
child_process_3_pid=$!
/healthcheck.sh &
/configure.sh
# sleep 36500d &
# wait $!
wait ${child_process_1_pid} ${child_process_2_pid} ${child_process_3_pid}
echo 'start.sh exiting'
start.sh is the parent script. When SIGTERM is trapped, it is forwarded to 3 of its child processes. If # sleep 36500d &
# wait $! is commented (removed from code), start.sh does not wait for child_process_1.sh, child_process_2.sh and child_process_3.sh to receive SIGTERM, handle it and exit before exiting the parent process (start.sh), instead start.sh exits immediately on receiving SIGTERM even before child processes could handle it. But if I keep sleep 36500d & wait $! uncommented in the code, parent process (start.sh) waits for child processes (1, 2, and 3) to receive, handle Sigterm and exit first before exiting itself.
Why does this difference exist even though I wait for 3 pids (of child processes) in either case? Why should I need sleep when I am waiting for 3 pids?
Receiving a signal will cause any wait command in progress to return.
This is because the purpose of a signal is to interrupt a process in whatever it's currently doing.
All the effects you see are simply the result of the current wait returning, the handler running, and the script continuing from where the wait exited.

Prevent SIGINT from interrupting current task while still passing information about SIGINT (and preserve the exit code)

I have a quite long shell script and I'm trying to add signal handling to it.
The main task of the script is to run various programs and then clean up their temporary files.
I want to trap SIGINT.
When the signal is caught, the script should wait for the current program to finish execution, then do the cleanup and exit.
Here is an MCVE:
#!/bin/sh
stop_this=0
trap 'stop_this=1' 2
while true ; do
result="$(sleep 2 ; echo success)" # run some program
echo "result: '$result'"
echo "Cleaning up..." # clean up temporary files
if [ $stop_this -ne 0 ] ; then
echo 'OK, time to stop this.'
break
fi
done
exit 0
The expected result:
Cleaning up...
result: 'success'
Cleaning up...
^Cresult: 'success'
Cleaning up...
OK, time to stop this.
The actual result:
Cleaning up...
result: 'success'
Cleaning up...
^Cresult: ''
Cleaning up...
OK, time to stop this.
The problem is that the currently running instruction (result="$(sleep 2 ; echo success)" in this case) is interrupted.
What can I do so it would behave more like I was set trap '' 2?
I'm looking for either a POSIX solution or one that is supported by most of shell interpreters (BusyBox, dash, Cygwin...)
I already saw answers for Prevent SIGINT from closing child process in bash script but this isn't really working for me. All of these solutions require to modify each line which shouldn't be interrupted. My real script is quite long and much more complicated than the example. I would have to modify hundreds of lines.
You need to prevent the SIGINT from going to the echo in the first place (or rewrite the cmd that you are running in the variable assignment to ignore SIGINT). Also, you need to allow the variable assignment to happen, and it appears that the shell is aborting the assignment when it receives the SIGINT. If you're only worried about user generated SIGINT from the tty, you need to disassociate that command from the tty (eg, get it out of the foreground process group) and prevent the SIGINT from aborting the assignment. You can (almost) accomplish both of those with:
#!/bin/sh
stop_this=0
while true ; do
trap 'stop_this=1' INT
{ sleep 1; echo success > tmpfile; } & # run some program
while ! wait; do : ; done
trap : INT
result=$(cat tmpfile& wait)
echo "result: '$result'"
echo "Cleaning up..." # clean up temporary files
if [ $stop_this -ne 0 ] ; then
echo 'OK, time to stop this.'
break
fi
done
exit 0
If you're worried about SIGINT from another source, you'll have to re-implement sleep (or whatever command I presume sleep is a proxy for) to handle SIGINT the way you want. The key here is to run the command in the background and wait for it to prevent the SIGINT from going to it and terminating it early. Note that we've opened at least 2 new cans of worms here. By waiting in a loop, we're effectively ignoring the any errors that the subcommand might raise (we're doing this to try and implement a SIGRESTART), so may potentially hang. Also, if the SIGINT arrives during the cat, we have attempted to prevent the cat from aborting by running it in the background, but now the variable assignment will be terminated and you'll get your original behavior. Signal handling is not clean in the shell! But this gets you closer to your desired goal.
Sighandling in shell scripts can get clumsy. It's pretty much impossible to
do it "right" without the support of C.
The problem with:
result="$(sleep 2 ; echo success)" # run some program
is that $() creates a subshell and in subshells, non-ignored (trap '' SIGNAL is how you ignore SIGNAL)
signals are reset to their default dispositions which for SIGINT is to terminate the process
($( ) gets its own process, thought it will receive the signal too because the terminal-generated SIGINT
is process-group targeted)
To prevent this, you could do something like:
result="$(
trap '' INT #ignore; could get killed right before the trap command
sleep 2; echo success)"
or
result="$( trap : INT; #no-op handler; same problem
sleep 2; while ! echo success; do :; done)"
but as noted, there will be a small race-condition window between the start of the
subshell and the registration of the signal handler during which
the subshell could get killed by the reset-to-default SIGINT signal.
Both answers from #PSkocik and #WilliamPursell have helped me to get on the right track.
I have a fully working solution. It ain't pretty because it needs to use an external file to indicate that the signal didn't occurred but beside that it should work reliably.
#!/bin/sh
touch ./continue
trap 'rm -f ./continue' 2
( # the whole main body of the script is in a separate background process
trap '' 2 # ignore SIGINT
while true ; do
result="$(sleep 2 ; echo success)" # run some program
echo "result: '$result'"
echo "Cleaning up..." # clean up temporary files
if [ ! -e ./continue ] ; then # exit the loop if file "./continue" is deleted
echo 'OK, time to stop this.'
break
fi
done
) & # end of the main body of the script
while ! wait ; do : ; done # wait for the background process to end (ignore signals)
wait $! # wait again to get the exit code
result=$? # exit code of the background process
rm -f ./continue # clean up if the background process ended without a signal
exit $result
EDIT: There are some problems with this code in Cygwin.
The main functionality regarding signals work.
However, it seems like the finished background process doesn't stay in the system as a zombie. This makes the wait $! to not work. The exit code of the script has incorrect value of 127.
Solution to that would be removing lines wait $!, result=$? and result=$? so the script always returns 0.
It should be also possible to keep the proper error code by using another layer of subshell and temporarily store the exit code in a file.
For disallowing interrupting the program:
trap "" ERR HUP INT QUIT TERM TSTP TTIN TTOU
But if a sub-command handles traps by itself, and that command must really complete, you need to prevent passing signals to it.
For people on Linux that don't mind installing extra commands, you can just use:
waitFor [command]
Alternatively you can adapt the latest source code of waitFor into your program as needed, or use the code from Gilles' answer. Although that has the disadvantage of not benefiting from updates upstream.
Just mind that other terminals and the service manager can still terminate "command". If you want the service manager to be unable to close "command", it shall be run as a service with the appropriate kill mode and kill signal set.
You may want to adapt the following:
#!/bin/sh
tmpfile=".tmpfile"
rm -f $tmpfile
trap : INT
# put the action that should not be interrupted in the innermost brackets
# | |
( set -m; (sleep 10; echo success > $tmpfile) & wait ) &
wait # wait will be interrupted by Ctrl+c
while [ ! -r $tmpfile ]; do
echo "waiting for $tmpfile"
sleep 1
done
result=`cat $tmpfile`
echo "result: '$result'"
This seems also to work with programs that install their own SIGINT handler like mpirun and mpiexec and so on.

WAIT for "1 of many process" to finish

Is there any built in feature in bash to wait for 1 out of many processes to finish? And then kill remaining processes?
pids=""
# Run five concurrent processes
for i in {1..5}; do
( longprocess ) &
# store PID of process
pids+=" $!"
done
if [ "one of them finished" ]; then
kill_rest_of_them;
fi
I'm looking for "one of them finished" command. Is there any?
bash 4.3 added a -n flag to the built-in wait command, which causes the script to wait for the next child to complete. The -p option to jobs also means you don't need to store the list of pids, as long as there aren't any background jobs that you don't want to wait on.
# Run five concurrent processes
for i in {1..5}; do
( longprocess ) &
done
wait -n
kill $(jobs -p)
Note that if there is another background job other than the 5 long processes that completes first, wait -n will exit when it completes. That would also mean you would still want to save the list of process ids to kill, rather than killing whatever jobs -p returns.
It's actually fairly easy:
#!/bin/bash
set -o monitor
killAll()
{
# code to kill all child processes
}
# call function to kill all children on SIGCHLD from the first one
trap killAll SIGCHLD
# start your child processes here
# now wait for them to finish
wait
You just have to be really careful in your script to use only bash built-in commands. You can't start any utilities that run as a separate process after you issue the trap command - any child process exiting will send SIGCHLD - and you can't tell where it came from.

How To Check When Curl PID is Done from Bash Loop

I can't figure out my bug on OSX. When I try to see when Curl is finished, the process remains loaded. I never see the CURL FINISHED message.
#!/bin/bash
curl -S -o example.com http://example.com/downloads/example.zip &
CURL_PID=$!
echo -e "CURL PID = $CURL_PID"
while :
do
sleep 1
if [ -n $(ps -p$CURL_PID -o pid=) ]; then
echo "CURL NOT FINISHED"
else
echo "CURL FINISHED"
break
fi
done
Note on OSX's version of Bash when I run this:
#!/bin/bash
PIDX=1
if [ -n $(ps -p$PIDX -o pid=) ]; then
echo "PROCESS 1 IS THERE"
else
echo "PROCESS 1 IS NOT THERE"
fi
...it says Process 1 is there. (Everyone has a PID 1, so this is just an example.) So, I know that my if statement is correct. No double quotes necessary on the if line.
Note that I can't use wait on the $CURL_PID because what you don't see here is that I also am using OSX's osascript command to show a dialog that says "Downloading...", which also has a Cancel button on it and its own $DLG_PID, and so I'm looping endlessly until either they cancel the dialog (meaning $DLG_PID points is gone) or $CURL_PID is gone (meaning the download finally completed so I can run kill $DLG_PID now).
On OSX, note I'm doing this as well before the curl statement.
osascript -e 'tell app "System Events" to display dialog "Downloading..." with title "My App Installer" buttons {"Cancel"}' &
So, if someone cancels the dialog, I kill the curl by PID and exit the infinite loop (and exit the bash script). If they don't cancel that dialog, and the curl finishes, then I kill the dialog by PID and exit the bash script.
Usually you'll use wait for that:
curl http://... &
do_something
wait
echo "CURL has finished"
The portable way for polling a backgrounded job is to use the kill builtin, and send the signal 0 to see if it's deliverable. kill -0 $pid (where $pid is the PID of a child process) will return zero if the child process is still running, and nonzero if it has already died. Note that this is safe and only safe (from PID recycling) for a child process (rather than some random process started elsewhere, with PID written to a PID file), for reasons outlined here:
Each UNIX process also has a parent process. This parent process is the process that started it, but can change to the init process if the parent process ends before the new process does. (That is, init will pick up orphaned processes.) Understanding this parent/child relationship is vital because it is the key to reliable process management in UNIX. A process's PID will NEVER be freed up for use after the process dies UNTIL the parent process waits for the PID to see whether it ended and retrieve its exit code. If the parent ends, the process is returned to init, which does this for you.
This is important for one major reason: if the parent process manages its child process, it can be absolutely certain that, even if the child process dies, no other new process can accidentally recycle the child process's PID until the parent process has waited for that PID and noticed the child died. This gives the parent process the guarantee that the PID it has for the child process will ALWAYS point to that child process, whether it is alive or a "zombie". Nobody else has that guarantee.
Of course, newer versions of OS X don't use init (in its place is launchd), but the principle is the same.
By the way, the whole page is worth a read: http://mywiki.wooledge.org/ProcessManagement.
In light of that, here's an example script that does what you want (it takes one URL argument — the URL to download). Bug me if something's unclear.
#!/usr/bin/env bash
osascript -e 'tell app "System Events" to display dialog "Downloading..." with title "Downloader" buttons {"Cancel"}' &>/dev/null &
dialog_pid=$!
curl -sSLO "$1" &
curl_pid=$!
timer=0
while kill -0 "$curl_pid" &>/dev/null; do
kill -0 "$dialog_pid" &>/dev/null || { echo "User cancelled download from dialog."; kill "$curl_pid" &>/dev/null; exit 1; }
sleep 1
(( timer++ ))
echo "Been downloading for $timer seconds..."
done
echo "Finished."
kill "$dialog_pid" &>/dev/null
wait &>/dev/null
Run it:
> ./download https://github.com/torvalds/linux/archive/v4.4-rc2.tar.gz
Been downloading for 1 seconds...
Been downloading for 2 seconds...
<omitted>
Been downloading for 38 seconds...
Finished.
Cancelling midway:
> ./download https://github.com/torvalds/linux/archive/v4.4-rc2.tar.gz
Been downloading for 1 seconds...
Been downloading for 2 seconds...
Been downloading for 3 seconds...
User cancelled download from dialog.
The ugly thing is that killing the PID of the osascript job doesn't dismiss the dialog box... Which I'm not in the position to solve because I absolutely dread AppleScript.

How does trap / kill work in bash on Linux?

My sample file
traptest.sh:
#!/bin/bash
trap 'echo trapped' TERM
while :
do
sleep 1000
done
$ traptest.sh &
[1] 4280
$ kill %1 <-- kill by job number works
Terminated
trapped
$ traptest.sh &
[1] 4280
$ kill 4280 <-- kill by process id doesn't work?
(sound of crickets, process isn't killed)
If I remove the trap statement completely, kill process-id works again?
Running some RHEL 2.6.18-194.11.4.el5 at work. I am really confused by this behaviour, is it right?
kill [pid]
send the TERM signal exclusively to the specified PID.
kill %1
send the TERM signal to the job #1's entire process group, in this case to the script pid + his children (sleep).
I've verified that with strace on sleep process and on script process
Anyway, someone got a similar problem here (but with SIGINT instead of SIGTERM): http://www.vidarholen.net/contents/blog/?p=34.
Quoting the most important sentence:
kill -INT %1 sends the signal to the job’s process group, not the backgrounded pid!
This is expected behavior. Default signal sent by kill is SIGTERM, which you are catching by your trap. Consider this:
#!/bin/bash
# traptest.sh
trap "echo Booh!" SIGINT SIGTERM
echo "pid is $$"
while : # This is the same as "while true".
do
a=1
done
(sleep really creates a new process and the behavior is clearer with my example I guess).
So if you run traptest.sh in one terminal and kill TRAPTEST_PROCESS_ID from another terminal, output in the terminal running traptest will be Booh! as expected (and the process will NOT be killed). If you try sending kill -s HUP TRAPTEST_PROCESS_ID, it will kill the traptest process.
This should clear up the %1 confusion.
Note: the code example is taken from tldp
Davide Berra explained the difference between kill %<jobspec> and kill <PID>, but not how that difference results in what you observed. After all, Unix signal handlers should be called pretty much instantaneously, so why does sending a SIGTERM to the script alone not trigger its trap handler?
The bash man page explains why, in the last paragraph of the SIGNALS section:
If bash is waiting for a command to complete and receives a signal for
which a trap has been set, the trap will not be executed until the
command completes.
So, the signal was delivered immediately, but the handler execution was deferred until sleep exited.
Hence, with kill %<jobspec>:
Both the script and sleep received SIGTERM
bash registered the signal, noticed that a trap was set for it, and queued the handler for future execution
sleep exited immediately
bash noted sleep's exit, and ran the trap handler
whereas with kill <script_PID>:
Only the script received SIGTERM
bash registered the signal, noticed that a trap was set for it, and queued the handler for future execution
sleep exited after 1000 seconds
bash noted sleep's exit, and ran the trap handler
Obviously, you didn't want long enough to see that last bit. :)
If you're interested in the gory details, download the bash source code and look in trap.c, specifically the trap_handler() and run_pending_traps() functions.

Resources