bash restart sub-process using trap SIGCHLD? - bash

I've seen monitoring programs either in scripts that check process status using 'ps' or 'service status(on Linux)' periodically, or in C/C++ that forks and wait on the process...
I wonder if it is possible to use bash with trap and restart the sub-process when SIGCLD received?
I have tested a basic suite on RedHat Linux with following idea (and certainly it didn't work...)
#!/bin/bash
set -o monitor # can someone explain this? discussion on Internet say this is needed
trap startProcess SIGCHLD
startProcess() {
/path/to/another/bash/script.sh & # the one to restart
while [ 1 ]
do
sleep 60
done
}
startProcess
what the bash script being started just sleep for a few seconds and exit for now.
several issues observed:
when the shell starts in foreground, SIGCHLD will be handled only once. does trap reset signal handling like signal()?
the script and its child seem to be immune to SIGINT, which means they cannot be stopped by ^C
since cannot be closed, I closed the terminal. The script seems to be HUP and many zombie children left.
when run in background, the script caused terminal to die
... anyway, this does not work at all. I have to say I know too little about this topic.
Can someone suggest or give some working examples?
Are there scripts for such use?
how about use wait in bash, then?
Thanks

I can try to answer some of your questions but not all based on what I
know.
The line set -o monitor (or equivalently, set -m) turns on job
control, which is only on by default for interactive shells. This seems
to be required for SIGCHLD to be sent. However, job control is more of
an interactive feature and not really meant to be used in shell scripts
(see also this question).
Also keep in mind this is probably not what you intended to do
because once you enable job control, SIGCHLD will be sent for every
external command that exists (e.g. every time you run ls or grep or
anything, a SIGCHLD will fire when that command completes and your trap
will run).
I suspect the reason the SIGCHLD trap only appears to run once is
because your trap handler contains a foreground infinite loop, so your
script gets stuck in the trap handler. There doesn't seem to be a point
to that loop anyways, so you could simply remove it.
The script's "immunity" to SIGINT seems to be an effect of enabling
job control (the monitor part). My hunch is with job control turned on,
the sub-instance of bash that runs your script no longer terminates
itself in response to a SIGINT but instead passes the SIGINT through to
its foreground child process. In your script, the ^C i.e. SIGINT
simply acts like a continue statement in other programming languages
case, since SIGINT will just kill the currently running sleep 60,
whereupon the while loop will immediately run a new sleep 60.
When I tried running your script and then killing it (from another
terminal), all I ended up with were two stray sleep processes.
Backgrounding that script also kills my shell for me, although
the behavior is not terribly consistent (sometimes it happens
immediately, other times not at all). It seems typing any keys other
than enter causes an EOF to get sent somehow. Even after the terminal
exits the script continues to run in the background. I have no idea
what is going on here.
Being more specific about what you want to accomplish would help. If
you just want a command to run continuously for the lifetime of your
script, you could run an infinite loop in the background, like
while true; do
some-command
echo some-command finished
echo restarting some-command ...
done &
Note the & after the done.
For other tasks, wait is probably a better idea than using job control
in a shell script. Again, it would depend on what exactly you are trying
to do.

Related

Why does bash "forget" about my background processes?

I have this code:
#!/bin/bash
pids=()
for i in $(seq 1 999); do
sleep 1 &
pids+=( "$!" )
done
for pid in "${pids[#]}"; do
wait "$pid"
done
I expect the following behavior:
spin through the first loop
wait about a second on the first pid
spin through the second loop
Instead, I get this error:
./foo.sh: line 8: wait: pid 24752 is not a child of this shell
(repeated 171 times with different pids)
If I run the script with shorter loop (50 instead of 999), then I get no errors.
What's going on?
Edit: I am using GNU bash 4.4.23 on Windows.
POSIX says:
The implementation need not retain more than the {CHILD_MAX} most recent entries in its list of known process IDs in the current shell execution environment.
{CHILD_MAX} here refers to the maximum number of simultaneous processes allowed per user. You can get the value of this limit using the getconf utility:
$ getconf CHILD_MAX
13195
Bash stores the statuses of at most twice as that many exited background processes in a circular buffer, and says not a child of this shell when you call wait on the PID of an old one that's been overwritten. You can see how it's implemented here.
The way you might reasonably expect this to work, as it would if you wrote a similar program in most other languages, is:
sleep is executed in the background via a fork+exec.
At some point, sleep exits leaving behind a zombie.
That zombie remains in place, holding its PID, until its parent calls wait to retrieve its exit code.
However, shells such as bash actually do this a little differently. They proactively reap their zombie children and store their exit codes in memory so that they can deallocate the system resources those processes were using. Then when you wait the shell just hands you whatever value is stored in memory, but the zombie could be long gone by then.
Now, because all of these exit statuses are being stored in memory, there is a practical limit to how many background processes can exit without you calling wait before you've filled up all the memory you have available for this in the shell. I expect that you're hitting this limit somewhere in the several hundreds of processes in your environment, while other users manage to make it into the several thousands in theirs. Regardless, the outcome is the same - eventually there's nowhere to store information about your children and so that information is lost.
I can reproduce on ArchLinux with docker run -ti --rm bash:5.0.18 bash -c 'pids=; for ((i=1;i<550;++i)); do true & pids+=" $!"; done; wait $pids' and any earlier. I can't reproduce with bash:5.1.0 .
What's going on?
It looks like a bug in your version of Bash. There were a couple of improvements in jobs.c and wait.def in Bash:5.1 and Make sure SIGCHLD is blocked in all cases where waitchld() is not called from a signal handler is mentioned in the changelog. From the look of it, it looks like an issue with handling a SIGCHLD signal while already handling another SIGCHLD signal.

Killing Subshell with SIGTERM

I'm sure this is really simple, but it's biting me in the face anyway, and I'm a little frustrated and stumped.
So, I have a script which I've managed to boil down to:
#!/bin/sh
sleep 50 | echo
If I run that at the command line, and hit Ctrl-C it stops, like I would expect.
If I send it sigint, using kill, it does nothing.
I thought that was strange, since I thought those should have been the same.
Then, if I send it sigterm, then it also dies, but if I look in ps, the sleep is still running.
What am I missing, here?
This is obviously not the real script, which runs python, and it's more of a problem when it keeps running after start-stop-daemon tries to kill the daemon.
Help me people. I'm dumb.
The reason this happens is that the Ctrl-C is delivered to the sleep process, whereas the sigint you are sending is delivered only to the script itself. See Child process receives parent's SIGINT for details on this.
You can verify this yourself by using strace -p when hitting ctrl-c or sending sigint; strace will tell you what signals are delivered.
EDIT: I don't think you are dumb. Processes and how they work are seemingly simple, but the details are often complicated, and even experts get confused by this sort of thing.
I did the same thing I written script named as test.sh with below containt.
#!/bin/sh
sleep 50 | echo
After executing , I did Ctrl-C -> its working fine means closing it.
Again executed and in another terminal i checked the PID by ps -ef|grep test.sh after finding the pid , i did kill <pid> and it killed the process , to verify again i executed ps -ef|grep test.sh and didnt get any pid.

Shell script termination notification

I have a requirement in shell script that in case if shell terminates for some reason I.e terminal closed or user terminated, I would like to do some work like clean up and unlocking etc. Is it possible? If so please explain how it can be done.
Thanks in advance,
Praveen P B
Yes it is possible. Check for signal using trap command. See help trap.
Example
cleanup() {
...
exit
}
trap cleanup SIGINT # user terminated
trap cleanup SIGHUP # terminal closed
trap cleanup SIGTERM # killed with kill -15
What #Grzegorz said, is correct, but note the following:
You didn't indicate the OS (and shell, but we assume sh), so to stay POSIXly correct you should use uppercase signal names (like SIGINT). Also see here.
Unless you want to let your script continue running, after receiving the signal, you should put exit as the last line in your "cleanup" function.
Don't forget to manually call your "cleanup" function before exiting the script by other means (i.e. before "falling off" the end or whenever you manually call exit). Alternatively register a handler with trap cleanup EXIT.
Note that you cannot catch SIGKILL, i.e. kill -9 <mypid> by this means (or any other means for that matter). So make sure you can somehow deal with the fact that cleanup may actually not happen. For example, check during your script initialization, whether old (orphaned) lock-files still exist and exit with an error, telling the user/admin to clean up manually. Now, differentiating between lock-files that legitimately exist and remains of a killed instance, might not be trivial and largely depends on your script. You could write the scripts PID into the lock-file, and then, if a lock-file exists when (another instance of) your scripts finds a lock-file make sure that the process with that PID still exists, if it does, chances are another script instance is truely running. Otherwise, manual cleanup might be necessary. But even that is not 100% fail safe.

How to kill all children of the current shell on interrupt?

My scripts cdist-deploy-to and cdist-mass-deploy (from cdist configuration management) run interactively (i.e. are called by a user).
These scripts call a lot of scripts, which again call some scripts:
cdist-mass-deploy ...
cdist-deploy-to ...
cdist-explorer-run-global ...
cdist-dir ....
What I want is to exit / kill all scripts, as soon as cdist-mass-deploy is either stopped by control C (SIGINT) or killed with SIGTERM.
cdist-deploy-to can also be called interactively and should exhibit the same behaviour.
Using ps -ef... and co variants to find out all processes with the ppid looks like it could be quite unportable. Using $! does not work as in the deeper levels the children are no background processes.
I tried using the following code:
__cdist_kill_on_interrupt()
{
__cdist_tmp_removal
kill 0
exit 1
}
trap __cdist_kill_on_interrupt INT TERM
But this leads to ugly Terminated messages as well as to a segfault in the shells (dash, bash, zsh) and seems not to stop everything instantly anyway:
# cdist-mass-deploy -p ikq04.ethz.ch ikq05.ethz.ch
core: Waiting for cdist-deploy-to jobs to finish
^CTerminated
Terminated
Terminated
Terminated
Segmentation fault
So the question is, how to cleanly exit including all (sub-)children in a portable manner (bourne shell, no csh support needed)?
You don't need to handle ^C, that will result in a signal being sent to the whole process group, which will kill all the processes that are not in the background. So you don't need to catch INT.
The only reason you get a Terminated when you kill them is that kill sends TERM by default, but that's reasonable if you are handling a TERM in the first place. You could use kill -INT 0 if you want to avoid the messages.
(responding with extra info)
If the child processes are run in the background, you can get their process ids just after you start them, using the $! special shell variable. Gather these together in a variable and just kill them all when you need to terminate.

Why do unix background processes sometimes die when I exit my shell?

I wanted to know why i am seeing a different behaviour in the background process in Bash shell
Case 1: Logged in to Unix server using Putty(SSH)
By default it uses csh shell
I changed to bash shell
typed sleep 2000 &
press enter
It gave me the job number. Now i killed my session by clicking the x in the putty window
Now open another session and tried to lookup the process..the process died.
Case 2:Case 1: Logged in to Unix server using Putty(SSH)
By default it uses csh shell
I changed to bash shell
vi mysleep.sh
sleep 2000 & Saved mysleep.sh
./mysleep.sh
Diff here is..instead of executing the sleep command directly i am storing the sleep command in a file and executing the file.
Now i killed my session by clicking the x in the putty window
Now open another session and tried to lookup the process..the process is still there
Not sure why this is happening. I thought i need to do disown in bash to run the process even after logging out.
One diff i see in the parent process id..In the second case..the parent process id for the sleep 2000 becomes 1. Looks like as soon as process for mysleep.sh died the kernel assigned the parent process to 1.
The difference here is indeed the intervening process.
When you close the terminal window, a HUP signal (related to "nohup" as an0nymo0usc0ward mentioned) is sent to the processes running in it. The default action on receiving HUP is to die - from the signal(3) manpage,
No Name Default Action Description
1 SIGHUP terminate process terminal line hangup
In your first example, the sleep process directly receives this HUP signal and dies because it isn't set to do anything else. (Some processes catch HUP and use it to perform some action, e.g. reread some configuration files)
In the second example, the shell process running your shell script has already died, so the sleep process never gets the signal. In UNIX, every process must have a parent process due to the internals of how the wait(2) family of calls works and indeed processes in general. So when the parent process dies, the kernel gives it to init (pid 1, as you note) as a foster child.
Orphan process (on wikipedia) has some more information available about it, also see Zombie process for some additional technical details.
Already running process?
^z
bg
disown %<jobid>
New process/script (on local machine's console)?
nohup script.sh &
New process/script (on remote machine's console)?
Depending on your need,
there are two options [ there will be more ;-) ]
ssh remotehost 'nohup /path/to/script.sh </dev/null > nohup.out 2>&1 &'
OR
use 'screen'
Try "nohup cmd args..."
Steven's answer is correct, but I'd like to highlight the tricky part here again:
=> Using a bash script that just executes sleep in the background
The effect of this is that the "script" exits almost immediately (since it's done all its commands). However, it did create a child process (sleep) during its lifetime. The effect of this is that:
The "script" cannot be the parent anymore, and sleep is orphaned to init (which shows nicely in a pstree)
The bash shell where you started the script from has no underlying jobs anymore
Note that this stuff all happens when you executed the script, and has nothing to do with any ssh logout/putty closing.
When you then finally close your putty session, bash receives a "SIGHUP", but doesn't forward it to any other process (since there are no jobs left)
In the other case, bash did still have a job left, which it then sent the SIGHUP to, causing it to end (as you noticed)
Hope this helps

Resources