Why do processes spawned by cron end up defunct? - bash

I have some processes showing up as <defunct> in top (and ps). I've boiled things down from the real scripts and programs.
In my crontab:
* * * * * /tmp/launcher.sh /tmp/tester.sh
The contents of launcher.sh (which is of course marked executable):
#!/bin/bash
# the real script does a little argument processing here
"$#"
The contents of tester.sh (which is of course marked executable):
#!/bin/bash
sleep 27 & # the real script launches a compiled C program in the background
ps shows the following:
user 24257 24256 0 18:32 ? 00:00:00 [launcher.sh] <defunct>
user 24259 1 0 18:32 ? 00:00:00 sleep 27
Note that tester.sh does not appear--it has exited after launching the background job.
Why does launcher.sh stick around, marked <defunct>? It only seems to do this when launched by cron--not when I run it myself.
Additional note: launcher.sh is a common script in the system this runs on, which is not easily modified. The other things (crontab, tester.sh, even the program that I run instead of sleep) can be modiified much more easily.

Because they haven't been the subject of a wait(2) system call.
Since someone may wait for these processes in the future, the kernel can't completely get rid of them or it won't be able to execute the wait system call because it won't have the exit status or evidence of its existence any more.
When you start one from the shell, your shell is trapping SIGCHLD and doing various wait operations anyway, so nothing stays defunct for long.
But cron isn't in a wait state, it is sleeping, so the defunct child may stick around for a while until cron wakes up.
Update: Responding to comment...
Hmm. I did manage to duplicate the issue:
PPID PID PGID SESS COMMAND
1 3562 3562 3562 cron
3562 1629 3562 3562 \_ cron
1629 1636 1636 1636 \_ sh <defunct>
1 1639 1636 1636 sleep
So, what happened was, I think:
cron forks and cron child starts shell
shell (1636) starts sid and pgid 1636 and starts sleep
shell exits, SIGCHLD sent to cron 3562
signal is ignored or mishandled
shell turns zombie. Note that sleep is reparented to init, so when the sleep exits init will get the signal and clean up. I'm still trying to figure out when the zombie gets reaped. Probably with no active children cron 1629 figures out it can exit, at that point the zombie will be reparented to init and get reaped. So now we wonder about the missing SIGCHLD that cron should have processed.It isn't necessarily vixie cron's fault. As you can see here, libdaemon installs a SIGCHLD handler during daemon_fork(), and this could interfere with signal delivery on a quick exit by intermediate 1629Now, I don't even know if vixie cron on my Ubuntu system is even built with libdaemon, but at least I have a new theory. :-)

to my opinion it's caused by process CROND (spawned by crond for every task) waiting for input on stdin which is piped to the stdout/stderr of the command in the crontab. This is done because cron is able to send resulting output via mail to the user.
So CROND is waiting for EOF till the user command and all it's spawned child processes have closed the pipe. If this is done CROND continues with the wait-statement and then the defunct user command disappears.
So I think you have to explicitly disconnect every spawned subprocess in your script form the pipe (e.g. by redirecting it to a file or /dev/null.
so the following line should work in crontab :
* * * * * ( /tmp/launcher.sh /tmp/tester.sh &>/dev/null & )

I suspect that cron is waiting for all subprocesses in the session to terminate. See wait(2) with respect to negative pid arguments. You can see the SESS with:
ps faxo stat,euid,ruid,tty,tpgid,sess,pgrp,ppid,pid,pcpu,comm
Here's what I see (edited):
STAT EUID RUID TT TPGID SESS PGRP PPID PID %CPU COMMAND
Ss 0 0 ? -1 3197 3197 1 3197 0.0 cron
S 0 0 ? -1 3197 3197 3197 18825 0.0 \_ cron
Zs 1000 1000 ? -1 18832 18832 18825 18832 0.0 \_ sh <defunct>
S 1000 1000 ? -1 18832 18832 1 18836 0.0 sleep
Notice that the sh and the sleep are in the same SESS.
Use the command setsid(1). Here's tester.sh:
#!/bin/bash
setsid sleep 27 # the real script launches a compiled C program in the background
Notice you don't need &, setsid puts it in the background.

I’d recommend that you solve the problem by simply not having two separate processes: Have launcher.sh do this on its last line:
exec "$#"
This will eliminate the superfluous process.

I found this question while I was looking for a solution with a similar issue. Unfortunately answers in this question didn't solve my problem.
Killing defunct process is not an option as you need to find and kill its parent process. I ended up killing the defunct processes in the following way:
ps -ef | grep '<defunct>' | grep -v grep | awk '{print "kill -9 ",$3}' | sh
In "grep ''" you can narrow down the search to a specific defunct process you are after.

I have tested the same problem so many times.
And finally I've got the solution.
Just specify the '/bin/bash' before the bash script as shown below.
* * * * * /bin/bash /tmp/launcher.sh /tmp/tester.sh

Related

Kill -INT not working on bash script with trap INT set

Consider the following bash script s:
#!/bin/bash
echo "Setting trap"
trap 'ctrl_c' INT
function ctrl_c() {
echo "CTRL+C pressed"
exit
}
sleep 1000
When calling it ./s, it will sleep. If you press Ctrl+C it prints the message and exits.
Now call it, open another terminal and kill the corresponding bash pid with
-INT flag. It does not work or do anything. Why?
Kill without flags works but does not call the ctrl_c() function.
From the POSIX standard for the shell:
When a signal for which a trap has been set is received while the shell is waiting for the completion of a utility executing a foreground command, the trap associated with that signal shall not be executed until after the foreground command has completed.
To verify this, we can run strace on the shell executing your script. Upon sending the SIGINT from another terminal, bash just notes that the signal has been received and returns to waiting for its child, the sleep command, to finish:
rt_sigaction(SIGINT, {0x808a550, [], 0}, {0x80a4600, [], 0}, 8) = 0
waitpid(-1, 0xbfb56324, 0) = ? ERESTARTSYS
--- SIGINT {si_signo=SIGINT, si_code=SI_USER, si_pid=3556, si_uid=1000} ---
sigreturn({mask=[CHLD]}) = -1 EINTR (Interrupted system call)
waitpid(-1,
To make kill have the same effect as Ctrl-C, you should send SIGINT to the process group. The shell by default will put every process of a new command into its own pgrp:
$ ps -f -o pid,ppid,pgid,tty,comm -t pts/1
PID PPID PGID TT COMMAND
3460 3447 3460 pts/1 bash
29087 3460 29087 pts/1 \_ foo.sh
29120 29087 29087 pts/1 \_ sleep
$ kill -2 -29087
And now the trap runs and the shell exits:
waitpid(-1, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGINT}], 0) = 29120
--- SIGINT {si_signo=SIGINT, si_code=SI_USER, si_pid=3556, si_uid=1000} ---
sigreturn({mask=[CHLD]}) = 29120
...
write(1, "CTRL+C pressed\n", 15) = 15
...
exit_group(0) = ?
When we use the ps aux command to find the PID values, we see that there are 2 processes, this program and sleep. Program process does not interrupt with -INT parameter as you say. However, when the -2 or -INT (same meaning) parameter is used for the sleep process, the process is interrupted as expected. The message prints and exits.I did not understand why the main process did not stop, but I wanted to show that the sleep process can be interrupted by this method. I hope it will be good for you.
sleep is not a shell built-in:
$ which sleep
/bin/sleep
This means that when you run sleep it will run as a separate child process. When a child process is created various things are copied from parent to child (inherited) and one of these is the signal mask. Note: only the mask. So if you had set:
trap '' INT
then sleep would have inherited that - it means ignore the signal. However, in your case you expect sleep (which is probably written in C) to execute a bash function. No can do - wrong language for a start. Imagine: it would have to execute bash to execute the bash function.
Caveat - details concerning signal handling can vary across platforms.
It is possible to make sleep a builtin, see here but that's probably more trouble than it is worth.

Can't send signals to process created by PTY.spawn() in Ruby

I'm running ruby in an Alpine docker container (it's a sidekiq worker, if that matters). At a certain point, my application receives instructions to shell out to a subcommand. I need to be able to stream STDOUT rather than have it buffer. This is why I'm using PTY instead of system() or another similar answer. I'm executing the following line of code:
stdout, stdin, pid = PTY.spawn(my_cmd)
When I connect to the docker container and run ps auxf, I see this:
root 7 0.0 0.4 187492 72668 ? Sl 01:38 0:00 ruby_program
root 12378 0.0 0.0 1508 252 pts/4 Ss+ 01:38 0:00 \_ sh -c my_cmd
root 12380 0.0 0.0 15936 6544 pts/4 S+ 01:38 0:00 \_ my_cmd
Note how the child process of ruby is "sh -c my_cmd", which itself then has a child "my_cmd" process.
"my_cmd" can take a significant amount of time to run. It is designed so that sending a signal USR1 to the process causes it to save its state so it can be resumed later and abort cleanly.
The problem is that the pid returned from "PTY.spawn()" is the pid of the "sh -c my_cmd" process, not the "my_cmd" process. So when I execute:
Process.kill('USR1', pid)
it sends USR1 to sh, not to my_cmd, so it doesn't behave as it should.
Is there any way to get the pid that corresponds to the command I actually specified? I'm open to ideas outside the realm of PTY, but it needs to satisfy the following constraints:
1) I need to be able to stream output from both STDOUT and STDERR as they are written, without waiting for them to be flushed (since STDOUT and STDERR get mixed together into a single stream in PTY, I'm redirecting STDERR to a file and using inotify to get updates).
2) I need to be able to send USR1 to the process to be able to pause it.
I gave up on a clean solution. I finally just executed
pgrep -P #{pid}
to get the child pid, and then I could send USR1 to that process. Feels hacky, but when ruby gives you lemons...
You should send your arguments as arrays. So instead of
stdout, stdin, pid = PTY.spawn("my_cmd arg1 arg2")
You should use
stdout, stdin, pid = PTY.spawn("my_cmd", "arg1", "arg2")
etc.
Also see:
Process.spawn child does not receive TERM
https://zendesk.engineering/running-a-child-process-in-ruby-properly-febd0a2b6ec8

Ruby leaves zombies and how to manage command executions

I need to call commands sporadically in my ruby app, most commands will end quite quickly, but some will run for longer (like playing a song) and I need to be able to kill that process if wanted.
The commands are called in a sub thread as not to lock the rest of the app.
I also need to store info on running commands (so I know what to kill if needed), but I also need to send out info when the command ended.
The problems:
wait or waitpid doesn't work (they never return)
I tried to use trap, which detects when the command ended, but it only works for the latest started command.
Ruby leaves the commands as zombie process (until killing the app) and trying to use the kill(both Process.kill and calling the command directly) doesn't remove them. (Perhaps the reason to why wait doesn't work??)
# Play sound (player can be either 'mplayer' or 'aplay' )
pid = spawn player, sound_dir + file
Eddie::common[:commands][:sound_pids][pid] = file
Signal.trap("CLD") {
# This runs, but doesn't clean up zombies
Process.kill "KILL", pid
p = system 'kill ' + pid.to_s
Eddie::common[:commands][:sound_pids].delete pid
}
If I run this twice (once while the first command is running) it will look like this after both commands ended:
Eddie::common[:commands][:sound_pids] => {3018=>"firstfile.wav"}
and this is the result of ps
chris 3018 0.0 0.0 0 0 pts/5 Z+ 23:50 0:00 [aplay] <defunct>
chris 3486 0.2 0.0 0 0 pts/5 Z+ 23:51 0:00 [aplay] <defunct>
Added note: Using system to call the command doesn't leave a zombie process, but in turn it's not killable..

Resource leaking of available PIDs by long running bash scripts

I am currently reading up on some more details on Bash scripting and especially process management here. In the section on "PIDs and Parents" I found the following statement:
A process's PID will NEVER be freed up for use after the process dies UNTIL the parent process waits for the PID to see whether it ended and retrieve its exit code.
So if I understand this correctly, if I start an process in a bash script, then the process terminates, that the PID cannot be used by any other process. Wouldn't this mean, that if I have a long running script, which repeatedly starts other sub-processes but never waits on them, that I'll eventually have a resource leak, because the used PIDs will not be returned back to the system?
How about if I actually wait for the other process, but the wait get's cancelled by a trap. Would this wait somehow still free up the PID, or do I have to wait again after the trap has been caught?
Luckily you won't. I can't tell you exactly why but you can easily test this. Run the following script (stop with Ctrl+C):
#!/bin/bash
while true; do
sleep 5 &
sleep 1
done
You can see you get no zombies (leaked PIDs) after 6+ seconds. To see some zombies use the following python code (again, stop with Ctrl+C):
#!/usr/bin/python
import subprocess, time
pl = []
while True:
pl.append(subprocess.Popen(["sleep", "5"]))
time.sleep(1)
After 6 seconds you'll see one zombie:
ps xaw | grep 'sleep'
...
26470 pts/2 Z+ 0:00 [sleep] <defunct>
...
My guess is that bash does wait and stores the results reaping the zombile processes with or without the builtin wait command. For the python script, if you remove the pl.append part the garbage collection releases the objects and does it's magic again reaping the zombies. Just for info a child may never become a zombie (from wikipedia, Zombie process):
...if the parent explicitly ignores SIGCHLD by setting its handler to SIG_IGN (rather
than simply ignoring the signal by default) or has the SA_NOCLDWAIT flag set, all
child exit status information will be discarded and no zombie processes will be left.
You don't have to explicitly wait on foreground processes because the shell in which your script is running waits on them. The next process won't start until the previous one finishes.
If you start many long running background processes, you could use all available PIDs, but that's subject to the limit of ulimit -u (which could be unlimited).

How to be notified when a script's background job completes?

My question is very similar to this one except that my background process was started from a script. I could be doing something wrong but when I try this simple example:
#!/bin/bash
set -mb # enable job control and notification
sleep 5 &
I never receive notification when the sleep background command finishes. However, if I execute the same directly in the terminal,
$ set -mb
$ sleep 5 &
$
[1]+ Done sleep 5
I see the output that I expect.
I'm using bash on cygwin. I'm guessing that it might have something to do with where the output is directed, but trying various output redirection, I'm not getting any closer.
EDIT: So I have more information about the why (thx to jkramer), but still looking for the how. How do I get "push" notification that a background process started from a script has terminated? Saving a PID to a file and polling is not what I'm looking for.
if you run it as source file . then it can notify. e.g.
cat foo.sh
#!/bin/bash
set -mb # enable job control and notification
sleep 5 &
. foo.sh
[1]+ Done sleep 5
The job control of your shell only affects processes controlled by your terminal, that is tasks started directly from the shell.
When the parent process (your script) dies, the init process automatically becomes the parent of the child process (your sleep command), effectively killing all output. Try this example:
[jkramer/sgi5k:~]# cat foo.sh
#!/bin/bash
sleep 20 &
echo $!
[jkramer/sgi5k:~]# bash foo.sh
19638
[jkramer/sgi5k:~]# ps -ef | grep 19638
jkramer 19638 1 0 23:08 pts/3 00:00:00 sleep 20
jkramer 19643 19500 0 23:08 pts/3 00:00:00 grep 19638
As you can see, the parent process of sleep after the script terminates is 1, the init process. If you need to get notified about the termination of your child process, you can save the content of the $! variable (PID of the last created process) in a file or somewhere and use the `wait´ command to wait for its termination.

Resources