Best way to wait for all child processes to complete in Ruby? - ruby

Looking for a way to wait for the completion of all child processes, I found this code:
while true
p "waiting for child processes"
begin
exited_pid = Process.waitpid(-1,Process::WNOHANG)
if exited_pid and exited_pid > 0 then
p "Process exited : #{exited_pid} with status #{$?.exitstatus }"
end
sleep 5
rescue SystemCallError
puts "All children collected!"
break
end
end
This looks like it works in a similar way to Unix-systems process management, as I read on tutorialspoint HERE.
So in summary, it looks like this code:
Calls Process.waitpid, for any child process that exists. If no child process has exited, continue anyway.
If a child process has exited, then notify the user. Otherwise sleep, and check again.
When all child processes have exited an error is thrown, which is caught and the user is notified that processes are complete.
But looking at a similar question on waiting for child processes in C (Make parent wait for all child processes), which has as an answer:
POSIX defines a function: wait(NULL);. It's shorthand for waitpid(-1,
NULL, 0);, which will block until all children processes exit.
I tested that Process.wait() in Ruby achieves pretty much the same thing as the more verbose code above.
What is the benefit of the more verbose code above? Or, which is considered a better approach to waiting for child processes? It seems in the verbose code that I would be able to wait for specific processes and listen for specific exit codes. But if I don't need to do this is there any benefit?
Also, regarding the more verbose code:
Why does the call to Process.waitpid() throw an error if there are no more child processes?
If more than 1 child process exists within the 5 second sleep period, it seems like there is a queue of completed processes and that Process.waitpid just returns the top member of the queue. What is actually happening here?

Related

Python3 How to gracefully shutdown a multiprocess application

I am trying to fix a python3 application where multiple proceess and threads are created controlled by various queues and pipes. I am trying to make a form of controlled exit when someone tries to break the program with ctrl-c. However no mather what I do it always hangs just at the end.
I've tried to used Keyboard-interrupt exception and signal catch
The below code is part of the multi process code.
from multiprocessing import Process, Pipe, JoinableQueue as Queue, Event
class TaskExecutor(Process):
def __init__(....)
{inits}
def signal_handler(self, sig, frame):
print('TaskExecutor closing')
self._in_p.close()
sys.exit(1)
def run
signal.signal(signal.SIGINT, self.signal_handler)
signal.signal(signal.SIGTERM, self.signal_handler)
while True:
# Get the Task Groupe name from the Task queue.
try:
ExecCmd = self._in_p.recv() # type: TaskExecCmd
except Exceptions as e:
self._in_p.close()
return
if ExecCmd.Kill:
self._log.info('{:30} : Kill Command received'.format(self.name))
self._in_p.close()
return
else
{other code executing here}
I'm getting the above print that its closing.
but im still getting a lot of different exceptions which i try to catch but it will not.
I'm am looking for some documentation on how to and in which order to shut down multiprocess and its main process.
I know it's very general question however its a very large application so if there are any question or thing i could test i could narrow it down.
Regards
So after investigating this issue further I found that in situation where I had a pipe thread, Queue thread and 4 multiprocesses running. # of these processes could end up hanging when terminating the application with ctrl-c. The Pipe and Queue process where already shut down.
In the multiprocessing documentation there are a warning.
Warning If this method is used when the associated process is using a
pipe or queue then the pipe or queue is liable to become corrupted and
may become unusable by other process. Similarly, if the process has
acquired a lock or semaphore etc. then terminating it is liable to
cause other processes to deadlock.
And I think this is what's happening.
I also found that even though I have a shutdown mechanism in my multi-process class the threads still running would of cause be considered alive (reading is_alive()) even though I know that the run() method have return IE som internal was hanging.
Now of the solution. My multiprocesses was for a design view not a Deamon because I wanted to control the shot down of them. However I changed them to Deamon so they would always be killed regardless. I first added that anyone kill signal would raise and ProgramKilled exception throughout my entire program.
def signal_handler(signum, frame):
raise ProgramKilled('Task Executor killed')
I then changed my shut down mechanism in my multi process class to
while True:
# Get the Task Groupe name from the Task queue.
try:
# Reading from pipe
ExecCmd = self._in_p.recv() # type: TaskExecCmd
# If fatal error just close it all
except BrokenPipe:
break
# This can occure close the pipe and break the loop
except EOFError:
self._in_p.close()
break
# Exception for when a kill signal is detected
# Set the multiprocess as killed (just waiting for the kill command from main)
except ProgramKilled:
self._log.info('{:30} : Died'.format(self.name))
self._KilledStatus = True
continue
# kill command from main recieved
# Shut down all we can. Ignore exceptions
if ExecCmd.Kill:
self._log.info('{:30} : Kill Command received'.format(self.name))
try:
self._in_p.close()
self._out_p.join()
except Exception:
pass
self._log.info('{:30} : Kill Command executed'.format(self.name))
break
else if (not self._KilledStatus):
{Execute code}
# When out of the loop set killed event
KilledEvent.set()
And in my main thread I have added the following clean up process.
#loop though all my resources
for ThreadInterfaces in ResourceThreadDict.values():
# test each process in each resource
for ThreadIf in ThreadInterfaces:
# Wait for its event to be set
ThreadIf['KillEvent'].wait()
# When event have been recevied see if its hanging
# We know at this point every thing have been closed and all data have been purged correctly so if its still alive terminate it.
if ThreadIf['Thread'].is_alive():
try:
psutil.Process(ThreadIf['Thread'].pid).terminate()
except (psutil.NoSuchProcess, AttributeError):
pass
Af a lot of testing I know its really hard to control a termination of and app with multiple processes because you simply do not know in which order all of your processes receive this signal.
I've tried to in someway to save most of my data when its killed. Some would argue what I need that data for when manually terminating the app. But in this case this app runs a lot of external scripts and other application and any of those can lock the application and then you need to manually kill it but still retain the information for what have already been executed.
So this is my solution to my current problem with my current knowledge.
Any input or more in depth knowledge on what happening is welcome.
Please note that this app runs both on linux and windows.
Regards

How can a process die in a way that Process.wait wouldn't notice?

I have this ruby script to manage que processes. que doesn't support multi-proccess, see discussion here):
#!/usr/bin/env ruby
cluster_size = 2
puts "starting Que cluster with #{cluster_size} workers"; STDOUT.flush
%w[INT TERM].each do |signal|
trap(signal) do
#pids.each{|pid| Process.kill(signal, pid) }
end
end
#pids = []
cluster_size.to_i.times do |n|
puts "Starting Que daemon #{n}"; STDOUT.flush
#pids << Process.spawn("que --worker-count $MAX_THREADS")
end
Process.waitall
puts "Que cluster has shut down"; STDOUT.flush
The script has been working well for a couple months. The other day I found things in a state where the script was running, but both child processes were dead.
I experimented with trying to replicate this. I killed the children with various signals, had them raise exceptions. In all cases, the script knew the process died and itself died.
How could the child process have died without the parent script knowing?
How could the child process have died without the parent script
knowing?
My guess is that the child process turned into a zombie and missed by Process.waitall. Did you check if the child processes are zombies when it happens?
The zombie:
If you have zombie processes it means those zombies have not been waited for by their parent (check the PPID with ps -l). In the end you have three choices: Fix the parent process (make it wait); kill the parent; or get over it.
Could you check your list of signals and trap it?
You can list all Signal(s) available (below is on windows):
Signal.list
=> {"EXIT"=>0, "INT"=>2, "ILL"=>4, "ABRT"=>22, "FPE"=>8, "KILL"=>9, "SEGV"=>11, "TERM"=>15}
Could you try to trap it via e.g. INT (note: you can have one trap per Signal) (
Signal.trap('SEGV') { throw :sigsegv }
catch :sigsegv
start_what_you_need
end
puts 'OMG! Got a SEGV!'
Since your question is a general one, it is hard to give you a specific answer.
Zombies are not the only possible cause for this problem -- stopped children may not be reported for a variety of reasons.
The existence of a zombie typically means that the parent has not properly waited on them. The posted code looks OK, though, so unless there's a framework bug lurking somewhere I'd want to look beyond the zombie apocalypse to explain this problem.
In contrast to zombies, which can't be fully reaped because they have no accessible parent, frozen processes have an intact parent but have stopped responding for some reason (waiting for an external process or I/O operation, memory problems, long or infinite looping, slow database operations, etc.).
On some platforms, Ruby can add a flag requesting return of stopped children that haven't been reported, using the following syntax:
waitpid(pid, Process::WUNTRACED)
AFAIK waitall doesn't have a version that accepts flags, so you'd have to aggregate this yourself, or use pid = -1 to wait for any child process (the default if you omit pid) or pid = 0 to wait for any child with the same process groupID as the calling process.
See documentation here.

what does exit do in this ruby if fork block

some code like following:
def start
if fork
do something
exit 0
end
end
fork duplicate a child process,am i right?
But my question is which process does exit 0 really exit?the parent process or child process?
fork, if given no block, has two different returns. To the parent it returns the process id (PID) of the child. To the child it returns nil which is false.
This is taken advantage of like so:
if fork
...this is the parent...
else
...this is the child...
end
So your code above forks, the parent does something, then the parent exits and the child lives on.

Ruby - Don't kill process when main thread exits

Basically, all of my logic is in a bunch of event handlers that are fired by threads. After I establish the event handlers in the main thread:
puts 'Now connecting...'
socket = SocketIO::Client::Simple.connect 'http://localhost:3000'
socket.on :connect do
puts 'Connected'
end
I don't really have anything else to do in the main thread... but when I exit it, the whole process exits! I guess I could just do a while 1 {sleep 3} or something but that seems like a hack.
From what I can tell, daemon threads also don't work on Windows, so what am I supposed to do here?
If you're creating threads then it's your obligation to wait for them to finish before terminating. Normally this is done with join on the thread or threads in question.
Do you have a way of getting the thread out of that SocketIO instance? If so, join it.

fork() and wait() connection to pid

I know that fork() creates a child process, returns 0 to child and returns child's pid to parent.
From what I understand wait() also returns some kind of pid of the child process that's terminated. Is this the same pid as the one that's returned to parent after fork?
I don't understand how to use wait().
My textbook just shows
int ReturnCode;
while (pid!=wait(&ReturnCode));
/*the child has terminated with Returncode as its return code*/
I don't even understand what this means.
How do I use wait()? I am using execv to create a child process but I want parent to wait. Someone please explain and give an example.
Thanks
wait() does indeed return the PID of the child process that died. If you only have one child process, you don't really need to check the PID (do check that it's not zero or negative though; there are some conditions that may cause the wait call to fail). You can find an example here: http://www.csl.mtu.edu/cs4411/www/NOTES/process/fork/wait.html
wait() takes the address of an integer
variable and returns the process ID of
the completed process.
More about the wait() system call
The
while (pid!=wait(&ReturnCode));
loop is comparing the process id (pid) returned by wait() to the pid received earlier from a fork or any other process starter. If it finds out that the process that has ended IS NOT the same as the one this parent process has been waiting for, it keeps on wait()ing.

Resources