I am trying to fix a python3 application where multiple proceess and threads are created controlled by various queues and pipes. I am trying to make a form of controlled exit when someone tries to break the program with ctrl-c. However no mather what I do it always hangs just at the end.
I've tried to used Keyboard-interrupt exception and signal catch
The below code is part of the multi process code.
from multiprocessing import Process, Pipe, JoinableQueue as Queue, Event
class TaskExecutor(Process):
def __init__(....)
{inits}
def signal_handler(self, sig, frame):
print('TaskExecutor closing')
self._in_p.close()
sys.exit(1)
def run
signal.signal(signal.SIGINT, self.signal_handler)
signal.signal(signal.SIGTERM, self.signal_handler)
while True:
# Get the Task Groupe name from the Task queue.
try:
ExecCmd = self._in_p.recv() # type: TaskExecCmd
except Exceptions as e:
self._in_p.close()
return
if ExecCmd.Kill:
self._log.info('{:30} : Kill Command received'.format(self.name))
self._in_p.close()
return
else
{other code executing here}
I'm getting the above print that its closing.
but im still getting a lot of different exceptions which i try to catch but it will not.
I'm am looking for some documentation on how to and in which order to shut down multiprocess and its main process.
I know it's very general question however its a very large application so if there are any question or thing i could test i could narrow it down.
Regards
So after investigating this issue further I found that in situation where I had a pipe thread, Queue thread and 4 multiprocesses running. # of these processes could end up hanging when terminating the application with ctrl-c. The Pipe and Queue process where already shut down.
In the multiprocessing documentation there are a warning.
Warning If this method is used when the associated process is using a
pipe or queue then the pipe or queue is liable to become corrupted and
may become unusable by other process. Similarly, if the process has
acquired a lock or semaphore etc. then terminating it is liable to
cause other processes to deadlock.
And I think this is what's happening.
I also found that even though I have a shutdown mechanism in my multi-process class the threads still running would of cause be considered alive (reading is_alive()) even though I know that the run() method have return IE som internal was hanging.
Now of the solution. My multiprocesses was for a design view not a Deamon because I wanted to control the shot down of them. However I changed them to Deamon so they would always be killed regardless. I first added that anyone kill signal would raise and ProgramKilled exception throughout my entire program.
def signal_handler(signum, frame):
raise ProgramKilled('Task Executor killed')
I then changed my shut down mechanism in my multi process class to
while True:
# Get the Task Groupe name from the Task queue.
try:
# Reading from pipe
ExecCmd = self._in_p.recv() # type: TaskExecCmd
# If fatal error just close it all
except BrokenPipe:
break
# This can occure close the pipe and break the loop
except EOFError:
self._in_p.close()
break
# Exception for when a kill signal is detected
# Set the multiprocess as killed (just waiting for the kill command from main)
except ProgramKilled:
self._log.info('{:30} : Died'.format(self.name))
self._KilledStatus = True
continue
# kill command from main recieved
# Shut down all we can. Ignore exceptions
if ExecCmd.Kill:
self._log.info('{:30} : Kill Command received'.format(self.name))
try:
self._in_p.close()
self._out_p.join()
except Exception:
pass
self._log.info('{:30} : Kill Command executed'.format(self.name))
break
else if (not self._KilledStatus):
{Execute code}
# When out of the loop set killed event
KilledEvent.set()
And in my main thread I have added the following clean up process.
#loop though all my resources
for ThreadInterfaces in ResourceThreadDict.values():
# test each process in each resource
for ThreadIf in ThreadInterfaces:
# Wait for its event to be set
ThreadIf['KillEvent'].wait()
# When event have been recevied see if its hanging
# We know at this point every thing have been closed and all data have been purged correctly so if its still alive terminate it.
if ThreadIf['Thread'].is_alive():
try:
psutil.Process(ThreadIf['Thread'].pid).terminate()
except (psutil.NoSuchProcess, AttributeError):
pass
Af a lot of testing I know its really hard to control a termination of and app with multiple processes because you simply do not know in which order all of your processes receive this signal.
I've tried to in someway to save most of my data when its killed. Some would argue what I need that data for when manually terminating the app. But in this case this app runs a lot of external scripts and other application and any of those can lock the application and then you need to manually kill it but still retain the information for what have already been executed.
So this is my solution to my current problem with my current knowledge.
Any input or more in depth knowledge on what happening is welcome.
Please note that this app runs both on linux and windows.
Regards
Related
I have a data processing app which takes many hours to run. One of our customers is running it on an AWS EC2 Windows 2012 instance which has a shutdown schedule defined. As a result, it keeps getting cut off part way through each run.
This doesn't pose a problem for the data processing itself: each data row is processed atomically, and it can be safely restarted afterwards without re-doing any of the data rows already processed. However, it does write lots of summary data to disk at the end of the process, which I could really do with seeing, and this data is not getting written.
The app is written in VB.NET WinForms. There is a form that displays progress on screen, and the processing logic is done via a BackgroundWorker component.
The form handles cancellation events as follows:
Private Sub MyBase_FormClosing(sender As Object, e As FormClosingEventArgs) Handles MyBase.FormClosing
' If the user tried to close the app, ask if they are sure, and if not, then don't close down.
If e.CloseReason = CloseReason.UserClosing AndAlso ContainsFocus AndAlso Not PromptToClose() Then
e.Cancel = True
Return
End If
' If the background worker is running, then politely request it to stop. Otherwise, let the form close down.
If worker.IsBusy Then
If Not worker.CancellationPending Then
worker.CancelAsync()
End If
e.Cancel = True
Else
Environment.ExitCode = AppExitCode.Cancelled
End If
End Sub
The background thread polls for cancellation once per data row processed, as follows:
Private Sub ReportProgress(state as UIState)
worker.ReportProgress(0, state)
If worker.CancellationPending Then
Throw New AbortException ' custom exception type
End If
End Sub
The top-level Catch block in the background thread handles all success and failure paths appropriately, writing all summary data to disk, before exiting the thread.
I have tested this extensively outside AWS. In particular, I have verified that, if you end the process via Task Manager, then it gracefully shuts down straight away without prompting the user, and writes all summary data to disk as expected. The graceful shutdown takes a few seconds at most.
My problem is, the summary data is not written to disk when AWS shuts down the EC2 instance on schedule. I don't know exactly what happens during the EC2 shutdown process, and my searching of the EC2 documentation has not yet shed light on this. I guess it could be one of:
AWS or Windows is terminating running processes without sending WM_CLOSE messages
AWS or Windows is giving insufficient time to each process to shut down
Can anyone clarify how the EC2 shutdown process works, or suggest how I can improve my code to handle it?
It looks like if you create a subprocess via exec.Cmd and Start() it, the Cmd.Process field is populated right away, however Cmd.ProcessState field remains nil until the process exits.
// ProcessState contains information about an exited process,
// available after a call to Wait or Run.
ProcessState *os.ProcessState
So it looks like I can't actually check the status of a process I Start()ed while it's still running?
It makes no sense to me ProcessState is set when the process exits. There's an ProcessState.Exited() method which will always return true in this case.
So I tried to go this route instead: cmd.Process.Pid field exists right after I cmd.Start(), however it looks like os.Process doesn't expose any mechanisms to check if the process is running.
os.FindProcess says:
On Unix systems, FindProcess always succeeds and returns a Process for the given pid, regardless of whether the process exists.
which isn't useful –and it seems like there's no way to go from os.Process to an os.ProcessState unless you .Wait() which defeats the whole purpose (I want to know if the process is running or not before it has exited).
I think you have two reasonable options here:
Spin off a goroutine that waits for the process to exit. When the wait is done, you know the process exited. (Positive: pretty easy to code correctly; negative: you dedicate an OS thread to waiting.)
Use syscall.Wait4() on the published Pid. A Wait4 with syscall.WNOHANG set returns immediately, filling in the status.
It might be nice if there were an exported os or cmd function that did the Wait4 for you and filled in the ProcessState. You could supply WNOHANG or not, as you see fit. But there isn't.
The point of ProcessState.Exited() is to distinguish between all the various possibilities, including:
process exited normally (with a status byte)
process died due to receiving an unhandled signal
See the stringer for ProcessState. Note that there are more possibilities than these two ... only there seems to be no way to get the others into a ProcessState. The only calls to syscall.Wait seem to be:
syscall/exec_unix.go: after a failed exec, to collect zombies before returning an error; and
os/exec_unix.go: after a call to p.blockUntilWaitable().
If it were not for the blockUntilWaitable, the exec_unix.go implementation variant for wait() could call syscall.Wait4 with syscall.WNOHANG, but blockUntilWaitable itself ensures that this is pointless (and the goal of this particular wait is to wait for exit anyway).
I have this ruby script to manage que processes. que doesn't support multi-proccess, see discussion here):
#!/usr/bin/env ruby
cluster_size = 2
puts "starting Que cluster with #{cluster_size} workers"; STDOUT.flush
%w[INT TERM].each do |signal|
trap(signal) do
#pids.each{|pid| Process.kill(signal, pid) }
end
end
#pids = []
cluster_size.to_i.times do |n|
puts "Starting Que daemon #{n}"; STDOUT.flush
#pids << Process.spawn("que --worker-count $MAX_THREADS")
end
Process.waitall
puts "Que cluster has shut down"; STDOUT.flush
The script has been working well for a couple months. The other day I found things in a state where the script was running, but both child processes were dead.
I experimented with trying to replicate this. I killed the children with various signals, had them raise exceptions. In all cases, the script knew the process died and itself died.
How could the child process have died without the parent script knowing?
How could the child process have died without the parent script
knowing?
My guess is that the child process turned into a zombie and missed by Process.waitall. Did you check if the child processes are zombies when it happens?
The zombie:
If you have zombie processes it means those zombies have not been waited for by their parent (check the PPID with ps -l). In the end you have three choices: Fix the parent process (make it wait); kill the parent; or get over it.
Could you check your list of signals and trap it?
You can list all Signal(s) available (below is on windows):
Signal.list
=> {"EXIT"=>0, "INT"=>2, "ILL"=>4, "ABRT"=>22, "FPE"=>8, "KILL"=>9, "SEGV"=>11, "TERM"=>15}
Could you try to trap it via e.g. INT (note: you can have one trap per Signal) (
Signal.trap('SEGV') { throw :sigsegv }
catch :sigsegv
start_what_you_need
end
puts 'OMG! Got a SEGV!'
Since your question is a general one, it is hard to give you a specific answer.
Zombies are not the only possible cause for this problem -- stopped children may not be reported for a variety of reasons.
The existence of a zombie typically means that the parent has not properly waited on them. The posted code looks OK, though, so unless there's a framework bug lurking somewhere I'd want to look beyond the zombie apocalypse to explain this problem.
In contrast to zombies, which can't be fully reaped because they have no accessible parent, frozen processes have an intact parent but have stopped responding for some reason (waiting for an external process or I/O operation, memory problems, long or infinite looping, slow database operations, etc.).
On some platforms, Ruby can add a flag requesting return of stopped children that haven't been reported, using the following syntax:
waitpid(pid, Process::WUNTRACED)
AFAIK waitall doesn't have a version that accepts flags, so you'd have to aggregate this yourself, or use pid = -1 to wait for any child process (the default if you omit pid) or pid = 0 to wait for any child with the same process groupID as the calling process.
See documentation here.
Let's say I have this scenario:
Scenario: Test LDAP access
Given that the LDAP dummy server is started
And the LDAP query is executed
...
I wish to start a LDAP server in that step. In my case, I use ruby-ldapserver, so I could, in theory, do this in my step:
args = { ... }
#ldap_pid = fork do
redirect_stdout_stderr_to_logfile()
wait_for_ldap_requests(args)
exit # avoid messing with Cucumber/web driver cleanup
end
...
After do
if #ldap_pid
Process.kill("HUP", #ldap_pid)
Process.wait #ldap_pid
end
end
A totally different approach:
system("some_script_that_starts_ldap_dummy < #{input} >#{tmpfile} 2>&1 &")
This certainly works but is rather unelegant (starting a ruby program from inside ruby - unnecessary process creation, and I need to set up the input parameters for that subprogram as well).
All that said, I'm not too altogether about either approach (the "warm fuzzy feeling" is not there).
What is your standard approach to these things? Is there one to speak of? Does Cucumber bring something to the table that could support me here? Should I run something to tell Cucumber that it has forked and should handle itself like a child process?
Edit: actually, when playing around with the fork approach, I did not notice any problems with the DB at all. I did notice that if I kill the child with SIGINT, it will break the web driver (Poltergeist / PhantomJS) in my case. A functioning workaround for this is to send a SIGHUP, handle it in the child by shutting down gracefully (if needed) but not callingexit; and then, after a few seconds a SIGKILL (which denies the child any chance to close down any protocols and just rips it away). Not nice... and not free of race conditions, say if the CI server should be under load.
Ryan Tomayko touched off quite a fire storm with this post about using Unix process control commands.
We should be doing more of this. A lot more of this. I'm talking about fork(2), execve(2), pipe(2), socketpair(2), select(2), kill(2), sigaction(2), and so on and so forth. These are our friends. They want so badly just to help us.
I have a bit of code (a delayed_job clone for DataMapper that I think would fit right in with this, but I'm not clear on how to take advantage of the listed commands. Any Ideas on how to improve this code?
def start
say "*** Starting job worker #{#name}"
t = Thread.new do
loop do
delay = Update.work_off(self)
break if $exit
sleep delay
break if $exit
end
clear_locks
end
trap('TERM') { terminate_with t }
trap('INT') { terminate_with t }
trap('USR1') do
say "Wakeup Signal Caught"
t.run
end
end
Ahh yes... the dangers of "We should do more of this" without explaining what each of those do and in what circumstances you'd use them. For something like delayed_job you may even be using fork without knowing that you're using fork. That said, it really doesn't matter. Ryan was talking about using fork for preforking servers. delayed_job would use fork for turning a process into a daemon. Same system call, different purposes. Running delayed_job in the foreground (without fork) vs in the background (with fork) will result in a negligible performance difference.
However, if you write a server that accepts concurrent connections, now Ryan's advice is right on the money.
fork: creates a copy of the original process
execve: stops executing the current file and begins executing a new file in the same process (very useful in rake tasks)
pipe: creates a pipe (two file descriptors, one for read, one for write)
socketpair: like a pipe, but for sockets
select: let's you wait for one or more of multiple file descriptors to be ready with a timeout
kill: used to send a signal to a process
sigaction: lets you change what happens when a process receives a signal
5 months later, you can view my solution at http://github.com/antarestrader/Updater. Look at lib/updater/fork_worker.rb