How to debug MPI program before bad termination? - debugging

I am currently developing a program written in C++ with the MPI+pthread paradigm.
I add some functionality to my program, however I have a bad termination message from one MPI process, like this:
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 37805 RUNNING AT node165
= EXIT CODE: 11
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:0#node162] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:887): assert (!closed) failed
[proxy:0:0#node162] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:2#node166] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:887): assert (!closed) failed
[proxy:0:2#node166] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:2#node166] main (pm/pmiserv/pmip.c:202): demux engine error waiting for event
srun: error: node162: task 0: Exited with exit code 7
[proxy:0:0#node162] main (pm/pmiserv/pmip.c:202): demux engine error waiting for event
srun: error: node166: task 2: Exited with exit code 7
[mpiexec#node162] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec#node162] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec#node162] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion
[mpiexec#node162] main (ui/mpich/mpiexec.c:340): process manager error waiting for completion
My problem is such that I have no idea about why I have this kind of message, and thus how to correct it.
I use only some basic functions from MPI, and ensure that there is no threads which uses MPI calls (only my "master process" is allowed to call such functions).
I also checked that one process does not send message to itself, and that the process destination exist before sending a message.
My question is quite simple: how to know where the problem comes from to then debug my application ?
Thank you a lot.

one of your processes has had a segmentation fault. This means reading from or writing to an area of memory that it is not permitted to.
That's the cause and MPI functions often are difficult to get right the first time - for example it could be MPI send and receive functions with incorrect sizes or locations.
The best solution is to fire up a parallel debugger so that you can watch all the processes. It looks like you are using a proper HPC system so there is a chance that there is one installed on the system -- ddt or totalview are the most popular.
Take a look at How to debug an MPI program

My experience with this problem when writing in C++ and using MPI is that this frequently occurred when I did not set MPI_Finalze(); before every return statement.

Related

How to check if a process started in the background still running?

It looks like if you create a subprocess via exec.Cmd and Start() it, the Cmd.Process field is populated right away, however Cmd.ProcessState field remains nil until the process exits.
// ProcessState contains information about an exited process,
// available after a call to Wait or Run.
ProcessState *os.ProcessState
So it looks like I can't actually check the status of a process I Start()ed while it's still running?
It makes no sense to me ProcessState is set when the process exits. There's an ProcessState.Exited() method which will always return true in this case.
So I tried to go this route instead: cmd.Process.Pid field exists right after I cmd.Start(), however it looks like os.Process doesn't expose any mechanisms to check if the process is running.
os.FindProcess says:
On Unix systems, FindProcess always succeeds and returns a Process for the given pid, regardless of whether the process exists.
which isn't useful –and it seems like there's no way to go from os.Process to an os.ProcessState unless you .Wait() which defeats the whole purpose (I want to know if the process is running or not before it has exited).
I think you have two reasonable options here:
Spin off a goroutine that waits for the process to exit. When the wait is done, you know the process exited. (Positive: pretty easy to code correctly; negative: you dedicate an OS thread to waiting.)
Use syscall.Wait4() on the published Pid. A Wait4 with syscall.WNOHANG set returns immediately, filling in the status.
It might be nice if there were an exported os or cmd function that did the Wait4 for you and filled in the ProcessState. You could supply WNOHANG or not, as you see fit. But there isn't.
The point of ProcessState.Exited() is to distinguish between all the various possibilities, including:
process exited normally (with a status byte)
process died due to receiving an unhandled signal
See the stringer for ProcessState. Note that there are more possibilities than these two ... only there seems to be no way to get the others into a ProcessState. The only calls to syscall.Wait seem to be:
syscall/exec_unix.go: after a failed exec, to collect zombies before returning an error; and
os/exec_unix.go: after a call to p.blockUntilWaitable().
If it were not for the blockUntilWaitable, the exec_unix.go implementation variant for wait() could call syscall.Wait4 with syscall.WNOHANG, but blockUntilWaitable itself ensures that this is pointless (and the goal of this particular wait is to wait for exit anyway).

Python3 How to gracefully shutdown a multiprocess application

I am trying to fix a python3 application where multiple proceess and threads are created controlled by various queues and pipes. I am trying to make a form of controlled exit when someone tries to break the program with ctrl-c. However no mather what I do it always hangs just at the end.
I've tried to used Keyboard-interrupt exception and signal catch
The below code is part of the multi process code.
from multiprocessing import Process, Pipe, JoinableQueue as Queue, Event
class TaskExecutor(Process):
def __init__(....)
{inits}
def signal_handler(self, sig, frame):
print('TaskExecutor closing')
self._in_p.close()
sys.exit(1)
def run
signal.signal(signal.SIGINT, self.signal_handler)
signal.signal(signal.SIGTERM, self.signal_handler)
while True:
# Get the Task Groupe name from the Task queue.
try:
ExecCmd = self._in_p.recv() # type: TaskExecCmd
except Exceptions as e:
self._in_p.close()
return
if ExecCmd.Kill:
self._log.info('{:30} : Kill Command received'.format(self.name))
self._in_p.close()
return
else
{other code executing here}
I'm getting the above print that its closing.
but im still getting a lot of different exceptions which i try to catch but it will not.
I'm am looking for some documentation on how to and in which order to shut down multiprocess and its main process.
I know it's very general question however its a very large application so if there are any question or thing i could test i could narrow it down.
Regards
So after investigating this issue further I found that in situation where I had a pipe thread, Queue thread and 4 multiprocesses running. # of these processes could end up hanging when terminating the application with ctrl-c. The Pipe and Queue process where already shut down.
In the multiprocessing documentation there are a warning.
Warning If this method is used when the associated process is using a
pipe or queue then the pipe or queue is liable to become corrupted and
may become unusable by other process. Similarly, if the process has
acquired a lock or semaphore etc. then terminating it is liable to
cause other processes to deadlock.
And I think this is what's happening.
I also found that even though I have a shutdown mechanism in my multi-process class the threads still running would of cause be considered alive (reading is_alive()) even though I know that the run() method have return IE som internal was hanging.
Now of the solution. My multiprocesses was for a design view not a Deamon because I wanted to control the shot down of them. However I changed them to Deamon so they would always be killed regardless. I first added that anyone kill signal would raise and ProgramKilled exception throughout my entire program.
def signal_handler(signum, frame):
raise ProgramKilled('Task Executor killed')
I then changed my shut down mechanism in my multi process class to
while True:
# Get the Task Groupe name from the Task queue.
try:
# Reading from pipe
ExecCmd = self._in_p.recv() # type: TaskExecCmd
# If fatal error just close it all
except BrokenPipe:
break
# This can occure close the pipe and break the loop
except EOFError:
self._in_p.close()
break
# Exception for when a kill signal is detected
# Set the multiprocess as killed (just waiting for the kill command from main)
except ProgramKilled:
self._log.info('{:30} : Died'.format(self.name))
self._KilledStatus = True
continue
# kill command from main recieved
# Shut down all we can. Ignore exceptions
if ExecCmd.Kill:
self._log.info('{:30} : Kill Command received'.format(self.name))
try:
self._in_p.close()
self._out_p.join()
except Exception:
pass
self._log.info('{:30} : Kill Command executed'.format(self.name))
break
else if (not self._KilledStatus):
{Execute code}
# When out of the loop set killed event
KilledEvent.set()
And in my main thread I have added the following clean up process.
#loop though all my resources
for ThreadInterfaces in ResourceThreadDict.values():
# test each process in each resource
for ThreadIf in ThreadInterfaces:
# Wait for its event to be set
ThreadIf['KillEvent'].wait()
# When event have been recevied see if its hanging
# We know at this point every thing have been closed and all data have been purged correctly so if its still alive terminate it.
if ThreadIf['Thread'].is_alive():
try:
psutil.Process(ThreadIf['Thread'].pid).terminate()
except (psutil.NoSuchProcess, AttributeError):
pass
Af a lot of testing I know its really hard to control a termination of and app with multiple processes because you simply do not know in which order all of your processes receive this signal.
I've tried to in someway to save most of my data when its killed. Some would argue what I need that data for when manually terminating the app. But in this case this app runs a lot of external scripts and other application and any of those can lock the application and then you need to manually kill it but still retain the information for what have already been executed.
So this is my solution to my current problem with my current knowledge.
Any input or more in depth knowledge on what happening is welcome.
Please note that this app runs both on linux and windows.
Regards

R windows GUI halt execution when error

One of my users is running a script in his R GUI in Windows. He takes the script itself and copy-paste it into the R console. If the user sets some incompatible parameters the script has errors but the rest of it executes giving the impression that everything has gone well. Is there some way such that the R session is terminated if an error is encountered? or any other way to stop execution without terminating the session as soon as any error is spotted?
Just rewrite your script as calls to one or more functions. If this is too much work, you can also just wrap lines of code in a {...} block. Execution will stop at the first encountered error.
halt = function() q('no')
options(error=halt)
will do the job

CreateProcess returns non 0 but GetExitCodeProcess() returns 128

I am creating an application that will start another process using CreateProcess(). And in the parent process I will use GetExitCodeProcess() to check whether the process active or not.
Here CreateProcess() is successful (returned a non negative value) but GetExitCodeProcess() returns 128 (There are no child processes to wait for). I am not seeing any trace of the child process started(usually some debugs). It happens intermittently.
Any idea what really happened to the child process?. Where we get more information (in system/application event logs?).
Please guide me.
Thanks,
Naga
Thanks for your comments.
I have found the following MSDN articles that gives the same symptoms and resolution for the problem.
Cmd.exe, Perl.exe, or other console-mode applications may fail to initialize properly and terminate prematurely when launched by a service using the CreateProcess() or CreateProcessAsUser() APIs. The calling process has no way of knowing that the launched console-mode application has terminated prematurely.
In some instances, calling GetExitCode() against the failed process indicates the following exit code:
128L ERROR_WAIT_NO_CHILDREN - There are no child processes to wait for.
http://support.microsoft.com/kb/156484
http://support.microsoft.com/kb/142676/EN-US
http://support.microsoft.com/kb/175687/EN-US
Thanks,
Naga

server using an overlapped named pipe : how to use GetOverlappedResult() with ReadFile()?

I have written a server and a client that are using an overlapped named pipe. My problem is mainly with Readfile() and GetOverlappedResult().
Note that this program is a test code. It will be integrated later in a framework (I'm porting linux code to unix that uses AF_UNIX adress family for socket connections)
I describe the server part. I have 2 threads :
1) the main thread opens an overlapped named pipe, then loop over WaitForMultipleObjects(). WaitForMultipleObjects() waits for 3 events: the 1st one waits for a client to connect. The 2nd allows me to cleanly quit the program. The 3rd is signaled when an operation is pending in ReadFile().
2) The second thread is launched when the client is connected. It loops over ReadFile().
Here is the server code:
http://pastebin.com/5rka7dK7
I mainly used MSDN doc (named pipe server using overlapped I/O, named pipe client), the SDK, and other doc on internet, to write that code. Look in [1] for the client code. The client code needs some love, but for now, I focus on making the server working perfectly.
There are 4 functions in the server code (i forget the function that display error messages):
a) svr_new: it creates the overlapped named pipe and the 3 events, and calls ConnectNamedPipe()
b) svr_del frees all the resources
c) _read_data_cb: the thread that calls ReadFile()
d) the main() function (the main thread), which loops over WaitForMultipleObjects()
My aim is to detect in _read_data_cb() when the client disconnects (ReadFile() fails and GetLastError() returns ERROR_BROKEN_PIPE) and when data comes from the client.
What I don't understand:
Should I call GetOverlappedResult() ?
If yes, where ? When ReadFile() fails and GetLastError() returns ERROR_IO_PENDING (line 50 of the paste) ? When WaitForMultipleObjects() returns (line 303 of the paste, I commented the code there) ? Somewhere else ?
I do a ResetEvent of the event of ReadFile() when WaitForMultipleObjects() returns (line 302 of the paste). Is it the correct place to call it ?
With the code I pasted, here is the result if the client sends these 24 bytes (the ReadFile() buffer is of size 5 bytes. I intentionnaly set that value to test what to do if a client sends some data larger than the ReadFile() buffer)
message : "salut, c'est le client !"
output:
$ ./server.exe
waiting for client...
WaitForMultipleObjects : 0
client connected (1)
WaitForMultipleObjects : 2
* ReadFile : 5
WaitForMultipleObjects : 2
* ReadFile : 5
WaitForMultipleObjects : 2
* ReadFile : 5
WaitForMultipleObjects : 2
* ReadFile : 5
WaitForMultipleObjects : 2
* ReadFile : 4
Note: WaitForMultipleObjects() can be called less than that, it seems random.
So, in my code, I do not call getOverlappedResult(), ReadFile() succeeds (il reads 5*4 + 4 = 24 bytes), but I don't know when the read operation has finished.
Note: I I add a printf() when ReadFile() fails with ERROR_IO_PENDING, that printf() is called indefinitely.
In addition, the client sends 2 messages. The one above, and another one 3seconds later. The 2nd message is never read and ReadFile() fails with the error ERROR_SUCCESS... (so to be precise, ReadFile() returns FALSE and GetLastError() returns ERROR_SUCCESS)
So, I'm completely lost. I have searched hours on Internet, in MSDN, in the SDK code (Server32.c and Client32.c). I still do not know what to do in my specific case.
So, ca someone explain me how to use GetOverlappedResult() (if I have to use it) to know how to check if the read operation finished, and where ? And even, if someone can fix my code :-) I gave the code so that everyone can test it (i find a lot of doc on internet, but it is almost always not precise at all :-/ )
thank you
[1] http://pastebin.com/fbCH2By8
Take a look at I/O Completion Ports. In my opinion it's the most efficient way to receive and handle notifications about overlapped operations in Windows. So basically you will need to use GetQueuedCompletionStatus and GetQueuedCompletionStatusEx in blocking and non-blocking mode when you're ready to process new completion events, instead of calling GetOverlappedResult from time to time. As a matter of fact, you can even get rid of WaitForMultipleObjects completely.
Also, which flavor of Unix are you targeting? In Solaris there's a very similar abstraction. Check out man port_create.
Unfortunately, there's nothing similar in Linux. Signals (including real-time) can be used to some extent as waitable completion objects, but they are not as comprehensive as the ports in Windows and Solaris.

Resources