Any method to catch a fatal error triggered within cgo code? - go

Our applications are using the odbc driver to access an Impala database. We've discovered that in certain difficult-to-replicate situations, the driver will trigger a segfault within its cgo code, which manifests as a fatal error once it propagates back up through the driver and to our code. Since we want some cleanup and alerting to happen in these situations, I implemented a deferred panic catcher, hoping this might catch them.
However, it isn't working. The fatal error continues straight past the deferred function containing the recover() call (so apparently it's not a panic, despite the print output looking similar), though it does catch other panics. A github issue suggests that cgo signals cannot be caught, and that applications should gracelessly and immediately crash if one occurs. This is an unacceptable crash case for our production applications, so I'm wondering if that's changed in the last 6 years, or if anyone knows of another way of running some cleanup code in the event of a cgo signal. It seems like extremely poor design to have no way at all catch and handle these fatal errors.

Related

COMGLB_EXCEPTION_DONOT_HANDLE_ANY does not always work for unhandled C++ exceptions

TL;DR
(Visual Studio 2019, Windows 10 (so far tested on 1809LTSC, because this is my dev machine)
We have an out-of-process COM server
We set COMGLB_EXCEPTION_DONOT_HANDLE_ANY
"Fatal" SEH exceptions are handled OK.
"Non-Fatal" SEH exceptions, among these C++ excpetions, are randomly swallowed or handled by the COM/RPC runtime stack.
Does COMGLB_EXCEPTION_DONOT_HANDLE _ANY work reliably? Are there any additional settings?
Neccessary Background
When using COM, the RPC layer will catch (and possibly swallow) all Structured SEH Exceptions (which include C++ exceptions). Raymond explains this very well:
Historically, COM placed a giant try/except around your server’s
methods. If your server encountered what would normally be an
unhandled exception, the giant try/except would catch it and turn it
into the error RPC_E_SERVERFAULT. It then marked the exception as
handled, so that the server remained running ... Mind you, this was
actually a disservice
Now there is a supposed solution, namely IGlobalOptions with setting COMGLB_EXCEPTION_DONOT_HANDLE_ANY.
This is supposed to (to quote The Old New):
... then go ahead and let the process crash.” In Windows 7, you can
ask for the even stronger COMGLB_EXCEPTION_DONOT_HANDLE_ANY, which
means “Don’t even try to catch ‘nonfatal’ exceptions.”
You can even find this recommendation in the docs:
It's important for applications that detect crashes and other
exceptions that might be generated while executing inbound COM calls,
... to set COMGLB_EXCEPTION_HANDLING to COMGLB_EXCEPTION_DONOT_HANDLE
to disable COM behavior of catching exceptions.
And the option is explained as:
COMGLB_EXCEPTION_DONOT_HANDLE_ANY:
When set and a fatal exception
occurs in a COM method, this causes the COM runtime to not handle the
exception. (caveat A)
When set and a non-fatal exception occurs in a COM method, this causes
the COM runtime to create a Windows Error Reporting (WER) dump and
terminate the process. Supported in Windows 7 and later. (caveat B)
And here's the thing
Neither of the above two statements is really accurate, but specifically for any non fatal exception, which C++ exceptions are, we get random behavior:
I have set up a simple client / server COM Test Program in VS2019 and intentionally generate an unhandled C++ exception: There are two modes at runtime, seemingly at random:
Server is terminated by the COM/RPC stack and we get an ID 1000 entry in the event log with the exceptioncode 0xe06d7363 (and a WER dump is written). The client gets 0x800706BE HRESULT in this case.
This is the advertised behavior.
Starting the client -> server a second (or third, ...) time, the C++ Exception DOES NOT terminate the server, and the client gets 0xe06d7363 as HRESULT for its server call. No event log entry written!
For those "fatal" SEH exceptions the termination happens reliably; but not for the non-fatal ones.
What is going on here?

ZeroMQ assertion failed: socket handle no longer valid for some reason

Got a Windows 10 c++ program using ZeroMQ that aborts very often on the same group of computers due to assertion failures.
The assert statement is buried deep into the libzmq code.
On other machines, the same program runs fine without those problems (but in all fairness, that's with different OS build numbers and program configurations).
The assertion failure seems to happen because internal zeromq (socket and/or pipe based) connection(s)/handles get unexpectedly closed.
What could possibly cause something like that?
More information:
The assertion failure seems to have something to do with the channels/mailboxes that ZeroMQ uses for internal signaling. In older versions of the library this works with several loopback TCP sockets while modern versions rely on a solution involving IOCP (I/O completion ports).
Here's a long standing and possibly related issue where the original author himself talked about a similar crash that happened to him:
https://github.com/zeromq/libzmq/issues/1108
Working with the crash dumps of our application I see that the stack trace leading to the assert statement usually happens at point right after attempting to read from a socket (or socket file descriptor?). The read or receive action fails and then the library panics.
So, suddenly a socket handle no longer seems valid. Examples of errors that I see are "The resource is temporarily unavailable" and things like "Invalid handle/parameter".
Can it be that something or someone is forcefully closing the socket for us?
What could be causing this behavior?
This happens for an old version of zeromq (4.0.10) as well as a modern one (4.3.5). This leads me to believe that the fault is somewhere else if such different implementations fail roughly the same way.
When trying to reproduce the problem I can trigger a similar assertion failure for 4.0.x by manually force closing an internal TCP connection that ZeroMQ uses with TCPView. The resulting assertion failure is instant and the crash dump looks identical to what happens in the wild.
But the modern version doesn't seem to use loopback sockets, so I couldn't close the "private" connections there. Maybe they are using pipes or unix style sockets instead (which is now possible on Windows 10 I have heard).
For a moment I have considered ephemeral port exhaustion as a reason for all this trouble but that alone doesn't make sense to me: I don't expect the OS to force close existing connections, existing connections should keep working. You'd expect only new connections to fail then.
As #user253751 suggested, the culprit seems to be a particular piece of code in the application that closes the same HANDLE twice. A serious bug in our code, not ZeroMQ!
On Windows, closed handles immediately get reused, so anything that is opened right after the first CloseHandle is at risk of being unexpectely closed when the second CloseHandle strikes, due to the bug.

Indy idHTTP continue execution after error

I have an indy IDHTTP component which is called repeatedly using a timer (4-5 times a second)
I have a poor internet connection so occasionally there are timeout problems or garbage responses that cause an error with the idHTTP component.
I have a try except finally clause around the component but when an error occurs the code shows an error and execution stops.
I know what causes the errors, its my poor internet connection, but what i want to do is just ignore the invalid response or error and just continue so that my program doesn't break.
I'm getting these errors because of a poor internet connection, I can't fix that. The code is used to access the Betfair API so advising Betfair won't help.

Detect UI operation which will "hang" the application if running in service mode

Fellow experts!
I have faced the following dilemma: some of our tools (executables) are started as scheduled tasks, some are started as services and others as usual desktop apps with interactive Windows user. We are using the code sharing strategy for source management (this is not debatable for this question).
So the solution I want to find is the following:
Detect UI operation at run-time which leads to hanging service/background task (such as say call to Application.ShowException, ShowMessage, MessageDialog, TForm.Show etc.). And when such an action detected I want to raise the exception instead. Then the operation will fail, we will have stack trace etc. but the process will not hang up! The most problematic hang up is when some event processing is done in transaction and then in some of the code used to process event suddenly (because of error in code, design, whatever) there is UI code executed then the process hangs and the DB parts can be locked!
What I think I need to do is: Use DDetours library to intercept WinAPI calls to a certain routines and raise exception instead (so that the process does not hang, but just fail in some method). Also I know that the creation of forms and windows does not hang the app, but only the tries to show them to the user.
Is there some known method of handling this problem? Or maybe there is some list of WinAPI routine set which hangs in service mode?
Thank you in advance.

What errors / exceptions trigger Windows Error Reporting?

When running a Delphi application outside the debugger most exceptions that occur seem to be silently ignored (like an access violation). Sometimes however there appears the Windows error reporting dialog (send or not send, you probably know what I mean). What exactly does this mean? What errors trigger this behaviour?
Additional info: I have a global exception handler for my application that should log all unhandled exceptions. So, no exceptions should leave the application unhandled.
Thanks.
Most exceptions are not silently ignored when running outside the debugger. They are normally caught by the event loop in VCL applications, or fall through to the main begin/end in console applications, etc. The default aciton of the VCL event loop is to display a dialog containing the message associated with the exception.
It's if the exception escapes the application, either by reaching the main begin/end without being caught, or not being caught by the event loop, that the Windows error reporting steps in - functionally, it is an exception handler just like any other except at the very base of the stack.
It covers exceptions that are not handled by the application - if an exception propagates outside of the main entry point of the app, then WER will step in. This covers things like AVs, divide by zero, invalid handle access and other out of band or "chip" exceptions. Sometimes your code can attempt to handle those things, but if memory is corrupted too badly or what have you, then your code will die.
You will generally get problems if you have exceptions in threads that are not handled in the Execute method. The program will mostly be killed, but the behaviour is unpredictable and seems to depend on many things (like the number and state of other threads). Often the main window vanishes immediately, and any further exceptions will thus not be handled by the program, and this is probably what causes WER to catch them.
I made it a habit to have an outer exception handler in Execute that logs any unhandled exceptions and allows the thread to terminate cleanly.
Exceptions occurring in initialization and finalization sections would escape your global exception handler and trigger WER.

Resources