ZeroMQ assertion failed: socket handle no longer valid for some reason - windows

Got a Windows 10 c++ program using ZeroMQ that aborts very often on the same group of computers due to assertion failures.
The assert statement is buried deep into the libzmq code.
On other machines, the same program runs fine without those problems (but in all fairness, that's with different OS build numbers and program configurations).
The assertion failure seems to happen because internal zeromq (socket and/or pipe based) connection(s)/handles get unexpectedly closed.
What could possibly cause something like that?
More information:
The assertion failure seems to have something to do with the channels/mailboxes that ZeroMQ uses for internal signaling. In older versions of the library this works with several loopback TCP sockets while modern versions rely on a solution involving IOCP (I/O completion ports).
Here's a long standing and possibly related issue where the original author himself talked about a similar crash that happened to him:
https://github.com/zeromq/libzmq/issues/1108
Working with the crash dumps of our application I see that the stack trace leading to the assert statement usually happens at point right after attempting to read from a socket (or socket file descriptor?). The read or receive action fails and then the library panics.
So, suddenly a socket handle no longer seems valid. Examples of errors that I see are "The resource is temporarily unavailable" and things like "Invalid handle/parameter".
Can it be that something or someone is forcefully closing the socket for us?
What could be causing this behavior?
This happens for an old version of zeromq (4.0.10) as well as a modern one (4.3.5). This leads me to believe that the fault is somewhere else if such different implementations fail roughly the same way.
When trying to reproduce the problem I can trigger a similar assertion failure for 4.0.x by manually force closing an internal TCP connection that ZeroMQ uses with TCPView. The resulting assertion failure is instant and the crash dump looks identical to what happens in the wild.
But the modern version doesn't seem to use loopback sockets, so I couldn't close the "private" connections there. Maybe they are using pipes or unix style sockets instead (which is now possible on Windows 10 I have heard).
For a moment I have considered ephemeral port exhaustion as a reason for all this trouble but that alone doesn't make sense to me: I don't expect the OS to force close existing connections, existing connections should keep working. You'd expect only new connections to fail then.

As #user253751 suggested, the culprit seems to be a particular piece of code in the application that closes the same HANDLE twice. A serious bug in our code, not ZeroMQ!
On Windows, closed handles immediately get reused, so anything that is opened right after the first CloseHandle is at risk of being unexpectely closed when the second CloseHandle strikes, due to the bug.

Related

Automatic reconnect in case of network failures

I am testing .NET version of ZeroMQ to understand how to handle network failures. I put the server (pub socket) to one external machine and debugging the client (sub socket). If I stop my local Wi-Fi connection for seconds, then ZeroMQ automatically recovers and I even get remaining values. However, if I disable Wi-Fi for longer time like a minute, then it just gets stuck on a frame waiting. How can I configure this period when ZeroMQ is still able to recover? And how can I reconnect manually after, say, several minutes? How can I understand that the socket is locked and I need to kill/open again?
Q :" How can I configure this ... ?"
A :Use the .NET versions of zmq_setsockopt() detailed parameter settings - family of link-management parameters alike ZMQ_RECONNECT_IVL, ZMQ_RCVTIMEO and the likes.
All other questions depend on your code.
If using blocking-forms of the .recv()-methods, you can easily throw yourself into unsalvageable deadlocks, best never block your own code ( why one would ever deliberately lose one's own code domain-of-control ).
If in a need to indeed understand low-level internal link-management details, do not hesitate to use zmq_socket_monitor() instrumentation ( if not available in .NET binding, still may use another language to see details the monitor-instance reports about link-state and related events ).
I was able to find an answer on their GitHub https://github.com/zeromq/netmq/issues/845. Seems that the behavior is by design as I got the same with native zmq lib via .NET binding.

Resolve Windows socket error WSAENOBUFS (10055)

Our application has a feature to actively connect to the customers' internal factory network and send a message when inspection events occur. The customer enters the IP address and port number of their machine and application into our software.
I'm using a TClientSocket in blocking mode and have provided callback functions for the OnConnect and OnError events. Assuming the abovementioned feature has been activated, when the application starts I call the following code in a separate thread:
// Attempt active connection
try
m_socketClient.Active := True;
except
end;
// Later...
// If `OnConnect` and socket is connected...send some data!
// If `OnError`...call `m_socketClient.Active := True;` again
When IP + port are valid, the feature works well. But if not, after several thousand errors (and many hours or even days) eventually Windows socket error 10055 (WSAENOBUFS) occurs and the application crashes.
Various articles such as this one from ServerFramework and this one from Microsoft talk about exhausting the Windows non-paged pool and mention (1) actively managing the number outstanding asynchronous send operations and (2) releasing the data buffers that were used for the I/O operations.
My question is how to achieve this and is three-fold:
A) Am I doing something wrong that memory is being leaked? For example, is there some missing cleanup code in the OnError handler?
B) How do you monitor I/O buffers to see if they are being exhausted? I've used Process Explorer to confirm my application is the cause of the leak, but ideally I'd need some programmatic way of measuring this.
C) Apart from restarting the application, is there a way to ask Windows to clear out or release I/O operation data buffers?
Code samples in Delphi, C/C++, C# fine.
A) The cause of the resource leak was a programming error. When the OnError event occurs, Socket.Close() should be called to release low-level resources associated with the socket.
B) The memory leak does not show up in the standard Working Set memory use of the process. Open handles belonging to your process need to be monitored which is possible with GetProcessHandleCount. See this answer in Delphi which was tested and works well. This answer in C++ was not tested but the answer is accepted so should work. Of course, you should be able to use GetProcessHandleCount directly in C++.
C) After much research, I must conclude that just like a normal memory leak, you cannot just ask Windows to "clean up" after you! The handle resource has been leaked by your application and you must find and fix the cause (see A and B above).

Getting specific errors when TCP connections disconnect in Windows

I'm trying to improve the usefulness of the error reporting in a server I am working on. The server uses TCP sockets, and it runs on Windows.
The problem is that when a TCP link drops due to some sort of network failure, the error code that I can get from WSARecv() (or the other Windows socket APIs) is not very descriptive. For most network hiccups, I get either WSAECONNRESET (10054) or WSAETIMEDOUT (10060). But there are about a million things that can cause both of these: the local machine is having a problem, the remote machine or process is having a problem, some intermediate router has a problem, etc. This is a problem because the server operator doesn't have a definitive way to investigate the problem, because they don't necessarily even know where the problem is, or who might be responsible.
At the IP level, it's a different story. If the server operator happens to have a network sniffer attached when something bad happens, it's usually pretty easy to sort of what went wrong. For instance, if an intermediate router sent an ICMP unreachable, the router that sent it will put its IP address in there, and that's usually enough to track it down. Put another way, Windows killed the connection for a reason, probably because it got a specific packet that had a specific problem.
However, a large number of failures are experienced in the field, unexpected. It is not realistic to always have a network sniffer attached to a production server. There needs to be a way to track down problems that happen only rarely, intermittently, or randomly.
How can I solve this problem programmatically?
Is there a way to get Windows to cough up a more specific error message? Is there some easy way to capture and mine recent Windows events (perhaps the one Microsoft Network Monitor uses)? One way I've "solved it" before is to keep dumpcap (from Wireshark) running in ring buffer mode, and force it to stop capturing when a bad event happens, that I can mine later.
I'm also open to the possibility that this is not the right way to solve this problem. For instance, perhaps there is some special Windows mode that can be turned on to cause it to log useful information, that a network administrator could use to track this down after-the-fact.

Boost.Asio SSL ungraceful close

I am trying to handle SSL error scenarios where, for example, SSL async_handshake() is taking too long.
After some time (say 20sec) i want to close this connection (lowest_layer().close()).
I pass shared_ptr with connection object as a parameter to async_handshake(), so object still exists, eventually handshake handler is invoked and object gets destroyed.
But, still I'm getting sporadic crashes! Looks like after close() SSL is still trying to read or operate on read buffer.
So, the basic question - is it safe to hard close() SSL connection?
Any ideas?
Typically the method I've used stop outstanding asynchronous operations on a socket is socket::cancel as described in the documentation. Their handlers will be invoked with asio::error::operation_aborted as the error parameter, which you'll need to handle somehow.
That said, I don't see a problem using close instead of cancel. Though it is difficult to offer much help or advice without some code to analyze.
Note that some Windows platforms have problems when canceling outstanding asynchronous operations. The documentation has suggestions for portable cancelation if your application needs to support Windows.

Meaning/cause of RPC Exception 'No interfaces have been exported.'

We have a fairly standard client/server application built using MS RPC. Both client and server are implemented in C++. The client establishes a session to the server, then makes repeated calls to it over a period of time before finally closing the session.
Periodically, however, especially under heavy load conditions, we are seeing an RPC exception show up with code 1754: RPC_S_NOTHING_TO_EXPORT.
It appears that this happens in the middle of a session. The user is logged on for a while, making successful calls, then one of the calls inexplicably returns this error. As far as we can tell, the server receives no indication that anything went wrong - and it definitely doesn't see the call the client made.
The error code appears to have permanent implications, as well. Having the client retry the connection doesn't work, either. However, if the user has multiple user sessions active simultaneously between the same client and server, the other connections are unaffected.
In essence, I have two questions:
Does anyone know what RPC_S_NOTHING_TO_EXPORT means? The MSDN documentation simply says: "No interfaces have been exported." ... Huh? The session was working fine for numerous instances of the same call up until this point...
Does anyone have any ideas as to how to identify the real problem? Note: Capturing network traffic is something we would rather avoid, if possible, as the problem is sporadic enough that we would likely go through multiple gigabytes of traffic before running into an occurrence.
Capturing network traffic would be one of the best ways to tackle this issue. If you can't do that, could you dump the client process and debug with WinDBG or Visual Studio? Perhaps compare a dump when operating normally versus in the error state?

Resources