I am trying to handle SSL error scenarios where, for example, SSL async_handshake() is taking too long.
After some time (say 20sec) i want to close this connection (lowest_layer().close()).
I pass shared_ptr with connection object as a parameter to async_handshake(), so object still exists, eventually handshake handler is invoked and object gets destroyed.
But, still I'm getting sporadic crashes! Looks like after close() SSL is still trying to read or operate on read buffer.
So, the basic question - is it safe to hard close() SSL connection?
Any ideas?
Typically the method I've used stop outstanding asynchronous operations on a socket is socket::cancel as described in the documentation. Their handlers will be invoked with asio::error::operation_aborted as the error parameter, which you'll need to handle somehow.
That said, I don't see a problem using close instead of cancel. Though it is difficult to offer much help or advice without some code to analyze.
Note that some Windows platforms have problems when canceling outstanding asynchronous operations. The documentation has suggestions for portable cancelation if your application needs to support Windows.
Related
Got a Windows 10 c++ program using ZeroMQ that aborts very often on the same group of computers due to assertion failures.
The assert statement is buried deep into the libzmq code.
On other machines, the same program runs fine without those problems (but in all fairness, that's with different OS build numbers and program configurations).
The assertion failure seems to happen because internal zeromq (socket and/or pipe based) connection(s)/handles get unexpectedly closed.
What could possibly cause something like that?
More information:
The assertion failure seems to have something to do with the channels/mailboxes that ZeroMQ uses for internal signaling. In older versions of the library this works with several loopback TCP sockets while modern versions rely on a solution involving IOCP (I/O completion ports).
Here's a long standing and possibly related issue where the original author himself talked about a similar crash that happened to him:
https://github.com/zeromq/libzmq/issues/1108
Working with the crash dumps of our application I see that the stack trace leading to the assert statement usually happens at point right after attempting to read from a socket (or socket file descriptor?). The read or receive action fails and then the library panics.
So, suddenly a socket handle no longer seems valid. Examples of errors that I see are "The resource is temporarily unavailable" and things like "Invalid handle/parameter".
Can it be that something or someone is forcefully closing the socket for us?
What could be causing this behavior?
This happens for an old version of zeromq (4.0.10) as well as a modern one (4.3.5). This leads me to believe that the fault is somewhere else if such different implementations fail roughly the same way.
When trying to reproduce the problem I can trigger a similar assertion failure for 4.0.x by manually force closing an internal TCP connection that ZeroMQ uses with TCPView. The resulting assertion failure is instant and the crash dump looks identical to what happens in the wild.
But the modern version doesn't seem to use loopback sockets, so I couldn't close the "private" connections there. Maybe they are using pipes or unix style sockets instead (which is now possible on Windows 10 I have heard).
For a moment I have considered ephemeral port exhaustion as a reason for all this trouble but that alone doesn't make sense to me: I don't expect the OS to force close existing connections, existing connections should keep working. You'd expect only new connections to fail then.
As #user253751 suggested, the culprit seems to be a particular piece of code in the application that closes the same HANDLE twice. A serious bug in our code, not ZeroMQ!
On Windows, closed handles immediately get reused, so anything that is opened right after the first CloseHandle is at risk of being unexpectely closed when the second CloseHandle strikes, due to the bug.
Our application has a feature to actively connect to the customers' internal factory network and send a message when inspection events occur. The customer enters the IP address and port number of their machine and application into our software.
I'm using a TClientSocket in blocking mode and have provided callback functions for the OnConnect and OnError events. Assuming the abovementioned feature has been activated, when the application starts I call the following code in a separate thread:
// Attempt active connection
try
m_socketClient.Active := True;
except
end;
// Later...
// If `OnConnect` and socket is connected...send some data!
// If `OnError`...call `m_socketClient.Active := True;` again
When IP + port are valid, the feature works well. But if not, after several thousand errors (and many hours or even days) eventually Windows socket error 10055 (WSAENOBUFS) occurs and the application crashes.
Various articles such as this one from ServerFramework and this one from Microsoft talk about exhausting the Windows non-paged pool and mention (1) actively managing the number outstanding asynchronous send operations and (2) releasing the data buffers that were used for the I/O operations.
My question is how to achieve this and is three-fold:
A) Am I doing something wrong that memory is being leaked? For example, is there some missing cleanup code in the OnError handler?
B) How do you monitor I/O buffers to see if they are being exhausted? I've used Process Explorer to confirm my application is the cause of the leak, but ideally I'd need some programmatic way of measuring this.
C) Apart from restarting the application, is there a way to ask Windows to clear out or release I/O operation data buffers?
Code samples in Delphi, C/C++, C# fine.
A) The cause of the resource leak was a programming error. When the OnError event occurs, Socket.Close() should be called to release low-level resources associated with the socket.
B) The memory leak does not show up in the standard Working Set memory use of the process. Open handles belonging to your process need to be monitored which is possible with GetProcessHandleCount. See this answer in Delphi which was tested and works well. This answer in C++ was not tested but the answer is accepted so should work. Of course, you should be able to use GetProcessHandleCount directly in C++.
C) After much research, I must conclude that just like a normal memory leak, you cannot just ask Windows to "clean up" after you! The handle resource has been leaked by your application and you must find and fix the cause (see A and B above).
I'm on a network that usually causes a ton of connection timeout issues, and ocasionally I'm running into read timeout issues as well. Retrying the code whenever a connect timeout happens fixes the problem with connecting to the server. Is is safe to retry the code whenever I get a read_timeout, or whould the response become corrupted? I'm using Ruby, with Net::HTTP client, but I guess this could apply to other languages as well.
A read_timeout means that the server did not send any data within the expected timeout. The response becoming corrupted is less likely as this is TCP.
To answer if it's safe or not to retry depends on what operation you're performing and/or any guarantees the service you're interacting with gives you.
In general GET should be safe to retry.
POST/PUT may need special handling (i.e. rereading some state before deciding to retry) as this usually means that something changes on the server.
MSDN article on CoRevokeGetClassObject() says that when the COM server calls it the class object referenced by clients is not released. Then the following comes:
If other clients still have pointers to the class object and have caused the reference >count to be incremented by calls to IUnknown::AddRef, the reference count will not be >zero. When this occurs, applications may benefit if subsequent calls (with the obvious >exceptions of IUnknown::AddRef and IUnknown::Release) to the class object fail.
What is meant by "applications may benefit"? The class object is not released, but creation requests fail. Sounds reasonable but where's the benefit?
Yeah, it's a pretty strange turn of words...
I think what they're trying to say is that clients may end up in a tricky situation if they create objects from a server that just called CoRevokeClassObjects, because it's likely it'll disappear very soon (CoRevokeClassObjects is routinely called when a server is shut down.)
So, if the activation calls (IClassFactory::CreateInstance) don't fail, the client will get an interface pointer back, and as soon as they call a method on it, they'll get an error from the RPC layer that the server is gone.
I suppose that's 'beneficial' in some way :-)
That said, I'm not sure how to detect the case where IUnknown::Release is called via CoRevokeClassObjects vs some other client, but I suppose the code revoking the factories could set some global state or per-factory state that they can check before letting creation requests come through.
We have a fairly standard client/server application built using MS RPC. Both client and server are implemented in C++. The client establishes a session to the server, then makes repeated calls to it over a period of time before finally closing the session.
Periodically, however, especially under heavy load conditions, we are seeing an RPC exception show up with code 1754: RPC_S_NOTHING_TO_EXPORT.
It appears that this happens in the middle of a session. The user is logged on for a while, making successful calls, then one of the calls inexplicably returns this error. As far as we can tell, the server receives no indication that anything went wrong - and it definitely doesn't see the call the client made.
The error code appears to have permanent implications, as well. Having the client retry the connection doesn't work, either. However, if the user has multiple user sessions active simultaneously between the same client and server, the other connections are unaffected.
In essence, I have two questions:
Does anyone know what RPC_S_NOTHING_TO_EXPORT means? The MSDN documentation simply says: "No interfaces have been exported." ... Huh? The session was working fine for numerous instances of the same call up until this point...
Does anyone have any ideas as to how to identify the real problem? Note: Capturing network traffic is something we would rather avoid, if possible, as the problem is sporadic enough that we would likely go through multiple gigabytes of traffic before running into an occurrence.
Capturing network traffic would be one of the best ways to tackle this issue. If you can't do that, could you dump the client process and debug with WinDBG or Visual Studio? Perhaps compare a dump when operating normally versus in the error state?