Resolve Windows socket error WSAENOBUFS (10055) - windows

Our application has a feature to actively connect to the customers' internal factory network and send a message when inspection events occur. The customer enters the IP address and port number of their machine and application into our software.
I'm using a TClientSocket in blocking mode and have provided callback functions for the OnConnect and OnError events. Assuming the abovementioned feature has been activated, when the application starts I call the following code in a separate thread:
// Attempt active connection
try
m_socketClient.Active := True;
except
end;
// Later...
// If `OnConnect` and socket is connected...send some data!
// If `OnError`...call `m_socketClient.Active := True;` again
When IP + port are valid, the feature works well. But if not, after several thousand errors (and many hours or even days) eventually Windows socket error 10055 (WSAENOBUFS) occurs and the application crashes.
Various articles such as this one from ServerFramework and this one from Microsoft talk about exhausting the Windows non-paged pool and mention (1) actively managing the number outstanding asynchronous send operations and (2) releasing the data buffers that were used for the I/O operations.
My question is how to achieve this and is three-fold:
A) Am I doing something wrong that memory is being leaked? For example, is there some missing cleanup code in the OnError handler?
B) How do you monitor I/O buffers to see if they are being exhausted? I've used Process Explorer to confirm my application is the cause of the leak, but ideally I'd need some programmatic way of measuring this.
C) Apart from restarting the application, is there a way to ask Windows to clear out or release I/O operation data buffers?
Code samples in Delphi, C/C++, C# fine.

A) The cause of the resource leak was a programming error. When the OnError event occurs, Socket.Close() should be called to release low-level resources associated with the socket.
B) The memory leak does not show up in the standard Working Set memory use of the process. Open handles belonging to your process need to be monitored which is possible with GetProcessHandleCount. See this answer in Delphi which was tested and works well. This answer in C++ was not tested but the answer is accepted so should work. Of course, you should be able to use GetProcessHandleCount directly in C++.
C) After much research, I must conclude that just like a normal memory leak, you cannot just ask Windows to "clean up" after you! The handle resource has been leaked by your application and you must find and fix the cause (see A and B above).

Related

Automatic reconnect in case of network failures

I am testing .NET version of ZeroMQ to understand how to handle network failures. I put the server (pub socket) to one external machine and debugging the client (sub socket). If I stop my local Wi-Fi connection for seconds, then ZeroMQ automatically recovers and I even get remaining values. However, if I disable Wi-Fi for longer time like a minute, then it just gets stuck on a frame waiting. How can I configure this period when ZeroMQ is still able to recover? And how can I reconnect manually after, say, several minutes? How can I understand that the socket is locked and I need to kill/open again?
Q :" How can I configure this ... ?"
A :Use the .NET versions of zmq_setsockopt() detailed parameter settings - family of link-management parameters alike ZMQ_RECONNECT_IVL, ZMQ_RCVTIMEO and the likes.
All other questions depend on your code.
If using blocking-forms of the .recv()-methods, you can easily throw yourself into unsalvageable deadlocks, best never block your own code ( why one would ever deliberately lose one's own code domain-of-control ).
If in a need to indeed understand low-level internal link-management details, do not hesitate to use zmq_socket_monitor() instrumentation ( if not available in .NET binding, still may use another language to see details the monitor-instance reports about link-state and related events ).
I was able to find an answer on their GitHub https://github.com/zeromq/netmq/issues/845. Seems that the behavior is by design as I got the same with native zmq lib via .NET binding.

ZeroMQ assertion failed: socket handle no longer valid for some reason

Got a Windows 10 c++ program using ZeroMQ that aborts very often on the same group of computers due to assertion failures.
The assert statement is buried deep into the libzmq code.
On other machines, the same program runs fine without those problems (but in all fairness, that's with different OS build numbers and program configurations).
The assertion failure seems to happen because internal zeromq (socket and/or pipe based) connection(s)/handles get unexpectedly closed.
What could possibly cause something like that?
More information:
The assertion failure seems to have something to do with the channels/mailboxes that ZeroMQ uses for internal signaling. In older versions of the library this works with several loopback TCP sockets while modern versions rely on a solution involving IOCP (I/O completion ports).
Here's a long standing and possibly related issue where the original author himself talked about a similar crash that happened to him:
https://github.com/zeromq/libzmq/issues/1108
Working with the crash dumps of our application I see that the stack trace leading to the assert statement usually happens at point right after attempting to read from a socket (or socket file descriptor?). The read or receive action fails and then the library panics.
So, suddenly a socket handle no longer seems valid. Examples of errors that I see are "The resource is temporarily unavailable" and things like "Invalid handle/parameter".
Can it be that something or someone is forcefully closing the socket for us?
What could be causing this behavior?
This happens for an old version of zeromq (4.0.10) as well as a modern one (4.3.5). This leads me to believe that the fault is somewhere else if such different implementations fail roughly the same way.
When trying to reproduce the problem I can trigger a similar assertion failure for 4.0.x by manually force closing an internal TCP connection that ZeroMQ uses with TCPView. The resulting assertion failure is instant and the crash dump looks identical to what happens in the wild.
But the modern version doesn't seem to use loopback sockets, so I couldn't close the "private" connections there. Maybe they are using pipes or unix style sockets instead (which is now possible on Windows 10 I have heard).
For a moment I have considered ephemeral port exhaustion as a reason for all this trouble but that alone doesn't make sense to me: I don't expect the OS to force close existing connections, existing connections should keep working. You'd expect only new connections to fail then.
As #user253751 suggested, the culprit seems to be a particular piece of code in the application that closes the same HANDLE twice. A serious bug in our code, not ZeroMQ!
On Windows, closed handles immediately get reused, so anything that is opened right after the first CloseHandle is at risk of being unexpectely closed when the second CloseHandle strikes, due to the bug.

Two-way communication between kernel-mode driver and user-mode application?

I need a two-way communication between a kernel-mode WFP driver and a user-mode application. The driver initiates the communication by passing a URL to the application which then does a categorization of that URL (Entertainment, News, Adult, etc.) and passes that category back to the driver. The driver needs to know the category in the filter function because it may block certain web pages based on that information. I had a thread in the application that was making an I/O request that the driver would complete with the URL and a GUID, and then the application would write the category into the registry under that GUID where the driver would pick it up. Unfortunately, as the driver verifier pointed out, this is unstable because the Zw registry functions have to run at PASSIVE_LEVEL. I was thinking about trying the same thing with mapped memory buffers, but I’m not sure what the interrupt requirements are for that. Also, I thought about lowering the interrupt level before the registry function calls, but I don't know what the side effects of that are.
You just need to have two different kinds of I/O request.
If you're using DeviceIoControl to retrieve the URLs (I think this would be the most suitable method) this is as simple as adding a second I/O control code.
If you're using ReadFile or equivalent, things would normally get a bit messier, but as it happens in this specific case you only have two kinds of operations, one of which is a read (driver->application) and the other of which is a write (application->driver). So you could just use WriteFile to send the reply, including of course the GUID so that the driver can match up your reply to the right query.
Another approach (more similar to your original one) would be to use a shared memory buffer. See this answer for more details. The problem with that idea is that you would either need to use a spinlock (at the cost of system performance and power consumption, and of course not being able to work on a single-core system) or to poll (which is both inefficient and not really suitable for time-sensitive operations).
There is nothing unstable about PASSIVE_LEVEL. Access to registry must be at PASSIVE_LEVEL so it's not possible directly if driver is running at higher IRQL. You can do it by offloading to work item, though. Lowering the IRQL is usually not recommended as it contradicts the OS intentions.
Your protocol indeed sounds somewhat cumbersome and doing a direct app-driver communication is probably preferable. You can find useful information about this here: http://msdn.microsoft.com/en-us/library/windows/hardware/ff554436(v=vs.85).aspx
Since the callouts are at DISPATCH, your processing has to be done either in a worker thread or a DPC, which will allow you to use ZwXXX. You should into inverted callbacks for communication purposes, there's a good document on OSR.
I've just started poking around WFP but it looks like even in the samples that they provide, Microsoft reinject the packets. I haven't looked into it that closely but it seems that they drop the packet and re-inject whenever processed. That would be enough for your use mode engine to make the decision. You should also limit the packet capture to a specific port (80 in your case) so that you don't do extra processing that you don't need.

Best way to communicate from KEXT to Daemon and block until result is returned from Daemon

In KEXT, I am listening for file close via vnode or file scope listener. For certain (very few) files, I need to send file path to my system daemon which does some processing (this has to happen in daemon) and returns the result back to KEXT. The file close call needs to be blocked until I get response from daemon. Based on result I need to some operation in close call and return close call successfully. There is lot of discussion on KEXT communication related topic on the forum. But they are not conclusive and appears be very old (year 2002 around). This requirement can be handled by FtlSendMessage(...) Win32 API. I am looking for equivalent of that on Mac
Here is what I have looked at and want to summarize my understanding:
Mach message: Provides very good way of bidirectional communication using sender and reply ports with queueing mechansim. However, the mach message APIs (e.g. mach_msg, mach_port_allocate, bootstrap_look_up) don't appear to be KPIs. The mach API mach_msg_send_from_kernel can be used, but that alone will not help in bidirectional communication. Is my understanding right?
IOUserClient: This appears be more to do with communicating from User space to KEXT and then having some callbacks from KEXT. I did not find a way to initiate communication from KEXT to daemon and then wait for result from daemon. Am I missing something?
Sockets: This could be last option since I would have to implement entire bidirectional communication channel from KEXT to Daemon.
ioctl/sysctl: I don't know much about them. From what I have read, its not recommended option especially for bidirectional communication
RPC-Mig: Again I don't know much about them. Looks complicated from what I have seen. Not sure if this is recommended way.
KUNCUserNotification: This appears to be just providing notification to the user from KEXT. It does not meet my requirement.
Supported platform is (10.5 onwards). So looking at the requirement, can someone suggest and provide some pointers on this topic?
Thanks in advance.
The pattern I've used for that process is to have the user-space process initiate a socket connection to the KEXT; the KEXT creates a new thread to handle messages over that socket and sleeps the thread. When the KEXT detects an event it needs to respond to, it wakes the messaging thread and uses the existing socket to send data to the daemon. On receiving a response, control is passed back to the requesting thread to decide whether to veto the operation.
I don't know of any single resource that will describe that whole pattern completely, but the relevant KPIs are discussed in Mac OS X Internals (which seems old, but the KPIs haven't changed much since it was written) and OS X and iOS Kernel Programming (which I was a tech reviewer on).
For what it's worth, autofs uses what I assume you mean by "RPC-Mig", so it's not too complicated (MIG is used to describe the RPC calls, and the stub code it generates handles calling the appropriate Mach-message sending and receiving code; there are special options to generate kernel-mode stubs).
However, it doesn't need to do any lookups, as automountd (the user-mode daemon to which the autofs kext sends messages) has a "host special port" assigned to it. Doing the lookups to find an arbitrary service would be harder.
If you want to use the socket established with ctl_register() on the KExt side, then beware: The communication from kext to user space (via ctl_enqueuedata()) works OK. However opposite direction is buggy on 10.5.x and 10.6.x.
After about 70.000 or 80.000 send() calls with SOCK_DGRAM in the PF_SYSTEM domain complete net stack breaks with disastrous consequences for complete system (hard turning off is the only way out). This has been fixed in 10.7.0. I workaround by using setsockopt() in our project for the direction from user space to kext as we only send very small data (just to allow/disallow some operation).

Boost.Asio SSL ungraceful close

I am trying to handle SSL error scenarios where, for example, SSL async_handshake() is taking too long.
After some time (say 20sec) i want to close this connection (lowest_layer().close()).
I pass shared_ptr with connection object as a parameter to async_handshake(), so object still exists, eventually handshake handler is invoked and object gets destroyed.
But, still I'm getting sporadic crashes! Looks like after close() SSL is still trying to read or operate on read buffer.
So, the basic question - is it safe to hard close() SSL connection?
Any ideas?
Typically the method I've used stop outstanding asynchronous operations on a socket is socket::cancel as described in the documentation. Their handlers will be invoked with asio::error::operation_aborted as the error parameter, which you'll need to handle somehow.
That said, I don't see a problem using close instead of cancel. Though it is difficult to offer much help or advice without some code to analyze.
Note that some Windows platforms have problems when canceling outstanding asynchronous operations. The documentation has suggestions for portable cancelation if your application needs to support Windows.

Resources