Reasons for rare sendto()/recvfrom() issues under Winsock? - windows

We recently observe rare UDP communication issues that show the following symptoms:
A socket sendto() call fails with error WSAENOBUFS (10055)
A subsequent recvfrom() call on this socket does not receive anything, even though Wireshark shows that the network interface actually received the expected datagrams. This situation persists for approximately 8 seconds, afterwards new incoming datagrams can be received again from the socket.
In Windows System Log, there appears a Kernel-General information entry at the time of the sendto() error:
The access history in hive \??\C:\ProgramData\Microsoft\Provisioning\Microsoft-Desktop-Provisioning-Sequence.dat was cleared updating 0 keys and creating 0 modified pages.
The issue happens on a customer system running Microsoft Windows 10 Pro for Workstations, Version 10.0.17763 Build 17763.
On that system we were able to reproduce the issue with a simple test program written in C++ that echoes UDP datagrams. We verified that the thread receiving from the socket was actually responsive all the time, by specifying a timeout of 1 second using SO_RCVTIMEO, printing some “still alive” output and immediately calling recvfrom() again.
On our own test system, we were unable to observe the issue under the same circumstances as the customer. However, we were able to provoke similar effects when playing around with the network adapter settings while the test was running. Enabling Microsoft LLDP Protocol Driver showed the sendto() error and sometimes also resulted in the 8 second “silence” period, but without any Windows System Log entry.
Any hints are greatly appreciated.

The issue seems to be related to Microsoft Provisioning Tool since Windows 10 1809.
Disabling it fixed the issue in our case:
Open Task Scheduler, go to Microsoft/Windows/Manangement/Provisioning and disable Logon task.
Source: Windows TenForums

Related

Windivert fails in Win10 - Solution to TCP 3handshake - Evade windows RST packets

I had a project which used windivert to work as a router in my network, and it worked fine but now is dead with the same code. Previous versions which worked succesfully now dont work. I always get the same Windivert error which is 997 (Overlapped I/O operation is in progress).
For example when I use WindivertOpen I get the error, when I restart the computer to reset the windivert driver I dont get the error 997 in WindivertOpen but I get it in WindivertSend or WinDivertSendEx and after use them I again get the error in WindivertOpen. These functions worked fine for me months ago and my router worked as I expected, but now I am done with these errors, there is nothing I can do, maybe this is caused by a windows security update.
I need to know how to reset the driver without restart the computer and to know what I can do to face this problem. I used windivert to block windows TCP RST packets to my router fordwards, windows does this when there is not sockets associated with the ports that you are fordwarding, what can I do to block this packets without windivert or with a working way of windivert?
The 997 error is ERROR_IO_PENDING, but the error code is meaningless unless WinDivertOpen returns INVALID_HANDLE_VALUE. Otherwise the call will have completed successfully.
Presumably you have upgraded to WinDivert 1.4 from a previous version. Simply replacing the binary files (dll/sys) won't work -- you must instead recompile your program against the new API.

What does "Blocked" really mean in the Firefox developer tools Network monitoring?

The timing section of the Firefox Network Monitor documentation, "Blocked" is explained as:
Time spent in a queue waiting for a network connection.
The browser imposes a limit on the number of simultaneous connections that can be made to a single server. In Firefox this defaults to 6
Is the limit on the number connections the only limitation? Or is the browser blocked waiting to get a connection from the OS count as blocked too?
In a fresh browser, on a first connection, before any other connection is made (so the limit should not apply here), I get blocked for 195 ms.
Is this the browser waiting for the OS? Was does "Blocked" mean here?
We changed the Firefox setting (about:config) 'network.http.max-persistent-connections-per-server' to 64 and the blocks went away. We changed it back to 6. We changed our design/development method to a more 'asynchronous' loading method so as not to have a large number simultaneous connections. The blocks were mostly loading a lot of png flags for locale settings.
I have a server that takes several seconds to respond, which allowed me to cross-reference the firefox measurement with a wireshark trace. I see that the first SYN is sent out immediately. The end of the "Blocked" time corresponds to when the Server Hello comes back.
I couldn't relate the end of "TLS setup" to any wireshark packet. It extends a few seconds belong the last data that is exchanged on the initial TLS connection.
Bottom line: it doesn't look like the time spent in "Blocked" and "TLS setup" is very reliable, at least in some cases.
My setup has a TLS reverse proxy that forwards the connection with SNI. I'm not sure if that might be related.
Time spent in a queue waiting for a network connection.
The browser imposes a limit on the number of simultaneous connections
that can be made to a single server. In Firefox this defaults to 6,
but can be changed using the
network.http.max-persistent-connections-per-server preference. If all
connections are in use, the browser can't download more resources
until a connection is released.
Source : https://developer.mozilla.org/en-US/docs/Tools/Network_Monitor
It's very clear that the browser fixes the limit to 6 concurrent connections per server (domains/IP), the OS question is not very relevent.
In my case both waiting for network connection and DNS lookup times were pretty high, up to 2 seconds each, caused significant page load times if the page was loaded for the first time. Firefox was freshly installed without addons and just started with no other opened tabs. I tried on both Ubuntu 18.04 LTS and Ubuntu 19.04 with the same results. Although my ISP doesn't provide support, my router assignes IPv6 addresses. As it turned out the problem was the IPv6 broken network, which forced Firefox to fall back to IPv4 (of course after some time(time-out)). After I turned off the IPv6 support in Linux the requests speeded up significantly.
Here is a relavant discussion: https://bugzilla.mozilla.org/show_bug.cgi?id=1452028
I encountered this error whilst using an Angular 9 'dist' deployment. I discovered that the error appeared because I was trying to access an unreachable API, according to the specified IP address and port.
Therefore to solve it, I just have to reference a valid and accessible API.

Perforce - RpcTransport: partial message read

When using "revert -a" through P4V it waits for a few minutes and throws this error back at me.
RpcTransport: partial message read
TCP receive failed.
read: socket: WSAECONNRESET
The server status returns fine and there are no locked database files.
I suspect this problem is local to this computer as others don't have the same issue. Issueing the same command through the command prompt just has the command prompt sit there indefinitly.
Other commands such as submit and add will have the visual client sit there indefinitely but does not throw and error.
The files are stored on a local drive. This happens with multiply depots/workstations.
The 'WSAECONNRESET' error is issued by Windows, when a network socket is forcibly closed.
Regular occurrences of this error can indicate network problems.
More information is available here:
http://answers.perforce.com/articles/KB/2968/
Hope this helps,
Jen!
I got the same on windows machine. I guess in my case it was caused by corrupted config settings and because of popup error message I had no chance to set it correctly via GUI.
The command line SET command helped to set port and host name again:
p4 set P4PORT=<portnum>
This command reenables the GUI config dialog
A few years late, but for those still facing this:
I faced this error when fetching files from a large repo. I believe what caused this for me was low internet upload speeds due to which - even though I had high a download speed - the TCP acknowledgment from my computer was not getting sent, causing a connection failure.
Perform an upload speed test to determine if it is very low (in my case it had dropped to less than 0.1 Mbps). Fixing upload speeds is a separate topic, but in case it helps try restarting your router as a first step.

Irregular socket errors (10054) on Windows application

I am working on a Windows (Microsoft Visual C++ 2005) application that uses several processes
running on different hosts in an intranet.
Processes communicate with each other using TCP/IP. Different processes can be on the
same host or on different hosts (i.e. the communication can be both within the same
host or between different hosts).
We have currently a bug that appears irregularly. The communication seems to work
for a while, then it stops working. Then it works again for some time.
When the communication does not work, we get an error (apparently while a process
was trying to send data). The call looks like this:
send(socket, (char *) data, (int) data_size, 0);
By inspecting the error code we get from
WSAGetLastError()
we see that it is an error 10054. Here is what I found in the Microsoft documentation
(see here):
WSAECONNRESET
10054
Connection reset by peer.
An existing connection was forcibly closed by the remote host. This normally
results if the peer application on the remote host is suddenly stopped, the
host is rebooted, the host or remote network interface is disabled, or the
remote host uses a hard close (see setsockopt for more information on the
SO_LINGER option on the remote socket). This error may also result if a
connection was broken due to keep-alive activity detecting a failure while
one or more operations are in progress. Operations that were in progress
fail with WSAENETRESET. Subsequent operations fail with WSAECONNRESET.
So, as far as I understand, the connection was interrupted by the receiving process.
In some cases this error is (AFAIK) correct: one process has terminated and
is therefore not reachable. In other cases both the sender and receiver are running
and logging activity, but they cannot communicate due to the above error (the error
is reported in the logs).
My questions.
What does the SO_LINGER option mean?
What is a keep-alive activity and how can it break a connection?
How is it possible to avoid this problem or recover from it?
Regarding the last question. The first solution we tried (actually, it is rather a
workaround) was resending the message when the error occurs. Unfortunately, the
same error occurs over and over again for a while (a few minutes). So this is not
a solution.
At the moment we do not understand if we have a software problem or a configuration
issue: maybe we should check something in the windows registry?
One hypothesis was that the OS runs out of ephemeral ports (in case connections are
closed but ports are not released because of TcpTimedWaitDelay), but by analyzing
this issue we think that there should be plenty of them: the problem occurs even
if messages are not sent too frequently between processes. However, we still are not
100% sure that we can exclude this: can ephemeral ports get lost in some way (???)
Another detail that might help is that sending and receiving occurs in each process
concurrently in separate threads: are there any shared data structures in the
TCP/IP libraries that might get corrupted?
What is also very strange is that the problem occurs irregularly: communication works
OK for a few minutes, then it does not work for a few minutes, then it works again.
Thank you for any ideas and suggestions.
EDIT
Thanks for the hints confirming that the only possible explanation was a connection closed error. By further analysis of the problem, we found out that the server-side process of the connection had crashed / had been terminated and had been restarted. So there was a new server process running and listening on the correct port, but the client had not detected this and was still trying to use the old connection. We now have a mechanism to detect such situations and reset the connection on the client side.
That error means that the connection was closed by the
remote site. So you cannot do anything on your programm except to accept that the connection is broken.
I was facing this problem for some days recently and found out that Adobe Acrobat Reader update was the culprit. As soon as you completely uninstall Adobe from the system everything returns back to normal.
I spent a long time debugging a 10054/10053 error in s3 pre-signed uploads
Turns out that the s3 server will reject pre-signed s3 uploads for the first 15 minutes of it's life.
So - If you're debugging s3 check it's not a new bucket.
If you're debugging something else - this is most likely a problem on the server side not client side.

Windows XPe RAS error 756 "connection is being dialled"

I'm working with an embedded system which has a RAS entry already set up, using the API function RasDial from rasapi32.dll.
All works well except if something goes wrong after RasDial and before RasHangUp. In this case any further attempt to dial is met with error 756 "connection is being dialled", whether the dial attempt is done via the API or via the Windows rasdial command line utility.
rasdial connectionname /d doesn't help either.
The com port used for the modem is locked.
The only way to recover is to reboot.
Obviously under normal circumstances the solution is to make sure that RasDial is always followed by RasHangUp. But for cases where this doesn't happen, is there a way of aborting the dial attempt? For example, if the app calls RasDial and then crashes, how do I get out of that other than by rebooting?
Unfortunately, unless your application can properly terminate the connection that's in progress before exiting the RAS state machine becomes corrupted and must reboot to fix the problem. I've noticed that Windows 7 handles these sorts of scenarios better than XP and Vista did, but there are still occasions when I've had to reboot.
I've managed to prevent most of these sorts of problems with the DotRas API as long as they're occuring in the event handlers of the RasDialer, but if the application crashes from another thread and not from the background thread which raises the RasDialer events, there's nothing I can do about that.
For asynchronous dialing using the DotRas 1.2 SDK:
using DotRas;
RasDialer dialer = new RasDialer();
dialer.EntryName = "My Connection";
dialer.Credentials = new NetworkCredential("My", "User");
dialer.DialAsync();
From this point you can call dialer.DialAsyncCancel() if you want to cancel the connection attempt that's in progress.
For synchronous dialing using the DotRas 1.2 SDK is very similar to asynchronous dialing other than replacing the DialAsync call with simply dialer.Dial().
Here's a link to the API I was talking about: http://www.codeplex.com/DotRas
Hope that helps!

Resources