Irregular socket errors (10054) on Windows application - windows

I am working on a Windows (Microsoft Visual C++ 2005) application that uses several processes
running on different hosts in an intranet.
Processes communicate with each other using TCP/IP. Different processes can be on the
same host or on different hosts (i.e. the communication can be both within the same
host or between different hosts).
We have currently a bug that appears irregularly. The communication seems to work
for a while, then it stops working. Then it works again for some time.
When the communication does not work, we get an error (apparently while a process
was trying to send data). The call looks like this:
send(socket, (char *) data, (int) data_size, 0);
By inspecting the error code we get from
WSAGetLastError()
we see that it is an error 10054. Here is what I found in the Microsoft documentation
(see here):
WSAECONNRESET
10054
Connection reset by peer.
An existing connection was forcibly closed by the remote host. This normally
results if the peer application on the remote host is suddenly stopped, the
host is rebooted, the host or remote network interface is disabled, or the
remote host uses a hard close (see setsockopt for more information on the
SO_LINGER option on the remote socket). This error may also result if a
connection was broken due to keep-alive activity detecting a failure while
one or more operations are in progress. Operations that were in progress
fail with WSAENETRESET. Subsequent operations fail with WSAECONNRESET.
So, as far as I understand, the connection was interrupted by the receiving process.
In some cases this error is (AFAIK) correct: one process has terminated and
is therefore not reachable. In other cases both the sender and receiver are running
and logging activity, but they cannot communicate due to the above error (the error
is reported in the logs).
My questions.
What does the SO_LINGER option mean?
What is a keep-alive activity and how can it break a connection?
How is it possible to avoid this problem or recover from it?
Regarding the last question. The first solution we tried (actually, it is rather a
workaround) was resending the message when the error occurs. Unfortunately, the
same error occurs over and over again for a while (a few minutes). So this is not
a solution.
At the moment we do not understand if we have a software problem or a configuration
issue: maybe we should check something in the windows registry?
One hypothesis was that the OS runs out of ephemeral ports (in case connections are
closed but ports are not released because of TcpTimedWaitDelay), but by analyzing
this issue we think that there should be plenty of them: the problem occurs even
if messages are not sent too frequently between processes. However, we still are not
100% sure that we can exclude this: can ephemeral ports get lost in some way (???)
Another detail that might help is that sending and receiving occurs in each process
concurrently in separate threads: are there any shared data structures in the
TCP/IP libraries that might get corrupted?
What is also very strange is that the problem occurs irregularly: communication works
OK for a few minutes, then it does not work for a few minutes, then it works again.
Thank you for any ideas and suggestions.
EDIT
Thanks for the hints confirming that the only possible explanation was a connection closed error. By further analysis of the problem, we found out that the server-side process of the connection had crashed / had been terminated and had been restarted. So there was a new server process running and listening on the correct port, but the client had not detected this and was still trying to use the old connection. We now have a mechanism to detect such situations and reset the connection on the client side.

That error means that the connection was closed by the
remote site. So you cannot do anything on your programm except to accept that the connection is broken.

I was facing this problem for some days recently and found out that Adobe Acrobat Reader update was the culprit. As soon as you completely uninstall Adobe from the system everything returns back to normal.

I spent a long time debugging a 10054/10053 error in s3 pre-signed uploads
Turns out that the s3 server will reject pre-signed s3 uploads for the first 15 minutes of it's life.
So - If you're debugging s3 check it's not a new bucket.
If you're debugging something else - this is most likely a problem on the server side not client side.

Related

Reasons for rare sendto()/recvfrom() issues under Winsock?

We recently observe rare UDP communication issues that show the following symptoms:
A socket sendto() call fails with error WSAENOBUFS (10055)
A subsequent recvfrom() call on this socket does not receive anything, even though Wireshark shows that the network interface actually received the expected datagrams. This situation persists for approximately 8 seconds, afterwards new incoming datagrams can be received again from the socket.
In Windows System Log, there appears a Kernel-General information entry at the time of the sendto() error:
The access history in hive \??\C:\ProgramData\Microsoft\Provisioning\Microsoft-Desktop-Provisioning-Sequence.dat was cleared updating 0 keys and creating 0 modified pages.
The issue happens on a customer system running Microsoft Windows 10 Pro for Workstations, Version 10.0.17763 Build 17763.
On that system we were able to reproduce the issue with a simple test program written in C++ that echoes UDP datagrams. We verified that the thread receiving from the socket was actually responsive all the time, by specifying a timeout of 1 second using SO_RCVTIMEO, printing some “still alive” output and immediately calling recvfrom() again.
On our own test system, we were unable to observe the issue under the same circumstances as the customer. However, we were able to provoke similar effects when playing around with the network adapter settings while the test was running. Enabling Microsoft LLDP Protocol Driver showed the sendto() error and sometimes also resulted in the 8 second “silence” period, but without any Windows System Log entry.
Any hints are greatly appreciated.
The issue seems to be related to Microsoft Provisioning Tool since Windows 10 1809.
Disabling it fixed the issue in our case:
Open Task Scheduler, go to Microsoft/Windows/Manangement/Provisioning and disable Logon task.
Source: Windows TenForums

What would happen if a process established multiple PostgreSQL connections and terminated without closing them?

I'm writing a DLL for a purchased software.
The software will perform multi-threaded calculations on certain tasks.
My job is to output the relative result into a database.
However, due to the limited support of the software, it is kind of difficult to do multi-threaded output of the data.
The key problem is that there is no info on the last execution of the DLL function.
Therefore, the database connection will not be closed.
So may I ask if I leave the connection open and terminate the process, what would be the potential problems?
My platform is winserver 2008, and PostgreSQL 10.
I don't understand the background information you are giving, but I can answer the question:
If a PostgreSQL client process dies without closing the database (and TCP) connection, the PostgreSQL server process (“backend process”) that servers this connection will not realize this immediately.
Of course, as soon as the server tries to communicate to the client, e.g. to return some results, TCP it will notice that the partner has gone away and will return an error.
However, often the backend process is idle, waiting for the client to send the next request. In this case, it would never notice that its partner has died. This could eventually cause max_connections to be exhausted with dead connections.
Because this is a common problem in networking, TCP provides the “keepalive” functionality: when a connection has been idle for a while (2 hours by default), the operating system will send a so-called “keepalive packet” and wait for a response from the other side. Sending keepalive packets is repeated several times (5 times by default) in short intervals (1 second by default), and if no response is received, the connection is closed by the operating system, the backend process receives an error message and terminates.
PostgreSQL provides parameters with which you can configure these settings on the server side: tcp_keepalives_idle, tcp_keepalives_count and tcp_keepalives_interval. If you set tcp_keepalives_idle to a shorter value, dead connections will be detected and removed faster, at the cost of some little added network traffic.

Perforce - RpcTransport: partial message read

When using "revert -a" through P4V it waits for a few minutes and throws this error back at me.
RpcTransport: partial message read
TCP receive failed.
read: socket: WSAECONNRESET
The server status returns fine and there are no locked database files.
I suspect this problem is local to this computer as others don't have the same issue. Issueing the same command through the command prompt just has the command prompt sit there indefinitly.
Other commands such as submit and add will have the visual client sit there indefinitely but does not throw and error.
The files are stored on a local drive. This happens with multiply depots/workstations.
The 'WSAECONNRESET' error is issued by Windows, when a network socket is forcibly closed.
Regular occurrences of this error can indicate network problems.
More information is available here:
http://answers.perforce.com/articles/KB/2968/
Hope this helps,
Jen!
I got the same on windows machine. I guess in my case it was caused by corrupted config settings and because of popup error message I had no chance to set it correctly via GUI.
The command line SET command helped to set port and host name again:
p4 set P4PORT=<portnum>
This command reenables the GUI config dialog
A few years late, but for those still facing this:
I faced this error when fetching files from a large repo. I believe what caused this for me was low internet upload speeds due to which - even though I had high a download speed - the TCP acknowledgment from my computer was not getting sent, causing a connection failure.
Perform an upload speed test to determine if it is very low (in my case it had dropped to less than 0.1 Mbps). Fixing upload speeds is a separate topic, but in case it helps try restarting your router as a first step.

TCP: Address already in use exception - possible causes for client port? NO PORT EXHAUSTION

stupid problem. I get those from a client connecting to a server. Sadly, the setup is complicated making debugging complex - and we run out of options.
The environment:
*Client/Server system, both running on the same machine. The client is actually a service doing some database manipulation at specific times.
* The cnonection comes from C# going through OleDb to an EasySoft JDBC driver to a custom written JDBC server that then hosts logic in C++. Yeah, compelx - but the third party supplier decided to expose the extension mechanisms for their server through a JDBC interface. Not a lot can be done here ;)
The Symptom:
At (ir)regular intervals we get a "Address already in use: connect" told from the JDBC driver. They seem to come from one particular service we run.
Now, I did read all the stuff about port exhaustion. This is why we have a little tool running now that counts ports and their states every minute. Last time this happened, we had an astonishing 370 ports in use, with the count rising to about 900 AFTER the error. We aleady patched the registry (it is a windows machine) to allow more than the 5000 client ports standard, but even then, we are far far from that limit to start with.
Which is why I am asking here. Ayneone an ide what ELSE could cause this?
It is a Windows 2003 Server machine, 64 bit. The only other thing I can see that may cause it (but this functionality is supposedly disabled) is Symantec Endpoint Protection that is installed on the server - and being capable of actinc as a firewall, it could possibly intercept network traffic. I dont want to open a can of worms by pointing to Symantec prematurely (if pointing to Symantec can ever be seen as such). So, anyone an idea what else may be the cause?
Thanks
"Address already in use", aka WSAEADDRINUSE (10048), means that when the client socket prepared to connect to the server socket, it first tried to bind itself to a specific local IP/Port pair that was already in use by another socket, either an active one or one that has been closed but is still in the FD_WAIT state. This has nothing to do with the number of ports that are available.
I'm having the same issue on a Windows 2000 Server with a .Net application connecting to a SQL Server 7.0. There's like 10 servers with the same configuration and only one is showing this error several times a day. With a small test program I'm able to reproduce the error by just establishing a TCP connection on the SQL Server listening port. Running CurrPorts (http://www.nirsoft.net/utils/cports.html) shows there's still plenty of available ports in range 1024-5000.
I'm out of ideas and would like to know if you've found a solution since you've posted your question.
Edit : I finally found the solution : a worm was present on the server (WORM_DOWNAD.A) and exhausted local ports without being noticed.

What is the difference between "ORA-12571: TNS packet writer failure" and "ORA-03135: connection lost contact"?

I am working in an environment where we get production issues from time to time related to Oracle connections. We use ODP.NET from ASP.NET applications, and we suspect the firewall closes connections that have been in the connection pool too long.
Sometimes we get an "ORA-12571: TNS packet writer failure" error, and sometimes we get "ORA-03135: connection lost contact."
I was wondering if someone has run into this and/or has an understanding of the difference between the 2 errors.
Using a mobile phone analogy:
ORA-12571 (Failure) Means call is dropped.
ORA-03135 (Connection Lost) Other party hung up.
My understanding is that 3135 occurs when a connection is lost. This doesn't tell you why the connection was lost, though. It may have been terminated by the server because the server failed to recieve a response to a probe for a certain amount of time, and assumed that the connection was dead. Or (I'm not sure about this) the exact reverse of that: the client failed to recieve a probe response from the server for a certain amount of time, so it assumed the connection was lost. The "certain amount of time" is cotrolled by SQLNET.EXPIRE_TIME=[minutes] in sqlnet.ora.
As for 12571, my (again vague) understanding is that there was a sudden failure to send a packet during communication with the server, and that this is typically caused by some software or hardware interfering with the connection (either by design, or by error). For instance, if you pull out your ethernet cable and then try to execute a query, you'll probably get this. Or if a firewall or anti-malware application decides to block the traffic.

Resources