Winsock error codes 10054 and 10053 - tcplistener

I have an application that is listening for data received from GPRS units in the field on a normal TCP connection. I'm getting Winsock 10054 and 10053 errors.
As explained by microsoft
10053 : Software caused connection abort. An established connection was aborted by the software in your host machine, possibly due to a data transmission time-out or protocol error.
and
10054 : Connection reset by peer. An existing connection was forcibly closed by the remote host. This normally results if the peer application on the remote host is suddenly stopped, the host is rebooted, or the remote host uses a hard close (see setsockopt (Windows Sockets) for more information on the SO_LINGER option on the remote socket.) This error may also result if a connection was broken due to keep-alive activity detecting a failure while one or more operations are in progress. Operations that were in progress fail with WSAENETRESET. Subsequent operations fail with WSAECONNRESET.
I'm not sure how to interpret this. How do I determine if the error is caused on the server or by the client sending the information?

Related

IBM MQ client 7.5 MQRC_HOST_NOT_AVAILABLE

We've tryed to test connection to the remote queue manager after installing MQ client v7.5 on Windows Server 2019. We've used Rfhutilc for this and got 'Host not available' inspite of the fact that telnet connection to the corresponding address was succecfully established. Also we tryed to connect using MQ client v9.0 with the same result.
AMQERR01.LOG (client v.7.5) reported following details:
29.09.2020 15:36:10 - Process(10828.2) User(Администратор) Program(rfhutilc.exe)
Host(-) Installation(Installation1)
VRMF(7.5.0.6)
AMQ9208: Error on receive from host 'X.X.X.X'.
EXPLANATION: An error occurred receiving data from 'X.X.X.X' over TCP/IP. This may be due to a communications failure.
ACTION: The return code from the TCP/IP recv() call was 10054 (X'2746'). Record these values and tell the systems administrator.
----- amqccita.c : 4065 -------------------------------------------------------
29.09.2020 15:37:56 - Process(10828.1) User(Администратор) Program(rfhutilc.exe)
Host(-) Installation(Installation1)
VRMF(7.5.0.6)
AMQ9202: Remote host 'X.X.X.X' not available, retry later.
EXPLANATION: The attempt to allocate a conversation using TCP/IP to host 'X.X.X.X' was not successful. However the error may be a transitory one and it may be possible to successfully allocate a TCP/IP conversation later.
ACTION: Try the connection again later. If the failure persists, record the error values and contact your systems administrator. The return code from TCP/IP is 10060 (X'274C'). The reason for the failure may be that this host cannot reach the destination host. It may also be possible that the listening program at host 'X.X.X.X' was not running. If this is the case, perform the relevant operations to start the TCP/IP listening program, and try again.
Here is an example of how traffic data looks like when Rfhutilc refuses to connect to the queue.
As soon as according to the picture there was some code page issue we've tryed to set MQCCSID environment variable with the value 1208 and it helpled.
Also connection attempt via Rfhutilc was succeful while running under another user with login "admin" even though without setting MQCCSID variable.
But I failed to find explanation for this. Did the CCSID of the MQ client differ from system code page of what? And how could I find out default CCSID of MQ client then?
MQ client v7.5 worked just fine on the Windows Server 2012 R2 right after installing. Rfhutilc v7.5 was used both on Server 2012 and Server 2019 for testing.

Windows sockets: How to immediately detect TCP RST on nonblocking connect()?

Our software (Nmap port scanner) needs to quickly determine the status of a non-blocking TCP socket connect(). We use select() to monitor a lot of sockets, and Windows is good at notifying us when one succeeds. But if the port is closed and the target sends a TCP RST, Windows will keep trying a few times before notifying the exceptfds, and the socket error is WSAECONNREFUSED as expected. Our application has its own timeout, though, and will usually mark the connection as timed-out before Windows gives up. We want to get as close as possible to the behavior of Linux, which is to notify with ECONNREFUSED immediately upon receipt of the first RST.
We have tried using the TCP_MAXRT socket option, and this works to get select() to signal us right away, but the result (for closed ports) is always WSAETIMEDOUT, which makes it impossible to distinguish closed (RST) from filtered/firewalled (network timeout), which puts us back at the original problem. Determining this distinction is a core feature of our application.
So what is the best way on Windows to find out if a non-blocking socket connect() has received a connection reset?
EDITED TO ADD: A core problem here is this line from Microsoft's documentation on the SO_ERROR socket option: "This per-socket error code is not always immediately set." If it were immediately set, we could check for it prior to the connect timeout.

dial tcp remote_ip:6379: connect: connection timed out

I'm using redigo for both regular commands as well as subscribing. Every few days I get this error which causes a panic.
dial tcp IP:6379: connect: connection timed out
I'm guessing there is a some lag or minor disturbance with the network which is causing the connection to time out.
How can I avoid this? I'm OK with the program waiting a few seconds until the problem is resolves, rather than panicking.
How can I avoid this? Should I define Timeouts for Dial? Such as
DialReadTimeout
DialWriteTimeout
Use DialConnectTimeout to specify a timeout for dialing a network connection or DialNetDial for complete control over dialing a network connection.
The application supplied NetDial function can set timeouts, throttle connect attempts on failure, and more.
Panics related to a dial failure are probably due to a lack of error checking in the application.
DialWriteTimeout and DialReadTimeout are dial options for specifying the timeout when writing a command to the network connection and reading a reply from the network connection respectively. These options have no bearing on timeouts during connect.

keepalive timeout on unix/windows

What is the error returned on aix/linux when a connection breaks down due to keepalive activity? Is it a unique error code which can be distinguished from other socket errors?
On windows this can be either WSAECONNRESET or WSAENETRESET.
Is there a way to differentiate the error due to keepalive activity when WSAECONNRESET is returned?
WSAECONNRESET
10054
Connection reset by peer.
An existing connection was forcibly closed by the remote host. This normally results if the peer application on the remote host is suddenly stopped, the host is rebooted, the host or remote network interface is disabled, or the remote host uses a hard close (see setsockopt for more information on the SO_LINGER option on the remote socket). This error may also result if a connection was broken due to keep-alive activity detecting a failure while one or more operations are in progress. Operations that were in progress fail with WSAENETRESET. Subsequent operations fail with WSAECONNRESET.
Is there a way to differentiate the error due to keepalive activity when WSAECONNRESET is returned ?
No. The underlying condition is a 'connection reset' in all cases.

What is the difference between "ORA-12571: TNS packet writer failure" and "ORA-03135: connection lost contact"?

I am working in an environment where we get production issues from time to time related to Oracle connections. We use ODP.NET from ASP.NET applications, and we suspect the firewall closes connections that have been in the connection pool too long.
Sometimes we get an "ORA-12571: TNS packet writer failure" error, and sometimes we get "ORA-03135: connection lost contact."
I was wondering if someone has run into this and/or has an understanding of the difference between the 2 errors.
Using a mobile phone analogy:
ORA-12571 (Failure) Means call is dropped.
ORA-03135 (Connection Lost) Other party hung up.
My understanding is that 3135 occurs when a connection is lost. This doesn't tell you why the connection was lost, though. It may have been terminated by the server because the server failed to recieve a response to a probe for a certain amount of time, and assumed that the connection was dead. Or (I'm not sure about this) the exact reverse of that: the client failed to recieve a probe response from the server for a certain amount of time, so it assumed the connection was lost. The "certain amount of time" is cotrolled by SQLNET.EXPIRE_TIME=[minutes] in sqlnet.ora.
As for 12571, my (again vague) understanding is that there was a sudden failure to send a packet during communication with the server, and that this is typically caused by some software or hardware interfering with the connection (either by design, or by error). For instance, if you pull out your ethernet cable and then try to execute a query, you'll probably get this. Or if a firewall or anti-malware application decides to block the traffic.

Resources