Issue with CLOSE_WAIT status - windows

I see a number of connections with CLOSE_WAIT status on my production server, and there are few questions. Please advice.
I know that the windows register has the following parameter:
TcpTimedWaitDelay in the HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters
According to the Microsoft stuff (http://social.technet.microsoft.com/Forums/en-US/windowsserver2008r2networking/thread/4288d218-fbf9-4489-b869-384a05dea83d/) , the value of the TIME_WAIT by default is 4 minutes. I set up the value in 30 seconds, but it looks like nothing is changed. Moreover, even after 4 mins CLOSE_WAIT connections are still there.
My questions is how I can change the value of the TIME_WAIT, and see it in action? Should I restart my server?
Is there any other settings to control the CLOSE_WAIT status in the Windows environment?
Regards,
Cyril

CLOSE_WAIT means that the peer has closed the connection and you haven't. The operating system is waiting for you (the local application) to close it.
So close it. Somewhere or other you have missed out.

You're confusing CLOSE_WAIT and TIME_WAIT. They're not the same.
See here: http://www.serverframework.com/asynchronousevents/2011/01/time-wait-and-its-design-implications-for-protocols-and-scalable-servers.html for some details on TIME_WAIT and why you might not really want to play with shortening the timeout.
And see here: http://www.sunmanagers.org/pipermail/summaries/2006-January/007068.html and here: http://blogs.msdn.com/b/spike/archive/2008/10/09/tcp-connections-hanging-in-the-close-wait-and-fin-wait-2-state.aspx for details on why you might be collecting sockets stuck in CLOSE_WAIT - in summary, you're possibly not closing your sockets correctly.

I Will suggest you to use Spring WS because I was facing same issue in my project and I switched to Spring Web Service than my problem resolved.
See the following code,
<bean id="viewCustomerInfo"
class="org.springframework.remoting.jaxws.JaxWsPortProxyFactoryBean"
p:serviceInterface="com.javaplex.CustomerInfoInterface"
p:wsdlDocumentUrl="http://127.0.0.1:8080/portal/CustomerInfoPort?wsdl"
p:namespaceUri="http://ws.customergen.com/" p:serviceName="CustomerInfo"
p:portName="CustomerInfoPort">
here is the complete article how to setup spring based beans which give your optimum performance.
http://www.javaplex.com/spring-jax-ws-client-for-best-performance/

Related

Enable TCP keepalive on port open by another program

On a Debian machine I'm using an OPCUA server https://github.com/FreeOpcUa/opcua-asyncio. The server does not give the possibility to enable TCP keepalive on the port opened by the server.
Basically, I want to know if it's possible to start the server then in another script, enable the tcp keepalive on that port.
I also found some other information from Redhat https://access.redhat.com/solutions/19029, and https://access.redhat.com/solutions/25773 (requires you to sign up to see the articles). But again I'm still lost as to what to do exactly.
I'll keep reading up on this, but so far I've spent about 10 hours trying to figure out whether it's even possible. So I thought I should ask for some help.
Any advice is welcome, thanks!
For operations of socket of another process socket must be shared from it https://docs.python.org/3/library/socket.html#socket.socket.share or duplicated.
Its easier to patch your server for keepalive.

Why the SpringBoot website refuses clients' connection after several minutes of the Jmeter load test begins?

It is a SpringBoot website and deployed in one Linux server. We use Jmeter to do the load test.
We mock 500 users to visit the webiste index page simultaneously. The index page is very simple html, no database connection,so it is a quite short connection.
After about 2 minutes, Jmeter starts to throw timeout exception as bleow
I guess this is because of website reaching its capacity and running out of connection.
I get one quesiton here, why does website reach its capacity 2 minutes later after Jemter starts. If its TCP connection capacity for this website is 1000, I guess it will reach 1000 very soon after the Jmeter starts, not 2 minutes.
Besides, I see many TCP connections are in TIME_WAIT status in Linux server. I guess this may be related with the connection timeout?
Edit: Someone thinks it is running of port. Someone thinks it is running out of connection. And someone thinks it is running out of processing thread(eg. What does this messge java.net.ConnectException/Connection timed out mean in log.jtl file of Jmeter?). I don't know which one is the exact reason...
Most probably this is due to underlying Linux TCP/IP kernel stack configuration, as per Linux TCP/IP tuning for scalability article:
By default, a connection is supposed to stay in the TIME_WAIT state for twice the msl. Its purpose is to make sure any lost packets that arrive after a connection is closed do not confuse the TCP subsystem (the full details of this are beyond the scope of this article, but ask me if you’d like details). The default msl is 60 seconds, which puts the default TIME_WAIT timeout value at 2 minutes. Which means you’ll run out of available ports if you receive more than about 400 requests a second, or if we look back to how nginx does proxies, this actually translates to 200 requests per second. Not good for scaling.
SO double check timeouts along with maximum number of ports/sockets/files on the Linux server - my expectation is that the aforementioned parameters need to be tuned for high loads.
It's also a good practice to have monitoring of baseline OS health metrics in place (CPU, RAM, Network, Disk, swap usage, etc.). You can use i.e. JMeter PerfMon Plugin or JMeter SSHMon Listener for this.

Stale connection with Pheanstalk

I'm using beanstalkd to offload some work to other machines. The setup is a bit unusual, the server is on the internet (public ip) but the consumers are behind adsl lines on some peoples homes. So there is a linux server as client going out through a dynamic ip and connecting to the server to get a job. It's all PHP and I'm using pheanstalk library.
Everything runs smoothly for some time, but then the adsl changes the IP (every 24h hours the provider forces a disconnect-reconnect) the client just hangs, never to go out of "reserve".
I thought that putting a timeout on the reserve would help it, but it didn't. As it seems, the client issues a command and blocks, it never checks the timeout. It just issues a reserve-with-timeout (instead of a simple reserve) and it is the servers responsibility to return a TIME_OUT as the timeout occurs. The problem is, the connection is broken (but the TCP/IP doesn't know about that yet until any of the sides try to talk to the other side) and if the client blocked reading, it will never return.
The library seems to have support for some kind of timeouts locally (for example when trying to connect to server), but it does not seem to contemplate this scenario.
How could I detect the stale connection and force a reconnect? Is there some kind of keepalive on the protocol (and on the pheanstalk itself)?
Thanks!
You could try to close each connection right after the request is answered and reopen a new connection each time.
There is no close() function but you deleting the Pheanstaly Object with unset($pheanstalk) will close it.
This explanation is quite helpful:
Pheanstalk (PHP client for beanstalk) - how do connections work?
I haven't tried it yet, but I came up with the idea of connecting to the beanstalk server through an SSH tunnel. We can enable the ServerAliveCountMax and ServerAliveInterval options on the tunnel, so that a network or server failure will cause the tunnel to close. This should then cause the pheanstalk client to report an error.

Irregular socket errors (10054) on Windows application

I am working on a Windows (Microsoft Visual C++ 2005) application that uses several processes
running on different hosts in an intranet.
Processes communicate with each other using TCP/IP. Different processes can be on the
same host or on different hosts (i.e. the communication can be both within the same
host or between different hosts).
We have currently a bug that appears irregularly. The communication seems to work
for a while, then it stops working. Then it works again for some time.
When the communication does not work, we get an error (apparently while a process
was trying to send data). The call looks like this:
send(socket, (char *) data, (int) data_size, 0);
By inspecting the error code we get from
WSAGetLastError()
we see that it is an error 10054. Here is what I found in the Microsoft documentation
(see here):
WSAECONNRESET
10054
Connection reset by peer.
An existing connection was forcibly closed by the remote host. This normally
results if the peer application on the remote host is suddenly stopped, the
host is rebooted, the host or remote network interface is disabled, or the
remote host uses a hard close (see setsockopt for more information on the
SO_LINGER option on the remote socket). This error may also result if a
connection was broken due to keep-alive activity detecting a failure while
one or more operations are in progress. Operations that were in progress
fail with WSAENETRESET. Subsequent operations fail with WSAECONNRESET.
So, as far as I understand, the connection was interrupted by the receiving process.
In some cases this error is (AFAIK) correct: one process has terminated and
is therefore not reachable. In other cases both the sender and receiver are running
and logging activity, but they cannot communicate due to the above error (the error
is reported in the logs).
My questions.
What does the SO_LINGER option mean?
What is a keep-alive activity and how can it break a connection?
How is it possible to avoid this problem or recover from it?
Regarding the last question. The first solution we tried (actually, it is rather a
workaround) was resending the message when the error occurs. Unfortunately, the
same error occurs over and over again for a while (a few minutes). So this is not
a solution.
At the moment we do not understand if we have a software problem or a configuration
issue: maybe we should check something in the windows registry?
One hypothesis was that the OS runs out of ephemeral ports (in case connections are
closed but ports are not released because of TcpTimedWaitDelay), but by analyzing
this issue we think that there should be plenty of them: the problem occurs even
if messages are not sent too frequently between processes. However, we still are not
100% sure that we can exclude this: can ephemeral ports get lost in some way (???)
Another detail that might help is that sending and receiving occurs in each process
concurrently in separate threads: are there any shared data structures in the
TCP/IP libraries that might get corrupted?
What is also very strange is that the problem occurs irregularly: communication works
OK for a few minutes, then it does not work for a few minutes, then it works again.
Thank you for any ideas and suggestions.
EDIT
Thanks for the hints confirming that the only possible explanation was a connection closed error. By further analysis of the problem, we found out that the server-side process of the connection had crashed / had been terminated and had been restarted. So there was a new server process running and listening on the correct port, but the client had not detected this and was still trying to use the old connection. We now have a mechanism to detect such situations and reset the connection on the client side.
That error means that the connection was closed by the
remote site. So you cannot do anything on your programm except to accept that the connection is broken.
I was facing this problem for some days recently and found out that Adobe Acrobat Reader update was the culprit. As soon as you completely uninstall Adobe from the system everything returns back to normal.
I spent a long time debugging a 10054/10053 error in s3 pre-signed uploads
Turns out that the s3 server will reject pre-signed s3 uploads for the first 15 minutes of it's life.
So - If you're debugging s3 check it's not a new bucket.
If you're debugging something else - this is most likely a problem on the server side not client side.

TCP: Address already in use exception - possible causes for client port? NO PORT EXHAUSTION

stupid problem. I get those from a client connecting to a server. Sadly, the setup is complicated making debugging complex - and we run out of options.
The environment:
*Client/Server system, both running on the same machine. The client is actually a service doing some database manipulation at specific times.
* The cnonection comes from C# going through OleDb to an EasySoft JDBC driver to a custom written JDBC server that then hosts logic in C++. Yeah, compelx - but the third party supplier decided to expose the extension mechanisms for their server through a JDBC interface. Not a lot can be done here ;)
The Symptom:
At (ir)regular intervals we get a "Address already in use: connect" told from the JDBC driver. They seem to come from one particular service we run.
Now, I did read all the stuff about port exhaustion. This is why we have a little tool running now that counts ports and their states every minute. Last time this happened, we had an astonishing 370 ports in use, with the count rising to about 900 AFTER the error. We aleady patched the registry (it is a windows machine) to allow more than the 5000 client ports standard, but even then, we are far far from that limit to start with.
Which is why I am asking here. Ayneone an ide what ELSE could cause this?
It is a Windows 2003 Server machine, 64 bit. The only other thing I can see that may cause it (but this functionality is supposedly disabled) is Symantec Endpoint Protection that is installed on the server - and being capable of actinc as a firewall, it could possibly intercept network traffic. I dont want to open a can of worms by pointing to Symantec prematurely (if pointing to Symantec can ever be seen as such). So, anyone an idea what else may be the cause?
Thanks
"Address already in use", aka WSAEADDRINUSE (10048), means that when the client socket prepared to connect to the server socket, it first tried to bind itself to a specific local IP/Port pair that was already in use by another socket, either an active one or one that has been closed but is still in the FD_WAIT state. This has nothing to do with the number of ports that are available.
I'm having the same issue on a Windows 2000 Server with a .Net application connecting to a SQL Server 7.0. There's like 10 servers with the same configuration and only one is showing this error several times a day. With a small test program I'm able to reproduce the error by just establishing a TCP connection on the SQL Server listening port. Running CurrPorts (http://www.nirsoft.net/utils/cports.html) shows there's still plenty of available ports in range 1024-5000.
I'm out of ideas and would like to know if you've found a solution since you've posted your question.
Edit : I finally found the solution : a worm was present on the server (WORM_DOWNAD.A) and exhausted local ports without being noticed.

Resources