TCP socket stops receiving data until closed - ruby

I have a really weird problem that is driving me crazy.
I have a Ruby server and a Flash client (Action Script 3). It's a multiplayer game.
The problem is that everything is working perfect and then, suddenly, a random player stops receiving data. When the server closes the connection because of inactivity, about 20-60 seconds later, the client receives all the buffered data.
The client uses XMLsocket for retrieving data, so the way the client receives data is not the problem.
socket.addEventListener(Event.CONNECT, connectHandler);
function connectHandler(event)
{
sendData(sess);
}
function sendData(dat)
{
trace("SEND: " + dat);
addDebugData("SEND: " + dat)
if (socket.connected) {
socket.send(dat);
} else {
addDebugData("SOCKET NOT CONNECTED")
}
}
socket.addEventListener(DataEvent.DATA, dataHandler);
function dataHandler(e:DataEvent) {
var data:String = e.data;
workData(data);
}
The server flushes data after every write, so is not a flushing problem:
sock.write(data + DATAEOF)
sock.flush()
DATAEOF is null char, so the client parses the string.
When the server accepts a new socket, it sets sync to true, to autoflush, and TCP_NODELAY to true too:
newsock = serverSocket.accept
newsock.sync = true
newsock.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_NODELAY, true)
This is my research:
Info: I was dumping netstat data to a file each second.
When the client stops receiving data, netstat shows that socket status is still ESTABLISHED.
Some seconds after that, send-queue grows accordingly to data sent.
tcpflow shows that packets are sent 2 times.
When the server closes the socket, socket status changes to FIN_WAIT1, as expected. Then, tcpflow shows that all buffered data is sent to the client, but the client don't receives data. some seconds after that, connection dissapears from netstat and tcpflow shows that the same data is sent again, but this time the client receives the data so starts sending data to the server and the server receives it. But it's too late... server has closed connection.
I don't think it's an OS/network problem, because I've changed from a VPS located in Spain to Amazon EC2 located in Ireland and the problem still remains.
I don't think it's a client network problem too, because this occurs dozens of times per day, and the average quantity of online users is about 45-55, with about 400 unique users a day, so the ratio is extremely high.
EDIT:
I've done more research. I've changed the server to C++.
When a client stops sending data, after a while the server receives a "Connection reset by peer" error. In that moment, tcpdump shows me that the client sent a RST packet, this could be because the client closed the connection and the server tried to read, but... why the client closed the connection? I think the answer is that the client is not the one closing the connection, is the kernel. Here is some info: http://scie.nti.st/2008/3/14/amazon-s3-and-connection-reset-by-peer
Basically, as I understand it, Linux kernels 2.6.17+ increased the maximum size of the TCP window/buffer, and this started to cause other gear to wig out, if it couldn’t handle sufficiently large TCP windows. The gear would reset the connection, and we see this as a “Connection reset by peer” message.
I've followed the steps and now it seems that the server is closing connections only when the client losses its connection to internet.
I'm going to add this as an answer so people know a bit mroe about this.

I think the answer is that the kernel is the one closing the connection. Here is some info: http://scie.nti.st/2008/3/14/amazon-s3-and-connection-reset-by-peer
Basically, as I understand it, Linux kernels 2.6.17+ increased the maximum size of the TCP window/buffer, and this started to cause other gear to wig out, if it couldn’t handle sufficiently large TCP windows. The gear would reset the connection, and we see this as a “Connection reset by peer” message.
I've followed the steps and now it seems that the server is closing connections only when the client losses its connection to internet.

Related

MQ | Sender LONGRTR

I see SENDER channel goes into RETRY mode after LONGRTS start. It remains in RETRY mode and re-started after LONGMTR(1200) seconds. My question is - does Sender channel comes back to RUNNING as soon as message come, without completion of LONGMTR or it waits for LONGMTR time?
A SENDER channel will go into STATUS(RETRY) - a.k.a. Retry Mode - when the connection to its partner fails.
To begin with, on the assumption that many network failures are very short lived, a SENDER channel will try a small number of fairly close together attempts to re-make the network connection. It will try 10 times at 60 seconds apart, to re-make the connection. This is known as the "short retries".
This 10 times and 60 seconds apart, are coded in the SENDER channel fields called SHORTRTY and SHORTTMR.
If after these first 10 attempts, the SENDER channel has still not managed to get reconnected to the network partner, it will now move to "long retries". It is now operating with the assumption that the network outage is a longer one, for example the partner queue manager machine is having maintenance applied, or there has been some other major outage, and not just a network blip.
The SENDER channel will now try what it hopes is an infinite number of slightly more spaced apart attempts to re-make the connection. It will try 999999999 times at 1200 seconds apart, to re-make the connection.
This 999999999 and 1200, are coded in the SENDER channel fields called LONGRTY and LONGTMR.
You can see how many attempts are left by using the DISPLAY CHSTATUS command and looking at the SHORTRTS and LONGRTS fields. These should how many of the 10 or 999999999 are left. If SHORTRTS(0) then you know the SENDER is into "long retry mode".
If, on any of these attempts to re-make the connection, it is successful, it will stop retrying and you will see the SENDER channel show STATUS(RUNNING). Note that the success is due to the network connection having been successfully made, and is nothing to do with whether a message arrives or not.
It will not continue making retry attempts after it successfully connects to the partner (until the next time the connection is lost of course).
If your channel is in STATUS(RETRY) you should look in the AMQERR01.LOG to discover the reason for the failure. It may be something you can fix at the SENDER end or it may be something that needs to be fixed at the RECEIVER end, for example restarting the queue manager or the listener.

How does ZeroMQ connect and bind work internally

I am experimenting with ZeroMQ. And I found it really interesting that in ZeroMQ, it does not matter whether either connect or bind happens first. I tried looking into the source code of ZeroMQ but it was too big to find anything.
The code is as follows.
# client side
import zmq
ctx = zmq.Context()
socket = ctx.socket(zmq.PAIR)
socket.connect('tcp://*:2345') # line [1]
# make it wait here
# server side
import zmq
ctx = zmq.Context()
socket = ctx.socket(zmq.PAIR)
socket.bind('tcp://localhost:2345')
# make it wait here
If I start client side first, the server has not been started yet, but magically the code is not blocked at line [1]. At this point, I checked with ss and made sure that the client is not listening on any port. Nor does it have any open connection. Then I start the server. Now the server is listening on port 2345, and magically the client is connected to it. My question is how does the client know the server is now online?
The best place to ask your question is the ZMQ mailing list, as many of the developers (and founders!) of the library are active there and can answer your question directly, but I'll give it a try. I'll admit that I'm not a C developer so my understanding of the source is limited, but here's what I gather, mostly from src/tcp_connector.cpp (other transports are covered in their respective files and may behave differently).
Line 214 starts the open() method, and here looks to be the meat of what's going on.
To answer your question about why the code is not blocked at Line [1], see line 258. It's specifically calling a method to make the socket behave asynchronously (for specifics on how unblock_socket() works you'll have to talk to someone more versed in C, it's defined here).
On line 278, it attempts to make the connection to the remote peer. If it's successful immediately, you're good, the bound socket was there and we've connected. If it wasn't, on line 294 it sets the error code to EINPROGRESS and fails.
To see what happens then, we go back to the start_connecting() method on line 161. This is where the open() method is called from, and where the EINPROGRESS error is used. My best understanding of what's happening here is that if at first it does not succeed, it tries again, asynchronously, until it finds its peer.
I think the best answer is in zeromq wiki
When should I use bind and when connect?
As a very general advice: use bind on the most stable points in your architecture and connect from the more volatile endpoints. For request/reply the service provider might be point where you bind and the client uses connect. Like plain old TCP.
If you can't figure out which parts are more stable (i.e. peer-to-peer) think about a stable device in the middle, where boths sides can connect to.
The question of bind or connect is often overemphasized. It's really just a matter of what the endpoints do and if they live long — or not. And this depends on your architecture. So build your architecture to fit your problem, not to fit the tool.
And
Why do I see different behavior when I bind a socket versus connect a socket?
ZeroMQ creates queues per underlying connection, e.g. if your socket is connected to 3 peer sockets there are 3 messages queues.
With bind, you allow peers to connect to you, thus you don't know how many peers there will be in the future and you cannot create the queues in advance. Instead, queues are created as individual peers connect to the bound socket.
With connect, ZeroMQ knows that there's going to be at least a single peer and thus it can create a single queue immediately. This applies to all socket types except ROUTER, where queues are only created after the peer we connect to has acknowledge our connection.
Consequently, when sending a message to bound socket with no peers, or a ROUTER with no live connections, there's no queue to store the message to.
When you call socket.connect('tcp://*:2345') or socket.bind('tcp://localhost:2345') you are not calling these methods directly on an underlying TCP socket. All of ZMQ's IO - including connecting/binding underlying TCP sockets - happens in threads that are abstracted away from the user.
When these methods are called on a ZMQ socket it essentially queues these events within the IO threads. Once the IO threads begin to process them they will not return an error unless the event is truly impossible, otherwise they will continually attempt to connect/reconnect.
This means that a ZMQ socket may return without an error even if socket.connect is not successful. In your example it would likely fail without error but then quickly reattempt and succeeded if you were to run the server side of script.
It may also allow you to send messages while in this state (depending on the state of the queue in this situation, rather than the state of the network) and will then attempt to transmit queued messages once the IO threads are able to successfully connect. This also includes if a working TCP connection is later lost. The queues may continue to accept messages for the unconnected socket while IO attempts to automatically resolve the lost connection in the background. If the endpoint takes a while to come back online it should still receive it's messages.
To better explain here's another example
<?php
$pid = pcntl_fork();
if($pid)
{
$context = new ZMQContext();
$client = new ZMQSocket($context, ZMQ::SOCKET_REQ);
try
{
$client->connect("tcp://0.0.0.0:9000");
}catch (ZMQSocketException $e)
{
var_dump($e);
}
$client->send("request");
$msg = $client->recv();
var_dump($msg);
}else
{
// in spawned process
echo "waiting 2 seconds\n";
sleep(2);
$context = new ZMQContext();
$server = new ZMQSocket($context, ZMQ::SOCKET_REP);
try
{
$server->bind("tcp://0.0.0.0:9000");
}catch (ZMQSocketException $e)
{
var_dump($e);
}
$msg = $server->recv();
$server->send("response");
var_dump($msg);
}
The binding process will not begin until 2 seconds later than the connecting process. But once the child process wakes and successfully binds the req/rep transaction will successfully take place without error.
jason#jason-VirtualBox:~/php-dev$ php play.php
waiting 2 seconds
string(7) "request"
string(8) "response"
If I was to replace tcp://0.0.0.0:9000 on the binding socket with tcp://0.0.0.0:2345 it will hang because the client is trying to connect to tcp://0.0.0.0:9000, yet still without error.
But if I replace both with tcp://localhost:2345 I get an error on my system because it can't bind on localhost making the call truly impossible.
object(ZMQSocketException)#3 (7) {
["message":protected]=>
string(38) "Failed to bind the ZMQ: No such device"
["string":"Exception":private]=>
string(0) ""
["code":protected]=>
int(19)
["file":protected]=>
string(28) "/home/jason/php-dev/play.php"
["line":protected]=>
int(40)
["trace":"Exception":private]=>
array(1) {
[0]=>
array(6) {
["file"]=>
string(28) "/home/jason/php-dev/play.php"
["line"]=>
int(40)
["function"]=>
string(4) "bind"
["class"]=>
string(9) "ZMQSocket"
["type"]=>
string(2) "->"
["args"]=>
array(1) {
[0]=>
string(20) "tcp://localhost:2345"
}
}
}
["previous":"Exception":private]=>
NULL
}
If your needing real-time information for the state of underlying sockets you should look into socket monitors. Using socket monitors along with the ZMQ poll allows you to poll for both socket events and queue events.
Keep in mind that polling a monitor socket using ZMQ poll is not similar to polling a ZMQ_FD resource via select, epoll, etc. The ZMQ_FD is edge triggered and therefor doesn't behave the way you would expect when polling network resources, where a monitor socket within ZMQ poll is level triggered. Also, monitor sockets are very light weight and latency between the system event and the resulting monitor event is typically sub microsecond.

Windows socket write() is blocked by a read() unexpectedly

(this is an updated repost of a previously asked question, posted with old account which I've lost the credentials for)
I'm writing a client-server program, where the client is C++/winapi and the server is C#/.net.
socket is an AF_INET/ SOCK_STREAM and I'm using Readfile() and WriteFile() for socket IO at client.
the client have a loop where it reads from server (and may block the calling thread [denote t1] , which is fine with me). it also have another thread [denote t2] , that wait on an Event object with a timeout.
if the timeout is reached (and the Event is yet to be singled) the t2 thread, will write (exacly on byte) on the same socket.
The problem I have, is that it seems like the write won't return untill the read on t1 returns (in some legitimate scnerions it will never happen) , as if the socket was not full-duplex.
I've checked it buttom to top, there are no other thread-related comlications. I acctually ran the client from a diffrent machine, and used wireshark to sniff traffic. After the timeout was reached, there was nothing on the wire (no matter how long I wait) , untill I wrote some data from server side (so the read() on t1 returned) - and just right after that, the write() of client is finally transmitted.
As I explained in your earlier question, sockets are full duplex. A read in one thread will not block a write in another thread, and vice versa.

reconnect tcpsocket (or how to detect closed socket)

I have a ruby tcpsocket client that is connected to a server.
How can I check to see if the socket is connected before I send the data ?
Do I try to "rescue" a disconnected tcpsocket, reconnect and then resend ? if so, does anyone have a simple code sample as I don't know where to begin :(
I was quite proud that I managed to get a persistent connected client tcpsocket in rails. Then the server decided to kill the client and it all fell apart ;)
edit
I've used this code to get round some of the problems - it will try to reconnect if not connected, but won't handle the case if the server is down (it will keep retrying). Is this the start of the right approach ? Thanks
def self.write(data)
begin
##my_connection.write(data)
rescue Exception => e
##my_connection = TCPSocket.new 'localhost', 8192
retry
end
end
What I usually do in these types of scenarios is keep track of consecutive retries in a variable and have some other variable that sets the retry roof. Once we hit the roof, throw some type of exception that indicates there is a network or server problem. You'll want to reset the retry count variable on success of course.

How can I tell if my Ruby server script is being overloaded?

I have a daemonized ruby script running on my server that looks like this:
#server = TCPServer.open(61101)
loop do
#thr = Thread.new(#server.accept) do |sock|
Thread.current[:myArrayOfHashes] = [] # hashes containing attributes of myObject
SystemTimer.timeout_after(5) do
Thread.current[:string] = sock.gets
sock.close
# parse the string and load the data into myArrayOfHashes
Myobject.transaction do # Update the myObjects Table
Thread.current[:myArrayOfHashes].each do |h|
Thread.current[:newMyObject] = Myobject.new
# load up the new object with data
Thread.current[:newMyObject].save
end
end
end
end
#thr.join
end
This server receives and manages data for my rails application which is all running on Mac OS 10.6. The clients call the server every 15 minutes on the 15 and while I currently only have 16 or so clients calling every 15 min on the 15, I'm wondering about the following:
If two clients call at close enough to the same time, will one client's connection attempt fail?
How I can figure out how many client connections my server can accommodate at the same time?
How can I monitor how much memory my server is using?
Also, is there an article you can point me toward that discusses the best way to implement this kind of a server? I mean can I have multiple instances of the server listening on the same port? Would that even help?
I am using Bluepill to monitor my server daemons.
1 and 2
The answer is no, two clients connecting close to each other will not make the connection fail (however multiple clients connecting may fail, see below).
The reason is the operating system has a default so called listening queue built into all server sockets. So even if you are not calling accept fast enough in your program, the OS will still keep buffering incoming connections for you. It will buffer these connections for as long as the listening queue does not get filled.
Now what is the size of this queue then?
In most cases the default size typically used is 5. The size is set after you create the socket and you call listen on this socket (see man page for listen here).
For Ruby TCPSocket automatically calls listen for you, and if you look at the C-source code for TCPSocket you will find that it indeed sets the size to 5:
https://github.com/ruby/ruby/blob/trunk/ext/socket/ipsocket.c#L108
SOMAXCONN is defined as 5 here:
https://github.com/ruby/ruby/blob/trunk/ext/socket/mkconstants.rb#L693
Now what happens if you don't call accept fast enough and the queue gets filled?
The answer is found in the man page of listen:
The backlog argument defines the maximum length to which the queue of pending connections for sockfd may grow. If a connection request arrives when the queue is full, the client may receive an error with an indication of ECONNREFUSED or, if the underlying protocol supports retransmission, the request may be ignored so that a later reattempt at connection succeeds.
In your code however there is one problem which can make the queue fill up if more than 5 clients try to connect at the same time: you're calling #thr.join at the end of the loop.
What effectively happens when you do this is that your server will not accept any new incoming connections until all your stuff inside your accept-thread has finished executing.
So if the database stuff and the other things you are doing inside the accept-thread takes a long time, the listening queue may fill up in the meantime. It depends on how long your processing takes, and how many clients could potentially be connecting at the exact same time.
3
You didn't say which platform you are running on, but on linux/osx the easiest way is to just run top in your console. For more advanced memory monitoring options you might want to check these out:
ruby/ruby on rails memory leak detection
track application memory usage on heroku

Resources