nodeJS being bombarded with reconnections after restart - socket.io

We have a node instance that has about 2500 client socket connections, everything runs fine except occasionally then something happens to the service (restart or failover event in azure), when the node instances comes back up and all socket connections try to reconnect the service comes to a halt and the log just shows repeated socket connect/disconnects. Even if we stop the service and start it the same thing happens, we currently send out a package to our on premise servers to kill the users chrome sessions then everything works fine as users begin logging in again. We have the clients currently connecting with 'forceNew' and force web sockets only and not the default long polling than upgrade. Any one ever see this or have ideas?

In your socket.io client code, you can force the reconnects to be spread out in time more. The two configuration variables that appear to be most relevant here are:
reconnectionDelay
Determines how long socket.io will initially wait before attempting a reconnect (it should back off from there if the server is down awhile). You can increase this to make it less likely they are all trying to reconnect at the same time.
randomizationFactor
This is a number between 0 and 1.0 and defaults to 0.5. It determines how much the above delay is randomly modified to try to make client reconnects be more random and not all at the same time. You can increase this value to increase the randomness of the reconnect timing.
See client doc here for more details.
You may also want to explore your server configuration to see if it is as scalable as possible with moderate numbers of incoming socket requests. While nobody expects a server to be able to handle 2500 simultaneous connections all at once, the server should be able to queue up these connection requests and serve them as it gets time without immediately failing any incoming connection that can't immediately be handled. There is a desirable middle ground of some number of connections held in a queue (usually controllable by server-side TCP configuration parameters) and then when the queue gets too large connections are failed immediately and then socket.io should back-off and try again a little later. Adjusting the above variables will tell it to wait longer before retrying.
Also, I'm curious why you are using forceNew. That does not seem like it would help you. Forcing webSockets only (no initial polling) is a good thing.

Related

Reconnect Interval

I am looking for best practices to handle server restarts. Specifically, I push stock prices to users using websockets for a day trading simulation web app. I have 10k concurrent users. To ensure a responsive ux, I reconnect to the websocket when the onclose event is fired. As our user base has grown we have had to scale our hardware. In addition to better hardware, we have implemented a random delay before reconnecting. The goal of this is to spread out the influx of handshakes when the server restarts ever night (Continuous Deployment). However some of our users have poor internet (isp and or wifi). Their connection constantly drops. For these users I would prefer they reconnect immediately. Is there a solution for this problem that doesn't have the aforementioned tradeoffs?
The question is calling for a subjective response, here is mine :)
Discriminating a client disconnection and a server shutdown:
This can be achieved by sending a shutdown message over the websocket so that active clients can prepare and reconnect with a random delay. Thus, a client that encounters an onclose event without a proper shutdown broadcast would be able to reconnect asap. This means that the client application needs to be modified to account for this special shutdown event.
Handle the handshake load: Some web servers can handle incoming connections as an asynchronous parallel event queue, thus at most X connections will be initialized at the same time (in parallel) and others will wait in a queue until their turn comes. This allows to safeguard the server performance and the websocket handshake will thus be automatically delayed based on the true processing capabilities of the server. Of course, this means a change of web server technology and depends on your use-case.

What if server didn't receive fd_close

I have a high performance client server system programmed from the scratch. i am still improving my system. the server using io overlapping to handle connections. the server correctly handles disconnections and resource deallocations. at the client side i used shutdown command with sd_receive to notify the server that the client has no data to receive after final send from the client. this works well. and server detects that as a graceful disconnection. rarely i have observed when the connection is very slow the server doesn't detect this. I feel that the shutdown partial closure doesn't reach the server. how can i handle this. this is important the server shouldn't contain this kind of connections if so the server can not be stopped. and i do not want to close all such connection by force.
at the client side i used shutdown command with sd_receive to notify the server that the client has no data to receive after final send from the client.
It doesn't do that.
this works well
It doesn't work at all. The shutdown command with SD_RECEIVE that you're using is completely pointless. A close, or a shutdown with SD_SEND or SD_BOTH, sends a FIN: shutdown with SD_RECEIVE does exactly nothing on the wire, and specifically it does not 'notify the server' of anything.
I feel that the shutdown partial closure doesn't reach the server.
It never reaches the server. Your code doesn't work the way you think it does. What reaches the server is the FIN, which in turn is the result of the close, not the shutdown SD_RECEIVE.
What you need here is a read timeout at the server end. As you're using select() or whatever is delivering you the events, you will have to implement the timeout manually yourself.

Socket.io data loss when Internet speed drop

I am using socket.io 1.4 and I want to know that what happens in this scenario:
The client Emits like this:
Socket.emit('test',data);
The client does 3 emits to server but suddenly Internet speed drops and those emits may not get to server
But after a while the Internet speed rises again but what will happen to previous failed emits?
They will be emitted again automatically?
How should I handle that
Websockets use TCP, which is in general a reliable protocol. There is not exactly such a thing as "The internet speed dropped and I lost some messages." If some messages are lost they will be automatically retransmitted at the TCP level. If retransmission fails completely, the connection will be reset.
So what you really are asking is how socket.io handles this. And the answer is that it has some amount of reconnecting logic, and you may also want to monitor the connection in case it resets (hook up a listener for the disconnect event on the socket), if you want to take some extra action (like notify the user).

Websockets and uwsgi - detect broken connections client side?

I'm using uwsgi's websockets support and so far it's looking great, the server detects when the client disconnects and the client as well when the server goes down. But i'm concerned this will not work in every case/browser.
In other frameworks, namely sockjs, the connection is monitored by sending regular messages that work as heartbeats/pings. But uwsgi sends PING/PONG frames (ie. not regular messages/control frames) according to the websockets spec and so from the client side i have no way to know when the last ping was received from the server. So my question is this:
If the connection is dropped or blocked by some proxy will browsers reliably (ie. Chrome, IE, Firefox, Opera) detect no PING was received from the server and signal the connection as down or should i implement some additional ping/pong system so that the connection is detected as closed from the client side?
Thanks
You are totally right. There is no way from client side to track or send ping/pongs. So if the connection drops, the server is able of detecting this condition through the ping/pong, but the client is let hung... until it tries to send something and the underlying TCP mechanism detect that the other side is not ACKnowledging its packets.
Therefore, if the client application expects to be "listening" most of the time, it may be convenient to implement a keep alive system that works "both ways" as Stephen Clearly explains in the link you posted. But, this keep alive system would be part of your application layer, rather than part of the transport layer as ping/pongs.
For example you can have a message "{token:'whatever'}" that the server and client just echoes with a 5 seconds delay. The client should have a timer with a 10 seconds timeout that stops every time that messages is received and starts every time the message is echoed, if the timer triggers, the connection can be consider dropped.
Although browsers that implement the same RFC as uWSGI should detect reliably when the server closes the connection cleanly they won't detect when the connection is interrupted midway (half open connections)t. So from what i understand we should employ an extra mechanism like application level pings.

Why might an EventMachine outbound data buffer stop sending and just fill up forever (while other connections can still send)

I have an EventMachine server sending TCP data down to a Mac client (via GCDAsyncSocket). It always works flawlessly for a while, but inevitably the server suddenly stops sending data on a connection-by-connection basis. The connection is still maintained, and the server still receives data from the client, but it doesn't go the other way.
When this happens, I've discovered via connection#get_outbound_data_size that the connection send buffer is filling up infinitely (via #send_data) and not being sent to the client.
Are there specific (and hopefully fixable) reasons why this might occur? The reactor keeps humming along, and other active connections to the server continue working fine (though they sometimes fall into buffer hell as well).
I see one reason at least: when the remote client no longer read data from its side of the TCP connection (with a recv() call or whatever).
Then, the scenario is: the receiving TCP buffer on the client side becomes full. And the OS can no longer accepts TCP pacquets from its peer, since it cannot store them queue them. As a consequence, the sending TCP buffer on the server side becomes full too as your application continue to send paquets on the socket! Soon your server is no longer able to write into the socket since the send() system call will :
blocks undefinitively. (waiting for buffer to empty enough for the new paquet)
ot returns with an EWOULDBLOCK error. (if you configured your socket as a non-blocking one)
I usually met that kind of use case in TEST environment when I put a breakpoint in my code on the client side.
There was a patch was applied to GCDAsyncSocket on March 23 that prevents the reads from stopping. Did this patch solve your problem?

Resources