we have a major problem with many of our Applications making improper connections (SVRCONN) with queue manager and not issuing MQDISC when connection not required. This causes lot of idle stale connections and prevents Application from making new connections and fails with CONNECTION BROKEN (2009) error. We have been restricting Application connections with clientidle parameter in our Windows MQ on version 7.0.1.8 but when we migrated to MQ v7.5.0.2 in Linux platform we are deciding on the best option available in the new version. We do not have clientidle anymore in ini file for v7.5 but has DISCINT & KAINT in SVRCONN channels. I have been going through the advantages and disadvantages of both for our scenario of Application making connections through SVRCONN channels and leave connections open without issuing a disconnect. Which of these above channel attributes is ideal for us. Any suggestions? Does any of these take precedence over the other??
First off, KAINT controls TCP functions, not MQ functions. That means for it to take effect, the TCP Keepalive function must be enabled in the qm.ini TCP stanza. Nothing wrong with this, but the native HBINT and DISCINT are more responsive than delegating to TCP. This addresses the problem that the OS hasn't recognized that a socket's remote partner is gone and cleaned up the socket. As long as the socket exists and MQ's channel is idle, MQ won't notice. When TCP cleans the socket up, MQ's exception callback routine sees it immediately and closes the channel.
Of the remaining two, DISCINT controls the interval after which MQ will terminate an idle but active socket whereas HBINT controls the interval after which MQ will shut down an MCA attached to an orphan socket. Ideally, you will have a modern MQ client and server so you can use both of these.
The DISCINT should be a value longer than the longest expected interval between messages if you want the channel to stay up during the Production shift. So if a channel should have message traffic at least once every 5 minutes by design, then a DISCINT longer than 5 minutes would be required to avoid channel restart time.
The HBINT actually flows a small heartbeat message over the channel, but only will do so if HBINT seconds have passed without a message. Thsi catches the case that the socket is dead but TCP hasn't yet cleaned it up. HBINT allows MQ to discover this before the OS and take care of it, including tearing down the socket.
In general, really low values for HBINT can cause lots of unnecessary traffic. For example, HBINT(5) would flow a heartbeat every five second interval in which no other channel traffic is passed. chances are, you don't need to terminate orphan channels within 5 seconds of the loss of the socket so a larger value is perhaps more useful. That said, HBINT(5) would cause zero extra traffic in a system with a sustained message rate of 1/second - until the app died, in which case the orphan socket would be killed pretty quick.
For more detail, please go to the SupportPacs page and look for the Morag's "Keeping Channels Running" presentation.
Related
We have a node instance that has about 2500 client socket connections, everything runs fine except occasionally then something happens to the service (restart or failover event in azure), when the node instances comes back up and all socket connections try to reconnect the service comes to a halt and the log just shows repeated socket connect/disconnects. Even if we stop the service and start it the same thing happens, we currently send out a package to our on premise servers to kill the users chrome sessions then everything works fine as users begin logging in again. We have the clients currently connecting with 'forceNew' and force web sockets only and not the default long polling than upgrade. Any one ever see this or have ideas?
In your socket.io client code, you can force the reconnects to be spread out in time more. The two configuration variables that appear to be most relevant here are:
reconnectionDelay
Determines how long socket.io will initially wait before attempting a reconnect (it should back off from there if the server is down awhile). You can increase this to make it less likely they are all trying to reconnect at the same time.
randomizationFactor
This is a number between 0 and 1.0 and defaults to 0.5. It determines how much the above delay is randomly modified to try to make client reconnects be more random and not all at the same time. You can increase this value to increase the randomness of the reconnect timing.
See client doc here for more details.
You may also want to explore your server configuration to see if it is as scalable as possible with moderate numbers of incoming socket requests. While nobody expects a server to be able to handle 2500 simultaneous connections all at once, the server should be able to queue up these connection requests and serve them as it gets time without immediately failing any incoming connection that can't immediately be handled. There is a desirable middle ground of some number of connections held in a queue (usually controllable by server-side TCP configuration parameters) and then when the queue gets too large connections are failed immediately and then socket.io should back-off and try again a little later. Adjusting the above variables will tell it to wait longer before retrying.
Also, I'm curious why you are using forceNew. That does not seem like it would help you. Forcing webSockets only (no initial polling) is a good thing.
I am wondering how ZeroMQ behaves if messages are delivered over a bad quality link, e.g. a very unstable, low level serial connection which might drop individual bytes.
Of course in such a case the affected message will be lost, but will ZeroMQ be able to recover with the next message? Does it find the start again in any case?
Thank you!
The connection reliability is mostly the responsibility of the TCP protocol - if a socket believes it's connected, then the message is getting through. If packets are lost, then TCP detects that and attempts to retransmit them (see here for more info). This all happens "for free" as far as ZMQ is concerned, any connection type using TCP will behave the same way.
When the TCP connection is lost, which, presumably, could occur if the connection is very unreliable and the message never gets through after repeated attempts by TCP, then ZMQ adds another, separate layer of reliability on top of that, allowing your application to reconnect.
What happens with the original or subsequent messages during this outage depends on the ZMQ socket type you've chosen. Some socket types drop messages, some socket types queue them. If the message was already in transit, it may be lost because the sending socket has relinquished control over it.
Generally, if you want absolute reliability in message delivery, you'll be writing that yourself in your application, with your own confirmations that messages have been received. In most cases, something less than total reliability is needed and you'll just rely on TCP and ZMQ to get the job mostly done. If you're so focused on performance that even the reliability of TCP will slow you down too much and you'd rather just discard that data and move on, you'll need to use UDP - I've heard of people using UDP with ZMQ, but I haven't tried it and I don't believe it's fully supported across the board.
We had issues with lot of Applications connecting to MQ server without properly doing a disconnect. Hence we introduced DISCINT on our server connection channels with a value 1800 sec which we found ideal for our transactions. But our Keep Alive interval is pretty high with 900 sec. We would like to reduce that less than 300 as suggested by mqconfig util. But before doing that I would like to know if this is going to affect our disconnect interval value and whether it is going to override our disconnect interval value and make more frequent disconnects which will be a performance hit for us.
How does both these values work and how they are related?
Thanks
TCP KeepAlive works below the application layer in the protocol stack, so it does not affect the disconnecting of the channel configured by the DISCINT.
However lowering the value can result in more frequent disconnects, if your network is unreliable, for example has intermittent very short (shorter then the current KeepAlive, but longer then the new) periods when packets are not flowing.
I think the main difference is, that DISCINT is for disconnecting a technically working channel, which is not used for a given period, while KeepAlive is for detecting a not working TCP connection.
And MQ provides means to detect not working connections in the application layer too, configured by the heartbeat interval.
These may help:
http://www-01.ibm.com/support/knowledgecenter/SSFKSJ_7.5.0/com.ibm.mq.con.doc/q015650_.htm
http://www-01.ibm.com/support/knowledgecenter/SSFKSJ_7.5.0/com.ibm.mq.ref.con.doc/q081900_.htm
http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/overview.html
http://www-01.ibm.com/support/knowledgecenter/SSFKSJ_7.5.0/com.ibm.mq.ref.con.doc/q081860_.htm
I have a client/server setup in which clients send a single request message to the server and gets a bunch of data messages back.
The server is implemented using a ROUTER socket and the clients using a DEALER. The communication is asynchronous.
The clients are typically iPads/iPhones and they connect over wifi so the connection is not 100% reliable.
The issue I’m concern about is if the client connects to the server and sends a request for data but before the response messages are delivered back the communication goes down (e.g. out of wifi coverage).
In this case the messages will be queued up on the server side waiting for the client to reconnect. That is fine for a short time but eventually I would like to drop the messages and the connection to release resources.
By checking activity/timeouts it would be possible in the server and the client applications to identify that the connection is gone. The client can shutdown the socket and in this way free resources but how can it be done in the server?
Per the ZMQ FAQ:
How can I flush all messages that are in the ZeroMQ socket queue?
There is no explicit command for flushing a specific message or all messages from the message queue. You may set ZMQ_LINGER to 0 and close the socket to discard any unsent messages.
Per this mailing list discussion from 2013:
There is no option to drop old messages [from an outgoing message queue].
Your best bet is to implement heartbeating and, when one client stops responding without explicitly disconnecting, restart your ROUTER socket. Messy, I know, this is really something that should have a companion option to HWM. Pieter Hintjens is clearly on board (he created ZMQ) - but that was from 2011, so it looks like nothing ever came of it.
This is a bit late but setting tcp keepalive to a reasonable value will cause dead sockets to close after the timeouts have expired.
Heartbeating is necessary for either side to determine the other side is still responding.
The only thing I'm not sure about is how to go about heartbeating many thousands of clients without spending all available cpu just on dealing with the heartbeats.
In the past, our server application was designed so that a client creates one TCP connection, keeps this connection established indefinitely and sends messages when needed. These messages may come in high volume bursts or with longer idle periods in between. We are switching to a different connection protocol now, where the client will create a new connection per message and then disconnect after sending.
I have a few questions:
I know there is some overhead for the 3-shake handshake to establish the connection. But is this overhead significant (cpu, memory, bandwidth, etc.)?
Is there any difference between the latency of message being transferred for an established TCP connection that has been idle for minutes vs. creating a new connection and sending the message?
Are there any other factors/considerations that I should be considering if I'm trying to determine the performance impact of switching to this new connection protocol compared to the old one?
Any help at all is greatly appreciated.
Opening and closing a lot of TCP sessions may impact connection tracking firewalls and load balancers, causing them to slow down or even fail and reject the connection. Some, like the Linux iptables conntrack, have moderate default values for the number of tracked connections.
The program might run out of available local port numbers if it cycles messages too quickly. There is a TCP timeout before a socket can be considered "closed". There is often an operating system timer to clean up these closed connections. If too many sockets are opened too quickly, the operating system may not have had time to clean up.
The handshake adds about an extra 80 bytes to your bandwidth cost. The TCP connection close also involves FIN or RST packets, although these flags may be combined with the data packet.
Latency in an established TCP session might be a tiny bit higher if the Nagle algorithm is turned on for the message sender. Nagle causes the OS to wait for more data before sending a partially filled packet. The TCP session that is being closed will flush all data. The same effect can be had in the open session by disabling Nagle with the TCP_NODELAY flag.