Relation between DISCINT and Keep Alive interval in WebSphere MQ server? - ibm-mq

We had issues with lot of Applications connecting to MQ server without properly doing a disconnect. Hence we introduced DISCINT on our server connection channels with a value 1800 sec which we found ideal for our transactions. But our Keep Alive interval is pretty high with 900 sec. We would like to reduce that less than 300 as suggested by mqconfig util. But before doing that I would like to know if this is going to affect our disconnect interval value and whether it is going to override our disconnect interval value and make more frequent disconnects which will be a performance hit for us.
How does both these values work and how they are related?
Thanks

TCP KeepAlive works below the application layer in the protocol stack, so it does not affect the disconnecting of the channel configured by the DISCINT.
However lowering the value can result in more frequent disconnects, if your network is unreliable, for example has intermittent very short (shorter then the current KeepAlive, but longer then the new) periods when packets are not flowing.
I think the main difference is, that DISCINT is for disconnecting a technically working channel, which is not used for a given period, while KeepAlive is for detecting a not working TCP connection.
And MQ provides means to detect not working connections in the application layer too, configured by the heartbeat interval.
These may help:
http://www-01.ibm.com/support/knowledgecenter/SSFKSJ_7.5.0/com.ibm.mq.con.doc/q015650_.htm
http://www-01.ibm.com/support/knowledgecenter/SSFKSJ_7.5.0/com.ibm.mq.ref.con.doc/q081900_.htm
http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/overview.html
http://www-01.ibm.com/support/knowledgecenter/SSFKSJ_7.5.0/com.ibm.mq.ref.con.doc/q081860_.htm

Related

Tracking 'TCP/IP failed to establish an outgoing connection' bug that happens rarely

We're seeing TCP/IP warnings and a few connection failures on our server but they happen pretty rarely, like once a month or so. Right now, I built a little monitoring application that will just track TIME_WAIT statuses in netstat and see if there are any anomalies with the system we are monitoring. This was all done in response to how Microsoft's documentation would handle port exhaustion. I was wondering if I need to track for anything else? Is there a faster way to resolve this problem? I am not sure where the bug is originating but this seems to be the only way supposedly.
The error in question:
TCP/IP failed to establish an outgoing connection because the selected
local endpoint was recently used to connect to the same remote
endpoint. This error typically occurs when outgoing connections are
opened and closed at a high rate, causing all available local ports to
be used and forcing TCP/IP to reuse a local port for an outgoing
connection. To minimize the risk of data corruption, the TCP/IP
standard requires a minimum time period to elapse between successive
connections from a given local endpoint to a given remote endpoint.

How long can a websocket connection last?

How long can a client be connected to a server via websockets? Is there a time limit, can they potentially be connected together for years?
A WebSocket connection can in theory last forever. Assuming the endpoints remain up, one common reason why long-lived TCP connections eventually terminate is inactivity. WebSockets have a ping-pong mechanism which among other things can avoid closure by "smart" network routers which often have an inactivity timeout of a few hours. But if your application actively sends data (in either direction) at least once an hour, that's probably enough.
Still, "forever" is unlikely to be achieved in practice, because TCP connections on most networks eventually get terminated by some outage or other. You'll want intelligent reconnection logic if reliability is important.

nodeJS being bombarded with reconnections after restart

We have a node instance that has about 2500 client socket connections, everything runs fine except occasionally then something happens to the service (restart or failover event in azure), when the node instances comes back up and all socket connections try to reconnect the service comes to a halt and the log just shows repeated socket connect/disconnects. Even if we stop the service and start it the same thing happens, we currently send out a package to our on premise servers to kill the users chrome sessions then everything works fine as users begin logging in again. We have the clients currently connecting with 'forceNew' and force web sockets only and not the default long polling than upgrade. Any one ever see this or have ideas?
In your socket.io client code, you can force the reconnects to be spread out in time more. The two configuration variables that appear to be most relevant here are:
reconnectionDelay
Determines how long socket.io will initially wait before attempting a reconnect (it should back off from there if the server is down awhile). You can increase this to make it less likely they are all trying to reconnect at the same time.
randomizationFactor
This is a number between 0 and 1.0 and defaults to 0.5. It determines how much the above delay is randomly modified to try to make client reconnects be more random and not all at the same time. You can increase this value to increase the randomness of the reconnect timing.
See client doc here for more details.
You may also want to explore your server configuration to see if it is as scalable as possible with moderate numbers of incoming socket requests. While nobody expects a server to be able to handle 2500 simultaneous connections all at once, the server should be able to queue up these connection requests and serve them as it gets time without immediately failing any incoming connection that can't immediately be handled. There is a desirable middle ground of some number of connections held in a queue (usually controllable by server-side TCP configuration parameters) and then when the queue gets too large connections are failed immediately and then socket.io should back-off and try again a little later. Adjusting the above variables will tell it to wait longer before retrying.
Also, I'm curious why you are using forceNew. That does not seem like it would help you. Forcing webSockets only (no initial polling) is a good thing.

Socket.io data loss when Internet speed drop

I am using socket.io 1.4 and I want to know that what happens in this scenario:
The client Emits like this:
Socket.emit('test',data);
The client does 3 emits to server but suddenly Internet speed drops and those emits may not get to server
But after a while the Internet speed rises again but what will happen to previous failed emits?
They will be emitted again automatically?
How should I handle that
Websockets use TCP, which is in general a reliable protocol. There is not exactly such a thing as "The internet speed dropped and I lost some messages." If some messages are lost they will be automatically retransmitted at the TCP level. If retransmission fails completely, the connection will be reset.
So what you really are asking is how socket.io handles this. And the answer is that it has some amount of reconnecting logic, and you may also want to monitor the connection in case it resets (hook up a listener for the disconnect event on the socket), if you want to take some extra action (like notify the user).

WebSphere MQ DISC vs KAINT on SVRCONN channels

we have a major problem with many of our Applications making improper connections (SVRCONN) with queue manager and not issuing MQDISC when connection not required. This causes lot of idle stale connections and prevents Application from making new connections and fails with CONNECTION BROKEN (2009) error. We have been restricting Application connections with clientidle parameter in our Windows MQ on version 7.0.1.8 but when we migrated to MQ v7.5.0.2 in Linux platform we are deciding on the best option available in the new version. We do not have clientidle anymore in ini file for v7.5 but has DISCINT & KAINT in SVRCONN channels. I have been going through the advantages and disadvantages of both for our scenario of Application making connections through SVRCONN channels and leave connections open without issuing a disconnect. Which of these above channel attributes is ideal for us. Any suggestions? Does any of these take precedence over the other??
First off, KAINT controls TCP functions, not MQ functions. That means for it to take effect, the TCP Keepalive function must be enabled in the qm.ini TCP stanza. Nothing wrong with this, but the native HBINT and DISCINT are more responsive than delegating to TCP. This addresses the problem that the OS hasn't recognized that a socket's remote partner is gone and cleaned up the socket. As long as the socket exists and MQ's channel is idle, MQ won't notice. When TCP cleans the socket up, MQ's exception callback routine sees it immediately and closes the channel.
Of the remaining two, DISCINT controls the interval after which MQ will terminate an idle but active socket whereas HBINT controls the interval after which MQ will shut down an MCA attached to an orphan socket. Ideally, you will have a modern MQ client and server so you can use both of these.
The DISCINT should be a value longer than the longest expected interval between messages if you want the channel to stay up during the Production shift. So if a channel should have message traffic at least once every 5 minutes by design, then a DISCINT longer than 5 minutes would be required to avoid channel restart time.
The HBINT actually flows a small heartbeat message over the channel, but only will do so if HBINT seconds have passed without a message. Thsi catches the case that the socket is dead but TCP hasn't yet cleaned it up. HBINT allows MQ to discover this before the OS and take care of it, including tearing down the socket.
In general, really low values for HBINT can cause lots of unnecessary traffic. For example, HBINT(5) would flow a heartbeat every five second interval in which no other channel traffic is passed. chances are, you don't need to terminate orphan channels within 5 seconds of the loss of the socket so a larger value is perhaps more useful. That said, HBINT(5) would cause zero extra traffic in a system with a sustained message rate of 1/second - until the app died, in which case the orphan socket would be killed pretty quick.
For more detail, please go to the SupportPacs page and look for the Morag's "Keeping Channels Running" presentation.

Resources