MQ | Sender LONGRTR - ibm-mq

I see SENDER channel goes into RETRY mode after LONGRTS start. It remains in RETRY mode and re-started after LONGMTR(1200) seconds. My question is - does Sender channel comes back to RUNNING as soon as message come, without completion of LONGMTR or it waits for LONGMTR time?

A SENDER channel will go into STATUS(RETRY) - a.k.a. Retry Mode - when the connection to its partner fails.
To begin with, on the assumption that many network failures are very short lived, a SENDER channel will try a small number of fairly close together attempts to re-make the network connection. It will try 10 times at 60 seconds apart, to re-make the connection. This is known as the "short retries".
This 10 times and 60 seconds apart, are coded in the SENDER channel fields called SHORTRTY and SHORTTMR.
If after these first 10 attempts, the SENDER channel has still not managed to get reconnected to the network partner, it will now move to "long retries". It is now operating with the assumption that the network outage is a longer one, for example the partner queue manager machine is having maintenance applied, or there has been some other major outage, and not just a network blip.
The SENDER channel will now try what it hopes is an infinite number of slightly more spaced apart attempts to re-make the connection. It will try 999999999 times at 1200 seconds apart, to re-make the connection.
This 999999999 and 1200, are coded in the SENDER channel fields called LONGRTY and LONGTMR.
You can see how many attempts are left by using the DISPLAY CHSTATUS command and looking at the SHORTRTS and LONGRTS fields. These should how many of the 10 or 999999999 are left. If SHORTRTS(0) then you know the SENDER is into "long retry mode".
If, on any of these attempts to re-make the connection, it is successful, it will stop retrying and you will see the SENDER channel show STATUS(RUNNING). Note that the success is due to the network connection having been successfully made, and is nothing to do with whether a message arrives or not.
It will not continue making retry attempts after it successfully connects to the partner (until the next time the connection is lost of course).
If your channel is in STATUS(RETRY) you should look in the AMQERR01.LOG to discover the reason for the failure. It may be something you can fix at the SENDER end or it may be something that needs to be fixed at the RECEIVER end, for example restarting the queue manager or the listener.

Related

zeromq: ZMQ_CONFLATE==1 does not stop queues from saving old messages

With ZeroMQ and CPPZMQ 4.3.2, I want to drop old messages for all my sockets including
PAIR
Pub/Sub
REQ/REP
So I use m_socks[channel].setsockopt(ZMQ_CONFLATE, 1) on all my sockets before binding/connecting.
Test
However, when I made the following test, it seems that the old messages are still flushed out on each reconnection. In this test,
I use a thread to keep sending generated sinewave to a receiver thread
Every 10 seconds I double the sinewave's frequency
Then after 10 seconds I stop the process
Below is the pseudocode of the sender
// on sender end
auto thenSec = high_resolution_clock::now();
while(m_isRunning) {
// generate sinewave, double the frequency every 10s or so
auto nowSec = high_resolution_clock::now();
if (duration_cast<seconds>(nowSec - thenSec).count() > 10) {
m_sine.SetFreq(m_sine.GetFreq()*2);
thenSec = nowSec;
}
m_sine.Generate(audio);
// send to rendering thread
m_messenger.send("inproc://sound-ear.pair",
(const void*)(audio),
audio_size,
zmq::send_flags::dontwait
);
}
Note that I already use DONTWAIT to mitigate blocking.
On the receiver side I have a zmq::poller_event handler that simply receives the last message on event polling.
In the stop sequence I reset the sinewave frequency to its lowest value, say, 440Hz.
Expected
The expected behaviour would be:
If I stop both the sender and the receiver after 10s when the frequency is doubled,
and I restart both,
then I should see the sinewave reset to 440Hz.
Observed
But the observed behaviour is that the received sinewave is still of the doubled frequency after restarting the communication, i.e., 880Hz.
Question
Am I doing it wrong or should I use some kind of killswitch to force drop all messages in this case?
OK, I think I solved it myself. Kind of.
Actual solution
I finally realized that the behaviour I want is to flush all messages when I stop the rendering. According to the official doc(How can I flush all messages that are in the ZeroMQ socket queue?), this can only be achieved by
set the sockets of both sender's and receiver's ZMQ_LINGER option to 0, meaning to keep nothing on closing those sockets;
closing the sockets on both sender and receiver ends, which also involves bootstrapping pollers and all references to the sockets.
This seems a lot of unnecessary work if I'm to restart rendering my data again, right after the stop sequence. But I found no other way to solve this cleanly.
Initial effort
It seems to me that ZMQ_CONFLATE does not make a difference on PAIR. I really have to tweak high water marks on sender and receiver ends using ZMQ_SNDHWM and ZMQ_RCVHWM.
However, I said "kind of solved" because tweaking HWM in the end is not the optimal solution for a realtime application,
having ZMQ_SNDHWM / ZMQ_RCVHWM set to the minimum "1", we still have a sizable latency in terms of realtime.
Also, the consumer thread could fall into underrun situatioin, i.e., perceivable jitters with the lowest HWM.
If I'm not doing anything wrong, I guess the optimal solution would still be shared memory for my targeted scenario. This is sad because I really enjoyed the simplicity of ZMQ's multicast messaging patternsand hate to deal with thread locking littered everywhere.

'senderdisconnected' event is fired up after 5-10 min after actuall disconnection

We have an explicit requirement to tear down the streaming when the sender disconnected.
However we can see that 'senderdisconnected' and that window.castReceiverManager.onSenderDisconnected() is being called only after 10 minutes after device actually left the network.
Can we somehow force Receiver to check connection more aggressively?
appConfig.maxInactivity = 6000;
Set this an you will be fine.

sendReliable message sometimes not received by opposite peer

I've created a real time game for Google Play Game Services. It's in the later alpha stages right now. I have a question about sendReliableMessage. I've noticed certain cases where the other peer doesn't receive the message. I am aware that there is a callback onRealTimeMessageSent and I have some code in my MainActivity:
#Override
public void onRealTimeMessageSent(int i, int i2, String s) {
if(i== GamesStatusCodes.STATUS_OK)
{
}
else
{
lastMessageStatus=i;
sendToast("lastMessageStatus:"+Integer.toString(lastMessageStatus));
}
}
My games render loop is checking every iteration the value of lastMessageStatus and if there was something other than STATUS_OK I'm painting a T-Rex right now.
My question is is checking the sent status really enough? I also could create source code where the sender has to wait for an Acknowledged message. Each message would be stamped with a UUID and if ack is not received within a timeout then the sender would send the message again? Is an ACK based system necessary to create a persistent connection?
I've noticed certain cases where there is some lag before the opposite peer received the reliable message and I was wondering is there a timeout on the sendReliable message? Google Play Services documentation doesn't seem to indicate in the documentation that there is a timeout at all.
Thank you
Reliable messages are just that, reliable. There are not a lot of use cases for the onRealTimeMessageSent callback for reliable messages because, as you said, it does not guarantee that the recipient has processed the message yet. Only that it was sent.
It may seem annoying, but an ACK-based system is the best way to know for sure that your user has received the message. A UUID is one good way to do this. I have done this myself and found it to work great (although now you have round-trip latency).
As far as timeout, that is not implemented in the RealTime Messaging API. I have personally found round trip latency (send message, receive ACK in callback) to be about 200ms, and I have never found a way to make a message fail to deliver eventually even when purposefully using bad network conditions.

TCP socket stops receiving data until closed

I have a really weird problem that is driving me crazy.
I have a Ruby server and a Flash client (Action Script 3). It's a multiplayer game.
The problem is that everything is working perfect and then, suddenly, a random player stops receiving data. When the server closes the connection because of inactivity, about 20-60 seconds later, the client receives all the buffered data.
The client uses XMLsocket for retrieving data, so the way the client receives data is not the problem.
socket.addEventListener(Event.CONNECT, connectHandler);
function connectHandler(event)
{
sendData(sess);
}
function sendData(dat)
{
trace("SEND: " + dat);
addDebugData("SEND: " + dat)
if (socket.connected) {
socket.send(dat);
} else {
addDebugData("SOCKET NOT CONNECTED")
}
}
socket.addEventListener(DataEvent.DATA, dataHandler);
function dataHandler(e:DataEvent) {
var data:String = e.data;
workData(data);
}
The server flushes data after every write, so is not a flushing problem:
sock.write(data + DATAEOF)
sock.flush()
DATAEOF is null char, so the client parses the string.
When the server accepts a new socket, it sets sync to true, to autoflush, and TCP_NODELAY to true too:
newsock = serverSocket.accept
newsock.sync = true
newsock.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_NODELAY, true)
This is my research:
Info: I was dumping netstat data to a file each second.
When the client stops receiving data, netstat shows that socket status is still ESTABLISHED.
Some seconds after that, send-queue grows accordingly to data sent.
tcpflow shows that packets are sent 2 times.
When the server closes the socket, socket status changes to FIN_WAIT1, as expected. Then, tcpflow shows that all buffered data is sent to the client, but the client don't receives data. some seconds after that, connection dissapears from netstat and tcpflow shows that the same data is sent again, but this time the client receives the data so starts sending data to the server and the server receives it. But it's too late... server has closed connection.
I don't think it's an OS/network problem, because I've changed from a VPS located in Spain to Amazon EC2 located in Ireland and the problem still remains.
I don't think it's a client network problem too, because this occurs dozens of times per day, and the average quantity of online users is about 45-55, with about 400 unique users a day, so the ratio is extremely high.
EDIT:
I've done more research. I've changed the server to C++.
When a client stops sending data, after a while the server receives a "Connection reset by peer" error. In that moment, tcpdump shows me that the client sent a RST packet, this could be because the client closed the connection and the server tried to read, but... why the client closed the connection? I think the answer is that the client is not the one closing the connection, is the kernel. Here is some info: http://scie.nti.st/2008/3/14/amazon-s3-and-connection-reset-by-peer
Basically, as I understand it, Linux kernels 2.6.17+ increased the maximum size of the TCP window/buffer, and this started to cause other gear to wig out, if it couldn’t handle sufficiently large TCP windows. The gear would reset the connection, and we see this as a “Connection reset by peer” message.
I've followed the steps and now it seems that the server is closing connections only when the client losses its connection to internet.
I'm going to add this as an answer so people know a bit mroe about this.
I think the answer is that the kernel is the one closing the connection. Here is some info: http://scie.nti.st/2008/3/14/amazon-s3-and-connection-reset-by-peer
Basically, as I understand it, Linux kernels 2.6.17+ increased the maximum size of the TCP window/buffer, and this started to cause other gear to wig out, if it couldn’t handle sufficiently large TCP windows. The gear would reset the connection, and we see this as a “Connection reset by peer” message.
I've followed the steps and now it seems that the server is closing connections only when the client losses its connection to internet.

WebSphere MQ Receiver Channel Paused

I think I tried to start a channel that is already running or whatever. Whenever I start the sender channel, the receiver channel goes to a PAUSED state. I looked it up and found something about AdoptNewMCA configuration, not sure how to set it at the queue manager level. How do I fix this smoothly. Merely stopping and restarting the channels does not do it.
Error log says:
/02/2012 12:38:41 PM - Process(19161.269) User(mqm) Program(amqrmppa)
Host() Installation(Installation1)
VRMF(7.1.0.0) QMgr(QM_TEST2)
AMQ9514: Channel 'QM_TEST1.TO.QM_TEST2' is in use.
EXPLANATION: The requested operation failed because channel
''QM_TEST1.TO.QM_TEST2' is currently active. ACTION: Either end the channel
manually, or wait for it to close, and retry the operation.
----- amqrcsia.c : 1042 -------------------------------------------------------
08/02/2012 12:38:41 PM - Process(19161.269) User(mqm) Program(amqrmppa)
Host(...) Installation(Installation1)
VRMF(7.1.0.0) QMgr(QM_TEST2)
AMQ9999: Channel ''QM_TEST1.TO.QM_TEST2' to host '17.2.33.44' ended abnormally.
EXPLANATION: The channel program running under process ID 19161 for
channel ''QM_TEST1.TO.QM_TEST2' ended abnormally. The host name is
'17.2.33.44'; in some cases the host name cannot be
determined and so is shown as '????'. ACTION: Look at previous error
messages for the channel program in the error logs to determine the
cause of the failure. Note that this message can be excluded
completely or suppressed by tuning the "ExcludeMessage" or
"SuppressMessage" attributes under the "QMErrorLog" stanza in qm.ini.
Further information can be found in the System Administration Guide.
----- amqrmrsa.c : 887 --------------------------------------------------------
When looking these things up, I'd start first with the product manuals. In this case, the Infocenter topic on channel states says that a channel in PAUSED state is waiting on a retry interval. The sub-topic on channel errors explains why sending or receiving channels can be in retry:
If a channel is unable to put a message to the target queue because
that queue is full or put inhibited, the channel can retry the
operation a number of times (specified in the message-retry count
attribute) at a time interval (specified in the message-retry interval
attribute). Alternatively, you can write your own message-retry exit
that determines which circumstances cause a retry, and the number of
attempts made. The channel goes to PAUSED state while waiting for the
message-retry interval to finish.
So if you stop your channels, you should see a message in the XMitQ on the sending side. If you GET-enable that queue you can browse the message, look at the header and see which queue it is destined for. On the receiving side, look to see if that queue is full.
Classic fast-sender, slow-consumer problem here. If the consumer can't keep up, the messages back up on the receiving QMgr, then the channel goes to retry and they begin to back up on the sending QMgr. Got to monitor depth and input handles on request queues.
Make sure a DLQ is set.
Try reducing the message retry count to 1 to speed up use of the DLQ.

Resources