When should I reset my MQ channels? - ibm-mq

I've been studying the vagaries of channel statuses, how they get to those states and what to do to get them stopped or started. I've got a pretty solid understanding now, but a colleague brought up the topic of channel resets.
I've done them occasionally when I couldn't explain what was going on, but now I understand things a bit better I'm not sure his advice to "always reset" when stopping troublesome channels is the right advice.
Searching for info online, it's clear that when recreating channels it is obvious a reset would be needed but in the case if stuff just breaking – whether a queue manager is unexpectedly dropped or the network breaks or stuff like that – is a reset a good idea in general or should I only bother if I see sequence errors or it otherwise refuses to start when I know it should?

FYI, if you are resetting from the sending side of the channel, its OK to set the sequence number to 1. The receiving side will then also go back to 1. QED :-)
If you are resetting from the receiving side of the channel, you must use the sequence number that the sender was expecting.
These numbers are in the queue manager error logs on both sides.
If the channel is in RETRY state, it will try to use the new sequence numbers when it does the next retry. This could be up to 20 minutes away if you are using the default retry attributes on the sender channel. A simple way to bump this is to STOP the channel and then START it again straight away.
HTH, G.

Channels get sequence errors for a few reasons:
The local and remote MCAs got out of sync on a batch. Usually the remote MCA committed the batch but the local one did not. If you know the remote side delivered the batch, issue a RESOLVE ACTION(COMMIT) on the channel, otherwise issue RESOLVE ACTION(COMMIT). After resolving, issue RESET.
The channel points to a new QMgr. Perhaps after failover at the DNS, circuit or firewall NAT, a different QMgr of the same name is now attached to the channel. These should be well known because the failover (hopefully) doesn't happen without some alerts going off.
The contents of the channel sync queue are in error. Sometimes the QMgr can cause this but those issues are resolved (so far as I know) in recent versions. Sometimes people accidentally mess up the sync queue, usually by browsing it with a lock while the channels are trying to use it. This is a little harder to resolve and may require clearing the sync queue but check with IBM Support first.
When the channel is out of sync because of a known exception like failover, go ahead and reset it. Otherwise, you'd be well advised to find out why it's out of sync. You might reset it just to get it up and running, but hopefully not until you've saved off the <QMGR>/errors/AMQERR*.LOG files and any FDCs so you can diagnose the cause.

Related

How to grab the latest message sent from each connection

I have a ZMQ_PULL/ZMQ_PUSH socket connection.
I have multiple ZMQ_PUSH connections pushing to a single ZMQ_PULL connection.
ZMQ_PUSH connection 1----->
ZMQ_PUSH connection 2-----> ZMQ_PULL
ZMQ_PUSH connection N----->
I do not need every message, I just need the latest message that was sent. I am doing some inference on the back end and am streaming the results to the ZMQ_PULL socket.
I have set the ZMQ_PULL socket to Conflate=true
"If set, a socket shall keep only one message in its inbound/outbound queue, this message being the last message received/the last message to be sent. Ignores ZMQ_RCVHWM and ZMQ_SNDHWM options."
But after testing I realize I actually need the last message of each connection, not just the last message. So, if 3 connections, it grabs in a round robin from each connection, so I constantly have the latest from each connection.
Is there an option that is like Conflate, but instead of for all messages, it is for each connection?
Docs: http://api.zeromq.org/4-0:zmq-setsockopt
Is there an option that is like Conflate, but instead of for all messages, it is for each connection?
No.
The documentation you cite explains that 0MQ does not currently
offer direct support for such a single-socket use case.
You could certainly code it up and submit an upstream PR
so that future revs of 0MQ offer such functionality.
Given that you'll need app-level support to make
this work with 0MQ 4.3, simplest approach would
be to maintain N ZMQ_PULL sockets with ZMQ_CONFLATE
set, as you're already aware.
An alternate approach would be to assign a dedicated
thread or process to keep draining the existing muxed
socket, and update a shared memory data structure
that interested clients could consult.
The idea is to burn a core on keeping the queue
mostly empty, while doing no processing,
just focusing on communications.
Then other cores can examine "most recent message"
and each one then embarks on some expensive processing,
while another core continues to keep the queue drained.
This is essentially offering the 0MQ service proposed
above but at a different place in the stack,
up a level, within your application.
To do this in a distributed way,
the "queue draining service" would need to
know about idle workers.
That is, a worker could publish a brief
"I just completed an expensive task" message,
which would trigger the drainer to post
a fresh work item, never using shared memory at all.
This lets the drainer worry about eliding dup messages
that arrived when no one was available to immediately
start work on them, which have been superseded by a
more recent message.

Detecting socket connection using ZeroMQ STREAM sockets

I am building a new application that receives data from a number of external devices and needs to make it available to a number of different components. ZeroMQ seems purpose-built for the "data bus" aspect of my architecture.
I recently became aware that zmq STREAM sockets can connect to native TCP sockets and send/received messages. Using zmq throughout has a lot of appeal, but I have one problem that I don't know how to get around.
One of my devices needs to be set up. That is, I connect a socket to it, send it some configuration information, then sit back and wait for it to send me data. The device also has a "reset" capability (useful in some contexts), that requires re-sending the configuration information. Doing this depends upon having visibility to the setup/tear-down stage of the socket interface. I need to know when a new connection is established, so I can send the necessary configuration messages.
It seems that zmq is purposely designed to shield me from that knowledge. Is there a way to do what I want? Or should I just use regular sockets for this interface?
Well, it turns out that reading (the right version of) the fine manual can be instructive.
When a connection is made, a zero-length message will be received by the application. Similarly, when the peer disconnects (or the connection is lost), a zero-length message will be received by the application.
I guess all that remains is to disambiguate between connect and disconnect. Still looking for advice from the community, if others have dealt with this situation before.
Following up on your own answer, I would hesitate to rely on that zero length connect/disconnect message as your whole strategy - that seems needlessly fragile. It's not clear to me from your question which end is persistent and which end needs configuration information, but I expect that one end knows it's resetting and reconnecting, and that end needs configuration information from the peer, so it should ask for it with a message when it needs it, to which the peer responds with the requested information.
If the peer does not yet have the required configuration information before it receives some other message, it could either queue up that work or it could respond back with the need for the config, and then have the rest of the network handle that need appropriately.
You shouldn't need stream/tcp sockets to make that work, it should work with more standard ZMQ socket types, you just need to build the robustness into your application rather than trying to get it for free from TCP/socket actions.
If I've missed your point, and what I'm suggesting won't work for some reason, you will have to give more specific information about your network topology for anyone else to understand what a suitable solution might be.

Azure Queue delayed message

I has some strange behaviour on production deployment for azure queue messages:
Some of the messages in queues appears with big delay - minutes, and sometimes 10 minutes.
Befere you ask about setting delayTimeout when we put message to queue - we do not set delayTimeout for that message, so message should appear almost immedeatly after it was placed in queue.
At that moments we do not have a big load. So my instances has no work load, and able to process message fast, but they just don't appear.
Our service process millions of messages per month, we able to identify that 10-50 messages processed with very big delay, by that we fail SLA in front of our customers.
Does anyone have any idea what can be reason?
How to overcome?
Did anyone faced similar issues?
Some general ideas for troubleshooting:
Are you certain that the message was queued up for processing - ie the queue.addmessage operation returned successfully and then you are waiting 10 minutes - meaning you can rule out any client side retry policies etc as being the cause of the problem.
Is there any chance that the time calculation could be subject to some kind of clock skew problems. eg - if one of the worker roles pulling messages has its close out of sync with the other worker roles you could see this.
Is it possible that in the situations where the message is appearing to be delayed that a worker role responsible for pulling the messages is actually failing or crashing. If the client calls GetMessage but does not respond with an appropriate acknowledgement within the time specified by the invisibilityTimeout setting then the message will become visible again as the Queue Service assumes the client did not process the message. You could tell if this was a contributing factor by looking at the dequeue count on these messages that are taking longer. More information can be found here: http://msdn.microsoft.com/en-us/library/dd179474.aspx.
Is it possible that the number of workers you have pulling items from the queue is insufficient at certain times of the day and the delays are simply caused by the queue being populated faster than you can pull messages from the queue.
Have you enabled logging for queues and then looked to see if you can find the specific operations (look at e2elatency and serverlatency).
http://blogs.msdn.com/b/windowsazurestorage/archive/tags/analytics+2d00+logging+_2600_amp_3b00_+metrics/. You should also enable client logging and try to determine if the client is having connectivity problems and the retry logic is possibly kicking in.
And finally if none of these appear to help can you please send me the server logs (and ideally the client side logs as well) along with your account information (no passwords) to JAHOGG at Microsoft dot com.
Jason
Azure Service bus has a property in the BrokeredMessage class called ScheduledEnqueueTimeUtc, it allows you to set a time for when the message is added to the queue (effectively creating a delay).
Are you sure that in your code your not setting this property, and this might be the cause for the delay?
You can find more info on this at this url: https://www.amido.com/azure-service-bus-how-to-delay-a-message-being-sent-to-the-queue/
If you are using WebJobs to process messages from the queue, it can be due to WebJobs configuration.
From an MSDN forum post by pranav rastogi:
Starting with 0.4.0-beta, the (WebJobs) SDK implements a random exponential back-off algorithm. As a result of this if there are no messages on the queue, the SDK will back off and start polling less frequently.
The following setting allows you to configure this behavior.
MaxPollingInterval for when a queue remains empty, the longest period of time to wait before checking for a message to. Default is 10min.
static void Main()
{
JobHostConfiguration config = new JobHostConfiguration();
config.Queues.MaxPollingInterval = TimeSpan.FromMinutes(1);
JobHost host = new JobHost(config);
host.RunAndBlock();
}

WebSphere MQ v7.1 Channels going down

The sender and receiver channels between two queue managers (WebSphere MQ v7.1 running on Redhat Linux) that I have configured is going down pretty frequently. Any idea why? How can I debug this? Thanks.
Channels are expected to go down. The idea is that they stay active as long as there is traffic and then time out. Assuming they've been configured to trigger, the presence of a message on the XMitQ causes the channel to start up again.
The reason for this is that a triggered channel will generally restart if interrupted by a network failure or other adverse event. However if a channel is configured to stay running 24x7 then the only way it stops is due to one of these adverse events and that increases the likelihood that human intervention will be required to restart the channel. On the other hand, a channel that times out can survive all sorts of nasty network events that occur while it is inactive. Allowing it to time out when not in use thus improves overall reliability of the channel.
So how do you cause a channel to trigger? Make sure the transmission queue contains the TRIGGER, TRIGTYPE, TRIGDATA and INITQ attributes. For example, to define a transmission queue to the JUPITER QMgr:
DEF QL(JUPITER) +
USAGE(XMITQ) +
TRIGGER +
TRIGTYPE(FIRST) +
TRIGDATA('MYQMGR.JUPITER') +
INITQ(SYSTEM.CHANNEL.INITQ) +
REPLACE
The only variable of the bunch is TRIGDATA which contains the name of the channel serving this XMitQ.
Of course, the channel initiator must be running but in modern versions of WMQ it starts by default (based on the value of the queue manager's SCHINIT attribute) so generally will in fact be running.
The channel that is in STOPPED state cannot be triggered. By default the STOP CHL command uses STATUS(STOPPED) so most of the time manually stopping a channel prevents triggering. If you want to stop a channel in such a way that it will restart (for example to test triggering) use the STOP CHL(CHLNAME) STATUS(INACTIVE) command. If the channel is already in STOPPED state, either issue the START CHL command to make it start immediately or use the STOP CHL(CHLNAME) STATUS(INACTIVE) to change the status from STOPPED to INACTIVE without starting it.
Once the channels are up, the DISCINT attribute of the channel determines how long it will run before timing out. The value is in seconds and defaults to 600 which is 10 minutes. The DISCINT, KAINT and HBINT combine to determine when the channel comes down. Note that the TCP spec calls for things using keepalive to disable them by default so if you want to use keepalive on your channels, you must enable it in the QMgr tuning as described here.
Please see Triggering Channels in the Infocenter for more on the configuration details. Take a look at SupportPac MD0C WebSphere MQ - Keeping Channels Up and Running if you want to know more about the internals and tuning. (The SupportPac is a bit dated but the principles of tuning mostly still apply. Where there are discrepancies, the Infocenter is the authoritative version.)
If you want to keep channels up continuously, set DISCINT(0) but remember that triggering remains the preferred option. Some shops need to minimize response times during the business day and so set DISCINT to a value that allows the channels to time out at night but generally keeps them running all day. If for some reason you have triggering set up right and the channels go down prior to DISCIINT you should be able to check in the error logs for the reason why. These reside in the QMgr's directory under errors. For example, on UNIX/Linux they are in /var/mqm/qmgrs/qmgrname/errors and on Windows the default location is C:\Program Files(x86)\WebSphere MQ\QMgrs\qmgrname\errors. Look for the files named AMQERR??.LOG where ?? = 01, 02, or 03. The logs rotate where 01 is current, 02 is next and so on. If you have a very busy QMgr you need to capture these as soon as the channel goes down or they could roll off.

JMS Messages not consumed till producer connection close :-(

I am relatively new to JMS and have encountered a weird problem implementing my first real application. I'm desporate for any help or advice.
Background: I use AtiveMQ (java) as the message broker with non-transacted, non-persitent queues.
The Design: I have a straight forward producer/consumer system based around a single queue. A number of nodes(currently 2) place messages onto/ consume from the queue. Selectors are used to filter which messages a node recieves.
The Problem: The producer succesfully places its items on to the queue (i have verified they are there using the web interface) however the consumers remain blocked and do not read them. Only when i close the JMS connection in the producer do the consumers jump into life and consume the messages as expected.
This bevaior seems very weird to me, surely you shouldnt have to completely hang up the producer connection for the consumers to be able to read from the queue. I must have made a mistake somewhere(possibly with sessions) but the at the moment the number of things that could be wrong is to large and i have no idea what would cause this behaviour.
Any hints as to a solution, the cause of the problem or just how to continue debugging would be greatly appreciated.
Thanks for your time,
P.S If you requrie any additional information i am happy to provide it
Hard to say without seeing the code, but it sounds like the producer is transacted. You should not have to close the producer in order for the consumers to receive a message but a transacted producer won't send it messages until you call commit. Other things to check is that the connection has been started. Also if you have many consumers you should look at the prefetch setting to ensure that one consumer doesn't hog all the messages, setting to prefetch of 1 might be needed, but hard to say without further insight into your use case.

Resources