WebSphere MQ v7.1 Channels going down - ibm-mq

The sender and receiver channels between two queue managers (WebSphere MQ v7.1 running on Redhat Linux) that I have configured is going down pretty frequently. Any idea why? How can I debug this? Thanks.

Channels are expected to go down. The idea is that they stay active as long as there is traffic and then time out. Assuming they've been configured to trigger, the presence of a message on the XMitQ causes the channel to start up again.
The reason for this is that a triggered channel will generally restart if interrupted by a network failure or other adverse event. However if a channel is configured to stay running 24x7 then the only way it stops is due to one of these adverse events and that increases the likelihood that human intervention will be required to restart the channel. On the other hand, a channel that times out can survive all sorts of nasty network events that occur while it is inactive. Allowing it to time out when not in use thus improves overall reliability of the channel.
So how do you cause a channel to trigger? Make sure the transmission queue contains the TRIGGER, TRIGTYPE, TRIGDATA and INITQ attributes. For example, to define a transmission queue to the JUPITER QMgr:
DEF QL(JUPITER) +
USAGE(XMITQ) +
TRIGGER +
TRIGTYPE(FIRST) +
TRIGDATA('MYQMGR.JUPITER') +
INITQ(SYSTEM.CHANNEL.INITQ) +
REPLACE
The only variable of the bunch is TRIGDATA which contains the name of the channel serving this XMitQ.
Of course, the channel initiator must be running but in modern versions of WMQ it starts by default (based on the value of the queue manager's SCHINIT attribute) so generally will in fact be running.
The channel that is in STOPPED state cannot be triggered. By default the STOP CHL command uses STATUS(STOPPED) so most of the time manually stopping a channel prevents triggering. If you want to stop a channel in such a way that it will restart (for example to test triggering) use the STOP CHL(CHLNAME) STATUS(INACTIVE) command. If the channel is already in STOPPED state, either issue the START CHL command to make it start immediately or use the STOP CHL(CHLNAME) STATUS(INACTIVE) to change the status from STOPPED to INACTIVE without starting it.
Once the channels are up, the DISCINT attribute of the channel determines how long it will run before timing out. The value is in seconds and defaults to 600 which is 10 minutes. The DISCINT, KAINT and HBINT combine to determine when the channel comes down. Note that the TCP spec calls for things using keepalive to disable them by default so if you want to use keepalive on your channels, you must enable it in the QMgr tuning as described here.
Please see Triggering Channels in the Infocenter for more on the configuration details. Take a look at SupportPac MD0C WebSphere MQ - Keeping Channels Up and Running if you want to know more about the internals and tuning. (The SupportPac is a bit dated but the principles of tuning mostly still apply. Where there are discrepancies, the Infocenter is the authoritative version.)
If you want to keep channels up continuously, set DISCINT(0) but remember that triggering remains the preferred option. Some shops need to minimize response times during the business day and so set DISCINT to a value that allows the channels to time out at night but generally keeps them running all day. If for some reason you have triggering set up right and the channels go down prior to DISCIINT you should be able to check in the error logs for the reason why. These reside in the QMgr's directory under errors. For example, on UNIX/Linux they are in /var/mqm/qmgrs/qmgrname/errors and on Windows the default location is C:\Program Files(x86)\WebSphere MQ\QMgrs\qmgrname\errors. Look for the files named AMQERR??.LOG where ?? = 01, 02, or 03. The logs rotate where 01 is current, 02 is next and so on. If you have a very busy QMgr you need to capture these as soon as the channel goes down or they could roll off.

Related

what are the retry settings for subscriber in pubsub and how to set them correctly in a spring application?

I have a spring service subcribing for messages from a topic in the google cloud pubsub (pulling). It is working correctly in general. But I want to have more control over resent messages. My service need sometimes to nack the message or just let the ackDeadline pass so that I would get the message later on again. While testing with single messages, the nacked message comes back to me almost immidetaly, and the ones I don't ack or nack at all, come back after 10 sec default for ackDeadline. I would like it to postpone the repeated consuming of these messages. I thought the retry setting are designed for such cases.
I should mention as well that I am currently testing locally with an emulator and create the subscription from code. I am using the PubSubAdmin for managing.
According to this docu I have tried to set those configuration in my profile config. like this:
spring.cloud.gcp.pubsub.subscriber.retry.initial-retry-delay-second: 4
spring.cloud.gcp.pubsub.subscriber.retry.max-attempts: 5
spring.cloud.gcp.pubsub.subscriber.retry.initial-rpc-timeout-seconds: 4
spring.cloud.gcp.pubsub.subscriber.retry.max-rpc-timeout-seconds: 8
spring.cloud.gcp.pubsub.subscriber.retry.max-retry-delay-seconds: 7
spring.cloud.gcp.pubsub.subscriber.retry.total-timeout-seconds: 3000
but it had no effect on the time of reoccuring of the messages.
Do I understand the meaning of retry settings wrongly? maybe they only take effect if there are some connection problems but not in nacking or lacking of acknowledgment cases? Or do I have to set the setting while using deploymentManager for creating the subscriptions and am not allowed to set them from the code? Or maybe setting them in (development) profile configs won't work with the PubSubAdmin?
Thanks for any suggestions!
edit: I want the first retry to happen after 5 seconds, but next retry 10 seconds later, etc. Plus I want to set the max retry number. So what I am not interested in is setting the ackDeadline just to a bigger number.
edit2: why nacking: one of the services (let's call it a bridge) is subscribing for the messages, has to validate each message and if ok pass it to another external system. this service is acting as a bridge for this system, as we can't work on this second system directly. in some cases the message need some extra information, so the bridge will try to fetch it somewhere else (there are a lot of microservices included) and it happens sometimes, that at this moment in time the extra information is not there (yet). So the first idea was to not ack the message and let it come later again. but I don't want to ask every 10 sec for the next 7 days (with ackDeadline), I want to just try few times, and if it is not there after 2 hours, it will never came. so we tried to nack and hoped those retry settings can help to manage the resending. But as they don't, I suppose the only way to go will be to build something for managing these messages in the bridge by myself. Maybe store message ids and the number of retry so that I can ack after for example 5 times and push the message to another topic to deal with it differently. Or are there any better solutions known?
Cloud Pub/Sub does not provide exponential backoff for specific messages. A nack has no effect other than to tell Cloud Pub/Sub that you were not able to handle the message.
I could provide a more useful answer if you were to document why you needed to nack the messages. If you are unable to handle the current load, you can use the flow control options described here to reduce the number of outstanding messages or bytes to your client. If you have messages that are known to be bad, you should instead ack them after pushing to another dead letter topic to be handled separately.
Response to edit 2:
If you have this scenario where the action to supplement the messages can fail, implement whatever backoff mechanism you want on that action yourself in your service. Set the max ack extension period when constructing your subscriber (setMaxAckExtensionPeriod in java) to ensure that your client will extend the ack deadline for each message long enough for your chain of retries.
Edit 2
Note that Pub/Sub now has built in support for Dead Lettering.
You can use PubSubSubscriberTemplate.modifyAckDeadline() to programmatically extend the deadlines of a batch of messages retrieved through pull. Each individual AcknowledgeablePubsubMessage also has a modifyAckDeadline() method, if you only need to extend deadline for a select few stragglers.
If all messages on that particular subscription need to have a longer acknowledgement period, a default can be set in GCP Console by editing the subscription and updating the "Acknowledgement Deadline" field.

checking queue manager status through Visual Basic 6

I have to check the IBM MQ queue manager status before opening a queue.
I have to create requestor app by checking that the QMgr is active or not then call put msg or get message from MQ
Is it possible to check the status,
please share some code snippets.
Thanks
You should NEVER have to check the QMgr before opening a queue. As I responded to a similar question today, the design proposed is a very, VERY bad design. The effect is to turn async messaging back into synchronous messaging. This couples message producers to consumers, introduces location and resolution dependencies, breaks clustering, defeats WMQ's load distribution and balancing, embeds network topology into the application, and makes the whole system brittle. Please do not blame WMQ for not working correctly after intentionally defeating all its best features except the actual queue/dequeue operations.
If your requestor app is checking that the QMgr is active, you are much better off using a multi-instance connection name and a layer of two or more functionally equivalent QMgrs that can access the cluster. So long as one of the QMgrs is up, the app will cycle between them until it finds one at which to connect.
If your responder app is checking that the QMgr is active, you are much better off just attempting to connect. Responder apps should never fail over to a different QMgr since doing so breaks transactionality and may leave queues unserviced. Instead just ensure that each queue has at least two input handles from local responder apps that do not fail over across QMgrs. (It is OK if the QMgr itself fails over using hardware clustering or multi-instance QMgr though).
If the intent is to check that there's an open input handle on the queue before putting messages there a better design is to have the requesting app not care to which queue instance the messages are routed and instead use the instrumentation built into WMQ to either restart responder apps that lose their input handle, or to disable the queue when nothing's listening.

Azure Queue delayed message

I has some strange behaviour on production deployment for azure queue messages:
Some of the messages in queues appears with big delay - minutes, and sometimes 10 minutes.
Befere you ask about setting delayTimeout when we put message to queue - we do not set delayTimeout for that message, so message should appear almost immedeatly after it was placed in queue.
At that moments we do not have a big load. So my instances has no work load, and able to process message fast, but they just don't appear.
Our service process millions of messages per month, we able to identify that 10-50 messages processed with very big delay, by that we fail SLA in front of our customers.
Does anyone have any idea what can be reason?
How to overcome?
Did anyone faced similar issues?
Some general ideas for troubleshooting:
Are you certain that the message was queued up for processing - ie the queue.addmessage operation returned successfully and then you are waiting 10 minutes - meaning you can rule out any client side retry policies etc as being the cause of the problem.
Is there any chance that the time calculation could be subject to some kind of clock skew problems. eg - if one of the worker roles pulling messages has its close out of sync with the other worker roles you could see this.
Is it possible that in the situations where the message is appearing to be delayed that a worker role responsible for pulling the messages is actually failing or crashing. If the client calls GetMessage but does not respond with an appropriate acknowledgement within the time specified by the invisibilityTimeout setting then the message will become visible again as the Queue Service assumes the client did not process the message. You could tell if this was a contributing factor by looking at the dequeue count on these messages that are taking longer. More information can be found here: http://msdn.microsoft.com/en-us/library/dd179474.aspx.
Is it possible that the number of workers you have pulling items from the queue is insufficient at certain times of the day and the delays are simply caused by the queue being populated faster than you can pull messages from the queue.
Have you enabled logging for queues and then looked to see if you can find the specific operations (look at e2elatency and serverlatency).
http://blogs.msdn.com/b/windowsazurestorage/archive/tags/analytics+2d00+logging+_2600_amp_3b00_+metrics/. You should also enable client logging and try to determine if the client is having connectivity problems and the retry logic is possibly kicking in.
And finally if none of these appear to help can you please send me the server logs (and ideally the client side logs as well) along with your account information (no passwords) to JAHOGG at Microsoft dot com.
Jason
Azure Service bus has a property in the BrokeredMessage class called ScheduledEnqueueTimeUtc, it allows you to set a time for when the message is added to the queue (effectively creating a delay).
Are you sure that in your code your not setting this property, and this might be the cause for the delay?
You can find more info on this at this url: https://www.amido.com/azure-service-bus-how-to-delay-a-message-being-sent-to-the-queue/
If you are using WebJobs to process messages from the queue, it can be due to WebJobs configuration.
From an MSDN forum post by pranav rastogi:
Starting with 0.4.0-beta, the (WebJobs) SDK implements a random exponential back-off algorithm. As a result of this if there are no messages on the queue, the SDK will back off and start polling less frequently.
The following setting allows you to configure this behavior.
MaxPollingInterval for when a queue remains empty, the longest period of time to wait before checking for a message to. Default is 10min.
static void Main()
{
JobHostConfiguration config = new JobHostConfiguration();
config.Queues.MaxPollingInterval = TimeSpan.FromMinutes(1);
JobHost host = new JobHost(config);
host.RunAndBlock();
}

When should I reset my MQ channels?

I've been studying the vagaries of channel statuses, how they get to those states and what to do to get them stopped or started. I've got a pretty solid understanding now, but a colleague brought up the topic of channel resets.
I've done them occasionally when I couldn't explain what was going on, but now I understand things a bit better I'm not sure his advice to "always reset" when stopping troublesome channels is the right advice.
Searching for info online, it's clear that when recreating channels it is obvious a reset would be needed but in the case if stuff just breaking – whether a queue manager is unexpectedly dropped or the network breaks or stuff like that – is a reset a good idea in general or should I only bother if I see sequence errors or it otherwise refuses to start when I know it should?
FYI, if you are resetting from the sending side of the channel, its OK to set the sequence number to 1. The receiving side will then also go back to 1. QED :-)
If you are resetting from the receiving side of the channel, you must use the sequence number that the sender was expecting.
These numbers are in the queue manager error logs on both sides.
If the channel is in RETRY state, it will try to use the new sequence numbers when it does the next retry. This could be up to 20 minutes away if you are using the default retry attributes on the sender channel. A simple way to bump this is to STOP the channel and then START it again straight away.
HTH, G.
Channels get sequence errors for a few reasons:
The local and remote MCAs got out of sync on a batch. Usually the remote MCA committed the batch but the local one did not. If you know the remote side delivered the batch, issue a RESOLVE ACTION(COMMIT) on the channel, otherwise issue RESOLVE ACTION(COMMIT). After resolving, issue RESET.
The channel points to a new QMgr. Perhaps after failover at the DNS, circuit or firewall NAT, a different QMgr of the same name is now attached to the channel. These should be well known because the failover (hopefully) doesn't happen without some alerts going off.
The contents of the channel sync queue are in error. Sometimes the QMgr can cause this but those issues are resolved (so far as I know) in recent versions. Sometimes people accidentally mess up the sync queue, usually by browsing it with a lock while the channels are trying to use it. This is a little harder to resolve and may require clearing the sync queue but check with IBM Support first.
When the channel is out of sync because of a known exception like failover, go ahead and reset it. Otherwise, you'd be well advised to find out why it's out of sync. You might reset it just to get it up and running, but hopefully not until you've saved off the <QMGR>/errors/AMQERR*.LOG files and any FDCs so you can diagnose the cause.

Scheduling a MDB

I'm looking for a way to schedule a MDB. My requirement is that the MDB is set to feed a system from the company. This system goes out for maintenance every night, but the other systems don't know about it and may keep trying to feed it. A persistent queue is great in the way that my messages could be pilled until system goes back online.
How could I manage that? I've run into that already: schedule a message driven bean to access a queue during certain times? but it uses java 7, and worst, message is lost if the server restarts (messages is taken out of the JMS Queue and kept in memory until timer process it).
Another use of this would be to implement a "retry" queue. In case of error I want to retry processing my message, but not immediately, after a certain amount time only.
Any ideas to keep my MDB offline for a certain amount of time?
Most versions of JBoss publish a management MBean that allows you to stop delivery on a MDB.
If you're using EJB3, however, they auto-start, so you will need to register a startup class to stop starting MDBs at boot time if boots occur in your MDB's blackout period. Once past that snafu, you can schedule a simple quartz job to start and stop the MDBs according to your delivery windows.
Well, it looks like there is no way to pause a MDB in a generic way. The best solution is, like most people will answer, to use the DLQ (or DMQ).
Now, if I want to introduce a timer on a message, I set the time to live of the producer to the amount of time I want the message to wait. Then I send it to a normal queue, lets say waitingQueue which has no consumer. After expiration, the message is sent to default destination (mq.sys.dmq for Glassfish MQ, make sure to create a jms resource with mq.sys.dmq as imqDestinationName). I have a MDB listening to the error queue and responsible of sending the message again. Now, if I want to "close" a queue for some time, when a message arrives in the queue, I check if current time is allowed or not. Just set the time to live to the amount of time before next opening hours and send it to waitingQueue.
The reason I didn't use it since the beginning is that I fell into a few pitfalls. Here are a few useful properties to set when using DMQ with Glassfish 3.1.1 and its embedded MQ.
imq.message.expiration.interval=1 that's for the poll interval on each queue before sending timed out messages to the DMQ. Default is 60 seconds. If like me you want to test your application with little latency, this is what you need.

Resources