In the following scenario, i'm curious as to what happens as it relates to what's in the active LOG files of the queue manager in question. Linear Logging is being used.
What activity (if any) is experienced by the MQ Active LOGS, during a scenario where a queue with say, 100 messages, is being READ with a JMS context attribute (looking for a specific message) -- that, for the case of this arguement, it will NEVER find. All messages are read off the queue, but none are committed. The messages therefore were never actually deleted from the queue; does the queue manager, however, record such GET operations so as to recover these "in flight" conditions, should the queue manager Crash while this is happening? We recently experienced a situation where the dequeue rate from a specific queue was in the 4000-4500 msg / min range, while the queue depth was only about 2500. We surmise that more than 1 such process thread were trying to read off a JMS message by context (sort of like with correlation ID I suppose), but without any hope of ever actually finding a message it was looking for (due to a probable misconfiguration). During this time, the active LOGS filled up rapidly. Is it likely that such wanton dequeue rates as we saw were the culprit?
MQ writes log records for persistent messages during get and put. More details can be found here:
http://pic.dhe.ibm.com/infocenter/wmqv7/v7r5/topic/com.ibm.mq.dev.doc/q023070_.htm
Related
I have been exploring EventStoreDB and trying to understand more about the ordering of messages on the consumer side. Read about persistent subscriptions and also the Pinned consumer strategy here.
I have a scenario wherein inventory updates get pushed to eventstore and different streams get created by the different unique inventoryIds in the inventory event.
We have multiple consumers with the same consumerGroup name to read these inventory events. We are using Pinned Persistent Subscription with ResolveLinkTos enabled.
My question:
Will every message from a particular stream always go to the same consumer instance of the consumerGroup?
If the answer to the above question is yes, will every message from that particular stream reach the particular consumer instance in the same order as the events were ingested?
The documentation has a warning that ordered message processing using persistent subscriptions is not guaranteed. Any strategy delivers messages with the best-effort level of ordering guarantees, if applicable.
There are a few reasons for this, some of those are:
Spreading out messages across consumer groups lead to a non-linearised checkpoint commit. It means that some messages can be processed before other messages.
Persistent subscriptions attempt to buffer messages, but when a timeout happens on the client side, the whole buffer is redelivered, which can eventually break the processing order
Built-in retry policies essentially can break the message order at any time
Most event log-based brokers, if not all, don't even attempt to guarantee ordered message delivery across multiple consumers. I often hear "but Kafka does it", ignoring the fact that Kafka delivers messages from one partition to at most one consumer in a group. There's no load balancing of one partition between multiple consumers due to exactly the same issue. That being said, EventStoreDB is still not a broker, but a database for events.
So, here are the answers:
Will every message from a particular stream always go to the same consumer instance of the consumer group?
No. It might work most of the time, but it will eventually break.
will every message from that particular stream reach the particular consumer instance in the same order as the events were ingested?
Most of the time, yes, but again, if a message is being retried, you might get the next message before the previous one is Acked.
Overall, load-balancing ordered processing of messages, which aren't pre-partitioned on the server is not an easy task. At most, you get messages re-delivered if the checkpoint fails to persist at some point, and the consumers restart.
EDIT: Solved this one while I was writing it up :P -- I love those kind of solutions. I figured I'd post it anyway, maybe someone else will have the same problem and find my solution. Don't care about points/karma, etc. I just already wrote the whole thing up, so figured I'd post it and the solution.
I have an SQS FIFO queue. It is using a dead letter queue. Here is how it had been configured:
I have a single producer microservice, and I have 10 ECS images that are running as consumers.
It is important that we process the messages close to the time they are delivered in the queue for business reasons.
We're using a fairly recent version of the AWS SDK Golang client package for both producer and consumer code (if important, I can go look up the version, but it is not terribly outdated).
I capture the logs for the producer so I know exactly when messages were put in the queue and what the messages were.
I capture aggregate logs for all the consumers, so I have a full view of all 10 consumers and when messages were received and processed.
Here's what I see under normal conditions looking at the logs:
Message put in the queue at time x
Message received by one of the 10 consumers at time x
Message processed by consumer successfully
Message deleted from queue by consumer at time x + (0-2 seconds)
Repeat ad infinitum for up to about 700 messages / day at various times per day
But the problem I am seeing now is that some messages are not being processed in a timely manner. Occasionally we fail processing a message deliberately b/c of the state of the system for that message (e.g. maybe users still logged in, so it should back off and retry...which it does). The problem is if the consumer fails a message it is causing the queue to stop delivering any other messages to any other consumers.
"Failure to process a message" here just means the message was received, but the consumer declared it a failure, so we just log an error, and do not proceed to delete it from the queue. Thus, the visibility timeout (here 5m) will expire and it will be re-delivered to another consumer and retried up to 10 times, after which it will go to the dead letter queue.
After delving into the logs and analyzing it, here's what I'm seeing:
Process begins like above (message produced, consumed, deleted).
New message received at time x by consumer
Consumer fails -- logs error and just returns (does not delete)
Same message is received again at time x + 5m (visibility timeout)
Consumer fails -- logs error and just returns (does not delete)
Repeat up to 10x -- message goes to dead-letter queue
New message received but it is now 50 minutes late!
Now all messages that were put in the queue between steps 2-7 are 50 minutes late (5m visibility timeout * 10 retries)
All the docs I've read tells me the queue should not behave this way, but I've verified it several times in our logs. Sadly, we don't have a paid AWS support plan, or I'd file a ticket with them. But just consider the fact that we have 10 separate consumers all reading from the same queue. They only read from this queue. We don't have any other queues it is using.
For de-duplication we are using the automated hash of the message body. Messages are small JSON documents.
My expectation would be if we have a single bad message that causes a visibility timeout, that the queue would still happily deliver any other messages it has available while there are available consumers.
OK, so turns out I missed this little nugget of info about FIFO queues in the documentation:
https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/FIFO-queues.html
When you receive a message with a message group ID, no more messages
for the same message group ID are returned unless you delete the
message or it becomes visible.
I was indeed using the same Message Group ID. Hadn't given it a second thought. Just be aware, if you do that and any one of your messages fails to process, it will back up all other messages in the queue, until the time that the message is finally dealt with. The solution for me was to change the message group id. There is some business logic id I can postfix on it that will work for me.
I want to read messages from JMS MQ or In-memory message store based on count.
Like I want to start reading the messages when the message count is 10, until that i want the message processor to be idle.
I want this to be done using WSO2 ESB.
Can someone please help me?
Thanks.
I'm not familiar with wso2, but from an MQ perspective, the way to do this would be to trigger the application to run once there are 10 messages on the queue. There are trigger settings for this, specifically TRIGTYPE(DEPTH).
To expand on Morag's answer, I doubt that WS02 has built-in triggers that would monitor the queue for depth before reading messages. I suspect it just listens on a queue and processes messages as they arrive. I also doubt that you can use MQ's triggering mechanism to directly execute the flow conveniently based on depth. So although triggering is a great answer, you need a bit of glue code to make that work.
Conveniently, there's a tutorial that provides almost all the information necessary to do this. Please see Mission:Messaging: Easing administration and debugging with circular queues for details. That article has the scripts necessary to make the Q program work with MQ triggering. You just need to make a couple changes:
Instead of sending a command to Q to delete messages, send a command to move them.
Ditch the math that calculates how many messages to delete and either move them in batches of 10, or else move all messages until the queue drains. In the latter case, make sure to tell Q to wait for any stragglers.
Here's what it looks like when completed: The incoming messages land on some queue other than the WS02 input queue. That queue is triggered based on depth so that the Q program (SupportPac MA01) copies the messages to the real WS02 input queue. After the messages are copied, the glue code resets the trigger. This continues until there are less than 10 messages on the queue, at which time the cycle idles.
I got it by pushing the message to db and get as per the count required as in this answer of me take a look at my answer
I'm looking for help regarding a strange issue where a slow consumer on a queue causes all the other consumers on the same queue to start consuming messages at 30 second intervals. That is all consumers but the slow one don't consumer messages as fast as they can, instead they wait for some magical 30s barrier before consuming.
The basic flow of my application goes like this:
a number of producers place messages onto a single queue. Messages can have different JMSXGroupIDs
a number of consumers listen to messages on that single queue
as standard practice the JMSXGroupIDs get distributed across the consumers
at some point one of the consumers becomes slow and can't process messages very quickly
the slow consumer ends up filling its prefetch buffer on the broker and AMQ recognises that it is slow (default behaviour)
at that point - or some 'random' but close time later - all consumers except the slow one start to only consume messages at the same 30s intervals
if the slow consumer becomes fast again then things very quickly return to normal operation and the 30s barrier goes away
I'm at a loss for what could be causing this issue, or how to fix it, please help.
More background and findings
I've managed to reliably reproduce this issue on AMQ 5.8.0, 5.9.0 (where the issue was originally noticed) and 5.9.1, on fresh installs and existing ops-managed installs and on different machines some vm and some not. All linux installs, different OSs and java versions.
It doesn't appear to be affected by anything prefetch related, that is: changing the prefetch value from 1 to 10 to 1000 didn't stop the issue from happening
[red herring?] Enabling debug logs on the amq instance shows logs relating to the periodic check for messages that can be expired. The queue doesn't have an expiry policy so I can only think that the scheduled expireMessagesPeriod time is just waking amq up in such a way that it then sends messages to the non-slow consumers.
If the 30s mode is entered then left then entered again the seconds-past-the-minute time is always the same, for example 14s and 44s past the minute. This is true across all consumers and all machines hosting those consumers. Those barrier points do change after restarts of amq.
While not strictly a solution to the problem, further investigation has uncovered the root cause of this issue.
TL;DR - It's known behaviour and won't be fixed before Apollo
More Details
Ultimately this is caused by the maxPageSize property and the fact that AMQ will only apply selection criteria to messages in memory. Generally these are message selectors (property = value), but in my case they are JMSXGroupID=>Consumer assignments.
As messages are received by the queue they get paged into memory and placed into a collection (named pagedInPendingDispatch in the source). To dispatch messages AMQ will scan through this list of messages and try to find a consumer that will accept it. That includes checking the group id, message selector and prefetch buffer space. For our use case we aren't using message selectors but we are using groups. If no consumer can take the message then it is left in the collection and will be checked again at the next tick.
In order to stop the pagedInPendingDispatch collection from eating up all the resources available there is a suggested limit to the size of this queue configured via the maxPageSize property. This property isn't actually a maximum, it's more a hint as to whether, under normal conditions, new message arrivals should be paged in memory or paged to disk.
With these two pieces of information and a slow consumer it turns out that eventually all the messages in the pagedInPendingDispatch collection end up only being consumable by the slow consumer, and hence the collection effectively gets blocked and no other messages get dispatched. This explains why the slow consumer wasn't affected by the 30s interval, it had maxPageSize messages waiting delivery already.
This doesn't explain why I was seeing the non-slow consumers receive messages every 30s though. As it turns out, paging messages into memory has two modes, normal and forced. Normal follows the process outlined above where the size of the collection is compared to the maxPageSize property, when forced, however, messages are always paged into memory. This mode exists to allow you to browse through messages that aren't in memory. As it happens this forced mode is also used by the expiry mechanism to allow AMQ to expire messages that aren't in memory.
So what we have now is a collection of messages in memory that are all targeted for dispatch to the same consumer, a consumer that won't accept them because it is slow or blocked. We also have a backlog of messages awaiting delivery to all consumers. Every expireMessagesPeriod milliseconds a task runs that force pages messages into memory to check if they should be expired or not. This adds those messages onto the pages in collection which now contains maxPageSize messages for the slow consumer and N more messages destined for any consumer. Those messages get delivered.
QED.
References
Ticket referring to this issue but for message selectors instead
Docs relating to the configuration properties
Somebody else with this issue but for selectors
I have a MDB which listens to WebSphere MQ. It does not picks up the messages in the order that has been received by the Queue. How can i make it read it in that order? Is it possible? Should i not use a MDB.
In general, WMQ delivers messages in the order that they were received. However, several things can impact that...
If the queue is set to priority instead of FIFO delivery and messages arrive in different priorities, they will be delivered "out of order".
Distinguish between order produced and order delivered. If the messages are produced on a remote QMgr and there are multiple paths to the local QMgr, messages may arrive out of order.
Difference in persistence - if messages are produced on a remote QMgr and are of different persistences, the non-persistent messages may arrive faster than the persistent ones, especially with channel NPMSPEED(FAST) set.
Multiple readers/writers - Any dependency on sequence implies a single producer sending to a single consumer over a single path. Any redundancy in producers, consumers or paths between them can result in messages delivered out of sequence.
Syncpoint - To preserve sequence, ALL messages must be written and consumed under syncpoint or else ALL must be written and consumed outside of syncpoint.
Selectors - These specifically are intended to deliver messages out of order with respect to the context of all messages in the queue.
Message groups - Retrieval of grouped messages typically waits until the entire group is present. If groups are interleaved, messages are delivered out of sequence.
DLQ - if the target queue fills, messages may be delivered to the DLQ. As the target queue is drained, messages start going back there. With a queue near capacity, messages can alternate between the target queue and DLQ.
So when an MDB is receiving messages out of order any of these things, or even several of them in combination, may be at cause. Either eliminate the dependency on message sequence (best choice) or else go back over the design and reconcile all the factors that may lead to out-of-sequence processing.
To add to T.Rob's list, MDBs use the application server WorkManager to schedule message delivery, so message order is also dependent on the order in which the WorkManager starts Work items. This is outside the control of WMQ. If you limit the MDB ServerSessionPool depth to one, then this limit is removed as there will only ever be one in-flight Work instance, but at the cost of reducing maximum throughput.
If you're running in WebSphere application server, then non-ASF mode with ListenerPorts can preserve message order subject to some transactional/backout caveats. There's a support technote here:
http://www-01.ibm.com/support/docview.wss?uid=swg21446463