In the event where consumer is executing and the lock on the message is lost in the meantime due to duration expiring or intermittent connection loss - how can this scenario be potentially handled from within the consumer because currently if consumer finishes, the MessageLockLostException is thrown and another instance of consumer would have been started already.
To solve that - we wrote idempotent consumers, however, this is still an issue now where the first instance (the one that was executing prior to losing the lock) throws an exception because that causes the message to go the {queueName}_error by default and the MassTransit pipeline seems to ignore the fact that the message lock has been lost at that point and a new instance of consumer for the same message has been started.
So the summary of this predicament is that we can effectively get a message successfully acknowledged as well as it being in the error queue. I am looking if anyone has maybe successfully dealt with similar scenario or if there are some hooks in MassTransit where a consumer can work out whether it still has an active lock on the message it is processing.
Related
I have a question about some strange behaviour of consumer.
Recently we had strange situation on production environment. Two consumers on two different microservices were stuck at some messages. The first one was holding 20 messages from rabbitMQ queue and the second one 2 messages and they weren't processing them. These messages were visible as Unacked in RabbitMQ for two days. They went back to Ready state just when that two microservices were restarted. At that time when consumers took this messages the whole program was processing thousands messages per hour, so basically our Saga and all consumers were working. When these messages went back to Ready state they were processed in one second after that so I don't think that it's problem with them.
The messages are published by Saga to Exchange and besides these two stucked consumers we have also EventLogger consumer subscribed to all messages and this EventLogger processed this 22 messages normally without any problems (from his own queue). Also we have connected Application Insights to consumers and there is no information about receiving these 22 messages by these two consumers (there are information about receiving it by EventLogger).
The other day we had the same issue with one message on test environment.
Recently we updated version of MassTransit in our project from version 6.2.0 to 7.1.6 and before that we didn't notice any similar issues with consumers but maybe it's just coincidence. We also have retry, redelivery, circuit breaker and in memory outbox mechanisms but I don't think that's problem with them because the consumer didn't even start to process these 22 messages.
Do you have any suggestions what could happened to this consumers?
Usually when a consumer doesn't even start to consume the message once it has been delivered to MassTransit by RabbitMQ, it could be an issue resolving the consumer from the container, such as a dependency to another backing service (database, log server, file, network connection, device, etc.).
The message remains unacknowledged on the broker because the transport/delivery mechanism to the consumer is waiting for a resource to become available. If there isn't anything in the logs for that time period indicating an issue with a resource, it's hard to know what could have blocked those messages from being consumed. The fact that they were ultimately consumed once the services were restarted seems to indicate the message content itself was fine.
Monitoring the lack of message consumption (and likely an associated queue depth increase) would give an indication that the situation has occurred. If it happens again, I'd increase the logging detail levels to see if the issue occurs again and can then be identified.
I have a spring boot app with single kafka consumer to get messages from some topic.
But sometime errors are occurred while message handling.
I want to continue to receive the following messages as usual and at the same time be able not to lose that message and receive it, for example, the next time the service is restarted with the consumer after fixing it.
Is it possible to do this?
I understand that I need to disable auto-commit and commit successful messages manually, but, in this case, if I don't throw any exception for this exception case and commit each next successful message manually, then I will lose the previous unsuccessful one, right?
If I understand your question correctly, your assumption is that the exception occurs due to a problem in your code and not while reading the message from the topic. In that case no retry or other measures will solve your problem.
What we usually do is to catch the exception and send it to another Kafka topic. Ideally, you will also add some details on why or in which code part the exception occurred. After you have fixed the bug in your application you can consume the messages from that other topic.
I understand that I need to disable auto-commit and commit successful messages manually, but, in this case, if I don't throw any exception for this exception case and commit each next successful message manually, then I will lose the previous unsuccessful one, right?
Yes, your understanding is correct. To be more precise, you will not "loose" the message but as soon as your ConsumerGroup commits a higher offset it will never try to read the lower offset again without any manual modification.
Alternative
If you only expect very rare cases where an exception could be thrown, but you just ignore it, you can always use the consumer.seek() method in pure Kafka
public void seek(TopicPartition partition, long offset)
to start reading from a particular offset out of a topic partition.
Yes you have to manually commit them. You retry a particular message 2-3 times. If it fails after retries then you can move those messages to another topic and consume those messages when you fix whatever is causing it to fail. This will not block your queue and you won't lose and messages too.
I want to continue to receive the following messages as usual and at
the same time be able not to lose that message and receive it, for
example, the next time the service is restarted with the consumer
after fixing it.
Is it possible to do this?
You don't need to do a manual commit, instead, you can choose to implement a mechanism to do a retrial, by publishing the event in another queue and delayed consuming the event. =====> Amazon SQS has delay Queue but unfortunately there is no such thing in kafka and you have to write the implementation by yourself.
Reference articles:
Article 1
Article 2
If you are retrying the message processing, then the order of the messages can change based on your implementation. Please do keep it in mind.
Do remember that kafka does consider a consumer dead in case the message processing time exceeds max.poll.interval. Read this
I have an application listening to messages on an IBM Websphere MQ queue.
Once a message is consumed, the application performs some processing logic.
If the processing completed OK, I would like the application to acknowledge the message and have it removed from the queue.
If an error occurred while processing, I would like the message to remain in the queue.
How is this implemented? (I'm using the .NET API)
Thanks.
MQ supports a single-phase commit protocol. You specify syncpoint when you get the message, then issue COMMIT or ROLLBACK as required. The default action if the connection is lost is ROLLBACK and if the program deliberately ends without resolving the transaction a COMMIT is assumed. (This is platform dependent so the customary advice is to explicitly call COMMIT and not rely on the class destructors to do it for you.)
This works whether the message is persistent or not. However if the message has an expiry specified and expires after being rolled back there's a chance it won't be seen again.
Of course, if the program issues a ROLLBACK the message will normally be seen again since it goes back to the same spot int he queue and for a FIFO queue that's the top. If the problem with the message is not transient then this causes a poison message loop of read/rollback/repeat. To avoid that the app can check the backout count and if it exceeds some threshold requeue the message to an exception queue.
When using JMS or XMS this is done for you by the class libraries. If the input queue's BOQNAME and BOQTHRESH attributes are set the requeue is to the queue names in BOQNAME. Otherwise a requeue to the Dead Queue is attempted. IF that fails (as it should if the system is properly secured) the listener will stop receiving messages.
The usual advice is to always specify a backout queue and either let the classes use it or code the app to use it.
Please see Usage Notes for MQGET in the MQAPI Reference and the MQGetMessageOptions.NET page in the .Net class reference.
You may want to look at the MQ Reporting Options.
Expiry, Confirmation of Arrival and Confirmation of Delivery can be requested and sent via a response queue back to the sending application by the receiving Queue Manager.
Positive and Negative Acknowledgements can also be generated by the receiving application provided they use the related reporting attributes found in the Message Descriptor.
Exception can be requested and sent via a response queue back to the sending application by any Queue Manager in the transmission chain or generated by the receiving application.
1 Read the message using MQC.MQGMO_SYNCPOINT,
2 process it
3 call MQQueueManager.Commit()
If Commit() is not called explicitly, or implicitly (eg exception is thrown), all messages that have been de-queued will be re-enqueued.
I have a set up of an ActiveMQ broker and a single consumer. Consumer gets a message that he is not able to process because a service that it depends has a bug (once fixed it will be fine). So the message keeps being redelivered (consumer redelivery) - we use JMS sessions. With our current configuration it will keep redelivering it every 10 minutes for 1 day. That obviously causes a problem because other messages are not being consumed.
In order to solve this problem I have accessed the queue through JMX and tried to delete that message but it is not there. I guess it is cached on the consumer and not visible at the broker.
Is there any way to delete this message other than restarting the application?
Is it possible to configure the redelivery mechanism so that such message (that causes a live lock eventually) is put at the end of the queue so that other messages can be processed?
The 10 minutes for 1 day redelivery policy should stay as is.
I think you're right that the messages are stuck in the consumer's prefetch buffer, and I don't know of a way to delete them from there.
I'd change your redelivery policy to send to the DLQ after the second failure, with a much shorter interval between them, like 30 seconds, and I'd configure the DLQ strategy as an individualDeadLetterStrategy so you get a separate DLQ containing only messages from this particular queue. Then set up a consumer on this DLQ to move the messages to (the end of) the main queue whenever your reprocessing condition is met (whether that's after a certain delay, or based on reading some flag value from a database, or whatever). This consumer is where you'd implement "every 10 minutes for 1 day" logic, instead of in the redelivery policy where you currently have it.
That will keep the garbage ones out of the main queue so they don't delay other messages from being consumed, but still ensure that they will be reprocessed later. And it will put them on the broker instead of in the consumer's prefetch buffer, where you can view and delete them.
The only way to get it to the back of the queue is to reproduce it to the queue. Redelivery polices can only be configured down to the destination on the connection factory.
Given that you already have a connection, it shouldn't be to hard to create a producer that can either move the given message to a DLQ or produce it back to the queue when you run into that particular bug.
Setting jms.nonBlockingRedelivery=true on the connection factory resolved the problem. Now even if there is a message redelivered it does not block processing of other Messages.
The logic flow is like this
A message is sent to an input queue
A ProcessorMDB's onMessage() is invoked. Within this method several operations/validations are done
In case of a poison message(msg that application code cannot handle) a RuntimeException is thrown.
This should rollback the transaction. We are seeing evidence in the log file.
There is a backout threshold defined with a backout queue name
once threshold is reached, the message is sent to backout queue
But immediately it starts going back and forth between the input queue and backout queue.
We are using MQMON tool to observe this weird behavior. It continues for ever almost even after the app server(where MDB is running) is shutdown.
We are using Weblogic 10.3.1 and WebSphere MQ 6.02
Any help will be much appreciated, looks like we are running out of ideas.
This sounds like a syncpoint issue. If the QMgr were to issue a COMMIT when a message is requeued inside of a unit of work it would affect all messages under syncpoint inside of that thread. This would cause serious problems if an application had performed several PUT or GET calls prior to hitting the poison message. Rather than issue a COMMIT outside of the program's control, the QMgr just leaves the message on the backout queue inside the unit of work and waits for the program to issue the COMMIT. This can lead to some unexpected behavior such as what you are seeing where a message lands back on the input queue.
If another message is in the queue behind the "bad" one and it is processed successfully by the same thread, everything works out perfectly. The app issues a COMMIT on the new message and this also affects the poison message on the Backout Queue. However if the thread were to exit uncleanly (without an explicit disconnect or COMMIT) then the transaction is rolled back and the poison message is returned to the input queue.
The usual way of dealing with this is that the next good message (or batch of messages if transactions are batched) in the input queue will force the COMMIT. However in some cases where the owning thread gets no new work (perhaps it was performing a GET by Correlation ID) there is nothing to push the bad message through. In these cases, it is important to make sure that the application issues a COMMIT before ending. One way to do this is to write the code to perform the GET by CORRELID with a wait interval. If the wait interval expires, the application would get a return code of 2033 and then issue a COMMIT before closing the thread. If the reply message is legitimately late for whatever reason, the COMMIT will have no effect. But if the message arrived and had been backed out and requeued, the COMMIT will cause it to stay in the Backout Queue.
One way to see exactly what is going on is to run a trace against the queue in question. You can use the built-in trace function - strmqtrc - which has a few more options in V7 than does the V6 version. However if you want very fine grained control you can use the trace exit in SupportPac MA0W. With MA0W you can see exactly what API calls are made by the program and those made on its behalf.
[EDIT] Updating the response with some info from the PMR:
The following is from the WMQ V7 Infocenter:
MessageConsumers are single threaded below the Session level, and
any requeuing of poison messages
takes place within the current unit of
work. This does not affect the
operation of the application, however
when poison messages are requeued
under a transacted or
Client_acknowledge Session, the
requeue action itself will not be
committed until the current unit of
work is committed by the application
code or, if appropriate, the
application container code."
Hence, if it is important for the customer to have poison messages
committed immediately after they are
backed out, it is recommended they
either make use of the Application
Server Facilities
(ConnectionConsumer) which can commit
the message immediately, or
another mechanism to move poison
messages from the queue.
Here is the link to this information in the V6 and V7 Information Centers. Since you are using the V6 client so you would want to refer to the V6 Infocenter. Note that with the V6 client, there is no mention in the Infocenter of ASF being able to commit the poison message immediately, even when using a ConnectionConsumer. The way I read it, this means you probably will need to upgrade to the V7 client to get the behavior you are looking for. Will be interested to see if the PMR results in a similar recommendation.