My instance of OSB is attempting to process a message from a JMS queue that doesn't exist - I believe this has already been processed and removed, but my current concern is the multiple failures each second it is trying to continue. The error logs are now useless as they're flooded with failures for one particular message.
I have rebooted the managed servers and admin server, but each time, it is immediately reattempting to process the same message. I believe this is having knock-on effects to performance, and I have had to remove all logs as the file system is continuing to overflow.
Where is this "currently processing" message being picked up from, and how can I progress this so that it will not keep trying to reprocess this?
As far as I understand, a problematic message is processing continuous failure within a JMS queue. There are 2 important actions.
Identify failure root cause. For me to help you with this item
depends on the error message. If error details are provided, I might
provide suggestions.
Protect environment by JMS Queue Configuration for Delivery Failure
such as "Expiration Policy", "Redelivery Limit", "Error Destination"
etc.
Please check following Oracle documentation for these configurations.
Related
I have a spring boot app with single kafka consumer to get messages from some topic.
But sometime errors are occurred while message handling.
I want to continue to receive the following messages as usual and at the same time be able not to lose that message and receive it, for example, the next time the service is restarted with the consumer after fixing it.
Is it possible to do this?
I understand that I need to disable auto-commit and commit successful messages manually, but, in this case, if I don't throw any exception for this exception case and commit each next successful message manually, then I will lose the previous unsuccessful one, right?
If I understand your question correctly, your assumption is that the exception occurs due to a problem in your code and not while reading the message from the topic. In that case no retry or other measures will solve your problem.
What we usually do is to catch the exception and send it to another Kafka topic. Ideally, you will also add some details on why or in which code part the exception occurred. After you have fixed the bug in your application you can consume the messages from that other topic.
I understand that I need to disable auto-commit and commit successful messages manually, but, in this case, if I don't throw any exception for this exception case and commit each next successful message manually, then I will lose the previous unsuccessful one, right?
Yes, your understanding is correct. To be more precise, you will not "loose" the message but as soon as your ConsumerGroup commits a higher offset it will never try to read the lower offset again without any manual modification.
Alternative
If you only expect very rare cases where an exception could be thrown, but you just ignore it, you can always use the consumer.seek() method in pure Kafka
public void seek(TopicPartition partition, long offset)
to start reading from a particular offset out of a topic partition.
Yes you have to manually commit them. You retry a particular message 2-3 times. If it fails after retries then you can move those messages to another topic and consume those messages when you fix whatever is causing it to fail. This will not block your queue and you won't lose and messages too.
I want to continue to receive the following messages as usual and at
the same time be able not to lose that message and receive it, for
example, the next time the service is restarted with the consumer
after fixing it.
Is it possible to do this?
You don't need to do a manual commit, instead, you can choose to implement a mechanism to do a retrial, by publishing the event in another queue and delayed consuming the event. =====> Amazon SQS has delay Queue but unfortunately there is no such thing in kafka and you have to write the implementation by yourself.
Reference articles:
Article 1
Article 2
If you are retrying the message processing, then the order of the messages can change based on your implementation. Please do keep it in mind.
Do remember that kafka does consider a consumer dead in case the message processing time exceeds max.poll.interval. Read this
We are using IBM MQ8.0. Activitiy logs are getting logged for outgoing messages which we are sending to external system. But there is no log available for the messages which are from external system to our MQ Manager.
Is it problem with client channel configuration ?
Or MQ logging configuration issue ?
IBM describes these "activity logs" as recover logs in the Knowledge center page "Making sure that messages are not lost (logging)"
IBM MQ records all significant changes to the persistent data controlled by the queue manager in a recovery log.
This includes creating and deleting objects, persistent message updates, transaction states, changes to object attributes, and channel activities. The log contains the information you need to recover all updates to message queues by:
Keeping records of queue manager changes
Keeping records of queue updates for use by the restart process
Enabling you to restore data after a hardware or software failure
Please note that non-persistent messages are not logged to the recover log.
Based on your question it is likely that the messages you are sending to the external system are persistent messages and the messages you are receiving from the external system are non-persistent messages, this would explain why they are not logged to the recover log files.
Persistence is determined at the time the message is first PUT.
IBM has a good Technote "Message persistence FAQs" about this subject.
Q3. What is the best way to be certain that messages are persistent?
A3. Set MQMD message persistence to persistent (MQPER_PERSISTENT), or nonpersistent (MQPER_NOT_PERSISTENT) and your message will always retain that value.
Note: MQPER_PERSISTENCE_AS_Q_DEF is the default setting for the persistence value in the MQMD. See the persistence values listed below.
...
Additional information
MQPER_PERSISTENCE_AS_Q_DEF can lead to unexpected results. If there is more than one definition in the queue-name resolution path, the default persistence attribute is taken from first queue definition in the path at the time of the MQPUT or MQPUT1 call. This queue could be an:
alias queue
local queue
local definition of a remote queue
queue-manager alias
transmission queue
cluster queue
The external system will need to make sure the messages they send you are set as persistent messages if you want them to be logged.
I has some strange behaviour on production deployment for azure queue messages:
Some of the messages in queues appears with big delay - minutes, and sometimes 10 minutes.
Befere you ask about setting delayTimeout when we put message to queue - we do not set delayTimeout for that message, so message should appear almost immedeatly after it was placed in queue.
At that moments we do not have a big load. So my instances has no work load, and able to process message fast, but they just don't appear.
Our service process millions of messages per month, we able to identify that 10-50 messages processed with very big delay, by that we fail SLA in front of our customers.
Does anyone have any idea what can be reason?
How to overcome?
Did anyone faced similar issues?
Some general ideas for troubleshooting:
Are you certain that the message was queued up for processing - ie the queue.addmessage operation returned successfully and then you are waiting 10 minutes - meaning you can rule out any client side retry policies etc as being the cause of the problem.
Is there any chance that the time calculation could be subject to some kind of clock skew problems. eg - if one of the worker roles pulling messages has its close out of sync with the other worker roles you could see this.
Is it possible that in the situations where the message is appearing to be delayed that a worker role responsible for pulling the messages is actually failing or crashing. If the client calls GetMessage but does not respond with an appropriate acknowledgement within the time specified by the invisibilityTimeout setting then the message will become visible again as the Queue Service assumes the client did not process the message. You could tell if this was a contributing factor by looking at the dequeue count on these messages that are taking longer. More information can be found here: http://msdn.microsoft.com/en-us/library/dd179474.aspx.
Is it possible that the number of workers you have pulling items from the queue is insufficient at certain times of the day and the delays are simply caused by the queue being populated faster than you can pull messages from the queue.
Have you enabled logging for queues and then looked to see if you can find the specific operations (look at e2elatency and serverlatency).
http://blogs.msdn.com/b/windowsazurestorage/archive/tags/analytics+2d00+logging+_2600_amp_3b00_+metrics/. You should also enable client logging and try to determine if the client is having connectivity problems and the retry logic is possibly kicking in.
And finally if none of these appear to help can you please send me the server logs (and ideally the client side logs as well) along with your account information (no passwords) to JAHOGG at Microsoft dot com.
Jason
Azure Service bus has a property in the BrokeredMessage class called ScheduledEnqueueTimeUtc, it allows you to set a time for when the message is added to the queue (effectively creating a delay).
Are you sure that in your code your not setting this property, and this might be the cause for the delay?
You can find more info on this at this url: https://www.amido.com/azure-service-bus-how-to-delay-a-message-being-sent-to-the-queue/
If you are using WebJobs to process messages from the queue, it can be due to WebJobs configuration.
From an MSDN forum post by pranav rastogi:
Starting with 0.4.0-beta, the (WebJobs) SDK implements a random exponential back-off algorithm. As a result of this if there are no messages on the queue, the SDK will back off and start polling less frequently.
The following setting allows you to configure this behavior.
MaxPollingInterval for when a queue remains empty, the longest period of time to wait before checking for a message to. Default is 10min.
static void Main()
{
JobHostConfiguration config = new JobHostConfiguration();
config.Queues.MaxPollingInterval = TimeSpan.FromMinutes(1);
JobHost host = new JobHost(config);
host.RunAndBlock();
}
The logic flow is like this
A message is sent to an input queue
A ProcessorMDB's onMessage() is invoked. Within this method several operations/validations are done
In case of a poison message(msg that application code cannot handle) a RuntimeException is thrown.
This should rollback the transaction. We are seeing evidence in the log file.
There is a backout threshold defined with a backout queue name
once threshold is reached, the message is sent to backout queue
But immediately it starts going back and forth between the input queue and backout queue.
We are using MQMON tool to observe this weird behavior. It continues for ever almost even after the app server(where MDB is running) is shutdown.
We are using Weblogic 10.3.1 and WebSphere MQ 6.02
Any help will be much appreciated, looks like we are running out of ideas.
This sounds like a syncpoint issue. If the QMgr were to issue a COMMIT when a message is requeued inside of a unit of work it would affect all messages under syncpoint inside of that thread. This would cause serious problems if an application had performed several PUT or GET calls prior to hitting the poison message. Rather than issue a COMMIT outside of the program's control, the QMgr just leaves the message on the backout queue inside the unit of work and waits for the program to issue the COMMIT. This can lead to some unexpected behavior such as what you are seeing where a message lands back on the input queue.
If another message is in the queue behind the "bad" one and it is processed successfully by the same thread, everything works out perfectly. The app issues a COMMIT on the new message and this also affects the poison message on the Backout Queue. However if the thread were to exit uncleanly (without an explicit disconnect or COMMIT) then the transaction is rolled back and the poison message is returned to the input queue.
The usual way of dealing with this is that the next good message (or batch of messages if transactions are batched) in the input queue will force the COMMIT. However in some cases where the owning thread gets no new work (perhaps it was performing a GET by Correlation ID) there is nothing to push the bad message through. In these cases, it is important to make sure that the application issues a COMMIT before ending. One way to do this is to write the code to perform the GET by CORRELID with a wait interval. If the wait interval expires, the application would get a return code of 2033 and then issue a COMMIT before closing the thread. If the reply message is legitimately late for whatever reason, the COMMIT will have no effect. But if the message arrived and had been backed out and requeued, the COMMIT will cause it to stay in the Backout Queue.
One way to see exactly what is going on is to run a trace against the queue in question. You can use the built-in trace function - strmqtrc - which has a few more options in V7 than does the V6 version. However if you want very fine grained control you can use the trace exit in SupportPac MA0W. With MA0W you can see exactly what API calls are made by the program and those made on its behalf.
[EDIT] Updating the response with some info from the PMR:
The following is from the WMQ V7 Infocenter:
MessageConsumers are single threaded below the Session level, and
any requeuing of poison messages
takes place within the current unit of
work. This does not affect the
operation of the application, however
when poison messages are requeued
under a transacted or
Client_acknowledge Session, the
requeue action itself will not be
committed until the current unit of
work is committed by the application
code or, if appropriate, the
application container code."
Hence, if it is important for the customer to have poison messages
committed immediately after they are
backed out, it is recommended they
either make use of the Application
Server Facilities
(ConnectionConsumer) which can commit
the message immediately, or
another mechanism to move poison
messages from the queue.
Here is the link to this information in the V6 and V7 Information Centers. Since you are using the V6 client so you would want to refer to the V6 Infocenter. Note that with the V6 client, there is no mention in the Infocenter of ASF being able to commit the poison message immediately, even when using a ConnectionConsumer. The way I read it, this means you probably will need to upgrade to the V7 client to get the behavior you are looking for. Will be interested to see if the PMR results in a similar recommendation.
This is in Cluster Environment. Queue Manager lost its identity in cluster and it is unable to connect to other servers. All channels to repository and others were retrying state.
CPU usage is optimal in this server. This is a UNIX box.
When I checked the logs below is it,
AMQ9532: Program cannot set queue
attributes.
EXPLANATION: The attempt to set the
attributes of queue
'SYSTEM.CLUSTER.TRANSMIT.QUEUE' on
queue manager 'QMGR.SERVER6A' failed
with reason code 2102.
ACTION: Ensure
that the queue is available and retry
the operation.
----- amqrmssa.c : 690 --------------------------------------------------------
AMQ9999: Channel program ended
abnormally.
EXPLANATION: Channel program
'Channel.Coord00' ended abnormally.
ACTION: Look at previous error
messages for channel program
'Channel.Coord00' in the error files to
determine the cause of the failure.
----- amqrccca.c : 883 --------------------------------------------------------
03/06/11 08:24:26 AMQ9544: Messages
not put to destination queue.
EXPLANATION: During the processing of
channel 'Channel.Server6A' one or more
messages could not be put to the
destination queue and attempts were
made to put them to a dead-letter
queue. The location of the queue is
1, where 1 is the local dead-letter
queue and 2 is the remote dead-letter
queue.
ACTION: Examine the contents of
the dead-letter queue. Each message
is contained in a structure that
describes why the message was put to
the queue, and to where it was
originally addressed. Also look at
previous error messages to see if the
attempt to put messages to a
dead-letter queue failed. The program
identifier (PID) of the processing
program was '1372200'.
----- amqrmrca.c : 1318 -------------------------------------------------------
Then I did recycled the queue manager it is now fine?
My question here is how did the MQ resource problem occurr? CPU usage of this server is not more than 15%. Please advise.
There are three different and unrelated problems shown in the log.
AMQ9532: Program cannot set queue
attributes.
EXPLANATION: The attempt to set the
attributes of queue
'SYSTEM.CLUSTER.TRANSMIT.QUEUE' on
queue manager 'QMGR.SERVER6A' failed
with reason code 2102.
The 2102 is MQRC_RESOURCE_PROBLEM and presumably the resource issue referred to in the posting. The 2102 can be any kind of scarce resource, including semaphores, user processes, queue handles, etc. Since the QMgr was attempting to set an attribute of the queue, it would have already had a thread instantiated but it would have required additional queue handles. When something like this occurs, use your admin tool (WMQ Explorer, mqmon or one of the many 3rd party tools) to look into the number of open queue handles, open channels, etc. Note that for a resource error, it will be necessary to maintain an open connection to the QMgr or else the tool will be unable to make a new connection when the resource shortage occurs.
AMQ9999: Channel program ended
abnormally.
EXPLANATION: Channel program
'Channel.Coord00' ended abnormally.
ACTION: Look at previous error
messages for channel program
'C00.US.MP00' in the error files to
determine the cause of the failure.
This error appears to actually be two different errors since it references two different channels. One of these appears to be an outbound cluster channel and the other appears to be a point-to-point channel. Neither channel mentioned in this error are associated with the first and last error message.
03/06/11 08:24:26 AMQ9544: Messages
not put to destination queue.
EXPLANATION: During the processing of
channel 'Channel.Server6A' one or more
messages could not be put to the
destination queue and attempts were
made to put them to a dead-letter
queue. The location of the queue is 1,
where 1 is the local dead-letter queue
and 2 is the remote dead-letter queue.
ACTION: Examine the contents of the
dead-letter queue. Each message is
contained in a structure that
describes why the message was put to
the queue, and to where it was
originally addressed. Also look at
previous error messages to see if the
attempt to put messages to a
dead-letter queue failed. The program
identifier (PID) of the processing
program was '1372200'.
The last error appears to be an inbound cluster channel. Since the first error was trying to set attributes of the cluster transmit queue, it could only have been associated with an outbound channel. Therefore the first and last error messages are unrelated. This error message appears to show an inbound message that was destined for a queue and that queue was full, PUT-disabled, or otherwise unable to accept the message. The message was therefore routed to the dead letter queue.
For the resource error, I would suggest reviewing the performance report appropriate to your platform. Go to the SupportPacs page and look for those SupportPacs named MP* and then look for the one for your platform. The Performance Reports give you specific tuning advice.
You may also want to review the Problem Determination chapter in the System Administration manual for additional advice on how to identify resource issues.
The WebSphere MQ cluster design and operation article in the developerWorks Mission:Messaging series has specific advice about keeping clusters healthy.
Last but not least, the WebSphere MQ MustGather page has sections on troubleshooting for all major platforms and categorized by problem area.
To increase the MAXMSGL to 100 MB in IBMMQ,
(Reason code-2102 - MQRC_RESOURCE_PROBLEM) after setting the MAXMSGL to 100 mb
Category: IBM WebSphere MQ
If you are receiving error Reason code:2102 - MQRC_RESOURCE_PROBLEM, then try
Queue manager->properties->Extended->Increase Log->Log primary files and Log->Log secondary files->value to 20