The AWS SQS -> Lambda integration allows you to process incoming messages in a batch, where you configure the maximum number you can receive in a single batch. If you throw an exception during processing, to indicate failure, all the messages are not deleted from the incoming queue and can be picked up by another lambda for processing once the visibility timeout has passed.
Is there any way to keep the batch processing, for performance reasons, but allow some messages from the batch to succeed (and be deleted from the inbound queue) and only leave some of the batch un-deleted?
The problem with manually re-enqueueing the failed messages to the queue is that you can get into an infinite loop where those items perpetually fail and get re-enqueued and fail again. Since they are being resent to the queue their retry count gets reset every time which means they'll never fail out into a dead letter queue. You also lose the benefits of the visibility timeout. This is also bad for monitoring purposes since you'll never be able to know if you're in a bad state unless you go manually check your logs.
A better approach would be to manually delete the successful items and then throw an exception to fail the rest of the batch. The successful items will be removed from the queue, all the items that actually failed will hit their normal visibility timeout periods and retain their receive count values, and you'll be able to actually use and monitor a dead letter queue. This is also overall less work than the other approach.
Considerations
Only override the default behavior if there has been a partial batch failure. If all the items succeeded, let the default behavior take its course
Since you're tracking the failures of each queue item, you'll need to catch and log each exception as they come in so that you can see what's going on later
I recently encountered this problem and the best way to handle this without writing any code from our side is to use the FunctionResponseTypes property of EventSourceMapping. Using this we just have to pass the list of failed message Id and the event source will handle to delete the successful message.
Please checkout Using SQS and Lambda
Cloudformation template to configure Eventsource for lambda
"FunctionEventSourceMapping": {
"Type": "AWS::Lambda::EventSourceMapping",
"Properties": {
"BatchSize": "100",
"Enabled": "True",
"EventSourceArn": {"Fn::GetAtt": ["SQSQueue", "Arn"]},
"FunctionName": "FunctionName",
"MaximumBatchingWindowInSeconds": "100",
"FunctionResponseTypes": ["ReportBatchItemFailures"] # This is important
}
}
After you configure your Event source with above configuration it should look something like below
Then we just have to return the response in the below-mentioned format from our lambda
{"batchItemFailures": [{"itemIdentifier": "85f26da9-fceb-4252-9560-243376081199"}]}
Provide the list of failed message Ids in batchIntemFailures list
If your lambda runtime environment is in python than please return dict in the above mentioned format for java based runtime you can use aws-lambda-java-event
Sample Python code
Advantages of this approach are
You don't have to add any code to manually delete the message from SQS queue
You don't have to include any third party library or boto just for deleting the message from the queue it will help you to reduce your final artifact size.
Keep it simple an stupid
On a side note make sure your lambda have the required permission on sqs to get and delete the message.
Thanks
One option is to manually send back the failed messages to the queue, and then replying with a success to the SQS so that there are no duplicates.
You could do something like setting up a fail count, so that if all messages failed you can simply return a failed status for all messages, otherwise if the fail count is < 10 (10 being the max batch size you can get from SQS -> Lambda event) then you can individually send back the failed messages to the queue, and then reply with a success message.
Additionally, to avoid any possible infinite retry loop, add a property to the event such as a "retry" count before sending it back to the queue, and drop the event when "retry" is greater than X.
Related
I had a few questions about AWS lambdas and I couldn't find much details in the documentation
How can I increase the number of retries in AWS Lambda?
In case the maximum number of retries have occurred and whole Lambda has failed how can I get some sort of a notification?
Lambda retries are based upon many factors. I suggest you take a look into the official docs to understand every single type of retry, but long story short:
Stream-Based Synchronous Event sources can either retry or not. It depends on the Service.
Asynchronous Event sources will retry up to three times. If all messages fail, you can then configure a DLQ to receive the failed messages
(Stream-Based && Poll-Based Event Sources) (like Kinesis or Dynamo) will retry until the configured data retention. Be careful because if one message fails and the message itself is poisonous, it will keep retrying until it expires and no new messages will be processed
(Non-Stream-Based && Poll-Based Event Sources) (SQS) will discard messages in case of failure (unless it was an invocation failure or a timeout). If discarded, they will be sent to a DLQ if you previously configured one.
So, based on the information above, we can then tackle your question around notifications: you can have another Lambda subscribed to your DLQ to receive the message and notify the way you want. Either by sending an e-mail from the function itself or sending it to SNS (to possibly send an e-mail directly or do whatever you want with it).
For the retry amount, that's not configurable for the already built-in values. The furthest you can go is to invoke a Lambda function synchronously from your code and, in case of exception, retry it as you wish (the exponential backoff, if desired, would also need to be coded manually)
EDIT: Solved this one while I was writing it up :P -- I love those kind of solutions. I figured I'd post it anyway, maybe someone else will have the same problem and find my solution. Don't care about points/karma, etc. I just already wrote the whole thing up, so figured I'd post it and the solution.
I have an SQS FIFO queue. It is using a dead letter queue. Here is how it had been configured:
I have a single producer microservice, and I have 10 ECS images that are running as consumers.
It is important that we process the messages close to the time they are delivered in the queue for business reasons.
We're using a fairly recent version of the AWS SDK Golang client package for both producer and consumer code (if important, I can go look up the version, but it is not terribly outdated).
I capture the logs for the producer so I know exactly when messages were put in the queue and what the messages were.
I capture aggregate logs for all the consumers, so I have a full view of all 10 consumers and when messages were received and processed.
Here's what I see under normal conditions looking at the logs:
Message put in the queue at time x
Message received by one of the 10 consumers at time x
Message processed by consumer successfully
Message deleted from queue by consumer at time x + (0-2 seconds)
Repeat ad infinitum for up to about 700 messages / day at various times per day
But the problem I am seeing now is that some messages are not being processed in a timely manner. Occasionally we fail processing a message deliberately b/c of the state of the system for that message (e.g. maybe users still logged in, so it should back off and retry...which it does). The problem is if the consumer fails a message it is causing the queue to stop delivering any other messages to any other consumers.
"Failure to process a message" here just means the message was received, but the consumer declared it a failure, so we just log an error, and do not proceed to delete it from the queue. Thus, the visibility timeout (here 5m) will expire and it will be re-delivered to another consumer and retried up to 10 times, after which it will go to the dead letter queue.
After delving into the logs and analyzing it, here's what I'm seeing:
Process begins like above (message produced, consumed, deleted).
New message received at time x by consumer
Consumer fails -- logs error and just returns (does not delete)
Same message is received again at time x + 5m (visibility timeout)
Consumer fails -- logs error and just returns (does not delete)
Repeat up to 10x -- message goes to dead-letter queue
New message received but it is now 50 minutes late!
Now all messages that were put in the queue between steps 2-7 are 50 minutes late (5m visibility timeout * 10 retries)
All the docs I've read tells me the queue should not behave this way, but I've verified it several times in our logs. Sadly, we don't have a paid AWS support plan, or I'd file a ticket with them. But just consider the fact that we have 10 separate consumers all reading from the same queue. They only read from this queue. We don't have any other queues it is using.
For de-duplication we are using the automated hash of the message body. Messages are small JSON documents.
My expectation would be if we have a single bad message that causes a visibility timeout, that the queue would still happily deliver any other messages it has available while there are available consumers.
OK, so turns out I missed this little nugget of info about FIFO queues in the documentation:
https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/FIFO-queues.html
When you receive a message with a message group ID, no more messages
for the same message group ID are returned unless you delete the
message or it becomes visible.
I was indeed using the same Message Group ID. Hadn't given it a second thought. Just be aware, if you do that and any one of your messages fails to process, it will back up all other messages in the queue, until the time that the message is finally dealt with. The solution for me was to change the message group id. There is some business logic id I can postfix on it that will work for me.
1) I'm interested to learn if it is possible to keep the messages that were delivered using Spring Integration. I'm already using the mongo persistent storage (ConfigurableMongoDbMessageStore), but only failed messages remain in the collection. Ideally, I want all messages to remain with the functionality to list them and retry them.
I would use a field "status" or similar to identify queued, succesful or failed messages. Not sure if this field exists already, but I'm guessing something similar must be in place.
2) Also, when a message fails and is persited, there is a lot more data in the message. This data is serialised, so I'm curious how I can extract the original message and retry it.
3) The goal is to create an interface in the webapp where all queued messages can be seen, and retried. Not only failed messages, but also succesful deliveries (useful for testing).
I looked everywhere for an answer to this, but could not find it.
Thanks
I'd say it isn't good design for queue component.
Right it returns failed messages to the queue back for the future redelivery, but good message should be removed from the queue to avoid duplication on the next poll from queue.
No, there is no "status" field on the message, because you use store as a queue.
BTW Spring Integration provides separete implementation for queue channels: MongoDbChannelMessageStore.
You can achieve it with separate parallel Mongo collection and store your message twice: for the queue and for the future analysis. Here you can introduce "status" field and control it, when message successful or not.
From here you can introduce you UI to manage that collection and provide actions like send, retry. Remove the message from here and send it again to those two collections.
HTH
I'm trying to handle two different types of problems while processing a message.
The first problem is if the remote database is down. In that case, the message should stop processing, and try again later. This message should never go to a DLQ, and should keep trying until the remote database is up.
The second problem is when there is a problem with the message. In that case, it should go to the DLQ.
How should I be structuring the following code?
#Override
public void onMessage(Message message) {
try {
// Do some processing
messageProcessing(message); // Should DLQ if message is bad
// Save to the database
putNamedLocation(message); // <<--- Exception when external DB is down
} catch (Exception e) {
logger.error(e.getMessage());
mdc.setRollbackOnly();
}
}
Assuming you can detect bad messages definitively in the code body of the MDB, I would write the bad messages to the DLQ directly. This gives you a bit more freedom to perhaps categorize the error and optionally send different types of bad messages to different "DLQ-Like" queues, and/or apply a time-to-live to DLQ'ed messages so that no-hope-of-ever-being-processed type messages don't pile up in the queue for ever. You can add #Resource annotated instance variables to your MDB class referencing the ConnectionFactory and Queue references to support the sending of the messages to the target DLQ. The bottom line is, make sure you detect the error and DLQ the message yourself.
As for the DB being down, you can detect this by catching exceptions when acquiring a connection or writing your updates. In this case, clean up your resources and throw a RuntimeException. This will cause the message to be redelivered, but you will want to check the JMS configuration for two things:
Make sure the max-redelivery count is high enough, otherwise the count will tick over and the message will be DLQed eventually anyway.
If your JMS implementation supports it, add a redelivery delay to rejected messages to allow some time for the DB to come back up, otherwise your messages will endlessly spin in a deliver/reject loop.
To avoid #2 (which is tricky if your JMS implementation does not support redilvery delay, like WebSphereMQ), you can use the JBoss JMX management interface for the MDB to stop (and later restart) delivery on the MDB. However, you can't do this inside the MDB in the same thread that is processing the message because the MDB will wait for the message to complete processing, which it can't because it is waiting for the MDB to stop, which it can't because...[and so on] so... your best bet is to start some sort of sentry that polls the DB and when it finds it down, stops the MDB and when it finds it up again, restarts it. See this question for a snippet on how to do that.
That last part should help deal with any unexpected exceptions resulting from message validations. (i.e. the DB is fine, but for some reason the message is totally fubar resulting in uncaught exceptions which causes the message to be redelivered). Since down-DB messages should not be redelivered more than a few times (on account of your sentry), you can check a message's redelivery count and if it is ridiculously high then you know you have poison message and you can ditch it, or DLQ it.
Hope that's helpful.
We have quartz process that polls a ActiveMQ JMS queue.
We know that we could get several messages a minute would like to only respond to the most current message at a configured polling rate of a minute or more.
We don't need to process any of the previous messages.
Is there a way to configure the queue to get this behavior?
Its seems like a topic has the ability to do this via the subscription recovery policy using a count of 1. We would like to do this using a queue to guarantee (more or less) a single delivery of the message.
Or is there a conceptual flaw in our assumptions...
Thanks
In my opinion there is no standard operation for this, so you will have to write some code....
One possible solution would be to use a QueueBrowser together with a QueueReceiver:
Through the QueueReceiver you would get an Enumeration of the messages in the queue. For each message you can now perform a receive with a MessageSelector on the JMSMessageID as long as hasMoreElements() returns true. The last message will be the one you want to have....
When using activemq, you can use "image caching" on topics. One of the settings there is to always keep the last mesage sent..
Take a look at the Subscription recovery Policy settings:
http://activemq.apache.org/subscription-recovery-policy.html