I had a few questions about AWS lambdas and I couldn't find much details in the documentation
How can I increase the number of retries in AWS Lambda?
In case the maximum number of retries have occurred and whole Lambda has failed how can I get some sort of a notification?
Lambda retries are based upon many factors. I suggest you take a look into the official docs to understand every single type of retry, but long story short:
Stream-Based Synchronous Event sources can either retry or not. It depends on the Service.
Asynchronous Event sources will retry up to three times. If all messages fail, you can then configure a DLQ to receive the failed messages
(Stream-Based && Poll-Based Event Sources) (like Kinesis or Dynamo) will retry until the configured data retention. Be careful because if one message fails and the message itself is poisonous, it will keep retrying until it expires and no new messages will be processed
(Non-Stream-Based && Poll-Based Event Sources) (SQS) will discard messages in case of failure (unless it was an invocation failure or a timeout). If discarded, they will be sent to a DLQ if you previously configured one.
So, based on the information above, we can then tackle your question around notifications: you can have another Lambda subscribed to your DLQ to receive the message and notify the way you want. Either by sending an e-mail from the function itself or sending it to SNS (to possibly send an e-mail directly or do whatever you want with it).
For the retry amount, that's not configurable for the already built-in values. The furthest you can go is to invoke a Lambda function synchronously from your code and, in case of exception, retry it as you wish (the exponential backoff, if desired, would also need to be coded manually)
Related
We have basically
dynamodb streams =>
trigger lambda (batch size XX, concurrency 1, retries YY) =>
write to service
There are multiple shards, so we may have some number of concurrent writes to the service. Under some conditions too many streams have too much data, and too many lambda instances are writing to the service, which then responds with 429.
Right now the failure simply ends up being a failure, the lambda retries, but the service is still overwhelmed.
What we would like to do is just have the lambda triggers delay before triggering a lambda retry, essentially have an exponential backoff before triggering. We can easily implement that "inside" the lambda, we can retry and wait for up to the 15m lambda duration.
But then we are billed for whole lambda execution time, while it is sleeping for however many backoffs are required.
Is there a way to configure the lambda/dynamodb trigger to have a delay (that we can control up and down) before invoking the retry? For SQS triggers there is some talk of redrive policy that somehow can control the rate of retries - but not clear how or if that applies to dynamodb streams.
I understand that the streams will "backup" as we slow down the dispatch of lambdas, but this is assumed to be a transient situation, and the dynamodb stream will act as a queue. And we can also configure a dead letter queue, but that is sort of orthogonal to the basic question.
You can configure a wait. And yes, while you are billed by the time use, its pennies. Seriously, the free aws account covers a million lambda invocations a month. At the enterprise level its really nothing compared to what EC2 servers cost. But Im not your CFO so maybe it is a concern.
You can take your stream and process it into whatever service calls you would need and have their paylods all added to the same SQS. You can configure your SQS to throttle it self in effect, so it only sends so many over a given time. The messages in your queue wold go to another lambda that would do the service call for you, one at a time. It would be doled out by the SQS
set up a Dead Letter Queue instead (possibly in combination with either of the above) to catch the failed ones and try again when traffic is lower.
As an aside, you dont want to 'pause' your dynamo stream as it only has a 24 hour TTL on it. If your stream pauses for too long you will loose data. Better to take the stream in whole and put it into an SQS queue as individual writes because SQS has a TTL of up to 14 days.
Is it possible to do this?
When dealing with SQS event streams, if the Lambda function does not have adequate reserved concurrency, the function will be throttled, and the unprocessed events / messages can be retried via the SQS redrive policy. I've never liked this limitation as unprocessed messages will eventually end up on the DLQ after some arbitrary number of retries / message visibility timeout.
From my naive perspective, it would appear that the above solution would not be possible with MSK, as placing a message back on an MSK topic for some visibility timeout would effectively lose topic delivery order.
I've searched around but can't find any detail as to how back pressure can be implemented with MSK to Lambda. Does anybody have any insight into how the MSK topic consumer handles Lambda throttling?
Many thanks!
Actually apart from support DLQ's this scenario is supported. But the understanding of how MSK works is a bit different from SQS. In MSK (Which is Apache Kafka) records are persistent and durable; and what indicates for the processors that a given record should be retried is a flag called committed-offset that consumers create. If the lambda function reads the record but doesn't finish its processing then it is just a matter of not committing its respective offset that in the next poll cycle the record will be picked up again.
Also, Kafka has a polling model instead of push. In this case, your lambda function performs a poll indicating how many records must be read on each poll. So you see; there is lots of controls in Kafka to implement backpressure -- just not exactly how it works in SQS.
The example below may give you an idea of how it works:
https://github.com/aws-samples/integration-sample-lambda-msk
The AWS SQS -> Lambda integration allows you to process incoming messages in a batch, where you configure the maximum number you can receive in a single batch. If you throw an exception during processing, to indicate failure, all the messages are not deleted from the incoming queue and can be picked up by another lambda for processing once the visibility timeout has passed.
Is there any way to keep the batch processing, for performance reasons, but allow some messages from the batch to succeed (and be deleted from the inbound queue) and only leave some of the batch un-deleted?
The problem with manually re-enqueueing the failed messages to the queue is that you can get into an infinite loop where those items perpetually fail and get re-enqueued and fail again. Since they are being resent to the queue their retry count gets reset every time which means they'll never fail out into a dead letter queue. You also lose the benefits of the visibility timeout. This is also bad for monitoring purposes since you'll never be able to know if you're in a bad state unless you go manually check your logs.
A better approach would be to manually delete the successful items and then throw an exception to fail the rest of the batch. The successful items will be removed from the queue, all the items that actually failed will hit their normal visibility timeout periods and retain their receive count values, and you'll be able to actually use and monitor a dead letter queue. This is also overall less work than the other approach.
Considerations
Only override the default behavior if there has been a partial batch failure. If all the items succeeded, let the default behavior take its course
Since you're tracking the failures of each queue item, you'll need to catch and log each exception as they come in so that you can see what's going on later
I recently encountered this problem and the best way to handle this without writing any code from our side is to use the FunctionResponseTypes property of EventSourceMapping. Using this we just have to pass the list of failed message Id and the event source will handle to delete the successful message.
Please checkout Using SQS and Lambda
Cloudformation template to configure Eventsource for lambda
"FunctionEventSourceMapping": {
"Type": "AWS::Lambda::EventSourceMapping",
"Properties": {
"BatchSize": "100",
"Enabled": "True",
"EventSourceArn": {"Fn::GetAtt": ["SQSQueue", "Arn"]},
"FunctionName": "FunctionName",
"MaximumBatchingWindowInSeconds": "100",
"FunctionResponseTypes": ["ReportBatchItemFailures"] # This is important
}
}
After you configure your Event source with above configuration it should look something like below
Then we just have to return the response in the below-mentioned format from our lambda
{"batchItemFailures": [{"itemIdentifier": "85f26da9-fceb-4252-9560-243376081199"}]}
Provide the list of failed message Ids in batchIntemFailures list
If your lambda runtime environment is in python than please return dict in the above mentioned format for java based runtime you can use aws-lambda-java-event
Sample Python code
Advantages of this approach are
You don't have to add any code to manually delete the message from SQS queue
You don't have to include any third party library or boto just for deleting the message from the queue it will help you to reduce your final artifact size.
Keep it simple an stupid
On a side note make sure your lambda have the required permission on sqs to get and delete the message.
Thanks
One option is to manually send back the failed messages to the queue, and then replying with a success to the SQS so that there are no duplicates.
You could do something like setting up a fail count, so that if all messages failed you can simply return a failed status for all messages, otherwise if the fail count is < 10 (10 being the max batch size you can get from SQS -> Lambda event) then you can individually send back the failed messages to the queue, and then reply with a success message.
Additionally, to avoid any possible infinite retry loop, add a property to the event such as a "retry" count before sending it back to the queue, and drop the event when "retry" is greater than X.
I'm trying to create a lambda function that is consuming a stream from dynamoDB table. However I was wondering which is the best practice to handle data that may not have been processed for some errors during the execution? For example my lambda failed and I lost part of the stream, which is the best way to reprocess the lost data?
This is handled for you. DynamoDB Streams, like Kinesis Streams, will resend records until they have been successfully processed. When you are using Lambda to process the stream, that means successfully exiting the function. If there is an error and the function exits unexpectedly, the DynamoDB stream will simply resend the record that was being processing.
The good thing is you are guaranteed at-least-once processing however, there are some things you need to look out for. Like Kinesis Streams, DynamoDB Streams are guaranteed to processes records in order. As a side effect of this, when a record fails to process, it is retried until it is successfully processed or it expires from the stream (possibly days) before processing any records behind it in the stream.
How you solve for this depends on the needs of your application. If you need at-least-once processing but don't need to guarantee that all records are processed in order, I would just drop the records into an SQS queue and do the processing off of the queue. SQS queues will also retry records that aren't successfully processed however, unlike DynamoDB and Kinesis Streams, records will not block each other in the queue. If you encounter an error when transferring a record from the DynamoDB Stream to the SQS Queue, you can just retry however, this may introduce duplicates in the SQS Queue.
If order is critical or duplicates can't be tolerated, you can use a SQS FIFO Queue. SQS FIFO Queues are similar to (Standard) SQS Queues except they they are guaranteed to deliver messages to the consumer in order and have a deduplication window (5 mins) where any duplicates added to the queue within that window will be discarded.
In both cases, when using SQS queues to process messages, you can setup a Dead Letter Queue where messages can automatically be sent if they fail to be processed N number of times.
TLDR: Use SQS Queues.
Updating this thread as all the existing answers are stale.
AWS Lambda now supports the DLQs for synchronous steam read from DynamoDB table stream.
With this feature in context, here is the flow that I would recommend:
Configure the event source mapping to include the DLQ arns and set the retry-attempts count. After these many retry, the batch metadata would then be moved to DLQs.
Set-up alarm on DLQ message visibility to get alert on impacted records.
DLQ message can be used to retrieve the impacted stream record using KCL library
ProTip: you can use attribute "Bisect on Function Error" to enable batch splitting. With this option, lambda would be able to narrow down on the impacted record.
DynamoDB Streams invokes the Lambda function for each event untill it successfully processes it (Untill the code calls success callback).
In an error situation while executing, you need to handle it in code unless otherwise the Lambda won't continue with the remaining messages in the stream.
If there is a situation where you need to process the message separate due to an error, you can use the dead letter queue (with Amazon SQS) to push the message and continue with the remaining items in the stream. You can have a separate logic to process the messages in this queue.
we are currently working in a message driven Microservice environment and some of our messages/events are event sourced (using Apache Kafka). Now we are struggling with implementing more complex business requirements, were we have to take multiple events into account to create new events and side effects.
In the current situation we are working with devices that can produce errors and we already process them and have a single topic which contains ERROR_OCCURRED and ERROR_RESOLVED events (so they are in order). We also make sure, that all messages regarding a specific device always go onto the same partition. And both messages share an ID that identifies that specific error incident. We already have a projection that consumes those events and provides an API for our customers, s.t. they can see all occurred errors and their current state.
Now we have to deal with the following requirement:
Reporting Errors
We need a push system that reports errors of devices to our external partners, but only after 15 minutes and if they have not been resolved in that timeframe. Our first approach was to consume all ERROR_RESOLVED events, store the IDs and have another consumer that is handling the ERROR_OCCURRED events in a delayed fashion (e.g. by only consuming the next ERROR_OCCURRED event on the topic if its timestamp is at least 15 minutes old). We would then be able to know if that particular error has already been resolved and does not need to be reported (since they share a common ID with the corresponding ERROR_RESOLVED event). Otherwise we send an HTTP request to our external partner and create an ERROR_REPORTED event on a new topic. Is there any better approach for delayed and conditional message processing?
We also have to take the following special use cases into account:
Service restarts: currently we are planning to keep the list of resolved errors in memory, so if a service restarts, that list has to be created from scratch. We could just replay the ERROR_RESOLVED messages, but that may take some time and in that time no ERROR_OCCURRED events should be processed because that may result in reporting errors that have been resolved in less then 15 minutes, but we are just not aware of it. Are there any good practices regarding replay vs. "normal" processing?
Scaling: we may increase or decrease the number of instances of our service at any time, so the partition assignment may change during runtime. That should not be a problem if we create a consumer group for each service instance when consuming the ERROR_RESOLVED events, s.t. every instance knows all resolved errors while still only handling the ERROR_OCCURRED events of its assigned partitions (in another consumer group which is shared by all instances). Is there a better approach for handling partition reassignment and internal state?
Thanks in advance!
For side effects, I would record all "side" actions in the event store. In your particular example, when it is time to send a notification, I would call SEND_NOTIFICATION command that emit NOTIFICATION_SENT event. These events would be processed by some worker process that does actual HTTP request.
Actually I would elaborate this even furter, since notifications could fail, so I would have, say, two events NOTIFICATION_REQUIRED, and NORIFICATION_SENT, so we can retry failed notifications.
And finally your logic would be "if error was not resolved in 15 minutes and notification was not sent - send a notification (or just discard if it missed its timeframe)"