Configure dynamodb stream event source triggers to have configurable delay for retries - aws-lambda

We have basically
dynamodb streams =>
trigger lambda (batch size XX, concurrency 1, retries YY) =>
write to service
There are multiple shards, so we may have some number of concurrent writes to the service. Under some conditions too many streams have too much data, and too many lambda instances are writing to the service, which then responds with 429.
Right now the failure simply ends up being a failure, the lambda retries, but the service is still overwhelmed.
What we would like to do is just have the lambda triggers delay before triggering a lambda retry, essentially have an exponential backoff before triggering. We can easily implement that "inside" the lambda, we can retry and wait for up to the 15m lambda duration.
But then we are billed for whole lambda execution time, while it is sleeping for however many backoffs are required.
Is there a way to configure the lambda/dynamodb trigger to have a delay (that we can control up and down) before invoking the retry? For SQS triggers there is some talk of redrive policy that somehow can control the rate of retries - but not clear how or if that applies to dynamodb streams.
I understand that the streams will "backup" as we slow down the dispatch of lambdas, but this is assumed to be a transient situation, and the dynamodb stream will act as a queue. And we can also configure a dead letter queue, but that is sort of orthogonal to the basic question.

You can configure a wait. And yes, while you are billed by the time use, its pennies. Seriously, the free aws account covers a million lambda invocations a month. At the enterprise level its really nothing compared to what EC2 servers cost. But Im not your CFO so maybe it is a concern.
You can take your stream and process it into whatever service calls you would need and have their paylods all added to the same SQS. You can configure your SQS to throttle it self in effect, so it only sends so many over a given time. The messages in your queue wold go to another lambda that would do the service call for you, one at a time. It would be doled out by the SQS
set up a Dead Letter Queue instead (possibly in combination with either of the above) to catch the failed ones and try again when traffic is lower.
As an aside, you dont want to 'pause' your dynamo stream as it only has a 24 hour TTL on it. If your stream pauses for too long you will loose data. Better to take the stream in whole and put it into an SQS queue as individual writes because SQS has a TTL of up to 14 days.

Related

How to ensure DynamoDB Stream records are not lost forever when Lambda fails for over 24 hours?

I am using a DynamoDB Stream (non-Kinesis version) and I've mapped the stream to a Lambda to process events.
Two things I understand about this stream are:
If the Lambda fails, it will automatically retry with the stream event.
DynamoDB stream will only keep the record for up to 24 hours.
My concern is that I want to be able to make sure my Lambda never misses a DynamoDB event, even if the Lambda is failing for more than 24 hours.
How can I ensure that the stream records are not lost forever if my Lambda fails for an extended period of time?
My initial thought is to treat this like I would a Lambda that reads from an SQS queue. I'd like to add a retry policy and DLQ to the Lambda, which would store failed events in a DLQ to reprocess at a later time.
Is this all that needs to be done to achieve what I want? I am struggling on finding documentation on how to do this with DynamoDB Stream. Is DDB Stream behavior any different than an SQS queue?
Why would the lambda fail for 24 hours?
My guess is your lambda relies on something downstream which you’re anticipating might be down for a long duration. In that case I’d suggest the lambda decide when to “give up” and it can toss its work items to your own SQS queue for later processing. You can’t keep items in the DynamoDB Stream for longer than the 24 hours, nor does the Stream have a DLQ.
Another option: DynamoDB can stream via Kinesis which has longer retention. The automatic lambda invocation however is only for DynamoDB Streams.

Rate-Limiting / Throttling SQS Consumer in conjunction with Step-Functions

Given following architecture:
The issue with that is that we reach throttling due to the maximum number of concurrent lambda executions (1K per account).
How can this be address or circumvented?
We want to have full control of the rate-limiting.
1) Request concurrency increase.
This would probably be the easiest solution but it would increase the potential workload quite much. It doesn't resolve the root cause nor does it give us any flexibility or room for any custom rate-limiting.
2) Rate Limiting API
This would only address one component, as the API is not the only trigger of the step-functions. Besides, it will have impact to the clients, as they will receive a 4x response.
3) Adding SQS in front of SFN
This will be one of our choices nevertheless, as it is always good to have a queue on top of such number of events. However, a simple queue on top does not provide rate-limiting.
As SQS can't be configured to execute SFN directly a lambda in between would be required, which then triggers then SFN by code. Without any more logic this would not solve the concurrency issues.
4) FIFO-SQS in front of SFN
Something along the line what this blog-post is explaining.
Summary: By using a virtually grouped items we can define the number of items that are being processed. As this solution works quite good for their use-case, I am actually not convinced it would be a good approach for our use-case. Because the SQS-consumer is not the indicator of the workload, as it only triggers the step-functions.
Due to uneven workload this is not optimal as it would be better to have the concurrency distributed by actual workload rather than by chance.
5) Kinesis Data Stream
By using Kinesis data stream with predefined shards and batch-sizes we can implement the logic of rate-limiting. However, this leaves us with the exact same issues described in (3).
6) Provisioned Concurrency
Assuming we have an SQS in front of the SFN, the SQS-consumer can be configured with a fixed provision concurrency. The value could be calculated by the account's maximum allowed concurrency in conjunction with the number of parallel tasks of the step-functions. It looks like we can find a proper value here.
But once the quota is reached, SQS will still retry to send messages. And once max is reached the message will end up in DLQ. This blog-post explains it quite good.
7) EventSourceMapping toogle by CloudWatch Metrics (sort of circuit breaker)
Assuming we have a SQS in front of SFN and a consumer-lambda.
We could create CW-metrics and trigger the execution of a lambda once a metric is hit. The event-lambda could then temporarily disable the event-source-mapping between the SQS and the consumer-lambda. Once the workload of the system eases another event could be send to enable the source-mapping again.
Something like:
However, I wasn't able to determine proper metrics to react on before the throttling kicks in. Additionally, CW-metrics are dealing with 1-minute frames. So the event might happen too late already.
8) ???
Question itself is a nice overview of all the major options. Well done.
You could implement throttling directly with API Gateway. This is the easiest option if you can afford rejecting the client every once in a while.
If you need stream and buffer control, go for Kinesis. You can even put all your events in S3 bucket and trigger lambdas or Step Function when a new event has been stored (more here). Yes, you will ingest events differently and you will need a bridge lambda function to trigger Step Function based on Kinesis events. But this is relatively low implementation effort.

DynamoDb re-processing records

I just inherited some one else's code that uses a server-less lambda function to process records from DynamoDb. The original developer is using DynamoDb much like how RabbitMQ works; as a temporary staging area with some level of fault tolerance and a lambda function that will process them at a later date.
We currently have a way to delay message publication in RabbitMQ at my company, but this feature is missing on the AWS side of the fence.
I wrote some code in my serverless lambda function so that it checks a special property called ProcessAfter (UTC DateTime) and effectively skips processing any given DynamoDb record if the current UTC date/time is less than that specified by the ProcessAfter. However DynamoDb never sends me that record ever again. It appears that DynamoDb only ever allows a single attempt at processing a record (excluding the exception re-tries built in), so I'm stuck with my attempted solution to implementing a delay capability.
Is there anyway to replicate the delay functionality in DynamoDb, or in my lambda function so that messages are skipped, and then re-processed as often as necessary until the delay is over and the record is successfully processed?
Looks like you are listening to dynamo_db streams. They work in a way if any event(insert, update etc which is being configured) happens for a record it will be sent to a listener for processing.
Now talking about your specific scenario, you need to have an SQS in place for processing a record later if you do not wish to process it after listening.
Better architecture I would advice is put an extra SQS and Lambda. The Lambda will listen the dynamo_db stream event, will compare processAfter with Date_Now to compute delay, add that delay as delay_seconds and send message to SQS.
Finally lambda listener will listen and process it after specified delay or 0 delay as required.

How to control event handling rate with a serverless stack

I have to fetch an external API that has a limitation of a few hundred req/min, to process an unknown amount of events : last week events ( events i store as dynamoDB objects ), and call this API with each of them.
My first idea is to do the following :
Get all the events for a specific day from dynamo ( but i could get fewer )
Put those events in an SQS queue
have SQS Events trigger another lambda with a reserved concurrency set low enough ( let's say to 2 ) that will request the API.
Since the lambda has a ~100ms duration, will I have a maximum of 20 req/sec here ?
I my logic correct here ?
Thanks.
I think your solution generally makes sense. One of the other things you should be aware of is the VisibilityTimeout on the SQS queue. This basically means
hide anything that's been read for ${VisibilityTimeout} seconds, before making it visible for processing again
Keep in mind if you get an error in your Lambda, the queue message will just stay in the queue. For more on that, see this article, which I found helpful.
The other approach you could take if you still run into throttling issues with your external API is to set up a CloudWatch event that wakes up every so often (let's say every 5 minutes) and explicitly calls your lambda. You'd need to retrofit your Lambda to explicitly read messages from the queue, and then process them. This would give you a little more control to "sip" messages using the receiveMessage method on the SQS SDK.

Real-time monitoring of SQS queue in AWS

What's the best way to provide real-time monitoring of the total count of messages sent to an SQS queue?
I currently have a Grafana dashboard set up to monitor an SQS queue, but it seems to refresh about every two minutes. I'm looking to get something set up to update almost in real-time, e.g. refresh every second.
The queue I'm using consumes around 6,000 messages per minute.
Colleagues of mine have built something for real-time monitoring of uploads to an S3 bucket, using a lambda to populate a PostgreSQL DB and using Grafana to query this.
Is this the best way of achieving this? Is there a more efficient way?
SQS is not event driven - it must be polled. Therefore, there isn't an event each time a message is put into the queue or removed from it. With S3 to Lambda there is an event sent in pretty much real time every time an object has been created or removed.
You can change the polling interval for SQS and poll as fast as you'd like. But be aware that polling does have a cost. The first 1 million requests a month are free.
I'm not sure what you're trying to accomplish (I'll address after my idea), but there's certainly a couple ways you could accomplish this. Each has positive and negative.
In every place you produce or consume messages, increment or decrement a cloudwatch metric (or datadog, librato, etc). It's still polling-based, but you could get the granularity down (even by using Cloudwatch) to 15-60 seconds. The biggest problem here is that it's error prone (what happens if the SQS message times out and gets reprocessed?).
Create a secondary queue. Each message that goes into this queue is either a "add" or "delete" message. Attach a lambda, container, autoscale group to process the queue and update metrics in an RDS or DynamoDB table. Query the table as needed.
Use a different queue processing system instead of SQS. I've seen RabbitMQ and Sensu used in very large environments, they will easily handle 6,000 messages per minute.
Keep in mind, there are a lot more metrics than just number of messages in the queue. I've recently become really fond of ApproximateAgeOfOldestMessage, because it indicates whether messages are being processed without error. Here's a blog post about the most helpful SQS metrics. It's called How to Monitor Amazon SQS with CloudWatch

Resources