AWS Lambda Reserved and Unreserved Concurrency Alarm - aws-lambda

In our set up, we have lots of AWS Lambda functions, developed by different teams. Some of the them have set a reserved concurrency. This eats out of the total concurrency of the account (1000).
Is there a way to monitor or set an alarm that is triggered when the unreserved currency drops below specific level?
This would be helpful to proactively do something to alleviate the issue and reduce failures.

In AWS there are pre-defined metrics, related to Lambda concurrency, that are exposed in AWS CloudWatch
ConcurrentExecutions: Shows you the concurrent executions that are happening at that moment across the all the Lambda functions in the Account including Reserved and Unreserved.
UnreservedConcurrentExecutions: This shows you the total concurrent executions, that are happening at that moment, that are using the Unreserved Concurrency.
The information I was looking for can be seen when we run the CLI command
ConcurrentExecutions and UnreservedConcurrentExecutions
$ aws lambda get-account-settings
{
"AccountLimit": {
"TotalCodeSize": 1231232132,
"CodeSizeUnzipped": 3242424,
"CodeSizeZipped": 324343434,
"**ConcurrentExecutions**": 10000,
"**UnreservedConcurrentExecutions**": 4000
},
"AccountUsage": {
"TotalCodeSize": 36972950817,
"FunctionCount": 1310
}
}
It is not possible to get these values in a dashboard. As we cannot execute API calls to fetch and display data in the dashboard.
Solution
We can create a lambda function, and, in that function, we can extract, using the API, the account wide values/settings for ConcurrentExecutions and UnreservedConcurrentExecutions. We can then create new metrics that would send the two values to CloudWatch. We can schedule AWS Lambda Functions Using CloudWatch Events.
Once we have the metric, we can set the required alarm for the Unreserved Concurrency.

Related

DynamoDB:PutItem calls silently ignored

I have a Lambda function bound to CodeBuild notifications; a Lambda instance writes details of the notification that triggered it to a DynamoDB table (BillingMode PAY_PER_REQUEST)
Each CodeBuild notification spawns an independent Lambda instance. A CodeBuild build can spawn 7-8 separate notifications/Lambda instances, many of which often happen simultaneously.
The Lambda function uses DynamoDB:PutItem to put details of the notification to DynamoDB. What I find is that out of 7-8 notifications in a 30 second period, sometimes all 7-8 get written to DynamoDB, but sometimes it can be as low as 0-1; many calls to DynamoDB:PutItem simply seem to be "ignored".
Why is this happening?
My guess is that DynamoDB simply shouldn't be accessed by multiple Lambda instances in this way; that best practice is to push the updates to a SQS queue bound to a separate Lambda, and have that separate Lambda write many updates to DynamoDB as part of a transaction.
Is that right? Why might parallel independent calls to DynamoDB:PutItem fail silently?
TIA.
DynamoDB uses a web endpoint and for that reason it can handle any number of concurrent connections, so the issue is not with how many Lambdas are writing.
I typically see this happen when users do not allow the Lambda to wait until the API requests are complete and the container gets shut down prematurely. I would first check your code and ensure that your lambda is staying alive for all items to be processed, you can do this by adding some simple logging in your code.
What you are describing is a good use case for Step Functions.
As much as Lambda functions are great to glue between services, they have their overheads and their limitations. With Step Functions, you can call directly to DynamoDB:PutItem, and you can handle various scenarios and flows, such as Async calls. These flows are possible to implement in a Lambda function, however with less visibility and with less traceability.
BTW, you can also call a Lambda function from Step Functions, however, I recommend you to try and use the direct service call to maximize the benefits of the Step Functions service.
My mistake, I had a separate issue which was messing up some of the range keys and causing updates to "fail" silently. But thx for the tip regarding timeouts

Rate-Limiting / Throttling SQS Consumer in conjunction with Step-Functions

Given following architecture:
The issue with that is that we reach throttling due to the maximum number of concurrent lambda executions (1K per account).
How can this be address or circumvented?
We want to have full control of the rate-limiting.
1) Request concurrency increase.
This would probably be the easiest solution but it would increase the potential workload quite much. It doesn't resolve the root cause nor does it give us any flexibility or room for any custom rate-limiting.
2) Rate Limiting API
This would only address one component, as the API is not the only trigger of the step-functions. Besides, it will have impact to the clients, as they will receive a 4x response.
3) Adding SQS in front of SFN
This will be one of our choices nevertheless, as it is always good to have a queue on top of such number of events. However, a simple queue on top does not provide rate-limiting.
As SQS can't be configured to execute SFN directly a lambda in between would be required, which then triggers then SFN by code. Without any more logic this would not solve the concurrency issues.
4) FIFO-SQS in front of SFN
Something along the line what this blog-post is explaining.
Summary: By using a virtually grouped items we can define the number of items that are being processed. As this solution works quite good for their use-case, I am actually not convinced it would be a good approach for our use-case. Because the SQS-consumer is not the indicator of the workload, as it only triggers the step-functions.
Due to uneven workload this is not optimal as it would be better to have the concurrency distributed by actual workload rather than by chance.
5) Kinesis Data Stream
By using Kinesis data stream with predefined shards and batch-sizes we can implement the logic of rate-limiting. However, this leaves us with the exact same issues described in (3).
6) Provisioned Concurrency
Assuming we have an SQS in front of the SFN, the SQS-consumer can be configured with a fixed provision concurrency. The value could be calculated by the account's maximum allowed concurrency in conjunction with the number of parallel tasks of the step-functions. It looks like we can find a proper value here.
But once the quota is reached, SQS will still retry to send messages. And once max is reached the message will end up in DLQ. This blog-post explains it quite good.
7) EventSourceMapping toogle by CloudWatch Metrics (sort of circuit breaker)
Assuming we have a SQS in front of SFN and a consumer-lambda.
We could create CW-metrics and trigger the execution of a lambda once a metric is hit. The event-lambda could then temporarily disable the event-source-mapping between the SQS and the consumer-lambda. Once the workload of the system eases another event could be send to enable the source-mapping again.
Something like:
However, I wasn't able to determine proper metrics to react on before the throttling kicks in. Additionally, CW-metrics are dealing with 1-minute frames. So the event might happen too late already.
8) ???
Question itself is a nice overview of all the major options. Well done.
You could implement throttling directly with API Gateway. This is the easiest option if you can afford rejecting the client every once in a while.
If you need stream and buffer control, go for Kinesis. You can even put all your events in S3 bucket and trigger lambdas or Step Function when a new event has been stored (more here). Yes, you will ingest events differently and you will need a bridge lambda function to trigger Step Function based on Kinesis events. But this is relatively low implementation effort.

why is there a 1MB payload limit in aws lambda + alb configuration?

AWS lambda can be used to create serverless applications, but it has some limits. One of these is a limit on the payload size: you can only return a maximum if 1MB payload from your function.
Does anyone know why this limit exists?
AWS Lambda limits the amount of compute and storage resources that you can use to run and store functions. AWS has deliberately put several Lambda limits that are either soft or hard to prevent misuse or abuse.
Here is the reference from AWS documentation AWS Hard and Soft Limits

Publishing high-volume metrics from Lambdas?

I have a bunch of Lambdas written in Go that produce certain events that are pushed out to various systems. I would like to publish metrics to CloudWatch that slice these by the event type. The volume is currently about 20000 events per second with peaks about twice that much.
Due to the load, I can't publish these metrics one-by-one on each Lambda invocation (each invocation produces a single event). What available approaches are there that cheap and don't hit any limits?
You can try to utilize shutdown phase from lambda lifecycle to publish you metric.
https://docs.aws.amazon.com/lambda/latest/dg/runtimes-context.html#runtimes-lifecycle-shutdown
To publish metric would suggest to utilize EMF(Embedded Metric Format) to combine multiple data points when calling PutMetricData API which takes also an array to act like a batch.
https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_PutMetricData.html

triggering EC2 autoscaling by using SNS

I need to trigger the EC2 autoscaling from the SNS subscription. Is there any way to do that something like triggering LAMBDA function. Thanks
I have created the SNS Topic and receing the messages from Alertmanager which is configured for CPU, Memory and thread count. I wish to enable the auto scaling based on the SNS topic and not using the cloudwatch events.
Amazon EC2 Auto Scaling is designed to work by responding to Amazon CloudWatch Alarms. When an alarm enters the ALARM state, it can trigger Auto Scaling to add or remove instances.
Alternatively, Auto Scaling can track a particular metric and will scale to keep that metric close to a given target. For example, an average CPU Utilization of 60% across the group.
If you do not wish to use CloudWatch Alarms to trigger scaling, then you can write your own logic and call SetDesiredCapacity() to change the number of desired instances, or call ExecutePolicy() to trigger a pre-defined scaling policy (eg "add 1 instance").

Resources