I need to trigger the EC2 autoscaling from the SNS subscription. Is there any way to do that something like triggering LAMBDA function. Thanks
I have created the SNS Topic and receing the messages from Alertmanager which is configured for CPU, Memory and thread count. I wish to enable the auto scaling based on the SNS topic and not using the cloudwatch events.
Amazon EC2 Auto Scaling is designed to work by responding to Amazon CloudWatch Alarms. When an alarm enters the ALARM state, it can trigger Auto Scaling to add or remove instances.
Alternatively, Auto Scaling can track a particular metric and will scale to keep that metric close to a given target. For example, an average CPU Utilization of 60% across the group.
If you do not wish to use CloudWatch Alarms to trigger scaling, then you can write your own logic and call SetDesiredCapacity() to change the number of desired instances, or call ExecutePolicy() to trigger a pre-defined scaling policy (eg "add 1 instance").
Related
Given following architecture:
The issue with that is that we reach throttling due to the maximum number of concurrent lambda executions (1K per account).
How can this be address or circumvented?
We want to have full control of the rate-limiting.
1) Request concurrency increase.
This would probably be the easiest solution but it would increase the potential workload quite much. It doesn't resolve the root cause nor does it give us any flexibility or room for any custom rate-limiting.
2) Rate Limiting API
This would only address one component, as the API is not the only trigger of the step-functions. Besides, it will have impact to the clients, as they will receive a 4x response.
3) Adding SQS in front of SFN
This will be one of our choices nevertheless, as it is always good to have a queue on top of such number of events. However, a simple queue on top does not provide rate-limiting.
As SQS can't be configured to execute SFN directly a lambda in between would be required, which then triggers then SFN by code. Without any more logic this would not solve the concurrency issues.
4) FIFO-SQS in front of SFN
Something along the line what this blog-post is explaining.
Summary: By using a virtually grouped items we can define the number of items that are being processed. As this solution works quite good for their use-case, I am actually not convinced it would be a good approach for our use-case. Because the SQS-consumer is not the indicator of the workload, as it only triggers the step-functions.
Due to uneven workload this is not optimal as it would be better to have the concurrency distributed by actual workload rather than by chance.
5) Kinesis Data Stream
By using Kinesis data stream with predefined shards and batch-sizes we can implement the logic of rate-limiting. However, this leaves us with the exact same issues described in (3).
6) Provisioned Concurrency
Assuming we have an SQS in front of the SFN, the SQS-consumer can be configured with a fixed provision concurrency. The value could be calculated by the account's maximum allowed concurrency in conjunction with the number of parallel tasks of the step-functions. It looks like we can find a proper value here.
But once the quota is reached, SQS will still retry to send messages. And once max is reached the message will end up in DLQ. This blog-post explains it quite good.
7) EventSourceMapping toogle by CloudWatch Metrics (sort of circuit breaker)
Assuming we have a SQS in front of SFN and a consumer-lambda.
We could create CW-metrics and trigger the execution of a lambda once a metric is hit. The event-lambda could then temporarily disable the event-source-mapping between the SQS and the consumer-lambda. Once the workload of the system eases another event could be send to enable the source-mapping again.
Something like:
However, I wasn't able to determine proper metrics to react on before the throttling kicks in. Additionally, CW-metrics are dealing with 1-minute frames. So the event might happen too late already.
8) ???
Question itself is a nice overview of all the major options. Well done.
You could implement throttling directly with API Gateway. This is the easiest option if you can afford rejecting the client every once in a while.
If you need stream and buffer control, go for Kinesis. You can even put all your events in S3 bucket and trigger lambdas or Step Function when a new event has been stored (more here). Yes, you will ingest events differently and you will need a bridge lambda function to trigger Step Function based on Kinesis events. But this is relatively low implementation effort.
We have basically
dynamodb streams =>
trigger lambda (batch size XX, concurrency 1, retries YY) =>
write to service
There are multiple shards, so we may have some number of concurrent writes to the service. Under some conditions too many streams have too much data, and too many lambda instances are writing to the service, which then responds with 429.
Right now the failure simply ends up being a failure, the lambda retries, but the service is still overwhelmed.
What we would like to do is just have the lambda triggers delay before triggering a lambda retry, essentially have an exponential backoff before triggering. We can easily implement that "inside" the lambda, we can retry and wait for up to the 15m lambda duration.
But then we are billed for whole lambda execution time, while it is sleeping for however many backoffs are required.
Is there a way to configure the lambda/dynamodb trigger to have a delay (that we can control up and down) before invoking the retry? For SQS triggers there is some talk of redrive policy that somehow can control the rate of retries - but not clear how or if that applies to dynamodb streams.
I understand that the streams will "backup" as we slow down the dispatch of lambdas, but this is assumed to be a transient situation, and the dynamodb stream will act as a queue. And we can also configure a dead letter queue, but that is sort of orthogonal to the basic question.
You can configure a wait. And yes, while you are billed by the time use, its pennies. Seriously, the free aws account covers a million lambda invocations a month. At the enterprise level its really nothing compared to what EC2 servers cost. But Im not your CFO so maybe it is a concern.
You can take your stream and process it into whatever service calls you would need and have their paylods all added to the same SQS. You can configure your SQS to throttle it self in effect, so it only sends so many over a given time. The messages in your queue wold go to another lambda that would do the service call for you, one at a time. It would be doled out by the SQS
set up a Dead Letter Queue instead (possibly in combination with either of the above) to catch the failed ones and try again when traffic is lower.
As an aside, you dont want to 'pause' your dynamo stream as it only has a 24 hour TTL on it. If your stream pauses for too long you will loose data. Better to take the stream in whole and put it into an SQS queue as individual writes because SQS has a TTL of up to 14 days.
In our set up, we have lots of AWS Lambda functions, developed by different teams. Some of the them have set a reserved concurrency. This eats out of the total concurrency of the account (1000).
Is there a way to monitor or set an alarm that is triggered when the unreserved currency drops below specific level?
This would be helpful to proactively do something to alleviate the issue and reduce failures.
In AWS there are pre-defined metrics, related to Lambda concurrency, that are exposed in AWS CloudWatch
ConcurrentExecutions: Shows you the concurrent executions that are happening at that moment across the all the Lambda functions in the Account including Reserved and Unreserved.
UnreservedConcurrentExecutions: This shows you the total concurrent executions, that are happening at that moment, that are using the Unreserved Concurrency.
The information I was looking for can be seen when we run the CLI command
ConcurrentExecutions and UnreservedConcurrentExecutions
$ aws lambda get-account-settings
{
"AccountLimit": {
"TotalCodeSize": 1231232132,
"CodeSizeUnzipped": 3242424,
"CodeSizeZipped": 324343434,
"**ConcurrentExecutions**": 10000,
"**UnreservedConcurrentExecutions**": 4000
},
"AccountUsage": {
"TotalCodeSize": 36972950817,
"FunctionCount": 1310
}
}
It is not possible to get these values in a dashboard. As we cannot execute API calls to fetch and display data in the dashboard.
Solution
We can create a lambda function, and, in that function, we can extract, using the API, the account wide values/settings for ConcurrentExecutions and UnreservedConcurrentExecutions. We can then create new metrics that would send the two values to CloudWatch. We can schedule AWS Lambda Functions Using CloudWatch Events.
Once we have the metric, we can set the required alarm for the Unreserved Concurrency.
What's the best way to provide real-time monitoring of the total count of messages sent to an SQS queue?
I currently have a Grafana dashboard set up to monitor an SQS queue, but it seems to refresh about every two minutes. I'm looking to get something set up to update almost in real-time, e.g. refresh every second.
The queue I'm using consumes around 6,000 messages per minute.
Colleagues of mine have built something for real-time monitoring of uploads to an S3 bucket, using a lambda to populate a PostgreSQL DB and using Grafana to query this.
Is this the best way of achieving this? Is there a more efficient way?
SQS is not event driven - it must be polled. Therefore, there isn't an event each time a message is put into the queue or removed from it. With S3 to Lambda there is an event sent in pretty much real time every time an object has been created or removed.
You can change the polling interval for SQS and poll as fast as you'd like. But be aware that polling does have a cost. The first 1 million requests a month are free.
I'm not sure what you're trying to accomplish (I'll address after my idea), but there's certainly a couple ways you could accomplish this. Each has positive and negative.
In every place you produce or consume messages, increment or decrement a cloudwatch metric (or datadog, librato, etc). It's still polling-based, but you could get the granularity down (even by using Cloudwatch) to 15-60 seconds. The biggest problem here is that it's error prone (what happens if the SQS message times out and gets reprocessed?).
Create a secondary queue. Each message that goes into this queue is either a "add" or "delete" message. Attach a lambda, container, autoscale group to process the queue and update metrics in an RDS or DynamoDB table. Query the table as needed.
Use a different queue processing system instead of SQS. I've seen RabbitMQ and Sensu used in very large environments, they will easily handle 6,000 messages per minute.
Keep in mind, there are a lot more metrics than just number of messages in the queue. I've recently become really fond of ApproximateAgeOfOldestMessage, because it indicates whether messages are being processed without error. Here's a blog post about the most helpful SQS metrics. It's called How to Monitor Amazon SQS with CloudWatch
I use 1 spot instance and would like to be emailed when prices for my instance size and region are above a threshold. I can then take appropriate action and shut down and move instance to another region if needed. Any ideas on how to be alerted to the prices?
There's two ways to go about this that I can think of:
1) Since you only have one instance, you could set a CloudWatch alarm for your instance in a region that will notify you when the spot price rises above what you're willing to pay hourly.
If you create an Alarm, and tell it to use the EstimatedCharges metric for the AmazonEC2 service, and choose a period of an hour, then you are basically telling CloudWatch to send you an email whenever the hourly spot price for your instance in the region it's running in is above your threshold for wanting to pay.
Once you get the email, you can then shut the instance down and start one up in another region, and leave it running with its own alarm.
2) You could automate the whole process with a client program that polls for changes in the spot price for your instance size in your desired regions.
This has the advantage that you could go one step further and use the same program to trigger instance shutdowns when the price rises and start another instance in a different region.
Amazon recently released a sample program to detect changes in spot prices by region and instance type: How to Track Spot Instance Activity with the Spot-Notifications Sample Application.
Simply combine that with the ec2 command-line tools to stop and start instances and you don't need to manually do it yourself.