How to detect if AWS has sufficient capacity to launch r4 instances - amazon-ec2

We keep running into this insufficient capacity way too often when we launch our cloudformation to create test environments. I've not found a way to proactively check if AWS can provide capacity to launch the number of instances of the specific type that we want. Is there really no way to check early instead of waiting each time for cloudformation to fail and resorting to using r3 type instances (besides adding more availability zones, i.e. not specifying them?)
2018-01-15T03:45:24.202064 2018-01-15 03:44:34.868000+00:00 | CREATE_FAILED | AWS::EC2::Instance | SlaveNode1 | We currently do not have sufficient r4.xlarge capacity in the Availability Zone you requested (ap-northeast-1c). Our system will be working on provisioning additional capacity. You can currently get r4.xlarge capacity by not specifying an Availability Zone in your request or choosing , ap-northeast-1a.


AWS Cloudwatch Per-Instance CPU monitoring on ASG

I'm trying to set up some metrics on an AWS CloudWatch board for our Autoscaling groups which would display the current CPU usage of each individual host in the Autoscaling group.
I've tried to use AWS's built-in searches, but none of them seem to relate to what I want. The ASG searches don't turn over per-instance CPU metrics, and using EC2 instances as the search filter returns all EC2 instances in our region rather than allowing me to target just the specific ASG.
I've tried:
SEARCH('{AWS/EC2,InstanceId} MetricName="CPUUtilization"', 'Average', 300)
But this returns all the CPU metrics from our entire fleet.
I've also tried:
Metric name
But this returns only the average of the CPU usage in the total ASG.
Is there no way to filter the CPU usage statistics by ASG?

Reducing AWS CloudWatch agent CPU usage

We have the CloudWatch agent installed on one EC2 instance and even with 4 cores the task takes up 24% of total CPU time. Is there a way to configure this to be less of a CPU strain? Perhaps to drop the sample rate or have it idle for periods?
While the documentation mentioned a cron job, I see nowhere information on how to set up a scheduled task to have the agent work intermittently. For example, it would be nice to have it fired up once every 5 minutes, send results to the cloud, then shutdown - perhaps with a powershell task.
I managed to limit the CPU usage from 15-20% to 0-0.2% by:
Removing old logs from the folder - There were around 500MB of logs and the agent was processing everything
Updating to the latest version
I reduced CPU usage significantly by removing use of the ** super asterisk.
Also, regarding collection interval, there is a setting in the config file to set collection interval (default is 60 seconds)
"metrics_collection_interval": 60
AWS Docs

What can cause a Cloud Run instance to not be reused despite continuous load?

My Spring-Boot app runs as expected on Cloud Run when I deploy it with max-instances set to 1: It receives a constant stream of pubsub messages via push, and makes anywhere from 0 to 5 writes to an associated CloudSQL instance, depending on the message payload. Typically it handles between 20 and 40 messages per second. Latency/response-time varies between 50ms and 60sec, probably due to some resource contention.
In order to increase throughput/ decrease resource contention, I'm looking to experiment with the connection pool size per app-instance, as well as the concurrency and max-instances parameters for my cloud run app.
I understand that due to Spring-Boot, my app has a relatively high cold-start time of about 30-40 seconds. This is acceptable for how this service is used.
I'm experiencing problems when deploying a spring-boot app to cloud run with max-instances set to a value greater than 1:
Instances start, handle a single request successfully, and then produce no more logs.
This happens a few times per minute, leading me to believe that instances get started (cold-start), handle a single request, die, and then get started again. They are not being reused as described in the docs, and as is happening when I set max-instances to 1. Official docs on concurrency
Instead, I expect 3 container instances to be started, which then each requests according to max-concurrency setting.
Billable container time at max-instances=3:
As shown in the graph, the number of instances is fluctuating wildly, once the new revision with max-instances=3 is deployed.
The graphs for CPU- and memory-usage also look like this.
There are no error logs. As before at max-instaces=1, there are warnings indicating that there are not enough instances available to handle requests (HTTP 429).
Connection Limit of CloudSQL instance has not been exceeded
Requests are handled at less than 10/s
Finally, this is the command used to deploy:
gcloud beta run deploy my-service --project=[...] --image=[...] --add-cloudsql-instances=[...] --region=[...] --platform=managed --memory=1Gi --max-instances=3 --concurrency=3 --no-allow-unauthenticated
What could cause this behavior?
Some month ago, in private Alpha, I performed tests and I observed the same behavior. After discussion with Google team, I understood that instances are over provisioned "in case of": an instances crashes, an instances is preempted, the traffic suddenly increase,...
The trade-off of this is that you will have more cold start that your max instances values. Worse, you will be charged for this over provisioned cold start -> this is not an issue because Cloud Run has a huge free tier that covers this kind of glitches.
Going deeper in the logs (you can do it by creating a sink of Cloud Run logs into BigQuery and then by requesting them), even if there is more instances up than your max instances, only your max instances are active in the same time. I'm not sure to be clear. With your parameters, that means, if you have 5 instances up in the same time, only 3 serve the traffic at the same point of time
This part is not documented because it evolves constantly for find the best balance between over-provisioning and lack of ressources (and 429 errors).
#Steren #AhmetB can you confirm or correct me?
When Cloud Run receives and processes requests rapidly, it predicts how many instances it needs, and will try to scale to the amount. If a sudden burst of requests occur, Cloud Run will instantiate a larger number of instances as a response. This is done in order to adapt to a possible higher number of network requests beyond what it is currently serving, with attempts to take into consideration the length of time it will take for the existing instance to complete loading the request. Per the documentation, it is possible that the amount of container instances can go above the max instance value when it spikes.
You mentioned with max-instances set to 1 it was running fine, but later you mentioned it was in fact producing 429s with it set to 1 as well. Seeing behavior of 429s as well as the instances spiking could indicate that the amount of traffic is not being handled fluidly.
It is also worth noting, because of the cold start time you mention, when instances are serving the first request(s), by design, the number of concurrent requests is actually hard set to 1. Once things are fully ready,only then the concurrency setting you have chosen is applied.
Was there some specific reason you chose 3 and 3 for Max Instance settings and concurrency? Also how was the concurrency set when you had max instance set to 1? Perhaps you could try tinkering up further the concurrency (max 80) and /or Max instances (high limit up to 1000) and see if that removes the 429s.

Starting and stopping container instances in-between task definitions on ECS

To reduce costs I would like to stop and start an container instance in a cluster in-between tasks. The task run every now and again so doesn't seem efficient keeping an EC2 running in-between.
What is the best way to allow this?
I have looked into lambda functions triggered by a cloudwatch scheduler and also thought about autoscaling.
Amazon doesn't make this incredibly straight-forward (though, they're trying to with Fargate). The best solution for now (if you're not in a region where Fargate is an option) is to try to keep your desired task count in line with your desired instance count on an autoscaling group.
The way we have it setup is through Lambda, triggering based on Autoscaling events (pretty easy to setup). The least trivial part about this is the Lambda script, though it's not incredibly difficult. Add tags to your ASG that help identify what cluster / service it's associated with. When a scaling event triggers your script, just have your script describe the ASG that triggered it, look for the cluster / service that's in the tags, and updated the desired count of that service:
asgDetail = paginator_describe_asg.paginate(
# set service desired count equal to ASG desired capacity
newDesiredCount = iter(asgDetail).next()['AutoScalingGroups'][0]['DesiredCapacity']
response = client_ecs.update_service(
cluster = <ecs cluster>,
desiredCount = newDesiredCount,
service = <ecs service>
The reason you shouldn't rely on CloudWatch for this is because it doesn't do a great job at granular scaling. What I mean is, the CPU that CloudWatch monitors on your ASG is the overall group average (I think). So the scenario we ran into was as follows:
CloudWatch detects hosts are at 90%, desired is 70%
CloudWatch launches 4 hosts
Service detects tasks are at 85%, desired is 70%
Service launches new task
Service detects tasks are at 80%, desired is 70%
Service launches new task
Service detects tasks are at 75%, desired is 70%
Service launches new task
Service detects tasks are at 70%, no action
While this is a trivial example, it's easy to see how the number of instances get out of sync from the number of tasks actually running (i.e., you may end up with a host sitting idle because ECS doesn't detect that it needs more capacity).
Could we just scale up 3 hosts? Sure, but ECS might still only place 2 tasks (depending on how the usage is per task). Could we scale one host at a time? Sure, but then it's pretty difficult to account for bursts.
All that to say, the best solution I can recommend for now is to have a Lambda script help keep your ASG instance count == your ECS service desired task count.
I have decided to create a lambda function that starts the instance and on container instance start a task is ran. Then I have a cloud watch event watching for the task changing status to STOPPED which triggers another lambda that stops the instance.

What is wrong in my amazon autoscalling?

I have an EC2 instance that I want to scale based on the number of messages in a SQS queue. If there are many messages (for 5 minutes) I want to pop up a new EC2, for consuming faster the messages. Then if the messages are few (for 5 minutes), I want to pop down the oldest EC2. This way, if the service that consumes the messages stops for some reason, I will terminate the old EC2, and the service will run.
I have created an AutoScalling for this. I have set the TerminationPolicy to OldestInstance, but it works as I expect only if I set just one zone (eg: eu-west-1a): it creates a new instance and terminates the oldest each time. But if I have 3 regions (eu-west-1a, eu-west-1b, eu-west-1c), it just launches and terminates the instances not in the OldestInstance manner. Or, at least, not as I expect: delete the oldest every time. Is there something linked to different zones? On this pace I have not found anything about it, except for the default policy.
And even if the case linking to multiple zones from default policy is applied, I can have maximum only 2 instances that turn at the same time. And they are always launched in a new zone.
This is probably the key paragraph:
When you customize the termination policy, Auto Scaling first assesses the Availability Zones for any imbalance. If an Availability Zone has more instances than the other Availability Zones that are used by the group, then Auto Scaling applies your specified termination policy on the instances from the imbalanced Availability Zone. If the Availability Zones used by the group are balanced, then Auto Scaling selects an Availability Zone at random and applies the termination policy that you specified.
I interpret this to mean that if you have instances in multiple zones, and those zones are already balanced, then AWS will select a zone at random AND THEN pick the oldest instance, within the randomly selected zone - it won't pick the oldest instances across AZ's, it picks and random AZ and then the oldest instance is terminated within that AZ.
