AWS Cloudwatch Per-Instance CPU monitoring on ASG - amazon-ec2

I'm trying to set up some metrics on an AWS CloudWatch board for our Autoscaling groups which would display the current CPU usage of each individual host in the Autoscaling group.
I've tried to use AWS's built-in searches, but none of them seem to relate to what I want. The ASG searches don't turn over per-instance CPU metrics, and using EC2 instances as the search filter returns all EC2 instances in our region rather than allowing me to target just the specific ASG.
I've tried:
SEARCH('{AWS/EC2,InstanceId} MetricName="CPUUtilization"', 'Average', 300)
But this returns all the CPU metrics from our entire fleet.
I've also tried:
Region
us-west-2
Namespace
AWS/EC2
Metric name
CPUUtilization
AutoScalingGroupName
production-web-WebAutoScaleGroup-B43QRLMZOCTJ
But this returns only the average of the CPU usage in the total ASG.
Is there no way to filter the CPU usage statistics by ASG?

Related

Reducing AWS CloudWatch agent CPU usage

We have the CloudWatch agent installed on one EC2 instance and even with 4 cores the task takes up 24% of total CPU time. Is there a way to configure this to be less of a CPU strain? Perhaps to drop the sample rate or have it idle for periods?
While the documentation mentioned a cron job, I see nowhere information on how to set up a scheduled task to have the agent work intermittently. For example, it would be nice to have it fired up once every 5 minutes, send results to the cloud, then shutdown - perhaps with a powershell task.
I managed to limit the CPU usage from 15-20% to 0-0.2% by:
Removing old logs from the folder - There were around 500MB of logs and the agent was processing everything
Updating to the latest version
I reduced CPU usage significantly by removing use of the ** super asterisk.
Also, regarding collection interval, there is a setting in the config file to set collection interval (default is 60 seconds)
"metrics_collection_interval": 60
AWS Docs

How do i consult these queries in prometheus?

I'm trying to setup some graphs using prometheus with grafana in a home lab using a single node kubernetes deployed using minikube. I also have some stress test to use on the cluster. I want to measure the results of the stress test using prometheus, so i need help with the following queries:
Cpu usage of the node/cluster and from and individual pod by given name, in a period of time (ie 5min).
Memory usage of the node/cluster and from and individual pod by given name, in a period of time (ie 5min).
Disk or file system usage of the node/cluster and from an individual pod by given name, in a period of time (ie 5min).
Latency from an individual pod by given name, in a period of time (ie 5min).
If any can help with that, or know a grafana dashboard for that (i've already tried the 737 and the 6417) or give a hint of which metrics i need to consult (i've tried rate(container_cpu_usage_seconds_total[5m]) and this gives me some sort of result for the cpu usage query for the whole node).
You can use Prometheus's labels to get metrics for a specific pod:
CPU (you don't have to provide all labels, you can select only one if that's unique:
sum(rate(container_cpu_usage_seconds_total{pod=~"<your_pod_name>", container=~"<your_container_name>", kubernetes_io_hostname=~"<your_node_name>"}[5m])) by (pod,kubernetes_io_hostname)
Memory:
sum(container_memory_working_set_bytes{pod=~"<your_pod_name>", container=~"<your_container_name>", kubernetes_io_hostname=~"<your_node_name>"}) by (pod,kubernetes_io_hostname)
Disk:
kubelet_volume_stats_used_bytes{kubernetes_io_hostname=~"<your_node_name>$", persistentvolumeclaim=~".*<your_pod_name>"}
Latency:
You can collect it in your application (web server)? via Prometheus client (application level)

Starting and stopping container instances in-between task definitions on ECS

To reduce costs I would like to stop and start an container instance in a cluster in-between tasks. The task run every now and again so doesn't seem efficient keeping an EC2 running in-between.
What is the best way to allow this?
I have looked into lambda functions triggered by a cloudwatch scheduler and also thought about autoscaling.
Amazon doesn't make this incredibly straight-forward (though, they're trying to with Fargate). The best solution for now (if you're not in a region where Fargate is an option) is to try to keep your desired task count in line with your desired instance count on an autoscaling group.
The way we have it setup is through Lambda, triggering based on Autoscaling events (pretty easy to setup). The least trivial part about this is the Lambda script, though it's not incredibly difficult. Add tags to your ASG that help identify what cluster / service it's associated with. When a scaling event triggers your script, just have your script describe the ASG that triggered it, look for the cluster / service that's in the tags, and updated the desired count of that service:
asgDetail = paginator_describe_asg.paginate(
AutoScalingGroupNames=[
asgName,
]
)
# set service desired count equal to ASG desired capacity
newDesiredCount = iter(asgDetail).next()['AutoScalingGroups'][0]['DesiredCapacity']
response = client_ecs.update_service(
cluster = <ecs cluster>,
desiredCount = newDesiredCount,
service = <ecs service>
)
The reason you shouldn't rely on CloudWatch for this is because it doesn't do a great job at granular scaling. What I mean is, the CPU that CloudWatch monitors on your ASG is the overall group average (I think). So the scenario we ran into was as follows:
CloudWatch detects hosts are at 90%, desired is 70%
CloudWatch launches 4 hosts
Service detects tasks are at 85%, desired is 70%
Service launches new task
Service detects tasks are at 80%, desired is 70%
Service launches new task
Service detects tasks are at 75%, desired is 70%
Service launches new task
Service detects tasks are at 70%, no action
While this is a trivial example, it's easy to see how the number of instances get out of sync from the number of tasks actually running (i.e., you may end up with a host sitting idle because ECS doesn't detect that it needs more capacity).
Could we just scale up 3 hosts? Sure, but ECS might still only place 2 tasks (depending on how the usage is per task). Could we scale one host at a time? Sure, but then it's pretty difficult to account for bursts.
All that to say, the best solution I can recommend for now is to have a Lambda script help keep your ASG instance count == your ECS service desired task count.
I have decided to create a lambda function that starts the instance and on container instance start a task is ran. Then I have a cloud watch event watching for the task changing status to STOPPED which triggers another lambda that stops the instance.

How to detect if AWS has sufficient capacity to launch r4 instances

We keep running into this insufficient capacity way too often when we launch our cloudformation to create test environments. I've not found a way to proactively check if AWS can provide capacity to launch the number of instances of the specific type that we want. Is there really no way to check early instead of waiting each time for cloudformation to fail and resorting to using r3 type instances (besides adding more availability zones, i.e. not specifying them?)
Error:
2018-01-15T03:45:24.202064 2018-01-15 03:44:34.868000+00:00 | CREATE_FAILED | AWS::EC2::Instance | SlaveNode1 | We currently do not have sufficient r4.xlarge capacity in the Availability Zone you requested (ap-northeast-1c). Our system will be working on provisioning additional capacity. You can currently get r4.xlarge capacity by not specifying an Availability Zone in your request or choosing , ap-northeast-1a.

What is wrong in my amazon autoscalling?

I have an EC2 instance that I want to scale based on the number of messages in a SQS queue. If there are many messages (for 5 minutes) I want to pop up a new EC2, for consuming faster the messages. Then if the messages are few (for 5 minutes), I want to pop down the oldest EC2. This way, if the service that consumes the messages stops for some reason, I will terminate the old EC2, and the service will run.
I have created an AutoScalling for this. I have set the TerminationPolicy to OldestInstance, but it works as I expect only if I set just one zone (eg: eu-west-1a): it creates a new instance and terminates the oldest each time. But if I have 3 regions (eu-west-1a, eu-west-1b, eu-west-1c), it just launches and terminates the instances not in the OldestInstance manner. Or, at least, not as I expect: delete the oldest every time. Is there something linked to different zones? On this pace I have not found anything about it, except for the default policy.
And even if the case linking to multiple zones from default policy is applied, I can have maximum only 2 instances that turn at the same time. And they are always launched in a new zone.
This is probably the key paragraph:
When you customize the termination policy, Auto Scaling first assesses the Availability Zones for any imbalance. If an Availability Zone has more instances than the other Availability Zones that are used by the group, then Auto Scaling applies your specified termination policy on the instances from the imbalanced Availability Zone. If the Availability Zones used by the group are balanced, then Auto Scaling selects an Availability Zone at random and applies the termination policy that you specified.
I interpret this to mean that if you have instances in multiple zones, and those zones are already balanced, then AWS will select a zone at random AND THEN pick the oldest instance, within the randomly selected zone - it won't pick the oldest instances across AZ's, it picks and random AZ and then the oldest instance is terminated within that AZ.

Resources