Reducing AWS CloudWatch agent CPU usage - performance

We have the CloudWatch agent installed on one EC2 instance and even with 4 cores the task takes up 24% of total CPU time. Is there a way to configure this to be less of a CPU strain? Perhaps to drop the sample rate or have it idle for periods?
While the documentation mentioned a cron job, I see nowhere information on how to set up a scheduled task to have the agent work intermittently. For example, it would be nice to have it fired up once every 5 minutes, send results to the cloud, then shutdown - perhaps with a powershell task.

I managed to limit the CPU usage from 15-20% to 0-0.2% by:
Removing old logs from the folder - There were around 500MB of logs and the agent was processing everything
Updating to the latest version

I reduced CPU usage significantly by removing use of the ** super asterisk.
Also, regarding collection interval, there is a setting in the config file to set collection interval (default is 60 seconds)
"metrics_collection_interval": 60
AWS Docs

Related

What can cause a Cloud Run instance to not be reused despite continuous load?

Context:
My Spring-Boot app runs as expected on Cloud Run when I deploy it with max-instances set to 1: It receives a constant stream of pubsub messages via push, and makes anywhere from 0 to 5 writes to an associated CloudSQL instance, depending on the message payload. Typically it handles between 20 and 40 messages per second. Latency/response-time varies between 50ms and 60sec, probably due to some resource contention.
In order to increase throughput/ decrease resource contention, I'm looking to experiment with the connection pool size per app-instance, as well as the concurrency and max-instances parameters for my cloud run app.
I understand that due to Spring-Boot, my app has a relatively high cold-start time of about 30-40 seconds. This is acceptable for how this service is used.
Problem:
I'm experiencing problems when deploying a spring-boot app to cloud run with max-instances set to a value greater than 1:
Instances start, handle a single request successfully, and then produce no more logs.
This happens a few times per minute, leading me to believe that instances get started (cold-start), handle a single request, die, and then get started again. They are not being reused as described in the docs, and as is happening when I set max-instances to 1. Official docs on concurrency
Instead, I expect 3 container instances to be started, which then each requests according to max-concurrency setting.
Billable container time at max-instances=3:
As shown in the graph, the number of instances is fluctuating wildly, once the new revision with max-instances=3 is deployed.
The graphs for CPU- and memory-usage also look like this.
There are no error logs. As before at max-instaces=1, there are warnings indicating that there are not enough instances available to handle requests (HTTP 429).
Connection Limit of CloudSQL instance has not been exceeded
Requests are handled at less than 10/s
Finally, this is the command used to deploy:
gcloud beta run deploy my-service --project=[...] --image=[...] --add-cloudsql-instances=[...] --region=[...] --platform=managed --memory=1Gi --max-instances=3 --concurrency=3 --no-allow-unauthenticated
What could cause this behavior?
Some month ago, in private Alpha, I performed tests and I observed the same behavior. After discussion with Google team, I understood that instances are over provisioned "in case of": an instances crashes, an instances is preempted, the traffic suddenly increase,...
The trade-off of this is that you will have more cold start that your max instances values. Worse, you will be charged for this over provisioned cold start -> this is not an issue because Cloud Run has a huge free tier that covers this kind of glitches.
Going deeper in the logs (you can do it by creating a sink of Cloud Run logs into BigQuery and then by requesting them), even if there is more instances up than your max instances, only your max instances are active in the same time. I'm not sure to be clear. With your parameters, that means, if you have 5 instances up in the same time, only 3 serve the traffic at the same point of time
This part is not documented because it evolves constantly for find the best balance between over-provisioning and lack of ressources (and 429 errors).
#Steren #AhmetB can you confirm or correct me?
When Cloud Run receives and processes requests rapidly, it predicts how many instances it needs, and will try to scale to the amount. If a sudden burst of requests occur, Cloud Run will instantiate a larger number of instances as a response. This is done in order to adapt to a possible higher number of network requests beyond what it is currently serving, with attempts to take into consideration the length of time it will take for the existing instance to complete loading the request. Per the documentation, it is possible that the amount of container instances can go above the max instance value when it spikes.
You mentioned with max-instances set to 1 it was running fine, but later you mentioned it was in fact producing 429s with it set to 1 as well. Seeing behavior of 429s as well as the instances spiking could indicate that the amount of traffic is not being handled fluidly.
It is also worth noting, because of the cold start time you mention, when instances are serving the first request(s), by design, the number of concurrent requests is actually hard set to 1. Once things are fully ready,only then the concurrency setting you have chosen is applied.
Was there some specific reason you chose 3 and 3 for Max Instance settings and concurrency? Also how was the concurrency set when you had max instance set to 1? Perhaps you could try tinkering up further the concurrency (max 80) and /or Max instances (high limit up to 1000) and see if that removes the 429s.

Occasional AWS Lambda timeouts, but otherwise sub-second execution

We have an AWS Lambda written in Java that usually completes in about 200 ms. Occasionally, it times out after 5 seconds (our configured timeout value).
I understand that there is occasional added latency due to container setup (though, I'm not clear if that counts against your execution time). I added some debug logging, and it seems like the code just runs slow.
For example, a particularly noticeable log entry shows a call to HttpClients.createDefault usually takes less than 200 ms (based on the fact that the Lambda executes in less than 200 ms), but when the timeout happens, it takes around 2-3 seconds.
2017-09-14 16:31:28 DEBUG Helper:Creating HTTP Client
2017-09-14 16:31:31 DEBUG Helper:Executing request
Unless I'm misunderstanding something, it seems like any latency due to container initialization would have already happened. Am I wrong in assuming that code execution should not have dramatic differences in speed from one execution to the next? Or is this just something we should expect?
Setting up new containers or replacing cold containers takes some time. Both account against your time. The time you see in the console is the time you are billed against.
I assume that Amazon doesn't charge for the provisioning of the container, but they will certainly hit the timer as soon as your runtime is started. You are likely to pay for the time during which the SDK/JDK gets initialized and loads it's classes. They are certainly not charging us for the starting of the operation system which hosts the containers.
Running a simple Java Lambda two times shows the different times for new and reused instances. The first one is 374.58 ms and the second one is 0.89 ms. After that you see the billed duration of 400 and 100 ms. For the second one the container got reused. While you can try to keep your containers warm as already pointed out by #dashmug, AWS will occasionally recycle the containers and as load increases or decreases spawn new containers. The blogs How long does AWS Lambda keep your idle functions around before a cold start? and How does language, memory and package size affect cold starts of AWS Lambda? might be worth a look as well. If you include external libraries you times will increase. If you look at that blog you can see that for Java and smaller memory allocations can regularly exceed 2 - 4 seconds.
Looking at these times you should probably increase your timeout and not just have a look at the log provided by the application, but a look at the START, END and REPORT entries as well for an actual timeout event. Each running Lambda container instance seems to create its own log stream. Consider keeping your Lambdas warm if they aren't called that often.
05:57:20 START RequestId: bc2e7237-99da-11e7-919d-0bd21baa5a3d Version: $LATEST
05:57:20 Hello from Lambda com.udoheld.aws.lambda.HelloLogSimple.
05:57:20 END RequestId: bc2e7237-99da-11e7-919d-0bd21baa5a3d
05:57:20 REPORT RequestId: bc2e7237-99da-11e7-919d-0bd21baa5a3d Duration: 374.58 ms Billed Duration: 400 ms Memory Size: 128 MB Max Memory Used: 44 MB
05:58:01 START RequestId: d534155b-99da-11e7-8898-2dcaeed855d3 Version: $LATEST
05:58:01 Hello from Lambda com.udoheld.aws.lambda.HelloLogSimple.
05:58:01 END RequestId: d534155b-99da-11e7-8898-2dcaeed855d3
05:58:01 REPORT RequestId: d534155b-99da-11e7-8898-2dcaeed855d3 Duration: 0.89 ms Billed Duration: 100 ms Memory Size: 128 MB Max Memory Used: 44 MB
Try to keep your function always warm and see if it would make a difference.
If the timeout is really due to container warmup, then keeping it warm will greatly help reduce the frequency of these timeouts. You'd still get cold starts when you deploy changes but at least that's predictable.
https://read.acloud.guru/how-to-keep-your-lambda-functions-warm-9d7e1aa6e2f0
For Java based applications the warm up period is more as you know it's jvm right. Better to use NodeJS or Python because the warm up period is less for them. If you are not in such a way to switch the tech stack simply keep the container warm by triggering it or increase the memory that will reduce the execution time as lambda cpu allocation is more for larger memory.

Heroku clock process: how to ensure jobs weren't skipped?

I'm building a Heroku app that relies on scheduled jobs. We were previously using Heroku Scheduler but clock processes seem more flexible and robust. So now we're using a clock process to enqueue background jobs at specific times/intervals.
Heroku's docs mention that clock dynos, as with all dynos, are restarted at least once per day--and this incurs the risk of a clock process skipping a scheduled job: "Since dynos are restarted at least once a day some logic will need to exist on startup of the clock process to ensure that a job interval wasn’t skipped during the dyno restart." (See https://devcenter.heroku.com/articles/scheduled-jobs-custom-clock-processes)
What are some recommended ways to ensure that scheduled jobs aren't skipped, and to re-enqueue any jobs that were missed?
One possible way is to create a database record whenever a job is run/enqueued, and to check for the presence of expected records at regular intervals within the clock job. The biggest downside to this is that if there's a systemic problem with the clock dyno that causes it to be down for a significant period of time, then I can't do the polling every X hours to ensure that scheduled jobs were successfully run, since that polling happens within the clock dyno.
How have you dealt with the issue of clock dyno resiliency?
Thanks!
You will need to store data about jobs somewhere. On Heroku, you don't have any informations or warranty about your code being running only once and all the time (because of cycling)
You may use a project like this on (but not very used) : https://github.com/amitree/delayed_job_recurring
Or depending on your need you could create a scheduler or process which schedule jobs for the next 24 hours and is run every 4 hours in order to be sure your jobs will be scheduled. And hope that the heroku scheduler will work at least once every 24 hours.
And have at least 2 worker processing the jobs.
Though it requires human involvement, we have our scheduled jobs check-in with Honeybadger via an after_perform hook in rails
# frozen_string_literal: true
class ScheduledJob < ApplicationJob
after_perform do |job|
check_in(job)
end
private
def check_in(job)
token = Rails.application.config_for(:check_ins)[job.class.name.underscore]
Honeybadger.check_in(token) if token.present?
end
end
This way when we happen to have poorly timed restarts from deploys we at least know should-be scheduled work didn't actually happen
Would be interested to know if someone has a more fully-baked, simple solution!

"Too many fetch-failures" while using Hive

I'm running a hive query against a hadoop cluster of 3 nodes. And I am getting an error which says "Too many fetch failures". My hive query is:
insert overwrite table tablename1 partition(namep)
select id,name,substring(name,5,2) as namep from tablename2;
that's the query im trying to run. All i want to do is transfer data from tablename2 to tablename1. Any help is appreciated.
This can be caused by various hadoop configuration issues. Here a couple to look for in particular:
DNS issue : examine your /etc/hosts
Not enough http threads on the mapper side for the reducer
Some suggested fixes (from Cloudera troubleshooting)
set mapred.reduce.slowstart.completed.maps = 0.80
tasktracker.http.threads = 80
mapred.reduce.parallel.copies = sqrt (node count) but in any case >= 10
Here is link to troubleshooting for more details
http://www.slideshare.net/cloudera/hadoop-troubleshooting-101-kate-ting-cloudera
Update for 2020 Things have changed a lot and AWS mostly rules the roost. Here is some troubleshooting for it
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-troubleshoot-error-resource-1.html
Too many fetch-failures
PDF
Kindle
The presence of "Too many fetch-failures" or "Error reading task output" error messages in step or task attempt logs indicates the running task is dependent on the output of another task. This often occurs when a reduce task is queued to execute and requires the output of one or more map tasks and the output is not yet available.
There are several reasons the output may not be available:
The prerequisite task is still processing. This is often a map task.
The data may be unavailable due to poor network connectivity if the data is located on a different instance.
If HDFS is used to retrieve the output, there may be an issue with HDFS.
The most common cause of this error is that the previous task is still processing. This is especially likely if the errors are occurring when the reduce tasks are first trying to run. You can check whether this is the case by reviewing the syslog log for the cluster step that is returning the error. If the syslog shows both map and reduce tasks making progress, this indicates that the reduce phase has started while there are map tasks that have not yet completed.
One thing to look for in the logs is a map progress percentage that goes to 100% and then drops back to a lower value. When the map percentage is at 100%, this does not mean that all map tasks are completed. It simply means that Hadoop is executing all the map tasks. If this value drops back below 100%, it means that a map task has failed and, depending on the configuration, Hadoop may try to reschedule the task. If the map percentage stays at 100% in the logs, look at the CloudWatch metrics, specifically RunningMapTasks, to check whether the map task is still processing. You can also find this information using the Hadoop web interface on the master node.
If you are seeing this issue, there are several things you can try:
Instruct the reduce phase to wait longer before starting. You can do this by altering the Hadoop configuration setting mapred.reduce.slowstart.completed.maps to a longer time. For more information, see Create Bootstrap Actions to Install Additional Software.
Match the reducer count to the total reducer capability of the cluster. You do this by adjusting the Hadoop configuration setting mapred.reduce.tasks for the job.
Use a combiner class code to minimize the amount of outputs that need to be fetched.
Check that there are no issues with the Amazon EC2 service that are affecting the network performance of the cluster. You can do this using the Service Health Dashboard.
Review the CPU and memory resources of the instances in your cluster to make sure that your data processing is not overwhelming the resources of your nodes. For more information, see Configure Cluster Hardware and Networking.
Check the version of the Amazon Machine Image (AMI) used in your Amazon EMR cluster. If the version is 2.3.0 through 2.4.4 inclusive, update to a later version. AMI versions in the specified range use a version of Jetty that may fail to deliver output from the map phase. The fetch error occurs when the reducers cannot obtain output from the map phase.
Jetty is an open-source HTTP server that is used for machine to machine communications within a Hadoop cluster

Amazon EC2 AutoScaling CPUUtilization Alarm- INSUFFICIENT DATA

So I've been using Boto in Python to try and configure autoscaling based on CPUUtilization, more or less exactly as specified in this example:
http://boto.readthedocs.org/en/latest/autoscale_tut.html
However both alarms in CloudWatch just report:
State Details: State changed to 'INSUFFICIENT_DATA' at 2012/11/12
16:30 UTC. Reason: Unchecked: Initial alarm creation
Auto scaling is working fine but the alarms aren't picking up any CPUUtilization data at all. Any ideas for things I can try?
Edit: The instance itself reports CPU utilisation data, just not when I try and create an alarm in CloudWatch, programatically in python or in the interface. Detailed monitoring is also enabled just in case...
Thanks!
The official answer from AWS goes like this:
Hi, There is an inherent delay in transitioning into INSUFFICIENT_DATA
state (only) as alarms wait for a period of time to compensate for
metric generation latency. For an alarm with a 60 second period, the
delay before transition into I_D state will be between 5 and 10
minutes.
John.
Apparently this is a temporary state and will likely resolve itself.
I am not sure what's going on in the backend, but if you compare the alarm history you will see AWS remove the 'unit' column if you just modify the alarm without any change as at7000ft said. So remove the unit column of your script.
Make sure that the alarm's Namespace is 'AWS/EC2'.
I know this is a long time after the original question, but in case others find this via Google, I had the same problem, and it turned out I set alarm's Namespace improperly.
It is needed to publish data with the same unit used to create the alarm. If you didn't specify one, it will be a <None> unit.
Unit can be specified in aws put-metric-data and aws-put-metric-alarm with --unit <value>
Unit <value> can be:
Seconds
Bytes
Bits
Percent
Count
Bytes/Second (bytes per second)
Bits/Second (bits per second)
Count/Second (counts per second)
None (default when no unit is specified)
Units are also case-sensitive, be carefull about that in your scripts.
For CPUUtilization, you can use Percent.
After the first data-set is sent to your alarm (it can take up to 5 minutes for a non-detailed monitored instance), the alarm will switch to the OK or ALARM state instead of the INSUFFICIENT_DATA one.
I am having the same INSUFFICIENT_DATA alarm state show up in CloudWatch for an RDS CPUUtilization > 60 alarm created with CloudFormation. ("Reason: Unchecked: Initial alarm creation" shows up under details). This is a very crude fix but I found that by selecting the alarm, clicking the Modify button, and then the Save button (without changing anything) the alarm goes to the OK state and everything is file.
I had this problem. Make sure the metric name you use to create the alarm matches the actual metric name.
You can list your metrics with:
aws cloudwatch list-metrics --namespace=<NAMESPACE, e.g. System/Linux, etc>
Find the metric and the MetricName. Make sure your alarm is configured for that metric.
As far as I know, default metric resolution is 5 minutes (which can be lowered to 1 minute if you pay up, or something like that), so if your alarm's measurement period is lower than that, then it'll remain permanently in an INSUFFICIENT_DATA state. In my case, I had a 1 minute measurement period on CPU utilization, and changing it to 5 minutes has fixed the state issue.
I had a similar problem, my alarm was constantly in INSUFFICIENT_DATA status although I can see the metric in the GUI.
Come out that this happen, because I specified the wrong Unit for the metric, when I created the Alarm. No error was reported back but it never became GREEN.
Better to avoid to specify it, if you are not sure, and AWS will do the correct match in the background.
There is a directory /var/tmp/aws-mon/ that contains a couple files. One is instance-id. The instance I was on was created from an AMI and this file retained the old instance id. I just edited it and made sure /var/tmp/aws-mon/placement/availability-zone was also correct. The alarms changed to OK almost instantly.
Also ran into this problem but for a different reason: I passed ES cluster ARN instead of domain name in my Cloudformation template. It was pretty frustrating

Resources