Amazon EC2 AutoScaling CPUUtilization Alarm- INSUFFICIENT DATA - amazon-ec2

So I've been using Boto in Python to try and configure autoscaling based on CPUUtilization, more or less exactly as specified in this example:
http://boto.readthedocs.org/en/latest/autoscale_tut.html
However both alarms in CloudWatch just report:
State Details: State changed to 'INSUFFICIENT_DATA' at 2012/11/12
16:30 UTC. Reason: Unchecked: Initial alarm creation
Auto scaling is working fine but the alarms aren't picking up any CPUUtilization data at all. Any ideas for things I can try?
Edit: The instance itself reports CPU utilisation data, just not when I try and create an alarm in CloudWatch, programatically in python or in the interface. Detailed monitoring is also enabled just in case...
Thanks!

The official answer from AWS goes like this:
Hi, There is an inherent delay in transitioning into INSUFFICIENT_DATA
state (only) as alarms wait for a period of time to compensate for
metric generation latency. For an alarm with a 60 second period, the
delay before transition into I_D state will be between 5 and 10
minutes.
John.
Apparently this is a temporary state and will likely resolve itself.

I am not sure what's going on in the backend, but if you compare the alarm history you will see AWS remove the 'unit' column if you just modify the alarm without any change as at7000ft said. So remove the unit column of your script.

Make sure that the alarm's Namespace is 'AWS/EC2'.
I know this is a long time after the original question, but in case others find this via Google, I had the same problem, and it turned out I set alarm's Namespace improperly.

It is needed to publish data with the same unit used to create the alarm. If you didn't specify one, it will be a <None> unit.
Unit can be specified in aws put-metric-data and aws-put-metric-alarm with --unit <value>
Unit <value> can be:
Seconds
Bytes
Bits
Percent
Count
Bytes/Second (bytes per second)
Bits/Second (bits per second)
Count/Second (counts per second)
None (default when no unit is specified)
Units are also case-sensitive, be carefull about that in your scripts.
For CPUUtilization, you can use Percent.
After the first data-set is sent to your alarm (it can take up to 5 minutes for a non-detailed monitored instance), the alarm will switch to the OK or ALARM state instead of the INSUFFICIENT_DATA one.

I am having the same INSUFFICIENT_DATA alarm state show up in CloudWatch for an RDS CPUUtilization > 60 alarm created with CloudFormation. ("Reason: Unchecked: Initial alarm creation" shows up under details). This is a very crude fix but I found that by selecting the alarm, clicking the Modify button, and then the Save button (without changing anything) the alarm goes to the OK state and everything is file.

I had this problem. Make sure the metric name you use to create the alarm matches the actual metric name.
You can list your metrics with:
aws cloudwatch list-metrics --namespace=<NAMESPACE, e.g. System/Linux, etc>
Find the metric and the MetricName. Make sure your alarm is configured for that metric.

As far as I know, default metric resolution is 5 minutes (which can be lowered to 1 minute if you pay up, or something like that), so if your alarm's measurement period is lower than that, then it'll remain permanently in an INSUFFICIENT_DATA state. In my case, I had a 1 minute measurement period on CPU utilization, and changing it to 5 minutes has fixed the state issue.

I had a similar problem, my alarm was constantly in INSUFFICIENT_DATA status although I can see the metric in the GUI.
Come out that this happen, because I specified the wrong Unit for the metric, when I created the Alarm. No error was reported back but it never became GREEN.
Better to avoid to specify it, if you are not sure, and AWS will do the correct match in the background.

There is a directory /var/tmp/aws-mon/ that contains a couple files. One is instance-id. The instance I was on was created from an AMI and this file retained the old instance id. I just edited it and made sure /var/tmp/aws-mon/placement/availability-zone was also correct. The alarms changed to OK almost instantly.

Also ran into this problem but for a different reason: I passed ES cluster ARN instead of domain name in my Cloudformation template. It was pretty frustrating

Related

Scheduling a task periodically in an android app using WorkManager(Coroutine Worker)

I'm trying to run a task periodically every 12 hours, I'v used Work Manager and followed their documentation. The task is running periodically well as long as the user is actually using the app.
But if i close the app, or even just let my phone be idle for a while, it seems like the task stops working.
I searched for this problem on google, and came across posts such as this post, which i think explains the behaviour on my phone.
From my understanding, this problem is not related to work manager, but to all frameworks who will try to run background tasks?
Is there a way to still periodically run tasks on most devices or is WorkManager is still the way to go?
Thanks!
Please check this:
https://developer.android.com/topic/libraries/architecture/workmanager/how-to/debugging#use-alb-shell0dumpsys-jobscheduler
Required constraints: TIMING_DELAY CONNECTIVITY [0x90000000]
Satisfied constraints: DEVICE_NOT_DOZING BACKGROUND_NOT_RESTRICTED WITHIN_QUOTA [0x3400000]
Unsatisfied constraints: TIMING_DELAY CONNECTIVITY [0x90000000]
Minimum latency: +1h29m59s687ms
Run time: earliest=+38m29s834ms, latest=none, original latest=none
The periodic work is not exactly Periodic. You have different Constraints:
Explicit - you set them.
Implicit - set by the system related to Battery optimization. You should save the battery and also Internet usage.
When you have a "periodic work" you actually have an explicit constraint called:
TIMING_DELAY. But when the time is passed it does not mean that the work will start. It means that this constraint is satisfied and if and only all the other constraints are Satisfied - then the work will start.
And for example, you have your work with a "period" of 12 hours, but you wait an extra 4 hours for the other Constraints - you will have a period of 16 hours.
And after the work is finished - WorkManager will create a completely new job in the JobScheduler with TIMING_DELAY again - 12 hours. It will not account for the extra 4 hours. So you can't imagine something like:
I have 5 days so it means - 10 executions. It might be only 4 or 5 executions.
You can improve this by asking the user to exempt you from battery optimization:
https://developer.android.com/training/monitoring-device-state/doze-standby#support_for_other_use_cases
If you need to be really exact - you need to use AlarmManager, but mostly the idea of all of this is for the battery to be saved so it is not only about what the devs need, but also what the user needs.

Nifi DetectDuplicate Not Detecting Duplicates

I am using the DetectDuplicate processor within a flow but am seeing some confusing behavior. The processor is configured as follows:
Cache Entry Identifier: ${rk.id}
FlowFile Description: Empty string set
Age Off Duration: 10s
Distributed Cache Service: DistributedMapCacheClientService
Cache The Entry Identifier: true
The "duplicate" relationship is automatically terminated. Concurrency is set to 1.
However, I'm seeing multiple copies of flowfiles on the output queue with the same rk.id that were run through the processor less than 2 seconds apart. How is this possible? I even tried increasing the age off to 5m and it made no difference. I also tried setting the processor to only run every 500ms, thinking there may be some delay in writing to the cache, and 2 flowfiles that were processed 1s apart with the same rk.id showed up in the output queue. What am I missing?
I think I figured this out. It looks like the cache was full and not accepting new values? Because we had a lot less traffic this morning and it seems to have properly run the deduplication.

Why are there holes in my cloudwatch logs?

I have been running lambdas using C# with serverless.com framework for some months now, and I consistently notice holes in the cloudwatch logs. So far it has only been an annoyance. I have been looking around for some explanation, but it is starting to get to the point where I need to understand/fix the problem.
For instance, today I can see the lambda monitor shows hundreds to thousands of executions between 7AM and 8AM, but the cloudwatch logs show logfiles up until 7:19AM and then nothing again until 8:52AM.
What is going on here?
Logs are by Invocation of the lambda and log group links are by concurrent executions. If you look at your lambda metrics, you will see a stat called ConcurrentExecution - this is the total number of simultaneous serverless lambda containers you have running at any given moment - but that does NOT equal the same as Invocations. The headless project im on is doing about 5k invocations an hour and we've never been above 5 concurrent executions of any of our 25ish lambda's (helps that they all run after start up at about 300ms)
So if you have 100 invocations in 10 seconds, but they all take less than a second to run, once a given lambda container is spun up it will be reused as long as it is continually receiving events. This is how AWS works around the 'cold start' problem as much as possible where a given lambda may take 10-15 or more seconds to start up. By trying to predict traffic flow (and you can manipulate these settings as well) AWS is attempting to have a warm lambda ready to go for you whenever you need it.
These concurrent executions are slowly shut down as their volume drops off, their calls brought back in to other ones that are still active.
What this means for Log Group logs is two fold:
you may see large 'gaps' in the times but if you look closely any given log group will have multiple invocations in it.
log groups are delayed by several seconds to several minutes depending on the server load, so at any given time you may not actually be seeing all the logs of a given moment.
The other possibility is that you logging is not set up correctly (Python lambda's in particular have difficulty in logging properly to cloudwatch - the default Logging Handler doesn't play nice with the way lambda boots up a handler to attach it to the logGroup) or what you are getting is a ton of hits that are not actually doing anything - only pings/keep alive events that do not actually trigger any of your log statement - at which you will generally only see the concurrent start up/shutdown log statements (as stated above they are far fewer)
What do you mean with gaps in log groups?
A log group gets its log by log streams and one of the same lambda container use the same log stream. So it may not be the most recent log stream in your log group that have the latest log entry.
Here you can read more about it:
https://dashbird.io/blog/how-to-save-hundreds-hours-debugging-lambda/
While trying to edit my question with screenshots and tallies of the data, I came upon the answer. I thought it would be helpful for this to be a separate answer as it is extremely specific and enlightening.
The crux of the problem is that I didn't expect such huge gaps between invocation times and log write times. 12 minutes is an eternity compared to the work I have done in the past.
Consider this graph:
12:59 UTC should be 7:59AM CST. Counting the invocations between 12:59 and 13:08, I get roughly ~110.
Cloudwatch shows these log streams:
Looking at these log streams, there seems to be a large gap. The timestamp on the log stream is the "file close" time. The logstream for 8:08:37 includes events from 12 minutes before.
So the timestamps on the log streams are not very useful for finding debug data. The search all has not been very helpful up until now either. Slow and very limited. I will look into some other method for crunching logs.

What can cause a Cloud Run instance to not be reused despite continuous load?

Context:
My Spring-Boot app runs as expected on Cloud Run when I deploy it with max-instances set to 1: It receives a constant stream of pubsub messages via push, and makes anywhere from 0 to 5 writes to an associated CloudSQL instance, depending on the message payload. Typically it handles between 20 and 40 messages per second. Latency/response-time varies between 50ms and 60sec, probably due to some resource contention.
In order to increase throughput/ decrease resource contention, I'm looking to experiment with the connection pool size per app-instance, as well as the concurrency and max-instances parameters for my cloud run app.
I understand that due to Spring-Boot, my app has a relatively high cold-start time of about 30-40 seconds. This is acceptable for how this service is used.
Problem:
I'm experiencing problems when deploying a spring-boot app to cloud run with max-instances set to a value greater than 1:
Instances start, handle a single request successfully, and then produce no more logs.
This happens a few times per minute, leading me to believe that instances get started (cold-start), handle a single request, die, and then get started again. They are not being reused as described in the docs, and as is happening when I set max-instances to 1. Official docs on concurrency
Instead, I expect 3 container instances to be started, which then each requests according to max-concurrency setting.
Billable container time at max-instances=3:
As shown in the graph, the number of instances is fluctuating wildly, once the new revision with max-instances=3 is deployed.
The graphs for CPU- and memory-usage also look like this.
There are no error logs. As before at max-instaces=1, there are warnings indicating that there are not enough instances available to handle requests (HTTP 429).
Connection Limit of CloudSQL instance has not been exceeded
Requests are handled at less than 10/s
Finally, this is the command used to deploy:
gcloud beta run deploy my-service --project=[...] --image=[...] --add-cloudsql-instances=[...] --region=[...] --platform=managed --memory=1Gi --max-instances=3 --concurrency=3 --no-allow-unauthenticated
What could cause this behavior?
Some month ago, in private Alpha, I performed tests and I observed the same behavior. After discussion with Google team, I understood that instances are over provisioned "in case of": an instances crashes, an instances is preempted, the traffic suddenly increase,...
The trade-off of this is that you will have more cold start that your max instances values. Worse, you will be charged for this over provisioned cold start -> this is not an issue because Cloud Run has a huge free tier that covers this kind of glitches.
Going deeper in the logs (you can do it by creating a sink of Cloud Run logs into BigQuery and then by requesting them), even if there is more instances up than your max instances, only your max instances are active in the same time. I'm not sure to be clear. With your parameters, that means, if you have 5 instances up in the same time, only 3 serve the traffic at the same point of time
This part is not documented because it evolves constantly for find the best balance between over-provisioning and lack of ressources (and 429 errors).
#Steren #AhmetB can you confirm or correct me?
When Cloud Run receives and processes requests rapidly, it predicts how many instances it needs, and will try to scale to the amount. If a sudden burst of requests occur, Cloud Run will instantiate a larger number of instances as a response. This is done in order to adapt to a possible higher number of network requests beyond what it is currently serving, with attempts to take into consideration the length of time it will take for the existing instance to complete loading the request. Per the documentation, it is possible that the amount of container instances can go above the max instance value when it spikes.
You mentioned with max-instances set to 1 it was running fine, but later you mentioned it was in fact producing 429s with it set to 1 as well. Seeing behavior of 429s as well as the instances spiking could indicate that the amount of traffic is not being handled fluidly.
It is also worth noting, because of the cold start time you mention, when instances are serving the first request(s), by design, the number of concurrent requests is actually hard set to 1. Once things are fully ready,only then the concurrency setting you have chosen is applied.
Was there some specific reason you chose 3 and 3 for Max Instance settings and concurrency? Also how was the concurrency set when you had max instance set to 1? Perhaps you could try tinkering up further the concurrency (max 80) and /or Max instances (high limit up to 1000) and see if that removes the 429s.

Why is Cacti showing an empty graph, even though the rrd file is created?

I have developed my own SNMP service, and i want to plot a graph of an OID provided.
So, i have created a graph in Cacti.
-) It is showing device up.
-) It is creating rrd file. (RRDTool says OK).
-) Showing the graph, but it's empty.
But when I check it, say
rrdtool fetch <rrd file> AVERAGE
it shows me nan for all the values. The monitored OID has value 47 and i have set min=0 and max=100.
I am using Cacti appliance by rpath:
http://www.rpath.org/ui/#/appliances?id=http://www.rpath.org/api/products/cacti-appliance
Still, I can't show value on graph..
Where is the problem? Can anyone please tell me?
First of all, use Cacti's "Rebuild Poller Cache" function under the Utilities menu.
If that didn't work ,check if the RRD file is actually updating with new data.
To do this use the command:
rrdtool last [filename.rrd]
This will output the last time (in unix timestamp) that a new value has been inserted into the RRA file which you can compare to the current time that date +%s will output.
If it's not updating with data then you should change the cacti log level to DEBUG via the settings page on Cacti's web UI and look for appropriate messages.
If the poller couldn't get the data then it's usually an issue relating to connectiviy/SNMP.
You can further check issues as such by manually polling the specific OID on that host:
snmpwalk -c[SNMP COMMUNITY] -v2c [HOSTNAME OR IP ADDRESS] 1.3.6.1.2.1
You can use the above command and OID (1.3.6.1.2.1) just to see if you're getting a reply.
If that worked then you should change the command from snmpwalk to snmpget and the OID to the actual OID you're trying to poll and retry.
If the RRD is updating with new data but you're still getting NaN in your graphs then I suggest looking into the heartbeat and step values of the data source (via the data template) in relation to your polling interval and poller cronjob interval.
These values determine how many times the RRD file will miss data before inserting a NaN.
The cronjob calls the cacti poller to start performing it's polling cycle.
The poller interval is the actual time that the poller will wait between two polling cycles if it was indeed invoked in time by the cronjob.
So for 1 minute polling (on the poller and the cronjob) you will have to use a step of 60 (seconds) and a heartbeat of 120.
For 5 minutes polling, the step will be 300 and the heartbeat will be 600.
This is mainly caused by someone changing the poller interval on the settings page.
Gandalf from the Cacti forums wrote a nice Guide that you can use and further help can be found on Cacti forums.
Good luck! :)
Maybe cacti doesn't have the needed permissions to access the rrd file and your test was done with a user who has the required permissions, for example root?
Are you sure you have collected enough data?
If your RRD has a step of 1 minute, and your first RRA has a consolidated count of 1 (1cdp=1pdp), then you should collect data for at least (step x ( count + 1 )) seconds before you expect to see any data in the graph. Make sure you are collecting data at least as often as the step size.
If you collect data for 10 min and nothing shows up, then make sure you are actually collecting the data, make sure the values you get are within range, and that they are being used. Check the last modification time on the RRD file. Print out the values before you update to verify they are what you think they are.
You should double check the range Cacti is plotting in. I moved the values in the graph filter and spotted a little chunk of data in the graphs, then you just have to adjust it.

Resources