I'm using AWS cloudwatch to monitor and send alerts if memory usage exceeding 90%. This has been configured with SNS topic to deliver alert notification.
Whenever the memory usage crossing 90% means use to get 2 alert message for this single event and not sure why the duplicate mail getting triggered.
Alarm conditions,
Metric name : MemoryUsed
Threshold: MemoryUsed > 14446883635 for 10 datapoints within 15 minutes
Statistic : Maximum
Period: 1 minute
Datapoints to alarm: 10 out of 15
Missing data treatment: Treat missing data as missing
Percentiles with low samples: evaluate
Related
I have an API which take a json object and forward it to Azure Event Hub. The API running .NET Core 3.1, with EventHub SDK 3.0, it also have Application Insight configured to collect dependency telemetry, including Event Hub.
Using the following kusto query in Application Insight, I've found that there are some call to Event Hub which have really high latency (highest is 60 second, on average it fall around 3-7 seconds).
dependencies
| where timestamp > now()-7d
| where type == "Azure Event Hubs" and duration > 3000
| order by duration desc
Also it is worth noting that it return 890 results, out of 4.6 million Azure Event Hubs dependency result
I've check Event Hub metrics blade on Azure Portal, with average (in 1 minute time granularity) incoming/outgoing request way below the throughput unit (I have 2 event hubs in a EH namespace, 1 TU, autoscale to 20 max), which is around 50-100 message per second, bytes around 100kB, both incoming and outgoing. 0 throttled requests, 1-2 server/user errors from time to time
There are spike but it does not exceed throughput limit, and the slow dependency timestamp also don't match these spike
I also increased throughput unit to 2 manually, and it does not change anything
My question is:
Is it normal to have extremely high latency to Event Hub sometimes? Or it is acceptable if it only in small amount?
Codewise, only use 1 EventHubClient instance to send all the request, it is a bad practice or should I used something else like a client pool?
I also have a support engineer told me during a timestamp where I have high latency in Application Insight, the Event Hub log does not seem to have such high latency (322ms max), without going into details, it is possible for Application Insight to produce wrong performance telemetry?
I'm creating a monitoring for a process using New Relic. The process itself is an AWS Lambda that finishes running in around 15 seconds. Any time this process fails, I want to an alert to be triggered and an email to be sent to me per the policy I've configured.
For testing purposes I'm causing the lambda to fail in a QA environment multiple times in a row to see what gets picked up by New Relic, although in production the failure would only occur a couple (less than 3) times per week, potentially a few days apart.
Here is the chart that depicts all of the failures, the NRQL query, and the thresholds. As we can see, the summed errors are well above the threshold but for some reason the alert email is not being dispatched. Any ideas?
Try increasing your evaluation offset in Condition Settings -> Advanced Settings > Evaluation offset
New Relic polls for Lambda metrics every 5 minutes so if your offset is lower than this you may find that the alert doesn't fire.
In reality I've found this quite unreliable and I'd suggest setting quite a high offset initially to test the alert - maybe 20 or 30 minutes.
According to me the red highlighted area is the timeframe where the alert condition is being violated. Alert should had been triggered, check your notification channel and try sending test notification.
I have a Lambda that is triggered to run every week, and I want to have a CloudWatch alarm if it ever does not run for more than 7 consecutive days.
My thinking was Alarm if < 1 invocation for 8 days but it does not seem to be possible to set it longer than 24 hours:
The alarm evaluation period (number of datapoints times the period of
the metric) must be no longer than 24 hours.
Is there another way to ensure execution of Lambdas that are triggered on a period of greater than 24 hours?
Maximum evaluation period is 24 hours.
You can get around that by creating a custom metric using CloudWatch PutMetricData API. You can publish the time elapsed since the last execution of your lambda function and then alarm when the value rises above 8 days.
One way of doing this would be to have your lambda function store the timestamp of execution to DynamoDB every time it triggers. Then you can create a new function that will read that timestamp from DynamoDB and publish the difference between it and current time to a custom metric (have that lambda trigger every 1h for example).
Once you have the new custom metric flowing, you can create an alarm that will fire if the value goes above 8 days for one 1h datapoint (this will solve your initial issue). You can also set the Treat missing data as option to bad - breaching threshold (this will alert you if the second lambda function doesn't trigger).
You should also set alarms on CloudWatch Events errors and Lambda errors. This will alert you if something goes wrong with the scheduling or the lambda itself. But the custom metric I mentioned above will also alert you in the case of human error where someone disables or deletes the event or the function by mistake for example.
Similar to Raising Google Drive API per-user limit does not prevent rate limit exceptions
In Drive API Console, quotas looks like this:
Despite Per-user limit being set to an unnecessarily high requests/sec, I am still getting rate errors at the user-level.
What I'm doing:
I am using approx 8 threads uploading to Drive, and they are ALL implementing a robust exponential back-off of 1, 2, 4, 8, 16, 32, 64 sec back-off respectively (pretty excessive back-off, but necessary imho). The problem can still persists through all of this back-off in some of the threads.
Is there some other rate that is not being advertised / cannot be set?
I'm nowhere near the requests/sec, and still have 99.53% total quota. Why am I still getting userRateLimitExceeded errors?
userRateLimitExceeded is flood protection basically. Its used to prevent people from sending to many requests to fast.
Indicates that the user rate limit has been exceeded. The maximum rate
limit is 10 qps per IP address. The default value set in Google
Developers Console is 1 qps per IP address. You can increase this
limit in the Google Developers Console to a maximum of 10 qps.
You need to slow your code down, by implementing Exponential Backoff.
Make a request to the API
Receive an error response that has a retry-able error code
Wait 1s + random_number_milliseconds seconds
Retry request
Receive an error response that has a retry-able error code
Wait 2s + random_number_milliseconds seconds
Retry request
Receive an error response that has a retry-able error code
Wait 4s + random_number_milliseconds seconds
Retry request
Receive an error response that has a retry-able error code
Wait 8s + random_number_milliseconds seconds
Retry request
Receive an error response that has a retry-able error code
Wait 16s + random_number_milliseconds seconds
Retry request
If you still get an error, stop and log the error.
The idea is that every time you see that error you wait a few seconds then try and send it again. If you get the error again you wait a little longer.
Quota user:
Now I am not sure how your application works but, If all the quests are coming from the same IP this could cause your issue. As you can see by the Quota you get 10 requests per second / per user. How does Google know its a user? They look at the IP address. If all your reuqests are coming from the same IP then its one user and you are locked to 10 requests per second.
You can get around this by adding QuotaUser to your request.
quotaUser - Alternative to userIp. Link
Lets you enforce per-user quotas from a server-side application even in cases when the user's IP address is unknown. This can occur,
for example, with applications that run cron jobs on App Engine on a
user's behalf.
You can choose any arbitrary string that uniquely identifies a user, but it is limited to 40 characters.
Overrides userIp if both are provided.
Learn more about capping usage.
If you send a different quotauser on every reqest, say a random number, then Google thinks its a different user and will assume that its only one request in the 10 seconds. Its a little trick to get around the ip limitation when running server applications that request everything from the same IP.
We daily download data from Google Analytics API. This morning, a number of jobs on one of our accounts hit a 403 "Serving Limit Exceeded" error. However, I checked into the statistics posted in the console and our records don't appear to be anywhere near the limit.
We made 335 requests this morning. This is far less than the 50k daily requests limit.
The API Chart on our console page shows that we peaked at 0.1322 requests/second which is much lower than the limit I've read being about 10 requests/second.
We run at most two simultaneous processes from each of two IP addresses; these have a five second delay between jobs and the jobs make only one request each.
Those 335 requests are spread across four different GA accounts, although they are likely queued such that all request for a single account are contiguous.
The errors occurred between midnight and 6AM Pacific time (-0700).
When I re-ran all the jobs at 8AM Pacific time they ran without error.
I'm I missing something in the rate limiting? Can someone explain what factor would cause us to hit this limit?
We're using the google-api-client gem.