Stackdriver monitoring - Metric Absent - google-cloud-stackdriver

I am trying to monitor and get alert when my instance shut down. For this I have configured Alert policy in stackdriver as below:
Metric Absence Condition
Violates when: CPU Usage (GCE Monitoring) is absent for greater than 5 minutes
It worked only for first time and then never created any incident for any of stopped instances.
What am I missing here?

Expected Behavior
#Jyotsna. I investigated this issue a bit more and was able to confirm that this is currently an expected behavior, as alert conditions aren't triggered by inactive instances which explains why, you won't get any alerts when the VM instance CPU is not registering any metrics. However, there's currently a Feature Request in progress to update this behavior.
Known Issue
There's also a known issue which seems to have caused the behavior of no logs being sent to Stackdriver on subsequent violations of the policy even after the offending VM is back online. This explains why,
It worked only for first time and then never created any incident for any of stopped instances.
Hence stopped instances won't work properly until the issue is fixed. Unfortunately, there's no ETA. on this, but eventually, it will be addressed.

Related

AWS - Prisma - Python Boto3 Enhanced auto-remediation

I use that enhanced auto-remediation (https://github.com/PaloAltoNetworks/Prisma-Enhanced-Remediation#getting-started) trying to auto remediate alerts detected in Prisma.
For some reasons some alerts that can not be remediated due to lack of permissions, errors or just deficiency in runbook or any others, constantly trigger associated runbooks in lambda.
I noticed that situation with constantly triggering alert happens when, first time alert is triggered and it can't be fixed due to lack of permissions or just runbook runs correctly but in fact it doesn't fix issue, it triggers lambda(runbook) for some period of time (it looks it is related to parameter Message retention period in SQS) and every 30 minutes (it looks it is related to parameter visibility timeout in SQS ), no matter it is fixed (manually or via improved runbook) or not.
Once alert comes in (first time) and is fixed immediately there are no more triggering as i described as root cause.
I suspect that in second scenario runbook returns something it allows remove that alert from queue. How to handle first scenario ?

Kibana Alerting - Monitors Disappearing and Alerts are not triggering automatically

the issue is in our kibana monitors are getting lost by themselves and after 2-3 mins they come back
[
Normally alerts were being created as expected
Second and the most important one is alerts are not triggering even if the condition result is true.
Condition Response true!!
So anyone faced this issue, i'm open to advices. Thanks a lot since now.

Consul health check script is not showing output on UI

Im not able to see the script output on consul UI...
The script runs but the output is not seen
What is that I'm missing or going wrong please help ! :/ :/
The following information is correct for Consul up to 0.7.2 and is subject to change in the future.
The output of a check is only updated in real time when the state of the check changes (i.e. when it goes from OK to WARNING or CRITICAL to OK). The actual text of the check will be updated periodically based Consul's anti-entropy runs, which default to happening every 10 minutes, iirc. If you're patient, the output will be updated. Or if you go to the Consul agent running the check and query the appropriate /v1/agent endpoint, it should be real-time. But if you query through the Consul Server's catalogs, it can be delayed.
This trade off in freshness was made due to scalability reasons and not wanting to continually stream hojilions of check updates into a single set of servers.

Windows 2012 Service Shows as Degraded

I wanted to ask does anyone know what the definition of a degraded service is?
I'm monitoring some systems using nagios and check_wmi_plus, it runs the following WMI query:
select name, displayname, Started, StartMode, State, Status FROM Win32_Service
The State comes back as running, but the Status as degraded for one particular service (an in house application that is known for crashing).
This status only seems to be mentioned in WMI so now I'm in a bit of a battle because from the front end everything seems fine but from the monitoring system we warn of the system being degraded, so any additional information on this problem and how to resolve it (other than just bouncing the service) would be great.
The most I've found is the service didn't close down correctly.
Many thanks.
In the check_wmi_plus change log for 1.62 the following was added:
Added some additional text when services were in a degraded state.
Previously they were listed/counted as being "bad" but the display
message was confusing as it still showed them as "running". Thanks to
Paul Jobb.
According to Mircosoft a service in a "degraded" state is as follows:
"Service is working, but in a reduced state."

Azure in role cache exceptions when service scales

I am using Windows Azure SDK 2.2 and have created an Azure cloud service that uses an in-role cache.
I have 2 instances of the service running under normal conditions.
When the services scales (up to 3 instances, or back down to 2 instances), I get lots of DataCacheExceptions. These are often accompanied by Azure db connection failures from the process going in inside the cache. (If I don't find the entry I want in the cache, I get it from the db and put it into the cache. All standard stuff.)
I have implemented retry processes on the cache gets and puts, and use the ReliableSqlConnection object with a retry process for db connection using the Transient Fault Handling application block.
The retry process uses a fixed interval retrying every second for 5 tries.
The failures are typically;
Microsoft.ApplicationServer.Caching.DataCacheException: ErrorCode:SubStatus:There is a temporary failure. Please retry later
Any idea why the scaling might cause these exceptions?
Should I try a less aggressive retry policy?
Any help appreciated.
I have also noticed that I am getting a high percentage (> 70%) cache miss rate and when the system is struggling, there is high cpu utilisation (> 80%).
Well, I haven't been able to find out any reason for the errors I am seeing, but I have 'fixed' the problem, sort of!
When looking at the last few days processing stats, it is clear the high cpu usage corresponds with the cloud service having 'problems'. I have changed the service to use two medium instances instead of two small instances.
This seems to have solved the problem, and the service has been running quite happily, low cpu usage, low memory usage, no exceptions.
So, whilst still not discovering what the source of the problems were, I seem to have overcome them by providing a bigger environment for the service to run in.
--Late news!!! I noticed this morning that from about 06:30, the cpu usage started to climb, along with the time taken for the service to process as it should. Errors started appearing and I had to restart the service at 10:30 to get things back to 'normal'. Also, when restarting the service, the OnRoleRun process threw loads of DataCacheExceptions before it started running again, 45 minutes later.
Now all seems well again, and I will monitor for the next hours/days...
There seems to be no explanation for this, remote desktop to the instances show no exceptions in the event log, other logging is not showing application problems, so I am still stumped.

Resources