graphite/grafana alert on message queue - metrics

We have the wait time of a message in a queue being fed into Graphite, recorded with Coda Hale’s Metrics library. And then we have Grafana graphing that time. I'm trying to come up with a sensible alert based on the wait time, but I'm seeing constant alerts. It's possible that they are legitimate, but I'm having a hard time finding examples to compare with and I wonder if the alert is doing what I think it's doing.
I would appreciate feedback and any advice from someone who is doing anything similar. What I’m looking to alert on right now is when a message has been waiting longer than the 98th percentile of the recorded message wait time. My queries are
A: aliasByNode(...waitTime.p98, 9)
B: timeShift(...waitTime.p98, '5min'),
C: asPercent(#B, #A)
My alert is
when avg() of query(C, 5m, now) is above 100

It sounds like you want to use movingAverage(...waitTime.p98, '1hour') rather than timeShift(...waitTime.p98, '5min').
That will give you a final query of asPercent(...waitTime.p98,movingAverage(...waitTime.p98, '1hour')) which will compute the value for each interval as a percentage of the average over the previous 1 hour rolling window. When you then take a 5-minute average over that with the alert you'll get a value that is effectively the average of the last 5 minutes compared to the average of the last 1hr 5min.
Another way to do this is to use asPercent(movingaverage(...waitTime.p98, '5min'), movingAverage(...waitTime.p98, '1hour')) and just take the current value. It's not 100% identical but will be easier to reason about because you can plot it and see exactly what the alert is firing on.

Related

SpringBoot - observability on *_max *_count *_sum metrics

Small question regarding Spring Boot, some of the useful default metrics, and how to properly use them in Grafana please.
Currently with a Spring Boot 2.5.1+ (question applicable to 2.x.x.) with Actuator + Micrometer + Prometheus dependencies, there are lots of very handy default metrics that come out of the box.
I am seeing many many of them with pattern _max _count _sum.
Example, just to take a few:
spring_data_repository_invocations_seconds_max
spring_data_repository_invocations_seconds_count
spring_data_repository_invocations_seconds_sum
reactor_netty_http_client_data_received_bytes_max
reactor_netty_http_client_data_received_bytes_count
reactor_netty_http_client_data_received_bytes_sum
http_server_requests_seconds_max
http_server_requests_seconds_count
http_server_requests_seconds_sum
Unfortunately, I am not sure what to do with them, how to correctly use them, and feel like my ignorance makes me miss on some great application insights.
Searching on the web, I am seeing some using like this, to compute what seems to be an average with Grafana:
irate(http_server_requests_seconds::sum{exception="None", uri!~".*actuator.*"}[5m]) / irate(http_server_requests_seconds::count{exception="None", uri!~".*actuator.*"}[5m])
But Not sure if it is the correct way to use those.
May I ask what sort of queries are possible, usually used when dealing with metrics of type _max _count _sum please?
Thank you
UPD 2022/11: Recently I've had a chance to work with these metrics myself and I made a dashboard with everything I say in this answer and more. It's available on Github or Grafana.com. I hope this will be a good example of how you can use these metrics.
Original answer:
count and sum are generally used to calculate an average. count accumulates the number of times sum was increased, while sum holds the total value of something. Let's take http_server_requests_seconds for example:
http_server_requests_seconds_sum 10
http_server_requests_seconds_count 5
With the example above one can say that there were 5 HTTP requests and their combined duration was 10 seconds. If you divide sum by count you'll get the average request duration of 2 seconds.
Having these you can create at least two useful panels: average request duration (=average latency) and request rate.
Request rate
Using rate() or irate() function you can get how many there were requests per second:
rate(http_server_requests_seconds_count[5m])
rate() works in the following way:
Prometheus takes samples from the given interval ([5m] in this example) and calculates difference between current timepoint (not necessarily now) and [5m] ago.
The obtained value is then divided by the amount of seconds in the interval.
Short interval will make the graph look like a saw (every fluctuation will be noticeable); long interval will make the line more smooth and slow in displaying changes.
Average Request Duration
You can proceed with
http_server_requests_seconds_sum / http_server_requests_seconds_count
but it is highly likely that you will only see a straight line on the graph. This is because values of those metrics grow too big with time and a really drastic change must occur for this query to show any difference. Because of this nature, it will be better to calculate average on interval samples of the data. Using increase() function you can get an approximate value of how the metric changed during the interval. Thus:
increase(http_server_requests_seconds_sum[5m]) / increase(http_server_requests_seconds_count[5m])
The value is approximate because under the hood increase() is rate() multiplied by [inverval]. The error is insignificant for fast-moving counters (such as the request rate), just be ready that there can be an increase of 2.5 requests.
Aggregation and filtering
If you already ran one of the queries above, you have noticed that there is not one line, but many. This is due to labels; each unique set of labels that the metric has is considered a separate time series. This can be fixed by using an aggregation function (like sum()). For example, you can aggregate request rate by instance:
sum by(instance) (rate(http_server_requests_seconds_count[5m]))
This will show you a line for each unique instance label. Now if you want to see only some and not all instances, you can do that with a filter. For example, to calculate a value just for nodeA instance:
sum by(instance) (rate(http_server_requests_seconds_count{instance="nodeA"}[5m]))
Read more about selectors here. With labels you can create any number of useful panels. Perhaps you'd like to calculate the percentage of exceptions, or their rate of occurrence, or perhaps a request rate by status code, you name it.
Note on max
From what I found on the web, max shows the maximum recorded value during some interval set in settings (default is 2 minutes if to trust the source). This is somewhat uncommon metric and whether it is useful is up to you. Since it is a Gauge (unlike sum and count it can go both up and down) you don't need extra functions (such as rate()) to see dynamics. Thus
http_server_requests_seconds_max
... will show you the maximum request duration. You can augment this with aggregation functions (avg(), sum(), etc) and label filters to make it more useful.

Jmeter Max value decresae over time

I'm using jmeter 5 to launch a simple load test. now i want to understand console output. But i have the difficulty with max value.
I was expected that max is a Maximum elapsed time of all the requests. But during the load test, his value decrease and increase.
Load test parameter:
loops: 1000
concurent threads: 5
rand-up: 1s
The image below show my console output. And you can see the max value decrease and increase. I don't know why.
please someone can explain me ? I have a some problems to understand variations of other values.
It's simple.
There, on that picture, you've got two types of reporting records:
1) Ones with "summary =" are overall, for the whole test duration.
As you can see, there Max values are gradually, but slowly, changes towards increase (Mins do the opposite, expectedly).
Which is expected. I shouldn't go for a why-s here, right?
2) Ones with "summary + " are delta.
That's what was added for a certain time period (30 sec here), and all the values you observe there are calculated for that time span ONLY.
Again, obviously - they are different, and independent of each other.
So, concluding: nothing actually "jumps" up there, everything works as expected, you'd just misinterpret it.
Hope that soothes your concerns.
P.S. You'd cleared any mentions of InfluxDB & Grafana out of the questions, but I have to add that it works similar way for that bundle: these values depends on a timeframe & grouping by time (smaller time chunks) within this timeframe.

Count number of point for a librato metric

I'm trying to build a composite metric to know how many point are sent on a period for a specific metric.
The closer stackoverflow response to this is about counting the number of source, and I failed to change it to do what I want (How can i count the total number of sources my metric has with Librato?)
The metric in question is a timing on a function execution, that receive around 20k values on peak hour
At first, I sum-ed the series with a count aggregation, and the pattern I had then was close to what I expected, but regarding our logs, it always differ
The composite I made was like that
sum(s("timing", "%", {function:"count"}))
Any ideas ?
Thanks
Well, the librato support told me the composite do what I want
The differences with the logs were due to errors during metrics sending

iBeacons: distance between bluetooth devices in iOS

I am working on an app that displays notification when user enters a particular area and exits from the area.
I am using 3 beacons for my app. When user is in between second and third Beacon I need to notify the user that he is inside the premises and when user has crossed the first beacon I need to notify him that he is outside the premises.
Till some extent I am able to achieve this by using the beacons accuracy property as distance between the user's device and all three beacons, but the time to display an alert to the user is more about 30 sec to one minute, but it should be instant.
It is important to understand that the CLBeacon accuracy property, which gives you a distance estimate in meters, lags by up to 20 seconds behind a moving mobile device. The same is true with the proximity property, which is derived from accuracy. This lag may be causing the delays you describe.
Why does this lag exist? Distance estimates are based on the bluetooth signal strength (rssi property) which varies a lot with radio noise. In order to filter this noise, iOS uses a 20 second running average in calculating the distance estimate. As a result, it tells you how far a beacon was away (on average) during the last 20 second period.
For applications where you need less lag in your estimate, you can use the rssi property directly. But be aware that due to noise, you will get a much less accurate indication of your distance to a beacon from a single reading.
Read more here: http://developer.radiusnetworks.com/2014/12/04/fundamentals-of-beacon-ranging.html
There are 2 questions you are trying to ask here. Will try to address them seperately.
To notify when you are in between 2 beacons - This should be pretty straightforward to do using "accuracy" and/or the "proximity" property of both the beacons.
If you need a closer estimate, use distance. pseudo code -
beaconsRanged:(CLBeacon)beacon{
if(beacon==BEACON1 && beacon.accuracy > requiredDistanceForBkn1)
"BEACON 1 IN REQUIRED RANGE"
if(beacon==BEACON2 && beacon.accuracy > requiredDistanceForBkn2)
"BEACON 2 IN REQUIRED RANGE"
}
Whenever both the conditions are satisfied, you will be done. Use proximity if you don't want fine tuning.
Code tip - you can raise LocalNotifications when each of these conditions are satisfied and have a separate class which will observe the notifications and perform required operation.
Time taken to raise alert when condition is satisfied - Ensure that you are raising alert on the main thread. If you do so on any other thread it takes a lot of time. I have tried the same thing and it just takes around a second to raise a simple alert.
One way I know of to do this -
dispatch_async(dispatch_get_main_queue(), ^{
//code
}

Counting events

I'm using Cube and Cubism. It's perfect, except for one thing... I need to display the total events numerically. E.g. I have a metric showing API calls per 10 second, I need to know the total API calls.
Is there anything built-in that I'm missing?
I thought about adding a (Mongo) count in the evaluator, but events expire so that wouldn't work.
Keeping track of the running total client-side and including it in the event could be an option, but the sources are distributed and the events are not monotonic, so a simple sum on the last 10 seconds won't work. I would need to be able to query 'get the last event for each distinct source'. Is that possible?
I have a lot of metrics, so I really want to keep the number of client requests to a minimum. If I could get e.g. cumulative alongside value in the standard metric query I'd be happy.
EDIT
I was missing something... using sum and a large (e.g. 1 day) step works.
I was missing something... using sum and a large (e.g. 1 day) step works perfectly.

Resources