I am trying to wrap my head around Prometheus histograms and histogram_quantile() and estimate errors.
We've set up request duration metrics for a service, but I keep seeing inaccurate duration on our Grafana graph in comparison to our logs(which prints duration). In other words, I see one value passed to Prometheus collector and a another bigger value after a query in Grafana. When I say bigger, I mean the logs says request took 1.2sec, but Grafana shows it took somewhere around 2.3sec(for 99th percentile).
Here is the query(same one is used for .95 and .50):
histogram_quantile(0.99,sum(rate(service_request_duration_seconds_bucket{path="some path"}[$__rate_interval])) by (le))
Here are the buckets:
var durationTimeBucketsInSeconds = []float64{.01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10}
I did some research and found out about https://prometheus.io/docs/practices/histograms/#errors-of-quantile-estimation
Is there anyway to estimate closer to the real value, like adjusting buckets?
Related
I'm new to metric monitoring.
If we want to record the duration of the requests, I think we should use gauge, but in practise, someone would use histogram.
for example, in grpc-ecosystem/go-grpc-prometheus, they prefer to use histogram to record duration. Are there agreed best practices for the use of metric types? Or it is just their own preference.
// ServerMetrics represents a collection of metrics to be registered on a
// Prometheus metrics registry for a gRPC server.
type ServerMetrics struct {
serverStartedCounter *prom.CounterVec
serverHandledCounter *prom.CounterVec
serverStreamMsgReceived *prom.CounterVec
serverStreamMsgSent *prom.CounterVec
serverHandledHistogramEnabled bool
serverHandledHistogramOpts prom.HistogramOpts
serverHandledHistogram *prom.HistogramVec
}
Thanks~
I am new to this but let me try to answer your question. So take my answer with a grain of salt or maybe someone with experience in using metrics to observe their systems jumps in.
as stated in https://prometheus.io/docs/concepts/metric_types/
A gauge is a metric that represents a single numerical value that can arbitrarily go up and down.
So if your goal would be to display the current value (duration time of requests) you could use a gauge. But I think the goal of using metrics is to find problems within your system or generate alerts if and when certain vaules aren't in a predefined range or getting a performance value (like the Apdex score) for your system.
From https://prometheus.io/docs/concepts/metric_types/#histogram
Use the histogram_quantile() function to calculate quantiles from histograms or even aggregations of histograms. A histogram is also suitable to calculate an Apdex score.
From https://en.wikipedia.org/wiki/Apdex
Apdex (Application Performance Index) is an open standard developed by an alliance of companies for measuring performance of software applications in computing. Its purpose is to convert measurements into insights about user satisfaction, by specifying a uniform way to analyze and report on the degree to which measured performance meets user expectations.
Read up on Quantiles and the calculations in histograms and summaries https://prometheus.io/docs/practices/histograms/#quantiles
Two rules of thumb:
If you need to aggregate, choose histograms.
Otherwise, choose a histogram if you have an idea of the range and distribution of values that will be observed. Choose a summary if you need an accurate quantile, no matter what the range and distribution of the values is.
Or like Adam Woodbeck in his book "Network programming with Go" said:
The general advice is to use summaries when you don’t know the range of expected values, but I’d advise you to use histograms whenever possible
so that you can aggregate histograms on the metrics server.
The main difference between gauge and histogram metric types in Prometheus is that Prometheus captures only a single (last) value of the gauge metric when it scrapes the target exposing the metric, while histogram captures all the metric values by incrementing the corresponding histogram bucket.
For example, if request duration is measured for frequently requested endpoint and Prometheus is set up to scrape your app every 30 seconds (e.g. scrape_interval: 30s in scrape_configs), then the Prometheus will scrape only a single duration for the last request every 30 seconds when the duration is stored in a Gauge metric. All the previous measurements for the request duration are lost.
On the other hand, any number of request duration measurement are registered in Histogram metric, and this doesn't depend on the interval between scrapes of your app. Later the Histogram metric allows obtaining the distribution of request durations on an arbitrary time range.
Prometheus histograms have some issues though:
You need to choose the number and the boundaries of histogram buckets, so they provide good accuracy for observing the distribution of the measured metric. This isn't a trivial task, since you may not know in advance the real distribution of the metric.
If the number of buckets are changed or their boundaries are changed for some measurement, then the histogram_quantile() function returns invalid results over such a measurement.
To big number of buckets per each histogram may result in high cardinality issues, since each bucket in the histogram creates a separate time series.
P.S. these issues are addressed in VcitoriaMetrics histograms (I'm the core developer of VictoriaMetrics).
As valyala suggest, the main difference is that histogram aggregates data, so you would take advantage of prometheus statistics engine over all registered samples (minimum, maximum, average, quantiles, etc.).
A gauge is more used to measure for example "wind velocity", "queue size", or any other kind of "instant data" where it is not so important to ignore old related samples of it as you want to know current picture.
Using gauges for "duration of the requests" would require very small scrape periods to be accurate, which is not practical even if your rate is not very high (if your scrape period is less than your application reception rate, you will ignore data). So, in summary, don't use gauges. Histogram fits much better your needs.
Small question regarding Spring Boot, some of the useful default metrics, and how to properly use them in Grafana please.
Currently with a Spring Boot 2.5.1+ (question applicable to 2.x.x.) with Actuator + Micrometer + Prometheus dependencies, there are lots of very handy default metrics that come out of the box.
I am seeing many many of them with pattern _max _count _sum.
Example, just to take a few:
spring_data_repository_invocations_seconds_max
spring_data_repository_invocations_seconds_count
spring_data_repository_invocations_seconds_sum
reactor_netty_http_client_data_received_bytes_max
reactor_netty_http_client_data_received_bytes_count
reactor_netty_http_client_data_received_bytes_sum
http_server_requests_seconds_max
http_server_requests_seconds_count
http_server_requests_seconds_sum
Unfortunately, I am not sure what to do with them, how to correctly use them, and feel like my ignorance makes me miss on some great application insights.
Searching on the web, I am seeing some using like this, to compute what seems to be an average with Grafana:
irate(http_server_requests_seconds::sum{exception="None", uri!~".*actuator.*"}[5m]) / irate(http_server_requests_seconds::count{exception="None", uri!~".*actuator.*"}[5m])
But Not sure if it is the correct way to use those.
May I ask what sort of queries are possible, usually used when dealing with metrics of type _max _count _sum please?
Thank you
UPD 2022/11: Recently I've had a chance to work with these metrics myself and I made a dashboard with everything I say in this answer and more. It's available on Github or Grafana.com. I hope this will be a good example of how you can use these metrics.
Original answer:
count and sum are generally used to calculate an average. count accumulates the number of times sum was increased, while sum holds the total value of something. Let's take http_server_requests_seconds for example:
http_server_requests_seconds_sum 10
http_server_requests_seconds_count 5
With the example above one can say that there were 5 HTTP requests and their combined duration was 10 seconds. If you divide sum by count you'll get the average request duration of 2 seconds.
Having these you can create at least two useful panels: average request duration (=average latency) and request rate.
Request rate
Using rate() or irate() function you can get how many there were requests per second:
rate(http_server_requests_seconds_count[5m])
rate() works in the following way:
Prometheus takes samples from the given interval ([5m] in this example) and calculates difference between current timepoint (not necessarily now) and [5m] ago.
The obtained value is then divided by the amount of seconds in the interval.
Short interval will make the graph look like a saw (every fluctuation will be noticeable); long interval will make the line more smooth and slow in displaying changes.
Average Request Duration
You can proceed with
http_server_requests_seconds_sum / http_server_requests_seconds_count
but it is highly likely that you will only see a straight line on the graph. This is because values of those metrics grow too big with time and a really drastic change must occur for this query to show any difference. Because of this nature, it will be better to calculate average on interval samples of the data. Using increase() function you can get an approximate value of how the metric changed during the interval. Thus:
increase(http_server_requests_seconds_sum[5m]) / increase(http_server_requests_seconds_count[5m])
The value is approximate because under the hood increase() is rate() multiplied by [inverval]. The error is insignificant for fast-moving counters (such as the request rate), just be ready that there can be an increase of 2.5 requests.
Aggregation and filtering
If you already ran one of the queries above, you have noticed that there is not one line, but many. This is due to labels; each unique set of labels that the metric has is considered a separate time series. This can be fixed by using an aggregation function (like sum()). For example, you can aggregate request rate by instance:
sum by(instance) (rate(http_server_requests_seconds_count[5m]))
This will show you a line for each unique instance label. Now if you want to see only some and not all instances, you can do that with a filter. For example, to calculate a value just for nodeA instance:
sum by(instance) (rate(http_server_requests_seconds_count{instance="nodeA"}[5m]))
Read more about selectors here. With labels you can create any number of useful panels. Perhaps you'd like to calculate the percentage of exceptions, or their rate of occurrence, or perhaps a request rate by status code, you name it.
Note on max
From what I found on the web, max shows the maximum recorded value during some interval set in settings (default is 2 minutes if to trust the source). This is somewhat uncommon metric and whether it is useful is up to you. Since it is a Gauge (unlike sum and count it can go both up and down) you don't need extra functions (such as rate()) to see dynamics. Thus
http_server_requests_seconds_max
... will show you the maximum request duration. You can augment this with aggregation functions (avg(), sum(), etc) and label filters to make it more useful.
Say I want to calculate the velocity of two datapoints (A and A'), each having a score, and a time published (A' is a future version of A, and has a higher score). This would be
[A'(score) - A(score)] / [A'(time published) - A (time published)]
What I want to capture are trends with high velocities. This means I want a score going from 20 to 200 having higher weight than 8500 to 9000. So I thought I'd normalize this data by dividing the scores by a baseline.
Ex. if A(score) is 2, and A'(score) is 3, the baseline is 2, so in the formula above,
A'(score) - A(score) would be (3/2 - 2/2)
However, this means that when the numbers are this low, the velocities will be very high (since on the other hand
9000/8500 - 8500/8500
produces very low velocities, given that time difference is constant in this example only, however normally, time differences are variable).
Is there any way to reduce the impact of low starting scores WHILE at the same time allowing jumps from, say, 20 to 200 being significant? Thank you.
There are two ways to look at this. Either could give you what you want.
My first thought was that your question came very close to providing your answer. You gave yourself an important hint by calling your first calculation your velocity - your rate of change of a score over time. You could then look at its acceleration - your rate of change of the velocity over time. That's:
(A''(score) - A'(score)) - (A'(score) - A(score))
Note, I'm not dividing by time, because you say the time difference is constant for each measurement. Then you're dividing each value by a constant, which is inefficient and probably doesn't give you any further clarity.
More likely, though, it seems you want how significant the change is from one score to the next. I suspect what you want is:
(A'(score) - A(score)) / A(score)
This is (a - b) / b, which reduces down to (a/b) - 1. If you don't care about the -1, the simplest way you can see the relevant change in your score is:
A'(score)/A(score)
This shows the rate of growth of the score from one step to the next.
Edit, after clarification:
Given your comment, a variable rate of time makes the logic more complicated, but still do-able.
In that case, you do want to calculate velocity, as you were doing:
V = A'(score) - A(score) / A'(time) - A(time)
But you want to normalize it based on the previous velocity:
result = V'/V
This then becomes similar to the "acceleration" example - it requires 3 samples to have a good idea of the rate of change of the rate of change. If you spell it out all the way, you get something like:
result = (A''(score) - A'(score))/(A''(time) - A'(time)) / (A'(score) - A(score))/(A'(time) - A(time))
You can do some math to shove these numbers around, but there's really no prettier result than that.
I have an API that that processes collections.
The execution time of this API is related to the collection size (the larger the collection, the more it will take).
I am researching how can I do this with prometheus but am unsure whether I am doing things correctly (documentation is a bit lacking in this area).
the first thing I did is define a Summary metric to measure execution time of the API. I am using the canonical rate(sum)/rate(count) as explained here.
Now, since I know that the latency may be affected by the size of the input, I also want to overlay the request size on the avg execution time. Since I dont want to measure each possible size, I figured I'd use a histogram. Like so:
Histogram histogram = Histogram.build().buckets(10, 30, 50)
.name("BULK_REQUEST_SIZE")
.help("histogram of bulk sizes to correlate with duration")
.labelNames("method", "entity")
.register();
Note: the term 'size' does not relate to the size in bytes but to the length of the collection that needs to be processed. 2 items, 5 items, 50 items...
and in the execution I do (simplified):
#PUT
void process(Collection<Entity> entitiesToProcess, string entityName){
Timer t = summary.labels("PUT_BULK", entityName).startTimer()
// process...
t.observeDuration();
histogram.labels("PUT_BULK", entityName).observe(entitiesToProcess.size())
}
Question:
Later when I am looking at the BULK_REQUEST_SIZE_bucket in Grafana, I see that all buckets have the same value, so clearly I am doing something wrong.
Is there a more canonical way to do it?
Your code is correct (though bulk_request_size_bytes would be a better metric name).
The problem is likely that you've suboptimal buckets, as 10, 30 and 50 bytes are pretty small for most request sizes. I'd try larger bucket sizes that cover more typical values.
I want to monitor response time of an API. I can methods like average, median and other for monitoring. But I am facing following problems with those methods:
Problem with average
if one of the request taken very high time. For example in given set average will become high due to value 1000.
S1= [ 1 , 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1000]
Problem with Median
It will be correct value only upto 50%. For example in given set S2=[2,2,2,2,2,50,50,50,50]. median gives as value 2 but most of the user are facing slow response.
Problem with 5-95 span (http://steveakers.com/2013/08/01/span-vs-median-for-response-time-monitors/)
In above article author suggested using value uppser95-uppser5.
But that will not generate alert if response time is like:
s3=[50,50,50,50,50] . In this case all API are response are slow. But span 5-95 is zero.
I am thinking of using either of these two values.
upper95 or (upper95+upper5)/2.
Which one will be better and why ? Is there any better method to calculate QOS ?
You listed three measurements:
Average (mean) response
Median response
5-95 span response
Notice that #3 is not measuring the same thing as #1 and #2!
Mean and median give you a measure of the actual response time. This will pick up a certain class of problem.
5-95 span tells you to what extent your response time varies. i.e. Is your response time consistent or not. This will pick up another class of problem.
You probably need to track both: the absolute response time, as well as the variance. The best approach for the former (mean vs median, whether to clip outliers) probably depends on the results you get for your service.