SpringBoot - observability on *_max *_count *_sum metrics - spring-boot

Small question regarding Spring Boot, some of the useful default metrics, and how to properly use them in Grafana please.
Currently with a Spring Boot 2.5.1+ (question applicable to 2.x.x.) with Actuator + Micrometer + Prometheus dependencies, there are lots of very handy default metrics that come out of the box.
I am seeing many many of them with pattern _max _count _sum.
Example, just to take a few:
spring_data_repository_invocations_seconds_max
spring_data_repository_invocations_seconds_count
spring_data_repository_invocations_seconds_sum
reactor_netty_http_client_data_received_bytes_max
reactor_netty_http_client_data_received_bytes_count
reactor_netty_http_client_data_received_bytes_sum
http_server_requests_seconds_max
http_server_requests_seconds_count
http_server_requests_seconds_sum
Unfortunately, I am not sure what to do with them, how to correctly use them, and feel like my ignorance makes me miss on some great application insights.
Searching on the web, I am seeing some using like this, to compute what seems to be an average with Grafana:
irate(http_server_requests_seconds::sum{exception="None", uri!~".*actuator.*"}[5m]) / irate(http_server_requests_seconds::count{exception="None", uri!~".*actuator.*"}[5m])
But Not sure if it is the correct way to use those.
May I ask what sort of queries are possible, usually used when dealing with metrics of type _max _count _sum please?
Thank you

UPD 2022/11: Recently I've had a chance to work with these metrics myself and I made a dashboard with everything I say in this answer and more. It's available on Github or Grafana.com. I hope this will be a good example of how you can use these metrics.
Original answer:
count and sum are generally used to calculate an average. count accumulates the number of times sum was increased, while sum holds the total value of something. Let's take http_server_requests_seconds for example:
http_server_requests_seconds_sum 10
http_server_requests_seconds_count 5
With the example above one can say that there were 5 HTTP requests and their combined duration was 10 seconds. If you divide sum by count you'll get the average request duration of 2 seconds.
Having these you can create at least two useful panels: average request duration (=average latency) and request rate.
Request rate
Using rate() or irate() function you can get how many there were requests per second:
rate(http_server_requests_seconds_count[5m])
rate() works in the following way:
Prometheus takes samples from the given interval ([5m] in this example) and calculates difference between current timepoint (not necessarily now) and [5m] ago.
The obtained value is then divided by the amount of seconds in the interval.
Short interval will make the graph look like a saw (every fluctuation will be noticeable); long interval will make the line more smooth and slow in displaying changes.
Average Request Duration
You can proceed with
http_server_requests_seconds_sum / http_server_requests_seconds_count
but it is highly likely that you will only see a straight line on the graph. This is because values of those metrics grow too big with time and a really drastic change must occur for this query to show any difference. Because of this nature, it will be better to calculate average on interval samples of the data. Using increase() function you can get an approximate value of how the metric changed during the interval. Thus:
increase(http_server_requests_seconds_sum[5m]) / increase(http_server_requests_seconds_count[5m])
The value is approximate because under the hood increase() is rate() multiplied by [inverval]. The error is insignificant for fast-moving counters (such as the request rate), just be ready that there can be an increase of 2.5 requests.
Aggregation and filtering
If you already ran one of the queries above, you have noticed that there is not one line, but many. This is due to labels; each unique set of labels that the metric has is considered a separate time series. This can be fixed by using an aggregation function (like sum()). For example, you can aggregate request rate by instance:
sum by(instance) (rate(http_server_requests_seconds_count[5m]))
This will show you a line for each unique instance label. Now if you want to see only some and not all instances, you can do that with a filter. For example, to calculate a value just for nodeA instance:
sum by(instance) (rate(http_server_requests_seconds_count{instance="nodeA"}[5m]))
Read more about selectors here. With labels you can create any number of useful panels. Perhaps you'd like to calculate the percentage of exceptions, or their rate of occurrence, or perhaps a request rate by status code, you name it.
Note on max
From what I found on the web, max shows the maximum recorded value during some interval set in settings (default is 2 minutes if to trust the source). This is somewhat uncommon metric and whether it is useful is up to you. Since it is a Gauge (unlike sum and count it can go both up and down) you don't need extra functions (such as rate()) to see dynamics. Thus
http_server_requests_seconds_max
... will show you the maximum request duration. You can augment this with aggregation functions (avg(), sum(), etc) and label filters to make it more useful.

Related

When to use gauge or histogram in prometheus in recording request duration?

I'm new to metric monitoring.
If we want to record the duration of the requests, I think we should use gauge, but in practise, someone would use histogram.
for example, in grpc-ecosystem/go-grpc-prometheus, they prefer to use histogram to record duration. Are there agreed best practices for the use of metric types? Or it is just their own preference.
// ServerMetrics represents a collection of metrics to be registered on a
// Prometheus metrics registry for a gRPC server.
type ServerMetrics struct {
serverStartedCounter *prom.CounterVec
serverHandledCounter *prom.CounterVec
serverStreamMsgReceived *prom.CounterVec
serverStreamMsgSent *prom.CounterVec
serverHandledHistogramEnabled bool
serverHandledHistogramOpts prom.HistogramOpts
serverHandledHistogram *prom.HistogramVec
}
Thanks~
I am new to this but let me try to answer your question. So take my answer with a grain of salt or maybe someone with experience in using metrics to observe their systems jumps in.
as stated in https://prometheus.io/docs/concepts/metric_types/
A gauge is a metric that represents a single numerical value that can arbitrarily go up and down.
So if your goal would be to display the current value (duration time of requests) you could use a gauge. But I think the goal of using metrics is to find problems within your system or generate alerts if and when certain vaules aren't in a predefined range or getting a performance value (like the Apdex score) for your system.
From https://prometheus.io/docs/concepts/metric_types/#histogram
Use the histogram_quantile() function to calculate quantiles from histograms or even aggregations of histograms. A histogram is also suitable to calculate an Apdex score.
From https://en.wikipedia.org/wiki/Apdex
Apdex (Application Performance Index) is an open standard developed by an alliance of companies for measuring performance of software applications in computing. Its purpose is to convert measurements into insights about user satisfaction, by specifying a uniform way to analyze and report on the degree to which measured performance meets user expectations.
Read up on Quantiles and the calculations in histograms and summaries https://prometheus.io/docs/practices/histograms/#quantiles
Two rules of thumb:
If you need to aggregate, choose histograms.
Otherwise, choose a histogram if you have an idea of the range and distribution of values that will be observed. Choose a summary if you need an accurate quantile, no matter what the range and distribution of the values is.
Or like Adam Woodbeck in his book "Network programming with Go" said:
The general advice is to use summaries when you don’t know the range of expected values, but I’d advise you to use histograms whenever possible
so that you can aggregate histograms on the metrics server.
The main difference between gauge and histogram metric types in Prometheus is that Prometheus captures only a single (last) value of the gauge metric when it scrapes the target exposing the metric, while histogram captures all the metric values by incrementing the corresponding histogram bucket.
For example, if request duration is measured for frequently requested endpoint and Prometheus is set up to scrape your app every 30 seconds (e.g. scrape_interval: 30s in scrape_configs), then the Prometheus will scrape only a single duration for the last request every 30 seconds when the duration is stored in a Gauge metric. All the previous measurements for the request duration are lost.
On the other hand, any number of request duration measurement are registered in Histogram metric, and this doesn't depend on the interval between scrapes of your app. Later the Histogram metric allows obtaining the distribution of request durations on an arbitrary time range.
Prometheus histograms have some issues though:
You need to choose the number and the boundaries of histogram buckets, so they provide good accuracy for observing the distribution of the measured metric. This isn't a trivial task, since you may not know in advance the real distribution of the metric.
If the number of buckets are changed or their boundaries are changed for some measurement, then the histogram_quantile() function returns invalid results over such a measurement.
To big number of buckets per each histogram may result in high cardinality issues, since each bucket in the histogram creates a separate time series.
P.S. these issues are addressed in VcitoriaMetrics histograms (I'm the core developer of VictoriaMetrics).
As valyala suggest, the main difference is that histogram aggregates data, so you would take advantage of prometheus statistics engine over all registered samples (minimum, maximum, average, quantiles, etc.).
A gauge is more used to measure for example "wind velocity", "queue size", or any other kind of "instant data" where it is not so important to ignore old related samples of it as you want to know current picture.
Using gauges for "duration of the requests" would require very small scrape periods to be accurate, which is not practical even if your rate is not very high (if your scrape period is less than your application reception rate, you will ignore data). So, in summary, don't use gauges. Histogram fits much better your needs.

Count number of point for a librato metric

I'm trying to build a composite metric to know how many point are sent on a period for a specific metric.
The closer stackoverflow response to this is about counting the number of source, and I failed to change it to do what I want (How can i count the total number of sources my metric has with Librato?)
The metric in question is a timing on a function execution, that receive around 20k values on peak hour
At first, I sum-ed the series with a count aggregation, and the pattern I had then was close to what I expected, but regarding our logs, it always differ
The composite I made was like that
sum(s("timing", "%", {function:"count"}))
Any ideas ?
Thanks
Well, the librato support told me the composite do what I want
The differences with the logs were due to errors during metrics sending

Rate metric per time with ElasticSearch datasource

I am using ElasticSearch as a data source in Grafana.
I have an ES index in which every document represents an HTTP request. I would like to create a graph that would show the rate of request in a given time interval (per second, per minute).
Basically, I am hoping it is possible to reproduce what prometheus offer with the rate() function: https://prometheus.io/docs/prometheus/latest/querying/functions/#rate
Per my actual researches, I think I should use the "derivative" option in Grafana, associated with the Count metric, but I am not sure how to configure it to graph correct results.
Furthermore, I am using a templated interval variable with custom intervals like 2m, 3m... Would it be possible to use $__interval_ms builtin variable to compute the rate. I mean, is this builtin automatically computed based my custom interval, or is it working only with the auto value? If not, how would I usea time interval like 5m to perform arithmetic to compute the rate from it ?
Thanks
Solved this by adding a dummy field for each request I log, where the content is simply the value 1. Then in grafana, I can use the sum aggregator and an inline script that allow me to calculate a rate given a time interval like 5m, where the script is simply *value / 60*5*.

How to correctly use Prometheus Histogram from java client to track size rather than latency?

I have an API that that processes collections.
The execution time of this API is related to the collection size (the larger the collection, the more it will take).
I am researching how can I do this with prometheus but am unsure whether I am doing things correctly (documentation is a bit lacking in this area).
the first thing I did is define a Summary metric to measure execution time of the API. I am using the canonical rate(sum)/rate(count) as explained here.
Now, since I know that the latency may be affected by the size of the input, I also want to overlay the request size on the avg execution time. Since I dont want to measure each possible size, I figured I'd use a histogram. Like so:
Histogram histogram = Histogram.build().buckets(10, 30, 50)
.name("BULK_REQUEST_SIZE")
.help("histogram of bulk sizes to correlate with duration")
.labelNames("method", "entity")
.register();
Note: the term 'size' does not relate to the size in bytes but to the length of the collection that needs to be processed. 2 items, 5 items, 50 items...
and in the execution I do (simplified):
#PUT
void process(Collection<Entity> entitiesToProcess, string entityName){
Timer t = summary.labels("PUT_BULK", entityName).startTimer()
// process...
t.observeDuration();
histogram.labels("PUT_BULK", entityName).observe(entitiesToProcess.size())
}
Question:
Later when I am looking at the BULK_REQUEST_SIZE_bucket in Grafana, I see that all buckets have the same value, so clearly I am doing something wrong.
Is there a more canonical way to do it?
Your code is correct (though bulk_request_size_bytes would be a better metric name).
The problem is likely that you've suboptimal buckets, as 10, 30 and 50 bytes are pretty small for most request sizes. I'd try larger bucket sizes that cover more typical values.

Kibana graphing just the difference of a metric instead of total

I was just wondering if anybody knew of a way to be able to show a graph of the difference of metrics like system.network.in.bytes -
If you look at this graph you can just see that the value continuously gets bigger (at around the same speed) - but I just want to graph the difference between each value not the total.
Example
Anyone have any ideas?
Try a timeseries visualization or timelion.
Assuming your field name is 'bytesIn' (for simplicity) and taking 1 minute intervals (as IMO 30s isn't possible in timelion), your timelion expression should look something like:
.es(*,metric='avg:bytesIn').subtract(.es(*,metric='avg:bytesIn',offset='-1m'))
Explanation
.es(*,metric='avg:bytesIn') gives average of bytesIn over a time interval (here I'm assuming 1m)
Adding offset='-1m', offsets the series retrieval by -1m as if they are happening now
.subtract just subtracts value of one series from another

Resources