I'm trying to build a composite metric to know how many point are sent on a period for a specific metric.
The closer stackoverflow response to this is about counting the number of source, and I failed to change it to do what I want (How can i count the total number of sources my metric has with Librato?)
The metric in question is a timing on a function execution, that receive around 20k values on peak hour
At first, I sum-ed the series with a count aggregation, and the pattern I had then was close to what I expected, but regarding our logs, it always differ
The composite I made was like that
sum(s("timing", "%", {function:"count"}))
Any ideas ?
Thanks
Well, the librato support told me the composite do what I want
The differences with the logs were due to errors during metrics sending
Related
I'm new to metric monitoring.
If we want to record the duration of the requests, I think we should use gauge, but in practise, someone would use histogram.
for example, in grpc-ecosystem/go-grpc-prometheus, they prefer to use histogram to record duration. Are there agreed best practices for the use of metric types? Or it is just their own preference.
// ServerMetrics represents a collection of metrics to be registered on a
// Prometheus metrics registry for a gRPC server.
type ServerMetrics struct {
serverStartedCounter *prom.CounterVec
serverHandledCounter *prom.CounterVec
serverStreamMsgReceived *prom.CounterVec
serverStreamMsgSent *prom.CounterVec
serverHandledHistogramEnabled bool
serverHandledHistogramOpts prom.HistogramOpts
serverHandledHistogram *prom.HistogramVec
}
Thanks~
I am new to this but let me try to answer your question. So take my answer with a grain of salt or maybe someone with experience in using metrics to observe their systems jumps in.
as stated in https://prometheus.io/docs/concepts/metric_types/
A gauge is a metric that represents a single numerical value that can arbitrarily go up and down.
So if your goal would be to display the current value (duration time of requests) you could use a gauge. But I think the goal of using metrics is to find problems within your system or generate alerts if and when certain vaules aren't in a predefined range or getting a performance value (like the Apdex score) for your system.
From https://prometheus.io/docs/concepts/metric_types/#histogram
Use the histogram_quantile() function to calculate quantiles from histograms or even aggregations of histograms. A histogram is also suitable to calculate an Apdex score.
From https://en.wikipedia.org/wiki/Apdex
Apdex (Application Performance Index) is an open standard developed by an alliance of companies for measuring performance of software applications in computing. Its purpose is to convert measurements into insights about user satisfaction, by specifying a uniform way to analyze and report on the degree to which measured performance meets user expectations.
Read up on Quantiles and the calculations in histograms and summaries https://prometheus.io/docs/practices/histograms/#quantiles
Two rules of thumb:
If you need to aggregate, choose histograms.
Otherwise, choose a histogram if you have an idea of the range and distribution of values that will be observed. Choose a summary if you need an accurate quantile, no matter what the range and distribution of the values is.
Or like Adam Woodbeck in his book "Network programming with Go" said:
The general advice is to use summaries when you don’t know the range of expected values, but I’d advise you to use histograms whenever possible
so that you can aggregate histograms on the metrics server.
The main difference between gauge and histogram metric types in Prometheus is that Prometheus captures only a single (last) value of the gauge metric when it scrapes the target exposing the metric, while histogram captures all the metric values by incrementing the corresponding histogram bucket.
For example, if request duration is measured for frequently requested endpoint and Prometheus is set up to scrape your app every 30 seconds (e.g. scrape_interval: 30s in scrape_configs), then the Prometheus will scrape only a single duration for the last request every 30 seconds when the duration is stored in a Gauge metric. All the previous measurements for the request duration are lost.
On the other hand, any number of request duration measurement are registered in Histogram metric, and this doesn't depend on the interval between scrapes of your app. Later the Histogram metric allows obtaining the distribution of request durations on an arbitrary time range.
Prometheus histograms have some issues though:
You need to choose the number and the boundaries of histogram buckets, so they provide good accuracy for observing the distribution of the measured metric. This isn't a trivial task, since you may not know in advance the real distribution of the metric.
If the number of buckets are changed or their boundaries are changed for some measurement, then the histogram_quantile() function returns invalid results over such a measurement.
To big number of buckets per each histogram may result in high cardinality issues, since each bucket in the histogram creates a separate time series.
P.S. these issues are addressed in VcitoriaMetrics histograms (I'm the core developer of VictoriaMetrics).
As valyala suggest, the main difference is that histogram aggregates data, so you would take advantage of prometheus statistics engine over all registered samples (minimum, maximum, average, quantiles, etc.).
A gauge is more used to measure for example "wind velocity", "queue size", or any other kind of "instant data" where it is not so important to ignore old related samples of it as you want to know current picture.
Using gauges for "duration of the requests" would require very small scrape periods to be accurate, which is not practical even if your rate is not very high (if your scrape period is less than your application reception rate, you will ignore data). So, in summary, don't use gauges. Histogram fits much better your needs.
Small question regarding Spring Boot, some of the useful default metrics, and how to properly use them in Grafana please.
Currently with a Spring Boot 2.5.1+ (question applicable to 2.x.x.) with Actuator + Micrometer + Prometheus dependencies, there are lots of very handy default metrics that come out of the box.
I am seeing many many of them with pattern _max _count _sum.
Example, just to take a few:
spring_data_repository_invocations_seconds_max
spring_data_repository_invocations_seconds_count
spring_data_repository_invocations_seconds_sum
reactor_netty_http_client_data_received_bytes_max
reactor_netty_http_client_data_received_bytes_count
reactor_netty_http_client_data_received_bytes_sum
http_server_requests_seconds_max
http_server_requests_seconds_count
http_server_requests_seconds_sum
Unfortunately, I am not sure what to do with them, how to correctly use them, and feel like my ignorance makes me miss on some great application insights.
Searching on the web, I am seeing some using like this, to compute what seems to be an average with Grafana:
irate(http_server_requests_seconds::sum{exception="None", uri!~".*actuator.*"}[5m]) / irate(http_server_requests_seconds::count{exception="None", uri!~".*actuator.*"}[5m])
But Not sure if it is the correct way to use those.
May I ask what sort of queries are possible, usually used when dealing with metrics of type _max _count _sum please?
Thank you
UPD 2022/11: Recently I've had a chance to work with these metrics myself and I made a dashboard with everything I say in this answer and more. It's available on Github or Grafana.com. I hope this will be a good example of how you can use these metrics.
Original answer:
count and sum are generally used to calculate an average. count accumulates the number of times sum was increased, while sum holds the total value of something. Let's take http_server_requests_seconds for example:
http_server_requests_seconds_sum 10
http_server_requests_seconds_count 5
With the example above one can say that there were 5 HTTP requests and their combined duration was 10 seconds. If you divide sum by count you'll get the average request duration of 2 seconds.
Having these you can create at least two useful panels: average request duration (=average latency) and request rate.
Request rate
Using rate() or irate() function you can get how many there were requests per second:
rate(http_server_requests_seconds_count[5m])
rate() works in the following way:
Prometheus takes samples from the given interval ([5m] in this example) and calculates difference between current timepoint (not necessarily now) and [5m] ago.
The obtained value is then divided by the amount of seconds in the interval.
Short interval will make the graph look like a saw (every fluctuation will be noticeable); long interval will make the line more smooth and slow in displaying changes.
Average Request Duration
You can proceed with
http_server_requests_seconds_sum / http_server_requests_seconds_count
but it is highly likely that you will only see a straight line on the graph. This is because values of those metrics grow too big with time and a really drastic change must occur for this query to show any difference. Because of this nature, it will be better to calculate average on interval samples of the data. Using increase() function you can get an approximate value of how the metric changed during the interval. Thus:
increase(http_server_requests_seconds_sum[5m]) / increase(http_server_requests_seconds_count[5m])
The value is approximate because under the hood increase() is rate() multiplied by [inverval]. The error is insignificant for fast-moving counters (such as the request rate), just be ready that there can be an increase of 2.5 requests.
Aggregation and filtering
If you already ran one of the queries above, you have noticed that there is not one line, but many. This is due to labels; each unique set of labels that the metric has is considered a separate time series. This can be fixed by using an aggregation function (like sum()). For example, you can aggregate request rate by instance:
sum by(instance) (rate(http_server_requests_seconds_count[5m]))
This will show you a line for each unique instance label. Now if you want to see only some and not all instances, you can do that with a filter. For example, to calculate a value just for nodeA instance:
sum by(instance) (rate(http_server_requests_seconds_count{instance="nodeA"}[5m]))
Read more about selectors here. With labels you can create any number of useful panels. Perhaps you'd like to calculate the percentage of exceptions, or their rate of occurrence, or perhaps a request rate by status code, you name it.
Note on max
From what I found on the web, max shows the maximum recorded value during some interval set in settings (default is 2 minutes if to trust the source). This is somewhat uncommon metric and whether it is useful is up to you. Since it is a Gauge (unlike sum and count it can go both up and down) you don't need extra functions (such as rate()) to see dynamics. Thus
http_server_requests_seconds_max
... will show you the maximum request duration. You can augment this with aggregation functions (avg(), sum(), etc) and label filters to make it more useful.
I was just wondering if anybody knew of a way to be able to show a graph of the difference of metrics like system.network.in.bytes -
If you look at this graph you can just see that the value continuously gets bigger (at around the same speed) - but I just want to graph the difference between each value not the total.
Example
Anyone have any ideas?
Try a timeseries visualization or timelion.
Assuming your field name is 'bytesIn' (for simplicity) and taking 1 minute intervals (as IMO 30s isn't possible in timelion), your timelion expression should look something like:
.es(*,metric='avg:bytesIn').subtract(.es(*,metric='avg:bytesIn',offset='-1m'))
Explanation
.es(*,metric='avg:bytesIn') gives average of bytesIn over a time interval (here I'm assuming 1m)
Adding offset='-1m', offsets the series retrieval by -1m as if they are happening now
.subtract just subtracts value of one series from another
Anyone who's read Parse documentation has stumbled upon this
Caveat: Count queries are rate limited to a maximum of 160 requests per minute. They can also return inaccurate results for classes with more than 1,000 objects. Thus, it is preferable to architect your application to avoid this sort of count operation (by using counters, for example.)
Why's there such limitation and inaccuracy?
To quote the Parse Engineering Blog Post: Building Scalable Apps on Parse
Suppose you are building a product catalog. You might want to display
the count of products in each category on the top-level navigation
screen. If you run a count query for each of these UI elements, they
will not run efficiently on large data sets because MongoDB does not
use counting B-trees. Instead, we recommend that you use a separate
Parse Object to keep track of counts for each category. Whenever a
product gets added or deleted, you can increment or decrement the
counts in an afterSave or afterDelete Cloud Code handler.
To add on to this, here is another quote by Hector Ramos from the Parse Developers Google Group
Count queries have always been expensive once you throw some
constraints in. If you only care about the total size of the
collection, you can run a count query without any constraints and that
one should be pretty fast, as getting the total number of records is a
different problem than counting how many of these match an arbitrary
list of constraints. This is just the reality of working with database
systems.
The inaccuracy is not due to the 1000 request object limit. The count query will try to get the total number of records regardless of size, but since the operation may take a large amount of time to complete, it is possible that the database has changed during that window and the count value that is returned may no longer be valid.
The recommended way to handle counts is to essentially maintain your own index using before/after save hooks. However, this is also a non-ideal solution because save hooks can arbitrarily fail part way through and (worse) postSave hooks have no error propagation.
The limitation is simply to stop people using counts too much, they're just as runtime costly as full queries in effect.
The inaccuracy is because queries are limited to 1000 result objects (100 by default) and counts have the same hard limit.
You can run a recursive query to build up a count, but it's a crappy option. Hence the only really good option at this point in time (and as far as we can see in the future) is to keep an index of the things you're interested in counting and update the counts when anything changes. You would usually do that with save hooks in cloud code.
I'm using Cube and Cubism. It's perfect, except for one thing... I need to display the total events numerically. E.g. I have a metric showing API calls per 10 second, I need to know the total API calls.
Is there anything built-in that I'm missing?
I thought about adding a (Mongo) count in the evaluator, but events expire so that wouldn't work.
Keeping track of the running total client-side and including it in the event could be an option, but the sources are distributed and the events are not monotonic, so a simple sum on the last 10 seconds won't work. I would need to be able to query 'get the last event for each distinct source'. Is that possible?
I have a lot of metrics, so I really want to keep the number of client requests to a minimum. If I could get e.g. cumulative alongside value in the standard metric query I'd be happy.
EDIT
I was missing something... using sum and a large (e.g. 1 day) step works.
I was missing something... using sum and a large (e.g. 1 day) step works perfectly.