After an update to Grafana v4.1.2 (commit: v4.1.2) which I am using together with the latest version of InfluxDB my dashboards changed their behaviour.
I am not using GROUP BY time($interval) in this particular dashboard, but give the possibility to define the group by value via template variable: GROUP BY time($group_time)
I know this is risky, but under our certain circumstances it needs to be this way.
Before the update if a very small value for group by variable was chosen together with a vast time interval obviously or it took a very long time to load or the Browser went in tilt. Now this behaviour changed. There seems to be a limit on the datapoints that grafana visualizes/ retrieves. To make this clearer see the example below:
Time interval with 20s GROUP BY
both curves entirely visible with this group by value
Same Time intervall with 10s GROUP BY
one curve entirely visible with this group by value, other curve cut after some points
Same Time intervall with 10s GROUP BY first curve turned to invisible
the second curve is entirely visible again
I suspect, that Grafana is now sending a LIMIT to InfluxDB. I found this bit of information for the combination Grafana & Graphite http://docs.grafana.org/installation/performance/ but nothing for Grafana & InfluxDB.
So where has this been changed and is there a way to set this limit manually? The limit that I am experiencing at the moment is most probably 10000, because I can limit my queries with values below 10000, but any higher value does not bring more points. Any kind of documentation of this now default limit would be very much appreciated.
Related
I'm searching to display in my Datadog dashboard the last value of a metric in a QueryValue field.
For the moment, I'm using
"queries": [
{
"query": "max:blabla.mycount{$env}",
"data_source": "metrics",
"name": "query1",
"aggregator": "last"
}
]
Is this the right way to do that ? For this series of mycount [20,1,5,3,2], which number will be taken ? Is it really the last one of the serie (2) or the biggest one in the serie (20) ?
Regards,
Blured.
So there's going to be 3 levels of aggregation to consider: the Time Aggregation and Space Aggregation of your query, and then the aggregation of the query value widget on the frontend (which is what you're asking about). For now, let's understand time aggregation by thinking of a time series widget, and then we'll see what happens with the query value widget after.
Space aggregation is the simplest one. The idea is the you have multiple time series being submitted from multiple applications/ servers. If 20 computers send a metric all at the same time, which metric should we pick to display? You decide that with the aggregation chunk of your query, yours is currently set to max.
The idea is that you have to decide which out of the dozens or hundreds of instances of your metric is the one you want to display.
If you don't want to worry about space aggregation, you have to make you query specific enough that only 1 time series exists for that metric. For example a cpu metric will need to be scoped to at least the hostname. For a container metric, hostname isn't enough, you would need at least the container_id. For a database there should be a db_identifier or something that gets you just 1 result back.
Now for time aggregation, let's look at the docs a bit:
As Datadog stores data at a 1 second granularity, it cannot display all real data on graphs. See How data is aggregated in graphs for more details.
For a graph on a 1-week time window, it would require sending hundreds of thousands of values to your browser—and besides, not all these points could be graphed on a widget occupying a small portion of your screen.
...
The Datadog backend tries to keep the number of intervals to a number below ~300.
https://docs.datadoghq.com/dashboards/guide/query-to-the-graph/#proceed-to-time-aggregation
So for example if you are looking at a 5 minute window, the time aggregation will be as granular as possible. there are 300 seconds in 5 minutes, so every interval on the graph will represent 1 second. If we zoomed out to 10 minutes (600 seconds), we can only show data every 2 seconds. So each bucket will represent 2 data points (assuming the metric is submitted every second).
In most scenarios your metrics are being submitted at a 15 second interval. So you won't notice any time aggregation rollups until 15*300=4500 seconds (a bit over an hour).
You control this with the rollup function, as described in the docs. If you don't want to worry about time aggregation, just make sure your time range is zoomed in enough to not have any bucketing.
And now for the last level of aggregation, the query value widget. You now have obtained a set of 300 points from the backend, space and time aggregation has already been applied. Out of those 300 datapoints, which one do you want to display? You could choose the last point, or a sum of the points, or whatever.
Hopefully that helps!
I just started trying to integrate micrometer, prometheus and Grafana into my microservices. At a first glance, it is very easy to use and there are many existing dashboard you can rely on. But the more I test the more it gets confusing. Maybe I don't understand the main idea behind this technology stack.
I would like to start my custom Grafana dashboard by showing the amount of request per endpoint for the selected time range (as a single stat), but I am not able to find the right query for that (and I am not sure it exists)
I tried different:
http_server_requests_seconds_count{uri="/users"}
Which always shows the current value. For example, if I sent 10 requests 30 minutes ago, this query will also return value 10 when I am changing changing the time range last 5 minutes (even though no request was entering the system during the last 5 minutes)
When I am using
increase(http_server_requests_seconds_count{uri="/users"}[$__range])
the query will not return the accurate value, instead something close to actual request amount. At least it works for a time range that doesn't include new incoming requests. In that case the query return 0.
So my question is, is there a way to use this Technology stack to get the amount of new requests for the selected period of time?
For the sake of performance when operating with millions of time series, many Prometheus functions show approximate and/or interpolated values. For example, the increase() function is basically a per-second rate() multiplied by the number of seconds in the interval. With such formula and possible missing data points, an accurate result is rather an exception than a normal thing.
The reason why it is so is that Prometheus exchanges accuracy for performance and reliability. It doesn't really matter if your server actual CPU usage is 86.3% instead of 86.4%, but it does matter whether you can get this information instantly. Prometheus even have this statement in their docs:
Prometheus values reliability. You can always view what statistics are available about your system, even under failure conditions. If you need 100% accuracy, such as for per-request billing, Prometheus is not a good choice as the collected data will likely not be detailed and complete enough. In such a case you would be best off using some other system to collect and analyze the data for billing, and Prometheus for the rest of your monitoring.
That being said, if you really need accurate values consider using something else. You can for example store logs and count lines (Grafana Loki, The Elastic Stack), or maybe write and retrieve this information from a traditional database with your own solution.
i have a dashboard that shows a sum of requests/sec with from a windows performance monitor collected by prometheus.
sum(Total_Query_Received_persec)
I would like to see any issues right away if those request/sec drop ( which will indicate an issue)
So the singlestat panel could change color if the number of request/sec is 50% less than the same number collected 10 minutes ago (for example), change panel coloring to yellow and if the number is 80% less than 10 minutes ago change color to Red.
I know that you can configure this based on thresholds, but not sure if there is a way to query that info in the metric.
Is this possible at all?
Thanks
I'm not familiar enough to grafana to provide all the details of handling the color change scenarios with that tool, but within prometheus the query you are interested in can likely be handled with the irate operator. It's only recommended for working with 'fast moving' counters, and the documentation mentions that you should track the irate() internal to a sum() to keep from hiding the volatility from the function.
You might also get perfectly acceptable performance and results from aggregating the detail with rate directly, such as rate(total_query_received_persec)[10m]
We use the wmi_exporter to scrape system metrics from a Windows server. Now we want to find out the time periods where this server was used more than a certain percentage / amount.
In our Grafana dashboard I can see that on a certain day there was a spike:
But it is not possible to find this specific day when looking at for example 10 days:
It seems like this value is levelled out. My original plan was to check each month for spikes and then drill down to the specific days. But this won't work because I would miss spikes.
How can I find those spikes without inspecting every single day ?
I am using codahale metrics for monitoring purposes. Lets say there is a spike in latency at some point and later there are no values reported due to attribute that there are no traffic, the value in the graph stays as is(I am using a histogram). At times it gives a notion that the spike remains and we might need to address it, but it actually means that no values are reported after that and hence the graph doesn't decay. Am I missing any config parameter in this case or is the behaviour expected?
The way we update the metrics is
metrics.processingTime.update(processingTime);
So, when there is no traffic, we don't update this metric.
I know that the histogram takes into consideration datapoints from the past (for an irregular period of time) in order to display a statistical image of the data.
When there are no new datapoints, only the outlier is taken into consideration and averaged on and on.
The meters have the same behavior, displaying the data through moving averages of 1,5,15 minutes.
The solution in the histogram case is to use HDRhistogram and flush it periodically.