i have a dashboard that shows a sum of requests/sec with from a windows performance monitor collected by prometheus.
sum(Total_Query_Received_persec)
I would like to see any issues right away if those request/sec drop ( which will indicate an issue)
So the singlestat panel could change color if the number of request/sec is 50% less than the same number collected 10 minutes ago (for example), change panel coloring to yellow and if the number is 80% less than 10 minutes ago change color to Red.
I know that you can configure this based on thresholds, but not sure if there is a way to query that info in the metric.
Is this possible at all?
Thanks
I'm not familiar enough to grafana to provide all the details of handling the color change scenarios with that tool, but within prometheus the query you are interested in can likely be handled with the irate operator. It's only recommended for working with 'fast moving' counters, and the documentation mentions that you should track the irate() internal to a sum() to keep from hiding the volatility from the function.
You might also get perfectly acceptable performance and results from aggregating the detail with rate directly, such as rate(total_query_received_persec)[10m]
Related
I just started trying to integrate micrometer, prometheus and Grafana into my microservices. At a first glance, it is very easy to use and there are many existing dashboard you can rely on. But the more I test the more it gets confusing. Maybe I don't understand the main idea behind this technology stack.
I would like to start my custom Grafana dashboard by showing the amount of request per endpoint for the selected time range (as a single stat), but I am not able to find the right query for that (and I am not sure it exists)
I tried different:
http_server_requests_seconds_count{uri="/users"}
Which always shows the current value. For example, if I sent 10 requests 30 minutes ago, this query will also return value 10 when I am changing changing the time range last 5 minutes (even though no request was entering the system during the last 5 minutes)
When I am using
increase(http_server_requests_seconds_count{uri="/users"}[$__range])
the query will not return the accurate value, instead something close to actual request amount. At least it works for a time range that doesn't include new incoming requests. In that case the query return 0.
So my question is, is there a way to use this Technology stack to get the amount of new requests for the selected period of time?
For the sake of performance when operating with millions of time series, many Prometheus functions show approximate and/or interpolated values. For example, the increase() function is basically a per-second rate() multiplied by the number of seconds in the interval. With such formula and possible missing data points, an accurate result is rather an exception than a normal thing.
The reason why it is so is that Prometheus exchanges accuracy for performance and reliability. It doesn't really matter if your server actual CPU usage is 86.3% instead of 86.4%, but it does matter whether you can get this information instantly. Prometheus even have this statement in their docs:
Prometheus values reliability. You can always view what statistics are available about your system, even under failure conditions. If you need 100% accuracy, such as for per-request billing, Prometheus is not a good choice as the collected data will likely not be detailed and complete enough. In such a case you would be best off using some other system to collect and analyze the data for billing, and Prometheus for the rest of your monitoring.
That being said, if you really need accurate values consider using something else. You can for example store logs and count lines (Grafana Loki, The Elastic Stack), or maybe write and retrieve this information from a traditional database with your own solution.
I need to show usage stats at any points on time for last 3 months, 6 months and 1 year. I am planning to use the KStream sliding windows for the durations mentioned above. Most of the examples I see are using durations in minutes or seconds. I would like to know is that OK to use the bigger time duration for sliding windows? Any performance impact? Any specific configuration It should use to get optimum performance?
Thanks,
Jinu
It will really depend on the density of the data and what kind of aggregations you are doing. It could end up with very large number of windows updating and not closing since the end time is so far out. Also if it is too heavy I am not sure the state stores could handle it. But with the correct load and retention times I don't see an obvious reason it wouldn't work.
Edit: If you do end up trying it I would be very interested in seeing how it works out.
We use the wmi_exporter to scrape system metrics from a Windows server. Now we want to find out the time periods where this server was used more than a certain percentage / amount.
In our Grafana dashboard I can see that on a certain day there was a spike:
But it is not possible to find this specific day when looking at for example 10 days:
It seems like this value is levelled out. My original plan was to check each month for spikes and then drill down to the specific days. But this won't work because I would miss spikes.
How can I find those spikes without inspecting every single day ?
After an update to Grafana v4.1.2 (commit: v4.1.2) which I am using together with the latest version of InfluxDB my dashboards changed their behaviour.
I am not using GROUP BY time($interval) in this particular dashboard, but give the possibility to define the group by value via template variable: GROUP BY time($group_time)
I know this is risky, but under our certain circumstances it needs to be this way.
Before the update if a very small value for group by variable was chosen together with a vast time interval obviously or it took a very long time to load or the Browser went in tilt. Now this behaviour changed. There seems to be a limit on the datapoints that grafana visualizes/ retrieves. To make this clearer see the example below:
Time interval with 20s GROUP BY
both curves entirely visible with this group by value
Same Time intervall with 10s GROUP BY
one curve entirely visible with this group by value, other curve cut after some points
Same Time intervall with 10s GROUP BY first curve turned to invisible
the second curve is entirely visible again
I suspect, that Grafana is now sending a LIMIT to InfluxDB. I found this bit of information for the combination Grafana & Graphite http://docs.grafana.org/installation/performance/ but nothing for Grafana & InfluxDB.
So where has this been changed and is there a way to set this limit manually? The limit that I am experiencing at the moment is most probably 10000, because I can limit my queries with values below 10000, but any higher value does not bring more points. Any kind of documentation of this now default limit would be very much appreciated.
I am using codahale metrics for monitoring purposes. Lets say there is a spike in latency at some point and later there are no values reported due to attribute that there are no traffic, the value in the graph stays as is(I am using a histogram). At times it gives a notion that the spike remains and we might need to address it, but it actually means that no values are reported after that and hence the graph doesn't decay. Am I missing any config parameter in this case or is the behaviour expected?
The way we update the metrics is
metrics.processingTime.update(processingTime);
So, when there is no traffic, we don't update this metric.
I know that the histogram takes into consideration datapoints from the past (for an irregular period of time) in order to display a statistical image of the data.
When there are no new datapoints, only the outlier is taken into consideration and averaged on and on.
The meters have the same behavior, displaying the data through moving averages of 1,5,15 minutes.
The solution in the histogram case is to use HDRhistogram and flush it periodically.