Finding a maximum value in a dataseries using Prometheus or Grafana - dashboard

We use the wmi_exporter to scrape system metrics from a Windows server. Now we want to find out the time periods where this server was used more than a certain percentage / amount.
In our Grafana dashboard I can see that on a certain day there was a spike:
But it is not possible to find this specific day when looking at for example 10 days:
It seems like this value is levelled out. My original plan was to check each month for spikes and then drill down to the specific days. But this won't work because I would miss spikes.
How can I find those spikes without inspecting every single day ?

Related

How to determine accurate request count in a time range with Spring Boot + Prometheus + Grafana

I just started trying to integrate micrometer, prometheus and Grafana into my microservices. At a first glance, it is very easy to use and there are many existing dashboard you can rely on. But the more I test the more it gets confusing. Maybe I don't understand the main idea behind this technology stack.
I would like to start my custom Grafana dashboard by showing the amount of request per endpoint for the selected time range (as a single stat), but I am not able to find the right query for that (and I am not sure it exists)
I tried different:
http_server_requests_seconds_count{uri="/users"}
Which always shows the current value. For example, if I sent 10 requests 30 minutes ago, this query will also return value 10 when I am changing changing the time range last 5 minutes (even though no request was entering the system during the last 5 minutes)
When I am using
increase(http_server_requests_seconds_count{uri="/users"}[$__range])
the query will not return the accurate value, instead something close to actual request amount. At least it works for a time range that doesn't include new incoming requests. In that case the query return 0.
So my question is, is there a way to use this Technology stack to get the amount of new requests for the selected period of time?
For the sake of performance when operating with millions of time series, many Prometheus functions show approximate and/or interpolated values. For example, the increase() function is basically a per-second rate() multiplied by the number of seconds in the interval. With such formula and possible missing data points, an accurate result is rather an exception than a normal thing.
The reason why it is so is that Prometheus exchanges accuracy for performance and reliability. It doesn't really matter if your server actual CPU usage is 86.3% instead of 86.4%, but it does matter whether you can get this information instantly. Prometheus even have this statement in their docs:
Prometheus values reliability. You can always view what statistics are available about your system, even under failure conditions. If you need 100% accuracy, such as for per-request billing, Prometheus is not a good choice as the collected data will likely not be detailed and complete enough. In such a case you would be best off using some other system to collect and analyze the data for billing, and Prometheus for the rest of your monitoring.
That being said, if you really need accurate values consider using something else. You can for example store logs and count lines (Grafana Loki, The Elastic Stack), or maybe write and retrieve this information from a traditional database with your own solution.

Incorrect Google Cloud metrics? or what is going on?

My background is more from the Twitter side where all stats are recorded minutely so you might have 120 request per minute. Inside twitter someone had the bright idea to divide by 60 so most graphs(except some teams who realize dividing by 60 is NOT the true rps at all since in a minute, that will fluctuate). So instead of 120 request per minute, many graphs report out 2 request per second. In google, seems like they are doing the same EXCEPT the math is not showing that. In twitter, we could multiply by 60 and the answer was always a whole integer of how many requests occurred in that minute.
In Google however, we see 0.02 requests / second which if we multiply by 60 is 1.2 request per minute. IF they are a minute granularity, they are definitely counting it wrong or something is wrong with their math.
This is from cloudrun metrics as we click into the instance itself
What am I missing here? AND BETTER yet, can we please report on request per minute. request per second is really the average req/second for that minute and it can be really confusing to people when we have these discussions of how you can get 0.5 request / second.
I AM assuming that this is not request per second 'at' the minute boundary because that would be VERY hard to calculate BUT would also be a whole number as well...ie. 0 requests or 1, not 0.2 and that would be quite useless to be honest.
EVERY cloud run instance creates this chart so I assume it's the same for everyone but if I click 'view in metrics explorer' it then give this picture on how 'google configured it'....
As it is available on the Metrics from Cloud Run Documentation, the Request Count metric is sampled every 60 seconds and it excludes from the count requests that are not reaching your container instances, the examples given are unauthorized requests or request sent after the maximum number of instances are reached, which obviously are not your case but again, something to be consider.
Assuming that the calculation of the request count is wrong, I did some digging on Google's IssueTracker system for the monitoring and cloud run components to check if there are any bugs opened that are related to that but could not find any, I would advice that you create a bug in their system so that Google can address it and that you are notified once that is fixed.

Grafana singlestat dashboard based on previous average number

i have a dashboard that shows a sum of requests/sec with from a windows performance monitor collected by prometheus.
sum(Total_Query_Received_persec)
I would like to see any issues right away if those request/sec drop ( which will indicate an issue)
So the singlestat panel could change color if the number of request/sec is 50% less than the same number collected 10 minutes ago (for example), change panel coloring to yellow and if the number is 80% less than 10 minutes ago change color to Red.
I know that you can configure this based on thresholds, but not sure if there is a way to query that info in the metric.
Is this possible at all?
Thanks
I'm not familiar enough to grafana to provide all the details of handling the color change scenarios with that tool, but within prometheus the query you are interested in can likely be handled with the irate operator. It's only recommended for working with 'fast moving' counters, and the documentation mentions that you should track the irate() internal to a sum() to keep from hiding the volatility from the function.
You might also get perfectly acceptable performance and results from aggregating the detail with rate directly, such as rate(total_query_received_persec)[10m]

Grafana cutting curves

After an update to Grafana v4.1.2 (commit: v4.1.2) which I am using together with the latest version of InfluxDB my dashboards changed their behaviour.
I am not using GROUP BY time($interval) in this particular dashboard, but give the possibility to define the group by value via template variable: GROUP BY time($group_time)
I know this is risky, but under our certain circumstances it needs to be this way.
Before the update if a very small value for group by variable was chosen together with a vast time interval obviously or it took a very long time to load or the Browser went in tilt. Now this behaviour changed. There seems to be a limit on the datapoints that grafana visualizes/ retrieves. To make this clearer see the example below:
Time interval with 20s GROUP BY
both curves entirely visible with this group by value
Same Time intervall with 10s GROUP BY
one curve entirely visible with this group by value, other curve cut after some points
Same Time intervall with 10s GROUP BY first curve turned to invisible
the second curve is entirely visible again
I suspect, that Grafana is now sending a LIMIT to InfluxDB. I found this bit of information for the combination Grafana & Graphite http://docs.grafana.org/installation/performance/ but nothing for Grafana & InfluxDB.
So where has this been changed and is there a way to set this limit manually? The limit that I am experiencing at the moment is most probably 10000, because I can limit my queries with values below 10000, but any higher value does not bring more points. Any kind of documentation of this now default limit would be very much appreciated.

Realtime alerts for page speed

I'm looking for a tool that will send me an alert for page load time.
Think of a downtime alert, eg: Pingdom, but one that sends alerts once a page load time increases above a certain threshold. Eg: Alert that X page has taken greater than 7 seconds consistently for 30 minutes.
I know of a lot of tools that give you historical records and page speed metrics after the fact, or give you Apdex scores, but nothing that alerts around speed thresholds.
Does anyone know of such a tool?
Almost all website monitoring services have alerts when the response time is above certain threshold. Your question however is bit more specific since you have a time frame (30 min). Depending on the service used and the monitoring frequency, during a 30 min period you might have between 1 and 30 tests. Do you want an alert if all of those tests are above 7 seconds or if the average response time is above 7 seconds?
I can speak of WebSitePulse where you can receive an alert if 1 or more tests in a row have detected a problem or if the page-load time is within certain limits.
GTmetrix.com Offers Daily alerts for Yslow and PageSpeed scores, as well as great breakdowns and grades for specific ticket items. Great freemium business model as well, free for 3 sites.
Upgraded versions include loading videos of your site.
Source: Just used it for my company's site.

Resources