Grafana difference between two datapoints - dashboard

In a Graphana dashboard with several datapoints, how can I get the difference between the last value and the previouse one for the same metric?
Perhaps the tricky part is that the tiem between 2 datapoins for the same metric is not know.
so the desired result is the <metric>.$current_value - <metric>.$previouse_value for each point in the metricstring.
Edit:
The metrics are stored in graphite/Carbon DB.
thanks

You need to use the derivative function
This is the opposite of the integral function. This is useful for taking a running total metric and calculating the delta between subsequent data points.
This function does not normalize for periods of time, as a true derivative would. Instead see the perSecond() function to calculate a rate of change over time.
Together with the keepLastValue
Takes one metric or a wildcard seriesList, and optionally a limit to the number of ‘None’ values to skip over.
Continues the line with the last received value when gaps (‘None’ values) appear in your data, rather than breaking your line.
Like this
derivative(keepLastValue(your_mteric))
A good example can be found here http://www.perehospital.cat/blog/graphite-getting-derivative-to-work-with-empty-data-points

Related

SpringBoot - observability on *_max *_count *_sum metrics

Small question regarding Spring Boot, some of the useful default metrics, and how to properly use them in Grafana please.
Currently with a Spring Boot 2.5.1+ (question applicable to 2.x.x.) with Actuator + Micrometer + Prometheus dependencies, there are lots of very handy default metrics that come out of the box.
I am seeing many many of them with pattern _max _count _sum.
Example, just to take a few:
spring_data_repository_invocations_seconds_max
spring_data_repository_invocations_seconds_count
spring_data_repository_invocations_seconds_sum
reactor_netty_http_client_data_received_bytes_max
reactor_netty_http_client_data_received_bytes_count
reactor_netty_http_client_data_received_bytes_sum
http_server_requests_seconds_max
http_server_requests_seconds_count
http_server_requests_seconds_sum
Unfortunately, I am not sure what to do with them, how to correctly use them, and feel like my ignorance makes me miss on some great application insights.
Searching on the web, I am seeing some using like this, to compute what seems to be an average with Grafana:
irate(http_server_requests_seconds::sum{exception="None", uri!~".*actuator.*"}[5m]) / irate(http_server_requests_seconds::count{exception="None", uri!~".*actuator.*"}[5m])
But Not sure if it is the correct way to use those.
May I ask what sort of queries are possible, usually used when dealing with metrics of type _max _count _sum please?
Thank you
UPD 2022/11: Recently I've had a chance to work with these metrics myself and I made a dashboard with everything I say in this answer and more. It's available on Github or Grafana.com. I hope this will be a good example of how you can use these metrics.
Original answer:
count and sum are generally used to calculate an average. count accumulates the number of times sum was increased, while sum holds the total value of something. Let's take http_server_requests_seconds for example:
http_server_requests_seconds_sum 10
http_server_requests_seconds_count 5
With the example above one can say that there were 5 HTTP requests and their combined duration was 10 seconds. If you divide sum by count you'll get the average request duration of 2 seconds.
Having these you can create at least two useful panels: average request duration (=average latency) and request rate.
Request rate
Using rate() or irate() function you can get how many there were requests per second:
rate(http_server_requests_seconds_count[5m])
rate() works in the following way:
Prometheus takes samples from the given interval ([5m] in this example) and calculates difference between current timepoint (not necessarily now) and [5m] ago.
The obtained value is then divided by the amount of seconds in the interval.
Short interval will make the graph look like a saw (every fluctuation will be noticeable); long interval will make the line more smooth and slow in displaying changes.
Average Request Duration
You can proceed with
http_server_requests_seconds_sum / http_server_requests_seconds_count
but it is highly likely that you will only see a straight line on the graph. This is because values of those metrics grow too big with time and a really drastic change must occur for this query to show any difference. Because of this nature, it will be better to calculate average on interval samples of the data. Using increase() function you can get an approximate value of how the metric changed during the interval. Thus:
increase(http_server_requests_seconds_sum[5m]) / increase(http_server_requests_seconds_count[5m])
The value is approximate because under the hood increase() is rate() multiplied by [inverval]. The error is insignificant for fast-moving counters (such as the request rate), just be ready that there can be an increase of 2.5 requests.
Aggregation and filtering
If you already ran one of the queries above, you have noticed that there is not one line, but many. This is due to labels; each unique set of labels that the metric has is considered a separate time series. This can be fixed by using an aggregation function (like sum()). For example, you can aggregate request rate by instance:
sum by(instance) (rate(http_server_requests_seconds_count[5m]))
This will show you a line for each unique instance label. Now if you want to see only some and not all instances, you can do that with a filter. For example, to calculate a value just for nodeA instance:
sum by(instance) (rate(http_server_requests_seconds_count{instance="nodeA"}[5m]))
Read more about selectors here. With labels you can create any number of useful panels. Perhaps you'd like to calculate the percentage of exceptions, or their rate of occurrence, or perhaps a request rate by status code, you name it.
Note on max
From what I found on the web, max shows the maximum recorded value during some interval set in settings (default is 2 minutes if to trust the source). This is somewhat uncommon metric and whether it is useful is up to you. Since it is a Gauge (unlike sum and count it can go both up and down) you don't need extra functions (such as rate()) to see dynamics. Thus
http_server_requests_seconds_max
... will show you the maximum request duration. You can augment this with aggregation functions (avg(), sum(), etc) and label filters to make it more useful.

Statistics/Algorithm: How do I compare a weekly graph with its own history to see when in the past it was almost the same?

I’ve got a statistical/mathematical problem I’m stumped on and I was really hoping to get some help. I’m working on a research where I need to compare a weekly graph with its own history to see when in the past it was almost the same. Think of this as “finding the closest match”. The information is displayed as a line graph, but it’s readily available as raw data:
Date...................Result
08/10/18......52.5
08/07/18......60.2
08/06/18......58.5
08/05/18......55.4
08/04/18......55.2
and so on...
What I really want is the output to be a form of correlation between the current data points with the other set of 5 concurrent data points in history. So, something like:
Date range.....................Correlation
07/10/18-07/15/18....0.98
We’ll be getting a code written in Python for the software to do this automatically (so that as new data is added, it automatically runs and finds the closest set of numbers to match the current one).
Here’s where the difficulty sets in: Since numbers are on a general upward trend over time, we don’t want it to compare the absolute value (since the numbers might never really match). One suggestion has been to compare the delta (rate of change as a percentage over the previous day), or using a log scale.
I’m wondering: how do I go about this? What kind of calculation I can use to get the desired results? I’ve looked at the different kind of correlation equations, but they don’t account for the “shape” of the data, and they generally just average it out. The shape of the line chart is the important thing.
Thanks very much in advance!
I would simply divide the data of each week by their average (i.e., normalize them to an average of 1), then sum the squares of the differences of each day of each pair of weeks. This sum is what you want to minimize.
If you don't care about how much a graph oscillates relative to its mean, you can normalize also the variance. For each week, calculate mean and variance, then subtract the mean and divide by the root of the variance. Each week will have mean 0 and variance 1. Then minimize the sum of squares of differences like before.
If the normalization of data is all you can change in your workflow, just leave out the sum of squares of differences minimization part.

Kibana graphing just the difference of a metric instead of total

I was just wondering if anybody knew of a way to be able to show a graph of the difference of metrics like system.network.in.bytes -
If you look at this graph you can just see that the value continuously gets bigger (at around the same speed) - but I just want to graph the difference between each value not the total.
Example
Anyone have any ideas?
Try a timeseries visualization or timelion.
Assuming your field name is 'bytesIn' (for simplicity) and taking 1 minute intervals (as IMO 30s isn't possible in timelion), your timelion expression should look something like:
.es(*,metric='avg:bytesIn').subtract(.es(*,metric='avg:bytesIn',offset='-1m'))
Explanation
.es(*,metric='avg:bytesIn') gives average of bytesIn over a time interval (here I'm assuming 1m)
Adding offset='-1m', offsets the series retrieval by -1m as if they are happening now
.subtract just subtracts value of one series from another

Kalman filter, multiple lines tracking

I have a problem with multiple lines tracking by using Kalman filter.
Input data - number of items and set of structures with x1,y1, x2,y2 (coordinates). For each iteration the number of items can be different so some lines can appear or disappear.
For single line it looks simple - we have input data, equasions etc. and we can create output. We always know the line can exist. If not, and it will appear later - it will be still the same line.
But for multiple lines I don't know how to start - in one iteration I can get few objects - ok, I will use this set of equasions for each of them. But in next iteration I can get less of more lines. I'm not sure what's the correct approach - I have data from previous iteration but I will need to use it for the same object. So:
1. I need to find it - checking distance between middlepoints for pair estimated previously <-> line N and choosing the smallest value? Is it correct approach or we have different method?
2. Storing old data - the line was visible long time but after next iteration will never appear. I got new line and again - the same situation. It will be good to store old results but in this case, after looong time, I will have a lot of zombie-data. Do we have some special criteria to clean it or I need to use some own ideas like max iterations if no detection etc.?

Estimating number of results in Google App Engine Query

I'm attempting to estimate the total amount of results for app engine queries that will return large amounts of results.
In order to do this, I assigned a random floating point number between 0 and 1 to every entity. Then I executed the query for which I wanted to estimate the total results with the following 3 settings:
* I ordered by the random numbers that I had assigned in ascending order
* I set the offset to 1000
* I fetched only one entity
I then plugged the entities's random value that I had assigned for this purpose into the following equation to estimate the total results (since I used 1000 as the offset above, the value of OFFSET would be 1000 in this case):
1 / RANDOM * OFFSET
The idea is that since each entity has a random number assigned to it, and I am sorting by that random number, the entity's random number assignment should be proportionate to the beginning and end of the results with respect to its offset (in this case, 1000).
The problem I am having is that the results I am getting are giving me low estimates. And the estimates are lower, the lower the offset. I had anticipated that the lower the offset that I used, the less accurate the estimate should be, but I thought that the margin of error would be both above and below the actual number of results.
Below is a chart demonstrating what I am talking about. As you can see, the predictions get more consistent (accurate) as the offset increases from 1000 to 5000. But then the predictions predictably follow a 4 part polynomial. (y = -5E-15x4 + 7E-10x3 - 3E-05x2 + 0.3781x + 51608).
Am I making a mistake here, or does the standard python random number generator not distribute numbers evenly enough for this purpose?
Thanks!
Edit:
It turns out that this problem is due to my mistake. In another part of the program, I was grabbing entities from the beginning of the series, doing an operation, then re-assigning the random number. This resulted in a denser distribution of random numbers towards the end.
I did a little more digging into this concept, fixed the problem, and tried it again on a different query (so the number of results are different from above). I found that this idea can be used to estimate the total results for a query. One thing of note is that the "error" is very similar for offsets that are close by. When I did a scatter chart in excel, I expected the accuracy of the predictions at each offset to "cloud". Meaning that offsets at the very begging would produce a larger, less dense cloud that would converge to a very tiny, dense could around the actual value as the offsets got larger. This is not what happened as you can see below in the cart of how far off the predictions were at each offset. Where I thought there would be a cloud of dots, there is a line instead.
This is a chart of the maximum after each offset. For example the maximum error for any offset after 10000 was less than 1%:
When using GAE it makes a lot more sense not to try to do large amounts work on reads - it's built and optimized for very fast requests turnarounds. In this case it's actually more efficent to maintain a count of your results as and when you create the entities.
If you have a standard query, this is fairly easy - just use a sharded counter when creating the entities. You can seed this using a map reduce job to get the initial count.
If you have queries that might be dynamic, this is more difficult. If you know the range of possible queries that you might perform, you'd want to create a counter for each query that might run.
If the range of possible queries is infinite, you might want to think of aggregating counters or using them in more creative ways.
If you tell us the query you're trying to run, there might be someone who has a better idea.
Some quick thought:
Have you tried Datastore Statistics API? It may provide a fast and accurate results if you won't update your entities set very frequently.
http://code.google.com/appengine/docs/python/datastore/stats.html
[EDIT1.]
I did some math things, I think the estimate method you purposed here, could be rephrased as an "Order statistic" problem.
http://en.wikipedia.org/wiki/Order_statistic#The_order_statistics_of_the_uniform_distribution
For example:
If the actual entities number is 60000, the question equals to "what's the probability that your 1000th [2000th, 3000th, .... ] sample falling in the interval [l,u]; therefore, the estimated total entities number based on this sample, will have an acceptable error to 60000."
If the acceptable error is 5%, the interval [l, u] will be [0.015873015873015872, 0.017543859649122806]
I think the probability won't be very large.
This doesn't directly deal with the calculations aspect of your question, but would using the count attribute of a query object work for you? Or have you tried that out and it's not suitable? As per the docs, it's only slightly faster than retrieving all of the data, but on the plus side it would give you the actual number of results.
http://code.google.com/appengine/docs/python/datastore/queryclass.html#Query_count

Resources