Small question regarding Spring Boot, some of the useful default metrics, and how to properly use them in Grafana please.
Currently with a Spring Boot 2.5.1+ (question applicable to 2.x.x.) with Actuator + Micrometer + Prometheus dependencies, there are lots of very handy default metrics that come out of the box.
I am seeing many many of them with pattern _max _count _sum.
Example, just to take a few:
spring_data_repository_invocations_seconds_max
spring_data_repository_invocations_seconds_count
spring_data_repository_invocations_seconds_sum
reactor_netty_http_client_data_received_bytes_max
reactor_netty_http_client_data_received_bytes_count
reactor_netty_http_client_data_received_bytes_sum
http_server_requests_seconds_max
http_server_requests_seconds_count
http_server_requests_seconds_sum
Unfortunately, I am not sure what to do with them, how to correctly use them, and feel like my ignorance makes me miss on some great application insights.
Searching on the web, I am seeing some using like this, to compute what seems to be an average with Grafana:
irate(http_server_requests_seconds::sum{exception="None", uri!~".*actuator.*"}[5m]) / irate(http_server_requests_seconds::count{exception="None", uri!~".*actuator.*"}[5m])
But Not sure if it is the correct way to use those.
May I ask what sort of queries are possible, usually used when dealing with metrics of type _max _count _sum please?
Thank you
UPD 2022/11: Recently I've had a chance to work with these metrics myself and I made a dashboard with everything I say in this answer and more. It's available on Github or Grafana.com. I hope this will be a good example of how you can use these metrics.
Original answer:
count and sum are generally used to calculate an average. count accumulates the number of times sum was increased, while sum holds the total value of something. Let's take http_server_requests_seconds for example:
http_server_requests_seconds_sum 10
http_server_requests_seconds_count 5
With the example above one can say that there were 5 HTTP requests and their combined duration was 10 seconds. If you divide sum by count you'll get the average request duration of 2 seconds.
Having these you can create at least two useful panels: average request duration (=average latency) and request rate.
Request rate
Using rate() or irate() function you can get how many there were requests per second:
rate(http_server_requests_seconds_count[5m])
rate() works in the following way:
Prometheus takes samples from the given interval ([5m] in this example) and calculates difference between current timepoint (not necessarily now) and [5m] ago.
The obtained value is then divided by the amount of seconds in the interval.
Short interval will make the graph look like a saw (every fluctuation will be noticeable); long interval will make the line more smooth and slow in displaying changes.
Average Request Duration
You can proceed with
http_server_requests_seconds_sum / http_server_requests_seconds_count
but it is highly likely that you will only see a straight line on the graph. This is because values of those metrics grow too big with time and a really drastic change must occur for this query to show any difference. Because of this nature, it will be better to calculate average on interval samples of the data. Using increase() function you can get an approximate value of how the metric changed during the interval. Thus:
increase(http_server_requests_seconds_sum[5m]) / increase(http_server_requests_seconds_count[5m])
The value is approximate because under the hood increase() is rate() multiplied by [inverval]. The error is insignificant for fast-moving counters (such as the request rate), just be ready that there can be an increase of 2.5 requests.
Aggregation and filtering
If you already ran one of the queries above, you have noticed that there is not one line, but many. This is due to labels; each unique set of labels that the metric has is considered a separate time series. This can be fixed by using an aggregation function (like sum()). For example, you can aggregate request rate by instance:
sum by(instance) (rate(http_server_requests_seconds_count[5m]))
This will show you a line for each unique instance label. Now if you want to see only some and not all instances, you can do that with a filter. For example, to calculate a value just for nodeA instance:
sum by(instance) (rate(http_server_requests_seconds_count{instance="nodeA"}[5m]))
Read more about selectors here. With labels you can create any number of useful panels. Perhaps you'd like to calculate the percentage of exceptions, or their rate of occurrence, or perhaps a request rate by status code, you name it.
Note on max
From what I found on the web, max shows the maximum recorded value during some interval set in settings (default is 2 minutes if to trust the source). This is somewhat uncommon metric and whether it is useful is up to you. Since it is a Gauge (unlike sum and count it can go both up and down) you don't need extra functions (such as rate()) to see dynamics. Thus
http_server_requests_seconds_max
... will show you the maximum request duration. You can augment this with aggregation functions (avg(), sum(), etc) and label filters to make it more useful.
I'm trying to build a composite metric to know how many point are sent on a period for a specific metric.
The closer stackoverflow response to this is about counting the number of source, and I failed to change it to do what I want (How can i count the total number of sources my metric has with Librato?)
The metric in question is a timing on a function execution, that receive around 20k values on peak hour
At first, I sum-ed the series with a count aggregation, and the pattern I had then was close to what I expected, but regarding our logs, it always differ
The composite I made was like that
sum(s("timing", "%", {function:"count"}))
Any ideas ?
Thanks
Well, the librato support told me the composite do what I want
The differences with the logs were due to errors during metrics sending
I want to visualize the amount of correct auto-responses my system sent in regards to the percentage of questions it has already learned.
So my idea was to filter all my test-results where a boolean field didSendCorrectAutoResponse is true, make the bucket on the x axis over a field called learnPercentage and on the y axis simply take the count as a metric.
The only problem with this is that the values on the y-axis are absolute and only count the number of responses sent but I want it to show it as a percentage of the total number of tests per percentage learned.
Here is how I defined my chart:
I can calculate the total number of test-cases for each percentage learned with this learnPercentage: 100 && strategy.keyword: "sum" (it only counts them for 100% questions learned, but the number of tests for each percentage is the same).
So what I want on the y-axis is not the plain count but count / totalNumberOfTestCases
edit:
In order for you to better understand what I need here is what I do with my system:
Lets say I have 100 known questions my system can learn. And I have 2500 test questions. Now I do the following:
Let my system learn none of the known questions
Ask the 2500 test questions
Save how many questions have been correctly answered (let's say 600)
Save this test result in elastic
Repeat with 10 questions learned:
Let my system learn 10% of the known questions
Ask the 2500 test questions
Save how many questions have been correctly answered (let's say 590)
Save this result in elastic
Repeat with 20 questions learned...
Now I want to plot how many questions have been correctly answered in each learning step:
600 at 0%
590 at 10%
900 at 20%
...
But instead of showing these absolute numbers I want 600/2500, 590/2500 etc on the y-axis.
For Visualizing your Y axis in percentage if it is not already in, You should first create a scripted field for your favorite column and then visualize that scripted field in kibana.
check the photos; in scripted field code, the removed part is your column name.
I have database with user subscriptions to topics.
There is currently about 20 000 topics,
20 mln users and 200 mln subscriptions stored in SQL database.
Because of its size, the database is partitioned by topics,
so I can't get the info in one database query.
There are couple of topics with 10 mln subscriptions, couple with 100 000 and others have hundreds or less.
When an event occurs, it usually matches couple of topics, so to inform users, I need to perform query like "give me all users subscribed to topics x, y, z and perform union of sets", so that one user gets the news once even if he subscribed both topics x and z.
The constraints are:
There must be no duplicates in the union set. (users can't get the content twice)
There can be bounded amount of users missing from the union set. (if sometimes user doesn't get the content, it is not that bad, but it can't be always the same user for the same topic)
It is possible to subscribe to new topic without rebuilding whole thing.
I thought about using set of bloom filters for every topic, but they constraints are the other way round: "user either not subscribed for sure or probably subscribed". I need something like "user subscribed for sure or probably not".
Lossy hash tables might be good idea, but I am not sure, if they can be as memory efficient as bloom filters and I am afraid, that it would be always the same user, that is missing the content in his topic.
Do you know any other data structures, that mey be good for solving this problem?
What if each user record had a BIT FIELD representing all of the topics.
TABLE Users(ID INT, UserName VARCHAR(16), Topics BINARY(8000))
A binary 8k would allow you to have 64000 topics. I would probably use multiple columns of BINARY(1024) each so I could add more topics easily.
Now when an event comes in that's tagged for topics 1, 10, 20, 30, 40.
I have to search every User, but this can be parallelized and will always be N complexity where N is the number of total users.
SELECT ID
FROM Users (READPAST)
WHERE
SUBSTRING(Topics, 1 / 8, 1) & (1 * POWER(2, (1 % 8))) > 0
OR
SUBSTRING(Topics, 10 / 8, 1) & (1 * POWER(2, (10 % 8))) > 0
OR
SUBSTRING(Topics, 20 / 8, 1) & (1 * POWER(2, (20 % 8))) > 0
OR
SUBSTRING(Topics, 30 / 8, 1) & (1 * POWER(2, (30 % 8))) > 0
OR
SUBSTRING(Topics, 40 / 8, 1) & (1 * POWER(2, (40 % 8))) > 0
OPTION (MAXDOP = 64)
No duplicates we're scanning Users once so we don't have o worry about unions
Some users missing the READPAST hint will skip any rows that are currently locked (being updated), so some users may be missing from the result.
SUbscribe You can [un]subscribe to a topic simply by toggling the topics bit in the Topics column.
As I said in comments, a memory-based exact solution is certainly feasible.
But if you really want an approximate data structure, then what you're looking for a size-limited set (of users for each topic) with random eviction.
You also need to compute unions rapidly on the fly when queries arrive. There's no helpful pre-computation here. If topic sets tend to repeat, you can look at caching the frequently used unions.
All the usual methods of representing a set apply. Hash tables (both closed and open), trees, and skip lists (all containing user id keys; no values required) are most likely.
If you use a closed hash table with a good hash function, pseudo-random eviction happens automatically. On collision, just replace the previous value. The problem with closed hashes is always picking a good table size for the set you need to represent. Remember that to recover set elements, you'll have to traverse the whole open table including null entries, so starting with a huge table isn't a good idea; rather start with a small one and reorganize, growing by a factor each time so reorganization amortizes to constant time overhead per element stored.
With the other schemes, you can literally do pseudo-random eviction when the table gets too big. The easiest way to evict fairly is store the user id's an a table and have the size-limited set store indices. Evict by generating a random index into the table and removing that id before adding a new one.
It's also possible to evict fairly from a BST set representation by using an order statistic tree: store the number of descendants in each node. Then you can always find the n'th element in key sorted order, where n is pseudo-random, and evict it.
I know you were looking for the bitwise space efficiency of a Bloom filter, but guaranteeing no false positives seems to rule that out.
This might not be the solution you were looking for, but you could utilize ElasticSearch's terms filter and to have one document like this for each user:
{
"id": 12345,
"topics": ["Apache", "GitHub", "Programming"]
}
Terms filters directly responds to the query "which users subscribe to at least one of these topics" and ES is very smart on how to cache and re-utilize filters.
It wouldn't be a probabilistic data structure but would very efficiently solve this problem. You'd need to use scan api for serializing retrieving potentially large JSON responses. If necessary you can scale this solution to billions of users spread on multiple computers and have response times like 10 - 100 milliseconds. You could also find correlations between topics (significant terms aggregation) and use ES as an engine for further analysis.
Edit: I implemented searching and scan / sroll API usage in Python and got some interesting results. I did "users who subscribe to any three of these topics" queries with that 20m users and 200m subscriptions dataset, and in general the search itself finishes in 4 - 8 milliseconds. Queries return 350.000 - 750.000 users.
Problems arise from getting user ids out of ES, even with the scan/scroll API. On Core i5 I seems to get only 8200 users / second so it is less than 0.5 million / minute (with "_source": false). The query itself looks like this:
{
"filtered": {
"filter": {
"terms": {
"topics": [ 123, 234, 345 ],
"execution": "plain",
"_cache": false
}
}
}
}
In production I would use "execution": "bool" so that partial query results can be cached and re-utilized at other queries. I don't know what is the bottle-neck with getting results out, server's CPU usage is 50% and I run the client's python script at the same machine, utilizing elasticsearch.helpers.scan.
[This solution is similar to Louis Ricci's, except inverted to the Topics table - which could make subscription updates less practical, be warned!]
(The probabilistic data structure approach is cool, but unnecessary for your current data-size. I was initially looking at compressed bitsets for a non-probabilistic solution, as they are great at performing set operations in-memory, but I think that's overkill as well. Here is a good implementation for this type of use-case. if you're interested.)
But looking at the sparsity of your data, bitsets waste space over integer-arrays. And even with integer-arrays, the union operation is still pretty inexpensive given that you only have an average of 10,000 subscriptions per topic.
So maybe, just maybe, a dead-simple data-structure given your use-case is simply:
Topic 1 => [array of subscriber IDs]
Topic 2 => [array of subscriber IDs]
...
Topic 20,000 => [array of subscriber IDs]
Storing (avg) 10,000 subscriber IDs (assuming 32-bit integers) only requires about 40kb of space per topic.
[In an array-type or BLOB, depending on your database]
With 20,000 topics, this adds only 800mb of data to your topic table ... and very little of this (~200kb avg) needs to be loaded to memory when a notification event occurs!
Then when an average event (affecting 5 topics) occurs, all that needs to happen is:
Query / Pull the data for the relevant topics (avg 5 records) into memory
(avg ~200kb of I/O)
Dump them into a Set data structure (de-duplicates subscriber list)
Alert the subscribers in the Set.
I'm trying to develop a rating system for an application I'm working on. Basically app allows you to rate an object from 1 to 5(represented by stars). But I of course know that keeping a rating count and adding the rating the number itself is not feasible.
So the first thing that came up in my mind was dividing the received rating by the total ratings given. Like if the object has received the rating 2 from a user and if the number of times that object has been rated is 100 maybe adding the 2/100. However I believe this method is not good enough since 1)A naive approach 2) In order for me to get the number of times that object has been rated I have to do a look up on db which might end up having time complexity O(n)
So I was wondering what alternative and possibly better ways to approach this problem?
You can keep in DB 2 additional values - number of times it was rated and total sum of all ratings. This way to update object's rating you need only to:
Add new rating to total sum.
Divide total sum by total times it was rated.
There are many approaches to this but before that check
If all feedback givers treated at equal or some have more weight than others (like panel review, etc)
If the objective is to provide only an average or any score band or such. Consider scenario like this website - showing total reputation score
And yes - if average is to be omputed, you need to have total and count of feedback and then have to compute it - that's plain maths. But if you need any other method, be prepared for more compute cycles. balance between database hits and compute cycle but that's next stage of design. First get your requirement and approach to solution in place.
I think you should keep separate counters for 1 stars, 2 stars, ... to calcuate the rating, you'd have to compute rating = (1*numOneStars+2*numTwoStars+3*numThreeStars+4*numFourStars+5*numFiveStars)/numOneStars+numTwoStars+numThreeStars+numFourStars+numFiveStars)
This way you can, like amazon also show how many ppl voted 1 stars and how many voted 5 stars...
Have you considered a vote up/down mechanism over numbers of stars? It doesn't directly solve your problem but it's worth noting that other sites such as YouTube, Facebook, StackOverflow etc all use +/- voting as it is often much more effective than star based ratings.