Counting events - d3.js

I'm using Cube and Cubism. It's perfect, except for one thing... I need to display the total events numerically. E.g. I have a metric showing API calls per 10 second, I need to know the total API calls.
Is there anything built-in that I'm missing?
I thought about adding a (Mongo) count in the evaluator, but events expire so that wouldn't work.
Keeping track of the running total client-side and including it in the event could be an option, but the sources are distributed and the events are not monotonic, so a simple sum on the last 10 seconds won't work. I would need to be able to query 'get the last event for each distinct source'. Is that possible?
I have a lot of metrics, so I really want to keep the number of client requests to a minimum. If I could get e.g. cumulative alongside value in the standard metric query I'd be happy.
EDIT
I was missing something... using sum and a large (e.g. 1 day) step works.

I was missing something... using sum and a large (e.g. 1 day) step works perfectly.

Related

SpringBoot - observability on *_max *_count *_sum metrics

Small question regarding Spring Boot, some of the useful default metrics, and how to properly use them in Grafana please.
Currently with a Spring Boot 2.5.1+ (question applicable to 2.x.x.) with Actuator + Micrometer + Prometheus dependencies, there are lots of very handy default metrics that come out of the box.
I am seeing many many of them with pattern _max _count _sum.
Example, just to take a few:
spring_data_repository_invocations_seconds_max
spring_data_repository_invocations_seconds_count
spring_data_repository_invocations_seconds_sum
reactor_netty_http_client_data_received_bytes_max
reactor_netty_http_client_data_received_bytes_count
reactor_netty_http_client_data_received_bytes_sum
http_server_requests_seconds_max
http_server_requests_seconds_count
http_server_requests_seconds_sum
Unfortunately, I am not sure what to do with them, how to correctly use them, and feel like my ignorance makes me miss on some great application insights.
Searching on the web, I am seeing some using like this, to compute what seems to be an average with Grafana:
irate(http_server_requests_seconds::sum{exception="None", uri!~".*actuator.*"}[5m]) / irate(http_server_requests_seconds::count{exception="None", uri!~".*actuator.*"}[5m])
But Not sure if it is the correct way to use those.
May I ask what sort of queries are possible, usually used when dealing with metrics of type _max _count _sum please?
Thank you
UPD 2022/11: Recently I've had a chance to work with these metrics myself and I made a dashboard with everything I say in this answer and more. It's available on Github or Grafana.com. I hope this will be a good example of how you can use these metrics.
Original answer:
count and sum are generally used to calculate an average. count accumulates the number of times sum was increased, while sum holds the total value of something. Let's take http_server_requests_seconds for example:
http_server_requests_seconds_sum 10
http_server_requests_seconds_count 5
With the example above one can say that there were 5 HTTP requests and their combined duration was 10 seconds. If you divide sum by count you'll get the average request duration of 2 seconds.
Having these you can create at least two useful panels: average request duration (=average latency) and request rate.
Request rate
Using rate() or irate() function you can get how many there were requests per second:
rate(http_server_requests_seconds_count[5m])
rate() works in the following way:
Prometheus takes samples from the given interval ([5m] in this example) and calculates difference between current timepoint (not necessarily now) and [5m] ago.
The obtained value is then divided by the amount of seconds in the interval.
Short interval will make the graph look like a saw (every fluctuation will be noticeable); long interval will make the line more smooth and slow in displaying changes.
Average Request Duration
You can proceed with
http_server_requests_seconds_sum / http_server_requests_seconds_count
but it is highly likely that you will only see a straight line on the graph. This is because values of those metrics grow too big with time and a really drastic change must occur for this query to show any difference. Because of this nature, it will be better to calculate average on interval samples of the data. Using increase() function you can get an approximate value of how the metric changed during the interval. Thus:
increase(http_server_requests_seconds_sum[5m]) / increase(http_server_requests_seconds_count[5m])
The value is approximate because under the hood increase() is rate() multiplied by [inverval]. The error is insignificant for fast-moving counters (such as the request rate), just be ready that there can be an increase of 2.5 requests.
Aggregation and filtering
If you already ran one of the queries above, you have noticed that there is not one line, but many. This is due to labels; each unique set of labels that the metric has is considered a separate time series. This can be fixed by using an aggregation function (like sum()). For example, you can aggregate request rate by instance:
sum by(instance) (rate(http_server_requests_seconds_count[5m]))
This will show you a line for each unique instance label. Now if you want to see only some and not all instances, you can do that with a filter. For example, to calculate a value just for nodeA instance:
sum by(instance) (rate(http_server_requests_seconds_count{instance="nodeA"}[5m]))
Read more about selectors here. With labels you can create any number of useful panels. Perhaps you'd like to calculate the percentage of exceptions, or their rate of occurrence, or perhaps a request rate by status code, you name it.
Note on max
From what I found on the web, max shows the maximum recorded value during some interval set in settings (default is 2 minutes if to trust the source). This is somewhat uncommon metric and whether it is useful is up to you. Since it is a Gauge (unlike sum and count it can go both up and down) you don't need extra functions (such as rate()) to see dynamics. Thus
http_server_requests_seconds_max
... will show you the maximum request duration. You can augment this with aggregation functions (avg(), sum(), etc) and label filters to make it more useful.

graphite/grafana alert on message queue

We have the wait time of a message in a queue being fed into Graphite, recorded with Coda Hale’s Metrics library. And then we have Grafana graphing that time. I'm trying to come up with a sensible alert based on the wait time, but I'm seeing constant alerts. It's possible that they are legitimate, but I'm having a hard time finding examples to compare with and I wonder if the alert is doing what I think it's doing.
I would appreciate feedback and any advice from someone who is doing anything similar. What I’m looking to alert on right now is when a message has been waiting longer than the 98th percentile of the recorded message wait time. My queries are
A: aliasByNode(...waitTime.p98, 9)
B: timeShift(...waitTime.p98, '5min'),
C: asPercent(#B, #A)
My alert is
when avg() of query(C, 5m, now) is above 100
It sounds like you want to use movingAverage(...waitTime.p98, '1hour') rather than timeShift(...waitTime.p98, '5min').
That will give you a final query of asPercent(...waitTime.p98,movingAverage(...waitTime.p98, '1hour')) which will compute the value for each interval as a percentage of the average over the previous 1 hour rolling window. When you then take a 5-minute average over that with the alert you'll get a value that is effectively the average of the last 5 minutes compared to the average of the last 1hr 5min.
Another way to do this is to use asPercent(movingaverage(...waitTime.p98, '5min'), movingAverage(...waitTime.p98, '1hour')) and just take the current value. It's not 100% identical but will be easier to reason about because you can plot it and see exactly what the alert is firing on.

How to structure an efficient GeoLocation and Time query

Hey rethinkers!
i've got a query optimization question that i cant quite figure out. It deals with geoLocation and time. I've got a ton of events, that all have a startTime, endTime (indexed), and location (indexed). If i want to get the events that are happening nearby by a certain location that haven't happend yet, i can do one of two ways:
I can get and filter all the events that haven't happend yet based on the end time, then calculate the location of all those events and only return the ones with the specified radius.
I can use the getNearest() command (which would return all the expired events) and then filter out the events that haven't happend yet. My one worry with this apporach is that getNearest() specifies how many to return, but I essentially need all of them within the given radius so i don't miss any events that haven't happend yet.
Im just unsure how i can figure out the fastest/most effecient query for this.
The best option to me would seem to be to filter and get all events that haven't happend yet, then use getNearest() to take advatage of the indexes. But i can call a get nearest on a filtered set. Please help!?!?!
For getting all events within a radius, I recommend using getIntersecting() together with r.circle.
That is not only more efficient than getNearest, but also doesn't have any limit on the number of returned documents.
You might need to make the radius you pass into r.circle slightly larger, to account for the fact that the generated polygon will be slightly smaller than the specified radius between the vertices.

Parse: limitations of count()

Anyone who's read Parse documentation has stumbled upon this
Caveat: Count queries are rate limited to a maximum of 160 requests per minute. They can also return inaccurate results for classes with more than 1,000 objects. Thus, it is preferable to architect your application to avoid this sort of count operation (by using counters, for example.)
Why's there such limitation and inaccuracy?
To quote the Parse Engineering Blog Post: Building Scalable Apps on Parse
Suppose you are building a product catalog. You might want to display
the count of products in each category on the top-level navigation
screen. If you run a count query for each of these UI elements, they
will not run efficiently on large data sets because MongoDB does not
use counting B-trees. Instead, we recommend that you use a separate
Parse Object to keep track of counts for each category. Whenever a
product gets added or deleted, you can increment or decrement the
counts in an afterSave or afterDelete Cloud Code handler.
To add on to this, here is another quote by Hector Ramos from the Parse Developers Google Group
Count queries have always been expensive once you throw some
constraints in. If you only care about the total size of the
collection, you can run a count query without any constraints and that
one should be pretty fast, as getting the total number of records is a
different problem than counting how many of these match an arbitrary
list of constraints. This is just the reality of working with database
systems.
The inaccuracy is not due to the 1000 request object limit. The count query will try to get the total number of records regardless of size, but since the operation may take a large amount of time to complete, it is possible that the database has changed during that window and the count value that is returned may no longer be valid.
The recommended way to handle counts is to essentially maintain your own index using before/after save hooks. However, this is also a non-ideal solution because save hooks can arbitrarily fail part way through and (worse) postSave hooks have no error propagation.
The limitation is simply to stop people using counts too much, they're just as runtime costly as full queries in effect.
The inaccuracy is because queries are limited to 1000 result objects (100 by default) and counts have the same hard limit.
You can run a recursive query to build up a count, but it's a crappy option. Hence the only really good option at this point in time (and as far as we can see in the future) is to keep an index of the things you're interested in counting and update the counts when anything changes. You would usually do that with save hooks in cloud code.

Getting the most frequent items without counting every item

I was wondering if there was an algorithm for counting "most frequent items" without having to keep a count of each item? For example, let's say I was a search engine and wanted to keep track of the 10 most popular searches. What I don't want to do is keep a counter of every query since there could be too many queries for me to count (and most them will be singletons). Is there a simple algorithm for this? Maybe something that is probabilistic? Thanks!
Well, if you have a very large number of queries (like a search engine presumably would), then you could just do "sampling" of queries. So you might be getting 1,000 queries per second, but if you just keep a count one per second, then over a longish period of time, you'd get an answer that would be relatively close to the "real" answer.
This is how, for example, a "sampling" profiler works. Every n mililiseconds it looks at what function is currently being executed. Over a long period of time (several seconds) you get a good idea of the "expensive" functions, because they're the ones that appear in your samples more often.
You still have to do "counting" but by doing periodic samples, instead of counting every single query you can get an upper bound on the amount of data that you actually have to store (e.g. max of one query per second, etc)
If you want the most frequent searches at any given time, you don't need to have endless counters keeping track of each query submitted. Instead, you need an algorithm to measure the amount of submissions for any given query divided by a set period of time. This is a pretty simple algorithm. Any search submitted to your search engine, for example the word “cache,” is stored for a fixed period of time called a refresh rate, (the length of your refresh rate depends on the kind of traffic your search engine is getting and the amount of “top-results” you want to keep track of). If the refresh rate time period expires and searches for the word “cache” have not persisted, the query is deleted memory. If searches for the word “cache” do persist, your algorithm only needs to keep track of the rate at which the word “cache” is being searched. To do this, simply store all searches on a “leaky-counter.” Every entry is pushed onto the counter with an expiration date after which the query is deleted. Your active counters are the indicators of your top queries.
Storing each and every query would be expensive, yet necessary to ensure the top 10 are actually the top 10. You'll have to cheat.
One idea is to store a table of URLs, hit counters, and timestamp indexed by count, then timestamp. When the table reaches some arbitrary near-maximum size, start removing low-end entries that are older than a given number of days. Although old, infrequent queries won't be counted, the queries likely to make the top 10 should make it on the table because of the faster query rate.
Another idea would be to write a 16-bit (or more) hash function for search queries. Have a 65536-entry table holding counters and URLs. When a search is performed, increment the respective table entry and set the URL if necessary. However, this approach has a major drawback. A spam bot could make repeated queries like "cheap viagra", possibly making legitimate queries increment the spam query counters instead, placing their messages on your main page.
You want a cache, of which there are many kinds; see Wikipedia
Cache algorithms and
Page replacement algorithm Aging.

Resources