How to correctly use Prometheus Histogram from java client to track size rather than latency? - performance

I have an API that that processes collections.
The execution time of this API is related to the collection size (the larger the collection, the more it will take).
I am researching how can I do this with prometheus but am unsure whether I am doing things correctly (documentation is a bit lacking in this area).
the first thing I did is define a Summary metric to measure execution time of the API. I am using the canonical rate(sum)/rate(count) as explained here.
Now, since I know that the latency may be affected by the size of the input, I also want to overlay the request size on the avg execution time. Since I dont want to measure each possible size, I figured I'd use a histogram. Like so:
Histogram histogram = Histogram.build().buckets(10, 30, 50)
.name("BULK_REQUEST_SIZE")
.help("histogram of bulk sizes to correlate with duration")
.labelNames("method", "entity")
.register();
Note: the term 'size' does not relate to the size in bytes but to the length of the collection that needs to be processed. 2 items, 5 items, 50 items...
and in the execution I do (simplified):
#PUT
void process(Collection<Entity> entitiesToProcess, string entityName){
Timer t = summary.labels("PUT_BULK", entityName).startTimer()
// process...
t.observeDuration();
histogram.labels("PUT_BULK", entityName).observe(entitiesToProcess.size())
}
Question:
Later when I am looking at the BULK_REQUEST_SIZE_bucket in Grafana, I see that all buckets have the same value, so clearly I am doing something wrong.
Is there a more canonical way to do it?

Your code is correct (though bulk_request_size_bytes would be a better metric name).
The problem is likely that you've suboptimal buckets, as 10, 30 and 50 bytes are pretty small for most request sizes. I'd try larger bucket sizes that cover more typical values.

Related

When to use gauge or histogram in prometheus in recording request duration?

I'm new to metric monitoring.
If we want to record the duration of the requests, I think we should use gauge, but in practise, someone would use histogram.
for example, in grpc-ecosystem/go-grpc-prometheus, they prefer to use histogram to record duration. Are there agreed best practices for the use of metric types? Or it is just their own preference.
// ServerMetrics represents a collection of metrics to be registered on a
// Prometheus metrics registry for a gRPC server.
type ServerMetrics struct {
serverStartedCounter *prom.CounterVec
serverHandledCounter *prom.CounterVec
serverStreamMsgReceived *prom.CounterVec
serverStreamMsgSent *prom.CounterVec
serverHandledHistogramEnabled bool
serverHandledHistogramOpts prom.HistogramOpts
serverHandledHistogram *prom.HistogramVec
}
Thanks~
I am new to this but let me try to answer your question. So take my answer with a grain of salt or maybe someone with experience in using metrics to observe their systems jumps in.
as stated in https://prometheus.io/docs/concepts/metric_types/
A gauge is a metric that represents a single numerical value that can arbitrarily go up and down.
So if your goal would be to display the current value (duration time of requests) you could use a gauge. But I think the goal of using metrics is to find problems within your system or generate alerts if and when certain vaules aren't in a predefined range or getting a performance value (like the Apdex score) for your system.
From https://prometheus.io/docs/concepts/metric_types/#histogram
Use the histogram_quantile() function to calculate quantiles from histograms or even aggregations of histograms. A histogram is also suitable to calculate an Apdex score.
From https://en.wikipedia.org/wiki/Apdex
Apdex (Application Performance Index) is an open standard developed by an alliance of companies for measuring performance of software applications in computing. Its purpose is to convert measurements into insights about user satisfaction, by specifying a uniform way to analyze and report on the degree to which measured performance meets user expectations.
Read up on Quantiles and the calculations in histograms and summaries https://prometheus.io/docs/practices/histograms/#quantiles
Two rules of thumb:
If you need to aggregate, choose histograms.
Otherwise, choose a histogram if you have an idea of the range and distribution of values that will be observed. Choose a summary if you need an accurate quantile, no matter what the range and distribution of the values is.
Or like Adam Woodbeck in his book "Network programming with Go" said:
The general advice is to use summaries when you don’t know the range of expected values, but I’d advise you to use histograms whenever possible
so that you can aggregate histograms on the metrics server.
The main difference between gauge and histogram metric types in Prometheus is that Prometheus captures only a single (last) value of the gauge metric when it scrapes the target exposing the metric, while histogram captures all the metric values by incrementing the corresponding histogram bucket.
For example, if request duration is measured for frequently requested endpoint and Prometheus is set up to scrape your app every 30 seconds (e.g. scrape_interval: 30s in scrape_configs), then the Prometheus will scrape only a single duration for the last request every 30 seconds when the duration is stored in a Gauge metric. All the previous measurements for the request duration are lost.
On the other hand, any number of request duration measurement are registered in Histogram metric, and this doesn't depend on the interval between scrapes of your app. Later the Histogram metric allows obtaining the distribution of request durations on an arbitrary time range.
Prometheus histograms have some issues though:
You need to choose the number and the boundaries of histogram buckets, so they provide good accuracy for observing the distribution of the measured metric. This isn't a trivial task, since you may not know in advance the real distribution of the metric.
If the number of buckets are changed or their boundaries are changed for some measurement, then the histogram_quantile() function returns invalid results over such a measurement.
To big number of buckets per each histogram may result in high cardinality issues, since each bucket in the histogram creates a separate time series.
P.S. these issues are addressed in VcitoriaMetrics histograms (I'm the core developer of VictoriaMetrics).
As valyala suggest, the main difference is that histogram aggregates data, so you would take advantage of prometheus statistics engine over all registered samples (minimum, maximum, average, quantiles, etc.).
A gauge is more used to measure for example "wind velocity", "queue size", or any other kind of "instant data" where it is not so important to ignore old related samples of it as you want to know current picture.
Using gauges for "duration of the requests" would require very small scrape periods to be accurate, which is not practical even if your rate is not very high (if your scrape period is less than your application reception rate, you will ignore data). So, in summary, don't use gauges. Histogram fits much better your needs.

SpringBoot - observability on *_max *_count *_sum metrics

Small question regarding Spring Boot, some of the useful default metrics, and how to properly use them in Grafana please.
Currently with a Spring Boot 2.5.1+ (question applicable to 2.x.x.) with Actuator + Micrometer + Prometheus dependencies, there are lots of very handy default metrics that come out of the box.
I am seeing many many of them with pattern _max _count _sum.
Example, just to take a few:
spring_data_repository_invocations_seconds_max
spring_data_repository_invocations_seconds_count
spring_data_repository_invocations_seconds_sum
reactor_netty_http_client_data_received_bytes_max
reactor_netty_http_client_data_received_bytes_count
reactor_netty_http_client_data_received_bytes_sum
http_server_requests_seconds_max
http_server_requests_seconds_count
http_server_requests_seconds_sum
Unfortunately, I am not sure what to do with them, how to correctly use them, and feel like my ignorance makes me miss on some great application insights.
Searching on the web, I am seeing some using like this, to compute what seems to be an average with Grafana:
irate(http_server_requests_seconds::sum{exception="None", uri!~".*actuator.*"}[5m]) / irate(http_server_requests_seconds::count{exception="None", uri!~".*actuator.*"}[5m])
But Not sure if it is the correct way to use those.
May I ask what sort of queries are possible, usually used when dealing with metrics of type _max _count _sum please?
Thank you
UPD 2022/11: Recently I've had a chance to work with these metrics myself and I made a dashboard with everything I say in this answer and more. It's available on Github or Grafana.com. I hope this will be a good example of how you can use these metrics.
Original answer:
count and sum are generally used to calculate an average. count accumulates the number of times sum was increased, while sum holds the total value of something. Let's take http_server_requests_seconds for example:
http_server_requests_seconds_sum 10
http_server_requests_seconds_count 5
With the example above one can say that there were 5 HTTP requests and their combined duration was 10 seconds. If you divide sum by count you'll get the average request duration of 2 seconds.
Having these you can create at least two useful panels: average request duration (=average latency) and request rate.
Request rate
Using rate() or irate() function you can get how many there were requests per second:
rate(http_server_requests_seconds_count[5m])
rate() works in the following way:
Prometheus takes samples from the given interval ([5m] in this example) and calculates difference between current timepoint (not necessarily now) and [5m] ago.
The obtained value is then divided by the amount of seconds in the interval.
Short interval will make the graph look like a saw (every fluctuation will be noticeable); long interval will make the line more smooth and slow in displaying changes.
Average Request Duration
You can proceed with
http_server_requests_seconds_sum / http_server_requests_seconds_count
but it is highly likely that you will only see a straight line on the graph. This is because values of those metrics grow too big with time and a really drastic change must occur for this query to show any difference. Because of this nature, it will be better to calculate average on interval samples of the data. Using increase() function you can get an approximate value of how the metric changed during the interval. Thus:
increase(http_server_requests_seconds_sum[5m]) / increase(http_server_requests_seconds_count[5m])
The value is approximate because under the hood increase() is rate() multiplied by [inverval]. The error is insignificant for fast-moving counters (such as the request rate), just be ready that there can be an increase of 2.5 requests.
Aggregation and filtering
If you already ran one of the queries above, you have noticed that there is not one line, but many. This is due to labels; each unique set of labels that the metric has is considered a separate time series. This can be fixed by using an aggregation function (like sum()). For example, you can aggregate request rate by instance:
sum by(instance) (rate(http_server_requests_seconds_count[5m]))
This will show you a line for each unique instance label. Now if you want to see only some and not all instances, you can do that with a filter. For example, to calculate a value just for nodeA instance:
sum by(instance) (rate(http_server_requests_seconds_count{instance="nodeA"}[5m]))
Read more about selectors here. With labels you can create any number of useful panels. Perhaps you'd like to calculate the percentage of exceptions, or their rate of occurrence, or perhaps a request rate by status code, you name it.
Note on max
From what I found on the web, max shows the maximum recorded value during some interval set in settings (default is 2 minutes if to trust the source). This is somewhat uncommon metric and whether it is useful is up to you. Since it is a Gauge (unlike sum and count it can go both up and down) you don't need extra functions (such as rate()) to see dynamics. Thus
http_server_requests_seconds_max
... will show you the maximum request duration. You can augment this with aggregation functions (avg(), sum(), etc) and label filters to make it more useful.

Counting against a threshold, without knowing the actual count

I'm looking for a counting bloom filter kind of data structure that exposes two methods:
class ModifiedCountingBloomFilter {
1. increment()
2. isThresholdBreached()
}
e.g. let's say threshold = 100 users. I'll keep calling increment() every time a user hits my website. And I only want to know whether >=100 users have hit my website.
Requirements
At any point of time, I shouldn't be able to tell how many users actually visited my website. Just a boolean value of >=100 or <100.
Note that I'm not looking for a class implementation. But an algorithm like bloom-filter that can appropriately hide any counts < 100, i.e. enforces no physical way of obtaining those counts.
Problems with using counting bloom filter
I'll be able to tell between 0 and non-zero counts
In fact it allows to query for any threshold. I intend to only be able to query for threshold = 100.
Analogy
It's like an opaque glass of water. I should only be able to add more water to it, and I should only be able to tell if it's full (when the water overflows). Rest everything should be hidden.
Any pointers shall be very appreciated.

Estimating number of results in Google App Engine Query

I'm attempting to estimate the total amount of results for app engine queries that will return large amounts of results.
In order to do this, I assigned a random floating point number between 0 and 1 to every entity. Then I executed the query for which I wanted to estimate the total results with the following 3 settings:
* I ordered by the random numbers that I had assigned in ascending order
* I set the offset to 1000
* I fetched only one entity
I then plugged the entities's random value that I had assigned for this purpose into the following equation to estimate the total results (since I used 1000 as the offset above, the value of OFFSET would be 1000 in this case):
1 / RANDOM * OFFSET
The idea is that since each entity has a random number assigned to it, and I am sorting by that random number, the entity's random number assignment should be proportionate to the beginning and end of the results with respect to its offset (in this case, 1000).
The problem I am having is that the results I am getting are giving me low estimates. And the estimates are lower, the lower the offset. I had anticipated that the lower the offset that I used, the less accurate the estimate should be, but I thought that the margin of error would be both above and below the actual number of results.
Below is a chart demonstrating what I am talking about. As you can see, the predictions get more consistent (accurate) as the offset increases from 1000 to 5000. But then the predictions predictably follow a 4 part polynomial. (y = -5E-15x4 + 7E-10x3 - 3E-05x2 + 0.3781x + 51608).
Am I making a mistake here, or does the standard python random number generator not distribute numbers evenly enough for this purpose?
Thanks!
Edit:
It turns out that this problem is due to my mistake. In another part of the program, I was grabbing entities from the beginning of the series, doing an operation, then re-assigning the random number. This resulted in a denser distribution of random numbers towards the end.
I did a little more digging into this concept, fixed the problem, and tried it again on a different query (so the number of results are different from above). I found that this idea can be used to estimate the total results for a query. One thing of note is that the "error" is very similar for offsets that are close by. When I did a scatter chart in excel, I expected the accuracy of the predictions at each offset to "cloud". Meaning that offsets at the very begging would produce a larger, less dense cloud that would converge to a very tiny, dense could around the actual value as the offsets got larger. This is not what happened as you can see below in the cart of how far off the predictions were at each offset. Where I thought there would be a cloud of dots, there is a line instead.
This is a chart of the maximum after each offset. For example the maximum error for any offset after 10000 was less than 1%:
When using GAE it makes a lot more sense not to try to do large amounts work on reads - it's built and optimized for very fast requests turnarounds. In this case it's actually more efficent to maintain a count of your results as and when you create the entities.
If you have a standard query, this is fairly easy - just use a sharded counter when creating the entities. You can seed this using a map reduce job to get the initial count.
If you have queries that might be dynamic, this is more difficult. If you know the range of possible queries that you might perform, you'd want to create a counter for each query that might run.
If the range of possible queries is infinite, you might want to think of aggregating counters or using them in more creative ways.
If you tell us the query you're trying to run, there might be someone who has a better idea.
Some quick thought:
Have you tried Datastore Statistics API? It may provide a fast and accurate results if you won't update your entities set very frequently.
http://code.google.com/appengine/docs/python/datastore/stats.html
[EDIT1.]
I did some math things, I think the estimate method you purposed here, could be rephrased as an "Order statistic" problem.
http://en.wikipedia.org/wiki/Order_statistic#The_order_statistics_of_the_uniform_distribution
For example:
If the actual entities number is 60000, the question equals to "what's the probability that your 1000th [2000th, 3000th, .... ] sample falling in the interval [l,u]; therefore, the estimated total entities number based on this sample, will have an acceptable error to 60000."
If the acceptable error is 5%, the interval [l, u] will be [0.015873015873015872, 0.017543859649122806]
I think the probability won't be very large.
This doesn't directly deal with the calculations aspect of your question, but would using the count attribute of a query object work for you? Or have you tried that out and it's not suitable? As per the docs, it's only slightly faster than retrieving all of the data, but on the plus side it would give you the actual number of results.
http://code.google.com/appengine/docs/python/datastore/queryclass.html#Query_count

Percentiles of Live Data Capture

I am looking for an algorithm that determines percentiles for live data capture.
For example, consider the development of a server application.
The server might have response times as follows:
17 ms
33 ms
52 ms
60 ms
55 ms
etc.
It is useful to report the 90th percentile response time, 80th percentile response time, etc.
The naive algorithm is to insert each response time into a list. When statistics are requested, sort the list and get the values at the proper positions.
Memory usages scales linearly with the number of requests.
Is there an algorithm that yields "approximate" percentile statistics given limited memory usage? For example, let's say I want to solve this problem in a way that I process millions of requests but only want to use say one kilobyte of memory for percentile tracking (discarding the tracking for old requests is not an option since the percentiles are supposed to be for all requests).
Also require that there is no a priori knowledge of the distribution. For example, I do not want to specify any ranges of buckets ahead of time.
If you want to keep the memory usage constant as you get more and more data, then you're going to have to resample that data somehow. That implies that you must apply some sort of rebinning scheme. You can wait until you acquire a certain amount of raw inputs before beginning the rebinning, but you cannot avoid it entirely.
So your question is really asking "what's the best way of dynamically binning my data"? There are lots of approaches, but if you want to minimise your assumptions about the range or distribution of values you may receive, then a simple approach is to average over buckets of fixed size k, with logarithmically distributed widths. For example, lets say you want to hold 1000 values in memory at any one time. Pick a size for k, say 100. Pick your minimum resolution, say 1ms. Then
The first bucket deals with values between 0-1ms (width=1ms)
Second bucket: 1-3ms (w=2ms)
Third bucket: 3-7ms (w=4ms)
Fourth bucket: 7-15ms (w=8ms)
...
Tenth bucket: 511-1023ms (w=512ms)
This type of log-scaled approach is similar to the chunking systems used in hash table algorithms, used by some filesystems and memory allocation algorithms. It works well when your data has a large dynamic range.
As new values come in, you can choose how you want to resample, depending on your requirements. For example, you could track a moving average, use a first-in-first-out, or some other more sophisticated method. See the Kademlia algorithm for one approach (used by Bittorrent).
Ultimately, rebinning must lose you some information. Your choices regarding the binning will determine the specifics of what information is lost. Another way of saying this is that the constant size memory store implies a trade-off between dynamic range and the sampling fidelity; how you make that trade-off is up to you, but like any sampling problem, there's no getting around this basic fact.
If you're really interested in the pros and cons, then no answer on this forum can hope to be sufficient. You should look into sampling theory. There's a huge amount of research on this topic available.
For what it's worth, I suspect that your server times will have a relatively small dynamic range, so a more relaxed scaling to allow higher sampling of common values may provide more accurate results.
Edit: To answer your comment, here's an example of a simple binning algorithm.
You store 1000 values, in 10 bins. Each bin therefore holds 100 values. Assume each bin is implemented as a dynamic array (a 'list', in Perl or Python terms).
When a new value comes in:
Determine which bin it should be stored in, based on the bin limits you've chosen.
If the bin is not full, append the value to the bin list.
If the bin is full, remove the value at the top of the bin list, and append the new value to the bottom of the bin list. This means old values are thrown away over time.
To find the 90th percentile, sort bin 10. The 90th percentile is the first value in the sorted list (element 900/1000).
If you don't like throwing away old values, then you can implement some alternative scheme to use instead. For example, when a bin becomes full (reaches 100 values, in my example), you could take the average of the oldest 50 elements (i.e. the first 50 in the list), discard those elements, and then append the new average element to the bin, leaving you with a bin of 51 elements that now has space to hold 49 new values. This is a simple example of rebinning.
Another example of rebinning is downsampling; throwing away every 5th value in a sorted list, for example.
I hope this concrete example helps. The key point to take away is that there are lots of ways of achieving a constant memory aging algorithm; only you can decide what is satisfactory given your requirements.
I've once published a blog post on this topic. The blog is now defunct but the article is included in full below.
The basic idea is to reduce the requirement for an exact calculation in favor of "95% percent of responses take 500ms-600ms or less" (for all exact percentiles of 500ms-600ms).
As we’ve recently started feeling that response times of one of our webapps got worse, we decided to spend some time tweaking the apps’ performance. As a first step, we wanted to get a thorough understanding of current response times. For performance evaluations, using minimum, maximum or average response times is a bad idea: “The ‘average’ is the evil of performance optimization and often as helpful as ‘average patient temperature in the hospital'” (MySQL Performance Blog). Instead, performance tuners should be looking at the percentile: “A percentile is the value of a variable below which a certain percent of observations fall” (Wikipedia). In other words: the 95th percentile is the time in which 95% of requests finished. Therefore, a performance goals related to the percentile could be similar to “The 95th percentile should be lower than 800 ms”. Setting such performance goals is one thing, but efficiently tracking them for a live system is another one.
I’ve spent quite some time looking for existing implementations of percentile calculations (e.g. here or here). All of them required storing response times for each and every request and calculate the percentile on demand or adding new response times in order. This was not what I wanted. I was hoping for a solution that would allow memory and CPU efficient live statistics for hundreds of thousands of requests. Storing response times for hundreds of thousands of requests and calculating the percentile on demand does neither sound CPU nor memory efficient.
Such a solution as I was hoping for simply seems not to exist. On second thought, I came up with another idea: For the type of performance evaluation I was looking for, it’s not necessary to get the exact percentile. An approximate answer like “the 95th percentile is between 850ms and 900ms” would totally suffice. Lowering the requirements this way makes an implementation extremely easy, especially if upper and lower borders for the possible results are known. For example, I’m not interested in response times higher than several seconds – they are extremely bad anyway, regardless of being 10 seconds or 15 seconds.
So here is the idea behind the implementation:
Define any random number of response time buckets (e.g. 0-100ms, 100-200ms, 200-400ms, 400-800ms, 800-1200ms, …)
Count number of responses and number of response each bucket (For a response time of 360ms, increment the counter for the 200ms – 400ms bucket)
Estimate the n-th percentile by summing counter for buckets until the sum exceeds n percent of the total
It’s that simple. And here is the code.
Some highlights:
public void increment(final int millis) {
final int i = index(millis);
if (i < _limits.length) {
_counts[i]++;
}
_total++;
}
public int estimatePercentile(final double percentile) {
if (percentile < 0.0 || percentile > 100.0) {
throw new IllegalArgumentException("percentile must be between 0.0 and 100.0, was " + percentile);
}
for (final Percentile p : this) {
if (percentile - p.getPercentage() <= 0.0001) {
return p.getLimit();
}
}
return Integer.MAX_VALUE;
}
This approach only requires two int values (= 8 byte) per bucket, allowing to track 128 buckets with 1K of memory. More than sufficient for analysing response times of a web application using a granularity of 50ms). Additionally, for the sake of performance, I’ve intentionally implemented this without any synchronization(e.g. using AtomicIntegers), knowing that some increments might get lost.
By the way, using Google Charts and 60 percentile counters, I was able to create a nice graph out of one hour of collected response times:
I believe there are many good approximate algorithms for this problem. A good first-cut approach is to simply use a fixed-size array (say 1K worth of data). Fix some probability p. For each request, with probability p, write its response time into the array (replacing the oldest time in there). Since the array is a subsampling of the live stream and since subsampling preserves the distribution, doing the statistics on that array will give you an approximation of the statistics of the full, live stream.
This approach has several advantages: it requires no a-priori information, and it's easy to code. You can build it quickly and experimentally determine, for your particular server, at what point growing the buffer has only a negligible effect on the answer. That is the point where the approximation is sufficiently precise.
If you find that you need too much memory to give you statistics that are precise enough, then you'll have to dig further. Good keywords are: "stream computing", "stream statistics", and of course "percentiles". You can also try "ire and curses"'s approach.
(It's been quite some time since this question was asked, but I'd like to point out a few related research papers)
There has been a significant amount of research on approximate percentiles of data streams in the past few years. A few interesting papers with full algorithm definitions:
A fast algorithm for approximate quantiles in high speed data streams
Space-and time-efficient deterministic algorithms for biased quantiles over data streams
Effective computation of biased quantiles over data streams
All of these papers propose algorithms with sub-linear space complexity for the computation of approximate percentiles over a data stream.
Try the simple algorithm defined in the paper “Sequential Procedure for Simultaneous Estimation of Several Percentiles” (Raatikainen). It’s fast, requires 2*m+3 markers (for m percentiles) and tends to an accurate approximation quickly.
Use a dynamic array T[] of large integers or something where T[n] counts the numer of times the response time was n milliseconds. If you really are doing statistics on a server application then possibly 250 ms response times are your absolute limit anyway. So your 1 KB holds one 32 bits integer for every ms between 0 and 250, and you have some room to spare for an overflow bin.
If you want something with more bins, go with 8 bit numbers for 1000 bins, and the moment a counter would overflow (i.e. 256th request at that response time) you shift the bits in all bins down by 1. (effectively halving the value in all bins). This means you disregard all bins that capture less than 1/127th of the delays that the most visited bin catches.
If you really, really need a set of specific bins I'd suggest using the first day of requests to come up with a reasonable fixed set of bins. Anything dynamic would be quite dangerous in a live, performance sensitive application. If you choose that path you'd better know what your doing, or one day you're going to get called out of bed to explain why your statistics tracker is suddenly eating 90% CPU and 75% memory on the production server.
As for additional statistics: For mean and variance there are some nice recursive algorithms that take up very little memory. These two statistics can be usefull enough in themselves for a lot of distributions because the central limit theorem states that distributions that that arise from a sufficiently large number of independent variables approach the normal distribution (which is fully defined by mean and variance) you can use one of the normality tests on the last N (where N sufficiently large but constrained by your memory requirements) to monitor wether the assumption of normality still holds.
#thkala started off with some literature citations. Let me extend that.
Implementations
T-digest from the 2019 Dunning paper has reference implementation in Java, and ports on that page to Python, Go, Javascript, C++, Scala, C, Clojure, C#, Kotlin, and C++ port by facebook, and a further rust port of that C++ port
Spark implements "GK01" from the 2001 Greenwald/Khanna paper for approximate quantiles
Beam: org.apache.beam.sdk.transforms.ApproximateQuantiles has approximate quantiles
Java: Guava:com.google.common.math.Quantiles implements exact quantiles, thus taking more memory
Rust: quantiles crate has implementations for the 2001 GK algorithm "GK01", and the 2005 CKMS algorithm. (caution: I found the CKMS implementation slow - issue)
C++: boost quantiles has some code, but I didn't understand it.
I did some profiling of the options in Rust [link] for up to 100M items, and found GK01 the best, T-digest the second, and "keep 1% top values in priority queue" the third.
Literature
2001: Space-efficient online computation of quantile summaries (by Greenwald, Khanna). Implemented in Rust: quantiles::greenwald_khanna.
2004: Medians and beyond: new aggregation techniques for sensor networks (by Shrivastava, Buragohain, Agrawal, Suri). Introduces "q-digests", used for fixed-universe data.
2005: Effective computation of biased quantiles over data streams (by Cormode, Korn, Muthukrishnan, Srivastava)... Implemented in Rust: quantiles::ckms which notes that the IEEE presentation is correct but the self-published one has flaws. With carefully crafted data, space can grow linearly with input size. "Biased" means it focuses on P90/P95/P99 rather than all the percentiles).
2006: Space-and time-efficient deterministic algorithms for biased quantiles over data streams (by Cormode, Korn, Muthukrishnan, Srivastava)... improved space bound over 2005 paper
2007: A fast algorithm for approximate quantiles in high speed data streams (by Zhang, Wang). Claims 60-300x speedup over GK. The 2020 literature review below says this has state-of-the-art space upper bound.
2019 Computing extremely accurate quantiles using t-digests (by Dunning, Ertl). Introduces t-digests, O(log n) space, O(1) updates, O(1) final calculation. It's neat feature is you can build partial digests (e.g. one per day) and merge them into months, then merge months into years. This is what the big query engines use.
2020 A survey of approximate quantile computation on large-scale data (technical report) (by Chen, Zhang).
2021 The t-digest: Efficient estimates of distributions - an approachable wrap-up paper about t-digests.
Cheap hack for P99 of <10M values: just store top 1% in a priority queue!
This'll sound stupid, but if I want to calculate the P99 of 10M float64s, I just created a priority queue with 100k float32s (takes 400kB). This takes only 4x as much space as "GK01" and is much faster. For 5M or fewer items, it takes less space than GK01!!
struct TopValues {
values: std::collections::BinaryHeap<std::cmp::Reverse<ordered_float::NotNan<f32>>>,
}
impl TopValues {
fn new(count: usize) -> Self {
let capacity = std::cmp::max(count / 100, 1);
let values = std::collections::BinaryHeap::with_capacity(capacity);
TopValues { values }
}
fn render(&mut self) -> String {
let p99 = self.values.peek().unwrap().0;
let max = self.values.drain().min().unwrap().0;
format!("TopValues, p99={:.4}, max={:.4}", p99, max)
}
fn insert(&mut self, value: f64) {
let value = value as f32;
let value = std::cmp::Reverse(unsafe { ordered_float::NotNan::new_unchecked(value) });
if self.values.len() < self.values.capacity() {
self.values.push(value);
} else if self.values.peek().unwrap().0 < value.0 {
self.values.pop();
self.values.push(value);
} else {
}
}
}
You can try the following structure:
Take on input n, ie. n = 100.
We'll keep an array of ranges [min, max] sorted by min with count.
Insertion of value x – binary search for min range for x. If not found take preceeding range (where min < x). If value belongs to range (x <= max) increment count. Otherwise insert new range with [min = x, max = x, count = 1].
If number of ranges hits 2*n – collapse/merge array into n (half) by taking min from odd and max from even entries, summing their count.
To get ie. p95 walk from the end summing the count until next addition would hit threshold sum >= 95%, take p95 = min + (max - min) * partial.
It will settle on dynamic ranges of measurements. n can be modified to trade accuracy for memory (to lesser extent cpu). If you make values more discrete, ie. by rounding to 0.01 before insertion – it'll stabilise on ranges sooner.
You could improve accuracy by not assuming that each range holds uniformly distributed entries, ie. something cheap like sum of values which will give you avg = sum / count, it would help to read closer p95 value from range where it sits.
You can also rotate them, ie. after m = 1 000 000 entries start filling new array and take p95 as weighted sum on count in array (if array B has 10% of count of A, then it contributes 10% to p95 value).

Resources