Cache hit ratio effect on performance - performance

I'm currently reading the second edition of Systems Performance by Brendan Gregg and had a question on the section about caching in Chapter 2. This section defines cache hit ratio as
hit ratio = hits / (hits + misses)
It goes on to say that the relationship between cache hit ratio and "performance" (for some hypothetical measure of system performance) is nonlinear. Specifically,
The performance difference between 98% and 99% is much greater than that between 10% and 11%. This is a nonlinear profile because of the difference in speed between cache hits and misses - the two storage tiers at play. The greater the difference, the steeper the slope becomes.
I don't quite understand where the nonlinearity in this relationship originates from. In order to explain this to myself, I concocted the following example. Consider the following, we model performance by some function f, where a lower value of f denotes better performance.
f(hit) = 10
f(miss) = 100
i.e. misses are 10x more expensive than hits. Assuming a hit ratio of 0, the "expected" performance of this system will be (0*10) + (1*100) = 100. A hit ratio of .01 (1% hits) yields (.01*10)+(.99*100) = 99.1. Finally a hit ratio of .02 (2% hits) yields (.02*10) + (.98*100) = 98.2. AFAICT, this is a linear relationship. What am I missing?
Thanks

I had the same question after reading the same book, and now that I have answered my question, so I will answer it.
Performance is nonlinear with hit rate because execution time and performance are different things.
As you wrote, equations with f (eg. (0*10) + (1*100) = 100) is linear in the hit ratio.
However, it does not actually represent performance, but average execution time.
Performance is relative, and when comparing two performances, we use the ratio of the inverse of the execution time.
This inverse is the source of the seemingly linear but actually nonlinear behavior.
I'm a non-native English speaker, so sorry if it's hard to read.
I hope my answer helps you.

Related

What are some good approaches to predicting the completion time of a long process?

tl;dr: I want to predict file copy completion. What are good methods given the start time and the current progress?
Firstly, I am aware that this is not at all a simple problem, and that predicting the future is difficult to do well. For context, I'm trying to predict the completion of a long file copy.
Current Approach:
At the moment, I'm using a fairly naive formula that I came up with myself: (ETC stands for Estimated Time of Completion)
ETC = currTime + elapsedTime * (totalSize - sizeDone) / sizeDone
This works on the assumption that the remaining files to be copied will do so at the average copy speed thus far, which may or may not be a realistic assumption (dealing with tape archives here).
PRO: The ETC will change gradually, and becomes more and more accurate as the process nears completion.
CON: It doesn't react well to unexpected events, like the file copy becoming stuck or speeding up quickly.
Another idea:
The next idea I had was to keep a record of the progress for the last n seconds (or minutes, given that these archives are supposed to take hours), and just do something like:
ETC = currTime + currAvg * (totalSize - sizeDone)
This is kind of the opposite of the first method in that:
PRO: If the speed changes quickly, the ETC will update quickly to reflect the current state of affairs.
CON: The ETC may jump around a lot if the speed is inconsistent.
Finally
I'm reminded of the control engineering subjects I did at uni, where the objective is essentially to try to get a system that reacts quickly to sudden changes, but isn't unstable and crazy.
With that said, the other option I could think of would be to calculate the average of both of the above, perhaps with some kind of weighting:
Weight the first method more if the copy has a fairly consistent long-term average speed, even if it jumps around a bit locally.
Weight the second method more if the copy speed is unpredictable, and is likely to do things like speed up/slow down for long periods, or stop altogether for long periods.
What I am really asking for is:
Any alternative approaches to the two I have given.
If and how you would combine several different methods to get a final prediction.
If you feel that the accuracy of prediction is important, the way to go about about building a predictive model is as follows:
collect some real-world measurements;
split them into three disjoint sets: training, validation and test;
come up with some predictive models (you already have two plus a mix) and fit them using the training set;
check predictive performance of the models on the validation set and pick the one that performs best;
use the test set to assess the out-of-sample prediction error of the chosen model.
I'd hazard a guess that a linear combination of your current model and the "average over the last n seconds" would perform pretty well for the problem at hand. The optimal weights for the linear combination can be fitted using linear regression (a one-liner in R).
An excellent resource for studying statistical learning methods is The Elements of
Statistical Learning by Hastie, Tibshirani and Friedman. I can't recommend that book highly enough.
Lastly, your second idea (average over the last n seconds) attempts to measure the instantaneous speed. A more robust technique for this might be to use the Kalman filter, whose purpose is exactly this:
Its purpose is to use measurements observed over time, containing
noise (random variations) and other inaccuracies, and produce values
that tend to be closer to the true values of the measurements and
their associated calculated values.
The principal advantage of using the Kalman filter rather than a fixed n-second sliding window is that it's adaptive: it will automatically use a longer averaging window when measurements jump around a lot than when they're stable.
Imho, bad implementations of ETC are wildly overused, which allows us to have a good laugh. Sometimes, it might be better to display facts instead of estimations, like:
5 of 10 files have been copied
10 of 200 MB have been copied
Or display facts and an estimation, and make clear that it is only an estimation. But I would not display only an estimation.
Every user knows that ETCs are often completely meaningless, and then it is hard to distinguish between meaningful ETCs and meaningless ETCs, especially for inexperienced users.
I have implemented two different solutions to address this problem:
The ETC for the current transfer at start time is based on a historic speed value. This value is refined after each transfer. During the transfer I compute a weighted average between the historic data and data from the current transfer, so that the closer to the end you are the more weight is given to actual data from the transfer.
Instead of showing a single ETC, show a range of time. The idea is to compute the ETC from the last 'n' seconds or minutes (like your second idea). I keep track of the best and worst case averages and compute a range of possible ETCs. This is kind of confusing to show in a GUI, but okay to show in a command line app.
There are two things to consider here:
the exact estimation
how to present it to the user
1. On estimation
Other than statistics approach, one simple way to have a good estimation of the current speed while erasing some noise or spikes is to take a weighted approach.
You already experimented with the sliding window, the idea here is to take a fairly large sliding window, but instead of a plain average, giving more weight to more recent measures, since they are more indicative of the evolution (a bit like a derivative).
Example: Suppose you have 10 previous windows (most recent x0, least recent x9), then you could compute the speed:
Speed = (10 * x0 + 9 * x1 + 8 * x2 + ... + x9) / (10 * window-time) / 55
When you have a good assessment of the likely speed, then you are close to get a good estimated time.
2. On presentation
The main thing to remember here is that you want a nice user experience, and not a scientific front.
Studies have demonstrated that users reacted very badly to slow-down and very positively to speed-up. Therefore, a good progress bar / estimated time should be conservative in the estimates presented (reserving time for a potential slow-down) at first.
A simple way to get that is to have a factor that is a percentage of the completion, that you use to tweak the estimated remaining time. For example:
real-completion = 0.4
presented-completion = real-completion * factor(real-completion)
Where factor is such that factor([0..1]) = [0..1], factor(x) <= x and factor(1) = 1. For example, the cubic function produces the nice speed-up toward the completion time. Other functions could use an exponential form 1 - e^x, etc...

Percentiles of Live Data Capture

I am looking for an algorithm that determines percentiles for live data capture.
For example, consider the development of a server application.
The server might have response times as follows:
17 ms
33 ms
52 ms
60 ms
55 ms
etc.
It is useful to report the 90th percentile response time, 80th percentile response time, etc.
The naive algorithm is to insert each response time into a list. When statistics are requested, sort the list and get the values at the proper positions.
Memory usages scales linearly with the number of requests.
Is there an algorithm that yields "approximate" percentile statistics given limited memory usage? For example, let's say I want to solve this problem in a way that I process millions of requests but only want to use say one kilobyte of memory for percentile tracking (discarding the tracking for old requests is not an option since the percentiles are supposed to be for all requests).
Also require that there is no a priori knowledge of the distribution. For example, I do not want to specify any ranges of buckets ahead of time.
If you want to keep the memory usage constant as you get more and more data, then you're going to have to resample that data somehow. That implies that you must apply some sort of rebinning scheme. You can wait until you acquire a certain amount of raw inputs before beginning the rebinning, but you cannot avoid it entirely.
So your question is really asking "what's the best way of dynamically binning my data"? There are lots of approaches, but if you want to minimise your assumptions about the range or distribution of values you may receive, then a simple approach is to average over buckets of fixed size k, with logarithmically distributed widths. For example, lets say you want to hold 1000 values in memory at any one time. Pick a size for k, say 100. Pick your minimum resolution, say 1ms. Then
The first bucket deals with values between 0-1ms (width=1ms)
Second bucket: 1-3ms (w=2ms)
Third bucket: 3-7ms (w=4ms)
Fourth bucket: 7-15ms (w=8ms)
...
Tenth bucket: 511-1023ms (w=512ms)
This type of log-scaled approach is similar to the chunking systems used in hash table algorithms, used by some filesystems and memory allocation algorithms. It works well when your data has a large dynamic range.
As new values come in, you can choose how you want to resample, depending on your requirements. For example, you could track a moving average, use a first-in-first-out, or some other more sophisticated method. See the Kademlia algorithm for one approach (used by Bittorrent).
Ultimately, rebinning must lose you some information. Your choices regarding the binning will determine the specifics of what information is lost. Another way of saying this is that the constant size memory store implies a trade-off between dynamic range and the sampling fidelity; how you make that trade-off is up to you, but like any sampling problem, there's no getting around this basic fact.
If you're really interested in the pros and cons, then no answer on this forum can hope to be sufficient. You should look into sampling theory. There's a huge amount of research on this topic available.
For what it's worth, I suspect that your server times will have a relatively small dynamic range, so a more relaxed scaling to allow higher sampling of common values may provide more accurate results.
Edit: To answer your comment, here's an example of a simple binning algorithm.
You store 1000 values, in 10 bins. Each bin therefore holds 100 values. Assume each bin is implemented as a dynamic array (a 'list', in Perl or Python terms).
When a new value comes in:
Determine which bin it should be stored in, based on the bin limits you've chosen.
If the bin is not full, append the value to the bin list.
If the bin is full, remove the value at the top of the bin list, and append the new value to the bottom of the bin list. This means old values are thrown away over time.
To find the 90th percentile, sort bin 10. The 90th percentile is the first value in the sorted list (element 900/1000).
If you don't like throwing away old values, then you can implement some alternative scheme to use instead. For example, when a bin becomes full (reaches 100 values, in my example), you could take the average of the oldest 50 elements (i.e. the first 50 in the list), discard those elements, and then append the new average element to the bin, leaving you with a bin of 51 elements that now has space to hold 49 new values. This is a simple example of rebinning.
Another example of rebinning is downsampling; throwing away every 5th value in a sorted list, for example.
I hope this concrete example helps. The key point to take away is that there are lots of ways of achieving a constant memory aging algorithm; only you can decide what is satisfactory given your requirements.
I've once published a blog post on this topic. The blog is now defunct but the article is included in full below.
The basic idea is to reduce the requirement for an exact calculation in favor of "95% percent of responses take 500ms-600ms or less" (for all exact percentiles of 500ms-600ms).
As we’ve recently started feeling that response times of one of our webapps got worse, we decided to spend some time tweaking the apps’ performance. As a first step, we wanted to get a thorough understanding of current response times. For performance evaluations, using minimum, maximum or average response times is a bad idea: “The ‘average’ is the evil of performance optimization and often as helpful as ‘average patient temperature in the hospital'” (MySQL Performance Blog). Instead, performance tuners should be looking at the percentile: “A percentile is the value of a variable below which a certain percent of observations fall” (Wikipedia). In other words: the 95th percentile is the time in which 95% of requests finished. Therefore, a performance goals related to the percentile could be similar to “The 95th percentile should be lower than 800 ms”. Setting such performance goals is one thing, but efficiently tracking them for a live system is another one.
I’ve spent quite some time looking for existing implementations of percentile calculations (e.g. here or here). All of them required storing response times for each and every request and calculate the percentile on demand or adding new response times in order. This was not what I wanted. I was hoping for a solution that would allow memory and CPU efficient live statistics for hundreds of thousands of requests. Storing response times for hundreds of thousands of requests and calculating the percentile on demand does neither sound CPU nor memory efficient.
Such a solution as I was hoping for simply seems not to exist. On second thought, I came up with another idea: For the type of performance evaluation I was looking for, it’s not necessary to get the exact percentile. An approximate answer like “the 95th percentile is between 850ms and 900ms” would totally suffice. Lowering the requirements this way makes an implementation extremely easy, especially if upper and lower borders for the possible results are known. For example, I’m not interested in response times higher than several seconds – they are extremely bad anyway, regardless of being 10 seconds or 15 seconds.
So here is the idea behind the implementation:
Define any random number of response time buckets (e.g. 0-100ms, 100-200ms, 200-400ms, 400-800ms, 800-1200ms, …)
Count number of responses and number of response each bucket (For a response time of 360ms, increment the counter for the 200ms – 400ms bucket)
Estimate the n-th percentile by summing counter for buckets until the sum exceeds n percent of the total
It’s that simple. And here is the code.
Some highlights:
public void increment(final int millis) {
final int i = index(millis);
if (i < _limits.length) {
_counts[i]++;
}
_total++;
}
public int estimatePercentile(final double percentile) {
if (percentile < 0.0 || percentile > 100.0) {
throw new IllegalArgumentException("percentile must be between 0.0 and 100.0, was " + percentile);
}
for (final Percentile p : this) {
if (percentile - p.getPercentage() <= 0.0001) {
return p.getLimit();
}
}
return Integer.MAX_VALUE;
}
This approach only requires two int values (= 8 byte) per bucket, allowing to track 128 buckets with 1K of memory. More than sufficient for analysing response times of a web application using a granularity of 50ms). Additionally, for the sake of performance, I’ve intentionally implemented this without any synchronization(e.g. using AtomicIntegers), knowing that some increments might get lost.
By the way, using Google Charts and 60 percentile counters, I was able to create a nice graph out of one hour of collected response times:
I believe there are many good approximate algorithms for this problem. A good first-cut approach is to simply use a fixed-size array (say 1K worth of data). Fix some probability p. For each request, with probability p, write its response time into the array (replacing the oldest time in there). Since the array is a subsampling of the live stream and since subsampling preserves the distribution, doing the statistics on that array will give you an approximation of the statistics of the full, live stream.
This approach has several advantages: it requires no a-priori information, and it's easy to code. You can build it quickly and experimentally determine, for your particular server, at what point growing the buffer has only a negligible effect on the answer. That is the point where the approximation is sufficiently precise.
If you find that you need too much memory to give you statistics that are precise enough, then you'll have to dig further. Good keywords are: "stream computing", "stream statistics", and of course "percentiles". You can also try "ire and curses"'s approach.
(It's been quite some time since this question was asked, but I'd like to point out a few related research papers)
There has been a significant amount of research on approximate percentiles of data streams in the past few years. A few interesting papers with full algorithm definitions:
A fast algorithm for approximate quantiles in high speed data streams
Space-and time-efficient deterministic algorithms for biased quantiles over data streams
Effective computation of biased quantiles over data streams
All of these papers propose algorithms with sub-linear space complexity for the computation of approximate percentiles over a data stream.
Try the simple algorithm defined in the paper “Sequential Procedure for Simultaneous Estimation of Several Percentiles” (Raatikainen). It’s fast, requires 2*m+3 markers (for m percentiles) and tends to an accurate approximation quickly.
Use a dynamic array T[] of large integers or something where T[n] counts the numer of times the response time was n milliseconds. If you really are doing statistics on a server application then possibly 250 ms response times are your absolute limit anyway. So your 1 KB holds one 32 bits integer for every ms between 0 and 250, and you have some room to spare for an overflow bin.
If you want something with more bins, go with 8 bit numbers for 1000 bins, and the moment a counter would overflow (i.e. 256th request at that response time) you shift the bits in all bins down by 1. (effectively halving the value in all bins). This means you disregard all bins that capture less than 1/127th of the delays that the most visited bin catches.
If you really, really need a set of specific bins I'd suggest using the first day of requests to come up with a reasonable fixed set of bins. Anything dynamic would be quite dangerous in a live, performance sensitive application. If you choose that path you'd better know what your doing, or one day you're going to get called out of bed to explain why your statistics tracker is suddenly eating 90% CPU and 75% memory on the production server.
As for additional statistics: For mean and variance there are some nice recursive algorithms that take up very little memory. These two statistics can be usefull enough in themselves for a lot of distributions because the central limit theorem states that distributions that that arise from a sufficiently large number of independent variables approach the normal distribution (which is fully defined by mean and variance) you can use one of the normality tests on the last N (where N sufficiently large but constrained by your memory requirements) to monitor wether the assumption of normality still holds.
#thkala started off with some literature citations. Let me extend that.
Implementations
T-digest from the 2019 Dunning paper has reference implementation in Java, and ports on that page to Python, Go, Javascript, C++, Scala, C, Clojure, C#, Kotlin, and C++ port by facebook, and a further rust port of that C++ port
Spark implements "GK01" from the 2001 Greenwald/Khanna paper for approximate quantiles
Beam: org.apache.beam.sdk.transforms.ApproximateQuantiles has approximate quantiles
Java: Guava:com.google.common.math.Quantiles implements exact quantiles, thus taking more memory
Rust: quantiles crate has implementations for the 2001 GK algorithm "GK01", and the 2005 CKMS algorithm. (caution: I found the CKMS implementation slow - issue)
C++: boost quantiles has some code, but I didn't understand it.
I did some profiling of the options in Rust [link] for up to 100M items, and found GK01 the best, T-digest the second, and "keep 1% top values in priority queue" the third.
Literature
2001: Space-efficient online computation of quantile summaries (by Greenwald, Khanna). Implemented in Rust: quantiles::greenwald_khanna.
2004: Medians and beyond: new aggregation techniques for sensor networks (by Shrivastava, Buragohain, Agrawal, Suri). Introduces "q-digests", used for fixed-universe data.
2005: Effective computation of biased quantiles over data streams (by Cormode, Korn, Muthukrishnan, Srivastava)... Implemented in Rust: quantiles::ckms which notes that the IEEE presentation is correct but the self-published one has flaws. With carefully crafted data, space can grow linearly with input size. "Biased" means it focuses on P90/P95/P99 rather than all the percentiles).
2006: Space-and time-efficient deterministic algorithms for biased quantiles over data streams (by Cormode, Korn, Muthukrishnan, Srivastava)... improved space bound over 2005 paper
2007: A fast algorithm for approximate quantiles in high speed data streams (by Zhang, Wang). Claims 60-300x speedup over GK. The 2020 literature review below says this has state-of-the-art space upper bound.
2019 Computing extremely accurate quantiles using t-digests (by Dunning, Ertl). Introduces t-digests, O(log n) space, O(1) updates, O(1) final calculation. It's neat feature is you can build partial digests (e.g. one per day) and merge them into months, then merge months into years. This is what the big query engines use.
2020 A survey of approximate quantile computation on large-scale data (technical report) (by Chen, Zhang).
2021 The t-digest: Efficient estimates of distributions - an approachable wrap-up paper about t-digests.
Cheap hack for P99 of <10M values: just store top 1% in a priority queue!
This'll sound stupid, but if I want to calculate the P99 of 10M float64s, I just created a priority queue with 100k float32s (takes 400kB). This takes only 4x as much space as "GK01" and is much faster. For 5M or fewer items, it takes less space than GK01!!
struct TopValues {
values: std::collections::BinaryHeap<std::cmp::Reverse<ordered_float::NotNan<f32>>>,
}
impl TopValues {
fn new(count: usize) -> Self {
let capacity = std::cmp::max(count / 100, 1);
let values = std::collections::BinaryHeap::with_capacity(capacity);
TopValues { values }
}
fn render(&mut self) -> String {
let p99 = self.values.peek().unwrap().0;
let max = self.values.drain().min().unwrap().0;
format!("TopValues, p99={:.4}, max={:.4}", p99, max)
}
fn insert(&mut self, value: f64) {
let value = value as f32;
let value = std::cmp::Reverse(unsafe { ordered_float::NotNan::new_unchecked(value) });
if self.values.len() < self.values.capacity() {
self.values.push(value);
} else if self.values.peek().unwrap().0 < value.0 {
self.values.pop();
self.values.push(value);
} else {
}
}
}
You can try the following structure:
Take on input n, ie. n = 100.
We'll keep an array of ranges [min, max] sorted by min with count.
Insertion of value x – binary search for min range for x. If not found take preceeding range (where min < x). If value belongs to range (x <= max) increment count. Otherwise insert new range with [min = x, max = x, count = 1].
If number of ranges hits 2*n – collapse/merge array into n (half) by taking min from odd and max from even entries, summing their count.
To get ie. p95 walk from the end summing the count until next addition would hit threshold sum >= 95%, take p95 = min + (max - min) * partial.
It will settle on dynamic ranges of measurements. n can be modified to trade accuracy for memory (to lesser extent cpu). If you make values more discrete, ie. by rounding to 0.01 before insertion – it'll stabilise on ranges sooner.
You could improve accuracy by not assuming that each range holds uniformly distributed entries, ie. something cheap like sum of values which will give you avg = sum / count, it would help to read closer p95 value from range where it sits.
You can also rotate them, ie. after m = 1 000 000 entries start filling new array and take p95 as weighted sum on count in array (if array B has 10% of count of A, then it contributes 10% to p95 value).

Algorithm(s) for spotting anomalies ("spikes") in traffic data

I find myself needing to process network traffic captured with tcpdump. Reading the traffic is not hard, but what gets a bit tricky is spotting where there are "spikes" in the traffic. I'm mostly concerned with TCP SYN packets and what I want to do is find days where there's a sudden rise in the traffic for a given destination port. There's quite a bit of data to process (roughly one year).
What I've tried so far is to use an exponential moving average, this was good enough to let me get some interesting measures out, but comparing what I've seen with external data sources seems to be a bit too aggressive in flagging things as abnormal.
I've considered using a combination of the exponential moving average plus historical data (possibly from 7 days in the past, thinking that there ought to be a weekly cycle to what I am seeing), as some papers I've read seem to have managed to model resource usage that way with good success.
So, does anyone knows of a good method or somewhere to go and read up on this sort of thing.
The moving average I've been using looks roughly like:
avg = avg+0.96*(new-avg)
With avg being the EMA and new being the new measure. I have been experimenting with what thresholds to use, but found that a combination of "must be a given factor higher than the average prior to weighing the new value in" and "must be at least 3 higher" to give the least bad result.
This is widely studied in intrusion detection literature. This is a seminal paper on the issue which shows, among other things, how to analyze tcpdump data to gain relevant insights.
This is the paper: http://www.usenix.org/publications/library/proceedings/sec98/full_papers/full_papers/lee/lee_html/lee.html here they use the RIPPER rule induction system, I guess you could replace that old one for something newer such as http://www.newty.de/pnc2/ or http://www.data-miner.com/rik.html
I would apply two low-pass filters to the data, one with a long time constant, T1, and one with a short time constant, T2. You would then look at the magnitude difference in output from these two filters and when it exceeds a certain threshold, K, then that would be a spike. The hardest part is tuning T1, T2 and K so that you don't get too many false positives and you don't miss any small spikes.
The following is a single pole IIR low-pass filter:
new = k * old + (1 - k) * new
The value of k determines the time constant and is usually close to 1.0 (but < 1.0 of course).
I am suggesting that you apply two such filters in parallel, with different time constants, e.g. start with say k = 0.9 for one (short time constant) and k = 0.99 for the other (long time constant) and then look at the magnitude difference in their outputs. The magnitude difference will be small most of the time, but will become large when there is a spike.

Smart progress bar ETA computation

In many applications, we have some progress bar for a file download, for a compression task, for a search, etc. We all often use progress bars to let users know something is happening. And if we know some details like just how much work has been done and how much is left to do, we can even give a time estimate, often by extrapolating from how much time it's taken to get to the current progress level.
(source: jameslao.com)
But we've also seen programs which this Time Left "ETA" display is just comically bad. It claims a file copy will be done in 20 seconds, then one second later it says it's going to take 4 days, then it flickers again to be 20 minutes. It's not only unhelpful, it's confusing!
The reason the ETA varies so much is that the progress rate itself can vary and the programmer's math can be overly sensitive.
Apple sidesteps this by just avoiding any accurate prediction and just giving vague estimates!
(source: autodesk.com)
That's annoying too, do I have time for a quick break, or is my task going to be done in 2 more seconds? If the prediction is too fuzzy, it's pointless to make any prediction at all.
Easy but wrong methods
As a first pass ETA computation, probably we all just make a function like if p is the fractional percentage that's done already, and t is the time it's taken so far, we output t*(1-p)/p as the estimate of how long it's going to take to finish. This simple ratio works "OK" but it's also terrible especially at the end of computation. If your slow download speed keeps a copy slowly advancing happening overnight, and finally in the morning, something kicks in and the copy starts going at full speed at 100X faster, your ETA at 90% done may say "1 hour", and 10 seconds later you're at 95% and the ETA will say "30 minutes" which is clearly an embarassingly poor guess.. in this case "10 seconds" is a much, much, much better estimate.
When this happens you may think to change the computation to use recent speed, not average speed, to estimate ETA. You take the average download rate or completion rate over the last 10 seconds, and use that rate to project how long completion will be. That performs quite well in the previous overnight-download-which-sped-up-at-the-end example, since it will give very good final completion estimates at the end. But this still has big problems.. it causes your ETA to bounce wildly when your rate varies quickly over a short period of time, and you get the "done in 20 seconds, done in 2 hours, done in 2 seconds, done in 30 minutes" rapid display of programming shame.
The actual question:
What is the best way to compute an estimated time of completion of a task, given the time history of the computation? I am not looking for links to GUI toolkits or Qt libraries. I'm asking about the algorithm to generate the most sane and accurate completion time estimates.
Have you had success with math formulas? Some kind of averaging, maybe by using the mean of the rate over 10 seconds with the rate over 1 minute with the rate over 1 hour? Some kind of artificial filtering like "if my new estimate varies too much from the previous estimate, tone it down, don't let it bounce too much"? Some kind of fancy history analysis where you integrate progress versus time advancement to find standard deviation of rate to give statistical error metrics on completion?
What have you tried, and what works best?
Original Answer
The company that created this site apparently makes a scheduling system that answers this question in the context of employees writing code. The way it works is with Monte Carlo simulation of future based on the past.
Appendix: Explanation of Monte Carlo
This is how this algorithm would work in your situation:
You model your task as a sequence of microtasks, say 1000 of them. Suppose an hour later you completed 100 of them. Now you run the simulation for the remaining 900 steps by randomly selecting 90 completed microtasks, adding their times and multiplying by 10. Here you have an estimate; repeat N times and you have N estimates for the time remaining. Note the average between these estimates will be about 9 hours -- no surprises here. But by presenting the resulting distribution to the user you'll honestly communicate to him the odds, e.g. 'with the probability 90% this will take another 3-15 hours'
This algorithm, by definition, produces complete result if the task in question can be modeled as a bunch of independent, random microtasks. You can gain a better answer only if you know how the task deviates from this model: for example, installers typically have a download/unpacking/installing tasklist and the speed for one cannot predict the other.
Appendix: Simplifying Monte Carlo
I'm not a statistics guru, but I think if you look closer into the simulation in this method, it will always return a normal distribution as a sum of large number of independent random variables. Therefore, you don't need to perform it at all. In fact, you don't even need to store all the completed times, since you'll only need their sum and sum of their squares.
In maybe not very standard notation,
sigma = sqrt ( sum_of_times_squared-sum_of_times^2 )
scaling = 900/100 // that is (totalSteps - elapsedSteps) / elapsedSteps
lowerBound = sum_of_times*scaling - 3*sigma*sqrt(scaling)
upperBound = sum_of_times*scaling + 3*sigma*sqrt(scaling)
With this, you can output the message saying that the thing will end between [lowerBound, upperBound] from now with some fixed probability (should be about 95%, but I probably missed some constant factor).
Here's what I've found works well! For the first 50% of the task, you assume the rate is constant and extrapolate. The time prediction is very stable and doesn't bounce much.
Once you pass 50%, you switch computation strategy. You take the fraction of the job left to do (1-p), then look back in time in a history of your own progress, and find (by binary search and linear interpolation) how long it's taken you to do the last (1-p) percentage and use that as your time estimate completion.
So if you're now 71% done, you have 29% remaining. You look back in your history and find how long ago you were at (71-29=42%) completion. Report that time as your ETA.
This is naturally adaptive. If you have X amount of work to do, it looks only at the time it took to do the X amount of work. At the end when you're at 99% done, it's using only very fresh, very recent data for the estimate.
It's not perfect of course but it smoothly changes and is especially accurate at the very end when it's most useful.
Whilst all the examples are valid, for the specific case of 'time left to download', I thought it would be a good idea to look at existing open source projects to see what they do.
From what I can see, Mozilla Firefox is the best at estimating the time remaining.
Mozilla Firefox
Firefox keeps a track of the last estimate for time remaining, and by using this and the current estimate for time remaining, it performs a smoothing function on the time.
See the ETA code here. This uses a 'speed' which is previously caculated here and is a smoothed average of the last 10 readings.
This is a little complex, so to paraphrase:
Take a smoothed average of the speed based 90% on the previous speed and 10% on the new speed.
With this smoothed average speed work out the estimated time remaining.
Use this estimated time remaining, and the previous estimated time remaining to created a new estimated time remaining (in order to avoid jumping)
Google Chrome
Chrome seems to jump about all over the place, and the code shows this.
One thing I do like with Chrome though is how they format time remaining.
For > 1 hour it says '1 hrs left'
For < 1 hour it says '59 mins left'
For < 1 minute it says '52 secs left'
You can see how it's formatted here
DownThemAll! Manager
It doesn't use anything clever, meaning the ETA jumps about all over the place.
See the code here
pySmartDL (a python downloader)
Takes the average ETA of the last 30 ETA calculations. Sounds like a reasonable way to do it.
See the code here/blob/916f2592db326241a2bf4d8f2e0719c58b71e385/pySmartDL/pySmartDL.py#L651)
Transmission
Gives a pretty good ETA in most cases (except when starting off, as might be expected).
Uses a smoothing factor over the past 5 readings, similar to Firefox but not quite as complex. Fundamentally similar to Gooli's answer.
See the code here
I usually use an Exponential Moving Average to compute the speed of an operation with a smoothing factor of say 0.1 and use that to compute the remaining time. This way all the measured speeds have influence on the current speed, but recent measurements have much more effect than those in the distant past.
In code it would look something like this:
alpha = 0.1 # smoothing factor
...
speed = (speed * (1 - alpha)) + (currentSpeed * alpha)
If your tasks are uniform in size, currentSpeed would simply be the time it took to execute the last task. If the tasks have different sizes and you know that one task is supposed to be i,e, twice as long as another, you can divide the time it took to execute the task by its relative size to get the current speed. Using speed you can compute the remaining time by multiplying it by the total size of the remaining tasks (or just by their number if the tasks are uniform).
Hopefully my explanation is clear enough, it's a bit late in the day.
In certain instances, when you need to perform the same task on a regular basis, it might be a good idea of using past completion times to average against.
For example, I have an application that loads the iTunes library via its COM interface. The size of a given iTunes library generally do not increase dramatically from launch-to-launch in terms of the number of items, so in this example it might be possible to track the last three load times and load rates and then average against that and compute your current ETA.
This would be hugely more accurate than an instantaneous measurement and probably more consistent as well.
However, this method depends upon the size of the task being relatively similar to the previous ones, so this would not work for a decompressing method or something else where any given byte stream is the data to be crunched.
Just my $0.02
First off, it helps to generate a running moving average. This weights more recent events more heavily.
To do this, keep a bunch of samples around (circular buffer or list), each a pair of progress and time. Keep the most recent N seconds of samples. Then generate a weighted average of the samples:
totalProgress += (curSample.progress - prevSample.progress) * scaleFactor
totalTime += (curSample.time - prevSample.time) * scaleFactor
where scaleFactor goes linearly from 0...1 as an inverse function of time in the past (thus weighing more recent samples more heavily). You can play around with this weighting, of course.
At the end, you can get the average rate of change:
averageProgressRate = (totalProgress / totalTime);
You can use this to figure out the ETA by dividing the remaining progress by this number.
However, while this gives you a good trending number, you have one other issue - jitter. If, due to natural variations, your rate of progress moves around a bit (it's noisy) - e.g. maybe you're using this to estimate file downloads - you'll notice that the noise can easily cause your ETA to jump around, especially if it's pretty far in the future (several minutes or more).
To avoid jitter from affecting your ETA too much, you want this average rate of change number to respond slowly to updates. One way to approach this is to keep around a cached value of averageProgressRate, and instead of instantly updating it to the trending number you've just calculated, you simulate it as a heavy physical object with mass, applying a simulated 'force' to slowly move it towards the trending number. With mass, it has a bit of inertia and is less likely to be affected by jitter.
Here's a rough sample:
// desiredAverageProgressRate is computed from the weighted average above
// m_averageProgressRate is a member variable also in progress units/sec
// lastTimeElapsed = the time delta in seconds (since last simulation)
// m_averageSpeed is a member variable in units/sec, used to hold the
// the velocity of m_averageProgressRate
const float frictionCoeff = 0.75f;
const float mass = 4.0f;
const float maxSpeedCoeff = 0.25f;
// lose 25% of our speed per sec, simulating friction
m_averageSeekSpeed *= pow(frictionCoeff, lastTimeElapsed);
float delta = desiredAvgProgressRate - m_averageProgressRate;
// update the velocity
float oldSpeed = m_averageSeekSpeed;
float accel = delta / mass;
m_averageSeekSpeed += accel * lastTimeElapsed; // v += at
// clamp the top speed to 25% of our current value
float sign = (m_averageSeekSpeed > 0.0f ? 1.0f : -1.0f);
float maxVal = m_averageProgressRate * maxSpeedCoeff;
if (fabs(m_averageSeekSpeed) > maxVal)
{
m_averageSeekSpeed = sign * maxVal;
}
// make sure they have the same sign
if ((m_averageSeekSpeed > 0.0f) == (delta > 0.0f))
{
float adjust = (oldSpeed + m_averageSeekSpeed) * 0.5f * lastTimeElapsed;
// don't overshoot.
if (fabs(adjust) > fabs(delta))
{
adjust = delta;
// apply damping
m_averageSeekSpeed *= 0.25f;
}
m_averageProgressRate += adjust;
}
Your question is a good one. If the problem can be broken up into discrete units having an accurate calculation often works best. Unfortunately this may not be the case even if you are installing 50 components each one might be 2% but one of them can be massive. One thing that I have had moderate success with is to clock the cpu and disk and give a decent estimate based on observational data. Knowing that certain check points are really point x allows you some opportunity to correct for environment factors (network, disk activity, CPU load). However this solution is not general in nature due to its reliance on observational data. Using ancillary data such as rpm file size helped me make my progress bars more accurate but they are never bullet proof.
Uniform averaging
The simplest approach would be to predict the remaining time linearly:
t_rem := t_spent ( n - prog ) / prog
where t_rem is the predicted ETA, t_spent is the time elapsed since the commencement of the operation, prog the number of microtasks completed out of their full quantity n. To explain—n may be the number of rows in a table to process or the number of files to copy.
This method having no parameters, one need not worry about the fine-tuning of the exponent of attenuation. The trade-off is poor adaptation to a changing progress rate because all samples have equal contribution to the estimate, whereas it is only meet that recent samples should be have more weight that old ones, which leads us to
Exponential smoothing of rate
in which the standard technique is to estimate progress rate by averaging previous point measurements:
rate := 1 / (n * dt); { rate equals normalized progress per unit time }
if prog = 1 then { if first microtask just completed }
rate_est := rate; { initialize the estimate }
else
begin
weight := Exp( - dt / DECAY_T );
rate_est := rate_est * weight + rate * (1.0 - weight);
t_rem := (1.0 - prog / n) / rate_est;
end;
where dt denotes the duration of the last completed microtask and is equal to the time passed since the previous progress update. Notice that weight is not a constant and must be adjusted according the length of time during which a certain rate was observed, because the longer we observed a certain speed the higher the exponential decay of the previous measurements. The constant DECAY_T denotes the length of time during which the weight of a sample decreases by a factor of e. SPWorley himself suggested a similar modification to gooli's proposal, although he applied it to the wrong term. An exponential average for equidistant measurements is:
Avg_e(n) = Avg_e(n-1) * alpha + m_n * (1 - alpha)
but what if the samples are not equidistant, as is the case with times in a typical progress bar? Take into account that alpha above is but an empirical quotient whose true value is:
alpha = Exp( - lambda * dt ),
where lambda is the parameter of the exponential window and dt the amount of change since the previous sample, which need not be time, but any linear and additive parameter. alpha is constant for equidistant measurements but varies with dt.
Mark that this method relies on a predefined time constant and is not scalable in time. In other words, if the exactly same process be uniformly slowed-down by a constant factor, this rate-based filter will become proportionally more sensitive to signal variations because at every step weight will be decreased. If we, however, desire a smoothing independent of the time scale, we should consider
Exponential smoothing of slowness
which is essentially the smoothing of rate turned upside down with the added simplification of a constant weight of because prog is growing by equidistant increments:
slowness := n * dt; { slowness is the amount of time per unity progress }
if prog = 1 then { if first microtask just completed }
slowness_est := slowness; { initialize the estimate }
else
begin
weight := Exp( - 1 / (n * DECAY_P ) );
slowness_est := slowness_est * weight + slowness * (1.0 - weight);
t_rem := (1.0 - prog / n) * slowness_est;
end;
The dimensionless constant DECAY_P denotes the normalized progress difference between two samples of which the weights are in the ratio of one to e. In other words, this constant determines the width of the smoothing window in progress domain, rather than in time domain. This technique is therefore independent of the time scale and has a constant spatial resolution.
Futher research: adaptive exponential smoothing
You are now equipped to try the various algorithms of adaptive exponential smoothing. Only remember to apply it to slowness rather than to rate.
I always wish these things would tell me a range. If it said, "This task will most likely be done in between 8 min and 30 minutes," then I have some idea of what kind of break to take. If it's bouncing all over the place, I'm tempted to watch it until it settles down, which is a big waste of time.
I have tried and simplified your "easy"/"wrong"/"OK" formula and it works best for me:
t / p - t
In Python:
>>> done=0.3; duration=10; "time left: %i" % (duration / done - duration)
'time left: 23'
That saves one op compared to (dur*(1-done)/done). And, in the edge case you describe, possibly ignoring the dialog for 30 minutes extra hardly matters after waiting all night.
Comparing this simple method to the one used by Transmission, I found it to be up to 72% more accurate.
I don't sweat it, it's a very small part of an application. I tell them what's going on, and let them go do something else.

Understanding algorithms for measuring trends

What's the rationale behind the formula used in the hive_trend_mapper.py program of this Hadoop tutorial on calculating Wikipedia trends?
There are actually two components: a monthly trend and a daily trend. I'm going to focus on the daily trend, but similar questions apply to the monthly one.
In the daily trend, pageviews is an array of number of page views per day for this topic, one element per day, and total_pageviews is the sum of this array:
# pageviews for most recent day
y2 = pageviews[-1]
# pageviews for previous day
y1 = pageviews[-2]
# Simple baseline trend algorithm
slope = y2 - y1
trend = slope * log(1.0 +int(total_pageviews))
error = 1.0/sqrt(int(total_pageviews))
return trend, error
I know what it's doing superficially: it just looks at the change over the past day (slope), and scales this up to the log of 1+total_pageviews (log(1)==0, so this scaling factor is non-negative). It can be seen as treating the month's total pageviews as a weight, but tempered as it grows - this way, the total pageviews stop making a difference for things that are "popular enough," but at the same time big changes on insignificant don't get weighed as much.
But why do this? Why do we want to discount things that were initially unpopular? Shouldn't big deltas matter more for items that have a low constant popularity, and less for items that are already popular (for which the big deltas might fall well within a fraction of a standard deviation)? As a strawman, why not simply take y2-y1 and be done with it?
And what would the error be useful for? The tutorial doesn't really use it meaningfully again. Then again, it doesn't tell us how trend is used either - this is what's plotted in the end product, correct?
Where can I read up for a (preferably introductory) background on the theory here? Is there a name for this madness? Is this a textbook formula somewhere?
Thanks in advance for any answers (or discussion!).
As the in-line comment goes, this is a simple "baseline trend algorithm",
which basically means before you compare the trends of two different pages, you have to establish
a baseline. In many cases, the mean value is used, it's straightforward if you
plot the pageviews against the time axis. This method is widely used in monitoring
water quality, air pollutants, etc. to detect any significant changes w.r.t the baseline.
In OP's case, the slope of pageviews is weighted by the log of totalpageviews.
This sorta uses the totalpageviews as a baseline correction for the slope. As Simon put it, this puts a balance
between two pages with very different totalpageviews.
For exmaple, A has a slope 500 over 1000,000 total pageviews, B is 1000 over 1,000.
A log basically means 1000,000 is ONLY twice more important than 1,000 (rather than 1000 times).
If you only consider the slope, A is less popular than B.
But with a weight, now the measure of popularity of A is the same as B. I think it is quite intuitive:
though A's pageviews is only 500 pageviews, but that's because it's saturating, you still gotta give it enough credit.
As for the error, I believe it comes from the (relative) standard error, which has a factor 1/sqrt(n), where
n is the number of data points. In the code, the error is equal to (1/sqrt(n))*(1/sqrt(mean)).
It roughly translates into : the more data points, the more accurate the trend. I don't see
it is an exact math formula, just a brute trend analysis algorithm, anyway the relative
value is more important in this context.
In summary, I believe it's just an empirical formula. More advanced topics can be found in some biostatistics textbooks (very similar to monitoring the breakout of a flu or the like.)
The code implements statistics (in this case the "baseline trend"), you should educate yourself on that and everything becomes clearer. Wikibooks has a good instroduction.
The algorithm takes into account that new pages are by definition more unpopular than existing ones (because - for example - they are linked from relatively few other places) and suggests that those new pages will grow in popularity over time.
error is the error margin the system expects for its prognoses. The higher error is, the more unlikely the trend will continue as expected.
The reason for moderating the measure by the volume of clicks is not to penalise popular pages but to make sure that you can compare large and small changes with a single measure. If you just use y2 - y1 you will only ever see the click changes on large volume pages. What this is trying to express is "significant" change. 1000 clicks change if you attract 100 clicks is really significant. 1000 click change if you attract 100,000 is less so. What this formula is trying to do is make both of these visible.
Try it out at a few different scales in Excel, you'll get a good view of how it operates.
Hope that helps.
another way to look at it is this:
suppose your page and my page are made at same day, and ur page gets total views about ten million, and mine about 1 million till some point. then suppose the slope at some point is a million for me, and 0.5 million for you. if u just use slope, then i win, but ur page already had more views per day at that point, urs were having 5 million, and mine 1 million, so that a million on mine still makes it 2 million, and urs is 5.5 million for that day. so may be this scaling concept is to try to adjust the results to show that ur page is also good as a trend setter, and its slope is less but it already was more popular, but the scaling is only a log factor, so doesnt seem too problematic to me.

Resources