Improving scaling of sample from discrete distribution - performance

I have recently started playing around with Julia and I am currently working on a Monte Carlo simulation of some stochastic process on a 2-dimensional lattice. Each site has some associated rate of activation (the number of times it "does something" per second on average) which we can assume to be approximately constant. The order in which lattice sites activate is relevant so we need a method for randomly picking one particular site with probability proportional to its rate of activation.
It looks like sample(sites,weights(rates)) from the StatsBase package is exactly what I am looking for BUT from testing out my code structure (no logic, just loops and RNG), it turns out that sample() scales linearly with the number of sites. This means that overall my runtimes scale like N^(2+2), where N in the side length of my 2-dimensional lattice (one factor of 2 from the increase in total rate of activity, the other from the scaling of sample()).
Now, the increase in total rate of activity is unavoidable but I think the scaling of the "random pick with weights" method can be improved. More specifically, one should be able to achieve a logarithmic scaling with the number of sites (rather than linear). Consider for example the following function (and, please, forgive the poor coding)
function randompick(indices,rates)
cumrates = [sum(rates[1:i]) for i in indices]
pick = rand()*cumrates[end]
tick = 0
lowb = 0
highb = indis[end]
while tick == 0
mid = floor(Int,(highb+lowb)/2)
midrate = cumrates[mid]
if pick > midrate
lowb = mid
else
highb = mid
end
if highb-lowb == 1
tick = 1
end
end
return(highb)
end
Because we half the number of "pickable" sites at each step, it would take n steps to pick one specific site out of 2^n (hence the logarithmic scaling). However, in its current state randompick() is so much slower than sample() that scaling is practically irrelevant. Is there any way of reducing this method to a form that can compete with sample() and hence take advantage of the improved scaling?
EDIT: calculating cumrates scales like N^2 as well but that can be solved by working with rates in the correct (cumulative) form throughout the code.

A simpler version of what I think you were trying for is:
function randompick(rates)
cumrates = cumsum(rates)
pick = rand()*cumrates[end]
searchsortedfirst(cumrates, pick)
end
The call to searchsortedfirst does scale logarithmically, but cumsum only scales linearly, thus eliminating any advantage this might have.
If the rates are constant, you could preprocess cumrates ahead of time, but if this was the case you would be better off using an alias table which can sample in constant time. There is an implementation available in the Distributions.jl package:
using Distributions
s = Distributions.AliasTable(rates)
rand(s)

I found out about an alternative sampling method in this paper by P. Hanusse that does not seem to scale with N, at least when the allowed activity rates are of the same order of magnitude.
The idea is to assume that all sites have the same rate of activity, equal to the rate of activity of the most active site maxrate (so that the random pick is reduced to a single RNG call rand(1:N)). Once we have picked a site, we separate its (constant) rate of activity into two contributions, the original rate of activity and a "do-nothing" rate (the second being the constant rate minus its original rate). Now we generate a second random number c = rand() * maxrate. If c<rate[site], we keep that site choice and proceed to activate the site, otherwise we go back to the uniform random pick.
The function containing the two RNG calls would look like this, with the second returned value determining whether the call has to be repeated.
function HanussePick(rates,maxrate)
site = rand(1:N^2)
slider = rand() * maxrate
return(site,rates[site]-slider)
end
The advantage of this approach is that, if the allowed rates of activity are comparable to each other, there should be no scaling with N, as we only need to generate O(1) random numbers.

Related

Finding the time in which a specific value is reached in time-series data when peaks are found

I would like to find the time instant at which a certain value is reached in a time-series data with noise. If there are no peaks in the data, I could do the following in MATLAB.
Code from here
% create example data
d=1:100;
t=d/100;
ts = timeseries(d,t);
% define threshold
thr = 55;
data = ts.data(:);
time = ts.time(:);
ind = find(data>thr,1,'first');
time(ind) %time where data>threshold
But when there is noise, I am not sure what has to be done.
In the time-series data plotted in the above image I want to find the time instant at which the y-axis value 5 is reached. The data actually stabilizes to 5 at t>=100 s. But due to the presence of noise in the data, we see a peak that reaches 5 somewhere around 20 s . I would like to know how to detect e.g 100 seconds as the right time and not 20 s . The code posted above will only give 20 s as the answer. I
saw a post here that explains using a sliding window to find when the data equilibrates. However, I am not sure how to implement the same. Suggestions will be really helpful.
The sample data plotted in the above image can be found here
Suggestions on how to implement in Python or MATLAB code will be really helpful.
EDIT:
I don't want to capture when the peak (/noise/overshoot) occurs. I want to find the time when equilibrium is reached. For example, around 20 s the curve rises and dips below 5. After ~100 s the curve equilibrates to a steady-state value 5 and never dips or peaks.
Precise data analysis is a serious business (and my passion) that involves a lot of understanding of the system you are studying. Here are comments, unfortunately I doubt there is a simple nice answer to your problem at all -- you will have to think about it. Data analysis basically always requires "discussion".
First to your data and problem in general:
When you talk about noise, in data analysis this means a statistical random fluctuation. Most often Gaussian (sometimes also other distributions, e.g. Poission). Gaussian noise is a) random in each bin and b) symmetric in negative and positive direction. Thus, what you observe in the peak at ~20s is not noise. It has a very different, very systematic and extended characteristics compared to random noise. This is an "artifact" that must have a origin, but of which we can only speculate here. In real-world applications, studying and removing such artifacts is the most expensive and time-consuming task.
Looking at your data, the random noise is negligible. This is very precise data. For example, after ~150s and later there are no visible random fluctuations up to fourth decimal number.
After concluding that this is not noise in the common sense it could be a least two things: a) a feature of the system you are studying, thus, something where you could develop a model/formula for and which you could "fit" to the data. b) a characteristics of limited bandwidth somewhere in the measurement chain, thus, here a high-frequency cutoff. See e.g. https://en.wikipedia.org/wiki/Ringing_artifacts . Unfortunately, for both, a and b, there are no catch-all generic solutions. And your problem description (even with code and data) is not sufficient to propose an ideal approach.
After spending now ~one hour on your data and making some plots. I believe (speculate) that the extremely sharp feature at ~10s cannot be a "physical" property of the data. It simply is too extreme/steep. Something fundamentally happened here. A guess of mine could be that some device was just switched on (was off before). Thus, the data before is meaningless, and there is a short period of time afterwards to stabilize the system. There is not really an alternative in this scenario but to entirely discard the data until the system has stabilized at around 40s. This also makes your problem trivial. Just delete the first 40s, then the maximum becomes evident.
So what are technical solutions you could use, please don't be too upset that you have to think about this yourself and assemble the best possible solution for your case. I copied your data in two numpy arrays x and y and ran the following test in python:
Remove unstable time
This is the trivial solution -- I prefer it.
plt.figure()
plt.xlabel('time')
plt.ylabel('signal')
plt.plot(x, y, label="original")
y_cut = y
y_cut[:40] = 0
plt.plot(x, y_cut, label="cut 40s")
plt.legend()
plt.grid()
plt.show()
Note carry on reading below only if you are a bit crazy (about data).
Sliding window
You mentioned "sliding window" which is best suited for random noise (which you don't have) or periodic fluctuations (which you also don't really have). Sliding window just averages over consecutive bins, averaging out random fluctuations. Mathematically this is a convolution.
Technically, you can actually solve your problem like this (try even larger values of Nwindow yourself):
Nwindow=10
y_slide_10 = np.convolve(y, np.ones((Nwindow,))/Nwindow, mode='same')
Nwindow=20
y_slide_20 = np.convolve(y, np.ones((Nwindow,))/Nwindow, mode='same')
Nwindow=30
y_slide_30 = np.convolve(y, np.ones((Nwindow,))/Nwindow, mode='same')
plt.xlabel('time')
plt.ylabel('signal')
plt.plot(x,y, label="original")
plt.plot(x,y_slide_10, label="window=10")
plt.plot(x,y_slide_20, label='window=20')
plt.plot(x,y_slide_30, label='window=30')
plt.legend()
#plt.xscale('log') # useful
plt.grid()
plt.show()
Thus, technically you can succeed to suppress the initial "hump". But don't forget this is a hand-tuned and not general solution...
Another caveat of any sliding window solution: this always distorts your timing. Since you average over an interval in time depending on rising or falling signals your convoluted trace is shifted back/forth in time (slightly, but significantly). In your particular case this is not a problem since the main signal region has basically no time-dependence (very flat).
Frequency domain
This should be the silver bullet, but it also does not work well/easily for your example. The fact that this doesn't work better is the main hint to me that the first 40s of data are better discarded.... (i.e. in a scientific work)
You can use fast Fourier transform to inspect your data in frequency-domain.
import scipy.fft
y_fft = scipy.fft.rfft(y)
# original frequency domain plot
plt.plot(y_fft, label="original")
plt.xlabel('frequency')
plt.ylabel('signal')
plt.yscale('log')
plt.show()
The structure in frequency represent the features of your data. The peak a zero is the stabilized region after ~100s, the humps are associated to (rapid) changes in time. You can now play around and change the frequency spectrum (--> filter) but I think the spectrum is so artificial that this doesn't yield great results here. Try it with other data and you may be very impressed! I tried two things, first cut high-frequency regions out (set to zero), and second, apply a sliding-window filter in frequency domain (sparing the peak at 0, since this cannot be touched. Try and you know why).
# cut high-frequency by setting to zero
y_fft_2 = np.array(y_fft)
y_fft_2[50:70] = 0
# sliding window in frequency
Nwindow = 15
Start = 10
y_fft_slide = np.array(y_fft)
y_fft_slide[Start:] = np.convolve(y_fft[Start:], np.ones((Nwindow,))/Nwindow, mode='same')
# frequency-domain plot
plt.plot(y_fft, label="original")
plt.plot(y_fft_2, label="high-frequency, filter")
plt.plot(y_fft_slide, label="frequency sliding window")
plt.xlabel('frequency')
plt.ylabel('signal')
plt.yscale('log')
plt.legend()
plt.show()
Converting this back into time-domain:
# reverse FFT into time-domain for plotting
y_filtered = scipy.fft.irfft(y_fft_2)
y_filtered_slide = scipy.fft.irfft(y_fft_slide)
# time-domain plot
plt.plot(x[:500], y[:500], label="original")
plt.plot(x[:500], y_filtered[:500], label="high-f filtered")
plt.plot(x[:500], y_filtered_slide[:500], label="frequency sliding window")
# plt.xscale('log') # useful
plt.grid()
plt.legend()
plt.show()
yields
There are apparent oscillations in those solutions which make them essentially useless for your purpose. This leads me to my final exercise to again apply a sliding-window filter on the "frequency sliding window" time-domain
# extra time-domain sliding window
Nwindow=90
y_fft_90 = np.convolve(y_filtered_slide, np.ones((Nwindow,))/Nwindow, mode='same')
# final time-domain plot
plt.plot(x[:500], y[:500], label="original")
plt.plot(x[:500], y_fft_90[:500], label="frequency-sliding window, slide")
# plt.xscale('log') # useful
plt.legend()
plt.show()
I am quite happy with this result, but it still has very small oscillations and thus does not solve your original problem.
Conclusion
How much fun. One hour well wasted. Maybe it is useful to someone. Maybe even to you Natasha. Please be not mad a me...
Let's assume your data is in data variable and time indices are in time. Then
import numpy as np
threshold = 0.025
stable_index = np.where(np.abs(data[-1] - data) > threshold)[0][-1] + 1
print('Stabilizes after', time[stable_index], 'sec')
Stabilizes after 96.6 sec
Here data[-1] - data is a difference between last value of data and all the data values. The assumption here is that the last value of data represents the equilibrium point.
np.where( * > threshold )[0] are all the indices of values of data which are greater than the threshold, that is still not stabilized. We take only the last index. The next one is where time series is considered stabilized, hence the + 1.
If you're dealing with deterministic data which is eventually converging monotonically to some fixed value, the problem is pretty straightforward. Your last observation should be the closest to the limit, so you can define an acceptable tolerance threshold relative to that last data point and scan your data from back to front to find where you exceeded your threshold.
Things get a lot nastier once you add random noise into the picture, particularly if there is serial correlation. This problem is common in simulation modeling(see (*) below), and is known as the issue of initial bias. It was first identified by Conway in 1963, and has been an active area of research since then with no universally accepted definitive answer on how to deal with it. As with the deterministic case, the most widely accepted answers approach the problem starting from the right-hand side of the data set since this is where the data are most likely to be in steady state. Techniques based on this approach use the end of the dataset to establish some sort of statistical yardstick or baseline to measure where the data start looking significantly different as observations get added by moving towards the front of the dataset. This is greatly complicated by the presence of serial correlation.
If a time series is in steady state, in the sense of being covariance stationary then a simple average of the data is an unbiased estimate of its expected value, but the standard error of the estimated mean depends heavily on the serial correlation. The correct standard error squared is no longer s2/n, but instead it is (s2/n)*W where W is a properly weighted sum of the autocorrelation values. A method called MSER was developed in the 1990's, and avoids the issue of trying to correctly estimate W by trying to determine where the standard error is minimized. It treats W as a de-facto constant given a sufficiently large sample size, so if you consider the ratio of two standard error estimates the W's cancel out and the minimum occurs where s2/n is minimized. MSER proceeds as follows:
Starting from the end, calculate s2 for half of the data set to establish a baseline.
Now update the estimate of s2 one observation at a time using an efficient technique such as Welford's online algorithm, calculate s2/n where n is the number of observations tallied so far. Track which value of n yields the smallest s2/n. Lather, rinse, repeat.
Once you've traversed the entire data set from back to front, the n which yielded the smallest s2/n is the number of observations from the end of the data set which are not detectable as being biased by the starting conditions.
Justification - with a sufficiently large baseline (half your data), s2/n should be relatively stable as long as the time series remains in steady state. Since n is monotonically increasing, s2/n should continue decreasing subject to the limitations of its variability as an estimate. However, once you start acquiring observations which are not in steady state the drift in mean and variance will inflate the numerator of s2/n. Hence the minimal value corresponds to the last observation where there was no indication of non-stationarity. More details can be found in this proceedings paper. A Ruby implementation is available on BitBucket.
Your data has such a small amount of variation that MSER concludes that it is still converging to steady state. As such, I'd advise going with the deterministic approach outlined in the first paragraph. If you have noisy data in the future, I'd definitely suggest giving MSER a shot.
(*) - In a nutshell, a simulation model is a computer program and hence has to have its state set to some set of initial values. We generally don't know what the system state will look like in the long run, so we initialize it to an arbitrary but convenient set of values and then let the system "warm up". The problem is that the initial results of the simulation are not typical of the steady state behaviors, so including that data in your analyses will bias them. The solution is to remove the biased portion of the data, but how much should that be?

Distribution of the Training Data vs Distribution of the Test/Prediction

Does the Distribution represented by the training data need to reflect the distribution of the test data and the data that you predict on? Can I measure the quality of the training data by looking at the distribution of each feature and compare that distribution to the data I am predicting or testing with? Ideally the training data should be sufficiently representative of the real world distribution.
Short answer: similar ranges would be a good idea.
Long answer: sometimes it won't be an issue (rarely) but let's examine when.
In an ideal situation, your model will capture the true phenomenon perfectly. Imagine the simplest case: the linear model y = x. If the training data are noiseless (or have tolerable noise). Your linear regression will naturally land on a model approximately equal to y = x. The generalization of the model will work nearly perfect even outside of the training range. If your train data were {1:1, 2:2, 3:3, 4:4, 5:5, 6:6, 7:7, 8:8, 9:9, 10:10}. The test point 500, will nicely map onto the function, returning 500.
In most modeling scenarios, this will almost certainly not be the case. If the training data are ample and the model is appropriately complex (and no more), you're golden.
The trouble is that few functions (and corresponding natural phenomena) -- especially when we consider nonlinear functions -- extend to data outside of the training range so cleanly. Imagine sampling office temperature against employee comfort. If you only look at temperatures from 40 deg to 60 deg. A linear function will behave brilliantly in the training data. Oddly enough, if you test on 60 to 80, the mapping will break down. Here, the issue is confidence in your claim that the data are sufficiently representative.
Now let's consider noise. Imagine that you know EXACTLY what the real world function is: a sine wave. Better still, you are told its amplitude and phase. What you don't know is its frequency. You have a really solid sampling between 1 and 100, the function you fit maps against the training data really well. Now if there is just enough noise, you might estimate the frequency incorrectly by a hair. When you test near the training range, the results aren't so bad. Outside of the training range, things start to get wonky. As you move further and further from the training range, the real function and the function diverge and converge based on their relative frequencies. Sometimes, the residuals are seemingly fine; sometimes they are dreadful.
There is an issue with your idea of examining the variable distributions: interaction between variables. Even if each variable is appropriately balanced in train and test, it is possible that the relationships between variables will differ (joint distributions). For a purely contrived example, consider you were predicting an individual's likelihood of being pregnant at any given time. In your training set, you had women aged 20 to 30 and men aged 30 to 40. In testing, you had the same percentage of men and women, but the age ranges were flipped. Independently, the variables look very nicely matched! But in your training set, you could very easily conclude, "only people under 30 get pregnant." Oddly enough, your testing set would demonstrate the exact opposite! The trouble is that your predictions are being made from a multivariate space, but the distributions you are thinking about are univariate. Considering the joint distributions of continuous variables against one another (and considering categorical variables appropriately) is, however, a good idea. Ideally, your fit model should have access to a similar range to your testing data.
Fundamentally, the question is about extrapolation from a limited training space. If the model fit in the training space generalizes, you can generalize; ultimately, it is usually safest to have a really well distributed training set to maximize the likelihood that you have captured the complexity of the underlying function.
Really interesting question! I hope the answer was somewhat insightful; I'll continue to build on it as resources come to mind! Let me know if any questions remain!
EDIT: a point made in the comments that I think should be read by future readers.
Ideally, training data should NEVER influence testing data in ANY way. That includes examining of the distributions, joint distributions etc. With sufficient data, distributions in the training data should converge on distributions in the testing data (think the mean, law of large nums). Manipulation to match distributions (like z-scoring before train/test split) fundamentally skews performance metrics in your favor. An appropriate technique for splitting train and test data would be something like stratified k fold for cross validation.
Sorry for the delayed response. After going through a few months of iterating, I implemented and pushed the following solution to production and it is working quite well.
The issue here boils down to how can one reduce the training/test score variance when performing cross validation. This is important as if your variance is high, the confidence in picking the best model goes down. The more representative the test data is to the train data, the less variance you get in your test scores across the cross validation set. Stratified cross validation tackles this issue especially when there is significant class imbalance, by ensuring that the label class proportions are preserved across all test/train sets. However, this doesnt address the issue with the feature distribution.
In my case, I had a few features that were very strong predictors but also very skewed in their distribution. This caused significant variance in my test scores which made it harder to pick a model with any confidence. Essentially, the solution is to ensure that the joint distribution of the label with the feature set is maintained across test/train sets. Many ways of doing this but a very simple approach is to simply take each column bucket range (if continuous) or label (if categorical) one by one and sample from these buckets when generating the test and train sets. Note that the buckets quickly gets very sparse especially when you have a lot of categorical variables. Also, the column order in which you bucket affects the sampling output greatly. Below is a solution where I bucket the label first (same like stratified CV) and then sample 1 other feature (most important feature (called score_percentage) that is known upfront).
def train_test_folds(self, label_column="label"):
# train_test is an array of tuples where each tuple is a test numpy array and train numpy array pair.
# The final iterator would return these individual elements separately.
n_folds = self.n_folds
label_classes = np.unique(self.label)
train_test = []
fmpd_copy = self.fm.copy()
fmpd_copy[label_column] = self.label
fmpd_copy = fmpd_copy.reset_index(drop=True).reset_index()
fmpd_copy = fmpd_copy.sort_values("score_percentage")
for lbl in label_classes:
fmpd_label = fmpd_copy[fmpd_copy[label_column] == lbl]
# Calculate the fold # using the label specific dataset
if (fmpd_label.shape[0] < n_folds):
raise ValueError("n_folds=%d cannot be greater than the"
" number of rows in each class."
% (fmpd_label.shape[0]))
# let's get some variance -- shuffle within each buck
# let's go through the data set, shuffling items in buckets of size nFolds
s = 0
shuffle_array = fmpd_label["index"].values
maxS = len(shuffle_array)
while s < maxS:
max = min(maxS, s + n_folds) - 1
for i in range(s, max):
j = random.randint(i, max)
if i < j:
tempI = shuffle_array[i]
shuffle_array[i] = shuffle_array[j]
shuffle_array[j] = tempI
s = s + n_folds
# print("shuffle s =",s," max =",max, " maxS=",maxS)
fmpd_label["index"] = shuffle_array
fmpd_label = fmpd_label.reset_index(drop=True).reset_index()
fmpd_label["test_set_number"] = fmpd_label.iloc[:, 0].apply(
lambda x: x % n_folds)
print("label ", lbl)
for n in range(0, n_folds):
test_set = fmpd_label[fmpd_label["test_set_number"]
== n]["index"].values
train_set = fmpd_label[fmpd_label["test_set_number"]
!= n]["index"].values
print("for label ", lbl, " test size is ",
test_set.shape, " train size is ", train_set.shape)
print("len of total size", len(train_test))
if (len(train_test) != n_folds):
# Split doesnt exist. Add it in.
train_test.append([train_set, test_set])
else:
temp_arr = train_test[n]
temp_arr[0] = np.append(temp_arr[0], train_set)
temp_arr[1] = np.append(temp_arr[1], test_set)
train_test[n] = [temp_arr[0], temp_arr[1]]
return train_test
Over time, I realized that this whole issue falls under the umbrella of covariate shift which is a well studied area within machine learning. Link below or just search google for covariate shift. The concept is how to detect and ensure that your prediction data is of similar distribution with your training data. THis is in the feature space but in theory you could have label drift as well.
https://www.analyticsvidhya.com/blog/2017/07/covariate-shift-the-hidden-problem-of-real-world-data-science/

Algorithm for smoothing wifi signal strength

When measuring the strength of a Wifi signal between two static points, the measurement constantly fluctuates due to environmental factors.
What is a good algorithm to use smooth out small fluctuations and detect significant changes? Exponential moving average?
Some sort of low pass filtering usually works for things like this:
y[i] = alpha * x[i] + (1-alpha) * y[i-1]
where alpha is chosen based on the amount of smoothing desired. x contains the raw input samples and y contains the filtered result.
Exponential moving average is a good way of estimating the current true value of the signal, which as you can see above has popped up under a number of disguises with a number of different justifications.
The problem of detecting significant changes is slightly different, and has been studied as part of statistical quality control. One simple tool for this is http://en.wikipedia.org/wiki/CUSUM. The wikipedia page tells you enough to implement this, but not how to set W in S[n+1] = S[n] + Min(0, S[n] + X[n] - W), or what value of S[n] means that it has detected something. You could search further than I have, look in texts such as "Introduction to Statistical Quality Control" by Montgomery, or just grab lots of data and see what works in real life.
I would start by setting W to be the average of the typical value of long term signal strength when everything is OK and the first value of long term signal strength that should make you actually do something, and then plot the results of this on historical data to see if it looks sane and, if so, what value of S[n] should make you actually do something. (X[n] is of course the raw measured signal strength).

Smart progress bar ETA computation

In many applications, we have some progress bar for a file download, for a compression task, for a search, etc. We all often use progress bars to let users know something is happening. And if we know some details like just how much work has been done and how much is left to do, we can even give a time estimate, often by extrapolating from how much time it's taken to get to the current progress level.
(source: jameslao.com)
But we've also seen programs which this Time Left "ETA" display is just comically bad. It claims a file copy will be done in 20 seconds, then one second later it says it's going to take 4 days, then it flickers again to be 20 minutes. It's not only unhelpful, it's confusing!
The reason the ETA varies so much is that the progress rate itself can vary and the programmer's math can be overly sensitive.
Apple sidesteps this by just avoiding any accurate prediction and just giving vague estimates!
(source: autodesk.com)
That's annoying too, do I have time for a quick break, or is my task going to be done in 2 more seconds? If the prediction is too fuzzy, it's pointless to make any prediction at all.
Easy but wrong methods
As a first pass ETA computation, probably we all just make a function like if p is the fractional percentage that's done already, and t is the time it's taken so far, we output t*(1-p)/p as the estimate of how long it's going to take to finish. This simple ratio works "OK" but it's also terrible especially at the end of computation. If your slow download speed keeps a copy slowly advancing happening overnight, and finally in the morning, something kicks in and the copy starts going at full speed at 100X faster, your ETA at 90% done may say "1 hour", and 10 seconds later you're at 95% and the ETA will say "30 minutes" which is clearly an embarassingly poor guess.. in this case "10 seconds" is a much, much, much better estimate.
When this happens you may think to change the computation to use recent speed, not average speed, to estimate ETA. You take the average download rate or completion rate over the last 10 seconds, and use that rate to project how long completion will be. That performs quite well in the previous overnight-download-which-sped-up-at-the-end example, since it will give very good final completion estimates at the end. But this still has big problems.. it causes your ETA to bounce wildly when your rate varies quickly over a short period of time, and you get the "done in 20 seconds, done in 2 hours, done in 2 seconds, done in 30 minutes" rapid display of programming shame.
The actual question:
What is the best way to compute an estimated time of completion of a task, given the time history of the computation? I am not looking for links to GUI toolkits or Qt libraries. I'm asking about the algorithm to generate the most sane and accurate completion time estimates.
Have you had success with math formulas? Some kind of averaging, maybe by using the mean of the rate over 10 seconds with the rate over 1 minute with the rate over 1 hour? Some kind of artificial filtering like "if my new estimate varies too much from the previous estimate, tone it down, don't let it bounce too much"? Some kind of fancy history analysis where you integrate progress versus time advancement to find standard deviation of rate to give statistical error metrics on completion?
What have you tried, and what works best?
Original Answer
The company that created this site apparently makes a scheduling system that answers this question in the context of employees writing code. The way it works is with Monte Carlo simulation of future based on the past.
Appendix: Explanation of Monte Carlo
This is how this algorithm would work in your situation:
You model your task as a sequence of microtasks, say 1000 of them. Suppose an hour later you completed 100 of them. Now you run the simulation for the remaining 900 steps by randomly selecting 90 completed microtasks, adding their times and multiplying by 10. Here you have an estimate; repeat N times and you have N estimates for the time remaining. Note the average between these estimates will be about 9 hours -- no surprises here. But by presenting the resulting distribution to the user you'll honestly communicate to him the odds, e.g. 'with the probability 90% this will take another 3-15 hours'
This algorithm, by definition, produces complete result if the task in question can be modeled as a bunch of independent, random microtasks. You can gain a better answer only if you know how the task deviates from this model: for example, installers typically have a download/unpacking/installing tasklist and the speed for one cannot predict the other.
Appendix: Simplifying Monte Carlo
I'm not a statistics guru, but I think if you look closer into the simulation in this method, it will always return a normal distribution as a sum of large number of independent random variables. Therefore, you don't need to perform it at all. In fact, you don't even need to store all the completed times, since you'll only need their sum and sum of their squares.
In maybe not very standard notation,
sigma = sqrt ( sum_of_times_squared-sum_of_times^2 )
scaling = 900/100 // that is (totalSteps - elapsedSteps) / elapsedSteps
lowerBound = sum_of_times*scaling - 3*sigma*sqrt(scaling)
upperBound = sum_of_times*scaling + 3*sigma*sqrt(scaling)
With this, you can output the message saying that the thing will end between [lowerBound, upperBound] from now with some fixed probability (should be about 95%, but I probably missed some constant factor).
Here's what I've found works well! For the first 50% of the task, you assume the rate is constant and extrapolate. The time prediction is very stable and doesn't bounce much.
Once you pass 50%, you switch computation strategy. You take the fraction of the job left to do (1-p), then look back in time in a history of your own progress, and find (by binary search and linear interpolation) how long it's taken you to do the last (1-p) percentage and use that as your time estimate completion.
So if you're now 71% done, you have 29% remaining. You look back in your history and find how long ago you were at (71-29=42%) completion. Report that time as your ETA.
This is naturally adaptive. If you have X amount of work to do, it looks only at the time it took to do the X amount of work. At the end when you're at 99% done, it's using only very fresh, very recent data for the estimate.
It's not perfect of course but it smoothly changes and is especially accurate at the very end when it's most useful.
Whilst all the examples are valid, for the specific case of 'time left to download', I thought it would be a good idea to look at existing open source projects to see what they do.
From what I can see, Mozilla Firefox is the best at estimating the time remaining.
Mozilla Firefox
Firefox keeps a track of the last estimate for time remaining, and by using this and the current estimate for time remaining, it performs a smoothing function on the time.
See the ETA code here. This uses a 'speed' which is previously caculated here and is a smoothed average of the last 10 readings.
This is a little complex, so to paraphrase:
Take a smoothed average of the speed based 90% on the previous speed and 10% on the new speed.
With this smoothed average speed work out the estimated time remaining.
Use this estimated time remaining, and the previous estimated time remaining to created a new estimated time remaining (in order to avoid jumping)
Google Chrome
Chrome seems to jump about all over the place, and the code shows this.
One thing I do like with Chrome though is how they format time remaining.
For > 1 hour it says '1 hrs left'
For < 1 hour it says '59 mins left'
For < 1 minute it says '52 secs left'
You can see how it's formatted here
DownThemAll! Manager
It doesn't use anything clever, meaning the ETA jumps about all over the place.
See the code here
pySmartDL (a python downloader)
Takes the average ETA of the last 30 ETA calculations. Sounds like a reasonable way to do it.
See the code here/blob/916f2592db326241a2bf4d8f2e0719c58b71e385/pySmartDL/pySmartDL.py#L651)
Transmission
Gives a pretty good ETA in most cases (except when starting off, as might be expected).
Uses a smoothing factor over the past 5 readings, similar to Firefox but not quite as complex. Fundamentally similar to Gooli's answer.
See the code here
I usually use an Exponential Moving Average to compute the speed of an operation with a smoothing factor of say 0.1 and use that to compute the remaining time. This way all the measured speeds have influence on the current speed, but recent measurements have much more effect than those in the distant past.
In code it would look something like this:
alpha = 0.1 # smoothing factor
...
speed = (speed * (1 - alpha)) + (currentSpeed * alpha)
If your tasks are uniform in size, currentSpeed would simply be the time it took to execute the last task. If the tasks have different sizes and you know that one task is supposed to be i,e, twice as long as another, you can divide the time it took to execute the task by its relative size to get the current speed. Using speed you can compute the remaining time by multiplying it by the total size of the remaining tasks (or just by their number if the tasks are uniform).
Hopefully my explanation is clear enough, it's a bit late in the day.
In certain instances, when you need to perform the same task on a regular basis, it might be a good idea of using past completion times to average against.
For example, I have an application that loads the iTunes library via its COM interface. The size of a given iTunes library generally do not increase dramatically from launch-to-launch in terms of the number of items, so in this example it might be possible to track the last three load times and load rates and then average against that and compute your current ETA.
This would be hugely more accurate than an instantaneous measurement and probably more consistent as well.
However, this method depends upon the size of the task being relatively similar to the previous ones, so this would not work for a decompressing method or something else where any given byte stream is the data to be crunched.
Just my $0.02
First off, it helps to generate a running moving average. This weights more recent events more heavily.
To do this, keep a bunch of samples around (circular buffer or list), each a pair of progress and time. Keep the most recent N seconds of samples. Then generate a weighted average of the samples:
totalProgress += (curSample.progress - prevSample.progress) * scaleFactor
totalTime += (curSample.time - prevSample.time) * scaleFactor
where scaleFactor goes linearly from 0...1 as an inverse function of time in the past (thus weighing more recent samples more heavily). You can play around with this weighting, of course.
At the end, you can get the average rate of change:
averageProgressRate = (totalProgress / totalTime);
You can use this to figure out the ETA by dividing the remaining progress by this number.
However, while this gives you a good trending number, you have one other issue - jitter. If, due to natural variations, your rate of progress moves around a bit (it's noisy) - e.g. maybe you're using this to estimate file downloads - you'll notice that the noise can easily cause your ETA to jump around, especially if it's pretty far in the future (several minutes or more).
To avoid jitter from affecting your ETA too much, you want this average rate of change number to respond slowly to updates. One way to approach this is to keep around a cached value of averageProgressRate, and instead of instantly updating it to the trending number you've just calculated, you simulate it as a heavy physical object with mass, applying a simulated 'force' to slowly move it towards the trending number. With mass, it has a bit of inertia and is less likely to be affected by jitter.
Here's a rough sample:
// desiredAverageProgressRate is computed from the weighted average above
// m_averageProgressRate is a member variable also in progress units/sec
// lastTimeElapsed = the time delta in seconds (since last simulation)
// m_averageSpeed is a member variable in units/sec, used to hold the
// the velocity of m_averageProgressRate
const float frictionCoeff = 0.75f;
const float mass = 4.0f;
const float maxSpeedCoeff = 0.25f;
// lose 25% of our speed per sec, simulating friction
m_averageSeekSpeed *= pow(frictionCoeff, lastTimeElapsed);
float delta = desiredAvgProgressRate - m_averageProgressRate;
// update the velocity
float oldSpeed = m_averageSeekSpeed;
float accel = delta / mass;
m_averageSeekSpeed += accel * lastTimeElapsed; // v += at
// clamp the top speed to 25% of our current value
float sign = (m_averageSeekSpeed > 0.0f ? 1.0f : -1.0f);
float maxVal = m_averageProgressRate * maxSpeedCoeff;
if (fabs(m_averageSeekSpeed) > maxVal)
{
m_averageSeekSpeed = sign * maxVal;
}
// make sure they have the same sign
if ((m_averageSeekSpeed > 0.0f) == (delta > 0.0f))
{
float adjust = (oldSpeed + m_averageSeekSpeed) * 0.5f * lastTimeElapsed;
// don't overshoot.
if (fabs(adjust) > fabs(delta))
{
adjust = delta;
// apply damping
m_averageSeekSpeed *= 0.25f;
}
m_averageProgressRate += adjust;
}
Your question is a good one. If the problem can be broken up into discrete units having an accurate calculation often works best. Unfortunately this may not be the case even if you are installing 50 components each one might be 2% but one of them can be massive. One thing that I have had moderate success with is to clock the cpu and disk and give a decent estimate based on observational data. Knowing that certain check points are really point x allows you some opportunity to correct for environment factors (network, disk activity, CPU load). However this solution is not general in nature due to its reliance on observational data. Using ancillary data such as rpm file size helped me make my progress bars more accurate but they are never bullet proof.
Uniform averaging
The simplest approach would be to predict the remaining time linearly:
t_rem := t_spent ( n - prog ) / prog
where t_rem is the predicted ETA, t_spent is the time elapsed since the commencement of the operation, prog the number of microtasks completed out of their full quantity n. To explain—n may be the number of rows in a table to process or the number of files to copy.
This method having no parameters, one need not worry about the fine-tuning of the exponent of attenuation. The trade-off is poor adaptation to a changing progress rate because all samples have equal contribution to the estimate, whereas it is only meet that recent samples should be have more weight that old ones, which leads us to
Exponential smoothing of rate
in which the standard technique is to estimate progress rate by averaging previous point measurements:
rate := 1 / (n * dt); { rate equals normalized progress per unit time }
if prog = 1 then { if first microtask just completed }
rate_est := rate; { initialize the estimate }
else
begin
weight := Exp( - dt / DECAY_T );
rate_est := rate_est * weight + rate * (1.0 - weight);
t_rem := (1.0 - prog / n) / rate_est;
end;
where dt denotes the duration of the last completed microtask and is equal to the time passed since the previous progress update. Notice that weight is not a constant and must be adjusted according the length of time during which a certain rate was observed, because the longer we observed a certain speed the higher the exponential decay of the previous measurements. The constant DECAY_T denotes the length of time during which the weight of a sample decreases by a factor of e. SPWorley himself suggested a similar modification to gooli's proposal, although he applied it to the wrong term. An exponential average for equidistant measurements is:
Avg_e(n) = Avg_e(n-1) * alpha + m_n * (1 - alpha)
but what if the samples are not equidistant, as is the case with times in a typical progress bar? Take into account that alpha above is but an empirical quotient whose true value is:
alpha = Exp( - lambda * dt ),
where lambda is the parameter of the exponential window and dt the amount of change since the previous sample, which need not be time, but any linear and additive parameter. alpha is constant for equidistant measurements but varies with dt.
Mark that this method relies on a predefined time constant and is not scalable in time. In other words, if the exactly same process be uniformly slowed-down by a constant factor, this rate-based filter will become proportionally more sensitive to signal variations because at every step weight will be decreased. If we, however, desire a smoothing independent of the time scale, we should consider
Exponential smoothing of slowness
which is essentially the smoothing of rate turned upside down with the added simplification of a constant weight of because prog is growing by equidistant increments:
slowness := n * dt; { slowness is the amount of time per unity progress }
if prog = 1 then { if first microtask just completed }
slowness_est := slowness; { initialize the estimate }
else
begin
weight := Exp( - 1 / (n * DECAY_P ) );
slowness_est := slowness_est * weight + slowness * (1.0 - weight);
t_rem := (1.0 - prog / n) * slowness_est;
end;
The dimensionless constant DECAY_P denotes the normalized progress difference between two samples of which the weights are in the ratio of one to e. In other words, this constant determines the width of the smoothing window in progress domain, rather than in time domain. This technique is therefore independent of the time scale and has a constant spatial resolution.
Futher research: adaptive exponential smoothing
You are now equipped to try the various algorithms of adaptive exponential smoothing. Only remember to apply it to slowness rather than to rate.
I always wish these things would tell me a range. If it said, "This task will most likely be done in between 8 min and 30 minutes," then I have some idea of what kind of break to take. If it's bouncing all over the place, I'm tempted to watch it until it settles down, which is a big waste of time.
I have tried and simplified your "easy"/"wrong"/"OK" formula and it works best for me:
t / p - t
In Python:
>>> done=0.3; duration=10; "time left: %i" % (duration / done - duration)
'time left: 23'
That saves one op compared to (dur*(1-done)/done). And, in the edge case you describe, possibly ignoring the dialog for 30 minutes extra hardly matters after waiting all night.
Comparing this simple method to the one used by Transmission, I found it to be up to 72% more accurate.
I don't sweat it, it's a very small part of an application. I tell them what's going on, and let them go do something else.

Aging a dataset

For reasons I'd rather not go into, I need to filter a set of values to reduce jitter. To that end, I need to be able to average a list of numbers, with the most recent having the greatest effect, and the least recent having the smallest effect. I'm using a sample size of 10, but that could easily change at some point.
Are there any reasonably simple aging algorithms that I can apply here?
Have a look at the exponential smoothing. Fairly simple, and might be sufficient for your needs. Basically recent observations are given relatively more weight than the older ones.
Also (depending on the application) you may want to look at various reinforcement learning techniques, for example Q-Learning or TD-Learning or generally speaking any method involving the discount.
I ran into something similar in an embedded control application.
The simplest option that I came across was a 3/4 filter. This gets applied continuously over the entire data set:
current_value = (3*current_value + new_value)/4
I eventually decided to go with a 16-tap FIR filter instead:
Overview
FIR FAQ
Wikipedia article
Many weighted averaging algorithms could be used.
For example, for items I(n) for n = 1 to N in sequence (newest to oldest):
(SUM(I(n) * (N + 1 - n)) / SUM(n)
It's not exactly clear from the question whether you're dealing with fixed-length
data or if data is continuously coming in. A nice physical model for the latter
would be a low pass filter, using a capacitor and a resistor (R and C). Assuming
your data is equidistantly spaced in time (is it?), this leads to an update prescription
U_aged[n+1] = U_aged[n] + deltat/Tau (U_raw[n+1] - U_aged[n])
where Tau is the time constant of the filter. In the limit of zero deltat, this
gives an exponential decay (old values will be reduced to 1/e of their value after
time Tau). In an implementation, you only need to keep a running weighted sum U_aged.
deltat would be 1 and Tau would specify the 'aging constant', the number of steps
it takes to reduce a sample's contribution to 1/e.

Resources