How can I normalize trending data? - algorithm

Say I want to calculate the velocity of two datapoints (A and A'), each having a score, and a time published (A' is a future version of A, and has a higher score). This would be
[A'(score) - A(score)] / [A'(time published) - A (time published)]
What I want to capture are trends with high velocities. This means I want a score going from 20 to 200 having higher weight than 8500 to 9000. So I thought I'd normalize this data by dividing the scores by a baseline.
Ex. if A(score) is 2, and A'(score) is 3, the baseline is 2, so in the formula above,
A'(score) - A(score) would be (3/2 - 2/2)
However, this means that when the numbers are this low, the velocities will be very high (since on the other hand
9000/8500 - 8500/8500
produces very low velocities, given that time difference is constant in this example only, however normally, time differences are variable).
Is there any way to reduce the impact of low starting scores WHILE at the same time allowing jumps from, say, 20 to 200 being significant? Thank you.

There are two ways to look at this. Either could give you what you want.
My first thought was that your question came very close to providing your answer. You gave yourself an important hint by calling your first calculation your velocity - your rate of change of a score over time. You could then look at its acceleration - your rate of change of the velocity over time. That's:
(A''(score) - A'(score)) - (A'(score) - A(score))
Note, I'm not dividing by time, because you say the time difference is constant for each measurement. Then you're dividing each value by a constant, which is inefficient and probably doesn't give you any further clarity.
More likely, though, it seems you want how significant the change is from one score to the next. I suspect what you want is:
(A'(score) - A(score)) / A(score)
This is (a - b) / b, which reduces down to (a/b) - 1. If you don't care about the -1, the simplest way you can see the relevant change in your score is:
A'(score)/A(score)
This shows the rate of growth of the score from one step to the next.
Edit, after clarification:
Given your comment, a variable rate of time makes the logic more complicated, but still do-able.
In that case, you do want to calculate velocity, as you were doing:
V = A'(score) - A(score) / A'(time) - A(time)
But you want to normalize it based on the previous velocity:
result = V'/V
This then becomes similar to the "acceleration" example - it requires 3 samples to have a good idea of the rate of change of the rate of change. If you spell it out all the way, you get something like:
result = (A''(score) - A'(score))/(A''(time) - A'(time)) / (A'(score) - A(score))/(A'(time) - A(time))
You can do some math to shove these numbers around, but there's really no prettier result than that.

Related

How to effeciently compute the first, second, and third derivatives of live updating data?

I have a running/decaying sum that updates over time with live data. I would like to efficiently compute the first, second, and third derivatives.
The simplest way I can think of doing this is to calculate deltas over some time difference in the running/decaying sum. e.g.
t_0 sum_0
t_1 sum_1
first_derivative = (sum_1 - sum_0) / (t_1 - t0)
I can continue this process further with the second and third derivatives, which I think should work, but I'm not sure if this is the best way.
This running/decaying sum is not a defined function and relies on live updating data, so I can't just do a normal derivative.
I don't know what your real use case is, but it sounds like you're going about this the wrong way. For most cases I can imagine, what you really want to do is:
First determine the continuous signal that your time series represents; and then
You can exactly calculate the derivatives of this signal at any point.
Since you have already decided that your time series represents exponential decay with discontinuous jumps, you have decided that all your derivatives are simply proportional to the current value and provide no extra information.
This probably isn't what you really want.
You would probably be better off applying a more sophisticated low-pass filter to your samples. In situations like yours, where you receive intermittent updates, it can be convenient to design the impulse response as a weighted sum of exponential decays with different (and possibly complex) time scales.
If you use 4 or 5 exponentials, then you can ensure that the value and first 3 derivatives of the impulse response are all smooth, so none of the derivatives you have to report are discontinuous.
The impulse response of any all-pole IIR filter can be written as the sum of exponentials in this way, though "partial fraction decomposition", but I guess there is a lot of learning between you and there right now. Those terms are all Googlable.
An example impulse response that would be smoother than an exponential decay, is this one, that's 0 in the first 3 derivatives:
5( e-t - 4e-2t + 6e-3t - 4e-4t + e-5t )
You can scale the decay times however you like. It looks like this (from Wolfram Alpha):
To be clear, you are looking to smooth out data AND to estimate rate of change. But rate of change inherently amplifies noise. Any solution is going to have to make some tradeoffs.
Here is a simple hack based on your existing technique.
First, let's look at a general version of a basic decaying sum. Let's keep the following variables:
average_value
average_time
average_weight
And you have a decay rate decay.
To update with a new observation (value, time) you simply:
average_weight *= (1 - decay)**(time - average_time)
average_value = (average_value * average_weight + value) / (1 + average_weight)
average_time = (average_time * average_weight + time) / (1 + average_weight)
average_weight += 1
Therefore this moving average represents where your weight was some time ago. The slower the decay, the farther back it goes and the more smoothed out it is. Given that we want rate of change, the when is going to matter.
Now let's look at a first derivative. You have correctly put out a formula for estimating a first derivative. But at what time is that estimated derivative at? The answer turns out to be at time (t_0 + t_1) / 2. Any other time you pick, it will be systematically off based on the third derivative.
So you can play around with it, but you can estimate a derivative based on any source of values and timestamps. You can do it from your first derivative, or do it from a weighted average. You can even combine them. You can also do a running weighted average of the first derivative! But whatever you do, you need to keep track of WHEN it is a derivative FOR. (This is why I went through and discussed how far back a weighted average is, you need to think clearly about timestamping every piece of data you have, averaged or not.)
And now we have your second derivative. You have all the same choices for the second derivative that you do for the first. Except your measurements don't give a first derivative.
The third derivative follows the same pattern of choices.
However you do it, keep in mind the following.
Each derivative will be delayed.
The more up to date you keep them, the more noise will be a problem.
Make sure to think clearly about both what the measurement is, and when it is as of.
It may require experimentation to find what works best for your application.

Calculate enemy's hitpoints (extrapolate), based on change rate?

In our game, we have a boss (NPC), who's life is being checked on a time interval, say 1 minute.
I need to find a way to extrapolate known points (life,time), and approximately predict life after one minute (after 1 minute life will be checked again, and will be put in data set)
Also, extrapolation needs to consider mostly recent change (for instance, if we have 10 points, and last two have changed rapidly, it should be able to predict even more rapid change at next point).
I found multiple example this one and this one, but seems like I'm not able to translate all this in as3 code. Basically what I was looking for was 2D Extrapolation.
P.S. The point is that any calculated value should not get above any previous values, that is because the boss' hit points cannot increase, and also cannot stay the same; they can only decrease. I guess that means extrapolation wouldn't do. So I'm looking for another algorithm that will do.
Consider a calculus-inspired approach. If we have a list d[i] of the damage at a past time iand the current time is n, then we can estimate d[n+1] using the previous values in the list. d[n] - d[n-1] provides an estimate of the change from d[n] to d[n+1] based on recent values, (d[n] - d[n-1]) - (d[n-1] - d[n-2]) provides an estimate of the change of that change, and so forth. The idea is to use differencing to estimate change. If you have a time-series data list d[i] = [a,b,c,...], and another list d2[i] = d[i] - d[i-1], then d2[] is the change in d[] for all times i > 1. Since d2[] is also a time series, you can use it to create a d3[], chaining the terms to provide an estimate:
d[n+1] ~ d[n] + ( (d[n] - d[n-1]) ) + ( (d[n] - d[n-1]) - (d[n-1] - d[n-2]) ) + ...
^last value ^ est. change ^est. change of change
d[n] d2[n] d3[n]
Granted that this makes a lot of assumptions about incoming data. The two most important problems I can think of:
This assumes that most recent change(s) are completely representative of future values- in cases where change terms are non-linear, this causes the estimation to lag behind the actual data
The "lag" becomes stronger as more terms are added - a better estimate (more terms) must be balanced with better agility (less terms)
Outliers in incoming data figure directly into the equation, and thus directly affect the resulting estimate

Formula for popularity? (based on "like it", "comments", "views")

I have some pages on a website and I have to create an ordering based on "popularity"/"activity"
The parameters that I have to use are:
views to the page
comments made on the page (there is a form at the bottom where uses can make comments)
clicks made to the "like it" icon
Are there any standards for what a formula for popularity would be? (if not opinions are good too)
(initially I thought of views + 10*comments + 10*likeit)
Actually there is an accepted best way to calculate this:
http://www.evanmiller.org/how-not-to-sort-by-average-rating.html
You may need to combine 'likes' and 'comments' into a single score, assigning your own weighting factor to each, before plugging it into the formula as the 'positive vote' value.
from the link above:
Score = Lower bound of Wilson score confidence interval for a
Bernoulli parameter
We need to balance the proportion of positive ratings with
the uncertainty of a small number of observations. Fortunately, the
math for this was worked out in 1927 by Edwin B. Wilson. What we want
to ask is: Given the ratings I have, there is a 95% chance that the
"real" fraction of positive ratings is at least what? Wilson gives the
answer. Considering only positive and negative ratings (i.e. not a
5-star scale), the lower bound on the proportion of positive ratings
is given by:
(Use minus where it says plus/minus to calculate the lower bound.)
Here p̂ is the observed fraction of positive ratings, zα/2 is the
(1-α/2) quantile of the standard normal distribution, and n is the
total number of ratings. The same formula implemented in Ruby:
require 'statistics2'
def ci_lower_bound(pos, n, confidence)
if n == 0
return 0
end
z = Statistics2.pnormaldist(1-(1-confidence)/2)
phat = 1.0*pos/n
(phat + z*z/(2*n) - z * Math.sqrt((phat*(1-phat)+z*z/(4*n))/n))/(1+z*z/n)
end
pos is the number of positive ratings, n is the total number of
ratings, and confidence refers to the statistical confidence level:
pick 0.95 to have a 95% chance that your lower bound is correct, 0.975
to have a 97.5% chance, etc. The z-score in this function never
changes, so if you don't have a statistics package handy or if
performance is an issue you can always hard-code a value here for z.
(Use 1.96 for a confidence level of 0.95.)
The same formula as an SQL query:
SELECT widget_id, ((positive + 1.9208) / (positive + negative) -
1.96 * SQRT((positive * negative) / (positive + negative) + 0.9604) /
(positive + negative)) / (1 + 3.8416 / (positive + negative))
AS ci_lower_bound FROM widgets WHERE positive + negative > 0
ORDER BY ci_lower_bound DESC;
There is no standard formula for this (how could there be?)
What you have looks like a fairly normal solution, and would probably work well. Of course, you should play around with the 10's to find values that suit your needs.
Depending on your requirements, you might also want to add in a time factor (i.e. -X points per week) so that old pages become less popular. Alternatively, you could change your "page views" to "page views in the last month". Again, this depends on your needs, it may not be relevant.
You could do something like what YouTube does - just have it sorted by largest count per category. For example - most viewed, most commented, most liked. In each category a different page could come first, though the rankings might likely be correlated. If you only need a single ranking, then you would have to come up with a formula of some sort, preferably derived empirically by analyzing a bunch of data you already have and deciding what should be calculated as good/bad, and working backwards to fit an equation that fits your decision.
You could even attempt a machine learning approach to "learn" what a good weighting is for combining each of these numbers as in your example formula. Doing it manually might also not be too hard.
I use,
(C*comments + L*likeit)*100/views
where you must use C and L depending on how much you value each attribute.
I use C=1 and L=1.
This gives you the percentage of views that generated a positive action, making the items with
higher percentage the most "popular".
I like this because it makes it possible for newer items to be very popular at first, showing up first and getting more views and thus becoming less popular (or more) until stabilizing.
Anyway,
i hope it helps.
PS: Of it would work just the same without the "*100" but i like percentages.
I would value comments more than 'like it's if the content invites a discussion. If it's just stating facts, an equal ration for comments and the like count seems ok (though 10 is a bit too much, I think...)
Does visit take into account the time the user spent somehow? You might use that, as well, as a 2 second view means less than a 3 minute one.
Java code for Anentropic's answer:
public static double getRank(double thumbsUp, double thumbsDown) {
double totalVotes = thumbsUp + thumbsDown;
if (totalVotes > 0) {
return ((thumbsUp + 1.9208) / totalVotes -
1.96 * Math.sqrt((thumbsUp * thumbsDown) / totalVotes + 0.9604) /
totalVotes) / (1 + (3.8416 / totalVotes));
} else {
return 0;
}
}

Smart progress bar ETA computation

In many applications, we have some progress bar for a file download, for a compression task, for a search, etc. We all often use progress bars to let users know something is happening. And if we know some details like just how much work has been done and how much is left to do, we can even give a time estimate, often by extrapolating from how much time it's taken to get to the current progress level.
(source: jameslao.com)
But we've also seen programs which this Time Left "ETA" display is just comically bad. It claims a file copy will be done in 20 seconds, then one second later it says it's going to take 4 days, then it flickers again to be 20 minutes. It's not only unhelpful, it's confusing!
The reason the ETA varies so much is that the progress rate itself can vary and the programmer's math can be overly sensitive.
Apple sidesteps this by just avoiding any accurate prediction and just giving vague estimates!
(source: autodesk.com)
That's annoying too, do I have time for a quick break, or is my task going to be done in 2 more seconds? If the prediction is too fuzzy, it's pointless to make any prediction at all.
Easy but wrong methods
As a first pass ETA computation, probably we all just make a function like if p is the fractional percentage that's done already, and t is the time it's taken so far, we output t*(1-p)/p as the estimate of how long it's going to take to finish. This simple ratio works "OK" but it's also terrible especially at the end of computation. If your slow download speed keeps a copy slowly advancing happening overnight, and finally in the morning, something kicks in and the copy starts going at full speed at 100X faster, your ETA at 90% done may say "1 hour", and 10 seconds later you're at 95% and the ETA will say "30 minutes" which is clearly an embarassingly poor guess.. in this case "10 seconds" is a much, much, much better estimate.
When this happens you may think to change the computation to use recent speed, not average speed, to estimate ETA. You take the average download rate or completion rate over the last 10 seconds, and use that rate to project how long completion will be. That performs quite well in the previous overnight-download-which-sped-up-at-the-end example, since it will give very good final completion estimates at the end. But this still has big problems.. it causes your ETA to bounce wildly when your rate varies quickly over a short period of time, and you get the "done in 20 seconds, done in 2 hours, done in 2 seconds, done in 30 minutes" rapid display of programming shame.
The actual question:
What is the best way to compute an estimated time of completion of a task, given the time history of the computation? I am not looking for links to GUI toolkits or Qt libraries. I'm asking about the algorithm to generate the most sane and accurate completion time estimates.
Have you had success with math formulas? Some kind of averaging, maybe by using the mean of the rate over 10 seconds with the rate over 1 minute with the rate over 1 hour? Some kind of artificial filtering like "if my new estimate varies too much from the previous estimate, tone it down, don't let it bounce too much"? Some kind of fancy history analysis where you integrate progress versus time advancement to find standard deviation of rate to give statistical error metrics on completion?
What have you tried, and what works best?
Original Answer
The company that created this site apparently makes a scheduling system that answers this question in the context of employees writing code. The way it works is with Monte Carlo simulation of future based on the past.
Appendix: Explanation of Monte Carlo
This is how this algorithm would work in your situation:
You model your task as a sequence of microtasks, say 1000 of them. Suppose an hour later you completed 100 of them. Now you run the simulation for the remaining 900 steps by randomly selecting 90 completed microtasks, adding their times and multiplying by 10. Here you have an estimate; repeat N times and you have N estimates for the time remaining. Note the average between these estimates will be about 9 hours -- no surprises here. But by presenting the resulting distribution to the user you'll honestly communicate to him the odds, e.g. 'with the probability 90% this will take another 3-15 hours'
This algorithm, by definition, produces complete result if the task in question can be modeled as a bunch of independent, random microtasks. You can gain a better answer only if you know how the task deviates from this model: for example, installers typically have a download/unpacking/installing tasklist and the speed for one cannot predict the other.
Appendix: Simplifying Monte Carlo
I'm not a statistics guru, but I think if you look closer into the simulation in this method, it will always return a normal distribution as a sum of large number of independent random variables. Therefore, you don't need to perform it at all. In fact, you don't even need to store all the completed times, since you'll only need their sum and sum of their squares.
In maybe not very standard notation,
sigma = sqrt ( sum_of_times_squared-sum_of_times^2 )
scaling = 900/100 // that is (totalSteps - elapsedSteps) / elapsedSteps
lowerBound = sum_of_times*scaling - 3*sigma*sqrt(scaling)
upperBound = sum_of_times*scaling + 3*sigma*sqrt(scaling)
With this, you can output the message saying that the thing will end between [lowerBound, upperBound] from now with some fixed probability (should be about 95%, but I probably missed some constant factor).
Here's what I've found works well! For the first 50% of the task, you assume the rate is constant and extrapolate. The time prediction is very stable and doesn't bounce much.
Once you pass 50%, you switch computation strategy. You take the fraction of the job left to do (1-p), then look back in time in a history of your own progress, and find (by binary search and linear interpolation) how long it's taken you to do the last (1-p) percentage and use that as your time estimate completion.
So if you're now 71% done, you have 29% remaining. You look back in your history and find how long ago you were at (71-29=42%) completion. Report that time as your ETA.
This is naturally adaptive. If you have X amount of work to do, it looks only at the time it took to do the X amount of work. At the end when you're at 99% done, it's using only very fresh, very recent data for the estimate.
It's not perfect of course but it smoothly changes and is especially accurate at the very end when it's most useful.
Whilst all the examples are valid, for the specific case of 'time left to download', I thought it would be a good idea to look at existing open source projects to see what they do.
From what I can see, Mozilla Firefox is the best at estimating the time remaining.
Mozilla Firefox
Firefox keeps a track of the last estimate for time remaining, and by using this and the current estimate for time remaining, it performs a smoothing function on the time.
See the ETA code here. This uses a 'speed' which is previously caculated here and is a smoothed average of the last 10 readings.
This is a little complex, so to paraphrase:
Take a smoothed average of the speed based 90% on the previous speed and 10% on the new speed.
With this smoothed average speed work out the estimated time remaining.
Use this estimated time remaining, and the previous estimated time remaining to created a new estimated time remaining (in order to avoid jumping)
Google Chrome
Chrome seems to jump about all over the place, and the code shows this.
One thing I do like with Chrome though is how they format time remaining.
For > 1 hour it says '1 hrs left'
For < 1 hour it says '59 mins left'
For < 1 minute it says '52 secs left'
You can see how it's formatted here
DownThemAll! Manager
It doesn't use anything clever, meaning the ETA jumps about all over the place.
See the code here
pySmartDL (a python downloader)
Takes the average ETA of the last 30 ETA calculations. Sounds like a reasonable way to do it.
See the code here/blob/916f2592db326241a2bf4d8f2e0719c58b71e385/pySmartDL/pySmartDL.py#L651)
Transmission
Gives a pretty good ETA in most cases (except when starting off, as might be expected).
Uses a smoothing factor over the past 5 readings, similar to Firefox but not quite as complex. Fundamentally similar to Gooli's answer.
See the code here
I usually use an Exponential Moving Average to compute the speed of an operation with a smoothing factor of say 0.1 and use that to compute the remaining time. This way all the measured speeds have influence on the current speed, but recent measurements have much more effect than those in the distant past.
In code it would look something like this:
alpha = 0.1 # smoothing factor
...
speed = (speed * (1 - alpha)) + (currentSpeed * alpha)
If your tasks are uniform in size, currentSpeed would simply be the time it took to execute the last task. If the tasks have different sizes and you know that one task is supposed to be i,e, twice as long as another, you can divide the time it took to execute the task by its relative size to get the current speed. Using speed you can compute the remaining time by multiplying it by the total size of the remaining tasks (or just by their number if the tasks are uniform).
Hopefully my explanation is clear enough, it's a bit late in the day.
In certain instances, when you need to perform the same task on a regular basis, it might be a good idea of using past completion times to average against.
For example, I have an application that loads the iTunes library via its COM interface. The size of a given iTunes library generally do not increase dramatically from launch-to-launch in terms of the number of items, so in this example it might be possible to track the last three load times and load rates and then average against that and compute your current ETA.
This would be hugely more accurate than an instantaneous measurement and probably more consistent as well.
However, this method depends upon the size of the task being relatively similar to the previous ones, so this would not work for a decompressing method or something else where any given byte stream is the data to be crunched.
Just my $0.02
First off, it helps to generate a running moving average. This weights more recent events more heavily.
To do this, keep a bunch of samples around (circular buffer or list), each a pair of progress and time. Keep the most recent N seconds of samples. Then generate a weighted average of the samples:
totalProgress += (curSample.progress - prevSample.progress) * scaleFactor
totalTime += (curSample.time - prevSample.time) * scaleFactor
where scaleFactor goes linearly from 0...1 as an inverse function of time in the past (thus weighing more recent samples more heavily). You can play around with this weighting, of course.
At the end, you can get the average rate of change:
averageProgressRate = (totalProgress / totalTime);
You can use this to figure out the ETA by dividing the remaining progress by this number.
However, while this gives you a good trending number, you have one other issue - jitter. If, due to natural variations, your rate of progress moves around a bit (it's noisy) - e.g. maybe you're using this to estimate file downloads - you'll notice that the noise can easily cause your ETA to jump around, especially if it's pretty far in the future (several minutes or more).
To avoid jitter from affecting your ETA too much, you want this average rate of change number to respond slowly to updates. One way to approach this is to keep around a cached value of averageProgressRate, and instead of instantly updating it to the trending number you've just calculated, you simulate it as a heavy physical object with mass, applying a simulated 'force' to slowly move it towards the trending number. With mass, it has a bit of inertia and is less likely to be affected by jitter.
Here's a rough sample:
// desiredAverageProgressRate is computed from the weighted average above
// m_averageProgressRate is a member variable also in progress units/sec
// lastTimeElapsed = the time delta in seconds (since last simulation)
// m_averageSpeed is a member variable in units/sec, used to hold the
// the velocity of m_averageProgressRate
const float frictionCoeff = 0.75f;
const float mass = 4.0f;
const float maxSpeedCoeff = 0.25f;
// lose 25% of our speed per sec, simulating friction
m_averageSeekSpeed *= pow(frictionCoeff, lastTimeElapsed);
float delta = desiredAvgProgressRate - m_averageProgressRate;
// update the velocity
float oldSpeed = m_averageSeekSpeed;
float accel = delta / mass;
m_averageSeekSpeed += accel * lastTimeElapsed; // v += at
// clamp the top speed to 25% of our current value
float sign = (m_averageSeekSpeed > 0.0f ? 1.0f : -1.0f);
float maxVal = m_averageProgressRate * maxSpeedCoeff;
if (fabs(m_averageSeekSpeed) > maxVal)
{
m_averageSeekSpeed = sign * maxVal;
}
// make sure they have the same sign
if ((m_averageSeekSpeed > 0.0f) == (delta > 0.0f))
{
float adjust = (oldSpeed + m_averageSeekSpeed) * 0.5f * lastTimeElapsed;
// don't overshoot.
if (fabs(adjust) > fabs(delta))
{
adjust = delta;
// apply damping
m_averageSeekSpeed *= 0.25f;
}
m_averageProgressRate += adjust;
}
Your question is a good one. If the problem can be broken up into discrete units having an accurate calculation often works best. Unfortunately this may not be the case even if you are installing 50 components each one might be 2% but one of them can be massive. One thing that I have had moderate success with is to clock the cpu and disk and give a decent estimate based on observational data. Knowing that certain check points are really point x allows you some opportunity to correct for environment factors (network, disk activity, CPU load). However this solution is not general in nature due to its reliance on observational data. Using ancillary data such as rpm file size helped me make my progress bars more accurate but they are never bullet proof.
Uniform averaging
The simplest approach would be to predict the remaining time linearly:
t_rem := t_spent ( n - prog ) / prog
where t_rem is the predicted ETA, t_spent is the time elapsed since the commencement of the operation, prog the number of microtasks completed out of their full quantity n. To explain—n may be the number of rows in a table to process or the number of files to copy.
This method having no parameters, one need not worry about the fine-tuning of the exponent of attenuation. The trade-off is poor adaptation to a changing progress rate because all samples have equal contribution to the estimate, whereas it is only meet that recent samples should be have more weight that old ones, which leads us to
Exponential smoothing of rate
in which the standard technique is to estimate progress rate by averaging previous point measurements:
rate := 1 / (n * dt); { rate equals normalized progress per unit time }
if prog = 1 then { if first microtask just completed }
rate_est := rate; { initialize the estimate }
else
begin
weight := Exp( - dt / DECAY_T );
rate_est := rate_est * weight + rate * (1.0 - weight);
t_rem := (1.0 - prog / n) / rate_est;
end;
where dt denotes the duration of the last completed microtask and is equal to the time passed since the previous progress update. Notice that weight is not a constant and must be adjusted according the length of time during which a certain rate was observed, because the longer we observed a certain speed the higher the exponential decay of the previous measurements. The constant DECAY_T denotes the length of time during which the weight of a sample decreases by a factor of e. SPWorley himself suggested a similar modification to gooli's proposal, although he applied it to the wrong term. An exponential average for equidistant measurements is:
Avg_e(n) = Avg_e(n-1) * alpha + m_n * (1 - alpha)
but what if the samples are not equidistant, as is the case with times in a typical progress bar? Take into account that alpha above is but an empirical quotient whose true value is:
alpha = Exp( - lambda * dt ),
where lambda is the parameter of the exponential window and dt the amount of change since the previous sample, which need not be time, but any linear and additive parameter. alpha is constant for equidistant measurements but varies with dt.
Mark that this method relies on a predefined time constant and is not scalable in time. In other words, if the exactly same process be uniformly slowed-down by a constant factor, this rate-based filter will become proportionally more sensitive to signal variations because at every step weight will be decreased. If we, however, desire a smoothing independent of the time scale, we should consider
Exponential smoothing of slowness
which is essentially the smoothing of rate turned upside down with the added simplification of a constant weight of because prog is growing by equidistant increments:
slowness := n * dt; { slowness is the amount of time per unity progress }
if prog = 1 then { if first microtask just completed }
slowness_est := slowness; { initialize the estimate }
else
begin
weight := Exp( - 1 / (n * DECAY_P ) );
slowness_est := slowness_est * weight + slowness * (1.0 - weight);
t_rem := (1.0 - prog / n) * slowness_est;
end;
The dimensionless constant DECAY_P denotes the normalized progress difference between two samples of which the weights are in the ratio of one to e. In other words, this constant determines the width of the smoothing window in progress domain, rather than in time domain. This technique is therefore independent of the time scale and has a constant spatial resolution.
Futher research: adaptive exponential smoothing
You are now equipped to try the various algorithms of adaptive exponential smoothing. Only remember to apply it to slowness rather than to rate.
I always wish these things would tell me a range. If it said, "This task will most likely be done in between 8 min and 30 minutes," then I have some idea of what kind of break to take. If it's bouncing all over the place, I'm tempted to watch it until it settles down, which is a big waste of time.
I have tried and simplified your "easy"/"wrong"/"OK" formula and it works best for me:
t / p - t
In Python:
>>> done=0.3; duration=10; "time left: %i" % (duration / done - duration)
'time left: 23'
That saves one op compared to (dur*(1-done)/done). And, in the edge case you describe, possibly ignoring the dialog for 30 minutes extra hardly matters after waiting all night.
Comparing this simple method to the one used by Transmission, I found it to be up to 72% more accurate.
I don't sweat it, it's a very small part of an application. I tell them what's going on, and let them go do something else.

Aging a dataset

For reasons I'd rather not go into, I need to filter a set of values to reduce jitter. To that end, I need to be able to average a list of numbers, with the most recent having the greatest effect, and the least recent having the smallest effect. I'm using a sample size of 10, but that could easily change at some point.
Are there any reasonably simple aging algorithms that I can apply here?
Have a look at the exponential smoothing. Fairly simple, and might be sufficient for your needs. Basically recent observations are given relatively more weight than the older ones.
Also (depending on the application) you may want to look at various reinforcement learning techniques, for example Q-Learning or TD-Learning or generally speaking any method involving the discount.
I ran into something similar in an embedded control application.
The simplest option that I came across was a 3/4 filter. This gets applied continuously over the entire data set:
current_value = (3*current_value + new_value)/4
I eventually decided to go with a 16-tap FIR filter instead:
Overview
FIR FAQ
Wikipedia article
Many weighted averaging algorithms could be used.
For example, for items I(n) for n = 1 to N in sequence (newest to oldest):
(SUM(I(n) * (N + 1 - n)) / SUM(n)
It's not exactly clear from the question whether you're dealing with fixed-length
data or if data is continuously coming in. A nice physical model for the latter
would be a low pass filter, using a capacitor and a resistor (R and C). Assuming
your data is equidistantly spaced in time (is it?), this leads to an update prescription
U_aged[n+1] = U_aged[n] + deltat/Tau (U_raw[n+1] - U_aged[n])
where Tau is the time constant of the filter. In the limit of zero deltat, this
gives an exponential decay (old values will be reduced to 1/e of their value after
time Tau). In an implementation, you only need to keep a running weighted sum U_aged.
deltat would be 1 and Tau would specify the 'aging constant', the number of steps
it takes to reduce a sample's contribution to 1/e.

Resources