I am working on an assignment where I am supposed to compute the density of an event. Let's say that a certain event happens 5 times within seconds, it would mean that it would have a higher density than if it were to happen 5 times within hours.
I have in my possession, the time at which the event happens.
I was first thinking about computing the elapsed time between each two successive events and then play with the average and mean of these values.
My problem is that I do not know how to accurately represent this notion of density through mathematics. Let's say that I have 5 events happening really close to each other, and then a long break, and then again 5 events happening really close to each other. I would like to be able to represent this as high density. How should I go about it?
In the last example, I understand that my mean won't be truly representative but that my standard deviation will show that. However, how could I have a single density value (let's say between 0 and 1) with which I could rank different events?
Thank you for your help!
I would try the harmonic mean, which represents the rate at which your events happen, by still giving you an averaged time value. It is defined by :
I think its behaviour is close to what you expect as it measures what you want, but not between 0 and 1 and with inverse tendencies (small values mean dense, large values mean sparse). Let us go through a few of your examples :
~5 events in an hour. Let us suppose for simplicity there is 10 minutes between each event. Then we have H = 6 /(6 * 1/10) = 10
~5 events in 10 minutes, then nothing until the end of the hour (50 minutes). Let us suppose all short intervals are 2.5 minutes, then H = 6 / (5/2.5 + 1/50) = 6 * 50 / 101 = 2.97
~5 events in 10 minutes, but this cycle restarts every half hour thus we have 20 minutes as the last interval instead of 50. Then we get H = 6 / (5/2.5 + 1/20) = 6 * 20 / 41 = 2.92
As you can see the effect of the longer and rarer values in a set is diminished by the fact that we use inverses, thus less weight to the "in between bursts" behaviour. Also you can compare behaviours with the same "burst density" but that do not happen at the same frequency, and you will get numbers that are close but whose ordering still reflects this difference.
For density to make sense you need to define 2 things:
the range where you look at it,
and the unit of time
After that you can say for example, that from 12:00 to 12:10 the density of the event was an average of 10/minute.
What makes sense in your case obviously depends on what your input data is. If your measurement lasts for 1 hour and you have millions of entries then probably seconds or milliseconds are better choice for unit. If you measure for a week and have a few entries then day is a better unit.
Related
I'm creating an app to monitor water quality. The temperature data is updated every 2 min to firebase real-time database. App has two requirements
1) It should alert the user when temperature exceed 33 degree or drop below 23 degree - This part is done
2) It should alert user when it has big temperature fluctuation after analysing data every 30min - This part i'm confused.
I don't know what algorithm to use to detect big temperature fluctuation over a period of time and alert the user. Can someone help me on this?
For a period of 30 minutes, your app would give you 15 values.
If you want to figure out a big change in this data, then there is one way to do so.
You can use implement the following method:
Calculate the mean and the standard deviation of the values.
Subtract the data you have from the mean and then take the absolute value of the result.
Compare if the absolute value is greater than one standard deviation, if it is greater then you have a big data.
See this example for better understanding:
Lets suppose you have these values for 10 minutes:
25,27,24,35,28
First Step:
Mean = 27 (apprx)
One standard deviation = 3.8
Second Step: Absolute(Data - Mean)
abs(25-27) = 2
abs(27-27) = 0
abs(24-27) = 3
abs(35-27) = 8
abs(28-27) = 1
Third Step
Check if any of the subtraction is greater than standard deviation
abs(35-27) gives 8 which is greater than 3.8
So, there is a big fluctuation. If all the subtracted results are less than standard deviation, then there is no fluctuation.
You can still improvise the result by selecting two or three standard deviation instead of one standard deviation.
Start by defining what you mean by fluctuation.
You don't say what temperature scale you're using. Fahrenheit, Celsius, Rankine, or Kelvin?
Your sampling rate is a new data value every two minutes. Do you define fluctuation as the absolute value of the difference between the last point and current value? That's defensible.
If the max allowable absolute value is some multiple of your 33-23 = 10 degrees you're in business.
I am reading some papers on simulation and performance modeling. The Y axis in some figures is labeled "Seconds per Simulation Day". I am not sure what it actually means. It span from 0, 20, 40 to 120.
Another label is "Simulation years per day". I guess it means the guest OS inside simulation environment thinks it has passed several years while actually it just passed a day in the real world? But I guess simulation should slow down the execution, so I guess inside simulation environment passed several hours while actually it just passed a day in the real world would be more reasonable.
Thanks.
Without seeing the paper, I assume they are trying to compare the CPU time it takes to get to some physical time in a simulation.
So "Seconds per Simulation Day" is likely the walltime it took to get 24 hours in the simulation.
Likewise, "Simulation Years per Day" is physical time of simulation/real life day.
Of course, without seeing the paper it's impossible to know for sure. It's also possible they are looking at CPU-seconds or CPU-days, which would be nCPUs*walltime.
Simulations typically run in discrete time units, called time steps. If you'd like to simulate a certain process that spans certain time in the simulation, you would have to perform certain number of time steps. If the length of a time step is fixed, the number of steps is then just the simulated time divided by the length of the time step. Calculations in each time step take certain amount of time and the total run time for the simulation would equal the number of time steps times the time it takes to perform one time step:
(1) total_time = (simulation_time / timestep_length) * run_time_per_timestep
Now several benchmark parameters can be obtained by placing different parameters on the left hand side. E.g. if you fix simulation_time = 1 day then total_time would give you the total simulation run time, i.e.
(2) seconds_per_sim_day = (1 day / timestep_length) * run_time_per_timestep
Large values of seconds_per_sim_day could mean:
it takes too much time to compute a single time step, i.e. run_time_per_timestep is too high -> the computation algorithm should be optimised for speed;
the time step is too short -> search for better algorithms that can accept larger time steps and still produce (almost) the same result.
On the other hand, if you solve (1) for simulation_time and fix total_time = 1 day, you get the number of time steps that can be performed per day times the length of the time step, or the total simulation time that can be achieved per day of computation:
(3) simulation_time_per_day = (1 day / run_time_per_step) * timestep_length
Now one can observe that:
larger time steps lead to larger values of simulation_time_per_day, i.e. longer simulations can be computed;
if it takes too much time to compute a time step, the value of simulation_time_per_day would go down.
Usually those figures could be used when making decisions about buying CPU time at some computing centre. For example, you would like to simulate 100 years, then just divide that by the amount of simulation years per day and you get how many compute days you would have to pay (or wait) for. Larger values of simulation_time_per_day definitely benefit you in this case. If, on the other hand, you only have 10 compute days at your disposal, then you can compute how long of a simulation could be computed and make some decisions, e.g. more short simulations but with many different parameters vs. less but longer simulations with some parameters that you have predicted to be the optimal ones.
In real life things are much more complicated. Usually computing each time step could take different time (although there are cases where each time step takes exactly the same amount of time as all other time steps) and it would strongly depend on the simluation size, configuration, etc. That's why standardised tests exist and usually some averaged value is reported.
Just to summarise: given that all test parameters are kept equal,
faster computers would give less "seconds per simulation day" and more "simulation years per day"
slower computers would give more "seconds per simulation day" and less "simulation years per day"
By the way, both quantites are reciprocial and related by this simple equation:
simuation_years_per_day = 236,55 / seconds_per_simulation_day
(that is "simulation years per day" equals 86400 divided by "seconds per simulation day" /which gives you the simulation days per day/ and then dividied by 365.25 to convert the result into years)
So it doesn't really matter if "simulation years per day" or "seconds per simulation day" is presented. One just have to chose the representation which clearly shows how much better the newer system is from the previous/older/existing one :)
For example, if I wanted to ensure that I had one winner every four hours, and I expected to have 125 plays per hour, what is the best way to provide for the highest chance of having a winner and the lowest chance of having no winners at the end of the four hour period?
The gameplay is like a slot-machine, not a daily number. i.e. the entrant enters the game and gets notified right away if they have won or lost.
Sounds like a homework problem, I know, but it's not :)
Thanks.
There's really only so much you can do to keep things fair (i.e. someone who enters at the beginning of a four hour period has the same odds of winning as someone who enters at the end) if you want to enforce this constraint. In general, the best you can do while remaining legal is to take a guess at how many entrants you're going to have and set the probability accordingly (and if there's no winner at the end of a given period, give it to a random entrant from that period).
Here's what I'd do to adjust your sweepstakes probability as you go (setting aside the legal ramifications of doing so):
For each period, start the probability at 1 / (number of expected entries * 2)
At any time, if you get a winner, the probability goes to 0 for the rest of that period.
Every thirty minutes, if you're still without a winner, set the probability at 1 / ((number of expected entries * (1 - percentage of period complete)) * 2). So here, the percentage of period complete is the number of hours elapsed in that current period / number of total hours in the period (4). Basically as you go, the probability will scale upwards.
Practical example: expected entries is 200.
Starting probability = 1 / 400 = 0.0025.
After first half hour, we don't have a winner, so we reevaluate probability:
probability = 1 / ((200 * (1 - 0.125) * 2) = 1 / (200 * 2 * 0.875) = 1/350
This goes down all the way until the probability is a maximum of 1/50, assuming no winner occurs before then.
You can adjust these parameters if you want to maximize the acceleration or whatever. But I'd be remiss if I didn't emphasize that I don't believe running a sweepstakes like this is legal. I've done a few sweepstakes for my company and am somewhat familiar with the various laws and regulations, and the general rule of thumb, as I understand it, is that no one entrant should have an advantage over any other entrant that the other entrant doesn't know about. I'm no expert, but consult with your lawyer before running a sweepstakes like this. That said, the solution above will help you maximize your odds of giving away a prize.
If you're wanting a winner for every drawing, you'd simply pick a random winner from your entrants.
If you're doing it like a lottery, where you don't have to have a winner for every drawing, the odds are as high or low as you care to make them based on your selection scheme. For instance, if you have 125 entries per hour, and you're picking every four hours, that's 500 entries per contest. If you generate a random number between 1 and 1000, there's a 50% chance that someone will win, 1 and 750 is a 75% chance that someone will win, and so forth. Then you just pick the entry that corresponds to the random number generated.
There's a million different ways to implement selecting a winner, in the end you just need to pick one and use it consistently.
There's this question but it has nothing close to help me out here.
Tried to find information about it on the internet yet this subject is so swarmed with articles on "how to win" or other non-related stuff that I could barely find anything. None worth posting here.
My question is how would I assure a payout of 95% over a year?
Theoretically, of course.
So far I can think of three obvious variables to consider within the calculation: Machine payout term (year in my case), total paid and total received in that term.
Now I could simply shoot a random number between the paid/received gap and fix slots results to be shown to the player but I'm not sure this is how it's done.
This method however sounds reasonable, although it involves building the slots results backwards..
I could also make a huge list of all possibilities, save them in a database randomized by order and simply poll one of them each time.
This got many flaws - the biggest one is the huge list I'm going to get (millions/billions/etc' records).
I certainly hope this question will be marked with an "Answer" (:
You have to make reel strips instead of huge database. Here is brief example for very basic 3-reel game containing 3 symbols:
Paytable:
3xA = 5
3xB = 10
3xC = 20
Reels-strip is a sequence of symbols on each reel. For the calculations you only need the quantity of each symbol per each reel:
A = 3, 1, 1 (3 symbols on 1st reel, 1 symbol on 2nd, 1 symbol on 3rd reel)
B = 1, 1, 2
C = 1, 1, 1
Full cycle (total number of all possible combinations) is 5 * 3 * 4 = 60
Now you can calculate probability of each combination:
3xA = 3 * 1 * 1 / full cycle = 0.05
3xB = 1 * 1 * 2 / full cycle = 0.0333
3xC = 1 * 1 * 1 / full cycle = 0.0166
Then you can calculate the return for each combination:
3xA = 5 * 0.05 = 0.25 (25% from AAA)
3xB = 10 * 0.0333 = 0.333 (33.3% from BBB)
3xC = 20 * 0.0166 = 0.333 (33.3% from CCC)
Total return = 91.66%
Finally, you can shuffle the symbols on each reel to get the reels-strips, e.g. "ABACA" for the 1st reel. Then pick a random number between 1 and the length of the strip, e.g. 1 to 5 for the 1st reel. This number is the middle symbol. The upper and lower ones are from the strip. If you picked from the edge of the strip, use the first or last one to loop the strip (it's a virtual reel). Then score the result.
In real life you might want to have Wild-symbols, free spins and bonuses. They all are pretty complicated to describe in this answer.
In this sample the Hit Frequency is 10% (total combinations = 60 and prize combinations = 6). Most of people use excel to calculate this stuff, however, you may find some good tools for making slot math.
Proper keywords for Google: PAR-sheet, "slot math can be fun" book.
For sweepstakes or Class-2 machines you can't use this stuff. You have to display a combination by the given prize instead. This is a pretty different task, so you may try to prepare a database storing the combinations sorted by the prize amount.
Well, the first problem is with the keyword assure, if you are dealing with random, you cannot assure, unless you change the logic of the slot machine.
Consider the following algorithm though. I think this style of thinking is more reliable then plotting graphs of averages to achive 95%;
if( customer_able_to_win() )
{
calculate_how_to_win();
}
else
no_win();
customer_able_to_win() is your data log that says how much intake you have gotten vs how much you have paid out, if you are under 95%, payout, then customer_able_to_win() returns true; in that case, calculate_how_to_win() calculates how much the customer would be able to win based on your %, so, lets choose a sampling period of 24 hours. If over the last 24 hours i've paid out 90% of the money I've taken in, then I can pay out up to 5%.... lets give that 5% a number such as 100$. So calculate_how_to_win says I can pay out up to 100$, so I would find a set of reels that would pay out 100$ or less, and that user could win. You could add a little random to it, but to ensure your 95% you'll have to have some other rules such as a forced max payout if you get below say 80%, and so on.
If you change the algorithm a little by adding random to the mix you will have to have more of these caveats..... So to make it APPEAR random to the user, you could do...
if( customer_able_to_win() && payout_percent() < 90% )
{
calculate_how_to_win(); // up to 5% payout
}
else
no_win();
With something like that, it will go on a losing streak after you hit 95% until you reach 90%, then it will go on a winning streak of random increments until you reach 95%.
This isn't a full algorithm answer, but more of a direction on how to think about how the slot machine works.
I've always envisioned this is the way slot machines work especially with video poker. Because the no_win() function would calculate how to lose, but make it appear to be 1 card off to tease you to think you were going to win, instead of dealing with a 'fair' game and the random just happens to be like that....
Think of the entire process of.... first thinking if you are going to win, how are you going to win, if you're not going to win, how are you going to lose, instead of random number generators determining if you will win or not.
I worked many years ago for an internet casino in Australia, this one being the only one in the world that was regulated completely by a government body. The algorithms you speak of that produce "structured randomness" are obviously extremely complex especially when you are talking multiple lines in all directions, double up, pick the suit, multiple progressive jackpots and the like.
Our poker machine laws for our state demand a payout of 97% of what goes in. For rudely to be satisfied that our machine did this, they made us run 10 million mock turns of the machine and then wanted to see that our game paid off at what the law states with the tiniest range of error (we had many many machines running a script to auto playing using a script to simulate the click for about a week before we hit the 10 mil).
Anyhow the algorithms you speak of are EXPENSIVE! They range from maybe $500k to several million per machine so as you can understand, no one is going to hand them over for free, that's for sure. If you wanted a single line machine it would be easy enough to do. Just work out you symbols/cards and what pay structure you want for each. Then you could just distribute those payouts amongst non-payouts till you got you respective figure. Obviously the more options there are means the longer it will take to pay out at that respective rate, it may even payout more early in the piece. Hit frequency and prize size are also factors you may want to consider
A simple way to do it, if you assume that people win a constant number of times a time period:
Create a collection of all possible tumbler combinations with how much each one pays out.
The first time someone plays, in that time period, you can offer all combinations at equal probability.
If they win, take that amount off the total left for the time period, and remove from the available options any combination that would payout more than you have left.
Repeat with the reduced combinations until all the money is gone for that time period.
Reset and start again for the next time period.
In many applications, we have some progress bar for a file download, for a compression task, for a search, etc. We all often use progress bars to let users know something is happening. And if we know some details like just how much work has been done and how much is left to do, we can even give a time estimate, often by extrapolating from how much time it's taken to get to the current progress level.
(source: jameslao.com)
But we've also seen programs which this Time Left "ETA" display is just comically bad. It claims a file copy will be done in 20 seconds, then one second later it says it's going to take 4 days, then it flickers again to be 20 minutes. It's not only unhelpful, it's confusing!
The reason the ETA varies so much is that the progress rate itself can vary and the programmer's math can be overly sensitive.
Apple sidesteps this by just avoiding any accurate prediction and just giving vague estimates!
(source: autodesk.com)
That's annoying too, do I have time for a quick break, or is my task going to be done in 2 more seconds? If the prediction is too fuzzy, it's pointless to make any prediction at all.
Easy but wrong methods
As a first pass ETA computation, probably we all just make a function like if p is the fractional percentage that's done already, and t is the time it's taken so far, we output t*(1-p)/p as the estimate of how long it's going to take to finish. This simple ratio works "OK" but it's also terrible especially at the end of computation. If your slow download speed keeps a copy slowly advancing happening overnight, and finally in the morning, something kicks in and the copy starts going at full speed at 100X faster, your ETA at 90% done may say "1 hour", and 10 seconds later you're at 95% and the ETA will say "30 minutes" which is clearly an embarassingly poor guess.. in this case "10 seconds" is a much, much, much better estimate.
When this happens you may think to change the computation to use recent speed, not average speed, to estimate ETA. You take the average download rate or completion rate over the last 10 seconds, and use that rate to project how long completion will be. That performs quite well in the previous overnight-download-which-sped-up-at-the-end example, since it will give very good final completion estimates at the end. But this still has big problems.. it causes your ETA to bounce wildly when your rate varies quickly over a short period of time, and you get the "done in 20 seconds, done in 2 hours, done in 2 seconds, done in 30 minutes" rapid display of programming shame.
The actual question:
What is the best way to compute an estimated time of completion of a task, given the time history of the computation? I am not looking for links to GUI toolkits or Qt libraries. I'm asking about the algorithm to generate the most sane and accurate completion time estimates.
Have you had success with math formulas? Some kind of averaging, maybe by using the mean of the rate over 10 seconds with the rate over 1 minute with the rate over 1 hour? Some kind of artificial filtering like "if my new estimate varies too much from the previous estimate, tone it down, don't let it bounce too much"? Some kind of fancy history analysis where you integrate progress versus time advancement to find standard deviation of rate to give statistical error metrics on completion?
What have you tried, and what works best?
Original Answer
The company that created this site apparently makes a scheduling system that answers this question in the context of employees writing code. The way it works is with Monte Carlo simulation of future based on the past.
Appendix: Explanation of Monte Carlo
This is how this algorithm would work in your situation:
You model your task as a sequence of microtasks, say 1000 of them. Suppose an hour later you completed 100 of them. Now you run the simulation for the remaining 900 steps by randomly selecting 90 completed microtasks, adding their times and multiplying by 10. Here you have an estimate; repeat N times and you have N estimates for the time remaining. Note the average between these estimates will be about 9 hours -- no surprises here. But by presenting the resulting distribution to the user you'll honestly communicate to him the odds, e.g. 'with the probability 90% this will take another 3-15 hours'
This algorithm, by definition, produces complete result if the task in question can be modeled as a bunch of independent, random microtasks. You can gain a better answer only if you know how the task deviates from this model: for example, installers typically have a download/unpacking/installing tasklist and the speed for one cannot predict the other.
Appendix: Simplifying Monte Carlo
I'm not a statistics guru, but I think if you look closer into the simulation in this method, it will always return a normal distribution as a sum of large number of independent random variables. Therefore, you don't need to perform it at all. In fact, you don't even need to store all the completed times, since you'll only need their sum and sum of their squares.
In maybe not very standard notation,
sigma = sqrt ( sum_of_times_squared-sum_of_times^2 )
scaling = 900/100 // that is (totalSteps - elapsedSteps) / elapsedSteps
lowerBound = sum_of_times*scaling - 3*sigma*sqrt(scaling)
upperBound = sum_of_times*scaling + 3*sigma*sqrt(scaling)
With this, you can output the message saying that the thing will end between [lowerBound, upperBound] from now with some fixed probability (should be about 95%, but I probably missed some constant factor).
Here's what I've found works well! For the first 50% of the task, you assume the rate is constant and extrapolate. The time prediction is very stable and doesn't bounce much.
Once you pass 50%, you switch computation strategy. You take the fraction of the job left to do (1-p), then look back in time in a history of your own progress, and find (by binary search and linear interpolation) how long it's taken you to do the last (1-p) percentage and use that as your time estimate completion.
So if you're now 71% done, you have 29% remaining. You look back in your history and find how long ago you were at (71-29=42%) completion. Report that time as your ETA.
This is naturally adaptive. If you have X amount of work to do, it looks only at the time it took to do the X amount of work. At the end when you're at 99% done, it's using only very fresh, very recent data for the estimate.
It's not perfect of course but it smoothly changes and is especially accurate at the very end when it's most useful.
Whilst all the examples are valid, for the specific case of 'time left to download', I thought it would be a good idea to look at existing open source projects to see what they do.
From what I can see, Mozilla Firefox is the best at estimating the time remaining.
Mozilla Firefox
Firefox keeps a track of the last estimate for time remaining, and by using this and the current estimate for time remaining, it performs a smoothing function on the time.
See the ETA code here. This uses a 'speed' which is previously caculated here and is a smoothed average of the last 10 readings.
This is a little complex, so to paraphrase:
Take a smoothed average of the speed based 90% on the previous speed and 10% on the new speed.
With this smoothed average speed work out the estimated time remaining.
Use this estimated time remaining, and the previous estimated time remaining to created a new estimated time remaining (in order to avoid jumping)
Google Chrome
Chrome seems to jump about all over the place, and the code shows this.
One thing I do like with Chrome though is how they format time remaining.
For > 1 hour it says '1 hrs left'
For < 1 hour it says '59 mins left'
For < 1 minute it says '52 secs left'
You can see how it's formatted here
DownThemAll! Manager
It doesn't use anything clever, meaning the ETA jumps about all over the place.
See the code here
pySmartDL (a python downloader)
Takes the average ETA of the last 30 ETA calculations. Sounds like a reasonable way to do it.
See the code here/blob/916f2592db326241a2bf4d8f2e0719c58b71e385/pySmartDL/pySmartDL.py#L651)
Transmission
Gives a pretty good ETA in most cases (except when starting off, as might be expected).
Uses a smoothing factor over the past 5 readings, similar to Firefox but not quite as complex. Fundamentally similar to Gooli's answer.
See the code here
I usually use an Exponential Moving Average to compute the speed of an operation with a smoothing factor of say 0.1 and use that to compute the remaining time. This way all the measured speeds have influence on the current speed, but recent measurements have much more effect than those in the distant past.
In code it would look something like this:
alpha = 0.1 # smoothing factor
...
speed = (speed * (1 - alpha)) + (currentSpeed * alpha)
If your tasks are uniform in size, currentSpeed would simply be the time it took to execute the last task. If the tasks have different sizes and you know that one task is supposed to be i,e, twice as long as another, you can divide the time it took to execute the task by its relative size to get the current speed. Using speed you can compute the remaining time by multiplying it by the total size of the remaining tasks (or just by their number if the tasks are uniform).
Hopefully my explanation is clear enough, it's a bit late in the day.
In certain instances, when you need to perform the same task on a regular basis, it might be a good idea of using past completion times to average against.
For example, I have an application that loads the iTunes library via its COM interface. The size of a given iTunes library generally do not increase dramatically from launch-to-launch in terms of the number of items, so in this example it might be possible to track the last three load times and load rates and then average against that and compute your current ETA.
This would be hugely more accurate than an instantaneous measurement and probably more consistent as well.
However, this method depends upon the size of the task being relatively similar to the previous ones, so this would not work for a decompressing method or something else where any given byte stream is the data to be crunched.
Just my $0.02
First off, it helps to generate a running moving average. This weights more recent events more heavily.
To do this, keep a bunch of samples around (circular buffer or list), each a pair of progress and time. Keep the most recent N seconds of samples. Then generate a weighted average of the samples:
totalProgress += (curSample.progress - prevSample.progress) * scaleFactor
totalTime += (curSample.time - prevSample.time) * scaleFactor
where scaleFactor goes linearly from 0...1 as an inverse function of time in the past (thus weighing more recent samples more heavily). You can play around with this weighting, of course.
At the end, you can get the average rate of change:
averageProgressRate = (totalProgress / totalTime);
You can use this to figure out the ETA by dividing the remaining progress by this number.
However, while this gives you a good trending number, you have one other issue - jitter. If, due to natural variations, your rate of progress moves around a bit (it's noisy) - e.g. maybe you're using this to estimate file downloads - you'll notice that the noise can easily cause your ETA to jump around, especially if it's pretty far in the future (several minutes or more).
To avoid jitter from affecting your ETA too much, you want this average rate of change number to respond slowly to updates. One way to approach this is to keep around a cached value of averageProgressRate, and instead of instantly updating it to the trending number you've just calculated, you simulate it as a heavy physical object with mass, applying a simulated 'force' to slowly move it towards the trending number. With mass, it has a bit of inertia and is less likely to be affected by jitter.
Here's a rough sample:
// desiredAverageProgressRate is computed from the weighted average above
// m_averageProgressRate is a member variable also in progress units/sec
// lastTimeElapsed = the time delta in seconds (since last simulation)
// m_averageSpeed is a member variable in units/sec, used to hold the
// the velocity of m_averageProgressRate
const float frictionCoeff = 0.75f;
const float mass = 4.0f;
const float maxSpeedCoeff = 0.25f;
// lose 25% of our speed per sec, simulating friction
m_averageSeekSpeed *= pow(frictionCoeff, lastTimeElapsed);
float delta = desiredAvgProgressRate - m_averageProgressRate;
// update the velocity
float oldSpeed = m_averageSeekSpeed;
float accel = delta / mass;
m_averageSeekSpeed += accel * lastTimeElapsed; // v += at
// clamp the top speed to 25% of our current value
float sign = (m_averageSeekSpeed > 0.0f ? 1.0f : -1.0f);
float maxVal = m_averageProgressRate * maxSpeedCoeff;
if (fabs(m_averageSeekSpeed) > maxVal)
{
m_averageSeekSpeed = sign * maxVal;
}
// make sure they have the same sign
if ((m_averageSeekSpeed > 0.0f) == (delta > 0.0f))
{
float adjust = (oldSpeed + m_averageSeekSpeed) * 0.5f * lastTimeElapsed;
// don't overshoot.
if (fabs(adjust) > fabs(delta))
{
adjust = delta;
// apply damping
m_averageSeekSpeed *= 0.25f;
}
m_averageProgressRate += adjust;
}
Your question is a good one. If the problem can be broken up into discrete units having an accurate calculation often works best. Unfortunately this may not be the case even if you are installing 50 components each one might be 2% but one of them can be massive. One thing that I have had moderate success with is to clock the cpu and disk and give a decent estimate based on observational data. Knowing that certain check points are really point x allows you some opportunity to correct for environment factors (network, disk activity, CPU load). However this solution is not general in nature due to its reliance on observational data. Using ancillary data such as rpm file size helped me make my progress bars more accurate but they are never bullet proof.
Uniform averaging
The simplest approach would be to predict the remaining time linearly:
t_rem := t_spent ( n - prog ) / prog
where t_rem is the predicted ETA, t_spent is the time elapsed since the commencement of the operation, prog the number of microtasks completed out of their full quantity n. To explain—n may be the number of rows in a table to process or the number of files to copy.
This method having no parameters, one need not worry about the fine-tuning of the exponent of attenuation. The trade-off is poor adaptation to a changing progress rate because all samples have equal contribution to the estimate, whereas it is only meet that recent samples should be have more weight that old ones, which leads us to
Exponential smoothing of rate
in which the standard technique is to estimate progress rate by averaging previous point measurements:
rate := 1 / (n * dt); { rate equals normalized progress per unit time }
if prog = 1 then { if first microtask just completed }
rate_est := rate; { initialize the estimate }
else
begin
weight := Exp( - dt / DECAY_T );
rate_est := rate_est * weight + rate * (1.0 - weight);
t_rem := (1.0 - prog / n) / rate_est;
end;
where dt denotes the duration of the last completed microtask and is equal to the time passed since the previous progress update. Notice that weight is not a constant and must be adjusted according the length of time during which a certain rate was observed, because the longer we observed a certain speed the higher the exponential decay of the previous measurements. The constant DECAY_T denotes the length of time during which the weight of a sample decreases by a factor of e. SPWorley himself suggested a similar modification to gooli's proposal, although he applied it to the wrong term. An exponential average for equidistant measurements is:
Avg_e(n) = Avg_e(n-1) * alpha + m_n * (1 - alpha)
but what if the samples are not equidistant, as is the case with times in a typical progress bar? Take into account that alpha above is but an empirical quotient whose true value is:
alpha = Exp( - lambda * dt ),
where lambda is the parameter of the exponential window and dt the amount of change since the previous sample, which need not be time, but any linear and additive parameter. alpha is constant for equidistant measurements but varies with dt.
Mark that this method relies on a predefined time constant and is not scalable in time. In other words, if the exactly same process be uniformly slowed-down by a constant factor, this rate-based filter will become proportionally more sensitive to signal variations because at every step weight will be decreased. If we, however, desire a smoothing independent of the time scale, we should consider
Exponential smoothing of slowness
which is essentially the smoothing of rate turned upside down with the added simplification of a constant weight of because prog is growing by equidistant increments:
slowness := n * dt; { slowness is the amount of time per unity progress }
if prog = 1 then { if first microtask just completed }
slowness_est := slowness; { initialize the estimate }
else
begin
weight := Exp( - 1 / (n * DECAY_P ) );
slowness_est := slowness_est * weight + slowness * (1.0 - weight);
t_rem := (1.0 - prog / n) * slowness_est;
end;
The dimensionless constant DECAY_P denotes the normalized progress difference between two samples of which the weights are in the ratio of one to e. In other words, this constant determines the width of the smoothing window in progress domain, rather than in time domain. This technique is therefore independent of the time scale and has a constant spatial resolution.
Futher research: adaptive exponential smoothing
You are now equipped to try the various algorithms of adaptive exponential smoothing. Only remember to apply it to slowness rather than to rate.
I always wish these things would tell me a range. If it said, "This task will most likely be done in between 8 min and 30 minutes," then I have some idea of what kind of break to take. If it's bouncing all over the place, I'm tempted to watch it until it settles down, which is a big waste of time.
I have tried and simplified your "easy"/"wrong"/"OK" formula and it works best for me:
t / p - t
In Python:
>>> done=0.3; duration=10; "time left: %i" % (duration / done - duration)
'time left: 23'
That saves one op compared to (dur*(1-done)/done). And, in the edge case you describe, possibly ignoring the dialog for 30 minutes extra hardly matters after waiting all night.
Comparing this simple method to the one used by Transmission, I found it to be up to 72% more accurate.
I don't sweat it, it's a very small part of an application. I tell them what's going on, and let them go do something else.