What are some good approaches to predicting the completion time of a long process? - algorithm

tl;dr: I want to predict file copy completion. What are good methods given the start time and the current progress?
Firstly, I am aware that this is not at all a simple problem, and that predicting the future is difficult to do well. For context, I'm trying to predict the completion of a long file copy.
Current Approach:
At the moment, I'm using a fairly naive formula that I came up with myself: (ETC stands for Estimated Time of Completion)
ETC = currTime + elapsedTime * (totalSize - sizeDone) / sizeDone
This works on the assumption that the remaining files to be copied will do so at the average copy speed thus far, which may or may not be a realistic assumption (dealing with tape archives here).
PRO: The ETC will change gradually, and becomes more and more accurate as the process nears completion.
CON: It doesn't react well to unexpected events, like the file copy becoming stuck or speeding up quickly.
Another idea:
The next idea I had was to keep a record of the progress for the last n seconds (or minutes, given that these archives are supposed to take hours), and just do something like:
ETC = currTime + currAvg * (totalSize - sizeDone)
This is kind of the opposite of the first method in that:
PRO: If the speed changes quickly, the ETC will update quickly to reflect the current state of affairs.
CON: The ETC may jump around a lot if the speed is inconsistent.
Finally
I'm reminded of the control engineering subjects I did at uni, where the objective is essentially to try to get a system that reacts quickly to sudden changes, but isn't unstable and crazy.
With that said, the other option I could think of would be to calculate the average of both of the above, perhaps with some kind of weighting:
Weight the first method more if the copy has a fairly consistent long-term average speed, even if it jumps around a bit locally.
Weight the second method more if the copy speed is unpredictable, and is likely to do things like speed up/slow down for long periods, or stop altogether for long periods.
What I am really asking for is:
Any alternative approaches to the two I have given.
If and how you would combine several different methods to get a final prediction.

If you feel that the accuracy of prediction is important, the way to go about about building a predictive model is as follows:
collect some real-world measurements;
split them into three disjoint sets: training, validation and test;
come up with some predictive models (you already have two plus a mix) and fit them using the training set;
check predictive performance of the models on the validation set and pick the one that performs best;
use the test set to assess the out-of-sample prediction error of the chosen model.
I'd hazard a guess that a linear combination of your current model and the "average over the last n seconds" would perform pretty well for the problem at hand. The optimal weights for the linear combination can be fitted using linear regression (a one-liner in R).
An excellent resource for studying statistical learning methods is The Elements of
Statistical Learning by Hastie, Tibshirani and Friedman. I can't recommend that book highly enough.
Lastly, your second idea (average over the last n seconds) attempts to measure the instantaneous speed. A more robust technique for this might be to use the Kalman filter, whose purpose is exactly this:
Its purpose is to use measurements observed over time, containing
noise (random variations) and other inaccuracies, and produce values
that tend to be closer to the true values of the measurements and
their associated calculated values.
The principal advantage of using the Kalman filter rather than a fixed n-second sliding window is that it's adaptive: it will automatically use a longer averaging window when measurements jump around a lot than when they're stable.

Imho, bad implementations of ETC are wildly overused, which allows us to have a good laugh. Sometimes, it might be better to display facts instead of estimations, like:
5 of 10 files have been copied
10 of 200 MB have been copied
Or display facts and an estimation, and make clear that it is only an estimation. But I would not display only an estimation.
Every user knows that ETCs are often completely meaningless, and then it is hard to distinguish between meaningful ETCs and meaningless ETCs, especially for inexperienced users.

I have implemented two different solutions to address this problem:
The ETC for the current transfer at start time is based on a historic speed value. This value is refined after each transfer. During the transfer I compute a weighted average between the historic data and data from the current transfer, so that the closer to the end you are the more weight is given to actual data from the transfer.
Instead of showing a single ETC, show a range of time. The idea is to compute the ETC from the last 'n' seconds or minutes (like your second idea). I keep track of the best and worst case averages and compute a range of possible ETCs. This is kind of confusing to show in a GUI, but okay to show in a command line app.

There are two things to consider here:
the exact estimation
how to present it to the user
1. On estimation
Other than statistics approach, one simple way to have a good estimation of the current speed while erasing some noise or spikes is to take a weighted approach.
You already experimented with the sliding window, the idea here is to take a fairly large sliding window, but instead of a plain average, giving more weight to more recent measures, since they are more indicative of the evolution (a bit like a derivative).
Example: Suppose you have 10 previous windows (most recent x0, least recent x9), then you could compute the speed:
Speed = (10 * x0 + 9 * x1 + 8 * x2 + ... + x9) / (10 * window-time) / 55
When you have a good assessment of the likely speed, then you are close to get a good estimated time.
2. On presentation
The main thing to remember here is that you want a nice user experience, and not a scientific front.
Studies have demonstrated that users reacted very badly to slow-down and very positively to speed-up. Therefore, a good progress bar / estimated time should be conservative in the estimates presented (reserving time for a potential slow-down) at first.
A simple way to get that is to have a factor that is a percentage of the completion, that you use to tweak the estimated remaining time. For example:
real-completion = 0.4
presented-completion = real-completion * factor(real-completion)
Where factor is such that factor([0..1]) = [0..1], factor(x) <= x and factor(1) = 1. For example, the cubic function produces the nice speed-up toward the completion time. Other functions could use an exponential form 1 - e^x, etc...

Related

Rapid change detection algorithm

I'm logging temperature values in a room, saving them to the database. I'd like to be alerted when temperature rises suddenly. I can't set fixed values, because 18°C is acceptable in winter and 25°C is acceptable in summer. But if it jumps from 20°C to 25°C during, let's say, 30 minutes and stays like this for 5 minutes (to eliminate false readouts), I'd like to be informed.
My current idea is to take readouts from last 30 minutes (A) and readouts from last 5 minutes (B), calculate median of A and B and check if difference between them is less then my desired threshold.
Is this correct way to solve this or is there a better algorithm? I searched for a specific one but most of them seem overcomplicated.
Thanks!
Detecting changes in a time-series is a well-researched subject, and hundreds if not thousands of papers have been written on this subject. As you've seen many methods are quite advanced, but proved to be quite useful for many use cases. Whatever method you choose, you should evaluate it against real of simulated data, and optimize its parameters for your use case.
As you require, let me suggest a very simple method that in many cases prove to be good enough, and is quite similar to that you considered.
Basically, you have two concerns:
Detecting a monotonous change in a sampled noisy signal
Ignoring false readouts
First, note that medians are not commonly used for detecting trends. For the series (1,2,3,30,35,3,2,1) the medians of 5 consecutive terms is be (3, 3, 3, 3). It is much more common to use averages.
One common trick is to throw the extreme values before averaging (e.g. for each 7 values average only the middle 5). If many false readouts are expected - try to take measurements at a faster rate, and throw more extreme values (e.g. for each 13 values average the middle 9).
Also, you should throw away unfeasible values and replace them with the last measured value (unfeasible means out of range, or non-physical change rate).
Your idea of comparing a short-period measure with a long-period measure is a good idea, and indeed it is commonly used (e.g. in econometrics).
Quoting from "Financial Econometric Models - Some Contributions to the Field [Nicolau, 2007]:
Buy and sell signals are generated by two moving averages of the price
level: a long-period average and a short-period average. A typical
moving average trading rule prescribes a buy (sell) when the
short-period moving average crosses the long-period moving average
from below (above) (i.e. when the original time series is rising
(falling) relatively fast).
When you say "rises suddenly," mathematically you are talking about the magnitude of the derivative of the temperature signal.
There is a nice algorithm to simultaneously smooth a signal and calculate its derivative called the Savitzky–Golay filter. It's explained with examples on Wikipedia, or you can use Matlab to help you generate the convolution coefficients required. Once you have the coefficients the calculation is very simple.

How many simulations need to do?

Hello my problem is more related with the validation of a model. I have done a program in netlogo that i'm gonna use in a report for my thesis but now the question is, how many repetitions (simulations) i need to do for justify my results? I already have read some methods using statistical approach and my colleagues have suggested me some nice mathematical operations, but i also want to know from people who works with computational models what kind of statistical test or mathematical method used to know that.
There are two aspects to this (1) How many parameter combinations (2) How many runs for each parameter combination.
(1) Generally you would do experiments, where you vary some of your input parameter values and see how some model output changes. Take the well known Schelling segregation model as an example, you would vary the tolerance value and see how the segregation index is affected. In this case you might vary the tolerance from 0 to 1 by 0.01 (if you want discrete) or you could just take 100 different random values in the range [0,1]. This is a matter of experimental design and is entirely affected by how fine you wish to examine your parameter space.
(2) For each experimental value, you also need to run multiple simulations so that you can can calculate the average and reduce the impact of randomness in the simulation run. For example, say you ran the model with a value of 3 for your input parameter (whatever it means) and got a result of 125. How do you know whether the 'real' answer is 125 or something else. If you ran it 10 times and got 10 different numbers in the range 124.8 to 125.2 then 125 is not an unreasonable estimate. If you ran it 10 times and got numbers ranging from 50 to 500, then 125 is not a useful result to report.
The number of runs for each experiment set depends on the variability of the output and your tolerance. Even the 124.8 to 125.2 is not useful if you want to be able to estimate to 1 decimal place. Look up 'standard error of the mean' in any statistics text book. Basically, if you do N runs, then a 95% confidence interval for the result is the average of the results for your N runs plus/minus 1.96 x standard deviation of the results / sqrt(N). If you want a narrower confidence interval, you need more runs.
The other thing to consider is that if you are looking for a relationship over the parameter space, then you need fewer runs at each point than if you are trying to do a point estimate of the result.
Not sure exactly what you mean, but maybe you can check the books of Hastie and Tishbiani
http://web.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf
specially the sections on resampling methods (Cross-Validation and bootstrap).
They also have a shorter book that covers the possible relevant methods to your case along with the commands in R to run this. However, this book, as a far as a I know, is not free.
http://www.springer.com/statistics/statistical+theory+and+methods/book/978-1-4614-7137-0
Also, could perturb the initial conditions to see you the outcome doesn't change after small perturbations of the initial conditions or parameters. On a larger scale, sometimes you can break down the space of parameters with regard to final state of the system.
1) The number of simulations for each parameter setting can be decided by studying the coefficient of variance Cv = s / u, here s and u are standard deviation and mean of the result respectively. It is explained in detail in this paper Coefficient of variance.
2) The simulations where parameters are changed can be analyzed using several methods illustrated in the paper Testing methods.
These papers provide scrupulous analyzing methods and refer to other papers which may be relevant to your question and your research.

Comparison of Sorting Algorithms using running time in terms of seconds

I have devised a test in order to compare the different running times of my sorting algorithm with Insertion sort, bubble sort, quick sort, selection sort, and shell sort. I have based my test using the test done in this website http://warp.povusers.org/SortComparison/index.html, but I modified my test a bit.
I set up a test manager program server which generates the data, and the test manager sends it to the clients that run the different algorithms, therefore they are sorting the same data to have no bias.
I noticed that the insertion sort, bubble sort, and selection sort algorithms really did run for a very long time (some more than 15 minutes) just to sort one given data for sizes of 100,000 and 1,000,000.
So I changed the number of runs per test case for those two data sizes. My original runs for the 100,000 was 500 but I reduced it to 15, and for 1,000,000 was 100 and I reduced it to 3.
Now my professor doubts the credibility as to why I've reduced it that much, but as I've observed the running time for sorting a specific data distribution varied only by a small percentage, which is why I still find it that even though I've reduced it to that much I'd still be able to approximate the average runtime for that specific test case of that algorithm.
My question now is, is my assumption wrong? Does the machine at times make significant running time changes (>50% changes), like say for example sorting the same data over and over if a first run would give it 0.3 milliseconds will the second run give as much difference as making it run for 1.5 seconds? Because from my observation, the running times don't vary largely given the same type of test distribution (e.g. completely random, completely sorted, completely reversed).
What you are looking for is a way to measure error in your experiments. My favorite book on subject is Error Analysis by Taylor and Chapter 4 has what you need which I'll summarize here.
You need to calculate Standard error of the mean or SDOM. First calculate mean and standard deviation (formulas are on Wikipedia and quite simple). Your SDOM is standard deviation divided by square root of number of measurements. Assuming your timings have Normal distribution (which it should), the twice the value of SDOM is a very common way to specify +/- error.
For example, let's say you run sorting algorithm 5 times and get following numbers: 5, 6, 7, 4, 5. Then mean is 5.4 and standard deviation is 1.1. Therefor SDOM is 1.1/sqrt(5) = 0.5. So 2*SDOM = 1. Now you can say that algorithm rum time was 5.4 ± 1. You professor can determine if this is acceptable error in measurement. Notice that as you take more readings, your SDOM, i.e. plus or minus error, goes down inversely proportional to square root of N. Twice of SDOM interval has 95% probability or confidence that the true value lies within the interval which is accepted standard.
Also you most likely want to measure performance by measuring CPU time instead of simple timer. Modern CPUs are too complex with various cache level and pipeline optimizations and you might end up getting less accurate measurement if you are using timer. More about CPU time is in this answer: How can I measure CPU time and wall clock time on both Linux/Windows?
It absolutely does. You need a variety of "random" samples in order to be able to draw proper conclusions about the population.
Look at it this way. It takes a long time to poll 100,000 people in the U.S. about their political stance. If we reduce the sample size to 100 people in order to complete it faster, we not only reduce the precision of our final result (2 decimal places rather than 5), we also introduce a larger chance that the members of the sample have a specific bias (there is a greater chance that 100 people out of 3xx,000,000 think the same way than 100,000 out of those same 3xx,000,000).
Your professor is right, however he's not provided the details that I mention some of them here :
Sampling issue: It's right that you generate some random numbers and feed them to your sorting methods, but with a few test cases indeed you're biased cause almost all of the random functions are biased to some extent (specially to the state of machine or time at the moment), so you should use more and more test cases to be more confident about the randomness.
Machine state: Suppose you've provide perfect data (fully representative of a uniform distribution), the performance of the electro-mechanical devises like computers may vary in different situations, so you should try for considerable times to smooth the effects of these phenomena.
Note : In advanced technical reports, you should provide a confidence coefficient for the answers you provide derived from statistical analysis, and proven step by step, but if you don't need to be that much exact, simply increase these :
The size of the data
The number of tests

Algorithm(s) for spotting anomalies ("spikes") in traffic data

I find myself needing to process network traffic captured with tcpdump. Reading the traffic is not hard, but what gets a bit tricky is spotting where there are "spikes" in the traffic. I'm mostly concerned with TCP SYN packets and what I want to do is find days where there's a sudden rise in the traffic for a given destination port. There's quite a bit of data to process (roughly one year).
What I've tried so far is to use an exponential moving average, this was good enough to let me get some interesting measures out, but comparing what I've seen with external data sources seems to be a bit too aggressive in flagging things as abnormal.
I've considered using a combination of the exponential moving average plus historical data (possibly from 7 days in the past, thinking that there ought to be a weekly cycle to what I am seeing), as some papers I've read seem to have managed to model resource usage that way with good success.
So, does anyone knows of a good method or somewhere to go and read up on this sort of thing.
The moving average I've been using looks roughly like:
avg = avg+0.96*(new-avg)
With avg being the EMA and new being the new measure. I have been experimenting with what thresholds to use, but found that a combination of "must be a given factor higher than the average prior to weighing the new value in" and "must be at least 3 higher" to give the least bad result.
This is widely studied in intrusion detection literature. This is a seminal paper on the issue which shows, among other things, how to analyze tcpdump data to gain relevant insights.
This is the paper: http://www.usenix.org/publications/library/proceedings/sec98/full_papers/full_papers/lee/lee_html/lee.html here they use the RIPPER rule induction system, I guess you could replace that old one for something newer such as http://www.newty.de/pnc2/ or http://www.data-miner.com/rik.html
I would apply two low-pass filters to the data, one with a long time constant, T1, and one with a short time constant, T2. You would then look at the magnitude difference in output from these two filters and when it exceeds a certain threshold, K, then that would be a spike. The hardest part is tuning T1, T2 and K so that you don't get too many false positives and you don't miss any small spikes.
The following is a single pole IIR low-pass filter:
new = k * old + (1 - k) * new
The value of k determines the time constant and is usually close to 1.0 (but < 1.0 of course).
I am suggesting that you apply two such filters in parallel, with different time constants, e.g. start with say k = 0.9 for one (short time constant) and k = 0.99 for the other (long time constant) and then look at the magnitude difference in their outputs. The magnitude difference will be small most of the time, but will become large when there is a spike.

Understanding algorithms for measuring trends

What's the rationale behind the formula used in the hive_trend_mapper.py program of this Hadoop tutorial on calculating Wikipedia trends?
There are actually two components: a monthly trend and a daily trend. I'm going to focus on the daily trend, but similar questions apply to the monthly one.
In the daily trend, pageviews is an array of number of page views per day for this topic, one element per day, and total_pageviews is the sum of this array:
# pageviews for most recent day
y2 = pageviews[-1]
# pageviews for previous day
y1 = pageviews[-2]
# Simple baseline trend algorithm
slope = y2 - y1
trend = slope * log(1.0 +int(total_pageviews))
error = 1.0/sqrt(int(total_pageviews))
return trend, error
I know what it's doing superficially: it just looks at the change over the past day (slope), and scales this up to the log of 1+total_pageviews (log(1)==0, so this scaling factor is non-negative). It can be seen as treating the month's total pageviews as a weight, but tempered as it grows - this way, the total pageviews stop making a difference for things that are "popular enough," but at the same time big changes on insignificant don't get weighed as much.
But why do this? Why do we want to discount things that were initially unpopular? Shouldn't big deltas matter more for items that have a low constant popularity, and less for items that are already popular (for which the big deltas might fall well within a fraction of a standard deviation)? As a strawman, why not simply take y2-y1 and be done with it?
And what would the error be useful for? The tutorial doesn't really use it meaningfully again. Then again, it doesn't tell us how trend is used either - this is what's plotted in the end product, correct?
Where can I read up for a (preferably introductory) background on the theory here? Is there a name for this madness? Is this a textbook formula somewhere?
Thanks in advance for any answers (or discussion!).
As the in-line comment goes, this is a simple "baseline trend algorithm",
which basically means before you compare the trends of two different pages, you have to establish
a baseline. In many cases, the mean value is used, it's straightforward if you
plot the pageviews against the time axis. This method is widely used in monitoring
water quality, air pollutants, etc. to detect any significant changes w.r.t the baseline.
In OP's case, the slope of pageviews is weighted by the log of totalpageviews.
This sorta uses the totalpageviews as a baseline correction for the slope. As Simon put it, this puts a balance
between two pages with very different totalpageviews.
For exmaple, A has a slope 500 over 1000,000 total pageviews, B is 1000 over 1,000.
A log basically means 1000,000 is ONLY twice more important than 1,000 (rather than 1000 times).
If you only consider the slope, A is less popular than B.
But with a weight, now the measure of popularity of A is the same as B. I think it is quite intuitive:
though A's pageviews is only 500 pageviews, but that's because it's saturating, you still gotta give it enough credit.
As for the error, I believe it comes from the (relative) standard error, which has a factor 1/sqrt(n), where
n is the number of data points. In the code, the error is equal to (1/sqrt(n))*(1/sqrt(mean)).
It roughly translates into : the more data points, the more accurate the trend. I don't see
it is an exact math formula, just a brute trend analysis algorithm, anyway the relative
value is more important in this context.
In summary, I believe it's just an empirical formula. More advanced topics can be found in some biostatistics textbooks (very similar to monitoring the breakout of a flu or the like.)
The code implements statistics (in this case the "baseline trend"), you should educate yourself on that and everything becomes clearer. Wikibooks has a good instroduction.
The algorithm takes into account that new pages are by definition more unpopular than existing ones (because - for example - they are linked from relatively few other places) and suggests that those new pages will grow in popularity over time.
error is the error margin the system expects for its prognoses. The higher error is, the more unlikely the trend will continue as expected.
The reason for moderating the measure by the volume of clicks is not to penalise popular pages but to make sure that you can compare large and small changes with a single measure. If you just use y2 - y1 you will only ever see the click changes on large volume pages. What this is trying to express is "significant" change. 1000 clicks change if you attract 100 clicks is really significant. 1000 click change if you attract 100,000 is less so. What this formula is trying to do is make both of these visible.
Try it out at a few different scales in Excel, you'll get a good view of how it operates.
Hope that helps.
another way to look at it is this:
suppose your page and my page are made at same day, and ur page gets total views about ten million, and mine about 1 million till some point. then suppose the slope at some point is a million for me, and 0.5 million for you. if u just use slope, then i win, but ur page already had more views per day at that point, urs were having 5 million, and mine 1 million, so that a million on mine still makes it 2 million, and urs is 5.5 million for that day. so may be this scaling concept is to try to adjust the results to show that ur page is also good as a trend setter, and its slope is less but it already was more popular, but the scaling is only a log factor, so doesnt seem too problematic to me.

Resources