Hello my problem is more related with the validation of a model. I have done a program in netlogo that i'm gonna use in a report for my thesis but now the question is, how many repetitions (simulations) i need to do for justify my results? I already have read some methods using statistical approach and my colleagues have suggested me some nice mathematical operations, but i also want to know from people who works with computational models what kind of statistical test or mathematical method used to know that.
There are two aspects to this (1) How many parameter combinations (2) How many runs for each parameter combination.
(1) Generally you would do experiments, where you vary some of your input parameter values and see how some model output changes. Take the well known Schelling segregation model as an example, you would vary the tolerance value and see how the segregation index is affected. In this case you might vary the tolerance from 0 to 1 by 0.01 (if you want discrete) or you could just take 100 different random values in the range [0,1]. This is a matter of experimental design and is entirely affected by how fine you wish to examine your parameter space.
(2) For each experimental value, you also need to run multiple simulations so that you can can calculate the average and reduce the impact of randomness in the simulation run. For example, say you ran the model with a value of 3 for your input parameter (whatever it means) and got a result of 125. How do you know whether the 'real' answer is 125 or something else. If you ran it 10 times and got 10 different numbers in the range 124.8 to 125.2 then 125 is not an unreasonable estimate. If you ran it 10 times and got numbers ranging from 50 to 500, then 125 is not a useful result to report.
The number of runs for each experiment set depends on the variability of the output and your tolerance. Even the 124.8 to 125.2 is not useful if you want to be able to estimate to 1 decimal place. Look up 'standard error of the mean' in any statistics text book. Basically, if you do N runs, then a 95% confidence interval for the result is the average of the results for your N runs plus/minus 1.96 x standard deviation of the results / sqrt(N). If you want a narrower confidence interval, you need more runs.
The other thing to consider is that if you are looking for a relationship over the parameter space, then you need fewer runs at each point than if you are trying to do a point estimate of the result.
Not sure exactly what you mean, but maybe you can check the books of Hastie and Tishbiani
http://web.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf
specially the sections on resampling methods (Cross-Validation and bootstrap).
They also have a shorter book that covers the possible relevant methods to your case along with the commands in R to run this. However, this book, as a far as a I know, is not free.
http://www.springer.com/statistics/statistical+theory+and+methods/book/978-1-4614-7137-0
Also, could perturb the initial conditions to see you the outcome doesn't change after small perturbations of the initial conditions or parameters. On a larger scale, sometimes you can break down the space of parameters with regard to final state of the system.
1) The number of simulations for each parameter setting can be decided by studying the coefficient of variance Cv = s / u, here s and u are standard deviation and mean of the result respectively. It is explained in detail in this paper Coefficient of variance.
2) The simulations where parameters are changed can be analyzed using several methods illustrated in the paper Testing methods.
These papers provide scrupulous analyzing methods and refer to other papers which may be relevant to your question and your research.
Related
Problem Introduction
Assume we have a parallel algorithm f(<params>) running on P cores whereas
<params>: Parameters for algorithm
P: Number of cores it runs on (i.e. threads, cores, processors)
We further assume that out implementation actually consists of three parts:
A - Distribution: We distribute the input to all processors
B - Run the algorithm: We run f(<params>) ("on each processor")
C - Collection: We collect the computed data from all processors
After fixing <params> and P like input size, number of processors etc. the algorithm itself is deterministic i.e. we can write down an exact cDAG for it.
I'm now trying to answer the question: "For a given set of parameters, what is the execution time for a given system?"
With "given system" I mean e.g. "my computer" or "the university super computer" because obviously, the runtime does depend on the system it runs on and obviously the system itself does introduce non-determinism because you never really know the state of the system.
So in short: While the algorithm might be deterministic, runtime measurements aren't. (but e.g. communication measurements would be deterministic.) So we need to do a proper statistical analysis. And this is where I'm unsure.
Measuring Runtime: Basic idea
We are interested in how long part "B - Run the Algorithm" takes. Since the algorithm actually runs on P cores we'll make a measurement on each core and so get P values, let call those P values P_measurements. Some cores might finish before others, so which value does represent the runtime of the whole algorithm? I think a good choice is to simply take value of the core that took the longest i.e. max(P_measurements).
Now there are two things that need consideration here:
We have to repeat the measurement n times since it's a non-deterministic value
Once we have those n*P values, we need to know how to properly summarize them.
(And additional concern would be how to communicate those results in the end, but that's not part of this question.)
Measuring Runtime: Statistical Analysis
So here's what I'd do and this is also the part where I'm very unsure.
We measure the runtime of f(<params>) on each of the P cores. We get P_measurements
We take max(P_measurements)
We repeat 1. & 2. n times and we end up with maxes. Whereas maxes is a list of the n values max(P_measurements)
We check if maxesis normally distributed using a Q-Q-Plot. If not, we normalize. We do expect it to be right-skewed.
Now we take the median of maxes. (If we normalized, we use the normalized values)
We compute the standard deviation, the population mean and the 95% confidence interval.
We might want to say that all values are of an error of e.g. 5% so we check if all the values lie between +-5% of the population mean i.e. the confidence interval should be rather "thin".
We got ourselves some nice runtime measurement.
Clarifications:
Step 4. was necessary because computing the CI in step 6 uses the t-distribution and because later on I want to measure a different implementation of the same algorithm. So I'll have to compare two values and for that I need to do e.g. a t-test. So I need to make sure, the prerequisites for the t-test are met, which are: iid & normally distributed. Iid is assumed.
Question
I am very unsure what I did is statistically sound. Especially step 1-3. I'm not sure if I can do that kind of summarization (just take the max) here. I know that we might have an outsider value that's "especially" high but since we only measure on super computers we can assume the noise to be low and since we take the median in the end any outliners shouldn't have a big impact.
I hope for good input since it's a rather complex topic and I'm very interested in doing it right. I mostly followed the following paper, which I can recommend: http://spcl.inf.ethz.ch/Teaching/2020-dphpc/hoefler-scientific-benchmarking.pdf
But even with the paper, I'm not used to use statistical analysis and thus would just like to get some input from people who actually know this stuff. :)
I just started a Machine learning class and we went over Perceptrons. For homework we are supposed to:
"Choose appropriate training and test data sets of two dimensions (plane). Use 10 data points for training and 5 for testing. " Then we are supposed to write a program that will use a perceptron algorithm and output:
a comment on whether the training data points are linearly
separable
a comment on whether the test points are linearly separable
your initial choice of the weights and constants
the final solution equation (decision boundary)
the total number of weight updates that your algorithm made
the total number of iterations made over the training set
the final misclassification error, if any, on the training data and
also on the test data
I have read the first chapter of my book several times and I am still having trouble fully understanding perceptrons.
I understand that you change the weights if a point is misclassified until none are misclassified anymore, I guess what I'm having trouble understanding is
What do I use the test data for and how does that relate to the
training data?
How do I know if a point is misclassified?
How do I go about choosing test points, training points, threshold or a bias?
It's really hard for me to know how to make up one of these without my book providing good examples. As you can tell I am pretty lost, any help would be so much appreciated.
What do I use the test data for and how does that relate to the
training data?
Think about a Perceptron as young child. You want to teach a child how to distinguish apples from oranges. You show it 5 different apples (all red/yellow) and 5 oranges (of different shape) while telling it what it sees at every turn ("this is a an apple. this is an orange). Assuming the child has perfect memory, it will learn to understand what makes an apple an apple and an orange an orange if you show him enough examples. He will eventually start to use meta-features (like shapes) without you actually telling him. This is what a Perceptron does. After you showed him all examples, you start at the beginning, this is called a new epoch.
What happens when you want to test the child's knowledge? You show it something new. A green apple (not just yellow/red), a grapefruit, maybe a watermelon. Why not show the child the exact same data as before during training? Because the child has perfect memory, it will only tell you what you told him. You won't see how good it generalizes from known to unseen data unless you have different training data that you never showed him during training. If the child has a horrible performance on the test data but a 100% performance on the training data, you will know that he has learned nothing - it's simply repeating what he has been told during training - you trained him too long, he only memorized your examples without understanding what makes an apple an apple because you gave him too many details - this is called overfitting. To prevent your Perceptron from only (!) recognizing training data you'll have to stop training at a reasonable time and find a good balance between the size of the training and testing set.
How do I know if a point is misclassified?
If it's different from what it should be. Let's say an apple has class 0 and an orange has 1 (here you should start reading into Single/MultiLayer Perceptrons and how Neural Networks of multiple Perceptrons work). The network will take your input. How it's coded is irrelevant for this, let's say input is a string "apple". Your training set then is {(apple1,0), (apple2,0), (apple3,0), (orange1,1), (orange2,1).....}. Since you know the class beforehand, the network will either output 1 or 0 for the input "apple1". If it outputs 1, you perform (targetValue-actualValue) = (1-0) = 1. 1 in this case means that the network gives a wrong output. Compare this to the delta rule and you will understand that this small equation is part of the larger update equation. In case you get a 1 you will perform a weight update. If target and actual value are the same, you will always get a 0 and you know that the network didn't misclassify.
How do I go about choosing test points, training points, threshold or
a bias?
Practically the bias and threshold isn't "chosen" per se. The bias is trained like any other unit using a simple "trick", namely using the bias as an additional input unit with value 1 - this means the actual bias value is encoded in this additional unit's weight and the algorithm we use will make sure it learns the bias for us automatically.
Depending on your activation function, the threshold is predetermined. For a simple perceptron, the classification will occur as follows:
Since we use a binary output (between 0 and 1), it's a good start to put the threshold at 0.5 since that's exactly the middle of the range [0,1].
Now to your last question about choosing training and test points: This is quite difficult, you do that by experience. Where you're at, you start off by implementing simple logical functions like AND, OR, XOR etc. There's it's trivial. You put everything in your training set and test with the same values as your training set (since for x XOR y etc. there are only 4 possible inputs 00, 10, 01, 11). For complex data like images, audio etc. you'll have to try and tweak your data and features until you feel like the network can work with it as good as you want it to.
What do I use the test data for and how does that relate to the training data?
Usually, to asses how well a particular algorithm performs, one first trains it and then uses different data to test how well it does on data it has never seen before.
How do I know if a point is misclassified?
Your training data has labels, which means that for each point in the training set, you know what class it belongs to.
How do I go about choosing test points, training points, threshold or a bias?
For simple problems, you usually take all the training data and split it around 80/20. You train on the 80% and test against the remaining 20%.
I got this interview question and need to write a function for it. I failed.
Because it is a phone interview question, I don't think what I am supposed to code really need to be perfect random tester.
Any ideas?
How to write some code to be a reasonable randomness tester within like 30 minutes during an interview?
edit
The distribution in this question is uniformly distributed
As this is an interview question, I think the interviewers are looking to assess in two ways:
Ability to understand what the requirements of the problem really are.
Ability to think of some code that would address those requirements.
This could be a really good interview question in certain settings, especially if the interviewer were willing to prompt the candidate with questions as and when necessary.
In terms of understanding the requirements of the question, it helps if you know that this is a really difficult problem, witness the Diehard tests mentioned in pjs's answer. Fundamentally I think a candidate would need to demonstrate appreciation of two things:
(a) The overall distribution of the numbers should match the desired distribution (I'm assuming it is uniform in this case, but as #pjs points out in comments this assumption should be made explicit).
(b) Each number drawn should be independent from the previous numbers drawn.
With half an hour to code something up in a phone interview, you can't go very far. If I were answering this question I would try to suggest something like:
(a) To test the distribution, come up with a set of equal-sized bins for the floating point numbers, and count the numbers that fall into each bin. Plot a histogram and eyeball it (plotting the data is always a good idea). To extend this, you could use a chi-squared test, as described in amit's answer.
However, as discussed in the comments, and here
The main problem with chi squared test is the choice of number and size of the intervals. Although rules of thumb can help produce good results, there is no panacea for all kinds of applications.
To this end, the Kolmogorov-Smirnov test can be used. The idea behind this test is that if you a plot of the ordered data should be a good fit against the perfect ordered data (known as the cumulative distribution). For a uniform distribution the perfect ordered data is a straight line: you expect the 10th percentile of the data to be 10% of the way through the range, the 20th to be 20% of the way through the range and so on. So, programmatically, you could sort the data, plot it against the ideal value and you should get a straight line. There is also a formal, quantitative statistical test you can apply, which is based on the differences between the actual and ideal values.
(b) To test independence, there are multiple approaches. Autocorrelation at various time lags is one fairly obvious one: to what extent is the value at time t similar to the value at time t+1, for example. The runs test is another nice one: you convert all the numbers into 1 or 0 depending on whether they fall above or below the median, and then the distribution of the length of runs can be used to construct a statistical test. The runs test can also be used to test for runs in one direction or another, as described here and here (this might be more useful in your case). Both of these have fairly straightforward implementations so long as you have the formulas to hand!
Apart from the diehard tests, other good sources discussing random number generators include here and here.
The way to check if a random number generator (or any other probability for that matter) is matching a desired model (in your case, uniform distribution) - you should use a statistical test, the Pearson's chi squared test.
The test is based on collecting observations, and matching them to the expected probability in according to the theoretic model you are assuming the numbers come from.
At the end, the test gives you the probability that the collected sample indeed came from the given model.
A simple example:
Given a cube, and the draws: [5,3,5,5,1,1] Is the cube balanced? (p=1/6 for each of {1,...,6})
Given the above observations we create the Expected vector: E = [1,1,1,1,1,1] (each entry is N/6 - 6 because this is the number of outcomes and N is the number of draws, 6 in the above example). And the Observed vector: O=[2,0,1,0,3,0]
From this we compute the statistic:
Xi^2 = sum((O_i - E_i)^2 / E_i) = 1/1 + 1/1 + 0/1 + 1/1 + 4/1 + 1/1 = 8
Now, we need to check what is the probability for P(Xi^2>=8), according to the chi^2 distribution (one degree of freedom). This probability is ~0.005 (a bit less..). So we can reject the hypothesis that the sample comes from unbiased cube with pretty high probability.
You're saying that they wanted you to recreate/reinvent the "diehard" battery of tests that it took Marsaglia many years to develop? I'd call them on unreasonable expectations.
Whatever distribution the random floats are suppposed to have, say uniform distribution over the interval [0,1], you can use the Kolmogorov-Smirnov test http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test to test to see if a sample does not follow the desired distribution. This can have advantages over chi-squared test if you have many possible values (because if you have more possible values than samples, then you have to define buckets for the chi-squared test, which makes the test less powerful compared to general distribution checking like Kolmogorov-Smirnov)
I have devised a test in order to compare the different running times of my sorting algorithm with Insertion sort, bubble sort, quick sort, selection sort, and shell sort. I have based my test using the test done in this website http://warp.povusers.org/SortComparison/index.html, but I modified my test a bit.
I set up a test manager program server which generates the data, and the test manager sends it to the clients that run the different algorithms, therefore they are sorting the same data to have no bias.
I noticed that the insertion sort, bubble sort, and selection sort algorithms really did run for a very long time (some more than 15 minutes) just to sort one given data for sizes of 100,000 and 1,000,000.
So I changed the number of runs per test case for those two data sizes. My original runs for the 100,000 was 500 but I reduced it to 15, and for 1,000,000 was 100 and I reduced it to 3.
Now my professor doubts the credibility as to why I've reduced it that much, but as I've observed the running time for sorting a specific data distribution varied only by a small percentage, which is why I still find it that even though I've reduced it to that much I'd still be able to approximate the average runtime for that specific test case of that algorithm.
My question now is, is my assumption wrong? Does the machine at times make significant running time changes (>50% changes), like say for example sorting the same data over and over if a first run would give it 0.3 milliseconds will the second run give as much difference as making it run for 1.5 seconds? Because from my observation, the running times don't vary largely given the same type of test distribution (e.g. completely random, completely sorted, completely reversed).
What you are looking for is a way to measure error in your experiments. My favorite book on subject is Error Analysis by Taylor and Chapter 4 has what you need which I'll summarize here.
You need to calculate Standard error of the mean or SDOM. First calculate mean and standard deviation (formulas are on Wikipedia and quite simple). Your SDOM is standard deviation divided by square root of number of measurements. Assuming your timings have Normal distribution (which it should), the twice the value of SDOM is a very common way to specify +/- error.
For example, let's say you run sorting algorithm 5 times and get following numbers: 5, 6, 7, 4, 5. Then mean is 5.4 and standard deviation is 1.1. Therefor SDOM is 1.1/sqrt(5) = 0.5. So 2*SDOM = 1. Now you can say that algorithm rum time was 5.4 ± 1. You professor can determine if this is acceptable error in measurement. Notice that as you take more readings, your SDOM, i.e. plus or minus error, goes down inversely proportional to square root of N. Twice of SDOM interval has 95% probability or confidence that the true value lies within the interval which is accepted standard.
Also you most likely want to measure performance by measuring CPU time instead of simple timer. Modern CPUs are too complex with various cache level and pipeline optimizations and you might end up getting less accurate measurement if you are using timer. More about CPU time is in this answer: How can I measure CPU time and wall clock time on both Linux/Windows?
It absolutely does. You need a variety of "random" samples in order to be able to draw proper conclusions about the population.
Look at it this way. It takes a long time to poll 100,000 people in the U.S. about their political stance. If we reduce the sample size to 100 people in order to complete it faster, we not only reduce the precision of our final result (2 decimal places rather than 5), we also introduce a larger chance that the members of the sample have a specific bias (there is a greater chance that 100 people out of 3xx,000,000 think the same way than 100,000 out of those same 3xx,000,000).
Your professor is right, however he's not provided the details that I mention some of them here :
Sampling issue: It's right that you generate some random numbers and feed them to your sorting methods, but with a few test cases indeed you're biased cause almost all of the random functions are biased to some extent (specially to the state of machine or time at the moment), so you should use more and more test cases to be more confident about the randomness.
Machine state: Suppose you've provide perfect data (fully representative of a uniform distribution), the performance of the electro-mechanical devises like computers may vary in different situations, so you should try for considerable times to smooth the effects of these phenomena.
Note : In advanced technical reports, you should provide a confidence coefficient for the answers you provide derived from statistical analysis, and proven step by step, but if you don't need to be that much exact, simply increase these :
The size of the data
The number of tests
tl;dr: I want to predict file copy completion. What are good methods given the start time and the current progress?
Firstly, I am aware that this is not at all a simple problem, and that predicting the future is difficult to do well. For context, I'm trying to predict the completion of a long file copy.
Current Approach:
At the moment, I'm using a fairly naive formula that I came up with myself: (ETC stands for Estimated Time of Completion)
ETC = currTime + elapsedTime * (totalSize - sizeDone) / sizeDone
This works on the assumption that the remaining files to be copied will do so at the average copy speed thus far, which may or may not be a realistic assumption (dealing with tape archives here).
PRO: The ETC will change gradually, and becomes more and more accurate as the process nears completion.
CON: It doesn't react well to unexpected events, like the file copy becoming stuck or speeding up quickly.
Another idea:
The next idea I had was to keep a record of the progress for the last n seconds (or minutes, given that these archives are supposed to take hours), and just do something like:
ETC = currTime + currAvg * (totalSize - sizeDone)
This is kind of the opposite of the first method in that:
PRO: If the speed changes quickly, the ETC will update quickly to reflect the current state of affairs.
CON: The ETC may jump around a lot if the speed is inconsistent.
Finally
I'm reminded of the control engineering subjects I did at uni, where the objective is essentially to try to get a system that reacts quickly to sudden changes, but isn't unstable and crazy.
With that said, the other option I could think of would be to calculate the average of both of the above, perhaps with some kind of weighting:
Weight the first method more if the copy has a fairly consistent long-term average speed, even if it jumps around a bit locally.
Weight the second method more if the copy speed is unpredictable, and is likely to do things like speed up/slow down for long periods, or stop altogether for long periods.
What I am really asking for is:
Any alternative approaches to the two I have given.
If and how you would combine several different methods to get a final prediction.
If you feel that the accuracy of prediction is important, the way to go about about building a predictive model is as follows:
collect some real-world measurements;
split them into three disjoint sets: training, validation and test;
come up with some predictive models (you already have two plus a mix) and fit them using the training set;
check predictive performance of the models on the validation set and pick the one that performs best;
use the test set to assess the out-of-sample prediction error of the chosen model.
I'd hazard a guess that a linear combination of your current model and the "average over the last n seconds" would perform pretty well for the problem at hand. The optimal weights for the linear combination can be fitted using linear regression (a one-liner in R).
An excellent resource for studying statistical learning methods is The Elements of
Statistical Learning by Hastie, Tibshirani and Friedman. I can't recommend that book highly enough.
Lastly, your second idea (average over the last n seconds) attempts to measure the instantaneous speed. A more robust technique for this might be to use the Kalman filter, whose purpose is exactly this:
Its purpose is to use measurements observed over time, containing
noise (random variations) and other inaccuracies, and produce values
that tend to be closer to the true values of the measurements and
their associated calculated values.
The principal advantage of using the Kalman filter rather than a fixed n-second sliding window is that it's adaptive: it will automatically use a longer averaging window when measurements jump around a lot than when they're stable.
Imho, bad implementations of ETC are wildly overused, which allows us to have a good laugh. Sometimes, it might be better to display facts instead of estimations, like:
5 of 10 files have been copied
10 of 200 MB have been copied
Or display facts and an estimation, and make clear that it is only an estimation. But I would not display only an estimation.
Every user knows that ETCs are often completely meaningless, and then it is hard to distinguish between meaningful ETCs and meaningless ETCs, especially for inexperienced users.
I have implemented two different solutions to address this problem:
The ETC for the current transfer at start time is based on a historic speed value. This value is refined after each transfer. During the transfer I compute a weighted average between the historic data and data from the current transfer, so that the closer to the end you are the more weight is given to actual data from the transfer.
Instead of showing a single ETC, show a range of time. The idea is to compute the ETC from the last 'n' seconds or minutes (like your second idea). I keep track of the best and worst case averages and compute a range of possible ETCs. This is kind of confusing to show in a GUI, but okay to show in a command line app.
There are two things to consider here:
the exact estimation
how to present it to the user
1. On estimation
Other than statistics approach, one simple way to have a good estimation of the current speed while erasing some noise or spikes is to take a weighted approach.
You already experimented with the sliding window, the idea here is to take a fairly large sliding window, but instead of a plain average, giving more weight to more recent measures, since they are more indicative of the evolution (a bit like a derivative).
Example: Suppose you have 10 previous windows (most recent x0, least recent x9), then you could compute the speed:
Speed = (10 * x0 + 9 * x1 + 8 * x2 + ... + x9) / (10 * window-time) / 55
When you have a good assessment of the likely speed, then you are close to get a good estimated time.
2. On presentation
The main thing to remember here is that you want a nice user experience, and not a scientific front.
Studies have demonstrated that users reacted very badly to slow-down and very positively to speed-up. Therefore, a good progress bar / estimated time should be conservative in the estimates presented (reserving time for a potential slow-down) at first.
A simple way to get that is to have a factor that is a percentage of the completion, that you use to tweak the estimated remaining time. For example:
real-completion = 0.4
presented-completion = real-completion * factor(real-completion)
Where factor is such that factor([0..1]) = [0..1], factor(x) <= x and factor(1) = 1. For example, the cubic function produces the nice speed-up toward the completion time. Other functions could use an exponential form 1 - e^x, etc...