Related
Does the Distribution represented by the training data need to reflect the distribution of the test data and the data that you predict on? Can I measure the quality of the training data by looking at the distribution of each feature and compare that distribution to the data I am predicting or testing with? Ideally the training data should be sufficiently representative of the real world distribution.
Short answer: similar ranges would be a good idea.
Long answer: sometimes it won't be an issue (rarely) but let's examine when.
In an ideal situation, your model will capture the true phenomenon perfectly. Imagine the simplest case: the linear model y = x. If the training data are noiseless (or have tolerable noise). Your linear regression will naturally land on a model approximately equal to y = x. The generalization of the model will work nearly perfect even outside of the training range. If your train data were {1:1, 2:2, 3:3, 4:4, 5:5, 6:6, 7:7, 8:8, 9:9, 10:10}. The test point 500, will nicely map onto the function, returning 500.
In most modeling scenarios, this will almost certainly not be the case. If the training data are ample and the model is appropriately complex (and no more), you're golden.
The trouble is that few functions (and corresponding natural phenomena) -- especially when we consider nonlinear functions -- extend to data outside of the training range so cleanly. Imagine sampling office temperature against employee comfort. If you only look at temperatures from 40 deg to 60 deg. A linear function will behave brilliantly in the training data. Oddly enough, if you test on 60 to 80, the mapping will break down. Here, the issue is confidence in your claim that the data are sufficiently representative.
Now let's consider noise. Imagine that you know EXACTLY what the real world function is: a sine wave. Better still, you are told its amplitude and phase. What you don't know is its frequency. You have a really solid sampling between 1 and 100, the function you fit maps against the training data really well. Now if there is just enough noise, you might estimate the frequency incorrectly by a hair. When you test near the training range, the results aren't so bad. Outside of the training range, things start to get wonky. As you move further and further from the training range, the real function and the function diverge and converge based on their relative frequencies. Sometimes, the residuals are seemingly fine; sometimes they are dreadful.
There is an issue with your idea of examining the variable distributions: interaction between variables. Even if each variable is appropriately balanced in train and test, it is possible that the relationships between variables will differ (joint distributions). For a purely contrived example, consider you were predicting an individual's likelihood of being pregnant at any given time. In your training set, you had women aged 20 to 30 and men aged 30 to 40. In testing, you had the same percentage of men and women, but the age ranges were flipped. Independently, the variables look very nicely matched! But in your training set, you could very easily conclude, "only people under 30 get pregnant." Oddly enough, your testing set would demonstrate the exact opposite! The trouble is that your predictions are being made from a multivariate space, but the distributions you are thinking about are univariate. Considering the joint distributions of continuous variables against one another (and considering categorical variables appropriately) is, however, a good idea. Ideally, your fit model should have access to a similar range to your testing data.
Fundamentally, the question is about extrapolation from a limited training space. If the model fit in the training space generalizes, you can generalize; ultimately, it is usually safest to have a really well distributed training set to maximize the likelihood that you have captured the complexity of the underlying function.
Really interesting question! I hope the answer was somewhat insightful; I'll continue to build on it as resources come to mind! Let me know if any questions remain!
EDIT: a point made in the comments that I think should be read by future readers.
Ideally, training data should NEVER influence testing data in ANY way. That includes examining of the distributions, joint distributions etc. With sufficient data, distributions in the training data should converge on distributions in the testing data (think the mean, law of large nums). Manipulation to match distributions (like z-scoring before train/test split) fundamentally skews performance metrics in your favor. An appropriate technique for splitting train and test data would be something like stratified k fold for cross validation.
Sorry for the delayed response. After going through a few months of iterating, I implemented and pushed the following solution to production and it is working quite well.
The issue here boils down to how can one reduce the training/test score variance when performing cross validation. This is important as if your variance is high, the confidence in picking the best model goes down. The more representative the test data is to the train data, the less variance you get in your test scores across the cross validation set. Stratified cross validation tackles this issue especially when there is significant class imbalance, by ensuring that the label class proportions are preserved across all test/train sets. However, this doesnt address the issue with the feature distribution.
In my case, I had a few features that were very strong predictors but also very skewed in their distribution. This caused significant variance in my test scores which made it harder to pick a model with any confidence. Essentially, the solution is to ensure that the joint distribution of the label with the feature set is maintained across test/train sets. Many ways of doing this but a very simple approach is to simply take each column bucket range (if continuous) or label (if categorical) one by one and sample from these buckets when generating the test and train sets. Note that the buckets quickly gets very sparse especially when you have a lot of categorical variables. Also, the column order in which you bucket affects the sampling output greatly. Below is a solution where I bucket the label first (same like stratified CV) and then sample 1 other feature (most important feature (called score_percentage) that is known upfront).
def train_test_folds(self, label_column="label"):
# train_test is an array of tuples where each tuple is a test numpy array and train numpy array pair.
# The final iterator would return these individual elements separately.
n_folds = self.n_folds
label_classes = np.unique(self.label)
train_test = []
fmpd_copy = self.fm.copy()
fmpd_copy[label_column] = self.label
fmpd_copy = fmpd_copy.reset_index(drop=True).reset_index()
fmpd_copy = fmpd_copy.sort_values("score_percentage")
for lbl in label_classes:
fmpd_label = fmpd_copy[fmpd_copy[label_column] == lbl]
# Calculate the fold # using the label specific dataset
if (fmpd_label.shape[0] < n_folds):
raise ValueError("n_folds=%d cannot be greater than the"
" number of rows in each class."
% (fmpd_label.shape[0]))
# let's get some variance -- shuffle within each buck
# let's go through the data set, shuffling items in buckets of size nFolds
s = 0
shuffle_array = fmpd_label["index"].values
maxS = len(shuffle_array)
while s < maxS:
max = min(maxS, s + n_folds) - 1
for i in range(s, max):
j = random.randint(i, max)
if i < j:
tempI = shuffle_array[i]
shuffle_array[i] = shuffle_array[j]
shuffle_array[j] = tempI
s = s + n_folds
# print("shuffle s =",s," max =",max, " maxS=",maxS)
fmpd_label["index"] = shuffle_array
fmpd_label = fmpd_label.reset_index(drop=True).reset_index()
fmpd_label["test_set_number"] = fmpd_label.iloc[:, 0].apply(
lambda x: x % n_folds)
print("label ", lbl)
for n in range(0, n_folds):
test_set = fmpd_label[fmpd_label["test_set_number"]
== n]["index"].values
train_set = fmpd_label[fmpd_label["test_set_number"]
!= n]["index"].values
print("for label ", lbl, " test size is ",
test_set.shape, " train size is ", train_set.shape)
print("len of total size", len(train_test))
if (len(train_test) != n_folds):
# Split doesnt exist. Add it in.
train_test.append([train_set, test_set])
else:
temp_arr = train_test[n]
temp_arr[0] = np.append(temp_arr[0], train_set)
temp_arr[1] = np.append(temp_arr[1], test_set)
train_test[n] = [temp_arr[0], temp_arr[1]]
return train_test
Over time, I realized that this whole issue falls under the umbrella of covariate shift which is a well studied area within machine learning. Link below or just search google for covariate shift. The concept is how to detect and ensure that your prediction data is of similar distribution with your training data. THis is in the feature space but in theory you could have label drift as well.
https://www.analyticsvidhya.com/blog/2017/07/covariate-shift-the-hidden-problem-of-real-world-data-science/
Hello my problem is more related with the validation of a model. I have done a program in netlogo that i'm gonna use in a report for my thesis but now the question is, how many repetitions (simulations) i need to do for justify my results? I already have read some methods using statistical approach and my colleagues have suggested me some nice mathematical operations, but i also want to know from people who works with computational models what kind of statistical test or mathematical method used to know that.
There are two aspects to this (1) How many parameter combinations (2) How many runs for each parameter combination.
(1) Generally you would do experiments, where you vary some of your input parameter values and see how some model output changes. Take the well known Schelling segregation model as an example, you would vary the tolerance value and see how the segregation index is affected. In this case you might vary the tolerance from 0 to 1 by 0.01 (if you want discrete) or you could just take 100 different random values in the range [0,1]. This is a matter of experimental design and is entirely affected by how fine you wish to examine your parameter space.
(2) For each experimental value, you also need to run multiple simulations so that you can can calculate the average and reduce the impact of randomness in the simulation run. For example, say you ran the model with a value of 3 for your input parameter (whatever it means) and got a result of 125. How do you know whether the 'real' answer is 125 or something else. If you ran it 10 times and got 10 different numbers in the range 124.8 to 125.2 then 125 is not an unreasonable estimate. If you ran it 10 times and got numbers ranging from 50 to 500, then 125 is not a useful result to report.
The number of runs for each experiment set depends on the variability of the output and your tolerance. Even the 124.8 to 125.2 is not useful if you want to be able to estimate to 1 decimal place. Look up 'standard error of the mean' in any statistics text book. Basically, if you do N runs, then a 95% confidence interval for the result is the average of the results for your N runs plus/minus 1.96 x standard deviation of the results / sqrt(N). If you want a narrower confidence interval, you need more runs.
The other thing to consider is that if you are looking for a relationship over the parameter space, then you need fewer runs at each point than if you are trying to do a point estimate of the result.
Not sure exactly what you mean, but maybe you can check the books of Hastie and Tishbiani
http://web.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf
specially the sections on resampling methods (Cross-Validation and bootstrap).
They also have a shorter book that covers the possible relevant methods to your case along with the commands in R to run this. However, this book, as a far as a I know, is not free.
http://www.springer.com/statistics/statistical+theory+and+methods/book/978-1-4614-7137-0
Also, could perturb the initial conditions to see you the outcome doesn't change after small perturbations of the initial conditions or parameters. On a larger scale, sometimes you can break down the space of parameters with regard to final state of the system.
1) The number of simulations for each parameter setting can be decided by studying the coefficient of variance Cv = s / u, here s and u are standard deviation and mean of the result respectively. It is explained in detail in this paper Coefficient of variance.
2) The simulations where parameters are changed can be analyzed using several methods illustrated in the paper Testing methods.
These papers provide scrupulous analyzing methods and refer to other papers which may be relevant to your question and your research.
I got this interview question and need to write a function for it. I failed.
Because it is a phone interview question, I don't think what I am supposed to code really need to be perfect random tester.
Any ideas?
How to write some code to be a reasonable randomness tester within like 30 minutes during an interview?
edit
The distribution in this question is uniformly distributed
As this is an interview question, I think the interviewers are looking to assess in two ways:
Ability to understand what the requirements of the problem really are.
Ability to think of some code that would address those requirements.
This could be a really good interview question in certain settings, especially if the interviewer were willing to prompt the candidate with questions as and when necessary.
In terms of understanding the requirements of the question, it helps if you know that this is a really difficult problem, witness the Diehard tests mentioned in pjs's answer. Fundamentally I think a candidate would need to demonstrate appreciation of two things:
(a) The overall distribution of the numbers should match the desired distribution (I'm assuming it is uniform in this case, but as #pjs points out in comments this assumption should be made explicit).
(b) Each number drawn should be independent from the previous numbers drawn.
With half an hour to code something up in a phone interview, you can't go very far. If I were answering this question I would try to suggest something like:
(a) To test the distribution, come up with a set of equal-sized bins for the floating point numbers, and count the numbers that fall into each bin. Plot a histogram and eyeball it (plotting the data is always a good idea). To extend this, you could use a chi-squared test, as described in amit's answer.
However, as discussed in the comments, and here
The main problem with chi squared test is the choice of number and size of the intervals. Although rules of thumb can help produce good results, there is no panacea for all kinds of applications.
To this end, the Kolmogorov-Smirnov test can be used. The idea behind this test is that if you a plot of the ordered data should be a good fit against the perfect ordered data (known as the cumulative distribution). For a uniform distribution the perfect ordered data is a straight line: you expect the 10th percentile of the data to be 10% of the way through the range, the 20th to be 20% of the way through the range and so on. So, programmatically, you could sort the data, plot it against the ideal value and you should get a straight line. There is also a formal, quantitative statistical test you can apply, which is based on the differences between the actual and ideal values.
(b) To test independence, there are multiple approaches. Autocorrelation at various time lags is one fairly obvious one: to what extent is the value at time t similar to the value at time t+1, for example. The runs test is another nice one: you convert all the numbers into 1 or 0 depending on whether they fall above or below the median, and then the distribution of the length of runs can be used to construct a statistical test. The runs test can also be used to test for runs in one direction or another, as described here and here (this might be more useful in your case). Both of these have fairly straightforward implementations so long as you have the formulas to hand!
Apart from the diehard tests, other good sources discussing random number generators include here and here.
The way to check if a random number generator (or any other probability for that matter) is matching a desired model (in your case, uniform distribution) - you should use a statistical test, the Pearson's chi squared test.
The test is based on collecting observations, and matching them to the expected probability in according to the theoretic model you are assuming the numbers come from.
At the end, the test gives you the probability that the collected sample indeed came from the given model.
A simple example:
Given a cube, and the draws: [5,3,5,5,1,1] Is the cube balanced? (p=1/6 for each of {1,...,6})
Given the above observations we create the Expected vector: E = [1,1,1,1,1,1] (each entry is N/6 - 6 because this is the number of outcomes and N is the number of draws, 6 in the above example). And the Observed vector: O=[2,0,1,0,3,0]
From this we compute the statistic:
Xi^2 = sum((O_i - E_i)^2 / E_i) = 1/1 + 1/1 + 0/1 + 1/1 + 4/1 + 1/1 = 8
Now, we need to check what is the probability for P(Xi^2>=8), according to the chi^2 distribution (one degree of freedom). This probability is ~0.005 (a bit less..). So we can reject the hypothesis that the sample comes from unbiased cube with pretty high probability.
You're saying that they wanted you to recreate/reinvent the "diehard" battery of tests that it took Marsaglia many years to develop? I'd call them on unreasonable expectations.
Whatever distribution the random floats are suppposed to have, say uniform distribution over the interval [0,1], you can use the Kolmogorov-Smirnov test http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test to test to see if a sample does not follow the desired distribution. This can have advantages over chi-squared test if you have many possible values (because if you have more possible values than samples, then you have to define buckets for the chi-squared test, which makes the test less powerful compared to general distribution checking like Kolmogorov-Smirnov)
I have used three point estimation for one of my project.
Formula is
Three Point Estimate = (O + 4M + L ) / 6
That means,
Best Estimate + 4 x Most Likely Estimate + Worst Case Estimate divided by 6
Here
divided by 6 means, average 6
and there is less chance of the worst case or the best case happening. In good faith, most likely estimate (M), is what it will take to get the job done.
But I don't know why they use 4(M). Why they multiplied by 4 ???. Not use 5,6,7 etc...
why most likely estimate is weighted four times as much as the other two values ?
There is a derivation here:
http://www.deepfriedbrainproject.com/2010/07/magical-formula-of-pert.html
In case the link goes dead, I'll provide a summary here.
So, taking a step back from the question for a moment, the goal here is to come up with a single mean (average) figure that we can say is the expected figure for any given 3 point estimate. That is to say, If I was to attempt the project X times, and add up all the costs of the project attempts for a total of $Y, then I expect the cost of one attempt to be $Y/X. Note that this number may or may not be the same as the mode (most likely) outcome, depending on the probability distribution.
An expected outcome is useful because we can do things like add up a whole list of expected outcomes to create an expected outcome for the project, even if we calculated each individual expected outcome differently.
A mode on the other hand, is not even necessarily unique per estimate, so that's one reason that it may be less useful than an expected outcome. For example, every number from 1-6 is the "most likely" for a dice roll, but 3.5 is the (only) expected average outcome.
The rationale/research behind a 3 point estimate is that in many (most?) real-world scenarios, these numbers can be more accurately/intuitively estimated by people than a single expected value:
A pessimistic outcome (P)
An optimistic outcome (O)
The most likely outcome (M)
However, to convert these three numbers into an expected value we need a probability distribution that interpolates all the other (potentially infinite) possible outcomes beyond the 3 we produced.
The fact that we're even doing a 3-point estimate presumes that we don't have enough historical data to simply lookup/calculate the expected value for what we're about to do, so we probably don't know what the actual probability distribution for what we're estimating is.
The idea behind the PERT estimates is that if we don't know the actual curve, we can plug some sane defaults into a Beta distribution (which is basically just a curve we can customise into many different shapes) and use those defaults for every problem we might face. Of course, if we know the real distribution, or have reason to believe that default Beta distribution prescribed by PERT is wrong for the problem at hand, we should NOT use the PERT equations for our project.
The Beta distribution has two parameters A and B that set the shape of the left and right hand side of the curve respectively. Conveniently, we can calculate the mode, mean and standard deviation of a Beta distribution simply by knowing the minimum/maximum values of the curve, as well as A and B.
PERT sets A and B to the following for every project/estimate:
If M > (O + P) / 2 then A = 3 + √2 and B = 3 - √2, otherwise the values of A and B are swapped.
Now, it just so happens that if you make that specific assumption about the shape of your Beta distribution, the following formulas are exactly true:
Mean (expected value) = (O + 4M + P) / 6
Standard deviation = (O - P) / 6
So, in summary
The PERT formulas are not based on a normal distribution, they are based on a Beta distribution with a very specific shape
If your project's probability distribution matches the PERT Beta distribution then the PERT formula are exactly correct, they are not approximations
It is pretty unlikely that the specific curve chosen for PERT matches any given arbitrary project, and so the PERT formulas will be an approximation in practise
If you don't know anything about the probability distribution of your estimate, you may as well leverage PERT as it's documented, understood by many people and relatively easy to use
If you know something about the probability distribution of your estimate that suggests something about PERT is inappropriate (like the 4x weighting towards the mode), then don't use it, use whatever you think is appropriate instead
The reason why you multiply by 4 to get the Mean (and not 5, 6, 7, etc.) is because the number 4 is tied to the shape of the underlying probability curve
Of course, PERT could have been based off a Beta distribution that yields 5, 6, 7 or any other number when calculating the Mean, or even a normal distribution, or a uniform distribution, or pretty much any other probability curve, but I'd suggest that the question of why they chose the curve they did is out of scope for this answer and possibly quite open ended/subjective anyway
I dug into this once. I cleverly neglected to write down the trail, so this is from memory.
So far as I can make out, the standards documents got it from the textbooks. The textbooks got it from the original 1950s write up in a statistics journals. The writeup in the journal was based on an internal report done by RAND as part of the overall work done to develop PERT for the Polaris program.
And that's where the trail goes cold. Nobody seems to have a firm idea of why they chose that formula. The best guess seems to be that it's based on a rough approximation of a normal distribution -- strictly, it's a triangular distribution. A lumpy bell curve, basically, that assumes that the "likely case" falls within 1 standard deviation of the true mean estimate.
4/6ths approximates 66.7%, which approximates 68%, which approximates the area under a normal distribution within one standard deviation of the mean.
All that being said, there are two problems:
It's essentially made up. There doesn't seem to be a firm basis for picking it. There's some Operational Research literature arguing for alternative distributions. In what universe are estimates normally distributed around the true outcome? I'd very much like to move there.
The accuracy-improving effect of the 3-point / PERT estimation method might be more about the breaking down of tasks into subtasks than from any particular formula. Psychologists studying what they call "the planning fallacy" have found that breaking down tasks -- "unpacking", in their terminology -- consistently improves estimates by making them higher and thus reducing inaccuracy. So perhaps the magic in PERT/3-point is the unpacking, not the formulae.
Isn't it a well working thumb-number?
The cone of uncertainty uses the factor 4 for the beginning phase of the project.
The book "Software Estimation" by Steve McConnell is based around the "cone of uncertainty" model and gives many "thumb-rules". However every approximated number or a thumb-rule is based on statistics from COCOMO or similar solid researches, models or studies.
Ideally these factors for O, M and L are derived using historical data for other projects in the same company in the same environment. In other words, the company should have 4 projects completed within M estimate, 1 within O and 1 within L. If my company/team had got 1 project completed within original O estimate, 2 projects within M and 2 within L, I would use another formula - (O + 2M + 2L) / 5. Does it make sense?
The cone of uncertainty was referenced above ... it's a well-known foundational element used in agile estimation practices.
What's the problem with it though? Doesn't it look too symmetrical - as if it's not natural, not really based on real data?
If you ever though that then you're right. The cone of uncertainty shown in the picture above is made up based on probabilities ... not actual raw data from real projects (but most of the times it's used as such).
Laurent Bossavit wrote a book and also gave a presentation where he presented his research on how that cone came to be (and other 'facts' we often believe in software engineering):
The Leprechauns of Software Engineering
https://www.amazon.com/Leprechauns-Software-Engineering-Laurent-Bossavit/dp/2954745509/
https://www.youtube.com/watch?v=0AkoddPeuxw
Is there some real data to support a cone of uncertainty? The closest he was able to find was a cone that can go up to 10x in the positive Y direction (so we can be up to 10 times off on our estimation in terms of the project taking 10 times as long in the end).
Hardly anybody estimates a project that ends up finishing 4 times earlier ... or ... gasp ... 10 times earlier.
For reasons I'd rather not go into, I need to filter a set of values to reduce jitter. To that end, I need to be able to average a list of numbers, with the most recent having the greatest effect, and the least recent having the smallest effect. I'm using a sample size of 10, but that could easily change at some point.
Are there any reasonably simple aging algorithms that I can apply here?
Have a look at the exponential smoothing. Fairly simple, and might be sufficient for your needs. Basically recent observations are given relatively more weight than the older ones.
Also (depending on the application) you may want to look at various reinforcement learning techniques, for example Q-Learning or TD-Learning or generally speaking any method involving the discount.
I ran into something similar in an embedded control application.
The simplest option that I came across was a 3/4 filter. This gets applied continuously over the entire data set:
current_value = (3*current_value + new_value)/4
I eventually decided to go with a 16-tap FIR filter instead:
Overview
FIR FAQ
Wikipedia article
Many weighted averaging algorithms could be used.
For example, for items I(n) for n = 1 to N in sequence (newest to oldest):
(SUM(I(n) * (N + 1 - n)) / SUM(n)
It's not exactly clear from the question whether you're dealing with fixed-length
data or if data is continuously coming in. A nice physical model for the latter
would be a low pass filter, using a capacitor and a resistor (R and C). Assuming
your data is equidistantly spaced in time (is it?), this leads to an update prescription
U_aged[n+1] = U_aged[n] + deltat/Tau (U_raw[n+1] - U_aged[n])
where Tau is the time constant of the filter. In the limit of zero deltat, this
gives an exponential decay (old values will be reduced to 1/e of their value after
time Tau). In an implementation, you only need to keep a running weighted sum U_aged.
deltat would be 1 and Tau would specify the 'aging constant', the number of steps
it takes to reduce a sample's contribution to 1/e.