Does the Distribution represented by the training data need to reflect the distribution of the test data and the data that you predict on? Can I measure the quality of the training data by looking at the distribution of each feature and compare that distribution to the data I am predicting or testing with? Ideally the training data should be sufficiently representative of the real world distribution.
Short answer: similar ranges would be a good idea.
Long answer: sometimes it won't be an issue (rarely) but let's examine when.
In an ideal situation, your model will capture the true phenomenon perfectly. Imagine the simplest case: the linear model y = x. If the training data are noiseless (or have tolerable noise). Your linear regression will naturally land on a model approximately equal to y = x. The generalization of the model will work nearly perfect even outside of the training range. If your train data were {1:1, 2:2, 3:3, 4:4, 5:5, 6:6, 7:7, 8:8, 9:9, 10:10}. The test point 500, will nicely map onto the function, returning 500.
In most modeling scenarios, this will almost certainly not be the case. If the training data are ample and the model is appropriately complex (and no more), you're golden.
The trouble is that few functions (and corresponding natural phenomena) -- especially when we consider nonlinear functions -- extend to data outside of the training range so cleanly. Imagine sampling office temperature against employee comfort. If you only look at temperatures from 40 deg to 60 deg. A linear function will behave brilliantly in the training data. Oddly enough, if you test on 60 to 80, the mapping will break down. Here, the issue is confidence in your claim that the data are sufficiently representative.
Now let's consider noise. Imagine that you know EXACTLY what the real world function is: a sine wave. Better still, you are told its amplitude and phase. What you don't know is its frequency. You have a really solid sampling between 1 and 100, the function you fit maps against the training data really well. Now if there is just enough noise, you might estimate the frequency incorrectly by a hair. When you test near the training range, the results aren't so bad. Outside of the training range, things start to get wonky. As you move further and further from the training range, the real function and the function diverge and converge based on their relative frequencies. Sometimes, the residuals are seemingly fine; sometimes they are dreadful.
There is an issue with your idea of examining the variable distributions: interaction between variables. Even if each variable is appropriately balanced in train and test, it is possible that the relationships between variables will differ (joint distributions). For a purely contrived example, consider you were predicting an individual's likelihood of being pregnant at any given time. In your training set, you had women aged 20 to 30 and men aged 30 to 40. In testing, you had the same percentage of men and women, but the age ranges were flipped. Independently, the variables look very nicely matched! But in your training set, you could very easily conclude, "only people under 30 get pregnant." Oddly enough, your testing set would demonstrate the exact opposite! The trouble is that your predictions are being made from a multivariate space, but the distributions you are thinking about are univariate. Considering the joint distributions of continuous variables against one another (and considering categorical variables appropriately) is, however, a good idea. Ideally, your fit model should have access to a similar range to your testing data.
Fundamentally, the question is about extrapolation from a limited training space. If the model fit in the training space generalizes, you can generalize; ultimately, it is usually safest to have a really well distributed training set to maximize the likelihood that you have captured the complexity of the underlying function.
Really interesting question! I hope the answer was somewhat insightful; I'll continue to build on it as resources come to mind! Let me know if any questions remain!
EDIT: a point made in the comments that I think should be read by future readers.
Ideally, training data should NEVER influence testing data in ANY way. That includes examining of the distributions, joint distributions etc. With sufficient data, distributions in the training data should converge on distributions in the testing data (think the mean, law of large nums). Manipulation to match distributions (like z-scoring before train/test split) fundamentally skews performance metrics in your favor. An appropriate technique for splitting train and test data would be something like stratified k fold for cross validation.
Sorry for the delayed response. After going through a few months of iterating, I implemented and pushed the following solution to production and it is working quite well.
The issue here boils down to how can one reduce the training/test score variance when performing cross validation. This is important as if your variance is high, the confidence in picking the best model goes down. The more representative the test data is to the train data, the less variance you get in your test scores across the cross validation set. Stratified cross validation tackles this issue especially when there is significant class imbalance, by ensuring that the label class proportions are preserved across all test/train sets. However, this doesnt address the issue with the feature distribution.
In my case, I had a few features that were very strong predictors but also very skewed in their distribution. This caused significant variance in my test scores which made it harder to pick a model with any confidence. Essentially, the solution is to ensure that the joint distribution of the label with the feature set is maintained across test/train sets. Many ways of doing this but a very simple approach is to simply take each column bucket range (if continuous) or label (if categorical) one by one and sample from these buckets when generating the test and train sets. Note that the buckets quickly gets very sparse especially when you have a lot of categorical variables. Also, the column order in which you bucket affects the sampling output greatly. Below is a solution where I bucket the label first (same like stratified CV) and then sample 1 other feature (most important feature (called score_percentage) that is known upfront).
def train_test_folds(self, label_column="label"):
# train_test is an array of tuples where each tuple is a test numpy array and train numpy array pair.
# The final iterator would return these individual elements separately.
n_folds = self.n_folds
label_classes = np.unique(self.label)
train_test = []
fmpd_copy = self.fm.copy()
fmpd_copy[label_column] = self.label
fmpd_copy = fmpd_copy.reset_index(drop=True).reset_index()
fmpd_copy = fmpd_copy.sort_values("score_percentage")
for lbl in label_classes:
fmpd_label = fmpd_copy[fmpd_copy[label_column] == lbl]
# Calculate the fold # using the label specific dataset
if (fmpd_label.shape[0] < n_folds):
raise ValueError("n_folds=%d cannot be greater than the"
" number of rows in each class."
% (fmpd_label.shape[0]))
# let's get some variance -- shuffle within each buck
# let's go through the data set, shuffling items in buckets of size nFolds
s = 0
shuffle_array = fmpd_label["index"].values
maxS = len(shuffle_array)
while s < maxS:
max = min(maxS, s + n_folds) - 1
for i in range(s, max):
j = random.randint(i, max)
if i < j:
tempI = shuffle_array[i]
shuffle_array[i] = shuffle_array[j]
shuffle_array[j] = tempI
s = s + n_folds
# print("shuffle s =",s," max =",max, " maxS=",maxS)
fmpd_label["index"] = shuffle_array
fmpd_label = fmpd_label.reset_index(drop=True).reset_index()
fmpd_label["test_set_number"] = fmpd_label.iloc[:, 0].apply(
lambda x: x % n_folds)
print("label ", lbl)
for n in range(0, n_folds):
test_set = fmpd_label[fmpd_label["test_set_number"]
== n]["index"].values
train_set = fmpd_label[fmpd_label["test_set_number"]
!= n]["index"].values
print("for label ", lbl, " test size is ",
test_set.shape, " train size is ", train_set.shape)
print("len of total size", len(train_test))
if (len(train_test) != n_folds):
# Split doesnt exist. Add it in.
train_test.append([train_set, test_set])
else:
temp_arr = train_test[n]
temp_arr[0] = np.append(temp_arr[0], train_set)
temp_arr[1] = np.append(temp_arr[1], test_set)
train_test[n] = [temp_arr[0], temp_arr[1]]
return train_test
Over time, I realized that this whole issue falls under the umbrella of covariate shift which is a well studied area within machine learning. Link below or just search google for covariate shift. The concept is how to detect and ensure that your prediction data is of similar distribution with your training data. THis is in the feature space but in theory you could have label drift as well.
https://www.analyticsvidhya.com/blog/2017/07/covariate-shift-the-hidden-problem-of-real-world-data-science/
Related
I would like to find the time instant at which a certain value is reached in a time-series data with noise. If there are no peaks in the data, I could do the following in MATLAB.
Code from here
% create example data
d=1:100;
t=d/100;
ts = timeseries(d,t);
% define threshold
thr = 55;
data = ts.data(:);
time = ts.time(:);
ind = find(data>thr,1,'first');
time(ind) %time where data>threshold
But when there is noise, I am not sure what has to be done.
In the time-series data plotted in the above image I want to find the time instant at which the y-axis value 5 is reached. The data actually stabilizes to 5 at t>=100 s. But due to the presence of noise in the data, we see a peak that reaches 5 somewhere around 20 s . I would like to know how to detect e.g 100 seconds as the right time and not 20 s . The code posted above will only give 20 s as the answer. I
saw a post here that explains using a sliding window to find when the data equilibrates. However, I am not sure how to implement the same. Suggestions will be really helpful.
The sample data plotted in the above image can be found here
Suggestions on how to implement in Python or MATLAB code will be really helpful.
EDIT:
I don't want to capture when the peak (/noise/overshoot) occurs. I want to find the time when equilibrium is reached. For example, around 20 s the curve rises and dips below 5. After ~100 s the curve equilibrates to a steady-state value 5 and never dips or peaks.
Precise data analysis is a serious business (and my passion) that involves a lot of understanding of the system you are studying. Here are comments, unfortunately I doubt there is a simple nice answer to your problem at all -- you will have to think about it. Data analysis basically always requires "discussion".
First to your data and problem in general:
When you talk about noise, in data analysis this means a statistical random fluctuation. Most often Gaussian (sometimes also other distributions, e.g. Poission). Gaussian noise is a) random in each bin and b) symmetric in negative and positive direction. Thus, what you observe in the peak at ~20s is not noise. It has a very different, very systematic and extended characteristics compared to random noise. This is an "artifact" that must have a origin, but of which we can only speculate here. In real-world applications, studying and removing such artifacts is the most expensive and time-consuming task.
Looking at your data, the random noise is negligible. This is very precise data. For example, after ~150s and later there are no visible random fluctuations up to fourth decimal number.
After concluding that this is not noise in the common sense it could be a least two things: a) a feature of the system you are studying, thus, something where you could develop a model/formula for and which you could "fit" to the data. b) a characteristics of limited bandwidth somewhere in the measurement chain, thus, here a high-frequency cutoff. See e.g. https://en.wikipedia.org/wiki/Ringing_artifacts . Unfortunately, for both, a and b, there are no catch-all generic solutions. And your problem description (even with code and data) is not sufficient to propose an ideal approach.
After spending now ~one hour on your data and making some plots. I believe (speculate) that the extremely sharp feature at ~10s cannot be a "physical" property of the data. It simply is too extreme/steep. Something fundamentally happened here. A guess of mine could be that some device was just switched on (was off before). Thus, the data before is meaningless, and there is a short period of time afterwards to stabilize the system. There is not really an alternative in this scenario but to entirely discard the data until the system has stabilized at around 40s. This also makes your problem trivial. Just delete the first 40s, then the maximum becomes evident.
So what are technical solutions you could use, please don't be too upset that you have to think about this yourself and assemble the best possible solution for your case. I copied your data in two numpy arrays x and y and ran the following test in python:
Remove unstable time
This is the trivial solution -- I prefer it.
plt.figure()
plt.xlabel('time')
plt.ylabel('signal')
plt.plot(x, y, label="original")
y_cut = y
y_cut[:40] = 0
plt.plot(x, y_cut, label="cut 40s")
plt.legend()
plt.grid()
plt.show()
Note carry on reading below only if you are a bit crazy (about data).
Sliding window
You mentioned "sliding window" which is best suited for random noise (which you don't have) or periodic fluctuations (which you also don't really have). Sliding window just averages over consecutive bins, averaging out random fluctuations. Mathematically this is a convolution.
Technically, you can actually solve your problem like this (try even larger values of Nwindow yourself):
Nwindow=10
y_slide_10 = np.convolve(y, np.ones((Nwindow,))/Nwindow, mode='same')
Nwindow=20
y_slide_20 = np.convolve(y, np.ones((Nwindow,))/Nwindow, mode='same')
Nwindow=30
y_slide_30 = np.convolve(y, np.ones((Nwindow,))/Nwindow, mode='same')
plt.xlabel('time')
plt.ylabel('signal')
plt.plot(x,y, label="original")
plt.plot(x,y_slide_10, label="window=10")
plt.plot(x,y_slide_20, label='window=20')
plt.plot(x,y_slide_30, label='window=30')
plt.legend()
#plt.xscale('log') # useful
plt.grid()
plt.show()
Thus, technically you can succeed to suppress the initial "hump". But don't forget this is a hand-tuned and not general solution...
Another caveat of any sliding window solution: this always distorts your timing. Since you average over an interval in time depending on rising or falling signals your convoluted trace is shifted back/forth in time (slightly, but significantly). In your particular case this is not a problem since the main signal region has basically no time-dependence (very flat).
Frequency domain
This should be the silver bullet, but it also does not work well/easily for your example. The fact that this doesn't work better is the main hint to me that the first 40s of data are better discarded.... (i.e. in a scientific work)
You can use fast Fourier transform to inspect your data in frequency-domain.
import scipy.fft
y_fft = scipy.fft.rfft(y)
# original frequency domain plot
plt.plot(y_fft, label="original")
plt.xlabel('frequency')
plt.ylabel('signal')
plt.yscale('log')
plt.show()
The structure in frequency represent the features of your data. The peak a zero is the stabilized region after ~100s, the humps are associated to (rapid) changes in time. You can now play around and change the frequency spectrum (--> filter) but I think the spectrum is so artificial that this doesn't yield great results here. Try it with other data and you may be very impressed! I tried two things, first cut high-frequency regions out (set to zero), and second, apply a sliding-window filter in frequency domain (sparing the peak at 0, since this cannot be touched. Try and you know why).
# cut high-frequency by setting to zero
y_fft_2 = np.array(y_fft)
y_fft_2[50:70] = 0
# sliding window in frequency
Nwindow = 15
Start = 10
y_fft_slide = np.array(y_fft)
y_fft_slide[Start:] = np.convolve(y_fft[Start:], np.ones((Nwindow,))/Nwindow, mode='same')
# frequency-domain plot
plt.plot(y_fft, label="original")
plt.plot(y_fft_2, label="high-frequency, filter")
plt.plot(y_fft_slide, label="frequency sliding window")
plt.xlabel('frequency')
plt.ylabel('signal')
plt.yscale('log')
plt.legend()
plt.show()
Converting this back into time-domain:
# reverse FFT into time-domain for plotting
y_filtered = scipy.fft.irfft(y_fft_2)
y_filtered_slide = scipy.fft.irfft(y_fft_slide)
# time-domain plot
plt.plot(x[:500], y[:500], label="original")
plt.plot(x[:500], y_filtered[:500], label="high-f filtered")
plt.plot(x[:500], y_filtered_slide[:500], label="frequency sliding window")
# plt.xscale('log') # useful
plt.grid()
plt.legend()
plt.show()
yields
There are apparent oscillations in those solutions which make them essentially useless for your purpose. This leads me to my final exercise to again apply a sliding-window filter on the "frequency sliding window" time-domain
# extra time-domain sliding window
Nwindow=90
y_fft_90 = np.convolve(y_filtered_slide, np.ones((Nwindow,))/Nwindow, mode='same')
# final time-domain plot
plt.plot(x[:500], y[:500], label="original")
plt.plot(x[:500], y_fft_90[:500], label="frequency-sliding window, slide")
# plt.xscale('log') # useful
plt.legend()
plt.show()
I am quite happy with this result, but it still has very small oscillations and thus does not solve your original problem.
Conclusion
How much fun. One hour well wasted. Maybe it is useful to someone. Maybe even to you Natasha. Please be not mad a me...
Let's assume your data is in data variable and time indices are in time. Then
import numpy as np
threshold = 0.025
stable_index = np.where(np.abs(data[-1] - data) > threshold)[0][-1] + 1
print('Stabilizes after', time[stable_index], 'sec')
Stabilizes after 96.6 sec
Here data[-1] - data is a difference between last value of data and all the data values. The assumption here is that the last value of data represents the equilibrium point.
np.where( * > threshold )[0] are all the indices of values of data which are greater than the threshold, that is still not stabilized. We take only the last index. The next one is where time series is considered stabilized, hence the + 1.
If you're dealing with deterministic data which is eventually converging monotonically to some fixed value, the problem is pretty straightforward. Your last observation should be the closest to the limit, so you can define an acceptable tolerance threshold relative to that last data point and scan your data from back to front to find where you exceeded your threshold.
Things get a lot nastier once you add random noise into the picture, particularly if there is serial correlation. This problem is common in simulation modeling(see (*) below), and is known as the issue of initial bias. It was first identified by Conway in 1963, and has been an active area of research since then with no universally accepted definitive answer on how to deal with it. As with the deterministic case, the most widely accepted answers approach the problem starting from the right-hand side of the data set since this is where the data are most likely to be in steady state. Techniques based on this approach use the end of the dataset to establish some sort of statistical yardstick or baseline to measure where the data start looking significantly different as observations get added by moving towards the front of the dataset. This is greatly complicated by the presence of serial correlation.
If a time series is in steady state, in the sense of being covariance stationary then a simple average of the data is an unbiased estimate of its expected value, but the standard error of the estimated mean depends heavily on the serial correlation. The correct standard error squared is no longer s2/n, but instead it is (s2/n)*W where W is a properly weighted sum of the autocorrelation values. A method called MSER was developed in the 1990's, and avoids the issue of trying to correctly estimate W by trying to determine where the standard error is minimized. It treats W as a de-facto constant given a sufficiently large sample size, so if you consider the ratio of two standard error estimates the W's cancel out and the minimum occurs where s2/n is minimized. MSER proceeds as follows:
Starting from the end, calculate s2 for half of the data set to establish a baseline.
Now update the estimate of s2 one observation at a time using an efficient technique such as Welford's online algorithm, calculate s2/n where n is the number of observations tallied so far. Track which value of n yields the smallest s2/n. Lather, rinse, repeat.
Once you've traversed the entire data set from back to front, the n which yielded the smallest s2/n is the number of observations from the end of the data set which are not detectable as being biased by the starting conditions.
Justification - with a sufficiently large baseline (half your data), s2/n should be relatively stable as long as the time series remains in steady state. Since n is monotonically increasing, s2/n should continue decreasing subject to the limitations of its variability as an estimate. However, once you start acquiring observations which are not in steady state the drift in mean and variance will inflate the numerator of s2/n. Hence the minimal value corresponds to the last observation where there was no indication of non-stationarity. More details can be found in this proceedings paper. A Ruby implementation is available on BitBucket.
Your data has such a small amount of variation that MSER concludes that it is still converging to steady state. As such, I'd advise going with the deterministic approach outlined in the first paragraph. If you have noisy data in the future, I'd definitely suggest giving MSER a shot.
(*) - In a nutshell, a simulation model is a computer program and hence has to have its state set to some set of initial values. We generally don't know what the system state will look like in the long run, so we initialize it to an arbitrary but convenient set of values and then let the system "warm up". The problem is that the initial results of the simulation are not typical of the steady state behaviors, so including that data in your analyses will bias them. The solution is to remove the biased portion of the data, but how much should that be?
I want to predict a value. I have a time series as well as a bunch of other time series that may be interesting to use to augment the prediction.
Someone is arguing with me that it is the same thing to find the correlation between 2 non stationary time series and finding the correlation when making both stationary by some sort of differencing. Their logic is that a state space model doesn't care.
Isn't the whole idea of regression to exploit correlations to predict values? Doesn't there have to exist a correlation to incorporate an explanation of variance in the data and not increase the variance in the predictions? Also, I am 100% convinced that finding the correlation between two non stationary time series without doing anything is wrong.... And you'll end up with correlations to time and not the variables themselves.
Any input is helpful. Thanks.
Depends on the models you're employing later on. You say that there has to exist a correlation or else the variance in the predictions will increase. That might hold for some models. Rather, I'd recommend you to go for models that have some model-election in themselves.
Think of LASSO, for example, that gives sparse vectors for the coefficients. Or think of a model that allows you to calculate Variable Importance and base your decisions on that outcome.
Second, let's do some math:
Correlation original = E[X(t)*Y(t)]
Correlation differencing = E[(X(t)-X(t-1))*(Y(t)-Y(t-1))] = E[X(t)Y(t)] + E[X(t-1)Y(t)] + E[X(t-1)Y(t-1)] + E[X(t)Y(t-1)]
If you assume that one time series is not correlated with the other time-series previous sample, then this reduces to
= E[X(t)Y(t)] + E[X(t-1)Y(t-1)]
I just started a Machine learning class and we went over Perceptrons. For homework we are supposed to:
"Choose appropriate training and test data sets of two dimensions (plane). Use 10 data points for training and 5 for testing. " Then we are supposed to write a program that will use a perceptron algorithm and output:
a comment on whether the training data points are linearly
separable
a comment on whether the test points are linearly separable
your initial choice of the weights and constants
the final solution equation (decision boundary)
the total number of weight updates that your algorithm made
the total number of iterations made over the training set
the final misclassification error, if any, on the training data and
also on the test data
I have read the first chapter of my book several times and I am still having trouble fully understanding perceptrons.
I understand that you change the weights if a point is misclassified until none are misclassified anymore, I guess what I'm having trouble understanding is
What do I use the test data for and how does that relate to the
training data?
How do I know if a point is misclassified?
How do I go about choosing test points, training points, threshold or a bias?
It's really hard for me to know how to make up one of these without my book providing good examples. As you can tell I am pretty lost, any help would be so much appreciated.
What do I use the test data for and how does that relate to the
training data?
Think about a Perceptron as young child. You want to teach a child how to distinguish apples from oranges. You show it 5 different apples (all red/yellow) and 5 oranges (of different shape) while telling it what it sees at every turn ("this is a an apple. this is an orange). Assuming the child has perfect memory, it will learn to understand what makes an apple an apple and an orange an orange if you show him enough examples. He will eventually start to use meta-features (like shapes) without you actually telling him. This is what a Perceptron does. After you showed him all examples, you start at the beginning, this is called a new epoch.
What happens when you want to test the child's knowledge? You show it something new. A green apple (not just yellow/red), a grapefruit, maybe a watermelon. Why not show the child the exact same data as before during training? Because the child has perfect memory, it will only tell you what you told him. You won't see how good it generalizes from known to unseen data unless you have different training data that you never showed him during training. If the child has a horrible performance on the test data but a 100% performance on the training data, you will know that he has learned nothing - it's simply repeating what he has been told during training - you trained him too long, he only memorized your examples without understanding what makes an apple an apple because you gave him too many details - this is called overfitting. To prevent your Perceptron from only (!) recognizing training data you'll have to stop training at a reasonable time and find a good balance between the size of the training and testing set.
How do I know if a point is misclassified?
If it's different from what it should be. Let's say an apple has class 0 and an orange has 1 (here you should start reading into Single/MultiLayer Perceptrons and how Neural Networks of multiple Perceptrons work). The network will take your input. How it's coded is irrelevant for this, let's say input is a string "apple". Your training set then is {(apple1,0), (apple2,0), (apple3,0), (orange1,1), (orange2,1).....}. Since you know the class beforehand, the network will either output 1 or 0 for the input "apple1". If it outputs 1, you perform (targetValue-actualValue) = (1-0) = 1. 1 in this case means that the network gives a wrong output. Compare this to the delta rule and you will understand that this small equation is part of the larger update equation. In case you get a 1 you will perform a weight update. If target and actual value are the same, you will always get a 0 and you know that the network didn't misclassify.
How do I go about choosing test points, training points, threshold or
a bias?
Practically the bias and threshold isn't "chosen" per se. The bias is trained like any other unit using a simple "trick", namely using the bias as an additional input unit with value 1 - this means the actual bias value is encoded in this additional unit's weight and the algorithm we use will make sure it learns the bias for us automatically.
Depending on your activation function, the threshold is predetermined. For a simple perceptron, the classification will occur as follows:
Since we use a binary output (between 0 and 1), it's a good start to put the threshold at 0.5 since that's exactly the middle of the range [0,1].
Now to your last question about choosing training and test points: This is quite difficult, you do that by experience. Where you're at, you start off by implementing simple logical functions like AND, OR, XOR etc. There's it's trivial. You put everything in your training set and test with the same values as your training set (since for x XOR y etc. there are only 4 possible inputs 00, 10, 01, 11). For complex data like images, audio etc. you'll have to try and tweak your data and features until you feel like the network can work with it as good as you want it to.
What do I use the test data for and how does that relate to the training data?
Usually, to asses how well a particular algorithm performs, one first trains it and then uses different data to test how well it does on data it has never seen before.
How do I know if a point is misclassified?
Your training data has labels, which means that for each point in the training set, you know what class it belongs to.
How do I go about choosing test points, training points, threshold or a bias?
For simple problems, you usually take all the training data and split it around 80/20. You train on the 80% and test against the remaining 20%.
I've always thought from what I read that cross validation is performed like this:
In k-fold cross-validation, the original sample is randomly
partitioned into k subsamples. Of the k subsamples, a single subsample
is retained as the validation data for testing the model, and the
remaining k − 1 subsamples are used as training data. The
cross-validation process is then repeated k times (the folds), with
each of the k subsamples used exactly once as the validation data. The
k results from the folds then can be averaged (or otherwise combined)
to produce a single estimation
So k models are built and the final one is the average of those.
In Weka guide is written that each model is always built using ALL the data set. So how does cross validation in Weka work ? Is the model built from all data and the "cross-validation" means that k fold are created then each fold is evaluated on it and the final output results is simply the averaged result from folds?
So, here is the scenario again: you have 100 labeled data
Use training set
weka will take 100 labeled data
it will apply an algorithm to build a classifier from these 100 data
it applies that classifier AGAIN on
these 100 data
it provides you with the performance of the
classifier (applied to the same 100 data from which it was
developed)
Use 10 fold CV
Weka takes 100 labeled data
it produces 10 equal sized sets. Each set is divided into two groups: 90 labeled data are used for training and 10 labeled data are used for testing.
it produces a classifier with an algorithm from 90 labeled data and applies that on the 10 testing data for set 1.
It does the same thing for set 2 to 10 and produces 9 more classifiers
it averages the performance of the 10 classifiers produced from 10 equal sized (90 training and 10 testing) sets
Let me know if that answers your question.
I would have answered in a comment but my reputation still doesn't allow me to:
In addition to Rushdi's accepted answer, I want to emphasize that the models which are created for the cross-validation fold sets are all discarded after the performance measurements have been carried out and averaged.
The resulting model is always based on the full training set, regardless of your test options. Since M-T-A was asking for an update to the quoted link, here it is: https://web.archive.org/web/20170519110106/http://list.waikato.ac.nz/pipermail/wekalist/2009-December/046633.html/. It's an answer from one of the WEKA maintainers, pointing out just what I wrote.
I think I figured it out. Take (for example) weka.classifiers.rules.OneR -x 10 -d outmodel.xxx. This does two things:
It creates a model based on the full dataset. This is the model that is written to outmodel.xxx. This model is not used as part of cross-validation.
Then cross-validation is run. cross-validation involves creating (in this case) 10 new models with the training and testing on segments of the data as has been described. The key is the models used in cross-validation are temporary and only used to generate statistics. They are not equivalent to, or used for the model that is given to the user.
Weka follows the conventional k-fold cross validation you mentioned here. You have the full data set, then divide it into k nos of equal sets (k1, k2, ... , k10 for example for 10 fold CV) without overlaps. Then at the first run, take k1 to k9 as training set and develop a model. Use that model on k10 to get the performance. Next comes k1 to k8 and k10 as training set. Develop a model from them and apply it to k9 to get the performance. In this way, use all the folds where each fold at most 1 time is used as test set.
Then Weka averages the performances and presents that on the output pane.
once we've done the 10-cross-validation by dividing data in 10 segments & create Decision tree and evaluate, what Weka does is run the algorithm an eleventh time on the whole dataset. That will then produce a classifier that we might deploy in practice. We use 10-fold cross-validation in order to get an evaluation result and estimate of the error, and then finally we do classification one more time to get an actual classifier to use in practice.
During kth cross validation, we will going to have different Decision tree but final one is created on whole datasets. CV is used to see if we have overfitting or large variance issue.
According to "Data Mining with Weka" at The University of Waikato:
Cross-validation is a way of improving upon repeated holdout.
Cross-validation is a systematic way of doing repeated holdout that actually improves upon it by reducing the variance of the estimate.
We take a training set and we create a classifier
Then we’re looking to evaluate the performance of that classifier, and there’s a certain amount of variance in that evaluation, because it’s all statistical underneath.
We want to keep the variance in the estimate as low as possible.
Cross-validation is a way of reducing the variance, and a variant on cross-validation called “stratified cross-validation” reduces it even further.
(In contrast to the the “repeated holdout” method in which we hold out 10% for the testing and we repeat that 10 times.)
So how does cross validation in Weka work ?:
With cross-validation, we divide our dataset just once, but we divide into k pieces, for example , 10 pieces. Then we take 9 of the pieces and use them for training and the last piece we use for testing. Then with the same division, we take another 9 pieces and use them for training and the held-out piece for testing. We do the whole thing 10 times, using a different segment for testing each time. In other words, we divide the dataset into 10 pieces, and then we hold out each of these pieces in turn for testing, train on the rest, do the testing and average the 10 results.
That would be 10-fold cross-validation. Divide the dataset into 10 parts (these are called “folds”);
hold out each part in turn;
and average the results.
So each data point in the dataset is used once for testing and 9 times for training.
That’s 10-fold cross-validation.
What is an algorithm to compare multiple sets of numbers against a target set to determine which ones are the most "similar"?
One use of this algorithm would be to compare today's hourly weather forecast against historical weather recordings to find a day that had similar weather.
The similarity of two sets is a bit subjective, so the algorithm really just needs to diferentiate between good matches and bad matches. We have a lot of historical data, so I would like to try to narrow down the amount of days the users need to look through by automatically throwing out sets that aren't close and trying to put the "best" matches at the top of the list.
Edit:
Ideally the result of the algorithm would be comparable to results using different data sets. For example using the mean square error as suggested by Niles produces pretty good results, but the numbers generated when comparing the temperature can not be compared to numbers generated with other data such as Wind Speed or Precipitation because the scale of the data is different. Some of the non-weather data being is very large, so the mean square error algorithm generates numbers in the hundreds of thousands compared to the tens or hundreds that is generated by using temperature.
I think the mean square error metric might work for applications such as weather compares. It's easy to calculate and gives numbers that do make sense.
Since your want to compare measurements over time you can just leave out missing values from the calculation.
For values that are not time-bound or even unsorted, multi-dimensional scatter data it's a bit more difficult. Choosing a good distance metric becomes part of the art of analysing such data.
Use the pearson correlation coefficient. I figured out how to calculate it in an SQL query which can be found here: http://vanheusden.com/misc/pearson.php
In finance they use Beta to measure the correlation of 2 series of numbers. EG, Beta could answer the question "Over the last year, how much would the price of IBM go up on a day that the price of the S&P 500 index went up 5%?" It deals with the percentage of the move, so the 2 series can have different scales.
In my example, the Beta is Covariance(IBM, S&P 500) / Variance(S&P 500).
Wikipedia has pages explaining Covariance, Variance, and Beta: http://en.wikipedia.org/wiki/Beta_(finance)
Look at statistical sites. I think you are looking for correlation.
As an example, I'll assume you're measuring temp, wind, and precip. We'll call these items "features". So valid values might be:
Temp: -50 to 100F (I'm in Minnesota, USA)
Wind: 0 to 120 Miles/hr (not sure if this is realistic but bear with me)
Precip: 0 to 100
Start by normalizing your data. Temp has a range of 150 units, Wind 120 units, and Precip 100 units. Multiply your wind units by 1.25 and Precip by 1.5 to make them roughly the same "scale" as your temp. You can get fancy here and make rules that weigh one feature as more valuable than others. In this example, wind might have a huge range but usually stays in a smaller range so you want to weigh it less to prevent it from skewing your results.
Now, imagine each measurement as a point in multi-dimensional space. This example measures 3d space (temp, wind, precip). The nice thing is, if we add more features, we simply increase the dimensionality of our space but the math stays the same. Anyway, we want to find the historical points that are closest to our current point. The easiest way to do that is Euclidean distance. So measure the distance from our current point to each historical point and keep the closest matches:
for each historicalpoint
distance = sqrt(
pow(currentpoint.temp - historicalpoint.temp, 2) +
pow(currentpoint.wind - historicalpoint.wind, 2) +
pow(currentpoint.precip - historicalpoint.precip, 2))
if distance is smaller than the largest distance in our match collection
add historicalpoint to our match collection
remove the match with the largest distance from our match collection
next
This is a brute-force approach. If you have the time, you could get a lot fancier. Multi-dimensional data can be represented as trees like kd-trees or r-trees. If you have a lot of data, comparing your current observation with every historical observation would be too slow. Trees speed up your search. You might want to take a look at Data Clustering and Nearest Neighbor Search.
Cheers.
Talk to a statistician.
Seriously.
They do this type of thing for a living.
You write that the "similarity of two sets is a bit subjective", but it's not subjective at all-- it's a matter of determining the appropriate criteria for similarity for your problem domain.
This is one of those situation where you are much better off speaking to a professional than asking a bunch of programmers.
First of all, ask yourself if these are sets, or ordered collections.
I assume that these are ordered collections with duplicates. The most obvious algorithm is to select a tolerance within which numbers are considered the same, and count the number of slots where the numbers are the same under that measure.
I do have a solution implemented for this in my application, but I'm looking to see if there is something that is better or more "correct". For each historical day I do the following:
function calculate_score(historical_set, forecast_set)
{
double c = correlation(historical_set, forecast_set);
double avg_history = average(historical_set);
double avg_forecast = average(forecast_set);
double penalty = abs(avg_history - avg_forecast) / avg_forecast
return c - penalty;
}
I then sort all the results from high to low.
Since the correlation is a value from -1 to 1 that says whether the numbers fall or rise together, I then "penalize" that with the percentage difference the averages of the two sets of numbers.
A couple of times, you've mentioned that you don't know the distribution of the data, which is of course true. I mean, tomorrow there could be a day that is 150 degree F, with 2000km/hr winds, but it seems pretty unlikely.
I would argue that you have a very good idea of the distribution, since you have a long historical record. Given that, you can put everything in terms of quantiles of the historical distribution, and do something with absolute or squared difference of the quantiles on all measures. This is another normalization method, but one that accounts for the non-linearities in the data.
Normalization in any style should make all variables comparable.
As example, let's say that a day it's a windy, hot day: that might have a temp quantile of .75, and a wind quantile of .75. The .76 quantile for heat might be 1 degree away, and the one for wind might be 3kmh away.
This focus on the empirical distribution is easy to understand as well, and could be more robust than normal estimation (like Mean-square-error).
Are the two data sets ordered, or not?
If ordered, are the indices the same? equally spaced?
If the indices are common (temperatures measured on the same days (but different locations), for example, you can regress the first data set against the second,
and then test that the slope is equal to 1, and that the intercept is 0.
http://stattrek.com/AP-Statistics-4/Test-Slope.aspx?Tutorial=AP
Otherwise, you can do two regressions, of the y=values against their indices. http://en.wikipedia.org/wiki/Correlation. You'd still want to compare slopes and intercepts.
====
If unordered, I think you want to look at the cumulative distribution functions
http://en.wikipedia.org/wiki/Cumulative_distribution_function
One relevant test is Kolmogorov-Smirnov:
http://en.wikipedia.org/wiki/Kolmogorov-Smirnov_test
You could also look at
Student's t-test,
http://en.wikipedia.org/wiki/Student%27s_t-test
or a Wilcoxon signed-rank test http://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test
to test equality of means between the two samples.
And you could test for equality of variances with a Levene test http://www.itl.nist.gov/div898/handbook/eda/section3/eda35a.htm
Note: it is possible for dissimilar sets of data to have the same mean and variance -- depending on how rigorous you want to be (and how much data you have), you could consider testing for equality of higher moments, as well.
Maybe you can see your set of numbers as a vector (each number of the set being a componant of the vector).
Then you can simply use dot product to compute the similarity of 2 given vectors (i.e. set of numbers).
You might need to normalize your vectors.
More : Cosine similarity