Hi I'm using the Maxent software 3.4.0 for Mac and I'm trying to understand an issue about k-fold Cross-validation.
Basically, I understood that my dataset is splitted in k folds and each fold more or less has the same size. Therefore, if my dataset has 100 observations, a 10-fold cross validation will split the dataset in 10 folds of 10 observations, and Maxent will train 10 models, each with 9 folds, and the 10th will test it.
My question is: can I split my dataset in more than 10 folds (e.g. 50 folds), BUT with 10 observations per fold?
In this case of course occurences would not be used once, but as many times as they appear in the different folds.
Can I do it (without the command line, that I don't know how to use it)? Could be the result meaningful?
The point of cross-validation is that each iteration your model is tested on observations that it has not been calibrated on. In your example, it will be inevitable that your validation folds will contain observations used in model calibration, inflating the cross-validated AUC.
What you could look at is using the bootstrapping option in Maxent. A question on cross validation and bootstrapping with Maxent was asked previously here FYI
https://gis.stackexchange.com/questions/366513/difference-between-bootstrap-and-cross-validation-maxent
I built a 3D image classification model with CNN for my research. I only have 5000 images and used 4500 images for training and 500 image for test set.
I tried different architectures and parameters for the training and
the F1 score and the accuracy on the training sets were as high as 0.9. It was fortunate that I didn't have to spend a lot of time to find these settings for the high accuracy.
Now I applied this model for the test set and I got a quite satisfying prediction with F1 score of 0.8~0.85.
My question here is, is it necessary to do validation? When I was taking a machine learning course back then, I was taught to use a validation set for tuning hyper parameters. One reason why I did not do k-fold cross validation is because I do not have much data and wanted to use as many training data as possible. And my model shows a quite good prediction on the test set. Can my model still convince people as long as the accuracy/f1 score/ROC are good enough? Or can I try to convince people only by doing k-fold cross validation without making and testing on a test set separately?
Thank you!
unfortunately i think that the single result won't be enough. This is due to the fact that your result could be just pure luck.
Using a 10 fold CV you use 90% of your data (4500 images) for training and the remaining 10% for testing. So basically you are not using less images in the training with the advantage of more reliable results.
The validation scheme proposed by Martin is already a good one but if you are looking for something more robust you should use a nested cross validation:
Split the data-set in K folds
The i-th training set is composed by {1,2,..,K} \ i folds.
Split the training set in N folds.
Set a hyper-parameter values grid
For each hyper-parameter set of values:
train on {1,2,..,N} \ j folds and test on the j-th fold;
Iterate for all the N folds and compute the average F-score.
Choose the set of hyper-parameters that maximize your metric.
Train the model using the i-th training set and the optimal set of hyper-parameters and test on the i-th fold.
Repeat for all the K folds and compute the average metrics.
The average metrics could be not sufficient to prove the stability of the method so it's advisable to provide also the confidence interval or the variance of the results.
Finally, to have a really stable validation of your method, you could consider to substitute the initial K-fold cross validation with a re-sampling procedure. Instead of splitting the data in K fold you resample the dataset at random using 90% of the samples as training and 10% of samples for testing. Repeat this M times with M>K. If the computation is fast enough you can consider to do this 20-50 or 100 times.
A cross validation dataset is used to adjust hyperparameters. You should never touch the test set, except when you are finished with everything!
As suggested in the comments, I recommend k-fold cross validation (e.g. k=10):
Split your dataset into k=10 sets
For i=1..10: Use sets {1, 2,..., 10} \ i as a training set (and to find the hyper parameters) and set i to evaluate.
Your final score is the average among those k=10 evaluation scores.
Does the Distribution represented by the training data need to reflect the distribution of the test data and the data that you predict on? Can I measure the quality of the training data by looking at the distribution of each feature and compare that distribution to the data I am predicting or testing with? Ideally the training data should be sufficiently representative of the real world distribution.
Short answer: similar ranges would be a good idea.
Long answer: sometimes it won't be an issue (rarely) but let's examine when.
In an ideal situation, your model will capture the true phenomenon perfectly. Imagine the simplest case: the linear model y = x. If the training data are noiseless (or have tolerable noise). Your linear regression will naturally land on a model approximately equal to y = x. The generalization of the model will work nearly perfect even outside of the training range. If your train data were {1:1, 2:2, 3:3, 4:4, 5:5, 6:6, 7:7, 8:8, 9:9, 10:10}. The test point 500, will nicely map onto the function, returning 500.
In most modeling scenarios, this will almost certainly not be the case. If the training data are ample and the model is appropriately complex (and no more), you're golden.
The trouble is that few functions (and corresponding natural phenomena) -- especially when we consider nonlinear functions -- extend to data outside of the training range so cleanly. Imagine sampling office temperature against employee comfort. If you only look at temperatures from 40 deg to 60 deg. A linear function will behave brilliantly in the training data. Oddly enough, if you test on 60 to 80, the mapping will break down. Here, the issue is confidence in your claim that the data are sufficiently representative.
Now let's consider noise. Imagine that you know EXACTLY what the real world function is: a sine wave. Better still, you are told its amplitude and phase. What you don't know is its frequency. You have a really solid sampling between 1 and 100, the function you fit maps against the training data really well. Now if there is just enough noise, you might estimate the frequency incorrectly by a hair. When you test near the training range, the results aren't so bad. Outside of the training range, things start to get wonky. As you move further and further from the training range, the real function and the function diverge and converge based on their relative frequencies. Sometimes, the residuals are seemingly fine; sometimes they are dreadful.
There is an issue with your idea of examining the variable distributions: interaction between variables. Even if each variable is appropriately balanced in train and test, it is possible that the relationships between variables will differ (joint distributions). For a purely contrived example, consider you were predicting an individual's likelihood of being pregnant at any given time. In your training set, you had women aged 20 to 30 and men aged 30 to 40. In testing, you had the same percentage of men and women, but the age ranges were flipped. Independently, the variables look very nicely matched! But in your training set, you could very easily conclude, "only people under 30 get pregnant." Oddly enough, your testing set would demonstrate the exact opposite! The trouble is that your predictions are being made from a multivariate space, but the distributions you are thinking about are univariate. Considering the joint distributions of continuous variables against one another (and considering categorical variables appropriately) is, however, a good idea. Ideally, your fit model should have access to a similar range to your testing data.
Fundamentally, the question is about extrapolation from a limited training space. If the model fit in the training space generalizes, you can generalize; ultimately, it is usually safest to have a really well distributed training set to maximize the likelihood that you have captured the complexity of the underlying function.
Really interesting question! I hope the answer was somewhat insightful; I'll continue to build on it as resources come to mind! Let me know if any questions remain!
EDIT: a point made in the comments that I think should be read by future readers.
Ideally, training data should NEVER influence testing data in ANY way. That includes examining of the distributions, joint distributions etc. With sufficient data, distributions in the training data should converge on distributions in the testing data (think the mean, law of large nums). Manipulation to match distributions (like z-scoring before train/test split) fundamentally skews performance metrics in your favor. An appropriate technique for splitting train and test data would be something like stratified k fold for cross validation.
Sorry for the delayed response. After going through a few months of iterating, I implemented and pushed the following solution to production and it is working quite well.
The issue here boils down to how can one reduce the training/test score variance when performing cross validation. This is important as if your variance is high, the confidence in picking the best model goes down. The more representative the test data is to the train data, the less variance you get in your test scores across the cross validation set. Stratified cross validation tackles this issue especially when there is significant class imbalance, by ensuring that the label class proportions are preserved across all test/train sets. However, this doesnt address the issue with the feature distribution.
In my case, I had a few features that were very strong predictors but also very skewed in their distribution. This caused significant variance in my test scores which made it harder to pick a model with any confidence. Essentially, the solution is to ensure that the joint distribution of the label with the feature set is maintained across test/train sets. Many ways of doing this but a very simple approach is to simply take each column bucket range (if continuous) or label (if categorical) one by one and sample from these buckets when generating the test and train sets. Note that the buckets quickly gets very sparse especially when you have a lot of categorical variables. Also, the column order in which you bucket affects the sampling output greatly. Below is a solution where I bucket the label first (same like stratified CV) and then sample 1 other feature (most important feature (called score_percentage) that is known upfront).
def train_test_folds(self, label_column="label"):
# train_test is an array of tuples where each tuple is a test numpy array and train numpy array pair.
# The final iterator would return these individual elements separately.
n_folds = self.n_folds
label_classes = np.unique(self.label)
train_test = []
fmpd_copy = self.fm.copy()
fmpd_copy[label_column] = self.label
fmpd_copy = fmpd_copy.reset_index(drop=True).reset_index()
fmpd_copy = fmpd_copy.sort_values("score_percentage")
for lbl in label_classes:
fmpd_label = fmpd_copy[fmpd_copy[label_column] == lbl]
# Calculate the fold # using the label specific dataset
if (fmpd_label.shape[0] < n_folds):
raise ValueError("n_folds=%d cannot be greater than the"
" number of rows in each class."
% (fmpd_label.shape[0]))
# let's get some variance -- shuffle within each buck
# let's go through the data set, shuffling items in buckets of size nFolds
s = 0
shuffle_array = fmpd_label["index"].values
maxS = len(shuffle_array)
while s < maxS:
max = min(maxS, s + n_folds) - 1
for i in range(s, max):
j = random.randint(i, max)
if i < j:
tempI = shuffle_array[i]
shuffle_array[i] = shuffle_array[j]
shuffle_array[j] = tempI
s = s + n_folds
# print("shuffle s =",s," max =",max, " maxS=",maxS)
fmpd_label["index"] = shuffle_array
fmpd_label = fmpd_label.reset_index(drop=True).reset_index()
fmpd_label["test_set_number"] = fmpd_label.iloc[:, 0].apply(
lambda x: x % n_folds)
print("label ", lbl)
for n in range(0, n_folds):
test_set = fmpd_label[fmpd_label["test_set_number"]
== n]["index"].values
train_set = fmpd_label[fmpd_label["test_set_number"]
!= n]["index"].values
print("for label ", lbl, " test size is ",
test_set.shape, " train size is ", train_set.shape)
print("len of total size", len(train_test))
if (len(train_test) != n_folds):
# Split doesnt exist. Add it in.
train_test.append([train_set, test_set])
else:
temp_arr = train_test[n]
temp_arr[0] = np.append(temp_arr[0], train_set)
temp_arr[1] = np.append(temp_arr[1], test_set)
train_test[n] = [temp_arr[0], temp_arr[1]]
return train_test
Over time, I realized that this whole issue falls under the umbrella of covariate shift which is a well studied area within machine learning. Link below or just search google for covariate shift. The concept is how to detect and ensure that your prediction data is of similar distribution with your training data. THis is in the feature space but in theory you could have label drift as well.
https://www.analyticsvidhya.com/blog/2017/07/covariate-shift-the-hidden-problem-of-real-world-data-science/
I'm trying to optimise the number of hidden units in my MLP.
I'm using k-fold cross validation, with 10 folds - 16200 training points and 1800 validation points in each fold.
When I run the network with hidden units varying from 1:10, I find the minimum error always occurs at 2 (NMSE of about 7).
3 is slightly higher (NMSE of about 11) and 4 or more hidden units and the error remains constant at about 14 or 15 regardless of many I add.
Why is this?
I find it hard to believe that overfitting is occurring, because of the very large amount of data points being used (with all 10 folds, that's 162000 training points, albeit each repeated 9 times).
Many thanks for any help or advice!
If the input is voltage and current, and question is about the power generated, then it's just P=V*I. Even if you have some noise, the relationship will be still linear. In this case simple linear model would do just fine - and would be far nicer to interpret! That's why simple ANN works best and more complex is overfitting, as it looks for non-linear relationships (which are not there, but it does whatever will minimise cost function).
To summarise, I would recommend to check a simple linear model. Also, since you have a lot of data points, make a 50-25-25 split for training, test and validation sets. Look at your cost function and see how it changes with error rate.
I used 10-fold cross validation in Weka.
I know this usually means that the data is split in 10 parts, 90% training, 10% test and that this is alternated 10 times.
I am wondering on what Weka calculates the resulting AUC. Is it the average of all 10 test sets? Or (and I hope this is true), does it use a holdout test set? I can't seem to find a description of this in the weka book.
Weka averages the test results. And this is a better approach then the holdout set, I don't understand why you would hope for such approach. If you hold out the test set (of what size?) your test would not be statisticaly significant, It would only say, that for best chosen parameters on the training data you achieved some score on arbitrary small part of data. The whole point of cross validation (as the evaluation technique) is to use all the data as training and as testing in turns, so the resulting metric is approximation of the expected value of the true evaluation measure. If you use the hold out test it would not converge to expected value (at least not in a reasonable time) and what is even more important - you would have to choose another constant (how big hold out set and why?) and reduce the number of samples used for training (while cross validation has been developed due to the problem with to small datasets for both training and testing).
I performed cross validation on my own (made my own random folds and created 10 classifiers) and checked the average AUC. I also checked to see if the entire dataset was used to report the AUC (similar as to when Weka outputs a decision tree under 10-fold).
The AUC for the credit dataset with a naive Bayes classifier as found by...
10-fold weka = 0.89559
10-fold mine = 0.89509
original train = 0.90281
There is a slight discrepancy between my average AUC and Weka's, but this could be from a failure in replicating the folds (although I did try to control the seeds).