Good results when training and cross-validating a model, but test data set shows poor results - validation

My problem is that I obtain a model with very good results (training and cross-validating), but when I test it again (with a different data set) poor results appear.
I got a model which has been trained and cross-validating tested. The model shows AUC=0.933, TPR=0.90 and FPR=0.04
I guess there is no overfitting present looking at pictures, corresponding to learning curve (error), learning curve (score), and deviance curve:
The problem is that when I test this model with a different test data set, I obtain poor results, nothing to do with my previus results AUC=0.52, TPR=0.165 and FPR=0.105
I used Gradient Boosting Classifier to train my model, with learning_rate=0.01, max_depth=12, max_features='auto', min_samples_leaf=3, n_estimators=750
I used SMOTE to balance the class. It is binary model. I vectorized my categorical attributes. I used 75% of my data set to train and 25% tot test. My model has a very low training error, and a low test error, so I guess it is not overfitted. Training error is very low, so there are not outliers in the training and cv-test data sets. What can I do from now on to find the problem? Thanks

If the process generating your datasets is non-stationary it could cause the behavior you describe.
In that case the distribution of the dataset you're using to test has not been used for training

Related

The value of K in KNN & the accuracy of the model

I am working on KNN algorithm, I have some questions please I need answers:
I tried different values of K such as 3, 5, 7, and sqrt(n)=73. I get different accuracies according to these different values of K. What K should I use in my model and why ??
What is the best percentage that I should use to split the dataset into train and test sets ??
why the accuracy of the train set is always greater than the accuracy of test set ??
Which accuracy (train accuracy or test accuracy) is used to describe the overall model accuracy ??
Choosing the value of K is not a exact science. In this post, user20160 explained a procedure to choose a good K using k-fold cross validation.
Usually, 80/20 and 70/30 ratios are used but once again, it is not a absolute truth. If your train set ratio is too large, your model could overfit which means your model learns specially for the train set and will not be performant with real cases. On another hand, if your train set is too small, your model could underfit.
The accuracy of the train set is often greater than the test one because your model is only trained with the train set and test dataset are cases your model has never seen before. It is like learning to ride a bike with only one bike and being evaluated with another bike.
The train accuracy is not realistic to evaluate the performance of your model since train dataset is well known by your model. Test accuracy is more relevant because test dataset is new and has never been seen by your model.
To better understand everything about model evaluation I strongly recommend you to take look at the cross validation link above.

80-20 or 80-10-10 for training machine learning models?

I have a very basic question.
1) When is it recommended to hold part of the data for validation and when is it unnecessary? For example, when can we say it is better to have 80% training, 10% validating and 10% testing split and when can we say it is enough to have a simple 80% training and 20% testing split?
2) Also, does using K-Cross Validation go with the simple split (training-testing)?
I find it more valuable to have a training and validation set if I have a limited size data set. The validation set is essentially a test set anyway. The reason for this is that you want your model to be able to extrapolate from having a high accuracy on the data it is trained on too also have high accuracy on data it has not seen before. The validation set allows you to determine if that is the case. I generally take at least 10% of the data set and make it a validation set. It is important that you select the validation data randomly so that it's probability distribution matches that of the training set. Next I monitor the validation loss and save the model with the lowest validation loss. I also use an adjustable learning rate. Keras has two useful callbacks for this purpose, ModelCheckpoint and ReduceLROnPlateau. Documentation is here. With a validation set you can monitor the validation loss during training and ascertain if your model is training proberly (training accuracy) and if it is extrapolating properly ( validation loss). The validation loss on average should decrease as the model accuracy increases. If the validation loss starts to increase with high training accuracy your model is over fitting and you can take remedial action such as including dropout layers, regularizers or reduce your model complexity. Documentation for that is here and here. To see why I use an adjustable learning rate see the answer to a stack overflow question here.

Which model to pick from K fold Cross Validation

I was reading about cross validation and about how it it is used to select the best model and estimate parameters , I did not really understand the meaning of it.
Suppose I build a Linear regression model and go for a 10 fold cross validation, I think each of the 10 will have different coefficiant values , now from 10 different which should I pick as my final model or estimate parameters.
Or do we use Cross Validation only for the purpose of finding an average error(average of 10 models in our case) and comparing against another model ?
If your build a Linear regression model and go for a 10 fold cross validation, indeed each of the 10 will have different coefficient values. The reason why you use cross validation is that you get a robust idea of the error of your linear model - rather than just evaluating it on one train/test split only, which could be unfortunate or too lucky. CV is more robust as no ten splits can be all ten lucky or all ten unfortunate.
Your final model is then trained on the whole training set - this is where your final coefficients come from.
Cross-validation is used to see how good your models prediction is. It's pretty smart making multiple tests on the same data by splitting it as you probably know (i.e. if you don't have enough training data this is good to use).
As an example it might be used to make sure you aren't overfitting the function. So basically you try your function when you've finished it with Cross-validation and if you see that the error grows a lot somewhere you go back to tweaking the parameters.
Edit:
Read the wikipedia for deeper understanding of how it works: https://en.wikipedia.org/wiki/Cross-validation_%28statistics%29
You are basically confusing Grid-search with cross-validation. The idea behind cross-validation is basically to check how well a model will perform in say a real world application. So we basically try randomly splitting the data in different proportions and validate it's performance. It should be noted that the parameters of the model remain the same throughout the cross-validation process.
In Grid-search we try to find the best possible parameters that would give the best results over a specific split of data (say 70% train and 30% test). So in this case, for different combinations of the same model, the dataset remains constant.
Read more about cross-validation here.
Cross Validation is mainly used for the comparison of different models.
For each model, you may get the average generalization error on the k validation sets. Then you will be able to choose the model with the lowest average generation error as your optimal model.
Cross-Validation or CV allows us to compare different machine learning methods and get a sense of how well they will work in practice.
Scenario-1 (Directly related to the question)
Yes, CV can be used to know which method (SVM, Random Forest, etc) will perform best and we can pick that method to work further.
(From these methods different models will be generated and evaluated for each method and an average metric is calculated for each method and the best average metric will help in selecting the method)
After getting the information about the best method/ or best parameters we can train/retrain our model on the training dataset.
For parameters or coefficients, these can be determined by grid search techniques. See grid search
Scenario-2:
Suppose you have a small amount of data and you want to perform training, validation and testing on data. Then dividing such a small amount of data into three sets reduce the training samples drastically and the result will depend on the choice of pairs of training and validation sets.
CV will come to the rescue here. In this case, we don't need the validation set but we still need to hold the test data.
A model will be trained on k-1 folds of training data and the remaining 1 fold will be used for validating the data. A mean and standard deviation metric will be generated to see how well the model will perform in practice.

Keras «Powerful image classification with little data»: disparity between training and validation

I followed this post and first made it work on the dataset «Cats vs dogs». Then I substituted this set with my own images, which show the presence of an object vs the absence of that object. My dataset is even smaller than the one in the post. I only have 496 images containing that object for training and 160 images with that object for validation. For the «absent» class I have numerous samples (without that object in an image).
So far I didn't try class_weight to tackle the imbalanced data problem. I just randomly choose 496 and 160 images without that object for training and validation, respectively. Basically, I do a two class image classification with a smaller dataset using the techniques in this post. Thus I expected a worse performance in comparison due to the insufficient data. But the actual problem is that the performance is not convergent as shown in the figures.
Could you tell me possible reasons that lead to the unconvergence? I guess the problem is related to my dataset as the model works perfectly for «cats vs dogs». But I don't know how to address it. Are there any good techniques to make it convergent?
Thank you.
This performance plot is based on VGG16, keeping all layers up to fully connected layer and training a small fully connected layer with 256 neurons.
This performance plot is also based on VGG16, but using 128 neurons instead of 256 neurons. Also I set epochs to 80.
Based on the suggestions provided so far, I'm thinking to have a customized convnet model to fight the overfitting problem. But how to do this? One of my worries is that a model with fewer layers will downgrade the performance for training. Any guidelines to customize a good model for little data? Thank you.
Updates:
Now I think I know the half reason that leads to the unconvergent problem. You know, Actually I only have 100+ images. The rest images are downloaded from Flickr. I thought those images having centric objects and better quality will work for the model. But later on I found they can not contribute to the accuracy and even worse the output class probabilities. After removing these downloaded images, the performance is bumping upward a little and the uncovergency is gone. Note I only use 64*2 images for training and 48*2 images for testing. Also I found the image augmentation could not improve the performance for my dataset. Without image augmentation, the training accuracy could reach 1. But if I add some image augmentation, the training accuracy is only around 85%. Did somebody have such experience? Why doesn't data augmentation always work? Because our specific dataset? Thank you very much.
Your model is working great, but it's "overfitting". It means it's capable of memorizing all your training data without really "thinking". That leads to great training results and bad test results.
Common ways to avoid overfitting are:
More data - If you have little data, the chance of overfitting increases
Less units/layers - Make the model less capable, so it will stop memorizing and start thinking.
Add "dropouts" to your layers (something that randomly discards part of the results to prevent the model from being too powerful)
Do more layers mean more power and performance?
If by performance you mean capability of learning, yes. (If you mean "speed", no)
Yes, more layers mean more power. But too much power leads to overfitting: the model is so capable that it can memorize training data.
So there is an optimal point:
A model that is not very capable will not give you the proper results (both training and test results will be bad)
A model that is too capable will memorize the training data (excellent training results, but bad test results)
A balanced model will learn the right things (good training and test results)
That's exactly why we use test data, it's data that is not presented for training, so the model doesn't learn from the test data.

Validation of hurdle model?

I built a hurdle model, and then used that model to predict from known to unknown data points using the predict command. Is there a way to validate the model and these predictions? Do I have to do this in two parts, for example using sensitivity and specificity for the binomial part of the model?
Any other ideas for how to assess the validity of this model?
For validating predictive models, I usually trust Cross-Validation.
In short: With cross-validation you can measure the predictive performance of your model using only the training data (data with known results). Thus you can get a general opinion on how your model works. Cross-validation works quite well for wide variety of different models. The downside is that it can get quite computation heavy.
With large data sets, 10-fold cross-validation is enough. The smaller your dataset is, the more "folds" you have to do (i.e. with very small datasets, you have to do leave-one-out cross-validation)
With cross-validation, you get predictions for the whole data set. You can then compare these predictions to the actual outputs and measure how well your model performed.
Cross-validated results can take a bit to understand in more complicated comparisons, but for your general purpose question "how to assess the validity of the model", the results should be quite easy to use.

Resources