How to handle overfitting properly? - overfitting-underfitting

I am training a CNN regression model with a 34560 size dataset. I have already got a training error rate less than 5%, but the validation error rate is over 60%. It seems to be an overfitting problem. And I tried the four ways to solve the problem, but none of them works well:
Increase the dataset size
Reduce the model complexity
Add a dropout layer before the output layer
Use L2 regularization / weight decay
Probably I did not use them in the right way. Can someone tell some details of these methods? Or are there any other ways to solve the overfitting problem?

Related

80-20 or 80-10-10 for training machine learning models?

I have a very basic question.
1) When is it recommended to hold part of the data for validation and when is it unnecessary? For example, when can we say it is better to have 80% training, 10% validating and 10% testing split and when can we say it is enough to have a simple 80% training and 20% testing split?
2) Also, does using K-Cross Validation go with the simple split (training-testing)?
I find it more valuable to have a training and validation set if I have a limited size data set. The validation set is essentially a test set anyway. The reason for this is that you want your model to be able to extrapolate from having a high accuracy on the data it is trained on too also have high accuracy on data it has not seen before. The validation set allows you to determine if that is the case. I generally take at least 10% of the data set and make it a validation set. It is important that you select the validation data randomly so that it's probability distribution matches that of the training set. Next I monitor the validation loss and save the model with the lowest validation loss. I also use an adjustable learning rate. Keras has two useful callbacks for this purpose, ModelCheckpoint and ReduceLROnPlateau. Documentation is here. With a validation set you can monitor the validation loss during training and ascertain if your model is training proberly (training accuracy) and if it is extrapolating properly ( validation loss). The validation loss on average should decrease as the model accuracy increases. If the validation loss starts to increase with high training accuracy your model is over fitting and you can take remedial action such as including dropout layers, regularizers or reduce your model complexity. Documentation for that is here and here. To see why I use an adjustable learning rate see the answer to a stack overflow question here.

Validation loss when using Dropout

I am trying to understand the effect of dropout on validation Mean Absolute Error (non-linear regression problem).
Without dropout
With dropout of 0.05
With dropout of 0.075
Without any dropouts the validation loss is more than training loss as shown in 1. My understanding is that the validation loss should only be slightly more than the training loss for a good fit.
Carefully, I increased the dropout so that validation loss is close to the training loss as seen in 2. The dropout is only applied during training and not during validation, hence the validation loss is lower than the training loss.
Finally the dropout was increased further and the validation loss again became more than the training loss in 3.
Which amongst these three should be called as a good fit?
Following the response of Marcin Możejko, I predicted against three tests as shown in 4. The 'Y' axis shows RMS error instead of MAE. The model 'without dropout' gave the best result.
Well - this a really good question. In my opinion - the lowest validation score (confirmed on a separate test set) is the best fit. Remember that in the end - the performance of your model on a totally new data is the most crucial thing and the fact that it performed even better on a training set is not so important.
Moreover - I think that your model might generaly underfit - and you could try extend it to e.g. have more layers or neurons and prune it a little bit using dropout in order to prevent example memoization.
If my hypothesis turned out to be false - remember - that it still might be possible that there are certain data patterns present only on validation set (this relatively often in case of medium size datasets) what makes the divergence of train and test loss. Moreover - I think that even though that your losses values saturated in case without dropout there is still a room for improvement by simple increase in number of epochs as there seems to be a trend for losses to be smaller.
Another technique I recommend you to try is reducing learning rate on plateau (using example this callback) as your model seems to need refinement with lower value learning rate.

Keras «Powerful image classification with little data»: disparity between training and validation

I followed this post and first made it work on the dataset «Cats vs dogs». Then I substituted this set with my own images, which show the presence of an object vs the absence of that object. My dataset is even smaller than the one in the post. I only have 496 images containing that object for training and 160 images with that object for validation. For the «absent» class I have numerous samples (without that object in an image).
So far I didn't try class_weight to tackle the imbalanced data problem. I just randomly choose 496 and 160 images without that object for training and validation, respectively. Basically, I do a two class image classification with a smaller dataset using the techniques in this post. Thus I expected a worse performance in comparison due to the insufficient data. But the actual problem is that the performance is not convergent as shown in the figures.
Could you tell me possible reasons that lead to the unconvergence? I guess the problem is related to my dataset as the model works perfectly for «cats vs dogs». But I don't know how to address it. Are there any good techniques to make it convergent?
Thank you.
This performance plot is based on VGG16, keeping all layers up to fully connected layer and training a small fully connected layer with 256 neurons.
This performance plot is also based on VGG16, but using 128 neurons instead of 256 neurons. Also I set epochs to 80.
Based on the suggestions provided so far, I'm thinking to have a customized convnet model to fight the overfitting problem. But how to do this? One of my worries is that a model with fewer layers will downgrade the performance for training. Any guidelines to customize a good model for little data? Thank you.
Updates:
Now I think I know the half reason that leads to the unconvergent problem. You know, Actually I only have 100+ images. The rest images are downloaded from Flickr. I thought those images having centric objects and better quality will work for the model. But later on I found they can not contribute to the accuracy and even worse the output class probabilities. After removing these downloaded images, the performance is bumping upward a little and the uncovergency is gone. Note I only use 64*2 images for training and 48*2 images for testing. Also I found the image augmentation could not improve the performance for my dataset. Without image augmentation, the training accuracy could reach 1. But if I add some image augmentation, the training accuracy is only around 85%. Did somebody have such experience? Why doesn't data augmentation always work? Because our specific dataset? Thank you very much.
Your model is working great, but it's "overfitting". It means it's capable of memorizing all your training data without really "thinking". That leads to great training results and bad test results.
Common ways to avoid overfitting are:
More data - If you have little data, the chance of overfitting increases
Less units/layers - Make the model less capable, so it will stop memorizing and start thinking.
Add "dropouts" to your layers (something that randomly discards part of the results to prevent the model from being too powerful)
Do more layers mean more power and performance?
If by performance you mean capability of learning, yes. (If you mean "speed", no)
Yes, more layers mean more power. But too much power leads to overfitting: the model is so capable that it can memorize training data.
So there is an optimal point:
A model that is not very capable will not give you the proper results (both training and test results will be bad)
A model that is too capable will memorize the training data (excellent training results, but bad test results)
A balanced model will learn the right things (good training and test results)
That's exactly why we use test data, it's data that is not presented for training, so the model doesn't learn from the test data.

Good results when training and cross-validating a model, but test data set shows poor results

My problem is that I obtain a model with very good results (training and cross-validating), but when I test it again (with a different data set) poor results appear.
I got a model which has been trained and cross-validating tested. The model shows AUC=0.933, TPR=0.90 and FPR=0.04
I guess there is no overfitting present looking at pictures, corresponding to learning curve (error), learning curve (score), and deviance curve:
The problem is that when I test this model with a different test data set, I obtain poor results, nothing to do with my previus results AUC=0.52, TPR=0.165 and FPR=0.105
I used Gradient Boosting Classifier to train my model, with learning_rate=0.01, max_depth=12, max_features='auto', min_samples_leaf=3, n_estimators=750
I used SMOTE to balance the class. It is binary model. I vectorized my categorical attributes. I used 75% of my data set to train and 25% tot test. My model has a very low training error, and a low test error, so I guess it is not overfitted. Training error is very low, so there are not outliers in the training and cv-test data sets. What can I do from now on to find the problem? Thanks
If the process generating your datasets is non-stationary it could cause the behavior you describe.
In that case the distribution of the dataset you're using to test has not been used for training

Difficulty with initially training Neural Networks

I am researching artificial neural networks (ANN). I am trying to train many different ANN's with main emphasis of research being correlation between structure change and prediction rate.
I have noticed it is quite common for the training algorithms to converge in the first 100 or so iterations to near initial state due to training step being too small. I have no clear idea why this would happen. Has anyone confronted with the same problem? What could be the reason for this? Is there a better way to overcome the problem than just force the iteration schemes to work their way through beginning where the problem seems to lie?
I have been training my networks in Octave using fmincg and fminunc. Backprop. to get the gradient and cost function is the same as logistic regressions. The problem occurred for network structure of 10 neurons in first and 10 neurons in second hidden layer. MNIST Database is being used for both training and test sets.
Addition:
Fminunc seems not to do very well at all on three layered ANN, but under some random variables with two layered ANN seems to converge without a problem. Conjugate gradient seems to work if forced through initial phase.
Could the problem be the random initialization of weights? Could having too low of a variability [-0.12; 0;12] causing the problem?
Edit: Made network structure part little more clear.

Resources