80-20 or 80-10-10 for training machine learning models? - validation

I have a very basic question.
1) When is it recommended to hold part of the data for validation and when is it unnecessary? For example, when can we say it is better to have 80% training, 10% validating and 10% testing split and when can we say it is enough to have a simple 80% training and 20% testing split?
2) Also, does using K-Cross Validation go with the simple split (training-testing)?

I find it more valuable to have a training and validation set if I have a limited size data set. The validation set is essentially a test set anyway. The reason for this is that you want your model to be able to extrapolate from having a high accuracy on the data it is trained on too also have high accuracy on data it has not seen before. The validation set allows you to determine if that is the case. I generally take at least 10% of the data set and make it a validation set. It is important that you select the validation data randomly so that it's probability distribution matches that of the training set. Next I monitor the validation loss and save the model with the lowest validation loss. I also use an adjustable learning rate. Keras has two useful callbacks for this purpose, ModelCheckpoint and ReduceLROnPlateau. Documentation is here. With a validation set you can monitor the validation loss during training and ascertain if your model is training proberly (training accuracy) and if it is extrapolating properly ( validation loss). The validation loss on average should decrease as the model accuracy increases. If the validation loss starts to increase with high training accuracy your model is over fitting and you can take remedial action such as including dropout layers, regularizers or reduce your model complexity. Documentation for that is here and here. To see why I use an adjustable learning rate see the answer to a stack overflow question here.

Related

Can I update weights of keras neural net only if validation improves?

I am training a neural network in keras and I reach a classical limit - my training accuracy improves with increasing epochs, but my validation accuracy decreases after 9 epochs (see figure).
I wonder if I can avoid the decrease of validation accuracy by doing the following: make the keras net only accept the changes to the weights after each epoch if the epoch led to an improvement of the validation accuracy, else reset to the state before the epoch? I assume that the validation is starting to diverge in a big part because after each epoch >9 the weights of the neural net diverge away from similarity to the validation data.
So, is my suggestion a good practice and can I achieve it in keras (are there callbacks or options that allow me to update the net only if the validation improved)?
Side question: Is my suggestion maybe violating the principle of "don't use your validation data for training"? Because I am making implicitly the performance of the neural net a function of my validation data.
The point of the validation set is to give you an idea of the generalizability your model achieves by learning using the training data. You don't HAVE to have a validation dataset. If your validation data is a random sample of your training data, then your best bet is probably modifying your architecture.
In short, if you want your model to train based on your validation data, then train the model on the training set, then take the resulting model, and train it on the validation data (i.e. make the validation data the training data). This obviously defeats the point of having a validation set.

What is the difference between test and validation specifically in Mask-R-CNN?

I have my own image dataset and use Mask-R-CNN for training. There you divide your dataset into train, valivation and test.
I want to know the difference between validation and test.
I know that validation in general is used to see the quality of the NN after each epoch. Based on that you can see how good the NN is and if overfitting is happening.
But i want to know if the NN learns based on the validation set.
Based on the trainset the NN learns after each image and adjusts each neuron to reduce the loss. And after the NN is finished learning, we use the testset to see how good our NN is really with new unseen images.
But what exactly happen in Mask-R-CNN based on the validationset? Is the validation set only there for seeing the results? Or will some parameters be adjusted based on the validation result to avoid overfitting? An even if this is the case, how much influence does the validationset have on the parameters? Will the neurons itself be adjusted or not?
If the influence is very very small, then i will choose the validation set equal to the testset, because i don't have many images(800).
So basically i want to know the difference between test and validation in Mask-R-CNN, that is how and how much the validationset influence the NN.
The model does not learn off the validation set. The validation set is just used to give an approximation of generalization error at any epoch but also, crucially, for hyperparameter optimization. So I can iterate over several different hyperparameter configuration and evaluate the accuracy of those on the validation set.
Then after we choose the best model based on the validation set accuracies we can then calculate the test error based on the test set. Ideally there is not a large difference between test set and validation set accuracies. Sometimes your model can essentially 'overfit' to the validation set if you iterate over lots of different hyperparameters.
Reserving another set, the test set, to evaluate on after this validation set evaluation is a luxury you may have if you have a lot of data. Lots of times you may be lacking enough labelled data for it even to be worth having a separate test set held back.
Lastly, these things are not specific to an Mask RCNN. Validation sets never affect the training of a model i.e. the weights or biases. Validation sets, like test sets, are purely for evaluation purposes.

Validation loss when using Dropout

I am trying to understand the effect of dropout on validation Mean Absolute Error (non-linear regression problem).
Without dropout
With dropout of 0.05
With dropout of 0.075
Without any dropouts the validation loss is more than training loss as shown in 1. My understanding is that the validation loss should only be slightly more than the training loss for a good fit.
Carefully, I increased the dropout so that validation loss is close to the training loss as seen in 2. The dropout is only applied during training and not during validation, hence the validation loss is lower than the training loss.
Finally the dropout was increased further and the validation loss again became more than the training loss in 3.
Which amongst these three should be called as a good fit?
Following the response of Marcin Możejko, I predicted against three tests as shown in 4. The 'Y' axis shows RMS error instead of MAE. The model 'without dropout' gave the best result.
Well - this a really good question. In my opinion - the lowest validation score (confirmed on a separate test set) is the best fit. Remember that in the end - the performance of your model on a totally new data is the most crucial thing and the fact that it performed even better on a training set is not so important.
Moreover - I think that your model might generaly underfit - and you could try extend it to e.g. have more layers or neurons and prune it a little bit using dropout in order to prevent example memoization.
If my hypothesis turned out to be false - remember - that it still might be possible that there are certain data patterns present only on validation set (this relatively often in case of medium size datasets) what makes the divergence of train and test loss. Moreover - I think that even though that your losses values saturated in case without dropout there is still a room for improvement by simple increase in number of epochs as there seems to be a trend for losses to be smaller.
Another technique I recommend you to try is reducing learning rate on plateau (using example this callback) as your model seems to need refinement with lower value learning rate.

Does validation accuracy/loss impact training in caffe

Have a simple question about validation set in caffe, was wondering if validation set has any impact on training? I know that you use validation set to check if the network isn't overfitting and as I understand validation set has no impact on weight update, but does it have some kind of impact on selecting or modifying hyper-parameters or is it just for user to see and estimate how well network has learned?
No, the results of the validation set are not used by the neural network during training to adjust any hyperparameters. Using the validation set during training is the same as applying the network at some point in time to predict values for the validation set, and then scoring how well it did.
You might decide that you want to run the same network training procedure many times over using different values for hyperparameters. In its fully exhaustive form, that would mean you would do a grid search over the hyperparameter space with many different training sessions of separate networks. In practice, it's not a great idea to do a fully exhaustive grid search with neural networks because the amount of parameters can be extremely large.
Often with neural networks you can tune one parameter at a time until they each seem "about right". Of course this might not get you the absolute best result, but it's not a bad first approach.

Is it good to do Cross Validation with the exact same dataset that is used in training phase?

I am using Weka API to test the performance of some algorithms. If I want to divide the dataset as following:
70% for training
10% for validation
20% for testing
For the validation phase, should I use the cross validation method on 10% divided fresh data? Or is it better to apply cross validation on the 70% data that has already trained? And why?
It is actually very problem specific, but in general - it depends on the size of the dataset. If you have big dataset then even a subsample is representative, thus you can split everything once to train/valid/test and just run a typical optimization and testing routine. On the other hand, if you have rather small datast (~1000 samples) then actually both testing and validation require CV (or other technique, like err 0.632). It is all about statistical significance of obtained error estimates. If data is small - you need to generate multiple experiments (CV) to get a reasonable estimator; if you have 100,000 samples then even 10% should be enough to use as a valid estimator of error.

Resources