What is the best metric to evaluate how well a CNN is trained? validation error or training loss? - validation

I want to train a CNN, but I want to use all data to train the network thus not performing validation. Is this a good choice? am I risking to overfit my CNN if using only the training loss as the criterium for early stopping the CNN?
In other words, what is the best 'monitor' parameter in KERAS (for example) for early stopping, among the options below?
early_stopper=EarlyStopping(monitor='train_loss', min_delta=0.0001, patience=20)
early_stopper=EarlyStopping(monitor='train_acc', min_delta=0.0001, patience=20)
early_stopper=EarlyStopping(monitor='val_loss', min_delta=0.0001, patience=20)
early_stopper=EarlyStopping(monitor='val_acc', min_delta=0.0001, patience=20)
There is a discussion like this in stackoverflow Keras: Validation error is a good measure for stopping criteria or validation accuracy?, however, they talk about the validation only. Is it better using criteria in validation or training data to early stopping a CNN training?

I want to train a CNN, but I want to use all data to train the network thus not performing validation. Is this a good choice? am I
risking to overfit my CNN if using only the training loss as the
criterium for early stopping the CNN?
Answer: No, your purpose is to predict on new samples, even you got 100% training accuracy but you may got bad prediction on new samples. You don't have a way to check whether you have an overfitting
In other words, what is the best 'monitor' parameter in KERAS (for example) for early stopping, among the options below?
Answer: It should be the criteria closest to the reality
early_stopper=EarlyStopping(monitor='val_acc', min_delta=0.0001, patience=20)
In addition, you may need train, validate, and test data. Train is to train your model, validate is to perform validating some models+parameters and select the best, and test is to verify independently your result (it's not used for choosing models, parameters, so it's equivalent to new samples)

I've already up-voted Tin Luu's answer, but wanted to refine one critical, practical point: the best criterion is the one that best matches your success criteria. To wit, you have to define your practical scoring function before your question makes complete sense for us.
What is important to the application for which you're training this model? If it's nothing more than top-1 prediction accuracy, then validation accuracy (val_acc) is almost certainly your sole criterion. If you care about confidence levels (e.g. hedging your bets when 48% chance it's a cat, 42% it's a wolf, 10% it's a Ferrari), then proper implementation of an error function will make validation error (val_err) a better choice.
Finally, I stress again that the ultimate metric is actual performance according to your chosen criteria. Test data are a representative sampling of your actual input. You can use an early stopping criterion for faster training turnaround, but you're not ready for deployment until your real-world criteria are tested and satisfied.

Related

80-20 or 80-10-10 for training machine learning models?

I have a very basic question.
1) When is it recommended to hold part of the data for validation and when is it unnecessary? For example, when can we say it is better to have 80% training, 10% validating and 10% testing split and when can we say it is enough to have a simple 80% training and 20% testing split?
2) Also, does using K-Cross Validation go with the simple split (training-testing)?
I find it more valuable to have a training and validation set if I have a limited size data set. The validation set is essentially a test set anyway. The reason for this is that you want your model to be able to extrapolate from having a high accuracy on the data it is trained on too also have high accuracy on data it has not seen before. The validation set allows you to determine if that is the case. I generally take at least 10% of the data set and make it a validation set. It is important that you select the validation data randomly so that it's probability distribution matches that of the training set. Next I monitor the validation loss and save the model with the lowest validation loss. I also use an adjustable learning rate. Keras has two useful callbacks for this purpose, ModelCheckpoint and ReduceLROnPlateau. Documentation is here. With a validation set you can monitor the validation loss during training and ascertain if your model is training proberly (training accuracy) and if it is extrapolating properly ( validation loss). The validation loss on average should decrease as the model accuracy increases. If the validation loss starts to increase with high training accuracy your model is over fitting and you can take remedial action such as including dropout layers, regularizers or reduce your model complexity. Documentation for that is here and here. To see why I use an adjustable learning rate see the answer to a stack overflow question here.

What happens if we feed rule based labels, to train a Neural Network?

I don't have hand labeled data right now, but a data in which labels are created by some rule based algorithm using other features. Can I train my Neural Network with this data?
if it gives a good score, can I use the same algorithm to train with hand-labeled data? would it give similar accuracy?
You don't realize it, but you've already answered your own question: "I don't trust the result by rule based algorithm". Now think back to the purpose of a label: it's your ground truth for training. If you do not trust those results, then those results cannot be any sort of ground truth for you.
The training results of your neural network would be based entirely on the rule-based labels. The NN operation would, at best, reproduce the results you don't trust. There is no way it will serve as an independent check on the rule-based system.
However, you can certainly develop your neural network with that data and learn something about the viability of the model topology. If your goal is to have a viable NN that will be ready once you do have hand-labeled data, then you're very much on the right track for methodology.
Do not put much faith in getting a "good score" from the rule-based data; your purpose is to build a model that will give a good score with the hand-labeled data. The preliminary result's accuracy is self-referential, and is no better than the quality of the input data.

Alternatives to validate Multi Linear regression time series

I am using multi linear regression to do sales quantity forecasting in retail. Due to practical issues, I cannot use use ARIMA or Neural Networks.
I split the historical data into train and validation sets. Using a walk forward validation method would be computationally quite expensive at this point. I have to take x number of weeks preceding current date as my validation set. The time series prior to x is my training set. The problem I am noting with this method is that accuracy is far higher during the validation period as compared to the future prediction. That is, the further we move from the end of the training period, the less accurate the prediction / forecast. How best can I control this problem?
Perhaps a smaller validation period, will allow the training period to come closer to the current date and hence provide a more accurate forecast; but this hurts the value of validation.
Another thought is to cheat and give both the training and validation historical data during training. As I am not using neural nets, the selected algo should not be over-fitted. Please correct me if this assumption is not right.
Any other thoughts or solution would be most welcome.
Thanks
Regards,
Adeel
If you're not using ARIMA or DNN, how about using rolling windows of regressions to train and test the historical data?

Model tuning with Cross validation

I have a model tuning object that fits multiple models and tunes each one of them to find the best hyperparameter combination for each of the models. I want to perform cross-validation on the model tuning part and this is where I am facing a dilemma.
Let's assume that I am fitting just the one model- a random forest classifier and performing a 5 fold cross-validation. Currently, for the first fold that I leave out, I fit the random forest model and perform the model tuning. I am performing model tuning using the dlib package. I calculate the evaluation metric(accuracy, precision, etc) and select the best hyper-parameter combination.
Now when I am leaving out the second fold, should I be tuning the model again? Because if I do, I will get a different combination of hyperparameters than I did in the first case. If I do this across the five folds, what combination do I select?
The cross validators present in spark and sklearn use grid search so for each fold they have the same hyper-parameter combination and don't have to bother about hyper-parameter combinations changing across folds
Choosing the best hyper-parameter combination that I get when I leave out the first fold and using it for the subsequent folds doesn't sound right because then my entire model tuning is dependent on which fold got left out first. However, if I am getting different hyperparameters each time, which one do I settle on?
TLDR:
If you are performing let's say a derivative based model tuning along with cross-validation, your hyper-parameter combination changes as you iterate over folds. How do you select the best combination then? Generally speaking, how do you use cross-validation with derivative-based model tuning methods.
PS: Please let me know if you need more details
This is more of a comment, but it is too long for this, so I post it as an answer instead.
Cross-validation and hyperparameter tuning are two separate things. Cross Validation is done to get a sense of the out-of-sample prediction error of the model. You can do this by having a dedicated validation set, but this raises the question if you are overfitting to this particular validation data. As a consequence, we often use cross-validation where the data are split in to k folds and each fold is used once for validation while the others are used for fitting. After you have done this for each fold, you combine the prediction errors into a single metric (e.g. by averaging the error from each fold). This then tells you something about the expected performance on unseen data, for a given set of hyperparameters.
Once you have this single metric, you can change your hyperparameter, repeat, and see if you get a lower error with the new hyperparameter. This is the hpyerparameter tuning part. The CV part is just about getting a good estimate of the model performance for the given set of hyperparameters, i.e. you do not change hyperparameters 'between' folds.
I think one source of confusion might be the distinction between hyperparameters and parameters (sometimes also referred to as 'weights', 'feature importances', 'coefficients', etc). If you use a gradient-based optimization approach, these change between iterations until convergence or a stopping rule is reached. This is however different from hyperparameter search (e.g. how many trees to plant in the random forest?).
By the way, I think questions like these should better be posted to the Cross-Validated or Data Science section here on StackOverflow.

4 fold cross validation | Caffe

So I trying to perform a 4-fold cross validation on my training set. I have divided my training data into four quarters. I use three quarters for training and one quarter for validation. I repeat this three more times till all the quarters are given a chance to be the validation set, atleast once.
Now after training I have four caffemodels. I test the models on my validation sets. I am getting different accuracy in each case. How should I proceed from here? Should I just choose the model with the highest accuracy?
Maybe it is a late reply, but in any case...
The short answer is that, if the performances of the four models are similar and good enough, then you re-train the model on all the data available, because you don't want to waste any of them.
The n-fold cross validation is a practical technique to get some insights on the learning and generalization properties of the model you are trying to train, when you don't have a lot of data to start with. You can find details everywhere on the web, but I suggest the open-source book Introduction to Statistical Learning, Chapter 5.
The general rule says that after you trained your n models, you average the prediction error (MSE, accuracy, or whatever) to get a general idea of the performance of that particular model (in your case maybe the network architecture and learning strategy) on that dataset.
The main idea is to assess the models learned on the training splits checking if they have an acceptable performance on the validation set. If they do not, then your models probably overfitted tha training data. If both the errors on training and validation splits are high, then the models should be reconsidered, since they don't have predictive capacity.
In any case, I would also consider the advice of Yoshua Bengio who says that for the kind of problem deep learning is meant for, you usually have enough data to simply go with a training/test split. In this case this answer on Stackoverflow could be useful to you.

Resources