Huge difference between test and validation accuracy - validation

I have a model (XGBoost) and Im using it on the stock price prediction.
My model is working well on a validation set (for example it makes a small profit in 3 months of data I used for validation) but it's not generating good predictions in the test dataset (no profit in another 3 months of data).
I'm not sure why is that and I'm looking for some possible hypotheses about why the model is not working properly so I can design experiments and test that hypothesis.
so far I have these hypotheses:
time difference makes validation dataset unrepresentative
parameters are heavily tuned on the validation set and I have a selection bias
there is a problem with codes (normalization, data leak, noise, ....)
stock price prediction is not possible
model is really working randomly and we just selected the best random model on the validation set

Related

How to validate my YOLO model trained on custom data set?

I am doing my research regarding object detection using YOLO although I am from civil engineering field and not familiar with computer science. My advisor is asking me to validate my YOLO detection model trained on custom dataset. But my problem is I really don't know how to validate my model. So, please kindly point me out how to validate my model.
Thanks in advance.
I think first you need to make sure that all the cases you are interested in (location of objects, their size, general view of the scene, etc) are represented in your custom dataset - in other words, the collected data reflects your task. You can discuss it with your advisor. Main rule - you label data qualitatively in same manner as you want to see it on the output. more information can be found here
It's really important - garbage in, garbage out, the quality of output of your trained model is determined by the quality of the input (labelled data)
If this is done, it is common practice to split your data into training and test sets. During model training only train set is used, and you can later validate the quality (generalizing ability, robustness, etc) on data that the model did not see - on the test set. It's also important, that this two subsets don't overlap - than your model will be overfitted and the model will not perform the tasks properly.
Than you can train few different models (with some architectural changes for example) on the same train set and validate them on the same test set, and this is a regular validation process.

How effective is early stopping when the validation accuracy varies

I am building a time series model for future forecasting which consists of 2 BILSTM layers followed by a dense layer. I have a total of 120 products to forecast their values. And I have a relatively small dataset (Monthly data for a period of 2 years => maximum 24-time steps). Consequently, if I look into the overall validation accuracy, I got this:
At every epoch, I saved the model weights into memory so that I load any model anytime in the future.
When I look into the validation accuracy of different products, I got the following (This is roughly for a few products):
For this product, can I use the model saved at epoch ~90 to forecast for this model?
And the following product, can I use the saved model at epoch ~40 for forecasting?
Am I cheating? Please note that products are quite diverse and their purchasing behavior differs from one to another. To me, following this strategy, is equivalent to training 120 models (given 120 products), while at the same time, feeding more data per model as a bonus, hoping to achieve better per product. Am I making a fair assumption?
Any help is much appreciated!

80-20 or 80-10-10 for training machine learning models?

I have a very basic question.
1) When is it recommended to hold part of the data for validation and when is it unnecessary? For example, when can we say it is better to have 80% training, 10% validating and 10% testing split and when can we say it is enough to have a simple 80% training and 20% testing split?
2) Also, does using K-Cross Validation go with the simple split (training-testing)?
I find it more valuable to have a training and validation set if I have a limited size data set. The validation set is essentially a test set anyway. The reason for this is that you want your model to be able to extrapolate from having a high accuracy on the data it is trained on too also have high accuracy on data it has not seen before. The validation set allows you to determine if that is the case. I generally take at least 10% of the data set and make it a validation set. It is important that you select the validation data randomly so that it's probability distribution matches that of the training set. Next I monitor the validation loss and save the model with the lowest validation loss. I also use an adjustable learning rate. Keras has two useful callbacks for this purpose, ModelCheckpoint and ReduceLROnPlateau. Documentation is here. With a validation set you can monitor the validation loss during training and ascertain if your model is training proberly (training accuracy) and if it is extrapolating properly ( validation loss). The validation loss on average should decrease as the model accuracy increases. If the validation loss starts to increase with high training accuracy your model is over fitting and you can take remedial action such as including dropout layers, regularizers or reduce your model complexity. Documentation for that is here and here. To see why I use an adjustable learning rate see the answer to a stack overflow question here.

What is the difference between test and validation specifically in Mask-R-CNN?

I have my own image dataset and use Mask-R-CNN for training. There you divide your dataset into train, valivation and test.
I want to know the difference between validation and test.
I know that validation in general is used to see the quality of the NN after each epoch. Based on that you can see how good the NN is and if overfitting is happening.
But i want to know if the NN learns based on the validation set.
Based on the trainset the NN learns after each image and adjusts each neuron to reduce the loss. And after the NN is finished learning, we use the testset to see how good our NN is really with new unseen images.
But what exactly happen in Mask-R-CNN based on the validationset? Is the validation set only there for seeing the results? Or will some parameters be adjusted based on the validation result to avoid overfitting? An even if this is the case, how much influence does the validationset have on the parameters? Will the neurons itself be adjusted or not?
If the influence is very very small, then i will choose the validation set equal to the testset, because i don't have many images(800).
So basically i want to know the difference between test and validation in Mask-R-CNN, that is how and how much the validationset influence the NN.
The model does not learn off the validation set. The validation set is just used to give an approximation of generalization error at any epoch but also, crucially, for hyperparameter optimization. So I can iterate over several different hyperparameter configuration and evaluate the accuracy of those on the validation set.
Then after we choose the best model based on the validation set accuracies we can then calculate the test error based on the test set. Ideally there is not a large difference between test set and validation set accuracies. Sometimes your model can essentially 'overfit' to the validation set if you iterate over lots of different hyperparameters.
Reserving another set, the test set, to evaluate on after this validation set evaluation is a luxury you may have if you have a lot of data. Lots of times you may be lacking enough labelled data for it even to be worth having a separate test set held back.
Lastly, these things are not specific to an Mask RCNN. Validation sets never affect the training of a model i.e. the weights or biases. Validation sets, like test sets, are purely for evaluation purposes.

Which model to pick from K fold Cross Validation

I was reading about cross validation and about how it it is used to select the best model and estimate parameters , I did not really understand the meaning of it.
Suppose I build a Linear regression model and go for a 10 fold cross validation, I think each of the 10 will have different coefficiant values , now from 10 different which should I pick as my final model or estimate parameters.
Or do we use Cross Validation only for the purpose of finding an average error(average of 10 models in our case) and comparing against another model ?
If your build a Linear regression model and go for a 10 fold cross validation, indeed each of the 10 will have different coefficient values. The reason why you use cross validation is that you get a robust idea of the error of your linear model - rather than just evaluating it on one train/test split only, which could be unfortunate or too lucky. CV is more robust as no ten splits can be all ten lucky or all ten unfortunate.
Your final model is then trained on the whole training set - this is where your final coefficients come from.
Cross-validation is used to see how good your models prediction is. It's pretty smart making multiple tests on the same data by splitting it as you probably know (i.e. if you don't have enough training data this is good to use).
As an example it might be used to make sure you aren't overfitting the function. So basically you try your function when you've finished it with Cross-validation and if you see that the error grows a lot somewhere you go back to tweaking the parameters.
Edit:
Read the wikipedia for deeper understanding of how it works: https://en.wikipedia.org/wiki/Cross-validation_%28statistics%29
You are basically confusing Grid-search with cross-validation. The idea behind cross-validation is basically to check how well a model will perform in say a real world application. So we basically try randomly splitting the data in different proportions and validate it's performance. It should be noted that the parameters of the model remain the same throughout the cross-validation process.
In Grid-search we try to find the best possible parameters that would give the best results over a specific split of data (say 70% train and 30% test). So in this case, for different combinations of the same model, the dataset remains constant.
Read more about cross-validation here.
Cross Validation is mainly used for the comparison of different models.
For each model, you may get the average generalization error on the k validation sets. Then you will be able to choose the model with the lowest average generation error as your optimal model.
Cross-Validation or CV allows us to compare different machine learning methods and get a sense of how well they will work in practice.
Scenario-1 (Directly related to the question)
Yes, CV can be used to know which method (SVM, Random Forest, etc) will perform best and we can pick that method to work further.
(From these methods different models will be generated and evaluated for each method and an average metric is calculated for each method and the best average metric will help in selecting the method)
After getting the information about the best method/ or best parameters we can train/retrain our model on the training dataset.
For parameters or coefficients, these can be determined by grid search techniques. See grid search
Scenario-2:
Suppose you have a small amount of data and you want to perform training, validation and testing on data. Then dividing such a small amount of data into three sets reduce the training samples drastically and the result will depend on the choice of pairs of training and validation sets.
CV will come to the rescue here. In this case, we don't need the validation set but we still need to hold the test data.
A model will be trained on k-1 folds of training data and the remaining 1 fold will be used for validating the data. A mean and standard deviation metric will be generated to see how well the model will perform in practice.

Resources