Perform data transformation on training data inside cross validation - transformation

I would like to do cross validation for 5 folds. In each fold, I have a training and valid set. However, due to data issue, I need to transform my data. First, I transform the training data, train the model,apply the transformation rule to the validation data, and then test the model. I need to redo the transformation for every fold. How would I do that in H2O? I can't find away to separate the transformation part out. Does anyone have any suggestion?

Related

How to validate my YOLO model trained on custom data set?

I am doing my research regarding object detection using YOLO although I am from civil engineering field and not familiar with computer science. My advisor is asking me to validate my YOLO detection model trained on custom dataset. But my problem is I really don't know how to validate my model. So, please kindly point me out how to validate my model.
Thanks in advance.
I think first you need to make sure that all the cases you are interested in (location of objects, their size, general view of the scene, etc) are represented in your custom dataset - in other words, the collected data reflects your task. You can discuss it with your advisor. Main rule - you label data qualitatively in same manner as you want to see it on the output. more information can be found here
It's really important - garbage in, garbage out, the quality of output of your trained model is determined by the quality of the input (labelled data)
If this is done, it is common practice to split your data into training and test sets. During model training only train set is used, and you can later validate the quality (generalizing ability, robustness, etc) on data that the model did not see - on the test set. It's also important, that this two subsets don't overlap - than your model will be overfitted and the model will not perform the tasks properly.
Than you can train few different models (with some architectural changes for example) on the same train set and validate them on the same test set, and this is a regular validation process.

Is it okay if we augment the data first then randomly choose the data and split the data afterward?

I am doing a science project about classifying medical images but I do not have a lot of data so, is it okay if I augment the data first then randomly select the data to keep and split the kept data afterward? At first, my teacher told me to augment the data first then split the data into train, validation, and test. But I think my proposed method will make the training dataset collide with the testing dataset which will cause the accuracy to be unrealistic(way too high), so I thought my method that randomly chooses the files after doing data augmentation should help the augmented dataset to not be too similar to each other and solve the imbalanced amount of dataset problem.
We want our model to generalize well on training set, so technically, we should do data augmentation only on the training set. I would suggest that you split your data-set into training, validation and testing, then do data augmentation only on training set.

Which model to pick from K fold Cross Validation

I was reading about cross validation and about how it it is used to select the best model and estimate parameters , I did not really understand the meaning of it.
Suppose I build a Linear regression model and go for a 10 fold cross validation, I think each of the 10 will have different coefficiant values , now from 10 different which should I pick as my final model or estimate parameters.
Or do we use Cross Validation only for the purpose of finding an average error(average of 10 models in our case) and comparing against another model ?
If your build a Linear regression model and go for a 10 fold cross validation, indeed each of the 10 will have different coefficient values. The reason why you use cross validation is that you get a robust idea of the error of your linear model - rather than just evaluating it on one train/test split only, which could be unfortunate or too lucky. CV is more robust as no ten splits can be all ten lucky or all ten unfortunate.
Your final model is then trained on the whole training set - this is where your final coefficients come from.
Cross-validation is used to see how good your models prediction is. It's pretty smart making multiple tests on the same data by splitting it as you probably know (i.e. if you don't have enough training data this is good to use).
As an example it might be used to make sure you aren't overfitting the function. So basically you try your function when you've finished it with Cross-validation and if you see that the error grows a lot somewhere you go back to tweaking the parameters.
Edit:
Read the wikipedia for deeper understanding of how it works: https://en.wikipedia.org/wiki/Cross-validation_%28statistics%29
You are basically confusing Grid-search with cross-validation. The idea behind cross-validation is basically to check how well a model will perform in say a real world application. So we basically try randomly splitting the data in different proportions and validate it's performance. It should be noted that the parameters of the model remain the same throughout the cross-validation process.
In Grid-search we try to find the best possible parameters that would give the best results over a specific split of data (say 70% train and 30% test). So in this case, for different combinations of the same model, the dataset remains constant.
Read more about cross-validation here.
Cross Validation is mainly used for the comparison of different models.
For each model, you may get the average generalization error on the k validation sets. Then you will be able to choose the model with the lowest average generation error as your optimal model.
Cross-Validation or CV allows us to compare different machine learning methods and get a sense of how well they will work in practice.
Scenario-1 (Directly related to the question)
Yes, CV can be used to know which method (SVM, Random Forest, etc) will perform best and we can pick that method to work further.
(From these methods different models will be generated and evaluated for each method and an average metric is calculated for each method and the best average metric will help in selecting the method)
After getting the information about the best method/ or best parameters we can train/retrain our model on the training dataset.
For parameters or coefficients, these can be determined by grid search techniques. See grid search
Scenario-2:
Suppose you have a small amount of data and you want to perform training, validation and testing on data. Then dividing such a small amount of data into three sets reduce the training samples drastically and the result will depend on the choice of pairs of training and validation sets.
CV will come to the rescue here. In this case, we don't need the validation set but we still need to hold the test data.
A model will be trained on k-1 folds of training data and the remaining 1 fold will be used for validating the data. A mean and standard deviation metric will be generated to see how well the model will perform in practice.

Should I split my data into training/testing/validation sets with k-fold-cross validation?

When evaluating a recommender system, one could split his data into three pieces: training, validation and testing sets. In such case, the training set would be used to learn the recommendation model from data and the validation set would be used to choose the best model or parameters to use. Then, using the chosen model, the user could evaluate the performance of his algorithm using the testing set.
I have found a documentation page for the scikit-learn cross validation (http://scikit-learn.org/stable/modules/cross_validation.html) where it says that is not necessary to split the data into three pieces when using k-fold-cross validation, but only into two: training and testing.
A solution to this problem is a procedure called cross-validation (CV for short). A test set should still be held out for final evaluation, but the validation set is no longer needed when doing CV. In the basic approach, called k-fold CV, the training set is split into k smaller sets (other approaches are described below, but generally follow the same principles).
I am wondering if this would be a good approach. And if so, someone could show me a reference to an article/book backing this theory up?
Cross validation does not avoid validation set, it simply uses many. In other words instead of one split into three parts, you have one split into two, and what you now call "training" is actually what previously has been training and validation, CV is simply about repeated splits (in slightly more smart manner than just randomly) into train and test, and then averaging the results. Theory backing it up is widely available in pretty much any good ML book; the crucial bit is "should I use it" and the answer is suprisingly simple - only if you do not have enough data to do one split. CV is used when you do not have enough data for each of the splits to be representative for the distribution you are interested in, then doing repeated splits simply reduce the variance. Furthermore, for really small datasets one does nested CV - one for [train+val][test] split and internal for [train][val], so the variance of both - model selection and its final evaluation - are reduced.

How is cross validation implemented?

I'm currently trying to train a neural network using cross validation, but I'm not sure if I'm getting how cross validation works. I understand the concept, but I can't totally see yet how the concept translates to code implementation. The following is a description of what I've got implemented, which is more-or-less guesswork.
I split the entire data set into K-folds, where 1 fold is the validation set, 1 fold is the testing set, and the data in the remaining folds are dumped into the training set.
Then, I loop K times, each time reassigning the validation and testing sets to other folds. Within each loop, I continuously train the network (update the weights) using only the training set until the error produced by the network meets some threshold. However, the error that is used to decide when to stop training is produced using the validation set, not the training set. After training is done, the error is once again produced, but this time using the testing set. This error from the testing set is recorded. Lastly, all the weights are re-initialized (using the same random number generator used to initialize them originally) or reset in some fashion to undo the learning that was done before moving on to the next set of validation, training, and testing sets.
Once all K loops finish, the errors recorded in each iteration of the K-loop are averaged.
I have bolded the parts where I'm most confused about. Please let me know if I made any mistakes!
I believe your implementation of Cross Validation is generally correct. To answer your questions:
However, the error that is used to decide when to stop training is produced using the validation set, not the training set.
You want to use the error on the validation set because it's reduces overfitting. This is the reason you always want to have a validation set. If you would do as you suggested, you could have a lower threshold, your algorithm will achieve a higher training accuracy than validation accuracy. However, this would generalize poorly to the unseen examples in the real world, that which your validation set is supposed to model.
Lastly, all the weights are re-initialized (using the same random number generator used to initialize them originally) or reset in some fashion to undo the learning that was done before moving on to the next set of validation, training, and testing sets.
The idea behind cross validation is that each iteration is like training the algorithm from scratch. This is desirable since by averaging your validation score, you get a more robust value. It protects against the possibility of a biased validation set.
My only suggestion would be to not use a test set in your cross validation scheme, since your validation set already models unseen examples, a seperate test set during the cross validation is redundant. I would instead split the data into a training and test set before you start cross validation. I would then not touch the test set until you want to gain an objective score for your algorithm.
You could use your cross validation score as an indication of performance on unseen examples, I assume however that you will be choosing parameters on this score, optimizing your model for your training set. Again, the possibility arises this does not generalize well to unseen examples, which is why it is a good practice to keep a seperate unseen test set. Which is only used after you have optimized your algorithm.

Resources