What is the difference between cross-validation and grid search? - cross-validation

In simple words, what is the difference between cross-validation and grid search? How does grid search work? Should I do first a cross-validation and then a grid search?

Cross-validation is when you reserve part of your data to use in evaluating your model. There are different cross-validation methods. The simplest conceptually is to just take 70% (just making up a number here, it doesn't have to be 70%) of your data and use that for training, and then use the remaining 30% of the data to evaluate the model's performance. The reason you need different data for training and evaluating the model is to protect against overfitting. There are other (slightly more involved) cross-validation techniques, of course, like k-fold cross-validation, which often used in practice.
Grid search is a method to perform hyper-parameter optimisation, that is, it is a method to find the best combination of hyper-parameters (an example of an hyper-parameter is the learning rate of the optimiser), for a given model (e.g. a CNN) and test dataset. In this scenario, you have several models, each with a different combination of hyper-parameters. Each of these combinations of parameters, which correspond to a single model, can be said to lie on a point of a "grid". The goal is then to train each of these models and evaluate them e.g. using cross-validation. You then select the one that performed best.
To give a concrete example, if you're using a support vector machine, you could use different values for gamma and C. So, for example, you could have a grid with the following values for (gamma, C): (1, 1), (0.1, 1), (1, 10), (0.1, 10). It's a grid because it's like a product of [1, 0.1] for gamma and [1, 10] for C. Grid-search would basically train a SVM for each of these four pair of (gamma, C) values, then evaluate it using cross-validation, and select the one that did best.

Cross-validation is a method for robustly estimating test-set performance (generalization) of a model.
Grid-search is a way to select the best of a family of models, parametrized by a grid of parameters.
Here, by "model", I don't mean a trained instance, more the algorithms together with the parameters, such as SVC(C=1, kernel='poly').

Cross-validation, simply separating test and training data and validate training results with test data. There are two cross validation techniques that I know.
First, Test/Train cross validation. Splitting data as test and train.
Second, k-fold cross-validation split your data into k bins, use each bin as testing data and use rest of the data as training data and validate against testing data. Repeat the process k times. And Get the average performance. k-fold cross validation especially useful for small dataset since it maximizes both the test and training data.
Grid Search; systematically working through multiple combinations of parameter tunes, cross validate each and determine which one gives the best performance.You can work through many combination only changing parameters a bit.

Cross-validation is a method of reserving a particular subset of your dataset on which you do not train the model. Later, you test your model on this subset before finalizing it.
The main steps you need to perform to do cross-validation are:
Split the whole dataset in training and test datasets (e.g. 80% of the whole dataset is the training dataset and the remaining 20% is the test dataset)
Train the model using the training dataset
Test your model on the test dataset. If your model performs well on the test dataset, continue the training process
There are other cross-validation methods, for example
Leave-one-out cross-validation (LOOCV)
K-fold cross-validation
Stratified K-fold cross-validation
Adversarial cross-validation strategies (used when train and rest datasets are differ largely from each other).

In simple terms,
consider making pasta as building a model:
Cross validation - choosing the quantity of pasta
Grid search - choosing the right proportion of ingredients.

Related

The value of K in KNN & the accuracy of the model

I am working on KNN algorithm, I have some questions please I need answers:
I tried different values of K such as 3, 5, 7, and sqrt(n)=73. I get different accuracies according to these different values of K. What K should I use in my model and why ??
What is the best percentage that I should use to split the dataset into train and test sets ??
why the accuracy of the train set is always greater than the accuracy of test set ??
Which accuracy (train accuracy or test accuracy) is used to describe the overall model accuracy ??
Choosing the value of K is not a exact science. In this post, user20160 explained a procedure to choose a good K using k-fold cross validation.
Usually, 80/20 and 70/30 ratios are used but once again, it is not a absolute truth. If your train set ratio is too large, your model could overfit which means your model learns specially for the train set and will not be performant with real cases. On another hand, if your train set is too small, your model could underfit.
The accuracy of the train set is often greater than the test one because your model is only trained with the train set and test dataset are cases your model has never seen before. It is like learning to ride a bike with only one bike and being evaluated with another bike.
The train accuracy is not realistic to evaluate the performance of your model since train dataset is well known by your model. Test accuracy is more relevant because test dataset is new and has never been seen by your model.
To better understand everything about model evaluation I strongly recommend you to take look at the cross validation link above.

Model tuning with Cross validation

I have a model tuning object that fits multiple models and tunes each one of them to find the best hyperparameter combination for each of the models. I want to perform cross-validation on the model tuning part and this is where I am facing a dilemma.
Let's assume that I am fitting just the one model- a random forest classifier and performing a 5 fold cross-validation. Currently, for the first fold that I leave out, I fit the random forest model and perform the model tuning. I am performing model tuning using the dlib package. I calculate the evaluation metric(accuracy, precision, etc) and select the best hyper-parameter combination.
Now when I am leaving out the second fold, should I be tuning the model again? Because if I do, I will get a different combination of hyperparameters than I did in the first case. If I do this across the five folds, what combination do I select?
The cross validators present in spark and sklearn use grid search so for each fold they have the same hyper-parameter combination and don't have to bother about hyper-parameter combinations changing across folds
Choosing the best hyper-parameter combination that I get when I leave out the first fold and using it for the subsequent folds doesn't sound right because then my entire model tuning is dependent on which fold got left out first. However, if I am getting different hyperparameters each time, which one do I settle on?
TLDR:
If you are performing let's say a derivative based model tuning along with cross-validation, your hyper-parameter combination changes as you iterate over folds. How do you select the best combination then? Generally speaking, how do you use cross-validation with derivative-based model tuning methods.
PS: Please let me know if you need more details
This is more of a comment, but it is too long for this, so I post it as an answer instead.
Cross-validation and hyperparameter tuning are two separate things. Cross Validation is done to get a sense of the out-of-sample prediction error of the model. You can do this by having a dedicated validation set, but this raises the question if you are overfitting to this particular validation data. As a consequence, we often use cross-validation where the data are split in to k folds and each fold is used once for validation while the others are used for fitting. After you have done this for each fold, you combine the prediction errors into a single metric (e.g. by averaging the error from each fold). This then tells you something about the expected performance on unseen data, for a given set of hyperparameters.
Once you have this single metric, you can change your hyperparameter, repeat, and see if you get a lower error with the new hyperparameter. This is the hpyerparameter tuning part. The CV part is just about getting a good estimate of the model performance for the given set of hyperparameters, i.e. you do not change hyperparameters 'between' folds.
I think one source of confusion might be the distinction between hyperparameters and parameters (sometimes also referred to as 'weights', 'feature importances', 'coefficients', etc). If you use a gradient-based optimization approach, these change between iterations until convergence or a stopping rule is reached. This is however different from hyperparameter search (e.g. how many trees to plant in the random forest?).
By the way, I think questions like these should better be posted to the Cross-Validated or Data Science section here on StackOverflow.

Which model to pick from K fold Cross Validation

I was reading about cross validation and about how it it is used to select the best model and estimate parameters , I did not really understand the meaning of it.
Suppose I build a Linear regression model and go for a 10 fold cross validation, I think each of the 10 will have different coefficiant values , now from 10 different which should I pick as my final model or estimate parameters.
Or do we use Cross Validation only for the purpose of finding an average error(average of 10 models in our case) and comparing against another model ?
If your build a Linear regression model and go for a 10 fold cross validation, indeed each of the 10 will have different coefficient values. The reason why you use cross validation is that you get a robust idea of the error of your linear model - rather than just evaluating it on one train/test split only, which could be unfortunate or too lucky. CV is more robust as no ten splits can be all ten lucky or all ten unfortunate.
Your final model is then trained on the whole training set - this is where your final coefficients come from.
Cross-validation is used to see how good your models prediction is. It's pretty smart making multiple tests on the same data by splitting it as you probably know (i.e. if you don't have enough training data this is good to use).
As an example it might be used to make sure you aren't overfitting the function. So basically you try your function when you've finished it with Cross-validation and if you see that the error grows a lot somewhere you go back to tweaking the parameters.
Edit:
Read the wikipedia for deeper understanding of how it works: https://en.wikipedia.org/wiki/Cross-validation_%28statistics%29
You are basically confusing Grid-search with cross-validation. The idea behind cross-validation is basically to check how well a model will perform in say a real world application. So we basically try randomly splitting the data in different proportions and validate it's performance. It should be noted that the parameters of the model remain the same throughout the cross-validation process.
In Grid-search we try to find the best possible parameters that would give the best results over a specific split of data (say 70% train and 30% test). So in this case, for different combinations of the same model, the dataset remains constant.
Read more about cross-validation here.
Cross Validation is mainly used for the comparison of different models.
For each model, you may get the average generalization error on the k validation sets. Then you will be able to choose the model with the lowest average generation error as your optimal model.
Cross-Validation or CV allows us to compare different machine learning methods and get a sense of how well they will work in practice.
Scenario-1 (Directly related to the question)
Yes, CV can be used to know which method (SVM, Random Forest, etc) will perform best and we can pick that method to work further.
(From these methods different models will be generated and evaluated for each method and an average metric is calculated for each method and the best average metric will help in selecting the method)
After getting the information about the best method/ or best parameters we can train/retrain our model on the training dataset.
For parameters or coefficients, these can be determined by grid search techniques. See grid search
Scenario-2:
Suppose you have a small amount of data and you want to perform training, validation and testing on data. Then dividing such a small amount of data into three sets reduce the training samples drastically and the result will depend on the choice of pairs of training and validation sets.
CV will come to the rescue here. In this case, we don't need the validation set but we still need to hold the test data.
A model will be trained on k-1 folds of training data and the remaining 1 fold will be used for validating the data. A mean and standard deviation metric will be generated to see how well the model will perform in practice.

Validation of hurdle model?

I built a hurdle model, and then used that model to predict from known to unknown data points using the predict command. Is there a way to validate the model and these predictions? Do I have to do this in two parts, for example using sensitivity and specificity for the binomial part of the model?
Any other ideas for how to assess the validity of this model?
For validating predictive models, I usually trust Cross-Validation.
In short: With cross-validation you can measure the predictive performance of your model using only the training data (data with known results). Thus you can get a general opinion on how your model works. Cross-validation works quite well for wide variety of different models. The downside is that it can get quite computation heavy.
With large data sets, 10-fold cross-validation is enough. The smaller your dataset is, the more "folds" you have to do (i.e. with very small datasets, you have to do leave-one-out cross-validation)
With cross-validation, you get predictions for the whole data set. You can then compare these predictions to the actual outputs and measure how well your model performed.
Cross-validated results can take a bit to understand in more complicated comparisons, but for your general purpose question "how to assess the validity of the model", the results should be quite easy to use.

what is the theoretical benchmark upon which to make select the artifical network to predicting data?

I have trained neural networks for 50 times(training) now I am not sure which network (from 50nets) I should choose to predict my data?Among MSE(mean square error) of network or R-squred or validation performance or test performance?....Thanks for any suggestion
If you have a training data, which you actually use to build a predictive model (so you need network for some actual application, and not for research purposes) then you should just have split with two subsets: training and validation, nothing else. Thus when you run your 50 trainings (over k-fold CV or random splits) you are supposed to use it to find out what set of hyperparameters is the best. And you select hyperparameters which lead to the best mean score over these 50 splits. Then you retrain your model on the whole dataset with these hyperparameters. Similarly - if you want to select between K algorithms, you use each of them to approximate their generalization capabilities, select the one with biggest mean score, and retrain this particular model on the whole dataset.

Resources