Validation of hurdle model? - validation

I built a hurdle model, and then used that model to predict from known to unknown data points using the predict command. Is there a way to validate the model and these predictions? Do I have to do this in two parts, for example using sensitivity and specificity for the binomial part of the model?
Any other ideas for how to assess the validity of this model?

For validating predictive models, I usually trust Cross-Validation.
In short: With cross-validation you can measure the predictive performance of your model using only the training data (data with known results). Thus you can get a general opinion on how your model works. Cross-validation works quite well for wide variety of different models. The downside is that it can get quite computation heavy.
With large data sets, 10-fold cross-validation is enough. The smaller your dataset is, the more "folds" you have to do (i.e. with very small datasets, you have to do leave-one-out cross-validation)
With cross-validation, you get predictions for the whole data set. You can then compare these predictions to the actual outputs and measure how well your model performed.
Cross-validated results can take a bit to understand in more complicated comparisons, but for your general purpose question "how to assess the validity of the model", the results should be quite easy to use.

Related

Why a "single variable model" overperforms a multivariate model for classification?

I have a dataset with 2 possible outcomes, disease vs healthy, when looking for biomarkers there is one variable that yields ahigher AUROC than a model built with 5 variables including that same feature.
It is hard to answer your question without more information about the data and model you're using.
Generally speaking, making a model more complex (e.g. by adding additional predictors) increases the risk of overfitting to the training data, which can lead to bad performance on the test data.
Another possible reason for decreasing predictive performance when adding additional predictors is multicollinearity between the predictors. You can check this by looking at the correlations between them or, in regression models, at variance inflation factors.

Model tuning with Cross validation

I have a model tuning object that fits multiple models and tunes each one of them to find the best hyperparameter combination for each of the models. I want to perform cross-validation on the model tuning part and this is where I am facing a dilemma.
Let's assume that I am fitting just the one model- a random forest classifier and performing a 5 fold cross-validation. Currently, for the first fold that I leave out, I fit the random forest model and perform the model tuning. I am performing model tuning using the dlib package. I calculate the evaluation metric(accuracy, precision, etc) and select the best hyper-parameter combination.
Now when I am leaving out the second fold, should I be tuning the model again? Because if I do, I will get a different combination of hyperparameters than I did in the first case. If I do this across the five folds, what combination do I select?
The cross validators present in spark and sklearn use grid search so for each fold they have the same hyper-parameter combination and don't have to bother about hyper-parameter combinations changing across folds
Choosing the best hyper-parameter combination that I get when I leave out the first fold and using it for the subsequent folds doesn't sound right because then my entire model tuning is dependent on which fold got left out first. However, if I am getting different hyperparameters each time, which one do I settle on?
TLDR:
If you are performing let's say a derivative based model tuning along with cross-validation, your hyper-parameter combination changes as you iterate over folds. How do you select the best combination then? Generally speaking, how do you use cross-validation with derivative-based model tuning methods.
PS: Please let me know if you need more details
This is more of a comment, but it is too long for this, so I post it as an answer instead.
Cross-validation and hyperparameter tuning are two separate things. Cross Validation is done to get a sense of the out-of-sample prediction error of the model. You can do this by having a dedicated validation set, but this raises the question if you are overfitting to this particular validation data. As a consequence, we often use cross-validation where the data are split in to k folds and each fold is used once for validation while the others are used for fitting. After you have done this for each fold, you combine the prediction errors into a single metric (e.g. by averaging the error from each fold). This then tells you something about the expected performance on unseen data, for a given set of hyperparameters.
Once you have this single metric, you can change your hyperparameter, repeat, and see if you get a lower error with the new hyperparameter. This is the hpyerparameter tuning part. The CV part is just about getting a good estimate of the model performance for the given set of hyperparameters, i.e. you do not change hyperparameters 'between' folds.
I think one source of confusion might be the distinction between hyperparameters and parameters (sometimes also referred to as 'weights', 'feature importances', 'coefficients', etc). If you use a gradient-based optimization approach, these change between iterations until convergence or a stopping rule is reached. This is however different from hyperparameter search (e.g. how many trees to plant in the random forest?).
By the way, I think questions like these should better be posted to the Cross-Validated or Data Science section here on StackOverflow.

Which model to pick from K fold Cross Validation

I was reading about cross validation and about how it it is used to select the best model and estimate parameters , I did not really understand the meaning of it.
Suppose I build a Linear regression model and go for a 10 fold cross validation, I think each of the 10 will have different coefficiant values , now from 10 different which should I pick as my final model or estimate parameters.
Or do we use Cross Validation only for the purpose of finding an average error(average of 10 models in our case) and comparing against another model ?
If your build a Linear regression model and go for a 10 fold cross validation, indeed each of the 10 will have different coefficient values. The reason why you use cross validation is that you get a robust idea of the error of your linear model - rather than just evaluating it on one train/test split only, which could be unfortunate or too lucky. CV is more robust as no ten splits can be all ten lucky or all ten unfortunate.
Your final model is then trained on the whole training set - this is where your final coefficients come from.
Cross-validation is used to see how good your models prediction is. It's pretty smart making multiple tests on the same data by splitting it as you probably know (i.e. if you don't have enough training data this is good to use).
As an example it might be used to make sure you aren't overfitting the function. So basically you try your function when you've finished it with Cross-validation and if you see that the error grows a lot somewhere you go back to tweaking the parameters.
Edit:
Read the wikipedia for deeper understanding of how it works: https://en.wikipedia.org/wiki/Cross-validation_%28statistics%29
You are basically confusing Grid-search with cross-validation. The idea behind cross-validation is basically to check how well a model will perform in say a real world application. So we basically try randomly splitting the data in different proportions and validate it's performance. It should be noted that the parameters of the model remain the same throughout the cross-validation process.
In Grid-search we try to find the best possible parameters that would give the best results over a specific split of data (say 70% train and 30% test). So in this case, for different combinations of the same model, the dataset remains constant.
Read more about cross-validation here.
Cross Validation is mainly used for the comparison of different models.
For each model, you may get the average generalization error on the k validation sets. Then you will be able to choose the model with the lowest average generation error as your optimal model.
Cross-Validation or CV allows us to compare different machine learning methods and get a sense of how well they will work in practice.
Scenario-1 (Directly related to the question)
Yes, CV can be used to know which method (SVM, Random Forest, etc) will perform best and we can pick that method to work further.
(From these methods different models will be generated and evaluated for each method and an average metric is calculated for each method and the best average metric will help in selecting the method)
After getting the information about the best method/ or best parameters we can train/retrain our model on the training dataset.
For parameters or coefficients, these can be determined by grid search techniques. See grid search
Scenario-2:
Suppose you have a small amount of data and you want to perform training, validation and testing on data. Then dividing such a small amount of data into three sets reduce the training samples drastically and the result will depend on the choice of pairs of training and validation sets.
CV will come to the rescue here. In this case, we don't need the validation set but we still need to hold the test data.
A model will be trained on k-1 folds of training data and the remaining 1 fold will be used for validating the data. A mean and standard deviation metric will be generated to see how well the model will perform in practice.

4 fold cross validation | Caffe

So I trying to perform a 4-fold cross validation on my training set. I have divided my training data into four quarters. I use three quarters for training and one quarter for validation. I repeat this three more times till all the quarters are given a chance to be the validation set, atleast once.
Now after training I have four caffemodels. I test the models on my validation sets. I am getting different accuracy in each case. How should I proceed from here? Should I just choose the model with the highest accuracy?
Maybe it is a late reply, but in any case...
The short answer is that, if the performances of the four models are similar and good enough, then you re-train the model on all the data available, because you don't want to waste any of them.
The n-fold cross validation is a practical technique to get some insights on the learning and generalization properties of the model you are trying to train, when you don't have a lot of data to start with. You can find details everywhere on the web, but I suggest the open-source book Introduction to Statistical Learning, Chapter 5.
The general rule says that after you trained your n models, you average the prediction error (MSE, accuracy, or whatever) to get a general idea of the performance of that particular model (in your case maybe the network architecture and learning strategy) on that dataset.
The main idea is to assess the models learned on the training splits checking if they have an acceptable performance on the validation set. If they do not, then your models probably overfitted tha training data. If both the errors on training and validation splits are high, then the models should be reconsidered, since they don't have predictive capacity.
In any case, I would also consider the advice of Yoshua Bengio who says that for the kind of problem deep learning is meant for, you usually have enough data to simply go with a training/test split. In this case this answer on Stackoverflow could be useful to you.

What is the difference between cross-validation and grid search?

In simple words, what is the difference between cross-validation and grid search? How does grid search work? Should I do first a cross-validation and then a grid search?
Cross-validation is when you reserve part of your data to use in evaluating your model. There are different cross-validation methods. The simplest conceptually is to just take 70% (just making up a number here, it doesn't have to be 70%) of your data and use that for training, and then use the remaining 30% of the data to evaluate the model's performance. The reason you need different data for training and evaluating the model is to protect against overfitting. There are other (slightly more involved) cross-validation techniques, of course, like k-fold cross-validation, which often used in practice.
Grid search is a method to perform hyper-parameter optimisation, that is, it is a method to find the best combination of hyper-parameters (an example of an hyper-parameter is the learning rate of the optimiser), for a given model (e.g. a CNN) and test dataset. In this scenario, you have several models, each with a different combination of hyper-parameters. Each of these combinations of parameters, which correspond to a single model, can be said to lie on a point of a "grid". The goal is then to train each of these models and evaluate them e.g. using cross-validation. You then select the one that performed best.
To give a concrete example, if you're using a support vector machine, you could use different values for gamma and C. So, for example, you could have a grid with the following values for (gamma, C): (1, 1), (0.1, 1), (1, 10), (0.1, 10). It's a grid because it's like a product of [1, 0.1] for gamma and [1, 10] for C. Grid-search would basically train a SVM for each of these four pair of (gamma, C) values, then evaluate it using cross-validation, and select the one that did best.
Cross-validation is a method for robustly estimating test-set performance (generalization) of a model.
Grid-search is a way to select the best of a family of models, parametrized by a grid of parameters.
Here, by "model", I don't mean a trained instance, more the algorithms together with the parameters, such as SVC(C=1, kernel='poly').
Cross-validation, simply separating test and training data and validate training results with test data. There are two cross validation techniques that I know.
First, Test/Train cross validation. Splitting data as test and train.
Second, k-fold cross-validation split your data into k bins, use each bin as testing data and use rest of the data as training data and validate against testing data. Repeat the process k times. And Get the average performance. k-fold cross validation especially useful for small dataset since it maximizes both the test and training data.
Grid Search; systematically working through multiple combinations of parameter tunes, cross validate each and determine which one gives the best performance.You can work through many combination only changing parameters a bit.
Cross-validation is a method of reserving a particular subset of your dataset on which you do not train the model. Later, you test your model on this subset before finalizing it.
The main steps you need to perform to do cross-validation are:
Split the whole dataset in training and test datasets (e.g. 80% of the whole dataset is the training dataset and the remaining 20% is the test dataset)
Train the model using the training dataset
Test your model on the test dataset. If your model performs well on the test dataset, continue the training process
There are other cross-validation methods, for example
Leave-one-out cross-validation (LOOCV)
K-fold cross-validation
Stratified K-fold cross-validation
Adversarial cross-validation strategies (used when train and rest datasets are differ largely from each other).
In simple terms,
consider making pasta as building a model:
Cross validation - choosing the quantity of pasta
Grid search - choosing the right proportion of ingredients.

Resources