I am working on a logistic regression model and will appreciate if anyone can help me on this. I built a logistic regression model on a training data set and got AUC of 0.87 , however when I score the the validation data set using the model the AUC reduces to 0.62. What might be the cause?. Thank you in advance
ROC curve assess the fit between the model and the data. If you overfit your training data, you will get an overly optimistic estimate of the performance of the model when you assess it with the training data itself (resubstitution).
This is why you must always test your model on a dataset that was not used to train it. Have a look at cross-validation for a common way to do it.
Related
I am trying to use adj. R^2 value to measure performance of regression model(to see which algorithm is working better for the specific dataset).
Do I need to split training and test set to measure adj. R^2?
If not, why is it different from classification model(I believe separating train:test set is necessary to measure accuracy/precision/...for the classification model)?
In addition, Do I need to split train and test set to use regression model to predict future value?
Thanks
I am working on KNN algorithm, I have some questions please I need answers:
I tried different values of K such as 3, 5, 7, and sqrt(n)=73. I get different accuracies according to these different values of K. What K should I use in my model and why ??
What is the best percentage that I should use to split the dataset into train and test sets ??
why the accuracy of the train set is always greater than the accuracy of test set ??
Which accuracy (train accuracy or test accuracy) is used to describe the overall model accuracy ??
Choosing the value of K is not a exact science. In this post, user20160 explained a procedure to choose a good K using k-fold cross validation.
Usually, 80/20 and 70/30 ratios are used but once again, it is not a absolute truth. If your train set ratio is too large, your model could overfit which means your model learns specially for the train set and will not be performant with real cases. On another hand, if your train set is too small, your model could underfit.
The accuracy of the train set is often greater than the test one because your model is only trained with the train set and test dataset are cases your model has never seen before. It is like learning to ride a bike with only one bike and being evaluated with another bike.
The train accuracy is not realistic to evaluate the performance of your model since train dataset is well known by your model. Test accuracy is more relevant because test dataset is new and has never been seen by your model.
To better understand everything about model evaluation I strongly recommend you to take look at the cross validation link above.
I am training a neural network in keras and I reach a classical limit - my training accuracy improves with increasing epochs, but my validation accuracy decreases after 9 epochs (see figure).
I wonder if I can avoid the decrease of validation accuracy by doing the following: make the keras net only accept the changes to the weights after each epoch if the epoch led to an improvement of the validation accuracy, else reset to the state before the epoch? I assume that the validation is starting to diverge in a big part because after each epoch >9 the weights of the neural net diverge away from similarity to the validation data.
So, is my suggestion a good practice and can I achieve it in keras (are there callbacks or options that allow me to update the net only if the validation improved)?
Side question: Is my suggestion maybe violating the principle of "don't use your validation data for training"? Because I am making implicitly the performance of the neural net a function of my validation data.
The point of the validation set is to give you an idea of the generalizability your model achieves by learning using the training data. You don't HAVE to have a validation dataset. If your validation data is a random sample of your training data, then your best bet is probably modifying your architecture.
In short, if you want your model to train based on your validation data, then train the model on the training set, then take the resulting model, and train it on the validation data (i.e. make the validation data the training data). This obviously defeats the point of having a validation set.
I built a hurdle model, and then used that model to predict from known to unknown data points using the predict command. Is there a way to validate the model and these predictions? Do I have to do this in two parts, for example using sensitivity and specificity for the binomial part of the model?
Any other ideas for how to assess the validity of this model?
For validating predictive models, I usually trust Cross-Validation.
In short: With cross-validation you can measure the predictive performance of your model using only the training data (data with known results). Thus you can get a general opinion on how your model works. Cross-validation works quite well for wide variety of different models. The downside is that it can get quite computation heavy.
With large data sets, 10-fold cross-validation is enough. The smaller your dataset is, the more "folds" you have to do (i.e. with very small datasets, you have to do leave-one-out cross-validation)
With cross-validation, you get predictions for the whole data set. You can then compare these predictions to the actual outputs and measure how well your model performed.
Cross-validated results can take a bit to understand in more complicated comparisons, but for your general purpose question "how to assess the validity of the model", the results should be quite easy to use.
My problem is that I obtain a model with very good results (training and cross-validating), but when I test it again (with a different data set) poor results appear.
I got a model which has been trained and cross-validating tested. The model shows AUC=0.933, TPR=0.90 and FPR=0.04
I guess there is no overfitting present looking at pictures, corresponding to learning curve (error), learning curve (score), and deviance curve:
The problem is that when I test this model with a different test data set, I obtain poor results, nothing to do with my previus results AUC=0.52, TPR=0.165 and FPR=0.105
I used Gradient Boosting Classifier to train my model, with learning_rate=0.01, max_depth=12, max_features='auto', min_samples_leaf=3, n_estimators=750
I used SMOTE to balance the class. It is binary model. I vectorized my categorical attributes. I used 75% of my data set to train and 25% tot test. My model has a very low training error, and a low test error, so I guess it is not overfitted. Training error is very low, so there are not outliers in the training and cv-test data sets. What can I do from now on to find the problem? Thanks
If the process generating your datasets is non-stationary it could cause the behavior you describe.
In that case the distribution of the dataset you're using to test has not been used for training