Best way to calculate error rate for kNN classification - knn

I am trying to find the error rate for my KNN classification model and I am not sure what method I should use to find error rate ?

Related

How to handle overfitting properly?

I am training a CNN regression model with a 34560 size dataset. I have already got a training error rate less than 5%, but the validation error rate is over 60%. It seems to be an overfitting problem. And I tried the four ways to solve the problem, but none of them works well:
Increase the dataset size
Reduce the model complexity
Add a dropout layer before the output layer
Use L2 regularization / weight decay
Probably I did not use them in the right way. Can someone tell some details of these methods? Or are there any other ways to solve the overfitting problem?

Can my testing error be lower than validation error but higher than training error?

I am trying a multilayer perceptron with 5 hidden nodes. However the testing error is lower than the validation error and higher than the training error. The coefficient of correlation on the test set is also higher than that on the validation set. Is that acceptable? I've included the regression and performance plots. The generalization is okay according to me; not the best but adequate.
This is an ANN-GARCH-type model for volatility.

Concept on "best constant's loss" in vowpal wabbit's output, and the stated rule of thumb in tutorial

I am trying to understand vowpal a bit more and came across this statement on the Linear Regression tutorial. (https://vowpalwabbit.org/tutorials/getting_started.html)
"At the end, some more straightforward totals are printed. The best constant and best constant's loss only work if you are using squared loss. Squared loss is the Vowpal Wabbit default. They compute the best constant’s predictor and the loss of the best constant predictor.
If average loss is not better than best constant's loss, something is wrong. In this case, we have too few examples to generalize."
Based on that context, I have 2 related questions:
Is the best constant's loss based on the loss of the null model in linear regression?
Is the general rule of thumb for "average loss" not being better than "best constant's loss" applicable to all loss functions (since the statement does state that the "best constant" only works for the default squared loss function)?
Thanks in advance for any responses!
Is the best constant's loss based on the loss of the null model in linear regression?
If by null-model you mean the model which always predicts the best-constant, then yes.
Is the general rule of thumb for "average loss" not being better than "best constant's loss" applicable to all loss functions?
Yes. If by always using the same prediction (some best constant applicable to a given loss-function) you are doing better than the learned model, it means that the learned model is inferior to the simplest possible model. The simplest model for a given loss-function, is always predicting the same (best constant) result, ignoring the input-features in the data.
One of the most common cases for a learned model being inferior to the best-constant model, is a too small data-set. When the data-set is small, the learning process didn't have a chance to fully converge yet. This is also known as under-fitting.
How is the best constant calculated (for completeness)?
In the case of linear-regression (least-squares hyperplane, the vw --loss_function squared, which is the default) the best constant is the simple average (aka mean) of the labels. This minimizes the squared-loss.
In the case of quantile-loss (aka absolute-error, vw --loss_function quantile) the best constant is the median of the labels and it minimizes the sum-of-distances between the labels and the prediction.

Good results when training and cross-validating a model, but test data set shows poor results

My problem is that I obtain a model with very good results (training and cross-validating), but when I test it again (with a different data set) poor results appear.
I got a model which has been trained and cross-validating tested. The model shows AUC=0.933, TPR=0.90 and FPR=0.04
I guess there is no overfitting present looking at pictures, corresponding to learning curve (error), learning curve (score), and deviance curve:
The problem is that when I test this model with a different test data set, I obtain poor results, nothing to do with my previus results AUC=0.52, TPR=0.165 and FPR=0.105
I used Gradient Boosting Classifier to train my model, with learning_rate=0.01, max_depth=12, max_features='auto', min_samples_leaf=3, n_estimators=750
I used SMOTE to balance the class. It is binary model. I vectorized my categorical attributes. I used 75% of my data set to train and 25% tot test. My model has a very low training error, and a low test error, so I guess it is not overfitted. Training error is very low, so there are not outliers in the training and cv-test data sets. What can I do from now on to find the problem? Thanks
If the process generating your datasets is non-stationary it could cause the behavior you describe.
In that case the distribution of the dataset you're using to test has not been used for training

ROC Curve reduces when model is applied to a validation dataset

I am working on a logistic regression model and will appreciate if anyone can help me on this. I built a logistic regression model on a training data set and got AUC of 0.87 , however when I score the the validation data set using the model the AUC reduces to 0.62. What might be the cause?. Thank you in advance
ROC curve assess the fit between the model and the data. If you overfit your training data, you will get an overly optimistic estimate of the performance of the model when you assess it with the training data itself (resubstitution).
This is why you must always test your model on a dataset that was not used to train it. Have a look at cross-validation for a common way to do it.

Resources