I am trying to use adj. R^2 value to measure performance of regression model(to see which algorithm is working better for the specific dataset).
Do I need to split training and test set to measure adj. R^2?
If not, why is it different from classification model(I believe separating train:test set is necessary to measure accuracy/precision/...for the classification model)?
In addition, Do I need to split train and test set to use regression model to predict future value?
Thanks
Related
I have a dataset with several indicators related to some geographical entities ,I want to study factors that influence an indicator A (among the other indicator) .I need to determine which indicators affect it the most (correlation)
which ML algo should I use
I want to have a kind of scoring function for my indicator A to allow its prediction
enter image description here
What you are looking for are correlation coefficients, you have multiple choices for that, the most commons are:
Pearson's coefficient which only measure linear relationship between two variables, see [Scipy's implementation]
Spearman's coefficient which can show non-linear relationship , see Scipy's implementation
You can also normalize your data using z-normalization and then do a simple Linear regression. The regression coefficient can give you an idea of the influence of each variable on the outcome. However this method is highly sensible to multi-collinearity which might be present, especially if your variables are geographical.
Could you provide an example of the dataset? Discrete or continuous variables? Which software are you using?
Anyway an easy way to test correlation (without going into ML algorithms in the very sense) is to simply perform Pearson's or Spearman's correlation coefficient on selected features or on the whole dataset by creating a matrix of the data. You can do that in Python with NumPy (see this) or in R (see this).
You can also use simple linear regression or logistic/multinomial logistic regression (depending on the nature of your data) to quantify the influence of the other features on your target variables. Just keep in mind that "correlation is not causation. Look here to see some models.
Then it depends on the object of your analysis whether to aggregate all the features of all the geographical points or create covariance matrices for each "subset" of observation related to the geographical points.
I am working on KNN algorithm, I have some questions please I need answers:
I tried different values of K such as 3, 5, 7, and sqrt(n)=73. I get different accuracies according to these different values of K. What K should I use in my model and why ??
What is the best percentage that I should use to split the dataset into train and test sets ??
why the accuracy of the train set is always greater than the accuracy of test set ??
Which accuracy (train accuracy or test accuracy) is used to describe the overall model accuracy ??
Choosing the value of K is not a exact science. In this post, user20160 explained a procedure to choose a good K using k-fold cross validation.
Usually, 80/20 and 70/30 ratios are used but once again, it is not a absolute truth. If your train set ratio is too large, your model could overfit which means your model learns specially for the train set and will not be performant with real cases. On another hand, if your train set is too small, your model could underfit.
The accuracy of the train set is often greater than the test one because your model is only trained with the train set and test dataset are cases your model has never seen before. It is like learning to ride a bike with only one bike and being evaluated with another bike.
The train accuracy is not realistic to evaluate the performance of your model since train dataset is well known by your model. Test accuracy is more relevant because test dataset is new and has never been seen by your model.
To better understand everything about model evaluation I strongly recommend you to take look at the cross validation link above.
I am using multi linear regression to do sales quantity forecasting in retail. Due to practical issues, I cannot use use ARIMA or Neural Networks.
I split the historical data into train and validation sets. Using a walk forward validation method would be computationally quite expensive at this point. I have to take x number of weeks preceding current date as my validation set. The time series prior to x is my training set. The problem I am noting with this method is that accuracy is far higher during the validation period as compared to the future prediction. That is, the further we move from the end of the training period, the less accurate the prediction / forecast. How best can I control this problem?
Perhaps a smaller validation period, will allow the training period to come closer to the current date and hence provide a more accurate forecast; but this hurts the value of validation.
Another thought is to cheat and give both the training and validation historical data during training. As I am not using neural nets, the selected algo should not be over-fitted. Please correct me if this assumption is not right.
Any other thoughts or solution would be most welcome.
Thanks
Regards,
Adeel
If you're not using ARIMA or DNN, how about using rolling windows of regressions to train and test the historical data?
I use pretrained imagenet model to use ResNet101 and BN layer to train another dataset.
After I trained over, How should I evaluate the model?? Should I don't set chainer.using_config('train', False)??
I found the evaluate accuracy is too low even I evaluate on the train dataset not (only achieve 80%) not validation dataset. But when I switch to chainer.using_config('train', True), The accuracy get reach 99%.
I have also put the question on https://github.com/chainer/chainer/issues/4553
One of reviewer comments:
I think the problem is caused by that BatchNorm uses different statistics for training and testing.
My answer is based on the assumption that you are applying a pre-trained model on a new dataset (including training/validation/test set). Maybe I'm wrong 😅
Specifically, if you use a pre-trained model, then the statistics of the batches in the original dataset (maybe ImageNet) is reused. As a result, during training, the statistics (mean, std) is actually the combination of the previous dataset and your current training split. Then if you evaluate the training split again with chainer.using_config('train', False), the statistics is reset and thus is purely from the training split. These differences may cause the performance degradation, as I have previously encountered.
Anyway, I think it's important to consider what data are used for computing the running average of statistics for BatchNorm, since this will make a big difference for evaluation even if using the same data.
Then if you evaluate the training split again with chainer.using_config('train', False), the statistics is reset and thus is purely from the training split.
I guess this understanding is wrong. As written in the docstring of BarchNormalization,
In testing mode, it uses pre-computed population statistics to normalize
the input variable. The population statistics is approximated if it is
computed by training mode, or accurate if it is correctly computed by
fine-tuning mode.
Thus, when you use with chainer.using_config('train', False) statement, the statistics learned at training time is used.
So the conclusion is you should use with chainer.using_config('train', False) during evaluation.
Note that in training mode,
In training mode, it normalizes the input by *batch statistics*. It also
maintains approximated population statistics by moving averages, which can
be used for instant evaluation in testing mode.
Therefore, whole statistics is not used, instead batch statistics is used.
My problem is that I obtain a model with very good results (training and cross-validating), but when I test it again (with a different data set) poor results appear.
I got a model which has been trained and cross-validating tested. The model shows AUC=0.933, TPR=0.90 and FPR=0.04
I guess there is no overfitting present looking at pictures, corresponding to learning curve (error), learning curve (score), and deviance curve:
The problem is that when I test this model with a different test data set, I obtain poor results, nothing to do with my previus results AUC=0.52, TPR=0.165 and FPR=0.105
I used Gradient Boosting Classifier to train my model, with learning_rate=0.01, max_depth=12, max_features='auto', min_samples_leaf=3, n_estimators=750
I used SMOTE to balance the class. It is binary model. I vectorized my categorical attributes. I used 75% of my data set to train and 25% tot test. My model has a very low training error, and a low test error, so I guess it is not overfitted. Training error is very low, so there are not outliers in the training and cv-test data sets. What can I do from now on to find the problem? Thanks
If the process generating your datasets is non-stationary it could cause the behavior you describe.
In that case the distribution of the dataset you're using to test has not been used for training