Pycaret and cross_val_score show extremely different scores

Pycaret and cross_val_score show extremely different scores - cross-validation

I'm trying to get the best model for a regression with pycaret
s=setup(train,target="PUE",fold=5 ,data_split_shuffle=False)
best = compare_models()
Whereas cross_val_score gives :
How can the scores be so dramatically different ?

Related

Catboost overfits training data but test performance increases

I'm training catboost on a dataset made of 41k observations and ~60 features. The dataset is a longitudinal series (9 years) that is spatially distributed. At the moment I'm just using random resampling of data, ignoring spatial and temporal dependencies. The model selection is performed using a 5 folds CV and some data are used as external test/held out set.
Best result I get with catboost is with following hps:
mtry=37, min_n = 458, tree_depth = 10, learn rate = 0.05
training AUC = .962
internal validation AUC = .867
external test AUC = .870
The difference between the training and test AUC is quite big and this suggests overfitting.
A second hp configuration, instead, reduces the difference between the training and test set but the test performance decreases as well.
mtry=19, min_n = 976, tree_depth = 8, learn rate = 0.0003
training AUC = .846
internal validation AUC = .841
external test AUC = .836
I'd be tempted to go with the first hps configuration since it gives me the best result on the test set. On the other hand the second result seems more robust to me, since training and test performance are quite similar. In addition the second result might be closer to the "true" performance I can get using spatial or temporal blocked resampling strategy.
Then my question is should I be concerned about differences between training and test set or as long as the test performance doesn't decrease (overfitting consequence) I shouldn't care about it and pick the first hps configuration?

Your intuition that "the second result might be closer to the 'true' result" is good. In a scenario where a model is overfitting, take even the performance on a validation and test set with a grain of salt. It could be that the pattern the model memorized for training still performs well on validation and test for now, but is a strong signal that the model is inflexible to variance, which in most cases is likely to occur with time.
Therefore, yes, you should be concerned about differences between training and test, and not simply select the model which has the best test performance. The difference in test performance between these two models is relatively small. Based on the little I know of what you have tried, I'd suggest iterating more to see if you can recapture a few points of accuracy while still eliminating the overfitting.

Is it correct to test model performance over the entire dataset?

The dataset is divided into training and testing sets using the function train_test_split() in 75:25 ratio.
The model is trained on the data set x_train and y_train.(classifier models like gaussian naive bayes, random forest, k nearest neighous ,etc)
Can we now test the model using the complete data set i.e, x and y?
Or should we only use x_test and y_test for testing the model?

train_test_split() is meant to give you a simpler way of creating training and test subsets from your original dataset. x_train and y_train both represent training data and target data, useful to train a model like the ones mentioned to finally test on the test subsets.
this is for training, i.e. practice.
testing on the entire dataset is wrong, because your model will crearly be biased on data it was trained on from x_train y_train.
you should test your models on never-before-seen y_test data

Get cross_validation_holdout_predictions() of models from a grid search

I'm trying to calculate performance in a different way how it is built in for models right now.
I would like to access raw predictions during cross-validation, so I can calculate performance on my own.
g = h2o.get_grid(grid_id)
for m in g.models:
print "Model %s" % m.model_id
rrc[m.model_id] = m.cross_validation_holdout_predictions()
I could just run prediction with a model on my dataset, but I think then this test might be biased because the model has seen this data before, or not? Can I take new predictions made on the same data set and use it to calculate performance?

I would like to access raw predictions during cross-validation, so I can calculate performance on my own.
If you want to calculate a custom metric on the cross-validated predictions, then set keep_cross_validation_predictions = True and you can access the raw predicted values using the .cross_validation_holdout_predictions() method like you have above.
Can I take new predictions made on the same data set and use it to calculate performance?
It sounds like you're asking if you can use only training data to estimate model performance? Yes, using cross-validation. If you set nfolds > 1, H2O will do cross-validation and compute a handful of cross-validated performance metrics for you. Also, if you tell H2O to save the cross-validated predictions, you can compute "cross-validated metrics" of your own.

In general, when does TF-IDF reduce accuracy?

I'm training a corpus consisting of 200000 reviews into positive and negative reviews using a Naive Bayes model, and I noticed that performing TF-IDF actually reduced the accuracy (while testing on test set of 50000 reviews) by about 2%. So I was wondering if TF-IDF has any underlying assumptions on the data or model that it works with, i.e. any cases where accuracy is reduced by the use of it?

The IDF component of TF*IDF can harm your classification accuracy in some cases.
Let suppose the following artificial, easy classification task, made for the sake of illustration:
Class A: texts containing the word 'corn'
Class B: texts not containing the word 'corn'
Suppose now that in Class A, you have 100 000 examples and in class B, 1000 examples.
What will happen to TFIDF? The inverse document frequency of corn will be very low (because it is found in almost all documents), and the feature 'corn' will get a very small TFIDF, which is the weight of the feature used by the classifier. Obviously, 'corn' was THE best feature for this classification task. This is an example where TFIDF may reduce your classification accuracy. In more general terms:
when there is class imbalance. If you have more instances in one class, the good word features of the frequent class risk having lower IDF, thus their best features will have a lower weight
when you have words with high frequency that are very predictive of one of the classes (words found in most documents of that class)

You can heuristically determine whether the usage of IDF on your training data decreases your predictive accuracy by performing grid search as appropriate.
For example, if you are working in sklearn, and you want to determine whether IDF decreases the predictive accuracy of your model, you can perform a grid search on the use_idf parameter of the TfidfVectorizer.
As an example, this code would implement the gridsearch algorithm on the selection of IDF for classification with SGDClassifier (you must import all the objects being instantiated first):
# import all objects first
X = # your training data
y = # your labels
pipeline = Pipeline([('tfidf',TfidfVectorizer()),
('sgd',SGDClassifier())])
params = {'tfidf__use_idf':(False,True)}
gridsearch = GridSearch(pipeline,params)
gridsearch.fit(X,y)
print(gridsearch.best_params_)
The output would be either:
Parameters selected as the best fit:
{'tfidf__use_idf': False}
or
{'tfidf__use_idf': True}

TF-IDF as far as I understand is a feature. TF is term frequency i.e. frequency of occurence in a document. IDF is inverse document frequncy i.e frequency of documents in which the term occurs.
Here, the model is using the TF-IDF info in the training corpus to estimate the new documents. For a very simple example, Say a document with word bad has pretty high term frequency of word bad in training set will sentiment label as negative. So, any new document containing bad will be more likely to be negative.
For the accuracy you can manaually select training corpus which contains mostly used negative or positive words. This will boost the accuracy.

Does Weka test results on a separate holdout set with 10CV?

I used 10-fold cross validation in Weka.
I know this usually means that the data is split in 10 parts, 90% training, 10% test and that this is alternated 10 times.
I am wondering on what Weka calculates the resulting AUC. Is it the average of all 10 test sets? Or (and I hope this is true), does it use a holdout test set? I can't seem to find a description of this in the weka book.

Weka averages the test results. And this is a better approach then the holdout set, I don't understand why you would hope for such approach. If you hold out the test set (of what size?) your test would not be statisticaly significant, It would only say, that for best chosen parameters on the training data you achieved some score on arbitrary small part of data. The whole point of cross validation (as the evaluation technique) is to use all the data as training and as testing in turns, so the resulting metric is approximation of the expected value of the true evaluation measure. If you use the hold out test it would not converge to expected value (at least not in a reasonable time) and what is even more important - you would have to choose another constant (how big hold out set and why?) and reduce the number of samples used for training (while cross validation has been developed due to the problem with to small datasets for both training and testing).

I performed cross validation on my own (made my own random folds and created 10 classifiers) and checked the average AUC. I also checked to see if the entire dataset was used to report the AUC (similar as to when Weka outputs a decision tree under 10-fold).
The AUC for the credit dataset with a naive Bayes classifier as found by...
10-fold weka = 0.89559
10-fold mine = 0.89509
original train = 0.90281
There is a slight discrepancy between my average AUC and Weka's, but this could be from a failure in replicating the folds (although I did try to control the seeds).

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio