Why can ROC_AUC score decrease after cross-validation? - roc

Please help, need an advice or link to some manual.
While training the model, performed these steps:
Choosing the model (CatBoostRegressor)
Tuning hyperparameters with Optuna framework
Model with study.best_trial.params scored 0.92
Used this model as an input to StratifiedKFold()
Finished with 0.95 average of 10 folds.
Submit...
And got 0.77 roc_auc_score on the test data.
Have I missed a step here?

Related

How to properly finetune t5 model

I'm finetuning a t5-base model following this notebook.
However, the loss of both validation set and training set decreases very slowly. I changed the learning_rate to a larger number, but it did not help. Eventually, the bleu score on the validation set was low (around 13.7), and the translation quality was low as well.
***** Running Evaluation *****
Num examples = 1000
Batch size = 32
{'eval_loss': 1.06500244140625, 'eval_bleu': 13.7229, 'eval_gen_len': 17.564, 'eval_runtime': 16.7915, 'eval_samples_per_second': 59.554, 'eval_steps_per_second': 1.906, 'epoch': 5.0}
If I use the "Helsinki-NLP/opus-mt-en-ro" model, the loss decreases properly, and at the end, the finetuned model works pretty well.
How to fine-tune t5-base properly? Did I miss something?
I think the metrics shown in the tutorial are for the already trained EN>RO opus-mt model which was then fine-tuned. I don't see the before and after comparison of the metrics for it, so it is hard to tell how much of a difference that fine-tuning really made.
You generally shouldn't expect the same results from fine-tuning T5 which is not a (pure) machine translation model. More important is the difference in metrics before and after the fine-tuning.
Two things I could imagine having gone wrong with your training:
Did you add the proper T5 prefix to the input sequences ("translate English to Romanian: ") for both your training and your evaluation? If you did not you might have been training a new task from scratch and not use the bit of pre-training the model did on MT to Romanian (and German and perhaps some other ones). You can see how that affects the model behavior for example in this inference demo: Language used during pretraining and Language not used during pretraining.
If you chose a relatively small model like t5-base but you stuck with the num_train_epochs=1 in the tutorial your train epoch number is probably a lot too low to make a noticable difference. Try increasing the epochs for as long as you get significant performance boosts from it, in the example this is probably the case for at least the first 5 to 10 epochs.
I actually did something very similar to what you are doing before for EN>DE (German). I fine-tuned both opus-mt-en-de and t5-base on a custom dataset of 30.000 samples for 10 epochs. opus-mt-en-de BLEU increased from 0.256 to 0.388 and t5-base from 0.166 to 0.340, just to give you an idea of what to expect. Romanian/the dataset you use might be more of a challenge for the model and result in different scores though.

How can you pass a pretrainend LDA Model to ldaseq in Gensim for DTM?

I have a tuned and pretrainend LDA Model that I want to pass on to the ldaseq model in gensim, but don't understand how to do it. I've tried lda_model and sstats but it doesn'T seem to work, I still get this from the logging:
running online (multi-pass) LDA training, 10 topics, 10 passes over
the supplied corpus of 1699 documents, updating model once every 1699
documents, evaluating perplexity every 1699 documents, iterating 50x
with a convergence threshold of 0.001000
In case anyone ever wonders this:
initialize='own' you need to supply sstats of the previously trained model in the shape (vocab_len, num_topics), and initialize='lda_model' you need to supply the previously trained lda model.
I found the answer here

How to tune the hyperparameters for time series data using models built with pytorch

I have read many posts related to the time-series data with walk forward validation. I am currently facing the issue as to how to split the data. I have to monitor previous 5 days data to predict the class of the current day data. There are two classes 0 or 1 i.e. either the price will go up or down. I have already the entire data windowed for 5 days with every window having the label of the next day as its label.
I want to keep 20% data in the test set and 80% data as training data for the final model evaluation. Now how do I apply walk-forward validation with grid-search of the hyper-parameters? keeping 20% data in training and next 20% in test always will generate 4 such models giving final model ie. 4th model as 80% training data and 20% test data. Let's say we have 100 windowed samples then
train | test
model1 - 20|20
model2 - 20+20|20
model3 - 20+20+20|20
model4 - 20+20+20+20|20
or it is like
model1 - 80|1
model2 - 81|1
model3 - 82|1
model4 - 83|1
.
.
.
model20 - 99|1
Further, the performance score will be mean score across all the models for every configuration of grid-search. Also, every model will be trained for 100 epochs which will be lot of time. So can we just evaluate multiple time steps at once?
I am new to time-series hyperparameter tuning.

AutoML Vision: Predictions include --other-- field

I have just trained a new model with a binary outcome (elite/non-elite). The model trained well, but when I tested a new image on it in the GUI it returned a third label --other--. I am not sure how/why that has appeared. Any ideas?
When multi-class (single-label) classification is used, there is an assumption that the confidence of all predictions must sum to 1 (as one and exactly one valid label is assumed). This is achieved by using softmax function. It normalizes all predictions to sum to 1 - which has some drawbacks - for example if both predictions are very low - for example prediction of "elite" is 0.0001 and Non_elite is 0.0002 - after normalization the predictions would be 0.333 and 0.666 respectively.
To work around that the automl system allows to use extra label (--other--) to indicate that none of the allowed predictions seems valid. This label is implementation detail and shouldn't be returned by the system (should be filtered out). This should get fixed in the near future.

Cross-validation in Lenskit

I'm trying to understand how exactly is performed cross-validation in lenskit. In the documentation, it says that by default the data are partitioned by user. Does that mean that, in each fold, none of the users in the test set has been used for training? Is this achieved through the "holdout" option? If so, does this option break the user-based partioning and yields folds in which each user shows up in both the training and test sets?
Right now, my evaluation code looks something like this:
dataset crossfold("data") {
source csvfile(sourceFile) {
delimiter "\t"
domain {
minimum 0.0
maximum 10.0
precision 0.1
}
}
// order RandomOrder
holdoutFraction 0.1
}
I commented out the "order" option because, when using it, lenskit eval throws an error.
Cheers!!!
Each user appears in both the training and the test sets, no matter the holdout, holdoutFraction, or retain options.
However, for each test user (when using 5 partitions, 20% of the users), part of their ratings (the test ratings) are held out and placed in the test set. The remainder of their ratings are placed in the training set, along with all ratings from other users.
This simulates the common case of a recommender system: you have users, for whom some of their history is already known and can be used in model training, and you're trying to recommend or predict their future behavior.
The holdout, holdoutFraction, and retain options are different ways of deciding how many ratings are put in the test set. If you say holdout 5, then 5 ratings from each test user are put in the test set, and the rest are used for training. If you say holdoutFraction 0.2, then 20% are used for testing and 80% for training. If you say retain 5, then 5 are used for training and the rest are used for testing.

Resources