h2o AutoML use in class imbalance mode - h2o

i have a usecase of very imbalace data set , i undersampled the training dataset ,
and tried running the automl in h2o, but it gave me great AUC results (over 0.99) but very bad aup_pr results (0.09).
is it related to the imbalance issue ?
i ran with weight_column option (http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/weights_column.html)
but it didn't help.
should i use the balance_classes option instead (when i run both options it fails with "h2oFrame is empty" message) .
the train and test are splitted on date time range , and the test dataset has the proper ration between majority and minority classes.

The large difference between AUC and AUCPR is most likely caused, as you suggest, by the class imbalance. You can either try to set balance_classes = True or set weights to a column that would weight the minority class more, e.g. taking the inverse of the class frequency. If you have really small number of observations for the minority class, you can try to synthesise more using e.g. SMOTE.

Related

How to properly finetune t5 model

I'm finetuning a t5-base model following this notebook.
However, the loss of both validation set and training set decreases very slowly. I changed the learning_rate to a larger number, but it did not help. Eventually, the bleu score on the validation set was low (around 13.7), and the translation quality was low as well.
***** Running Evaluation *****
Num examples = 1000
Batch size = 32
{'eval_loss': 1.06500244140625, 'eval_bleu': 13.7229, 'eval_gen_len': 17.564, 'eval_runtime': 16.7915, 'eval_samples_per_second': 59.554, 'eval_steps_per_second': 1.906, 'epoch': 5.0}
If I use the "Helsinki-NLP/opus-mt-en-ro" model, the loss decreases properly, and at the end, the finetuned model works pretty well.
How to fine-tune t5-base properly? Did I miss something?
I think the metrics shown in the tutorial are for the already trained EN>RO opus-mt model which was then fine-tuned. I don't see the before and after comparison of the metrics for it, so it is hard to tell how much of a difference that fine-tuning really made.
You generally shouldn't expect the same results from fine-tuning T5 which is not a (pure) machine translation model. More important is the difference in metrics before and after the fine-tuning.
Two things I could imagine having gone wrong with your training:
Did you add the proper T5 prefix to the input sequences ("translate English to Romanian: ") for both your training and your evaluation? If you did not you might have been training a new task from scratch and not use the bit of pre-training the model did on MT to Romanian (and German and perhaps some other ones). You can see how that affects the model behavior for example in this inference demo: Language used during pretraining and Language not used during pretraining.
If you chose a relatively small model like t5-base but you stuck with the num_train_epochs=1 in the tutorial your train epoch number is probably a lot too low to make a noticable difference. Try increasing the epochs for as long as you get significant performance boosts from it, in the example this is probably the case for at least the first 5 to 10 epochs.
I actually did something very similar to what you are doing before for EN>DE (German). I fine-tuned both opus-mt-en-de and t5-base on a custom dataset of 30.000 samples for 10 epochs. opus-mt-en-de BLEU increased from 0.256 to 0.388 and t5-base from 0.166 to 0.340, just to give you an idea of what to expect. Romanian/the dataset you use might be more of a challenge for the model and result in different scores though.

Small Dataset, Train Test Split or Train Val and Test?

I did some forecasting (stock) for my thesis. I only used a fix amount of 600 Samples (can't change that). Because of the small dataset i only did a Train and Test Split (no validation etc.). I found some settings where i get very good results (MAPE and R2) for both train and test. But i only have the loss curve of the train set. I am wondering if that is enough, or is it a must to have both train and validation loss-curve?
Because of that thought, i split it three ways (10% holdout test), and 70% train and 20% Validation. There i have both loss-curves, and i get good results for the MAPE score (around 3-5 %for all three) in Train Val and Test, only the R2 is bad in the val-set (0,7 and in train/test 0,95)
So can i use the first option, and only use the train-loss-curve?
I do not think a validation set will be necessary in this case if you are only training on a single data model. To my understanding, the validation set would be more useful if you are training on multiple various models, and that would help you decide which is the best fit.
https://machinelearningmastery.com/difference-test-validation-datasets/

H2O document question for stopping_tolerance, score_each_iteration, score_tree_interval, etc

I have the following questions that still confused me after I read the h2o document. Can someone provide some explanation for me
For the stopping_tolerance = 0.001, let's use AUC for example, current AUC is 0.8. Does that mean the AUC need to increase 0.8 + 0.001 or need to increase 0.8*(1+0.1%)?
score_each_iteration, in H2O document
(http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/score_each_iteration.html) it just say "iteration". But what exactly is the definition for each
"iteration", is that each tree or each grid search or each K folder
cross validation or something else?
Can I define score_tree_interval and set score_each_iteration = True
at the same time or I can only use one of them to make the grid
search repeatable?
Is there any difference to put 'stopping_metric',
'stopping_tolerance', 'stopping_rounds' in
H2OGradientBoostingEstimator vs in search_criteria of H2OGridSearch?
I found put in H2OGradientBoostingEstimator will make the code run
much faster when I test it in Spark environment
0.001 is the same as 0.1%, for AUC since bigger is better, you will want to see an increase of at least .001 after a specified number of scoring rounds.
You have linked to a portion of the documentation that is specific to the algorithms listed in Available in at the top of the page. So let's stick to answering this question with respect to individual models and not grid search. If you want to see what is being scored at each iteration take a look at your model results in Flow or use my_model.plot() (for the python api) to see what is getting scored at each iteration. For GBM and DRF this will be ntrees, but since different algorithms will have different aspects that change the word iteration is used since it is more generic.
Did you test this out? what did you find when you did this? Take a look at the scoring history plot in flow and notice what happens when you set both score_tree_interval and score_each_iteration = True versus when you only set score_tree_interval (I would recommend trying to understand these parameters at the individual model level before you use grid search).
yes, in once case you are specifying early stopping as you build an individual model in the case of grid search you are indicating whether on not to build more models.

Is validation set necessary for training a model?

I built a 3D image classification model with CNN for my research. I only have 5000 images and used 4500 images for training and 500 image for test set.
I tried different architectures and parameters for the training and
the F1 score and the accuracy on the training sets were as high as 0.9. It was fortunate that I didn't have to spend a lot of time to find these settings for the high accuracy.
Now I applied this model for the test set and I got a quite satisfying prediction with F1 score of 0.8~0.85.
My question here is, is it necessary to do validation? When I was taking a machine learning course back then, I was taught to use a validation set for tuning hyper parameters. One reason why I did not do k-fold cross validation is because I do not have much data and wanted to use as many training data as possible. And my model shows a quite good prediction on the test set. Can my model still convince people as long as the accuracy/f1 score/ROC are good enough? Or can I try to convince people only by doing k-fold cross validation without making and testing on a test set separately?
Thank you!
unfortunately i think that the single result won't be enough. This is due to the fact that your result could be just pure luck.
Using a 10 fold CV you use 90% of your data (4500 images) for training and the remaining 10% for testing. So basically you are not using less images in the training with the advantage of more reliable results.
The validation scheme proposed by Martin is already a good one but if you are looking for something more robust you should use a nested cross validation:
Split the data-set in K folds
The i-th training set is composed by {1,2,..,K} \ i folds.
Split the training set in N folds.
Set a hyper-parameter values grid
For each hyper-parameter set of values:
train on {1,2,..,N} \ j folds and test on the j-th fold;
Iterate for all the N folds and compute the average F-score.
Choose the set of hyper-parameters that maximize your metric.
Train the model using the i-th training set and the optimal set of hyper-parameters and test on the i-th fold.
Repeat for all the K folds and compute the average metrics.
The average metrics could be not sufficient to prove the stability of the method so it's advisable to provide also the confidence interval or the variance of the results.
Finally, to have a really stable validation of your method, you could consider to substitute the initial K-fold cross validation with a re-sampling procedure. Instead of splitting the data in K fold you resample the dataset at random using 90% of the samples as training and 10% of samples for testing. Repeat this M times with M>K. If the computation is fast enough you can consider to do this 20-50 or 100 times.
A cross validation dataset is used to adjust hyperparameters. You should never touch the test set, except when you are finished with everything!
As suggested in the comments, I recommend k-fold cross validation (e.g. k=10):
Split your dataset into k=10 sets
For i=1..10: Use sets {1, 2,..., 10} \ i as a training set (and to find the hyper parameters) and set i to evaluate.
Your final score is the average among those k=10 evaluation scores.

Does Weka test results on a separate holdout set with 10CV?

I used 10-fold cross validation in Weka.
I know this usually means that the data is split in 10 parts, 90% training, 10% test and that this is alternated 10 times.
I am wondering on what Weka calculates the resulting AUC. Is it the average of all 10 test sets? Or (and I hope this is true), does it use a holdout test set? I can't seem to find a description of this in the weka book.
Weka averages the test results. And this is a better approach then the holdout set, I don't understand why you would hope for such approach. If you hold out the test set (of what size?) your test would not be statisticaly significant, It would only say, that for best chosen parameters on the training data you achieved some score on arbitrary small part of data. The whole point of cross validation (as the evaluation technique) is to use all the data as training and as testing in turns, so the resulting metric is approximation of the expected value of the true evaluation measure. If you use the hold out test it would not converge to expected value (at least not in a reasonable time) and what is even more important - you would have to choose another constant (how big hold out set and why?) and reduce the number of samples used for training (while cross validation has been developed due to the problem with to small datasets for both training and testing).
I performed cross validation on my own (made my own random folds and created 10 classifiers) and checked the average AUC. I also checked to see if the entire dataset was used to report the AUC (similar as to when Weka outputs a decision tree under 10-fold).
The AUC for the credit dataset with a naive Bayes classifier as found by...
10-fold weka = 0.89559
10-fold mine = 0.89509
original train = 0.90281
There is a slight discrepancy between my average AUC and Weka's, but this could be from a failure in replicating the folds (although I did try to control the seeds).

Resources