Proper way to evaluate policy + exploration offline in Vowlpal Wabbit - vowpalwabbit

My use case is to retrain/make predictions using VW CB in batch mode (retrain/inference occurs nightly).
I'm reading this tutorial for offline policy evaluation in the batch scenario. I'm training on a logged dataset using:
--cb_adf --save_resume -f {MODEL_PATH} -d ./data/train.txt
and in order to tune hyperparameter epsilon on batch predictions, I run the following commands 3 times on a separate dataset using:
-i {MODEL_PATH} -t --cb_explore_adf --epsilon 0.1/0.2/0.3 -d ./data/eval.txt
whichever gives the lowest average loss is the optimal epsilon.
Am I using the right options? My confusion mostly comes from the another option --explore_eval. What is the difference between --explore_eval and cb_explore_adf and what is the right way to evaluate model+exploration offline? Should I just run
--explore_eval --epsilon 0.1/0.2/0.3 -d ./data/train+eval.txt
and whichever gives the lowest average loss is the optimal epsilon.

-i {MODEL_PATH} -t --cb_explore_adf --epsilon 0.1/0.2/0.3 -d ./data/eval.txt
I predict the result of this experiment: the optimal epsilon is the smallest. This is because after data has been collected, there is no value to exploration. In order to assess exploration, you have to change the data available at training in a manner sensitive to the exploration algorithm. Which brings us to ...
--explore_eval --epsilon 0.1/0.2/0.3 -d ./data/train+eval.txt
'--explore_eval' is designed to assess exploration. It requires more data to work well (since it discards the data if the exploration doesn't match) but allows you to assess exploration since it simulates the fog of war.
If you are testing other model hyperparameters such as base learning algorithm or interactions, the extra data overhead of '--explore_eval' is unnecessary.

Related

ElasticNet extremely slow

I am runnning an Elastic net model using sklearn. My dataset has 70k observations and 20 features. I want to test different parameters and use the following code:
alpha_plot, l1_ratio_plot = np.linspace(min_xlim, max_xlim, 50), np.linspace(0, 1, 10)
alpha_grid, l1_ratio_grid = np.meshgrid(alpha_plot, l1_ratio_plot)
l1_ratio_alpha_grid = np.array([l1_ratio_grid.ravel(), alpha_grid.ravel()]).T
model_coefficients_analysis = []
for i in l1_ratio_alpha_grid:
model_analysis = ElasticNet(alpha=i[1], l1_ratio=i[0], fit_intercept=True, max_iter=10000).fit(self.features_train_std, self.labels_train)
model_coefficients_analysis.append(model_analysis.coef_)
I am aware that this can be done with GridsearchCV but it doesn't do the job for me as I need to store the coefficients for every combination of parameters tested. The current code snippet is exceptionally slow. It takes roughly 10 minutes for each of the 50*10 iterations. Is there a way to speed up the process? For example in GridsearchCV there is a parameter n_jobs which can be set equal to -1 to speed up the process. But here I do not seem to find it.
It takes roughly 10 minutes for each of the 50*10 iterations
That seems very high, but you also have rather large data; I can't fit a randomized such dataset in memory in Colab (where I usually run examples for answers here). You might not be able to shrink the first fit time very much, but maybe you can reduce the subsequent fit times by using warm-starting.
Setting warm_start=True and using the same model object for each iteration, the coefficients will be saved as a starting point for the solver in the next iteration:
model_analysis = ElasticNet(fit_intercept=True, max_iter=10000)
for i in l1_ratio_alpha_grid:
model_analysis.set_params(alpha=i[1], l1_ratio=i[0])
model_analysis.fit(self.features_train_std, self.labels_train)
model_coefficients_analysis.append(model_analysis.coef_)
You might consider using ElasticNetCV, since it uses warm-starting internally, and it provides some other niceties. You can use a PredefinedSplit if adding k-fold cross-validation is too much of an added expense, but I believe the n_jobs parameter is only useful in splitting up jobs across hyperparameters and folds, so using more cores might mitigate the issues with k-fold (but then you'll also have k times as many coefficients).
Your large max_iter is a bit worrying; do you get nonconvergence? From your independent variable name it seems like you're scaling, but if not that's the place to start: fast (and maybe correct) convergence depends on features with similar scales. You might also consider increasing the convergence criterion tol. I have no experience with the selection parameter, but the docstring suggests changing it to random may speed up convergence?

Iterative training using VowpalWabbit

I am trying to perform iterative testing using VW.
Ideally I would be able to:
Train and save a model initial_model.vw (I have tested this and it works)
Load this model, add additional data to it, and save it again (to new_model.vw)
Use this new model to make predictions that the first model was not able to make to prove the iterative training has been successful.
I found one person also trying to do this (how to retrain the model for sequence of files in vowpal wabbit) but when I run my code and try to retrain with additional data, it seems to overwrite the old data instead of adding to it.
Here is the basic outline of the code I am using:
Initial training and saving:
vw initial_data.txt -b 26 --learning_rate 1.5 --passes 10 --
probabilities --loss_function=logistic --oaa 80 --save_resume --kill_cache
--cache_file a.cache -f initial_model.vw
Retraining with new data:
vw new_data.txt -b 26 --learning_rate 1.5 --
passes 10 -i initial_model.vw --probabilities --loss_function=logistic --
oaa 80 --save_resume --kill_cache --cache_file a.cache -f new_model.vw
I know that this is not enough to reproduce what I am doing but I just want to know if there are any problems with my arguments and if this should be working in theory. When I use my retrained model to make predictions, it is only accurate for test cases which are included in the new data, not anything that was covered in the original training file. Help appreciated!
I can see 2 potential issues with the arguments given in the question.
These may be ok, if you actually meant to use them this way, and you really know what you're doing, but they seem a bit suspect.
1) Whenever you run vw with multiple passes over the same data (--passes <n>), vw implicitly switches to hold-out mode with a 1 in 10 examples, held-out. Held-out examples are used only for error estimation, and not for learning to avoid over-fitting. If this is what you meant to do, then fine, but if you don't want to hold-out any of your examples, you should use the option --holdout_off, and be aware that the chances of over-fitting are increased.
2) The initial learning rate (--learning_rate 1.5) seems high, it increases over-fitting chances. If you use it because you end up with a lower training-loss, this is the wrong thing to do. in ML the goal is not to minimize training-loss but generalization loss.
Also: setting the initial learning-rate on the 2nd batch seems to contradict the --save_resume option. The goal of --save_resume is to start an new batch with low (already decayed, as saved in the model) per-feature learning rates (AdaGrad style). Making the learning rate jump at the start may make the first examples in the 2nd batch much more important than all the decayed features from the 1st batch.
Tip: you can get a feel of how well you're doing, by piping the progress output into the plotting utility vw-convergence:
vw -P 1.1 ... data.txt 2>&1 | vw-convergence
(note: vw-convergence requires R)

Is validation set necessary for training a model?

I built a 3D image classification model with CNN for my research. I only have 5000 images and used 4500 images for training and 500 image for test set.
I tried different architectures and parameters for the training and
the F1 score and the accuracy on the training sets were as high as 0.9. It was fortunate that I didn't have to spend a lot of time to find these settings for the high accuracy.
Now I applied this model for the test set and I got a quite satisfying prediction with F1 score of 0.8~0.85.
My question here is, is it necessary to do validation? When I was taking a machine learning course back then, I was taught to use a validation set for tuning hyper parameters. One reason why I did not do k-fold cross validation is because I do not have much data and wanted to use as many training data as possible. And my model shows a quite good prediction on the test set. Can my model still convince people as long as the accuracy/f1 score/ROC are good enough? Or can I try to convince people only by doing k-fold cross validation without making and testing on a test set separately?
Thank you!
unfortunately i think that the single result won't be enough. This is due to the fact that your result could be just pure luck.
Using a 10 fold CV you use 90% of your data (4500 images) for training and the remaining 10% for testing. So basically you are not using less images in the training with the advantage of more reliable results.
The validation scheme proposed by Martin is already a good one but if you are looking for something more robust you should use a nested cross validation:
Split the data-set in K folds
The i-th training set is composed by {1,2,..,K} \ i folds.
Split the training set in N folds.
Set a hyper-parameter values grid
For each hyper-parameter set of values:
train on {1,2,..,N} \ j folds and test on the j-th fold;
Iterate for all the N folds and compute the average F-score.
Choose the set of hyper-parameters that maximize your metric.
Train the model using the i-th training set and the optimal set of hyper-parameters and test on the i-th fold.
Repeat for all the K folds and compute the average metrics.
The average metrics could be not sufficient to prove the stability of the method so it's advisable to provide also the confidence interval or the variance of the results.
Finally, to have a really stable validation of your method, you could consider to substitute the initial K-fold cross validation with a re-sampling procedure. Instead of splitting the data in K fold you resample the dataset at random using 90% of the samples as training and 10% of samples for testing. Repeat this M times with M>K. If the computation is fast enough you can consider to do this 20-50 or 100 times.
A cross validation dataset is used to adjust hyperparameters. You should never touch the test set, except when you are finished with everything!
As suggested in the comments, I recommend k-fold cross validation (e.g. k=10):
Split your dataset into k=10 sets
For i=1..10: Use sets {1, 2,..., 10} \ i as a training set (and to find the hyper parameters) and set i to evaluate.
Your final score is the average among those k=10 evaluation scores.

Does Weka test results on a separate holdout set with 10CV?

I used 10-fold cross validation in Weka.
I know this usually means that the data is split in 10 parts, 90% training, 10% test and that this is alternated 10 times.
I am wondering on what Weka calculates the resulting AUC. Is it the average of all 10 test sets? Or (and I hope this is true), does it use a holdout test set? I can't seem to find a description of this in the weka book.
Weka averages the test results. And this is a better approach then the holdout set, I don't understand why you would hope for such approach. If you hold out the test set (of what size?) your test would not be statisticaly significant, It would only say, that for best chosen parameters on the training data you achieved some score on arbitrary small part of data. The whole point of cross validation (as the evaluation technique) is to use all the data as training and as testing in turns, so the resulting metric is approximation of the expected value of the true evaluation measure. If you use the hold out test it would not converge to expected value (at least not in a reasonable time) and what is even more important - you would have to choose another constant (how big hold out set and why?) and reduce the number of samples used for training (while cross validation has been developed due to the problem with to small datasets for both training and testing).
I performed cross validation on my own (made my own random folds and created 10 classifiers) and checked the average AUC. I also checked to see if the entire dataset was used to report the AUC (similar as to when Weka outputs a decision tree under 10-fold).
The AUC for the credit dataset with a naive Bayes classifier as found by...
10-fold weka = 0.89559
10-fold mine = 0.89509
original train = 0.90281
There is a slight discrepancy between my average AUC and Weka's, but this could be from a failure in replicating the folds (although I did try to control the seeds).

Cross Validation in Weka

I've always thought from what I read that cross validation is performed like this:
In k-fold cross-validation, the original sample is randomly
partitioned into k subsamples. Of the k subsamples, a single subsample
is retained as the validation data for testing the model, and the
remaining k − 1 subsamples are used as training data. The
cross-validation process is then repeated k times (the folds), with
each of the k subsamples used exactly once as the validation data. The
k results from the folds then can be averaged (or otherwise combined)
to produce a single estimation
So k models are built and the final one is the average of those.
In Weka guide is written that each model is always built using ALL the data set. So how does cross validation in Weka work ? Is the model built from all data and the "cross-validation" means that k fold are created then each fold is evaluated on it and the final output results is simply the averaged result from folds?
So, here is the scenario again: you have 100 labeled data
Use training set
weka will take 100 labeled data
it will apply an algorithm to build a classifier from these 100 data
it applies that classifier AGAIN on
these 100 data
it provides you with the performance of the
classifier (applied to the same 100 data from which it was
developed)
Use 10 fold CV
Weka takes 100 labeled data
it produces 10 equal sized sets. Each set is divided into two groups: 90 labeled data are used for training and 10 labeled data are used for testing.
it produces a classifier with an algorithm from 90 labeled data and applies that on the 10 testing data for set 1.
It does the same thing for set 2 to 10 and produces 9 more classifiers
it averages the performance of the 10 classifiers produced from 10 equal sized (90 training and 10 testing) sets
Let me know if that answers your question.
I would have answered in a comment but my reputation still doesn't allow me to:
In addition to Rushdi's accepted answer, I want to emphasize that the models which are created for the cross-validation fold sets are all discarded after the performance measurements have been carried out and averaged.
The resulting model is always based on the full training set, regardless of your test options. Since M-T-A was asking for an update to the quoted link, here it is: https://web.archive.org/web/20170519110106/http://list.waikato.ac.nz/pipermail/wekalist/2009-December/046633.html/. It's an answer from one of the WEKA maintainers, pointing out just what I wrote.
I think I figured it out. Take (for example) weka.classifiers.rules.OneR -x 10 -d outmodel.xxx. This does two things:
It creates a model based on the full dataset. This is the model that is written to outmodel.xxx. This model is not used as part of cross-validation.
Then cross-validation is run. cross-validation involves creating (in this case) 10 new models with the training and testing on segments of the data as has been described. The key is the models used in cross-validation are temporary and only used to generate statistics. They are not equivalent to, or used for the model that is given to the user.
Weka follows the conventional k-fold cross validation you mentioned here. You have the full data set, then divide it into k nos of equal sets (k1, k2, ... , k10 for example for 10 fold CV) without overlaps. Then at the first run, take k1 to k9 as training set and develop a model. Use that model on k10 to get the performance. Next comes k1 to k8 and k10 as training set. Develop a model from them and apply it to k9 to get the performance. In this way, use all the folds where each fold at most 1 time is used as test set.
Then Weka averages the performances and presents that on the output pane.
once we've done the 10-cross-validation by dividing data in 10 segments & create Decision tree and evaluate, what Weka does is run the algorithm an eleventh time on the whole dataset. That will then produce a classifier that we might deploy in practice. We use 10-fold cross-validation in order to get an evaluation result and estimate of the error, and then finally we do classification one more time to get an actual classifier to use in practice.
During kth cross validation, we will going to have different Decision tree but final one is created on whole datasets. CV is used to see if we have overfitting or large variance issue.
According to "Data Mining with Weka" at The University of Waikato:
Cross-validation is a way of improving upon repeated holdout.
Cross-validation is a systematic way of doing repeated holdout that actually improves upon it by reducing the variance of the estimate.
We take a training set and we create a classifier
Then we’re looking to evaluate the performance of that classifier, and there’s a certain amount of variance in that evaluation, because it’s all statistical underneath.
We want to keep the variance in the estimate as low as possible.
Cross-validation is a way of reducing the variance, and a variant on cross-validation called “stratified cross-validation” reduces it even further.
(In contrast to the the “repeated holdout” method in which we hold out 10% for the testing and we repeat that 10 times.)
So how does cross validation in Weka work ?:
With cross-validation, we divide our dataset just once, but we divide into k pieces, for example , 10 pieces. Then we take 9 of the pieces and use them for training and the last piece we use for testing. Then with the same division, we take another 9 pieces and use them for training and the held-out piece for testing. We do the whole thing 10 times, using a different segment for testing each time. In other words, we divide the dataset into 10 pieces, and then we hold out each of these pieces in turn for testing, train on the rest, do the testing and average the 10 results.
That would be 10-fold cross-validation. Divide the dataset into 10 parts (these are called “folds”);
hold out each part in turn;
and average the results.
So each data point in the dataset is used once for testing and 9 times for training.
That’s 10-fold cross-validation.

Resources