How can I train a CRF on two datasets with CRF++? - crf

I have two datasets: dataset A and dataset B. I want to use CRF++ (mirror) to train a conditional random field (CRF) on dataset A, then train the CRF on dataset B. Is it possible to achieve that with CRF++?
I do not want to train the CRF on two datasets at the same time.

I think its absolutely possible to train 2 individual models ( one on dataset A and another on dataset B ).
In case you want only 1 model , then you can combine both datasets and train 1 model.
If I was unable to answer your question, then I'm not sure what you are trying to do. Could you please elaborate your doubt?

Related

Use of validation_frame in H2O AutoML

Just started with H2O AutoML so apologies in advance if I have missed something basic.
I have a binary classification problem where data are observations from K years. I want to train on the K-1 years and tune the models and select the best one explicitly based on the remaining K year.
If I switch off cross-validation (with nfolds=0) to avoid randomly blending of years into the N folds and define data of year K as the validation_frame then I don't have the ensemble created (as expected according to the documentation) which in fact I need.
If I train with cross-validation (default nfolds) and defining a validation frame to be the K-year data
aml = H2OAutoML(max_runtime_secs=3600, seed=1)
aml.train(x=x,y=y, training_frame=k-1_years, validation_frame=k_year)
then according to
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html
the validation_frame is ignored
"...By default and when nfolds > 1, cross-validation metrics will be used for early stopping and thus validation_frame will be ignored."
Is there a way to get the tuning of the models and the selection of the best one(ensemble or not) based on the K-year data only, and while the ensemble of models is also available in the output?
Thanks a lot!
You don't want to have cross-validation (CV) if you are dealing with times-series (non-IID) data, since you won't want folds from the future to the predict the past.
I would explicitly add nfolds=0 so that CV is disabled in AutoML:
aml = H2OAutoML(max_runtime_secs=3600, seed=1, nfolds=0)
aml.train(x=x,y=y, training_frame=k-1_years, validation_frame=k_year)
To have an ensemble, add a blending_frame which also applies to time-series. See more info here.
Additionally, since you are dealing with time-series data. I would recommend adding time-series transformations (e.g. lags), so that your model gets info from previous years and their aggregates (e.g. weighted moving average).

Is it correct to test model performance over the entire dataset?

The dataset is divided into training and testing sets using the function train_test_split() in 75:25 ratio.
The model is trained on the data set x_train and y_train.(classifier models like gaussian naive bayes, random forest, k nearest neighous ,etc)
Can we now test the model using the complete data set i.e, x and y?
Or should we only use x_test and y_test for testing the model?
train_test_split() is meant to give you a simpler way of creating training and test subsets from your original dataset. x_train and y_train both represent training data and target data, useful to train a model like the ones mentioned to finally test on the test subsets.
this is for training, i.e. practice.
testing on the entire dataset is wrong, because your model will crearly be biased on data it was trained on from x_train y_train.
you should test your models on never-before-seen y_test data

Get cross_validation_holdout_predictions() of models from a grid search

I'm trying to calculate performance in a different way how it is built in for models right now.
I would like to access raw predictions during cross-validation, so I can calculate performance on my own.
g = h2o.get_grid(grid_id)
for m in g.models:
print "Model %s" % m.model_id
rrc[m.model_id] = m.cross_validation_holdout_predictions()
I could just run prediction with a model on my dataset, but I think then this test might be biased because the model has seen this data before, or not? Can I take new predictions made on the same data set and use it to calculate performance?
I would like to access raw predictions during cross-validation, so I can calculate performance on my own.
If you want to calculate a custom metric on the cross-validated predictions, then set keep_cross_validation_predictions = True and you can access the raw predicted values using the .cross_validation_holdout_predictions() method like you have above.
Can I take new predictions made on the same data set and use it to calculate performance?
It sounds like you're asking if you can use only training data to estimate model performance? Yes, using cross-validation. If you set nfolds > 1, H2O will do cross-validation and compute a handful of cross-validated performance metrics for you. Also, if you tell H2O to save the cross-validated predictions, you can compute "cross-validated metrics" of your own.

Cross Validation in Weka

I've always thought from what I read that cross validation is performed like this:
In k-fold cross-validation, the original sample is randomly
partitioned into k subsamples. Of the k subsamples, a single subsample
is retained as the validation data for testing the model, and the
remaining k − 1 subsamples are used as training data. The
cross-validation process is then repeated k times (the folds), with
each of the k subsamples used exactly once as the validation data. The
k results from the folds then can be averaged (or otherwise combined)
to produce a single estimation
So k models are built and the final one is the average of those.
In Weka guide is written that each model is always built using ALL the data set. So how does cross validation in Weka work ? Is the model built from all data and the "cross-validation" means that k fold are created then each fold is evaluated on it and the final output results is simply the averaged result from folds?
So, here is the scenario again: you have 100 labeled data
Use training set
weka will take 100 labeled data
it will apply an algorithm to build a classifier from these 100 data
it applies that classifier AGAIN on
these 100 data
it provides you with the performance of the
classifier (applied to the same 100 data from which it was
developed)
Use 10 fold CV
Weka takes 100 labeled data
it produces 10 equal sized sets. Each set is divided into two groups: 90 labeled data are used for training and 10 labeled data are used for testing.
it produces a classifier with an algorithm from 90 labeled data and applies that on the 10 testing data for set 1.
It does the same thing for set 2 to 10 and produces 9 more classifiers
it averages the performance of the 10 classifiers produced from 10 equal sized (90 training and 10 testing) sets
Let me know if that answers your question.
I would have answered in a comment but my reputation still doesn't allow me to:
In addition to Rushdi's accepted answer, I want to emphasize that the models which are created for the cross-validation fold sets are all discarded after the performance measurements have been carried out and averaged.
The resulting model is always based on the full training set, regardless of your test options. Since M-T-A was asking for an update to the quoted link, here it is: https://web.archive.org/web/20170519110106/http://list.waikato.ac.nz/pipermail/wekalist/2009-December/046633.html/. It's an answer from one of the WEKA maintainers, pointing out just what I wrote.
I think I figured it out. Take (for example) weka.classifiers.rules.OneR -x 10 -d outmodel.xxx. This does two things:
It creates a model based on the full dataset. This is the model that is written to outmodel.xxx. This model is not used as part of cross-validation.
Then cross-validation is run. cross-validation involves creating (in this case) 10 new models with the training and testing on segments of the data as has been described. The key is the models used in cross-validation are temporary and only used to generate statistics. They are not equivalent to, or used for the model that is given to the user.
Weka follows the conventional k-fold cross validation you mentioned here. You have the full data set, then divide it into k nos of equal sets (k1, k2, ... , k10 for example for 10 fold CV) without overlaps. Then at the first run, take k1 to k9 as training set and develop a model. Use that model on k10 to get the performance. Next comes k1 to k8 and k10 as training set. Develop a model from them and apply it to k9 to get the performance. In this way, use all the folds where each fold at most 1 time is used as test set.
Then Weka averages the performances and presents that on the output pane.
once we've done the 10-cross-validation by dividing data in 10 segments & create Decision tree and evaluate, what Weka does is run the algorithm an eleventh time on the whole dataset. That will then produce a classifier that we might deploy in practice. We use 10-fold cross-validation in order to get an evaluation result and estimate of the error, and then finally we do classification one more time to get an actual classifier to use in practice.
During kth cross validation, we will going to have different Decision tree but final one is created on whole datasets. CV is used to see if we have overfitting or large variance issue.
According to "Data Mining with Weka" at The University of Waikato:
Cross-validation is a way of improving upon repeated holdout.
Cross-validation is a systematic way of doing repeated holdout that actually improves upon it by reducing the variance of the estimate.
We take a training set and we create a classifier
Then we’re looking to evaluate the performance of that classifier, and there’s a certain amount of variance in that evaluation, because it’s all statistical underneath.
We want to keep the variance in the estimate as low as possible.
Cross-validation is a way of reducing the variance, and a variant on cross-validation called “stratified cross-validation” reduces it even further.
(In contrast to the the “repeated holdout” method in which we hold out 10% for the testing and we repeat that 10 times.)
So how does cross validation in Weka work ?:
With cross-validation, we divide our dataset just once, but we divide into k pieces, for example , 10 pieces. Then we take 9 of the pieces and use them for training and the last piece we use for testing. Then with the same division, we take another 9 pieces and use them for training and the held-out piece for testing. We do the whole thing 10 times, using a different segment for testing each time. In other words, we divide the dataset into 10 pieces, and then we hold out each of these pieces in turn for testing, train on the rest, do the testing and average the 10 results.
That would be 10-fold cross-validation. Divide the dataset into 10 parts (these are called “folds”);
hold out each part in turn;
and average the results.
So each data point in the dataset is used once for testing and 9 times for training.
That’s 10-fold cross-validation.

Strange results for NaiveBayes under Weka's GUI

I'm using Weka's GUI to classify text documents. My data set is in the .arff format.
I apply the StringToWordVector filter. Then, I apply the RemovePercentage filter to divide my data set into train and test set. It contains 99 instances in total and 934 attributes. After train-test splits, train set contains 66 instances and test set contains 33 instances.
I learn the model in the train set: result is 100% as accuracy
Then, I test the model learned on the test set: result is 3.0303 %.
Could anyone help me to understand why I get 3.0303 % and how to improve this result?
The model Naive Bayes learns is overfitted. You can try different train/test splits (or cross validation) to prevent this. You can also try adjusting the parameters of the Naive Bayes algorithm or using a different one.

Resources