pycrfsuite : how to recall the saved models for more training - methods

I am using pycrfsuite now.
I Know crf training model's saving.
crf_trainer = pycrfsuite.Trainer()
crf_trainer.train('crf.crfsuite')
So, When I want to tag, i use the source.
crf_tagger = pycrfsuite.Tagger()
crf_tagger.open('crf.crfsuite')
But, I don't know how to recall the saved models for more training.

Related

Is validation set necessary when fine-tuning a model using synthetic images?

update
Can I split the small test set into a validation set realB-v and a test set realB-t, then I fine-tune the model and test on the test set realB-v. Then, I swap the validation set and the test set and train a new model. Can I report the average results on two trainings?
original post
I have a pre-trained model M trained on the real dataset realA, I test it on another real dataset realB and get very poor results because realA and realB have domain gaps. Since real images in realB are difficult to acquire, I decide to generate synthetic images like realB and use these images syntheticA to fine-tune the model M.
I wonder if I still need to get a validation set? If so, the validation set should be splitted from syntheticA or realB? realB is already a very small set (300 images).
In my view, I don't think a validation set in this case is necessary. If I directly fine-tune the model and get hyperparameters according to the accuracy rate on realB, it won't cause generalization problems because the images I use for fine-tuning are all synthetic.
I'd like to hear your views. Thank you.

How to provide parameter input for interaction variable in H2OGradientBoostingEstimator?

I need to use the interaction variable feature of multiclass classification in H2OGradientBoostingEstimator in H2O in Python. I am not sure which parameter to use & how to use that. Can anyone please help me out with this?
Currently, I am using the below code -
pros_gbm = H2OGradientBoostingEstimator(nfolds=0,seed=1234, keep_cross_validation_predictions = False, ntrees=10, max_depth=3, learn_rate=0.01, distribution='multinomial')
hist_gbm = pros_gbm.train(x=predictors, y=target, training_frame=hf_train, validation_frame = hf_test,verbose=True)
GBM inherently creates interactions. You can extract information about feature interactions using the .feature_interaction() extractor method (for an H2O Model). More information is provided in the user guide and the Python docs.
If you want to explicitly add a new column that is the interaction between two numerics, you could create that manually by multiplying the two (or more) columns together to get a new interaction column.
For categorical interactions, there's also the the h2o.interaction() method in Python here to create interaction columns in the data (prior to sending it to the GBM or any algorithm).

In Gensim Word2vec, how to reduce the vocab size of an existing model?

In Gensims word2vec api, I trained a model where I initialized the model with max_final_vocab = 100000 and saved the model using model.save()
(This gives me one .model file, one .model.trainables.syn1neg.npy and one .model.wv.vectors.npy file).
I do not need to train model any further, so I'm fine with using just
model = gensim.models.Word2Vec.load("train.fr.model")
kv = model.wv
del model
the kv variable shown here. I now want to use only the top N (N=40000 in my case) vocabulary items instead of the entire vocabulary. The only way to even attempt cutting down the vocabulary I could find was
import numpy as np
emb_matrix = np.load("train.fr.model.wv.vectors.npy")
emb_matrix.shape
# (100000, 300)
new_emb_matrix = emb_matrix[:40000]
np.save("train.fr.model.wv.vectors.npy", new_emb_matrix)
If I load this model again though, the vocabulary still has length 100000.
I want to reduce the vocabulary of the model or model.wv while retaining a working model. Retraining is not an option.
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('train.fr.model', limit=1000)
Use optional limitparameter to reduce number of vectors that will be loaded from Word2Vec model file.

Reusing h2o model mojo or pojo file from python

As H2o models are only reusable with the same major version of h2o they were saved with, an alternative is to save the model as MOJO/POJO format. Is there a way these saved models can be reused/loaded from python code. Or is there any way to keep the model for further development when upgrading the H2O version??
If you want to use your model for scoring via python, you could use either h2o.mojo_predict_pandas or h2o.mojo_predict_csv. But otherwise if you want to load a binary model that you previously saved, you will need to have compatible versions.
Outside of H2O-3 you can look into pyjnius as Tom recommended: https://github.com/kivy/pyjnius
Another alternative is to use pysparkling, if you only need it for scoring:
from pysparkling.ml import H2OMOJOModel
# Load test data to predict
df = spark.read.parquet(test_data_path)
# Load mojo model
mojo = H2OMOJOModel.createFromMojo(mojo_path)
# Make predictions
predictions = mojo.transform(df)
# Show predictions with ground truth (y_true and y_pred)
predictions.select('your_target_column', 'prediction').show()

Is there a supported way to get list of features used by a H2O model during its training?

This is my situation. I have over 400 features, many of which are probably useless and often zero. I would like to be able to:
train an model with a subset of those features
query that model for the features actually used to build that model
build a H2OFrame containing just those features (I get a sparse list of non-zero values for each row I want to predict.)
pass this newly constructed frame to H2OModel.predict() to get a prediction
I am pretty sure what found is unsupported but works for now (v 3.13.0.341). Is there a more robust/supported way of doing this?
model._model_json['output']['names']
The response variable appears to be the last item in this list.
In a similar vein, it would be nice to have a supported way of finding out which H2O version that the model was built under. I cannot find the version number in the json.
If you want to know which feature columns the model used after you have built a model you can do the following in python:
my_training_frame = your_model.actual_params['training_frame']
which will return some frame id
and then you can do
col_used = h2o.get_frame(my_training_frame)
col_used
EDITED (after comment was posted)
To get the columns use:
col_used.columns
Also, a quick way to check the version of a saved binary model is to try and load it into h2o, if it loads it is the same version of h2o, if it isn't you will get a warning.
you can also open the saved model file, the first line will list the version of H2O used to create it.
For a model saved as a mojo you can look at the model.ini file. It will list the version of H2O.

Resources