Reusing h2o model mojo or pojo file from python - h2o

As H2o models are only reusable with the same major version of h2o they were saved with, an alternative is to save the model as MOJO/POJO format. Is there a way these saved models can be reused/loaded from python code. Or is there any way to keep the model for further development when upgrading the H2O version??

If you want to use your model for scoring via python, you could use either h2o.mojo_predict_pandas or h2o.mojo_predict_csv. But otherwise if you want to load a binary model that you previously saved, you will need to have compatible versions.
Outside of H2O-3 you can look into pyjnius as Tom recommended: https://github.com/kivy/pyjnius

Another alternative is to use pysparkling, if you only need it for scoring:
from pysparkling.ml import H2OMOJOModel
# Load test data to predict
df = spark.read.parquet(test_data_path)
# Load mojo model
mojo = H2OMOJOModel.createFromMojo(mojo_path)
# Make predictions
predictions = mojo.transform(df)
# Show predictions with ground truth (y_true and y_pred)
predictions.select('your_target_column', 'prediction').show()

Related

How to provide parameter input for interaction variable in H2OGradientBoostingEstimator?

I need to use the interaction variable feature of multiclass classification in H2OGradientBoostingEstimator in H2O in Python. I am not sure which parameter to use & how to use that. Can anyone please help me out with this?
Currently, I am using the below code -
pros_gbm = H2OGradientBoostingEstimator(nfolds=0,seed=1234, keep_cross_validation_predictions = False, ntrees=10, max_depth=3, learn_rate=0.01, distribution='multinomial')
hist_gbm = pros_gbm.train(x=predictors, y=target, training_frame=hf_train, validation_frame = hf_test,verbose=True)
GBM inherently creates interactions. You can extract information about feature interactions using the .feature_interaction() extractor method (for an H2O Model). More information is provided in the user guide and the Python docs.
If you want to explicitly add a new column that is the interaction between two numerics, you could create that manually by multiplying the two (or more) columns together to get a new interaction column.
For categorical interactions, there's also the the h2o.interaction() method in Python here to create interaction columns in the data (prior to sending it to the GBM or any algorithm).

Download pre-trained sentence-transformers model locally

I am using the SentenceTransformers library (here: https://pypi.org/project/sentence-transformers/#pretrained-models) for creating embeddings of sentences using the pre-trained model bert-base-nli-mean-tokens. I have an application that will be deployed to a device that does not have internet access. Here, it's already been answered, how to save the model Download pre-trained BERT model locally. Yet I'm stuck at loading the saved model from the locally saved path.
When I try to save the model using the above-mentioned technique, these are the output files:
('/bert-base-nli-mean-tokens/tokenizer_config.json',
'/bert-base-nli-mean-tokens/special_tokens_map.json',
'/bert-base-nli-mean-tokens/vocab.txt',
'/bert-base-nli-mean-tokens/added_tokens.json')
When I try to load it in the memory, using
tokenizer = AutoTokenizer.from_pretrained(to_save_path)
I'm getting
Can't load config for '/bert-base-nli-mean-tokens'. Make sure that:
- '/bert-base-nli-mean-tokens' is a correct model identifier listed on 'https://huggingface.co/models'
- or '/bert-base-nli-mean-tokens' is the correct path to a directory containing a config.json
You can download and load the model like this
from sentence_transformers import SentenceTransformer
modelPath = "local/path/to/model
model = SentenceTransformer('bert-base-nli-stsb-mean-tokens')
model.save(modelPath)
model = SentenceTransformer(modelPath)
this worked for me.You can check the SBERT documentation for model details for the SentenceTransformer class [Here][1]
[1]: https://www.sbert.net/docs/package_reference/SentenceTransformer.html#:~:text=class,Optional%5Bstr%5D%20%3D%20None)
There are many ways to solve this issue:
Assuming you have trained your BERT base model locally (colab/notebook), in order to use it with the Huggingface AutoClass, then the model (along with the tokenizers,vocab.txt,configs,special tokens and tf/pytorch weights) has to be uploaded to Huggingface. The steps to do this is mentioned here. Once it is uploaded, there will be a repository created with your username, and then the model can be accessed as follows:
from transformers import AutoTokenizer
from transformers import pipeline
tokenizer = AutoTokenizer.from_pretrained("<username>/<model-name>")
The second way is to use the trained model locally, and this can be done by using pipelines.The following is an example how to use this model trained(&saved) locally for your use-case (giving an example from my locally trained QA model):
from transformers import AutoModelForQuestionAnswering,AutoTokenizer,pipeline
nlp_QA=pipeline('question-answering',model='./abhilash1910/distilbert-squadv1',tokenizer='./abhilash1910/distilbert-squadv1')
QA_inp={
'question': 'What is the fund price of Huggingface in NYSE?',
'context': 'Huggingface Co. has a total fund price of $19.6 million dollars'
}
result=nlp_QA(QA_inp)
result
The third way is to directly use Sentence Transformers from the Huggingface models repo.
There are also other ways to resolve this but these might help. Also this list of pretrained models might help.

How to fetch details of non-leader models generated by h2o automl?

after running automl (classification of 3 classes), I can see a list of models as follows:
model_id mean_per_class_error
StackedEnsemble_BestOfFamily_0_AutoML_20180420_174925 0.262355
StackedEnsemble_AllModels_0_AutoML_20180420_174925 0.262355
XRT_0_AutoML_20180420_174925 0.266606
DRF_0_AutoML_20180420_174925 0.278428
GLM_grid_0_AutoML_20180420_174925_model_0 0.442917
but mean_per_class_error is not a good metric for my case, where classes are unbalanced (one class has very small population). How to fetch details of non-leader models and calculate other metrics? Thanks.
python version: 3.6.0
h2o version: 3.18.0.5
actually just figured this out myself (assuming aml is the h2o automl object after training):
for m in aml.leaderboard.as_data_frame()['model_id']:
print(m)
print(h2o.get_model(m))
You can also grab the corresponding model you're interested in using the following line:
model6 = h2o.get_model(aml.leaderboard.as_data_frame()['model_id'][6])
where 6 is the index number of the model in the leaderboard.

Is it possible to train XGBoost4J-Spark with tree_method='exact' ?

I intend to use the trained xgboost model with tree_method='exact' in the SparkML pipeline so I need to use XGBoost4J-Spark; however documentation says "Distributed and external memory version only support approximate algorithm." (https://xgboost.readthedocs.io/en/latest//parameter.html). Is there anyway to work around this?
Alternatively, I can train the model with C-based xgboost and some how convert the trained model to XGBoostEstimator which is a SparkML estimator and seamless to integrate in SparkML pipeline. Has anyone came across such a convertor?
I don't mind running on a single node instead of a cluster as I can afford to wait.
Any insights is appreciated.
So there is this way:
import ml.dmlc.xgboost4j.scala.XGBoost
val xgb1 = XGBoost.loadModel("xgb1")
import ml.dmlc.xgboost4j.scala.spark.XGBoostRegressionModel
val xgbSpark = new XGBoostRegressionModel(xgb1)
where xgb1 is the model trained with C-based xgboost. There is a problem however; their predictions don't match. I have reported the issue on the github repo: https://github.com/dmlc/xgboost/issues/3190

Is there a supported way to get list of features used by a H2O model during its training?

This is my situation. I have over 400 features, many of which are probably useless and often zero. I would like to be able to:
train an model with a subset of those features
query that model for the features actually used to build that model
build a H2OFrame containing just those features (I get a sparse list of non-zero values for each row I want to predict.)
pass this newly constructed frame to H2OModel.predict() to get a prediction
I am pretty sure what found is unsupported but works for now (v 3.13.0.341). Is there a more robust/supported way of doing this?
model._model_json['output']['names']
The response variable appears to be the last item in this list.
In a similar vein, it would be nice to have a supported way of finding out which H2O version that the model was built under. I cannot find the version number in the json.
If you want to know which feature columns the model used after you have built a model you can do the following in python:
my_training_frame = your_model.actual_params['training_frame']
which will return some frame id
and then you can do
col_used = h2o.get_frame(my_training_frame)
col_used
EDITED (after comment was posted)
To get the columns use:
col_used.columns
Also, a quick way to check the version of a saved binary model is to try and load it into h2o, if it loads it is the same version of h2o, if it isn't you will get a warning.
you can also open the saved model file, the first line will list the version of H2O used to create it.
For a model saved as a mojo you can look at the model.ini file. It will list the version of H2O.

Resources