DRF model Predictions using H2O and scoring - algorithm

My goal is to create a DRF model in H2O with the TRAIN, VALIDATION and TEST datasets I have and predict the RMSE, R2, MSE etc on the TEST model.
Below is the piece of code:
DRFParameters rfParms = (DRFParameters) algParameter;
rfParms._response_column = trainDataFrame._names[responseColumn(trainDataFrame)]; //The response column
rfParms._train = trainDataFrame._key;
//rfParms._valid = testDataFrame._key;
rfParms._nfolds = 5;
DRF job = new DRF(rfParms);
DRFModel drf = job.trainModel().get(); // Train the model
Frame pred = drf.score(testDataFrame); //Score the test
Here I don't know how to proceed with in finding the predictions (R2, RMSE, MSE, MAE etc) after scoring.
Could you please help in H2O DRF modeling and predictions calculation using JAVA?

Depending on whether your model is a regression, binomial or multinomial model you'll have to use one of ModelMetricsRegression.make(), ModelMetricsBinomial.make() or ModelMetricsMultinomial.make(). They have slightly different signatures - you can find them in our Java docs.
For the trainDataFrame you can get them from your drf model, it's in drf._output._training_metrics (you might need to cast it to an appropriate type as this one is a generic ModelMetrics). If you use your test dataset as a validation frame you can get the metrics from drf._output._validation_metrics.
#Edit:
DRFModel drf = job.trainModel().get(); // Train the model
Frame pred = drf.score(testDataFrame); //Score the test
ModelMetricsBinomial mm = ModelMetricsBinomial.make(preds.vec(2), trainDataFrame.vec(rfParms._response_column));
double auc = mm.auc();
double rmse = mm.rmse();
double r2 = mm.r2();
// etc.

Related

How to use Huggingface pretrained models to get the output of the dataset that was used to train the model?

I am working on getting the abstractive summaries of the XSUM and the CNN DailyMail datasets using Huggingface's pre-trained BART, Pegasus, and T5 models.
I am confused because there already exist checkpoints of models pre-trained on the same dataset.
So even if I do:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("mwesner/pretrained-bart-CNN-Dailymail-summ")
model = AutoModelForSeq2SeqLM.from_pretrained("mwesner/pretrained-bart-CNN-Dailymail-summ")
I can't understand how to get the summaries of either dataset since I don't have any new sentences that I can feed in.
This is how a pretrained model is normally used:
from transformers import BartTokenizer, BartForConditionalGeneration, BartConfig
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
ARTICLE_TO_SUMMARIZE = "My friends are cool but they eat too many carbs."
inputs = tokenizer([ARTICLE_TO_SUMMARIZE], max_length=1024, return_tensors='pt')
# Generate Summary
summary_ids = model.generate(inputs['input_ids'], num_beams=4, max_length=5, early_stopping=True)
print([tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids])
But I need the summaries generated by the pre-trained model on the dataset that was used to train them (XSUM and CNN DailyNews).

gensim.interfaces.TransformedCorpus - How use?

I'm relative new in the world of Latent Dirichlet Allocation.
I am able to generate a LDA Model following the Wikipedia tutorial and I'm able to generate a LDA model with my own documents.
My step now is try understand how can I use a previus generated model to classify unseen documents.
I'm saving my "lda_wiki_model" with
id2word =gensim.corpora.Dictionary.load_from_text('ptwiki_wordids.txt.bz2')
mm = gensim.corpora.MmCorpus('ptwiki_tfidf.mm')
lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=100, update_every=1, chunksize=10000, passes=1)
lda.save('lda_wiki_model.lda')
And I'm loading the same model with:
new_lda = gensim.models.LdaModel.load(path + 'lda_wiki_model.lda') #carrega o modelo
I have a "new_doc.txt", and I turn my document into a id<-> term dictionary and converted this tokenized document to "document-term matrix"
But when I run new_topics = new_lda[corpus] I receive a
'gensim.interfaces.TransformedCorpus object at 0x7f0ecfa69d50'
how can I extract topics from that?
I already tried
`lsa = models.LdaModel(new_topics, id2word=dictionary, num_topics=1, passes=2)
corpus_lda = lsa[new_topics]
print(lsa.print_topics(num_topics=1, num_words=7)
and
print(corpus_lda.print_topics(num_topics=1, num_words=7)
`
but that return topics not relationed to my new document.
Where is my mistake? I'm miss understanding something?
**If a run a new model using the dictionary and corpus created above, I receive the correct topics, my point is: how re-use my model? is correctly re-use that wiki_model?
Thank you.
I was facing the same problem. This code will solve your problem:
new_topics = new_lda[corpus]
for topic in new_topics:
print(topic)
This will give you a list of tuples of form (topic number, probability)
From the 'Topics_and_Transformation.ipynb' tutorial prepared by the RaRe Technologies people:
Converting the entire corpus at the time of calling
corpus_transformed = model[corpus] would mean storing the result
in main memory, and that contradicts gensim’s objective of
memory-independence.
If you will be iterating over the transformed corpus_transformed
multiple times, and the transformation is costly, serialize the
resulting corpus to disk first and continue using that.
Hope it helps.
This has been answered, but here is some code for anyone looking to also export the classification of unseen documents to a CSV file.
#Access the unseen corpus
corpus_test = [id2word.doc2bow(doc) for doc in data_test_lemmatized]
#Transform into LDA space based on old
lda_unseen = lda_model[corpus_test]
#Print results, export to csv
for topic in lda_unseen:
print(topic)
topic_probability = []
for t in lda_test:
#print(t)
topic_probability.append(t)
results_test = pd.DataFrame(topic_probability,columns=['Topic 1','Topic 2',
'Topic 3','Topic 4',
'Topic 5','Topic n'])
result_test.to_csv('test_results.csv', index=True, header=True)
Code inspired from this post.

Reduce precision of multi polygon field with django rest framwork gis

I'm using django rest gis to load up leaflet maps, and at the top level of my app I'm looking at a map of the world. The basemap is from Mapbox. I make a call to my rest-api and return an outline of all of the individual countries that are included in the app. Currently, the GeoJSON file that is returned in 1.1MB in size and I have more countries to add so I'd like to reduce the size to improve performance.
Here is an example of the contents:
{"type":"FeatureCollection","features":[{"type":"Feature","geometry":{"type":"MultiPolygon","coordinates":[[[[-64.54916992187498,-54.71621093749998],[-64.43881835937495,-54.739355468749984],[-64.22050781249999,-54.721972656249996],[-64.10532226562495,-54.72167968750003],[-64.054931640625,-54.72988281250001],[-64.03242187499995,-54.74238281249998],[-63.881933593750006,-54.72294921875002],[-63.81542968749997,-54.725097656250014],[-63.83256835937499,-54.76796874999995],[-63.97124023437499,-54.810644531250034],[-64.0283203125,-54.79257812499999],[-64.32290039062497,-54.79648437499999],[-64.45327148437497,-54.84033203124995],[-64.50869140625,-54.83994140624996],[-64.637353515625,-54.90253906250001],
The size of the file is a function the number of points and the precision of those points. I was thinking that the most expedient way to reduce the size, while preserving my original data, would be to reduce the precision of the geom points. But, I'm at a bit of a loss as to how to do this. I've looked through the documentation on github and haven't found any clues.
Is there a field option to reduce the precision of the GeoJSON returned? Or, is there another way to achieve what I'm try to do?
Many thanks.
I ended up simplifying the geometry using PostGIS and then passing that queryset to the serializer. I started with creating a raw query in the model manager.
class RegionQueryset(models.query.QuerySet):
def simplified(self):
return self.raw(
"SELECT region_code, country_code, name, slug, ST_SimplifyVW(geom, 0.01) as geom FROM regions_region "
"WHERE active=TRUE AND region_type = 'Country'"
)
class RegionsManager (models.GeoManager):
def get_queryset(self):
return RegionQueryset(self.model, using=self._db)
def simplified(self):
return self.get_queryset().simplified()
The view is quite simple:
class CountryApiGeoListView(ListAPIView):
queryset = Region.objects.simplified()
serializer_class = CountryGeoSerializer
And the serializer:
class CountryGeoSerializer(GeoFeatureModelSerializer):
class Meta:
model = Region
geo_field = 'geom'
queryset = Region.objects.filter(active=True)
fields = ('name', 'slug', 'region_code', 'geom')
I ended up settling on the PostGIS function ST_SimplifyVW() after running some tests.
My dataset has 20 countries with geometry provided by Natural Earth. Without optimizing, the geojson file was 1.2MB in size, the query took 17ms to run and 1.15 seconds to load in my browser. Of course, the quality of the rendered outline was great. I then tried the ST_Simplify() and ST_SimplifyVW() functions with different parameters. From these very rough tests, I decided on ST_SimplifyVW(geom, 0.01)
**Function Size Query time Load time Appearance**
None 1.2MB 17ms 1.15s Great
ST_Simplify(geom, 0.1) 240K 15.94ms 371ms Barely Acceptable
ST_Simplify(geom, 0.01) 935k 22.45ms 840ms Good
ST_SimplifyVW(geom, 0.01) 409K 25.92ms 628ms Good
My setup was Postgres 9.4 and PostGIS 2.2. ST_SimplifyVW is not included in PostGIS 2.1, so you must use 2.2.
You could save some space by setting the precision with GeometryField during serialization. This is an extract of my code to model the same WorldBorder model defined in geodjango GIS tutorial. For serializers.py:
from rest_framework_gis.serializers import (
GeoFeatureModelSerializer, GeometryField)
from .models import WorldBorder
class WorldBorderSerializer(GeoFeatureModelSerializer):
# set a custom precision for the geometry field
mpoly = GeometryField(precision=2, remove_duplicates=True)
class Meta:
model = WorldBorder
geo_field = "mpoly"
fields = (
"id", "name", "area", "pop2005", "fips", "iso2", "iso3",
"un", "region", "subregion", "lon", "lat",
)
Defining explicitely the precision with mpoly = GeometryField(precision=2) will do the trick. The remove_duplicates=True will remove identical points generated by truncating numbers. You need to keep the geo_field reference to your geometry field in the Meta class, or the rest framework will not work. This is my views.py code to see the GeoJSON object using ViewSet:
from rest_framework import viewsets, permissions
from .models import WorldBorder
from .serializers import WorldBorderSerializer
class WorldBorderViewSet(viewsets.ModelViewSet):
queryset = WorldBorder.objects.all()
serializer_class = WorldBorderSerializer
permission_classes = (permissions.IsAuthenticatedOrReadOnly, )
However the most effective improvement in saving space is to simplify geometries as described by geoAndrew. Here I calculate on the fly the geometry simplification using serializers:
from rest_framework_gis.serializers import (
GeoFeatureModelSerializer, GeometrySerializerMethodField)
from .models import WorldBorder
class WorldBorderSerializer(GeoFeatureModelSerializer):
# in order to simplify poligons on the fly
simplified_mpoly = GeometrySerializerMethodField()
def get_simplified_mpoly(self, obj):
# Returns a new GEOSGeometry, simplified to the specified tolerance
# using the Douglas-Peucker algorithm. A higher tolerance value implies
# fewer points in the output. If no tolerance is provided, it
# defaults to 0.
return obj.mpoly.simplify(tolerance=0.01, preserve_topology=True)
class Meta:
model = WorldBorder
geo_field = "simplified_mpoly"
fields = (
"id", "name", "area", "pop2005", "fips", "iso2", "iso3",
"un", "region", "subregion", "lon", "lat",
)
The two solutions are different and can't be merged (see how rest_framework.gis.fields is implemented). Maybe simplifing the geometry is the better solution to preserve quality and save space. Hope it helps!

Mlib RandomForest (Spark 2.0) predict a single vector

After training a RandomForestRegressor in PipelineModel using mlib and DataFrame (Spark 2.0)
I loaded the saved model into my RT environment in order to predict using the model, each request
is handled and transform through the loaded PipelineModel but in the process I had to convert the
single request vector to a one row DataFrame using spark.createdataframe all of this takes around 700ms!
comparing to 2.5ms if I uses mllib RDD RandomForestRegressor.predict(VECTOR).
Is there any way to use the new mlib to predict a a single vector without converting to DataFrame or do something else to speed things up?
The dataframe based org.apache.spark.ml.regression.RandomForestRegressionModel also takes a Vector as input. I don't think you need to convert a vector to dataframe for every call.
Here is how I think you code should work.
//load the trained RF model
val rfModel = RandomForestRegressionModel.load("path")
val predictionData = //a dataframe containing a column 'feature' of type Vector
predictionData.map { row =>
Vector feature = row.getAs[Vector]("feature")
Double result = rfModel.predict(feature)
(feature, result)
}

save trained model of Spark's Naive Bayes classificator

Somebody knows - is it possible to save trained model of Spark's Naive Bayes classificator (for example in text file), and load it in future if required?
Thank You.
I tried saving and loading the model. I was not able to recreate the model using the stored weights. ( Couldn't find the proper constructor ). But the whole model is serializable. So you can store and load it as follows :
store as :
val fos = new FileOutputStream(<storage path>)
val oos = new ObjectOutputStream(fos)
oos.writeObject(model)
oos.close
and load it in:
val fos = new FileInputStream(<storage path>)
val oos = new ObjectInputStream(fos)
val newModel = oos.readObject().asInstanceOf[org.apache.spark.mllib.classification.LogisticRegressionModel]
It worked for me
it is discussed in this thread :
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-save-mllib-model-to-hdfs-and-reload-it-td11953.html
You can use built-in functions (Spark version 2.1.0). Use NaiveBayesModel#save in order to store the model and NaiveBayesModel#load in order to read previously stored model.
Method save comes from Saveable and is implemented by wide range of classification models. Method load seems to be static in each classification model implementation.

Resources