Somebody knows - is it possible to save trained model of Spark's Naive Bayes classificator (for example in text file), and load it in future if required?
Thank You.
I tried saving and loading the model. I was not able to recreate the model using the stored weights. ( Couldn't find the proper constructor ). But the whole model is serializable. So you can store and load it as follows :
store as :
val fos = new FileOutputStream(<storage path>)
val oos = new ObjectOutputStream(fos)
oos.writeObject(model)
oos.close
and load it in:
val fos = new FileInputStream(<storage path>)
val oos = new ObjectInputStream(fos)
val newModel = oos.readObject().asInstanceOf[org.apache.spark.mllib.classification.LogisticRegressionModel]
It worked for me
it is discussed in this thread :
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-save-mllib-model-to-hdfs-and-reload-it-td11953.html
You can use built-in functions (Spark version 2.1.0). Use NaiveBayesModel#save in order to store the model and NaiveBayesModel#load in order to read previously stored model.
Method save comes from Saveable and is implemented by wide range of classification models. Method load seems to be static in each classification model implementation.
Related
I'm relative new in the world of Latent Dirichlet Allocation.
I am able to generate a LDA Model following the Wikipedia tutorial and I'm able to generate a LDA model with my own documents.
My step now is try understand how can I use a previus generated model to classify unseen documents.
I'm saving my "lda_wiki_model" with
id2word =gensim.corpora.Dictionary.load_from_text('ptwiki_wordids.txt.bz2')
mm = gensim.corpora.MmCorpus('ptwiki_tfidf.mm')
lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=100, update_every=1, chunksize=10000, passes=1)
lda.save('lda_wiki_model.lda')
And I'm loading the same model with:
new_lda = gensim.models.LdaModel.load(path + 'lda_wiki_model.lda') #carrega o modelo
I have a "new_doc.txt", and I turn my document into a id<-> term dictionary and converted this tokenized document to "document-term matrix"
But when I run new_topics = new_lda[corpus] I receive a
'gensim.interfaces.TransformedCorpus object at 0x7f0ecfa69d50'
how can I extract topics from that?
I already tried
`lsa = models.LdaModel(new_topics, id2word=dictionary, num_topics=1, passes=2)
corpus_lda = lsa[new_topics]
print(lsa.print_topics(num_topics=1, num_words=7)
and
print(corpus_lda.print_topics(num_topics=1, num_words=7)
`
but that return topics not relationed to my new document.
Where is my mistake? I'm miss understanding something?
**If a run a new model using the dictionary and corpus created above, I receive the correct topics, my point is: how re-use my model? is correctly re-use that wiki_model?
Thank you.
I was facing the same problem. This code will solve your problem:
new_topics = new_lda[corpus]
for topic in new_topics:
print(topic)
This will give you a list of tuples of form (topic number, probability)
From the 'Topics_and_Transformation.ipynb' tutorial prepared by the RaRe Technologies people:
Converting the entire corpus at the time of calling
corpus_transformed = model[corpus] would mean storing the result
in main memory, and that contradicts gensim’s objective of
memory-independence.
If you will be iterating over the transformed corpus_transformed
multiple times, and the transformation is costly, serialize the
resulting corpus to disk first and continue using that.
Hope it helps.
This has been answered, but here is some code for anyone looking to also export the classification of unseen documents to a CSV file.
#Access the unseen corpus
corpus_test = [id2word.doc2bow(doc) for doc in data_test_lemmatized]
#Transform into LDA space based on old
lda_unseen = lda_model[corpus_test]
#Print results, export to csv
for topic in lda_unseen:
print(topic)
topic_probability = []
for t in lda_test:
#print(t)
topic_probability.append(t)
results_test = pd.DataFrame(topic_probability,columns=['Topic 1','Topic 2',
'Topic 3','Topic 4',
'Topic 5','Topic n'])
result_test.to_csv('test_results.csv', index=True, header=True)
Code inspired from this post.
My goal is to create a DRF model in H2O with the TRAIN, VALIDATION and TEST datasets I have and predict the RMSE, R2, MSE etc on the TEST model.
Below is the piece of code:
DRFParameters rfParms = (DRFParameters) algParameter;
rfParms._response_column = trainDataFrame._names[responseColumn(trainDataFrame)]; //The response column
rfParms._train = trainDataFrame._key;
//rfParms._valid = testDataFrame._key;
rfParms._nfolds = 5;
DRF job = new DRF(rfParms);
DRFModel drf = job.trainModel().get(); // Train the model
Frame pred = drf.score(testDataFrame); //Score the test
Here I don't know how to proceed with in finding the predictions (R2, RMSE, MSE, MAE etc) after scoring.
Could you please help in H2O DRF modeling and predictions calculation using JAVA?
Depending on whether your model is a regression, binomial or multinomial model you'll have to use one of ModelMetricsRegression.make(), ModelMetricsBinomial.make() or ModelMetricsMultinomial.make(). They have slightly different signatures - you can find them in our Java docs.
For the trainDataFrame you can get them from your drf model, it's in drf._output._training_metrics (you might need to cast it to an appropriate type as this one is a generic ModelMetrics). If you use your test dataset as a validation frame you can get the metrics from drf._output._validation_metrics.
#Edit:
DRFModel drf = job.trainModel().get(); // Train the model
Frame pred = drf.score(testDataFrame); //Score the test
ModelMetricsBinomial mm = ModelMetricsBinomial.make(preds.vec(2), trainDataFrame.vec(rfParms._response_column));
double auc = mm.auc();
double rmse = mm.rmse();
double r2 = mm.r2();
// etc.
I'm using django rest gis to load up leaflet maps, and at the top level of my app I'm looking at a map of the world. The basemap is from Mapbox. I make a call to my rest-api and return an outline of all of the individual countries that are included in the app. Currently, the GeoJSON file that is returned in 1.1MB in size and I have more countries to add so I'd like to reduce the size to improve performance.
Here is an example of the contents:
{"type":"FeatureCollection","features":[{"type":"Feature","geometry":{"type":"MultiPolygon","coordinates":[[[[-64.54916992187498,-54.71621093749998],[-64.43881835937495,-54.739355468749984],[-64.22050781249999,-54.721972656249996],[-64.10532226562495,-54.72167968750003],[-64.054931640625,-54.72988281250001],[-64.03242187499995,-54.74238281249998],[-63.881933593750006,-54.72294921875002],[-63.81542968749997,-54.725097656250014],[-63.83256835937499,-54.76796874999995],[-63.97124023437499,-54.810644531250034],[-64.0283203125,-54.79257812499999],[-64.32290039062497,-54.79648437499999],[-64.45327148437497,-54.84033203124995],[-64.50869140625,-54.83994140624996],[-64.637353515625,-54.90253906250001],
The size of the file is a function the number of points and the precision of those points. I was thinking that the most expedient way to reduce the size, while preserving my original data, would be to reduce the precision of the geom points. But, I'm at a bit of a loss as to how to do this. I've looked through the documentation on github and haven't found any clues.
Is there a field option to reduce the precision of the GeoJSON returned? Or, is there another way to achieve what I'm try to do?
Many thanks.
I ended up simplifying the geometry using PostGIS and then passing that queryset to the serializer. I started with creating a raw query in the model manager.
class RegionQueryset(models.query.QuerySet):
def simplified(self):
return self.raw(
"SELECT region_code, country_code, name, slug, ST_SimplifyVW(geom, 0.01) as geom FROM regions_region "
"WHERE active=TRUE AND region_type = 'Country'"
)
class RegionsManager (models.GeoManager):
def get_queryset(self):
return RegionQueryset(self.model, using=self._db)
def simplified(self):
return self.get_queryset().simplified()
The view is quite simple:
class CountryApiGeoListView(ListAPIView):
queryset = Region.objects.simplified()
serializer_class = CountryGeoSerializer
And the serializer:
class CountryGeoSerializer(GeoFeatureModelSerializer):
class Meta:
model = Region
geo_field = 'geom'
queryset = Region.objects.filter(active=True)
fields = ('name', 'slug', 'region_code', 'geom')
I ended up settling on the PostGIS function ST_SimplifyVW() after running some tests.
My dataset has 20 countries with geometry provided by Natural Earth. Without optimizing, the geojson file was 1.2MB in size, the query took 17ms to run and 1.15 seconds to load in my browser. Of course, the quality of the rendered outline was great. I then tried the ST_Simplify() and ST_SimplifyVW() functions with different parameters. From these very rough tests, I decided on ST_SimplifyVW(geom, 0.01)
**Function Size Query time Load time Appearance**
None 1.2MB 17ms 1.15s Great
ST_Simplify(geom, 0.1) 240K 15.94ms 371ms Barely Acceptable
ST_Simplify(geom, 0.01) 935k 22.45ms 840ms Good
ST_SimplifyVW(geom, 0.01) 409K 25.92ms 628ms Good
My setup was Postgres 9.4 and PostGIS 2.2. ST_SimplifyVW is not included in PostGIS 2.1, so you must use 2.2.
You could save some space by setting the precision with GeometryField during serialization. This is an extract of my code to model the same WorldBorder model defined in geodjango GIS tutorial. For serializers.py:
from rest_framework_gis.serializers import (
GeoFeatureModelSerializer, GeometryField)
from .models import WorldBorder
class WorldBorderSerializer(GeoFeatureModelSerializer):
# set a custom precision for the geometry field
mpoly = GeometryField(precision=2, remove_duplicates=True)
class Meta:
model = WorldBorder
geo_field = "mpoly"
fields = (
"id", "name", "area", "pop2005", "fips", "iso2", "iso3",
"un", "region", "subregion", "lon", "lat",
)
Defining explicitely the precision with mpoly = GeometryField(precision=2) will do the trick. The remove_duplicates=True will remove identical points generated by truncating numbers. You need to keep the geo_field reference to your geometry field in the Meta class, or the rest framework will not work. This is my views.py code to see the GeoJSON object using ViewSet:
from rest_framework import viewsets, permissions
from .models import WorldBorder
from .serializers import WorldBorderSerializer
class WorldBorderViewSet(viewsets.ModelViewSet):
queryset = WorldBorder.objects.all()
serializer_class = WorldBorderSerializer
permission_classes = (permissions.IsAuthenticatedOrReadOnly, )
However the most effective improvement in saving space is to simplify geometries as described by geoAndrew. Here I calculate on the fly the geometry simplification using serializers:
from rest_framework_gis.serializers import (
GeoFeatureModelSerializer, GeometrySerializerMethodField)
from .models import WorldBorder
class WorldBorderSerializer(GeoFeatureModelSerializer):
# in order to simplify poligons on the fly
simplified_mpoly = GeometrySerializerMethodField()
def get_simplified_mpoly(self, obj):
# Returns a new GEOSGeometry, simplified to the specified tolerance
# using the Douglas-Peucker algorithm. A higher tolerance value implies
# fewer points in the output. If no tolerance is provided, it
# defaults to 0.
return obj.mpoly.simplify(tolerance=0.01, preserve_topology=True)
class Meta:
model = WorldBorder
geo_field = "simplified_mpoly"
fields = (
"id", "name", "area", "pop2005", "fips", "iso2", "iso3",
"un", "region", "subregion", "lon", "lat",
)
The two solutions are different and can't be merged (see how rest_framework.gis.fields is implemented). Maybe simplifing the geometry is the better solution to preserve quality and save space. Hope it helps!
After training a RandomForestRegressor in PipelineModel using mlib and DataFrame (Spark 2.0)
I loaded the saved model into my RT environment in order to predict using the model, each request
is handled and transform through the loaded PipelineModel but in the process I had to convert the
single request vector to a one row DataFrame using spark.createdataframe all of this takes around 700ms!
comparing to 2.5ms if I uses mllib RDD RandomForestRegressor.predict(VECTOR).
Is there any way to use the new mlib to predict a a single vector without converting to DataFrame or do something else to speed things up?
The dataframe based org.apache.spark.ml.regression.RandomForestRegressionModel also takes a Vector as input. I don't think you need to convert a vector to dataframe for every call.
Here is how I think you code should work.
//load the trained RF model
val rfModel = RandomForestRegressionModel.load("path")
val predictionData = //a dataframe containing a column 'feature' of type Vector
predictionData.map { row =>
Vector feature = row.getAs[Vector]("feature")
Double result = rfModel.predict(feature)
(feature, result)
}
I have an Abstract type called Product, and five "Types" that inherit from Product in a table per type hierarchy fashion as below:
I want to get all of the information for all of the Products, including a smattering of properties from the different objects that inherit from products to project them into a new class for use in an MVC web page. My linq query is below:
//Return the required products
var model = from p in Product.Products
where p.archive == false && ((Prod_ID == 0) || (p.ID == Prod_ID))
select new SearchViewModel
{
ID = p.ID,
lend_name = p.Lender.lend_name,
pDes_rate = p.pDes_rate,
pDes_details = p.pDes_details,
pDes_totTerm = p.pDes_totTerm,
pDes_APR = p.pDes_APR,
pDes_revDesc = p.pDes_revDesc,
pMax_desc = p.pMax_desc,
dDipNeeded = p.dDipNeeded,
dAppNeeded = p.dAppNeeded,
CalcFields = new DAL.SearchCalcFields
{
pDes_type = p.pDes_type,
pDes_rate = p.pDes_rate,
pTFi_fixedRate = p.pTFi_fixedRate
}
}
The problem I have is accessing the p.pTFi_fixedRate, this is not returned with the Products collection of entities as it is in the super type of Fixed. How do I return the "super" type of Products (Fixed) properties using Linq and the Entity Framework. I actually need to return some fields from all the different supertypes (Disc, Track, etc) for use in calculations. Should I return these as separate Linq queries checking the type of "Product" that is returned?
This is a really good question. I've had a look in the Julie Lerman book and scouted around the internet and I can't see an elegant answer.
If it were me I would create a data transfer object will all the properties of the types and then have a separate query for each type and then union them all up. I would insert blanks into the DTO properies where the properties aren't relevant to that type. Then I would hope that the EF engine makes a reasonable stab at creating decent SQL.
Example
var results = (from p in context.Products.OfType<Disc>
select new ProductDTO {basefield1 = p.val1, discField=p.val2, fixedField=""})
.Union(
from p in context.Products.OfType<Fixed>
select new ProductDTO {basefield1 = p.val1, discField="", fixedField=p.val2});
But that can't be the best answer can it. Is there any others?
So Fixed is inherited from Product? If so, you should probably be querying for Fixed instead, and the Product properties will be pulled into it.
If you are just doing calculations and getting some totals or something, you might want to look at using a stored procedure. It will amount to fewer database calls and allow for much faster execution.
Well it depends on your model, but usually you need to do something like:
var model = from p in Product.Products.Include("SomeNavProperty")
.... (rest of query)
Where SomeNavProperty is the entity type that loads pTFi_fixedRate.