I'm relative new in the world of Latent Dirichlet Allocation.
I am able to generate a LDA Model following the Wikipedia tutorial and I'm able to generate a LDA model with my own documents.
My step now is try understand how can I use a previus generated model to classify unseen documents.
I'm saving my "lda_wiki_model" with
id2word =gensim.corpora.Dictionary.load_from_text('ptwiki_wordids.txt.bz2')
mm = gensim.corpora.MmCorpus('ptwiki_tfidf.mm')
lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=100, update_every=1, chunksize=10000, passes=1)
lda.save('lda_wiki_model.lda')
And I'm loading the same model with:
new_lda = gensim.models.LdaModel.load(path + 'lda_wiki_model.lda') #carrega o modelo
I have a "new_doc.txt", and I turn my document into a id<-> term dictionary and converted this tokenized document to "document-term matrix"
But when I run new_topics = new_lda[corpus] I receive a
'gensim.interfaces.TransformedCorpus object at 0x7f0ecfa69d50'
how can I extract topics from that?
I already tried
`lsa = models.LdaModel(new_topics, id2word=dictionary, num_topics=1, passes=2)
corpus_lda = lsa[new_topics]
print(lsa.print_topics(num_topics=1, num_words=7)
and
print(corpus_lda.print_topics(num_topics=1, num_words=7)
`
but that return topics not relationed to my new document.
Where is my mistake? I'm miss understanding something?
**If a run a new model using the dictionary and corpus created above, I receive the correct topics, my point is: how re-use my model? is correctly re-use that wiki_model?
Thank you.
I was facing the same problem. This code will solve your problem:
new_topics = new_lda[corpus]
for topic in new_topics:
print(topic)
This will give you a list of tuples of form (topic number, probability)
From the 'Topics_and_Transformation.ipynb' tutorial prepared by the RaRe Technologies people:
Converting the entire corpus at the time of calling
corpus_transformed = model[corpus] would mean storing the result
in main memory, and that contradicts gensim’s objective of
memory-independence.
If you will be iterating over the transformed corpus_transformed
multiple times, and the transformation is costly, serialize the
resulting corpus to disk first and continue using that.
Hope it helps.
This has been answered, but here is some code for anyone looking to also export the classification of unseen documents to a CSV file.
#Access the unseen corpus
corpus_test = [id2word.doc2bow(doc) for doc in data_test_lemmatized]
#Transform into LDA space based on old
lda_unseen = lda_model[corpus_test]
#Print results, export to csv
for topic in lda_unseen:
print(topic)
topic_probability = []
for t in lda_test:
#print(t)
topic_probability.append(t)
results_test = pd.DataFrame(topic_probability,columns=['Topic 1','Topic 2',
'Topic 3','Topic 4',
'Topic 5','Topic n'])
result_test.to_csv('test_results.csv', index=True, header=True)
Code inspired from this post.
Related
Does anyone know if there's an easy way to negate a parse query? Something like this:
Parse.Query.not(query)
More specifically I want to do a relational query that gets everything except for the objects within the query. For example:
const relation = myParseObject.relation("myRelation");
const query = relation.query();
const negatedQuery = Parse.Query.not(query);
return await negatedQuery.find();
I know one solution would be to fetch the objects in the relation and then create a new query by looping through the objectIds using query.notEqualTo("objectId", fetchedObjectIds[i]), but this seems really circuitous...
Any help would be much appreciated!
doesNotMatchKeyInQuery is the solution as Davi Macedo pointed out in the comments.
For example, if I wanted to get all of the Comments that are not in an Article's relation, I would do the following:
const relationQuery = article.relation("comments").query();
const notInRelationQuery = new Parse.Query("Comment");
notInRelationQuery.doesNotMatchKeyInQuery("objectId", "objectId", relationQuery);
const notRelatedComments = await notInRelationQuery.find();
How I understand it is that the first argument is specifying the key in the objects that we are fetching. The second argument is specifying the key in the objects that are in the query that we're about to argue. And lastly we argue a query for the objects we don't want. So, it essentially finds the objects you don't want and then compares the values of the objects you do want to the values of the objects you don't want for the argued keys. It then returns all the objects you do want. I could probably write that more succinctly, but w/e.
I'm having a problem ordering numbers that are saved as string in CRM,
Its working fine until 10, then it says that 9 > 10 I know a simple solution where I can append zeros to the strings into a fixed length.
Wondering if there is a way to order by a string by int in some way.
My code:
QueryExpression query = new QueryExpression(entity);
query.ColumnSet.AddColumn(ID);
query.AddOrder(ID, OrderType.Descending); //there is a problem because the type is string.
EntityCollection entityCollection = organizationService.RetrieveMultiple(query);
I don't think there is any easy way of achieving this. I faced the same issue for the post code and ended up storing both values i.e. string and int. And while querying i used the int field to sort it.
Hope this helps,
Ravi Kashyap
Another possible solution to this problem would be to sort the results of the query using LINQ's OrderBy() method instead of using QueryExpression's built in ordering.
EntityCollection results = _client.RetrieveMultiple(query);
var sortedResults = results.Entities.OrderBy((e) =>
int.Parse(e.GetAttributeValue<string>("nameofattribute"))
);
This will yield the results your looking for. It isn't an ideal solution but at least you don't have to store everything twice.
I've got a lot of lat / lon points in a csv file, I've created a table which has a point in the 4326 projection (table postcode, field location)
I'm building data like this:-
factory = ::RGeo::Cartesian.preferred_factory(:has_z_coordinate => false)
p = factory.point(data_hash[:latitude], data_hash[:longitude])
and storing p in the location field.
The issue then is that I want to find "near" records to a given point.
I've seen some promising code at:-
https://github.com/rgeo/activerecord-postgis-adapter/blob/master/test/spatial_queries_test.rb
so I wrote the following:-
factory = ::RGeo::Cartesian.preferred_factory(:has_z_coordinate => false)
p = factory.point(53.7492, 1.6023)
res = Postcode.where(Postcode.arel_table[:location].st_distance(p).lt(1000));
res.each do |single|
puts single.postcode
end
But I'm getting exceptions (unsupported: RGeo::Cartesian::PointImpl)
I assume I'm needing to do some converting or something, any pointers appreciated!
I think your problem lies in the factory you use. Try to generate point from a spherical factory:
p = RGeo::Geographic.spherical_factory(srid: 4326).point(53.7492, 1.6023)
Also check rails logs to see the output query and run it manually in PG. Make sure that the query runs without problems.
My query is this:
DB[:expense_projects___p].where(:project_company_id=>user_company_id).
left_join(:expense_items___i, :expense_project_id=>:project_id).
select_group(:p__project_name, :p__project_id).
select_more{count(:i__item_id)}.
select_more{sum(:i__amount)}.to_a.to_json
which works.
However, payment methods include cash, card and invoice. So I would like to sum each of those for summary purposes to achieve a discrete total for payments by cash, card, and invoice repsectively. I included the following line into the query
select_more{sum(:i__amount).where(:i__mop => 'card')}.
and the error message was
NoMethodError - undefined method `where' for #<Sequel::SQL::Function:0x007fddd88b5ed0>:
so I created the dataset separately with
ds1 = expense_items.where(:mop=>'card', :expense_company_id=>user_company_id).sum(:amount)
and appended it, at the end of the original query, with
select_append{ds1}
which achieved partial success as the returned json is now:
{"project_name":"project 2","project_id":2,"count":4,"sum":"0.40501E3","?column?":"0.2381E2"}
as can be seen there is no name for this element which I need in order to reference it in my getJSON call. I tried to add an identifier by adding ___a
to the ds1 query as below
ds1 = expense_items.where(:mop=>'card', :expense_company_id=>user_company_id).sum(:amount___a)
but that failed.
In summary, is this the right approach and, in any case, how can I provide an identifier when doing a sequel sum query? In other words sum(:a_column).as(a_name)
Many thanks.
Dataset#sum returns the sum, not a modified dataset. You probably want something like:
ds1 = expense_items.where(:mop=>'card', :expense_company_id=>user_company_id).select{sum(:amount)}
select_append{ds1.as(:sum)}
I am not sure for approach ( better ask Jeremy Evans ) but it work.
You just change: .sum(... to .select_more{:amount___a).as(:desired_name)}
ds1 = expense_items.where(:mop=>'card', :expense_company_id=>user_company_id).select_more{:amount___a).as(:desired_name)}
and actually get that desired_name in db response.
Somebody knows - is it possible to save trained model of Spark's Naive Bayes classificator (for example in text file), and load it in future if required?
Thank You.
I tried saving and loading the model. I was not able to recreate the model using the stored weights. ( Couldn't find the proper constructor ). But the whole model is serializable. So you can store and load it as follows :
store as :
val fos = new FileOutputStream(<storage path>)
val oos = new ObjectOutputStream(fos)
oos.writeObject(model)
oos.close
and load it in:
val fos = new FileInputStream(<storage path>)
val oos = new ObjectInputStream(fos)
val newModel = oos.readObject().asInstanceOf[org.apache.spark.mllib.classification.LogisticRegressionModel]
It worked for me
it is discussed in this thread :
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-save-mllib-model-to-hdfs-and-reload-it-td11953.html
You can use built-in functions (Spark version 2.1.0). Use NaiveBayesModel#save in order to store the model and NaiveBayesModel#load in order to read previously stored model.
Method save comes from Saveable and is implemented by wide range of classification models. Method load seems to be static in each classification model implementation.