How to get vocabulary word count from gensim word2vec? - gensim

I am using gensim word2vec package in python. I know how to get the vocabulary from the trained model. But how to get the word count for each word in vocabulary?

Each word in the vocabulary has an associated vocabulary object, which contains an index and a count.
vocab_obj = w2v.vocab["word"]
vocab_obj.count
Output for google news w2v model: 2998437
So to get the count for each word, you would iterate over all words and vocab objects in the vocabulary.
for word, vocab_obj in w2v.vocab.items():
#Do something with vocab_obj.count

The vocab attribute was removed from KeyedVector in Gensim 4.0.0.
Instead:
word2vec_model.wv.get_vecattr("my-word", "count") # returns count of "my-word"
len(word2vec_model.wv) # returns size of the vocabulary
Check out notes on migrating from Gensim 3.x to 4

When you want to create a dictionary of word to count for easy retrieval later, you can do so as follows:
w2c = dict()
for item in model.wv.vocab:
w2c[item]=model.wv.vocab[item].count
If you want to sort it to see the most frequent words in the model, you can also do that so:
w2cSorted=dict(sorted(w2c.items(), key=lambda x: x[1],reverse=True))

Related

Doc2Vec How to find most similar document

I am using Gensim's Doc2Vec, and was wondering if there is a way to get the most similar document to another document that is outside the list of TaggedDocuments used to train the Doc2Vec model.
Right now I can infer a vector from a document not in the training set:
# 'model' here is a instance of Doc2Vec class that has been trained
# Inferring a vector
doc_not_in_training_set = "Foo Foo Foo Foo Foo Foo Fie"
v1 = model.infer_vector(word_tokenize(doc_not_in_training_set.lower()))
print("V1_infer", v1)
This prints out a vector representation of the 'doc_not_in_training_set' string. However, is there a way to use this vector to find the n most similar documents to the 'doc_not_in_training_set' string (in the TaggedDocuments training set for this word2vec model)?
Looking under the documentation, the closest I could find was the model.docvec.most_similar() method:
# Finding most similar to first
similar_doc = model.docvecs.most_similar('0')
This returns the document in the training set most similar to the document in the training set with tag '0'.
In the documentation of this method, it looks like there is not yet the functionality I am looking for:
TODO: Accept vectors of out-of-training-set docs, as if from inference.
Is there another method I can use to find documents similar to a document not in the training set?
The .most_similar() method will also take a raw vectors as the target position.
It helps to explicitly name the positive parameter, to prevent other logic of that method, which tries to intuit what other strings/etc supplied as arguments might mean, from misinterpreting a single raw vector.
So try:
similar_docs = model.docvecs.most_similar(positive=[v1])
You should get back a list of nearest-neighbors to the v1 vector that you'd previously inferred.

Doc2Vec input format

running gensim Doc2Vec over ubuntu
Doc2Vec rejects my input with the error
AttributeError: 'list' object has no attribute 'words'
import gensim from gensim.models
import doc2vec as dtv
from nltk.corpus import brown
documents = brown.tagged_sents()
d2vmodel = > dtv.Doc2Vec(documents, size=100, window=1, min_count=1, workers=1)
I have tried already from
this SO question and many variations with the same result
documents = [brown.tagged_sents()}
adding a hash function
If corpus is a .txt file I can utilize
documents=TaggedLineDocument(documents)
but that is often not possible
Gensim's Doc2Vec requires each document to be in the form of an object with a words property that is a list of string tokens, and a tags property that is a list of tags. These tags are usually strings, but expert users with large datasets can save a little memory by using plain-ints, starting from 0, instead.
A class TaggedDocument is included that is of the right 'shape', and used in most of the Gensim documentation/tutorial examples – but given Python's 'duck typing', any object with words and tags properties will do.
But a plain list won't.
And if I understand correctly, brown.tagged_sents() will return lists of (word, part-of-speech-tag) tuples, which isn't even the kind of list-of-word-tokens that would work as a words, and doesn't supply any of the full-document tags that are what Doc2Vec needs as keys to the doc-vectors that get trained.
Separately: it is unlikely you'd want to use min_count=1. Discarding very-low-frequency words usually makes retained Word2Vec/Doc2Vec vectors better.

Is there a way to set min_df and max_df in gensim's tfidf model?

I am using gensim's tdidf model like so:
from gensim import corpora, models
dictionary = corpora.Dictionary(some_corpus)
mapped_corpus = [dictionary.doc2bow(text)
for text in some_corpus]
tfidf = models.TfidfModel(mapped_corpus)
Now I'd like to apply thresholds to remove terms that appear too frequently (max_df) and too infrequently (min_df). I know that scikit's CountVectorizer allows you to do this, but I can't seem to find how to set these thresholds in gensim's tfidf. Could someone please help?
You can filter your dictionary with
dictionary.filter_extremes(no_below=min_df, no_above=rel_max_df)
Note that no_below expects the minimum number of documents in which tokens must appear, whereas no_above expects a maximum relative frequency, e.g. 0.5. Afterwards you can then construct your corpus with the filtered dictionary. According to the gensim docs it is also possible to construct a TfidfModel with only a dictionary.

Get word from array in word2vec in gensim

I just started to experiment with word2vec form gensim using tutorial provide in http://rare-technologies.com/word2vec-tutorial/. If we need need the raw output vectors, we write:
model['computer']
And the result is:
array([-0.00449447, -0.00310097, 0.02421786, ...], dtype=float32)
How can I get the word having the array? So if I write:
f=model['computer']
how can I get the word 'computer' using f?
I found the solution from this site https://github.com/piskvorky/gensim/issues/381:
word=model.most_similar(positive=[f],topn=1)
print(word[0][0])

Condense nested for loop to improve processing time with text analysis python

I am working on an untrained classifier model. I am working in Python 2.7. I have a loop. It looks like this:
features = [0 for i in xrange(len(dictionary))]
for bgrm in new_scored:
for i in xrange(len(dictionary)):
if bgrm[0] == dictionary[i]:
features[i] = int(bgrm[1])
break
I have a "dictionary" of bigrams that I have collected from a data set containing customer reviews and I would like to construct feature arrays of each review corresponding to the dictionary I have created. It would contain the frequencies of the bigrams found within the review of the features in the dictionary (I hope that makes sense). new_scored is a list of tuples which contains the bigrams found within a particular review paired with their relative frequency of occurrence in that review. The final feature arrays will be the same length as the original dictionary with few non zero entries.
The above works fine but I am looking at a data set of 13000 reviews, for each review to loop through this code is going to take for eeever (if my computer doesnt run out of RAM first). I have been sitting with it for a while and cannot see how I can condense it.
I am very new to python so I was hoping a more experienced could help with condensing it or perhaps point me in the right direction towards a library that will contain the function I need.
Thank you in advance!
Consider making dictionary an actual dict object (or some fancier subclass of dict if it better suits your needs), as opposed to an iterable (list or tuple seems like what it is now). dictionary could map bigrams as keys to an integer identifier that would identify a feature position.
If you refactor dictionary that way, then the loop can be rewritten as:
features = [0 for key in dictionary]
for bgram in new_scored:
try:
features[dictionary[bgram[0]]] = int(bgrm[1])
except KeyError:
# do something if the bigram is not in the dictionary for some reason
This should convert what was an O(n) traversal through dictionary into a hash lookup.
Hope this helps.

Resources