Doc2Vec input format - gensim

running gensim Doc2Vec over ubuntu
Doc2Vec rejects my input with the error
AttributeError: 'list' object has no attribute 'words'
import gensim from gensim.models
import doc2vec as dtv
from nltk.corpus import brown
documents = brown.tagged_sents()
d2vmodel = > dtv.Doc2Vec(documents, size=100, window=1, min_count=1, workers=1)
I have tried already from
this SO question and many variations with the same result
documents = [brown.tagged_sents()}
adding a hash function
If corpus is a .txt file I can utilize
documents=TaggedLineDocument(documents)
but that is often not possible

Gensim's Doc2Vec requires each document to be in the form of an object with a words property that is a list of string tokens, and a tags property that is a list of tags. These tags are usually strings, but expert users with large datasets can save a little memory by using plain-ints, starting from 0, instead.
A class TaggedDocument is included that is of the right 'shape', and used in most of the Gensim documentation/tutorial examples – but given Python's 'duck typing', any object with words and tags properties will do.
But a plain list won't.
And if I understand correctly, brown.tagged_sents() will return lists of (word, part-of-speech-tag) tuples, which isn't even the kind of list-of-word-tokens that would work as a words, and doesn't supply any of the full-document tags that are what Doc2Vec needs as keys to the doc-vectors that get trained.
Separately: it is unlikely you'd want to use min_count=1. Discarding very-low-frequency words usually makes retained Word2Vec/Doc2Vec vectors better.

Related

Gensim: How to load corpus from saved lda model?

When I saved my LdaModel lda_model.save('model'), it saved 4 files:
model
model.expElogbeta.npy
model.id2word
model.state
I want to use pyLDAvis.gensim to visualize the topics, which seems to need the model, corpus and dictionary. I was able to load the model and dictionary with:
lda_model = LdaModel.load('model')
dict = corpora.Dictionary.load('model.id2word')
Is it possible to load the corpus? How?
Sharing this here because it took me awhile to find out the answer to this as well. Note that dict is not a valid name for a dictionary and we use lda_dict instead.
# text array is a list of lists containing text you are analysing
# eg. text_array = [['volume', 'eventually', 'metric', 'rally'], ...]
# lda_dict is a gensim.corpora.Dictionary object
bow_corpus = [lda_dict.doc2bow(doc) for doc in text_array]
Jireh answered correctly but it may be confusing how to load all the previous LDA files. I'm not sure why gensim saves the *.state and *.npy files (I'd appreciate insights in the comments). To reuse a previous LDA model you load the *.model and *.id2word files along with your original corpus.
For instance, if I have a dataframe of my documents in column 'docs' then you load that dataframe again as you will need it to recreate your corpus.
import pandas as pd
from gensim import corpora, models
from gensim.corpora.dictionary import Dictionary
from pyLDAvis import gensim_models
df = pd.read_csv('your_file.csv')
texts = df['docs'].values
You load your previously created dictionary as follows:
dictionary = corpora.Dictionary.load('your_file.id2word')
... and then create the corpus from the dictionary and your original texts (created from the dataframe['docs'] above):
corpus = [dictionary.doc2bow(text) for text in texts]
The previously created LDA model is loaded via gensim:
lda_model = gensim.models.ldamodel.LdaModel.load('your_file.model')
These objects are then fed into your pyLDAvis instance:
lda_viz = pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary)
If you don't use the .id2word file you can run into issues with not having the correct shape (IndexError). I've had this happen when I ran LDA multicore so I use the .id2word rather than recreating the dictionary from the corpus.
in the gensim python code, they said ignore expElogbeta and state file. It is possible to load the corpus, corpus is a set of list contain 2 numbers. It will be complex to load it out, I suggest load corpus from the origin text data and using id2word

Gensim's FastText KeyedVector out of vocab

I want to use the read-only version of Gensim's FastText Embedding to save some RAM compared to the full model.
After loading the KeyVectors version, I get the following Error when fetching a vector:
IndexError: index 878080 is out of bounds for axis 0 with size 761210
The error occurs when using words that should be out-of-vocabulary e.g. "lawyerxy" instead of "lawyer". The full model returns a vector for both.
from gensim.models import KeyedVectors
model = KeyedVectors.load("model.kv")
model .wv.__getitem__("lawyerxy")
So, my assumption is that the KeyedVectors do not offer FastText's out of vacabulary function - a key feature for my usecase. This limitation is not given in the documentation:
https://radimrehurek.com/gensim/models/word2vec.html
Can anyone prove that assumption and/or name a fix to allow vectors for "lawyerxy" etc. ?
The KeyedVectors name is (as of gensim-3.8.0) just an alias for class Word2VecKeyedVectors, which only maintains a simple word (as key) to vector (as value) mapping.
You shouldn't expect FastText's advanced ability to synthesize vectors for out-of-vocabulary words to appear in any model/representation that doesn't explicitly claim to offer that ability.
(I would expect a lookup of an out-of-vocabulary word to give a clearer KeyError rather than the IndexError you've reported. But, you'd need to show exactly what code created the file you're loading, and triggered the error, and the full error stack, to further guess what's going wrong in your case.)
Depending on how your model.kv file was saved, you might be able to load it, with retained OOV-vector functionality, by using the class FastTextKeyedVectors instead of plain KeyedVectors.

Gensim most_similar() with Fasttext word vectors return useless/meaningless words

I'm using Gensim with Fasttext Word vectors for return similar words.
This is my code:
import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('cc.it.300.vec')
words = model.most_similar(positive=['sole'],topn=10)
print(words)
This will return:
[('sole.', 0.6860659122467041), ('sole.Ma', 0.6750558614730835), ('sole.Il', 0.6727924942970276), ('sole.E', 0.6680260896682739), ('sole.A', 0.6419174075126648), ('sole.È', 0.6401025652885437), ('splende', 0.6336565613746643), ('sole.La', 0.6049465537071228), ('sole.I', 0.5922051668167114), ('sole.Un', 0.5904430150985718)]
The problem is that "sole" ("sun", in english) return a series of words with a dot in it (like sole., sole.Ma, ecc...). Where is the problem? Why most_similar return this meaningless word?
EDIT
I tried with english word vector and the word "sun" return this:
[('sunlight', 0.6970556974411011), ('sunshine', 0.6911839246749878), ('sun.', 0.6835992336273193), ('sun-', 0.6780728101730347), ('suns', 0.6730450391769409), ('moon', 0.6499731540679932), ('solar', 0.6437565088272095), ('rays', 0.6423950791358948), ('shade', 0.6366724371910095), ('sunrays', 0.6306195259094238)] 
Is it impossible to reproduce results like relatedwords.org?
Perhaps the bigger question is: why does the Facebook FastText cc.it.300.vec model include so many meaningless words? (I haven't noticed that before – is there any chance you've downloaded a peculiar model that has decorated words with extra analytical markup?)
To gain the unique benefits of FastText – including the ability to synthesize plausible (better-than-nothing) vectors for out-of-vocabulary words – you may not want to use the general load_word2vec_format() on the plain-text .vec file, but rather a Facebook-FastText specific load method on the .bin file. See:
https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.load_facebook_vectors
(I'm not sure that will help with these results, but if choosing to use FastText, you may be interesting it using it "fully".)
Finally, given the source of this training – common-crawl text from the open web, which may contain lots of typos/junk – these might be legimate word-like tokens, essentially typos of sole, that appear often enough in the training data to get word-vectors. (And because they really are typo-synonyms for 'sole', they're not necessarily bad results for all purposes, just for your desired purpose of only seeing "real-ish" words.)
You might find it helpful to try using the restrict_vocab argument of most_similar(), to only receive results from the leading (most-frequent) part of all known word-vectors. For example, to only get results from among the top 50000 words:
words = model.most_similar(positive=['sole'], topn=10, restrict_vocab=50000)
Picking the right value for restrict_vocab might help in practice to leave out long-tail 'junk' words, while still providing the real/common similar words you seek.

Gensim Predict Output Word Function Syntax

How do you use the Gensim predict output word function?
model = KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True)
model.predict_output_word(['Hi', 'how', 'you'], topn=10)
AttributeError: 'Word2VecKeyedVectors' object has no attribute 'predict_output_word'
I tried Word2Vec.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True), which was deprecated as well.
A file like GoogleNews-vectors-negative300.bin only contains the word vectors, not the complete model used for training. So it is not possible to use predict_output_word in this case. If you would have trained a full model yourself and saved it with model.save(), then the method predict_output_word would be available.

Is there a way to set min_df and max_df in gensim's tfidf model?

I am using gensim's tdidf model like so:
from gensim import corpora, models
dictionary = corpora.Dictionary(some_corpus)
mapped_corpus = [dictionary.doc2bow(text)
for text in some_corpus]
tfidf = models.TfidfModel(mapped_corpus)
Now I'd like to apply thresholds to remove terms that appear too frequently (max_df) and too infrequently (min_df). I know that scikit's CountVectorizer allows you to do this, but I can't seem to find how to set these thresholds in gensim's tfidf. Could someone please help?
You can filter your dictionary with
dictionary.filter_extremes(no_below=min_df, no_above=rel_max_df)
Note that no_below expects the minimum number of documents in which tokens must appear, whereas no_above expects a maximum relative frequency, e.g. 0.5. Afterwards you can then construct your corpus with the filtered dictionary. According to the gensim docs it is also possible to construct a TfidfModel with only a dictionary.

Resources