Gensim FastText - KeyError: "word not in vocabulary" - gensim

I was having trouble with the "most_similar" call in a FastText model, from my understanding, Fasttext should be able to obtain results for words that aren't in the vocabulary, but I'm getting a "Not in Vocabulary" error, even when prior to saving and loading, the call was perfectly fine.
Here's the code from juypter.
import gensim as gensim
model = gensim.models.FastText(my_sentences, size=100, window=5, min_count=3, workers=4, sg=1)
model.wv.most_similar(positive=['iPhone 6'])
Returns
[('iPhone7', 0.942690372467041),
('iPhone7.', 0.9395840764045715),
('iPhone5s', 0.9379133582115173),
('iPhone6s', 0.9338586330413818),
('iPhone5S', 0.9335439801216125),
('iPhone5.', 0.9318809509277344),
('iPhone®', 0.9314558506011963),
('iPhone6', 0.9268479347229004),
('iPhone4s', 0.9223971366882324),
('iPhone5', 0.9212019443511963)]
So far so good, now I save the model.
model.wv.save_word2vec_format("example_fasttext.txt", binary=False)
Then load it up again:
from gensim.models import KeyedVectors
new_model = KeyedVectors.load_word2vec_format('example_fasttext.txt', binary=False, limit=50000)
Then I do the exact most_similar call from the model I just loaded:
new_model.most_similar(positive=['iPhone 6'])
But results now are:
KeyError: "word 'iPhone 6' not in vocabulary"
Any idea what I did wrong?

Your problem is probaly in the limit parameter of the load_word2vec_format method. What you are doing here is loading the model only for the 50000 most frequent words. If iPhone 6 does not appear enough times, you are not loading it.
Try with
new_model = KeyedVectors.load_word2vec_format('example_fasttext.txt', binary=False)

I'm having the same problem as you, and I think I am starting to understand what's going on.
Basically, when you save your model as a .txt or as a .vec, you are only saving the word-vectors ; not the n-grams (saved in the binary version of your model), which allow you to generalize / approximate out-of-vocabulary words.
I suggest you save your model with:
your_fasttext_model.save(file_path)

Related

In gensim with pretrained model, wmdistance is working well, but n_similarity is not

I have calculated distances between two sentences using wmdistance() funtion of gensim with pre-trained model
Now, I want to similarity between them and tried with n_similarity() funnction, but keyerror occured
keyerror : word not in vacabulary
This shows screenshoot of error example
Anyone have got idea on this, please?
When you get an error that a word is not in the vocabulary, it means the word is not in that model.
Any attempt to look it up will generate a KeyError, to let you know you are trying to get a word-vector that isn't there.
You should filter your lists-of-tokens, before passing them to n_similarity(), to only include valid words.
Of course, that means you can't get a meaningful result about the word 'selfie'. It's unknown nonsense to the model, as if you asked for the word 'asruhfglaiwurfliuawiufsdfsdfs'.

Gensim's FastText KeyedVector out of vocab

I want to use the read-only version of Gensim's FastText Embedding to save some RAM compared to the full model.
After loading the KeyVectors version, I get the following Error when fetching a vector:
IndexError: index 878080 is out of bounds for axis 0 with size 761210
The error occurs when using words that should be out-of-vocabulary e.g. "lawyerxy" instead of "lawyer". The full model returns a vector for both.
from gensim.models import KeyedVectors
model = KeyedVectors.load("model.kv")
model .wv.__getitem__("lawyerxy")
So, my assumption is that the KeyedVectors do not offer FastText's out of vacabulary function - a key feature for my usecase. This limitation is not given in the documentation:
https://radimrehurek.com/gensim/models/word2vec.html
Can anyone prove that assumption and/or name a fix to allow vectors for "lawyerxy" etc. ?
The KeyedVectors name is (as of gensim-3.8.0) just an alias for class Word2VecKeyedVectors, which only maintains a simple word (as key) to vector (as value) mapping.
You shouldn't expect FastText's advanced ability to synthesize vectors for out-of-vocabulary words to appear in any model/representation that doesn't explicitly claim to offer that ability.
(I would expect a lookup of an out-of-vocabulary word to give a clearer KeyError rather than the IndexError you've reported. But, you'd need to show exactly what code created the file you're loading, and triggered the error, and the full error stack, to further guess what's going wrong in your case.)
Depending on how your model.kv file was saved, you might be able to load it, with retained OOV-vector functionality, by using the class FastTextKeyedVectors instead of plain KeyedVectors.

gensim/models/ldaseqmodel.py:217: RuntimeWarning: divide by zero encountered in double_scalars

/Users/Barry/anaconda/lib/python2.7/site-packages/gensim/models/ldaseqmodel.py:217: RuntimeWarning: divide by zero encountered in double_scalars
convergence = np.fabs((bound - old_bound) / old_bound)
#dynamic topic model
def run_dtm(num_topics=18):
docs, years, titles = preprocessing(datasetType=2)
#resort document by years
Z = zip(years, docs)
Z = sorted(Z, reverse=False)
years_new, docs_new = zip(*Z)
#generate time slice
time_slice = Counter(years_new).values()
for year in Counter(years_new):
print year,' --- ',Counter(years_new)[year]
print '********* data set loaded ********'
dictionary = corpora.Dictionary(docs_new)
corpus = [dictionary.doc2bow(text) for text in docs_new]
print '********* train lda seq model ********'
ldaseq = ldaseqmodel.LdaSeqModel(corpus=corpus, id2word=dictionary, time_slice=time_slice, num_topics=num_topics)
print '********* lda seq model done ********'
ldaseq.print_topics(time=1)
Hey guys, I'm using the dynamic topic models in gensim package for topic analysis, following this tutorial, https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/ldaseqmodel.ipynb, however I always got the same unexpected error. Can anyone give me some guidance? I'm really puzzled even thought I have tried some different dataset for generating corpus and dictionary.
The error is like this:
/Users/Barry/anaconda/lib/python2.7/site-packages/gensim/models/ldaseqmodel.py:217: RuntimeWarning: divide by zero encountered in double_scalars
convergence = np.fabs((bound - old_bound) / old_bound)
The np.fabs error means it is encountering an error with NumPy. What NumPy and gensim versions are you using?
NumPy no longer supports Python 2.7, and Ldaseq was added to Gensim in 2016, so you might just not have a compatible version available. If you are recoding a Python 3+ tutorial to a 2.7 variant, you obviously understand a little bit about the version differences - try running it in a, say, 3.6.8 environment (you will have to upgrade sometime anyway, 2020 is the end of 2.7 support from Python itself). That might already help, I've gone through the tutorial and did not encounter this with my own data.
That being said, I have encountered the same error before when running LdaMulticore, and it was caused by an empty corpus.
Instead of running your code fully in a function, can you try to go through it line by line (or look at you DEBUG level log) and check whether your output has the expected properties: that, for example your corpus is not empty (or contains empty documents)?
If that happens, fix the preprocessing steps and try again - that at least helped me and helped with the same ldamodel error in the mailing list.
PS: not commenting because I lack the reputation, feel free to edit this.
This is the issue with the source code of ldaseqmodel.py itself.
For the latest gensim package(version 3.8.3) I am getting the same error at line 293:
ldaseqmodel.py:293: RuntimeWarning: divide by zero encountered in double_scalars
convergence = np.fabs((bound - old_bound) / old_bound)
Now, if you go through the code you will see this:
enter image description here
You can see that here they divide the difference between bound and old_bound by the old_bound(which is also visible from the warning)
Now if you analyze further you will see that at line 263, the old_bound is initialized with zero and this is the main reason that you are getting this warning of divide by zero encountered.
enter image description here
For further information, I put a print statement at line 294:
print('bound = {}, old_bound = {}'.format(bound, old_bound))
The output I received is: enter image description here
So, in a single line you are getting this warning because of the source code of the package ldaseqmodel.py not because of any empty document. Although if you do not remove the empty documents from your corpus you will receive another warning. So I suggest if there are any empty documents in your corpus remove them and just ignore the above warning of division by zero.

Gensim Predict Output Word Function Syntax

How do you use the Gensim predict output word function?
model = KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True)
model.predict_output_word(['Hi', 'how', 'you'], topn=10)
AttributeError: 'Word2VecKeyedVectors' object has no attribute 'predict_output_word'
I tried Word2Vec.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True), which was deprecated as well.
A file like GoogleNews-vectors-negative300.bin only contains the word vectors, not the complete model used for training. So it is not possible to use predict_output_word in this case. If you would have trained a full model yourself and saved it with model.save(), then the method predict_output_word would be available.

Doc2Vec input format

running gensim Doc2Vec over ubuntu
Doc2Vec rejects my input with the error
AttributeError: 'list' object has no attribute 'words'
import gensim from gensim.models
import doc2vec as dtv
from nltk.corpus import brown
documents = brown.tagged_sents()
d2vmodel = > dtv.Doc2Vec(documents, size=100, window=1, min_count=1, workers=1)
I have tried already from
this SO question and many variations with the same result
documents = [brown.tagged_sents()}
adding a hash function
If corpus is a .txt file I can utilize
documents=TaggedLineDocument(documents)
but that is often not possible
Gensim's Doc2Vec requires each document to be in the form of an object with a words property that is a list of string tokens, and a tags property that is a list of tags. These tags are usually strings, but expert users with large datasets can save a little memory by using plain-ints, starting from 0, instead.
A class TaggedDocument is included that is of the right 'shape', and used in most of the Gensim documentation/tutorial examples – but given Python's 'duck typing', any object with words and tags properties will do.
But a plain list won't.
And if I understand correctly, brown.tagged_sents() will return lists of (word, part-of-speech-tag) tuples, which isn't even the kind of list-of-word-tokens that would work as a words, and doesn't supply any of the full-document tags that are what Doc2Vec needs as keys to the doc-vectors that get trained.
Separately: it is unlikely you'd want to use min_count=1. Discarding very-low-frequency words usually makes retained Word2Vec/Doc2Vec vectors better.

Resources