Gensim Predict Output Word Function Syntax - gensim

How do you use the Gensim predict output word function?
model = KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True)
model.predict_output_word(['Hi', 'how', 'you'], topn=10)
AttributeError: 'Word2VecKeyedVectors' object has no attribute 'predict_output_word'
I tried Word2Vec.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True), which was deprecated as well.

A file like GoogleNews-vectors-negative300.bin only contains the word vectors, not the complete model used for training. So it is not possible to use predict_output_word in this case. If you would have trained a full model yourself and saved it with model.save(), then the method predict_output_word would be available.

Related

Doc2Vec How to find most similar document

I am using Gensim's Doc2Vec, and was wondering if there is a way to get the most similar document to another document that is outside the list of TaggedDocuments used to train the Doc2Vec model.
Right now I can infer a vector from a document not in the training set:
# 'model' here is a instance of Doc2Vec class that has been trained
# Inferring a vector
doc_not_in_training_set = "Foo Foo Foo Foo Foo Foo Fie"
v1 = model.infer_vector(word_tokenize(doc_not_in_training_set.lower()))
print("V1_infer", v1)
This prints out a vector representation of the 'doc_not_in_training_set' string. However, is there a way to use this vector to find the n most similar documents to the 'doc_not_in_training_set' string (in the TaggedDocuments training set for this word2vec model)?
Looking under the documentation, the closest I could find was the model.docvec.most_similar() method:
# Finding most similar to first
similar_doc = model.docvecs.most_similar('0')
This returns the document in the training set most similar to the document in the training set with tag '0'.
In the documentation of this method, it looks like there is not yet the functionality I am looking for:
TODO: Accept vectors of out-of-training-set docs, as if from inference.
Is there another method I can use to find documents similar to a document not in the training set?
The .most_similar() method will also take a raw vectors as the target position.
It helps to explicitly name the positive parameter, to prevent other logic of that method, which tries to intuit what other strings/etc supplied as arguments might mean, from misinterpreting a single raw vector.
So try:
similar_docs = model.docvecs.most_similar(positive=[v1])
You should get back a list of nearest-neighbors to the v1 vector that you'd previously inferred.

In gensim with pretrained model, wmdistance is working well, but n_similarity is not

I have calculated distances between two sentences using wmdistance() funtion of gensim with pre-trained model
Now, I want to similarity between them and tried with n_similarity() funnction, but keyerror occured
keyerror : word not in vacabulary
This shows screenshoot of error example
Anyone have got idea on this, please?
When you get an error that a word is not in the vocabulary, it means the word is not in that model.
Any attempt to look it up will generate a KeyError, to let you know you are trying to get a word-vector that isn't there.
You should filter your lists-of-tokens, before passing them to n_similarity(), to only include valid words.
Of course, that means you can't get a meaningful result about the word 'selfie'. It's unknown nonsense to the model, as if you asked for the word 'asruhfglaiwurfliuawiufsdfsdfs'.

Gensim's FastText KeyedVector out of vocab

I want to use the read-only version of Gensim's FastText Embedding to save some RAM compared to the full model.
After loading the KeyVectors version, I get the following Error when fetching a vector:
IndexError: index 878080 is out of bounds for axis 0 with size 761210
The error occurs when using words that should be out-of-vocabulary e.g. "lawyerxy" instead of "lawyer". The full model returns a vector for both.
from gensim.models import KeyedVectors
model = KeyedVectors.load("model.kv")
model .wv.__getitem__("lawyerxy")
So, my assumption is that the KeyedVectors do not offer FastText's out of vacabulary function - a key feature for my usecase. This limitation is not given in the documentation:
https://radimrehurek.com/gensim/models/word2vec.html
Can anyone prove that assumption and/or name a fix to allow vectors for "lawyerxy" etc. ?
The KeyedVectors name is (as of gensim-3.8.0) just an alias for class Word2VecKeyedVectors, which only maintains a simple word (as key) to vector (as value) mapping.
You shouldn't expect FastText's advanced ability to synthesize vectors for out-of-vocabulary words to appear in any model/representation that doesn't explicitly claim to offer that ability.
(I would expect a lookup of an out-of-vocabulary word to give a clearer KeyError rather than the IndexError you've reported. But, you'd need to show exactly what code created the file you're loading, and triggered the error, and the full error stack, to further guess what's going wrong in your case.)
Depending on how your model.kv file was saved, you might be able to load it, with retained OOV-vector functionality, by using the class FastTextKeyedVectors instead of plain KeyedVectors.

Doc2Vec input format

running gensim Doc2Vec over ubuntu
Doc2Vec rejects my input with the error
AttributeError: 'list' object has no attribute 'words'
import gensim from gensim.models
import doc2vec as dtv
from nltk.corpus import brown
documents = brown.tagged_sents()
d2vmodel = > dtv.Doc2Vec(documents, size=100, window=1, min_count=1, workers=1)
I have tried already from
this SO question and many variations with the same result
documents = [brown.tagged_sents()}
adding a hash function
If corpus is a .txt file I can utilize
documents=TaggedLineDocument(documents)
but that is often not possible
Gensim's Doc2Vec requires each document to be in the form of an object with a words property that is a list of string tokens, and a tags property that is a list of tags. These tags are usually strings, but expert users with large datasets can save a little memory by using plain-ints, starting from 0, instead.
A class TaggedDocument is included that is of the right 'shape', and used in most of the Gensim documentation/tutorial examples – but given Python's 'duck typing', any object with words and tags properties will do.
But a plain list won't.
And if I understand correctly, brown.tagged_sents() will return lists of (word, part-of-speech-tag) tuples, which isn't even the kind of list-of-word-tokens that would work as a words, and doesn't supply any of the full-document tags that are what Doc2Vec needs as keys to the doc-vectors that get trained.
Separately: it is unlikely you'd want to use min_count=1. Discarding very-low-frequency words usually makes retained Word2Vec/Doc2Vec vectors better.

Gensim FastText - KeyError: "word not in vocabulary"

I was having trouble with the "most_similar" call in a FastText model, from my understanding, Fasttext should be able to obtain results for words that aren't in the vocabulary, but I'm getting a "Not in Vocabulary" error, even when prior to saving and loading, the call was perfectly fine.
Here's the code from juypter.
import gensim as gensim
model = gensim.models.FastText(my_sentences, size=100, window=5, min_count=3, workers=4, sg=1)
model.wv.most_similar(positive=['iPhone 6'])
Returns
[('iPhone7', 0.942690372467041),
('iPhone7.', 0.9395840764045715),
('iPhone5s', 0.9379133582115173),
('iPhone6s', 0.9338586330413818),
('iPhone5S', 0.9335439801216125),
('iPhone5.', 0.9318809509277344),
('iPhone®', 0.9314558506011963),
('iPhone6', 0.9268479347229004),
('iPhone4s', 0.9223971366882324),
('iPhone5', 0.9212019443511963)]
So far so good, now I save the model.
model.wv.save_word2vec_format("example_fasttext.txt", binary=False)
Then load it up again:
from gensim.models import KeyedVectors
new_model = KeyedVectors.load_word2vec_format('example_fasttext.txt', binary=False, limit=50000)
Then I do the exact most_similar call from the model I just loaded:
new_model.most_similar(positive=['iPhone 6'])
But results now are:
KeyError: "word 'iPhone 6' not in vocabulary"
Any idea what I did wrong?
Your problem is probaly in the limit parameter of the load_word2vec_format method. What you are doing here is loading the model only for the 50000 most frequent words. If iPhone 6 does not appear enough times, you are not loading it.
Try with
new_model = KeyedVectors.load_word2vec_format('example_fasttext.txt', binary=False)
I'm having the same problem as you, and I think I am starting to understand what's going on.
Basically, when you save your model as a .txt or as a .vec, you are only saving the word-vectors ; not the n-grams (saved in the binary version of your model), which allow you to generalize / approximate out-of-vocabulary words.
I suggest you save your model with:
your_fasttext_model.save(file_path)

Resources