How to get word vectors from a gensim Doc2Vec? - gensim

I trained a gensim.models.doc2vec.Doc2Vec model
d2v_model = Doc2Vec(sentences, size=100, window=8, min_count=5, workers=4)
and I can get document vectors by
docvec = d2v_model.docvecs[0]
How can I get word vectors from trained model ?

Doc2Vec inherits from Word2Vec, and thus you can access word vectors the same as in Word2Vec, directly by indexing the model:
wv = d2v_model['apple']
Note, however, that a Doc2Vec training mode like pure DBOW (dm=0) doesn't need or create word vectors. (Pure DBOW still works pretty well and fast for many purposes!) If you do access word vectors from such a model, they'll just be the automatic randomly-initialized vectors, with no meaning.
Only when the Doc2Vec mode itself co-trains word-vectors, as in the DM mode (default dm=1) or when adding optional word-training to DBOW (dm=0, dbow_words=1), are word-vectors and doc-vectors both learned simultaneously.

If you want to get all the trained doc vectors, you can easily use
model.docvecs.doctag_syn0. If you want to get the indexed doc, you can use model.docvecs[i].
If you are training a Word2Vec model, you can get model.wv.syn0.
If you want to get more, check this github issue link: (https://github.com/RaRe-Technologies/gensim/issues/1513)

Related

Concatenated Doc2Vec - calculate similarities

I have two Doc2Vec models trained on the same corpus but with different parameters. I would like to concatenate the two of them and calculate the similarity of a given input word, choosing the returned vectors from the concatenated model. I read a lot of comments regarding the fact that this method may not be particularly suited for performance improvement and that it might be necessary to change the source code to the KeyedVector class in gensim to enable it. Up to now I attempted to do that using the Translation Matrix but it returns 5 features from the second model and I am not sure about whether it is performing the translations correctly or not.
Has anybody already encountered this issue? Is there another way to calculate the similarity for an input word in a concatenated doc2vec model?
Up to now I have been able to reproduce this:
vocab1 = model1.wv
vocab2 = model2.wv
concatenated_vectors = {}
vocab_concatenated = vocab1
for i in range(len(vocab1.vectors)):
v1 = vocab1.vectors[i]
v2 = vocab2.vectors[i]
vocab_concatenated[list(vocab1.vocab.keys())[i]] = np.concatenate((v1, v2))
In order to re-calculate the most_similar() top-n features for a passed argument, how should I re-istantiate the newly created object? It seems that
.add_vectors(list(vocab1.vocab.keys()), vocab_concatenated[list(vocab1.vocab.keys())])
is not working, but I am sure I am missing something.

Is a gensim vocab index the index in the corresponding 1-hot-vector?

I am doing research that requires direct manipulation & embedding of one-hot vectors and I am trying to use gensim to load a pretrained word2vec model for this.
The problem is they don't seem to have a direct api for working with 1-hot-vectors. And I am looking for work arounds.
So I wanted to know if anyone knows of a way to do this? Or more specifically if these vocab indices (which are defined quite ambiguously). Could be indices into corresponding 1-hot-vectors?
Context I have found:
Seems this question is related but I tried accessing the 'input embeddings' (assuming they were one-hot representations), via model.syn0 (from link in answer), but I got a non-sparse matrix...
Also appears they refer to word indices as 'doctags' (search for Doctag/index).
Here is another question giving some context to the indices (although not quite answering my question).
Here is the official documentation:
################################################
class gensim.models.keyedvectors.Vocab(**kwargs)
Bases: object
A single vocabulary item, used internally for collecting per-word frequency/sampling info, and for constructing binary trees (incl. both word leaves and inner nodes).
################################################
Yes, you can think of the index (position) of gensim's Word2Vec word-vectors as being the one dimension that would be 1.0 – with all other V dimensions, where V is the count of unique words, being 0.0.
The implementation doesn't actually ever create one-hot vectors, as a sparse or explicit representation. It's just using the word's index as a look-up for its dense vector – following in the path of the word2vec.c code from Google on which the gensim implementation was originally based.
(The term 'doctags' is only relevant in the Doc2Vec – aka 'Paragraph Vector' – implementation. There it is the name for the distinct tokens/ints that are used for looking up document-vectors, using a different namespace from in-document words. That is, in Doc2Vec you could use 'doc_007' as a doc-vector name, aka a 'doctag', and even if the string-token 'doc_007' also appears as a word inside documents, the doc-vector referenced by doctag-key 'doc_007' and the word-vector referenced by word-key 'doc_007' wouldn't be the same internal vector.)

Combining a word array and a vector array to make a Gensim W2V model

I have a word array from a pickle file, and a corresponding vector array from an npy file, how do I combine them to make a Gensim W2V model?
That's not enough to make a full Word2Vec model instance, which is usually created via a survey of, & then training of, a text corpus. (Those steps also compile necessary word frequencies & train internal model weights that aren't part of a set of word-vectors.)
You could create a gensim KeyedVectors instance of the right dimensionality, then use its .add() method to add your values. That requires you have a list of the words, and the array of vectors, in the same order. This would allow lots of standard operations on the word-vectors, like .most_similar(), but not further word2vec-training.
For example:
from gensim.models import KeyedVectors
kv = KeyedVectors(vector_size)
kv.add(list_of_words, array_of_vectors)
print(kv.most_similar('apple'))

Doc2vec - About getting document vector

I'm a very new student of doc2vec and have some questions about document vector.
What I'm trying to get is a vector of phrase like 'cat-like mammal'.
So, what I've tried so far is by using doc2vec pre-trained model, I tried the code below
import gensim.models as g
model = "path/pre-trained doc2vec model.bin"
m = g. Doc2vec.load(model)
oneword = 'cat'
phrase = 'cat like mammal'
oneword_vec = m[oneword]
phrase_vec = m[phrase_vec]
When I tried this code, I could get a vector for one word 'cat', but not 'cat-like mammal'.
Because word2vec only provide the vector for one word like 'cat' right? (If I'm wrong, plz correct me)
So I've searched and found infer_vector() and tried the code below
phrase = phrase.lower().split(' ')
phrase_vec = m.infer_vector(phrase)
When I tried this code, I could get a vector, but every time I get different value when I tried
phrase_vec = m.infer_vector(phrase)
Because infer_vector has 'steps'.
When I set steps=0, I get always the same vector.
phrase_vec = m.infer_vector(phrase, steps=0)
However, I also found that document vector is obtained from averaging words in document.
like if the document is composed of three words, 'cat-like mammal', add three vectors of 'cat', 'like', 'mammal', and then average it, that would be the document vector. (If I'm wrong, plz correct me)
So here are some questions.
Is it the right way to use infer_vector() with 0 steps to getting a vector of phrase?
If it is the right averaging vector of words to get document vector, is there no need to use infer_vector()?
What is a model.docvecs for?
Using 0 steps means no inference at all happens: the vector stays at its randomly-initialized position. So you definitely don't want that. That the vectors for the same text vary a little each time you run infer_vector() is normal: the algorithm is using randomness. The important thing is that they're similar-to-each-other, within a small tolerance. You are more likely to make them more similar (but still not identical) with a larger steps value.
You can see also an entry about this non-determinism in Doc2Vec training or inference in the gensim FAQ.
Averaging word-vectors together to get a doc-vector is one useful technique, that might be good as a simple baseline for many purposes. But it's not the same as what Doc2Vec.infer_vector() does - which involves iteratively adjusting a candidate vector to be better and better at predicting the text's words, just like Doc2Vec training. For your doc-vector to be comparable to other doc-vectors created during model training, you should use infer_vector().
The model.docvecs object holds all the doc-vectors that were learned during model training, for lookup (by the tags given as their names during training) or other operations, like finding the most_similar() N doc-vectors to a target tag/vector amongst those learned during training.

Is there any way to get the vocabulary size from doc2vec model?

I am using gensim doc2vec. I want know if there is any efficient way to know the vocabulary size from doc2vec. One crude way is to count the total number of words, but if the data is huge(1GB or more) then this won't be an efficient way.
If model is your trained Doc2Vec model, then the number of unique word tokens in the surviving vocabulary after applying your min_count is available from:
len(model.wv.vocab)
The number of trained document tags is available from:
len(model.docvecs)
The return data type of vocab is a dictionary. Use keys() as follows:
model.wv.vocab.keys()
This should return a list of words.
An update for gensim version 4. You can have the vocabulary size with:
vocab_len = len(model.wv) # 👍
See this Migrating to Gensim 4.0 page

Resources