Combining a word array and a vector array to make a Gensim W2V model - gensim

I have a word array from a pickle file, and a corresponding vector array from an npy file, how do I combine them to make a Gensim W2V model?

That's not enough to make a full Word2Vec model instance, which is usually created via a survey of, & then training of, a text corpus. (Those steps also compile necessary word frequencies & train internal model weights that aren't part of a set of word-vectors.)
You could create a gensim KeyedVectors instance of the right dimensionality, then use its .add() method to add your values. That requires you have a list of the words, and the array of vectors, in the same order. This would allow lots of standard operations on the word-vectors, like .most_similar(), but not further word2vec-training.
For example:
from gensim.models import KeyedVectors
kv = KeyedVectors(vector_size)
kv.add(list_of_words, array_of_vectors)
print(kv.most_similar('apple'))

Related

Concatenated Doc2Vec - calculate similarities

I have two Doc2Vec models trained on the same corpus but with different parameters. I would like to concatenate the two of them and calculate the similarity of a given input word, choosing the returned vectors from the concatenated model. I read a lot of comments regarding the fact that this method may not be particularly suited for performance improvement and that it might be necessary to change the source code to the KeyedVector class in gensim to enable it. Up to now I attempted to do that using the Translation Matrix but it returns 5 features from the second model and I am not sure about whether it is performing the translations correctly or not.
Has anybody already encountered this issue? Is there another way to calculate the similarity for an input word in a concatenated doc2vec model?
Up to now I have been able to reproduce this:
vocab1 = model1.wv
vocab2 = model2.wv
concatenated_vectors = {}
vocab_concatenated = vocab1
for i in range(len(vocab1.vectors)):
v1 = vocab1.vectors[i]
v2 = vocab2.vectors[i]
vocab_concatenated[list(vocab1.vocab.keys())[i]] = np.concatenate((v1, v2))
In order to re-calculate the most_similar() top-n features for a passed argument, how should I re-istantiate the newly created object? It seems that
.add_vectors(list(vocab1.vocab.keys()), vocab_concatenated[list(vocab1.vocab.keys())])
is not working, but I am sure I am missing something.

Is a gensim vocab index the index in the corresponding 1-hot-vector?

I am doing research that requires direct manipulation & embedding of one-hot vectors and I am trying to use gensim to load a pretrained word2vec model for this.
The problem is they don't seem to have a direct api for working with 1-hot-vectors. And I am looking for work arounds.
So I wanted to know if anyone knows of a way to do this? Or more specifically if these vocab indices (which are defined quite ambiguously). Could be indices into corresponding 1-hot-vectors?
Context I have found:
Seems this question is related but I tried accessing the 'input embeddings' (assuming they were one-hot representations), via model.syn0 (from link in answer), but I got a non-sparse matrix...
Also appears they refer to word indices as 'doctags' (search for Doctag/index).
Here is another question giving some context to the indices (although not quite answering my question).
Here is the official documentation:
################################################
class gensim.models.keyedvectors.Vocab(**kwargs)
Bases: object
A single vocabulary item, used internally for collecting per-word frequency/sampling info, and for constructing binary trees (incl. both word leaves and inner nodes).
################################################
Yes, you can think of the index (position) of gensim's Word2Vec word-vectors as being the one dimension that would be 1.0 – with all other V dimensions, where V is the count of unique words, being 0.0.
The implementation doesn't actually ever create one-hot vectors, as a sparse or explicit representation. It's just using the word's index as a look-up for its dense vector – following in the path of the word2vec.c code from Google on which the gensim implementation was originally based.
(The term 'doctags' is only relevant in the Doc2Vec – aka 'Paragraph Vector' – implementation. There it is the name for the distinct tokens/ints that are used for looking up document-vectors, using a different namespace from in-document words. That is, in Doc2Vec you could use 'doc_007' as a doc-vector name, aka a 'doctag', and even if the string-token 'doc_007' also appears as a word inside documents, the doc-vector referenced by doctag-key 'doc_007' and the word-vector referenced by word-key 'doc_007' wouldn't be the same internal vector.)

FastTextKeyedVectors difference between vectors, vectors_vocab and vectors_ngrams instance variables

I downloaded wiki-news-300d-1M-subword.bin.zip and loaded it as follows:
import gensim
print(gensim.__version__)
model = gensim.models.fasttext.load_facebook_model('./wiki-news-300d-1M-subword.bin')
print(type(model))
model_keyedvectors = model.wv
print(type(model_keyedvectors))
model_keyedvectors.save('./wiki-news-300d-1M-subword.keyedvectors')
As expected, I see the following output:
3.8.1
<class 'gensim.models.fasttext.FastText'>
<class 'gensim.models.keyedvectors.FastTextKeyedVectors'>
I also see the following three numpy arrays serialized to the disk:
$ du -h wiki-news-300d-1M-subword.keyedvectors*
127M wiki-news-300d-1M-subword.keyedvectors
2.3G wiki-news-300d-1M-subword.keyedvectors.vectors_ngrams.npy
2.3G wiki-news-300d-1M-subword.keyedvectors.vectors.npy
2.3G wiki-news-300d-1M-subword.keyedvectors.vectors_vocab.npy
I understand vectors_vocab.npy and vectors_ngrams.npy, however, what is vectors.npy is used for internally in gensim.models.keyedvectors.FastTextKeyedVectors? If I look at the source code for finding out word vector, I do not see how attribute vectors is being used anywhere. I see the attributes vectors_vocab and vectors_ngrams bing used. However, if I remove vectors.npy file, I am not able to load the model using gensim.models.keyedvectors.FastTextKeyedVectors.load method.
Can someone please explain where this variable is used? Can I remove it if all I am interested is in looking word vectors (to reduce memory footprint)?
Thanks.
vectors_ngrams are the buckets storing the vectors that are learned from word-fragments (character-n-grams). It's a fixed size no matter how many n-grams are encountered - as multiple n-grams can 'collide' into the same slot.
vectors_vocab are the full-word-token vectors as trained by the FastText algorithm, for full-words of interest. However, note that the actual word-vector, as returned by FastText for an in-vocabulary word, is defined as being this vector plus all the subword vectors.
vectors stores the actual, returnable full-word vectors for in-vocabulary words. That is: it's the precalculated combination of the vectors_vocab value plus all the word's n-gram vectors.
So, vectors is never directly trained, and can always be recalculated from the other arrays. It probably should not be stored as part of the saved model (as it's redundant info that could be reconstructed on demand).
(It could possibly even be made an optional optimization, for the specific case of FastText – with users who are willing to save memory, but have slower per-word lookup, discarding it. However, this would complicate the very common and important most_similar()-like operations, which are far more efficient if they have a full, ready array of all potential-answer word-vectors.)
If you don't see vectors being directly accessed, perhaps you're not considering methods inherited from superclasses.
While any model that was saved with vectors present will need that file when later .load()ed, you could conceivably save on disk-storage by discarding the model.wv.vectors property before saving, then forcing its reconstruction after loading. You would still be paying the RAM cost, when the model is loaded.
After vectors is calculated, and if you're completely done training, you could conceivably discard the vectors_vocab property to save RAM. (For any known word, the vectors can be consulted directly for instant look-up, and vectors_vocab is only needed in the case of further training or needing to re-generate vectors.)

Doc2vec - About getting document vector

I'm a very new student of doc2vec and have some questions about document vector.
What I'm trying to get is a vector of phrase like 'cat-like mammal'.
So, what I've tried so far is by using doc2vec pre-trained model, I tried the code below
import gensim.models as g
model = "path/pre-trained doc2vec model.bin"
m = g. Doc2vec.load(model)
oneword = 'cat'
phrase = 'cat like mammal'
oneword_vec = m[oneword]
phrase_vec = m[phrase_vec]
When I tried this code, I could get a vector for one word 'cat', but not 'cat-like mammal'.
Because word2vec only provide the vector for one word like 'cat' right? (If I'm wrong, plz correct me)
So I've searched and found infer_vector() and tried the code below
phrase = phrase.lower().split(' ')
phrase_vec = m.infer_vector(phrase)
When I tried this code, I could get a vector, but every time I get different value when I tried
phrase_vec = m.infer_vector(phrase)
Because infer_vector has 'steps'.
When I set steps=0, I get always the same vector.
phrase_vec = m.infer_vector(phrase, steps=0)
However, I also found that document vector is obtained from averaging words in document.
like if the document is composed of three words, 'cat-like mammal', add three vectors of 'cat', 'like', 'mammal', and then average it, that would be the document vector. (If I'm wrong, plz correct me)
So here are some questions.
Is it the right way to use infer_vector() with 0 steps to getting a vector of phrase?
If it is the right averaging vector of words to get document vector, is there no need to use infer_vector()?
What is a model.docvecs for?
Using 0 steps means no inference at all happens: the vector stays at its randomly-initialized position. So you definitely don't want that. That the vectors for the same text vary a little each time you run infer_vector() is normal: the algorithm is using randomness. The important thing is that they're similar-to-each-other, within a small tolerance. You are more likely to make them more similar (but still not identical) with a larger steps value.
You can see also an entry about this non-determinism in Doc2Vec training or inference in the gensim FAQ.
Averaging word-vectors together to get a doc-vector is one useful technique, that might be good as a simple baseline for many purposes. But it's not the same as what Doc2Vec.infer_vector() does - which involves iteratively adjusting a candidate vector to be better and better at predicting the text's words, just like Doc2Vec training. For your doc-vector to be comparable to other doc-vectors created during model training, you should use infer_vector().
The model.docvecs object holds all the doc-vectors that were learned during model training, for lookup (by the tags given as their names during training) or other operations, like finding the most_similar() N doc-vectors to a target tag/vector amongst those learned during training.

How to get word vectors from a gensim Doc2Vec?

I trained a gensim.models.doc2vec.Doc2Vec model
d2v_model = Doc2Vec(sentences, size=100, window=8, min_count=5, workers=4)
and I can get document vectors by
docvec = d2v_model.docvecs[0]
How can I get word vectors from trained model ?
Doc2Vec inherits from Word2Vec, and thus you can access word vectors the same as in Word2Vec, directly by indexing the model:
wv = d2v_model['apple']
Note, however, that a Doc2Vec training mode like pure DBOW (dm=0) doesn't need or create word vectors. (Pure DBOW still works pretty well and fast for many purposes!) If you do access word vectors from such a model, they'll just be the automatic randomly-initialized vectors, with no meaning.
Only when the Doc2Vec mode itself co-trains word-vectors, as in the DM mode (default dm=1) or when adding optional word-training to DBOW (dm=0, dbow_words=1), are word-vectors and doc-vectors both learned simultaneously.
If you want to get all the trained doc vectors, you can easily use
model.docvecs.doctag_syn0. If you want to get the indexed doc, you can use model.docvecs[i].
If you are training a Word2Vec model, you can get model.wv.syn0.
If you want to get more, check this github issue link: (https://github.com/RaRe-Technologies/gensim/issues/1513)

Resources