Doc2vec - About getting document vector - word-embedding

I'm a very new student of doc2vec and have some questions about document vector.
What I'm trying to get is a vector of phrase like 'cat-like mammal'.
So, what I've tried so far is by using doc2vec pre-trained model, I tried the code below
import gensim.models as g
model = "path/pre-trained doc2vec model.bin"
m = g. Doc2vec.load(model)
oneword = 'cat'
phrase = 'cat like mammal'
oneword_vec = m[oneword]
phrase_vec = m[phrase_vec]
When I tried this code, I could get a vector for one word 'cat', but not 'cat-like mammal'.
Because word2vec only provide the vector for one word like 'cat' right? (If I'm wrong, plz correct me)
So I've searched and found infer_vector() and tried the code below
phrase = phrase.lower().split(' ')
phrase_vec = m.infer_vector(phrase)
When I tried this code, I could get a vector, but every time I get different value when I tried
phrase_vec = m.infer_vector(phrase)
Because infer_vector has 'steps'.
When I set steps=0, I get always the same vector.
phrase_vec = m.infer_vector(phrase, steps=0)
However, I also found that document vector is obtained from averaging words in document.
like if the document is composed of three words, 'cat-like mammal', add three vectors of 'cat', 'like', 'mammal', and then average it, that would be the document vector. (If I'm wrong, plz correct me)
So here are some questions.
Is it the right way to use infer_vector() with 0 steps to getting a vector of phrase?
If it is the right averaging vector of words to get document vector, is there no need to use infer_vector()?
What is a model.docvecs for?

Using 0 steps means no inference at all happens: the vector stays at its randomly-initialized position. So you definitely don't want that. That the vectors for the same text vary a little each time you run infer_vector() is normal: the algorithm is using randomness. The important thing is that they're similar-to-each-other, within a small tolerance. You are more likely to make them more similar (but still not identical) with a larger steps value.
You can see also an entry about this non-determinism in Doc2Vec training or inference in the gensim FAQ.
Averaging word-vectors together to get a doc-vector is one useful technique, that might be good as a simple baseline for many purposes. But it's not the same as what Doc2Vec.infer_vector() does - which involves iteratively adjusting a candidate vector to be better and better at predicting the text's words, just like Doc2Vec training. For your doc-vector to be comparable to other doc-vectors created during model training, you should use infer_vector().
The model.docvecs object holds all the doc-vectors that were learned during model training, for lookup (by the tags given as their names during training) or other operations, like finding the most_similar() N doc-vectors to a target tag/vector amongst those learned during training.

Related

Concatenated Doc2Vec - calculate similarities

I have two Doc2Vec models trained on the same corpus but with different parameters. I would like to concatenate the two of them and calculate the similarity of a given input word, choosing the returned vectors from the concatenated model. I read a lot of comments regarding the fact that this method may not be particularly suited for performance improvement and that it might be necessary to change the source code to the KeyedVector class in gensim to enable it. Up to now I attempted to do that using the Translation Matrix but it returns 5 features from the second model and I am not sure about whether it is performing the translations correctly or not.
Has anybody already encountered this issue? Is there another way to calculate the similarity for an input word in a concatenated doc2vec model?
Up to now I have been able to reproduce this:
vocab1 = model1.wv
vocab2 = model2.wv
concatenated_vectors = {}
vocab_concatenated = vocab1
for i in range(len(vocab1.vectors)):
v1 = vocab1.vectors[i]
v2 = vocab2.vectors[i]
vocab_concatenated[list(vocab1.vocab.keys())[i]] = np.concatenate((v1, v2))
In order to re-calculate the most_similar() top-n features for a passed argument, how should I re-istantiate the newly created object? It seems that
.add_vectors(list(vocab1.vocab.keys()), vocab_concatenated[list(vocab1.vocab.keys())])
is not working, but I am sure I am missing something.

Assessing doc2vec accuracy

I am trying to assess a doc2vec model based on the code from here. Basically, I want to know the percentual of inferred documents are found to be most similar to itself. This is my current code an:
for doc_id, doc in enumerate(cur.execute('SELECT Text FROM Patents')):
docs += 1
doc = clean_text(doc)
inferred_vector = model.infer_vector(doc)
sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
rank = [docid for docid, sim in sims].index(doc_id)
ranks.append(rank)
counter = collections.Counter(ranks)
accuracy = counter[0] / docs
This code works perfectly with smaller datasets. However, since I have a huge file with millions of documents, this code becomes too slow, it would take months to compute. I profiled my code and most of the time is consumed by the following line: sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs)).
If I am not mistaken, this is having to measure each document to every other document. I think computation time might be massively reduced if I change this to topn=1 instead since the only thing I want to know is if the most similar document is itself or not. Doing this will basically take each doc (i.e., inferred_vector), measure its most similar document (i.e., topn=1), and then I just see if it is itself or not. How could I implement this? Any help or idea is welcome.
To have most_similar() return only the single most-similar document, that is as simple as specifying topn=1.
However, to know which one document of the millions is the most-similar to a single target vector, the similarities to all the candidates must be calculated & sorted. (If even one document was left out, it might have been the top-ranked one!)
Making sure absolutely no virtual-memory swapping is happening will help ensure that brute-force comparison happens as fast as possible, all in RAM – but with millions of docs, it will still be time-consuming.
What you're attempting is a fairly simple "self-check" as to whether training led to self-consistent model: whether the re-inference of a document creates a vector "very close to" the same doc-vector left over from bulk training. Failing that will indicate some big problems in doc-prep or training, but it's not a true measure of the model's "accuracy" for any real task, and the model's value is best evaluated against your intended use.
Also, because this "re-inference self-check" is just a crude sanity check, there's no real need to do it for every document. Picking a thousand (or ten thousand, or whatever) random documents will give you a representative idea of whether most of the re-inferred vectors have this quality, or not.
Similarly, you could simply check the similarity of the re-inferred vector against the single in-model vector for that same document-ID, and check whether they are "similar enough". (This will be much faster, but could also be done on just a random sample of docs.) There's no magic proper threshold for "similar enough"; you'd have to pick one that seems to match your other goals. For example, using scikit-learn's cosine_similarity() to compare the two vectors:
from sklearn.metrics.pairwise import cosine_similarity
# ...
inferred_vector = model.infer_vector(doc_words)
looked_up_vector = model.dv[doc_id]
self_similarity = cosine_similarity([inferred_vector], [looked_up_vector])[0]
# then check that value against some threshold
(You have to wrap the single vectors in lists as arguments to cosine_similarity(), then access the 0th element of the return value, because it is designed to usually work on larger lists of vectors.)
With this calculation, you wouldn't know if, for example, some of the other stored-doc-vectors are a little closer to your inferred target - but that may not be that important, anyway. The docs might be really similar! And while the original "closest to itself" self-check will fail miserably if there were major defects in training, even a well-trained model will likely have some cases where natural model jitter prevents a "closest to itself" for every document. (With more documents inside the same number of dimensions, or certain corpuses with lots of very-similar documents, this would become more common... but not be a concerning indicator of any model problems.)

Similarity measure using vectors in gensim

I have a pair of word and semantic types of those words. I am trying to compute the relatedness measure between these two words using semantic types, for example: word1=king, type1=man, word2=queen, type2=woman
we can use gensim word_vectors.most_similar to get 'queen' from 'king-man+woman'. However, I am looking for similarity measure between vector represented by 'king-man+woman' and 'queen'.
I am looking for a solution to above (or)
way to calculate vector that is representative of 'king-man+woman' (and)
calculating similarity between two vectors using vector values in gensim (or)
way to calculate simple mean of the projection weight vectors(i.e king-man+woman)
You should look at the source code for the gensim most_similar() method, which is used to propose answers to such analogy questions. Specifically, when you try...
sims = wv_model.most_similar(positive=['king', 'woman'], negative=['man'])
...the top result will (in a sufficiently-trained model) often be 'queen' or similar. So, you can look at the source code to see exactly how it calculates the target combination of wv('king') - wv('man') + wv('woman'), before searching all known vectors for those closest vectors to that target. See...
https://github.com/RaRe-Technologies/gensim/blob/5f6b28c538d7509138eb090c41917cb59e4709af/gensim/models/keyedvectors.py#L486
...and note that the local variable mean is the combination of the positive and negative values provided.
You might also find other methods there useful, either directly or as models for your own code, such as distances()...
https://github.com/RaRe-Technologies/gensim/blob/5f6b28c538d7509138eb090c41917cb59e4709af/gensim/models/keyedvectors.py#L934
...or n_similarity()...
https://github.com/RaRe-Technologies/gensim/blob/5f6b28c538d7509138eb090c41917cb59e4709af/gensim/models/keyedvectors.py#L1005

How to rotate a word2vec onto another word2vec?

I am training multiple word2vec models with Gensim. Each of the word2vec will have the same parameter and dimension, but trained with slightly different data. Then I want to compare how the change in data affected the vector representation of some words.
But every time I train a model, the vector representation of the same word is wildly different. Their similarity among other words remain similar, but the whole vector space seems to be rotated.
Is there any way I can rotate both of the word2vec representation in such way that same words occupy same position in vector space, or at least they are as close as possible.
Thanks in advance.
That the locations of words vary between runs is to be expected. There's no one 'right' place for words, just mutual arrangements that are good at the training task (predicting words from other nearby words) – and the algorithm involves random initialization, random choices during training, and (usually) multithreaded operation which can change the effective ordering of training examples, and thus final results, even if you were to try to eliminate the randomness by reliance on a deterministically-seeded pseudorandom number generator.
There's a class called TranslationMatrix in gensim that implements the learn-a-projection-between-two-spaces method, as used for machine-translation between natural languages in one of the early word2vec papers. It requires you to have some words that you specify should have equivalent vectors – an anchor/reference set – then lets other words find their positions in relation to those. There's a demo of its use in gensim's documentation notebooks:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/translation_matrix.ipynb
But, there are some other techniques you could also consider:
transform & concatenate the training corpuses instead, to both retain some words that are the same across all corpuses (such as very frequent words), but make other words of interest different per segment. For example, you might leave words like "hot" and "cold" unchanged, but replace words like "tamale" or "skiing" with subcorpus-specific versions, like "tamale(A)", "tamale(B)", "skiing(A)", "skiing(B)". Shuffle all data together for training in a single session, then check the distances/directions between "tamale(A)" and "tamale(B)" - since they were each only trained by their respective subsets of the data. (It's still important to have many 'anchor' words, shared between different sets, to force a correlation on those words, and thus a shared influence/meaning for the varying-words.)
create a model for all the data, with a single vector per word. Save that model aside. Then, re-load it, and try re-training it with just subsets of the whole data. Check how much words move, when trained on just the segments. (It might again help comparability to hold certain prominent anchor words constant. There's an experimental property in the model.trainables, with a name ending _lockf, that lets you scale the updates to each word. If you set its values to 0.0, instead of the default 1.0, for certain word slots, those words can't be further updated. So after re-loading the model, you could 'freeze' your reference words, by setting their _lockf values to 0.0, so that only other words get updated by the secondary training, and they're still bound to have coordinates that make sense with regard to the unmoving anchor words. Read the source code to better understand how _lockf works.)

How to get word vectors from a gensim Doc2Vec?

I trained a gensim.models.doc2vec.Doc2Vec model
d2v_model = Doc2Vec(sentences, size=100, window=8, min_count=5, workers=4)
and I can get document vectors by
docvec = d2v_model.docvecs[0]
How can I get word vectors from trained model ?
Doc2Vec inherits from Word2Vec, and thus you can access word vectors the same as in Word2Vec, directly by indexing the model:
wv = d2v_model['apple']
Note, however, that a Doc2Vec training mode like pure DBOW (dm=0) doesn't need or create word vectors. (Pure DBOW still works pretty well and fast for many purposes!) If you do access word vectors from such a model, they'll just be the automatic randomly-initialized vectors, with no meaning.
Only when the Doc2Vec mode itself co-trains word-vectors, as in the DM mode (default dm=1) or when adding optional word-training to DBOW (dm=0, dbow_words=1), are word-vectors and doc-vectors both learned simultaneously.
If you want to get all the trained doc vectors, you can easily use
model.docvecs.doctag_syn0. If you want to get the indexed doc, you can use model.docvecs[i].
If you are training a Word2Vec model, you can get model.wv.syn0.
If you want to get more, check this github issue link: (https://github.com/RaRe-Technologies/gensim/issues/1513)

Resources