How to interpret doc2vec results on previously seen data? - gensim

I use gensim 4.0.1 and train doc2vec:
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
sentences = [['hello', 'world'], ['james', 'bond'], ['adam', 'smith']]
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(sentences)]
model = Doc2Vec(documents, vector_size=5, window=5, min_count=0, workers=4)
documents
[TaggedDocument(words=['hello', 'world'], tags=[0]),
TaggedDocument(words=['james', 'bond'], tags=[1]),
TaggedDocument(words=['adam', 'smith'], tags=[2])]
model.dv[0],model.dv[1],model.dv[2]
(array([-0.10461631, -0.11958256, -0.1976151 , 0.1710569 , 0.0713223 ],
dtype=float32),
array([ 0.00526548, -0.19761242, -0.10334401, -0.19437183, 0.04021204],
dtype=float32),
array([ 0.05662392, 0.09290017, -0.08597242, -0.06293383, -0.06159503],
dtype=float32))
I expect to get a match on TaggedDocument #1
seen = ['james','bond']
Surprisingly, that known text (james bond) produces a completely "unseen" vector:
new_vector = model.infer_vector(seen)
new_vector
array([-0.07762126, 0.03976333, -0.02985927, 0.07899596, -0.03556045],
dtype=float32)
The most_similar() does not point to the expected Tag=1. Moreover, all 3 scores are quite weak implying completely unseen data.
model.dv.most_similar_cosmul(positive=[new_vector])
[(0, 0.5322251915931702), (2, 0.4972134530544281), (1, 0.46321794390678406)]
What is wrong here, any ideas?

Five dimensions is still too many for a toy-sized dataset of just 6 words, 6 unique words, and 3 2-word texts.
None of the Word2Vec/Doc2Vec/FastText-type algorithms works well on tiny amounts of contrived data. They only learn their patterns from many, subtly-contrasting usages of words in varied contexts.
Their real strengths only emerge with vectors that are 50, 100, or hundreds-of-dimensions wide - and training that many dimensions requires a unique vocabulary of (at least) many thousands of words – ideally tens or hundreds of thousands of words – with many usage examples of each. (For a variant like Doc2Vec, you'd similarly want many thousands of varied documents.)
You'll see improved correlations with expected results when using sufficient training data.

Related

Is there a way to infer topic distributions on unseen document from gensim LDA pre-trained model using matrix multiplication?

Is there a way to get the topic distribution of an unseen document using a pretrained LDA model without using the LDA_Model[unseenDoc] syntax? I am trying to implement my LDA model into a web application, and if there was a way to use matrix multiplication to get a similar result then I could use the model in javascript.
For example, I tried the following:
import numpy as np
import gensim
from gensim.corpora import Dictionary
from gensim import models
import nltk
from nltk.stem import WordNetLemmatizer, SnowballStemmer
nltk.download('wordnet')
def Preprocesser(text_list):
smallestWordSize = 3
processedList = []
for token in gensim.utils.simple_preprocess(text_list):
if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > smallestWordSize:
processedList.append(StemmAndLemmatize(token))
return processedList
lda_model = models.LdaModel.load('LDAModel\GoldModel') #Load pretrained LDA model
dictionary = Dictionary.load("ModelTrain\ManDict") #Load dictionary model was trained on
#Sample Unseen Doc to Analyze
doc = "I am going to write a string about how I can't get my task executor \
to travel properly. I am trying to use the \
AGV navigator, but it doesn't seem to be working network. I have been trying\
to use the AGV Process flow but that isn't working either speed\
trailer offset I am now going to change this so I can see how fast it runs"
termTopicMatrix = lda_model.get_topics() #Get Term-topic Matrix from pretrained LDA model
cleanDoc = Preprocesser(doc) #Tokenize, lemmatize, clean and stem words
bowDoc = dictionary.doc2bow(cleanDoc) #Create bow using dictionary
dictSize = len(termTopicMatrix[0]) #Get length of terms in dictionary
fullDict = np.zeros(dictSize) #Initialize array which is length of dictionary size
First = [first[0] for first in bowDoc] #Get index of terms in bag of words
Second = [second[1] for second in bowDoc] #Get frequency of term in bag of words
fullDict[First] = Second #Add word frequency to full dictionary
print('Matrix Multiplication: \n', np.dot(termTopicMatrix,fullDict))
print('Conventional Syntax: \n', lda_model[bowDoc])
Output:
Matrix Multiplication:
[0.0283254 0.01574513 0.03669142 0.01671816 0.03742738 0.01989461
0.01558603 0.0370233 0.04648389 0.02887623 0.00776652 0.02147539
0.10045133 0.01084273 0.01229849 0.00743788 0.03747379 0.00345913
0.03086953 0.00628912 0.29406082 0.10656977 0.00618827 0.00406316
0.08775404 0.00785408 0.02722744 0.09957815 0.01669402 0.00744392
0.31177135 0.03063149 0.07211428 0.01192056 0.03228589]
Conventional Syntax:
[(0, 0.070313625), (2, 0.056414187), (18, 0.2016589), (20, 0.46500313), (24, 0.1589748)]
In the pretrained model there are 35 topics and 1155 words.
In the "Conventional Syntax" output, the first element of each tuple is the index of the topic and the second element is the probability of the topic. In the "Matrix Multiplication" version, the probability is the index and the value is the probability. Clearly the two don't match up.
For example, the lda_model[unseenDoc] shows that topic 0 has a 0.07 probability, but the matrix multiplication method says that topic has a 0.028 probability. Am I missing a step here?
You can review the full source code used by LDAModel's get_document_topics() method in your installation, or online at:
https://github.com/RaRe-Technologies/gensim/blob/e75f6c8e8d1dee0786b1b2cd5ef60da2e290f489/gensim/models/ldamodel.py#L1283
(It also makes use of the inference() method in the same file.)
It's doing a lot more scaling/normalization/clipping than your code, which is likely the cause of the discrepancy. But you should be able to examine, line-by-line, where your process & its differ to get the steps to match up.
It also shouldn't be hard to use the gensim code's steps as guidance for creating parallel Javascript code that, given the right parts of the model's state, can reproduce its results.

Why does a Gensim Doc2vec object return empty doctags?

My question is how I should interpret my situation?
I trained a Doc2Vec model following this tutorial https://blog.griddynamics.com/customer2vec-representation-learning-and-automl-for-customer-analytics-and-personalization/.
For some reason, doc_model.docvecs.doctags returns {}. But doc_model.docvecs.vectors_docs seems to return a proper value.
Why the doc2vec object doesn't return any doctags but vectors_docs?
Thank you for any comments and answers in advance.
This is the code I used to train a Doc2Vec model.
from gensim.models.doc2vec import LabeledSentence, TaggedDocument, Doc2Vec
import timeit
import gensim
embeddings_dim = 200 # dimensionality of user representation
filename = f'models/customer2vec.{embeddings_dim}d.model'
if TRAIN_USER_MODEL:
class TaggedDocumentIterator(object):
def __init__(self, df):
self.df = df
def __iter__(self):
for row in self.df.itertuples():
yield TaggedDocument(words=dict(row._asdict())['all_orders'].split(),tags=[dict(row._asdict())['user_id']])
it = TaggedDocumentIterator(combined_orders_by_user_id)
doc_model = gensim.models.Doc2Vec(vector_size=embeddings_dim,
window=5,
min_count=10,
workers=mp.cpu_count()-1,
alpha=0.055,
min_alpha=0.055,
epochs=20) # use fixed learning rate
train_corpus = list(it)
doc_model.build_vocab(train_corpus)
for epoch in tqdm(range(10)):
doc_model.alpha -= 0.005 # decrease the learning rate
doc_model.min_alpha = doc_model.alpha # fix the learning rate, no decay
doc_model.train(train_corpus, total_examples=doc_model.corpus_count, epochs=doc_model.iter)
print('Iteration:', epoch)
doc_model.save(filename)
print(f'Model saved to [{filename}]')
else:
doc_model = Doc2Vec.load(filename)
print(f'Model loaded from [{filename}]')
doc_model.docvecs.vectors_docs returns
If all of the tags you supply are plain Python ints, those ints are used as the direct-indexes into the vectors-array.
This saves the overhead of maintaining a mapping from arbitrary tags to indexes.
But, it may also cause an over-allocation of the vectors array, to be large enough for the largest int tag you provided, even if other lower ints are never used. (That is: if you provided a single document, with a tags=[1000000], it will allocate an array sufficient for tags 0 to 1000000, even if most of those never appear in your training data.)
If you want model.docvecs.doctags to collect a list of all your tags, use string tags rather than plain ints.
Separately: don't call train() multiple times in your own loop, or manage the alpha learning-rate in your own code, unless you have an overwhelmingly good reason to do so. It's inefficient & error-prone. (Your code, for example, is actually performing 200 training-epochs, and if you were to increase the loop count without carefully adjusting your alpha increment, you could wind up with nonsensical negative alpha values – a very common error in code following this bad practice. Call .train() once with your desired number of epochs. Set the alpha and min_alpha at reasonable starting and nearly-zero values – probably just the defaults unless you're sure your change is helping – and then leave them alone.

How to understand the gensim's iter parameter and its implication on preprocessing?

from gensim.test.utils import datapath
from gensim import utils
class MyCorpus(object):
"""An interator that yields sentences (lists of str)."""
def __iter__(self):
corpus_path = datapath('lee_background.cor')
i = 1
print(str(i))
for line in open(corpus_path):
# assume there's one document per line, tokens separated by whitespace
yield utils.simple_preprocess(line)
import gensim.models
sentences = MyCorpus()
model = gensim.models.Word2Vec(sentences=sentences, iter=1)
This is the genism's documentation code from https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html.
I have 2 questions regarding the iter parameter:
1) when it is set to 1, why is the print(str(i)) executed twice?
2) when the "iter=10", the 'simple_preprocess' is executed 11 times. If my own customized 'preprocess' is very heavy, is this going to be very slow? How to avoid this preprocessing repetitions in using genism word2vec?
The gensim Word2Vec class needs to iterate through your corpus once to discover the full vocabulary, then iter times for training. So you'll see your corpus iterable used iter + 1 times.
Yes, if your preprocessing is expensive, it's wasteful to repeat it for every iteration. You can do it once, writing the results to a separate interim file where each token is separate by a space (' '). Then, you've only spent the preprocessing effort once, and when you later train Word2Vec, you perform only the (very simple and cheap) splitting-by-space tokenization.

What is the Gensim word2vec output

I want to use gensim word2vec as input for neural network. I have 2 questions:
1) gensim.models.Word2Vec get as parameter the size. How this parameter is used? and size of what?
2) Once trained what is the output of gensim word2vec? As i could see this is not a probability values (not between 0 and 1). It seems to me for each word vector we get a distance (cosinus) between this word and some other words (but which words exactly?)
Thanks for your response.
Ans to 1 -> The size parameter is the dimension of the word vectors i.e. each vector will be having 100 dimensions if size=100
Ans to 2 -> You can save the word vectors using this function save_word2vec_format(fname="vectors.txt", fvocab=None, binary=False) ref. This will save a file "vectors.txt" which will have first line as <size of the vocabulary> <dimensions> and rest of the lines will be of the form <word> <vector of size dimension>.
Sample for "vectors.txt":
297820 100
the -0.18542234751 0.138813291635 0.0392148854213 0.0238721499736 -0.0443151295365 0.03226302388 -0.168626211895 -0.17397777844 -0.0547546409461 0.166621666046 0.0534506882806 0.0774947957067 -0.180520283779 -0.0938140452702 -0.0354599008902 -0.0533488133527 -0.0667684564816 -0.0210904306995 -0.103069115604 -0.138712344952 -0.035142440978 -0.125067138202 0.0514192233164 -0.142052171747 0.0795726729387 0.0310433094806 -0.00666224898992 0.047268806263 0.0339849190176 -0.181107631029 0.0477396587205 0.0483130822899 -0.090229393762 0.0224528628225 0.190814060668 -0.179506639849 0.00034066604609 0.0639057478 0.156444383949 -0.0366888977431 -0.170674385275 -0.053907152935 0.106572313582 0.0724497821903 -0.00848717936216 0.124053494271 -0.0420715605081 0.0460277422205 -0.0514693485657 0.132215091575 -0.0429308836475 -0.111784875385 -0.0543172053216 0.0849476776796 -0.015301892652 0.00992711997251 -0.00566113637219 0.00136359242972 -0.0382116842516 0.0681229985191 0.0685156463052 0.0759072640845 -0.0238136705161 0.168710450161 0.00879930186352 -0.179756801973 -0.210286559709 -0.161832152064 -0.0212640125813 -0.0115905356526 -0.0949562511822 0.126493155131 0.0215821686774 -0.164276918273 -0.0573806470616 -0.0147266125919 0.0566350339785 -0.0276969849679 0.0178970346094 0.0599163813161 0.0919867942845 0.172071394538 0.0714226787026 0.109037733251 0.00403647493576 0.044853743905 -0.0915639785243 -0.0242494817113 0.0705554654776 0.255584701079 0.001309754199 0.0872413719572 -0.0376289782286 0.158184379871 0.109245196088 -0.0727554069742 0.168820215174 0.0454895919746 0.0741726055733 -0.134467710995
...
...
Dimension Size in word2vec**
Word2vec is used to create a vector space that represents words based on the trained corpus.
The vector is a mathematical representation of the word compared to other words in the given corpus. The dimensions size is the vector length.
Performing mathematical operations on the vectors represent the relationship between words.
as the vector of "man" and "king" will be close and same for the vector of "Faris" and "France".
If the size is too small like two or three dimensions, the information representation will be very limited.
The dimensions can be simplified as a linkage between different words. Words can be linked to each other in different dimensions based on how the words are positioned to each other in the corpus.
How to use the vectors
The vector by itself is useless and the numbers represent the position of the word with relations to all other words in the corpus
The vector can be meaningful when measured against another vector
cosine similarity is one of the common methods to measure the similarity between different words.
Good luck

word2vec's probalistic output

I'm new to the world of word2vec and I just start to use gensim's implementation for word2vec.
I use two naive sentences as my first document set,
[['first', 'sentence'], ['second', 'sentence']]
The vectors I get are like this:
'first', -0.07386458, -0.17405555
'second', 0.0761444 , -0.21217766
'sentence', 0.0545655 , -0.07535963
However, when I type in another toy document sets:
[['a', 'c'], ['b', 'c']]
I get the following result:
'a', 0.02936198, -0.05837455
'b', -0.05362414, -0.06813956
'c', 0.11918657, -0.10411404
Again, I'm new to word2vec but according to my understanding,
my two document sets are structurally identical, so the results of the corresponding word should be the same.
But why I'm getting different results?
Is the algorithm always giving probalistic output or the document sets too small?
The function I used is as the following:
model = word2vec.Word2Vec(sentences, size=2, min_count=1, window=2)
Prime reason you are getting different vectors is random initialisation of vectors in word2vec (there are other reasons like negative sampling, threading which can lead to difference in vector values).
The philosophy behind word2vec being, if the number of documents (training data) >> number of unique words (vocabulary size), the vectors for the words will stabilise after few iterations.

Resources