In the gensim's documentation window size is defined as,
window is the maximum distance between the current and predicted word within a sentence.
which should mean when looking at context it doesn't go beyond the sentence boundary. right?
What i did was i created a document with several thousand tweets and selected a word (q1) and then selected most similar words to q1 (using model.most_similar('q1')). But then, if I randomly shuffle the tweets in the input document and then did the same experiment (without changing word2vec parameters) I got a different set most_similar words to q1.
Can't really understand why that happens if only it's gonna look at is sentence level information? can anyone explain this?
EDIT: added model parameters and a graph
used model parameters:
model1 = word2vec.Word2Vec(sents1 , size=100, window=5, min_count=5, iter=n_iter, sg=0)
Graph:
To draw the graph what i did was I ran word2vec with above parameters for the original document (D) and the shuffled document (D') and took the top 10 or 20 (two bars) most_similar('q') words to a specific query word q, and calculated the jaccard similarity score between the two sets of words when iter=1,10,100.
It seems as the no of iterations increase, lesser and lesser similar words between the two sets of words got from running word2vec on D and D'.
can't really understand why this is happening or what's going on?
Related
I don't understand how word vectors are involved at all in the training process with gensim's doc2vec in DBOW mode (dm=0). I know that it's disabled by default with dbow_words=0. But what happens when we set dbow_words to 1?
In my understanding of DBOW, the context words are predicted directly from the paragraph vectors. So the only parameters of the model are the N p-dimensional paragraph vectors plus the parameters of the classifier.
But multiple sources hint that it is possible in DBOW mode to co-train word and doc vectors. For instance:
section 5 of An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation
this SO answer: How to use Gensim doc2vec with pre-trained word vectors?
So, how is this done? Any clarification would be much appreciated!
Note: for DM, the paragraph vectors are averaged/concatenated with the word vectors to predict the target words. In that case, it's clear that words vectors are trained simultaneously with document vectors. And there are N*p + M*q + classifier parameters (where M is vocab size and q word vector space dim).
If you set dbow_words=1, then skip-gram word-vector training is added the to training loop, interleaved with the normal PV-DBOW training.
So, for a given target word in a text, 1st the candidate doc-vector is used (alone) to try to predict that word, with backpropagation adjustments then occurring to the model & doc-vector. Then, a bunch of the surrounding words are each used, one at a time in skip-gram fashion, to try to predict that same target word – with the followup adjustments made.
Then, the next target word in the text gets the same PV-DBOW plus skip-gram treatment, and so on, and so on.
As some logical consequences of this:
training takes longer than plain PV-DBOW - by about a factor equal to the window parameter
word-vectors overall wind up getting more total training attention than doc-vectors, again by a factor equal to the window parameter
I'm a very new student of doc2vec and have some questions about document vector.
What I'm trying to get is a vector of phrase like 'cat-like mammal'.
So, what I've tried so far is by using doc2vec pre-trained model, I tried the code below
import gensim.models as g
model = "path/pre-trained doc2vec model.bin"
m = g. Doc2vec.load(model)
oneword = 'cat'
phrase = 'cat like mammal'
oneword_vec = m[oneword]
phrase_vec = m[phrase_vec]
When I tried this code, I could get a vector for one word 'cat', but not 'cat-like mammal'.
Because word2vec only provide the vector for one word like 'cat' right? (If I'm wrong, plz correct me)
So I've searched and found infer_vector() and tried the code below
phrase = phrase.lower().split(' ')
phrase_vec = m.infer_vector(phrase)
When I tried this code, I could get a vector, but every time I get different value when I tried
phrase_vec = m.infer_vector(phrase)
Because infer_vector has 'steps'.
When I set steps=0, I get always the same vector.
phrase_vec = m.infer_vector(phrase, steps=0)
However, I also found that document vector is obtained from averaging words in document.
like if the document is composed of three words, 'cat-like mammal', add three vectors of 'cat', 'like', 'mammal', and then average it, that would be the document vector. (If I'm wrong, plz correct me)
So here are some questions.
Is it the right way to use infer_vector() with 0 steps to getting a vector of phrase?
If it is the right averaging vector of words to get document vector, is there no need to use infer_vector()?
What is a model.docvecs for?
Using 0 steps means no inference at all happens: the vector stays at its randomly-initialized position. So you definitely don't want that. That the vectors for the same text vary a little each time you run infer_vector() is normal: the algorithm is using randomness. The important thing is that they're similar-to-each-other, within a small tolerance. You are more likely to make them more similar (but still not identical) with a larger steps value.
You can see also an entry about this non-determinism in Doc2Vec training or inference in the gensim FAQ.
Averaging word-vectors together to get a doc-vector is one useful technique, that might be good as a simple baseline for many purposes. But it's not the same as what Doc2Vec.infer_vector() does - which involves iteratively adjusting a candidate vector to be better and better at predicting the text's words, just like Doc2Vec training. For your doc-vector to be comparable to other doc-vectors created during model training, you should use infer_vector().
The model.docvecs object holds all the doc-vectors that were learned during model training, for lookup (by the tags given as their names during training) or other operations, like finding the most_similar() N doc-vectors to a target tag/vector amongst those learned during training.
I am training multiple word2vec models with Gensim. Each of the word2vec will have the same parameter and dimension, but trained with slightly different data. Then I want to compare how the change in data affected the vector representation of some words.
But every time I train a model, the vector representation of the same word is wildly different. Their similarity among other words remain similar, but the whole vector space seems to be rotated.
Is there any way I can rotate both of the word2vec representation in such way that same words occupy same position in vector space, or at least they are as close as possible.
Thanks in advance.
That the locations of words vary between runs is to be expected. There's no one 'right' place for words, just mutual arrangements that are good at the training task (predicting words from other nearby words) – and the algorithm involves random initialization, random choices during training, and (usually) multithreaded operation which can change the effective ordering of training examples, and thus final results, even if you were to try to eliminate the randomness by reliance on a deterministically-seeded pseudorandom number generator.
There's a class called TranslationMatrix in gensim that implements the learn-a-projection-between-two-spaces method, as used for machine-translation between natural languages in one of the early word2vec papers. It requires you to have some words that you specify should have equivalent vectors – an anchor/reference set – then lets other words find their positions in relation to those. There's a demo of its use in gensim's documentation notebooks:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/translation_matrix.ipynb
But, there are some other techniques you could also consider:
transform & concatenate the training corpuses instead, to both retain some words that are the same across all corpuses (such as very frequent words), but make other words of interest different per segment. For example, you might leave words like "hot" and "cold" unchanged, but replace words like "tamale" or "skiing" with subcorpus-specific versions, like "tamale(A)", "tamale(B)", "skiing(A)", "skiing(B)". Shuffle all data together for training in a single session, then check the distances/directions between "tamale(A)" and "tamale(B)" - since they were each only trained by their respective subsets of the data. (It's still important to have many 'anchor' words, shared between different sets, to force a correlation on those words, and thus a shared influence/meaning for the varying-words.)
create a model for all the data, with a single vector per word. Save that model aside. Then, re-load it, and try re-training it with just subsets of the whole data. Check how much words move, when trained on just the segments. (It might again help comparability to hold certain prominent anchor words constant. There's an experimental property in the model.trainables, with a name ending _lockf, that lets you scale the updates to each word. If you set its values to 0.0, instead of the default 1.0, for certain word slots, those words can't be further updated. So after re-loading the model, you could 'freeze' your reference words, by setting their _lockf values to 0.0, so that only other words get updated by the secondary training, and they're still bound to have coordinates that make sense with regard to the unmoving anchor words. Read the source code to better understand how _lockf works.)
Supposed you have two users with sets of attributes like so:
userA = {"happy", "excited"}
userB = {"sad", "anxious"}
Now, if we were to compute the Jaccard similarity of this, it would be 0. However, we want to define that excited is pretty similar to anxious.
My question is, how can this be structured?
Would I define another set of words that are synonyms to excited? How would I then factor this into the Jaccard index computation?
I suggest making clusters of synonyms using some sort of thesaurus. Each word would belong to at most one cluster.
For every cluster, choose a "canonical" representative.
Now when you have to compute Jaccard similarity, substitute every word with the representative from its cluster. Then proceed as usual.
Example clusters (representatives marked bold):
1. Good, great, excellent, positive, valuable
2. Bad, poor, sad, awful
Say you want to compute similarity of two users:
userA = {"positive"}
userB = {"good"}
Then you convert them to
userA' = {"good"} (because "good" is the representative for cluster, which "positive" belongs to)
userB' = {"good"}
Similarity = 1 / 1 = 1.
You can't do this with words, since they can be ambiguous, but if you were able to derive what WordNet calls "word senses," you could map from that to synsets which would encode all the synonyms which have word senses which match.
See, for example, this Python NLTK example for Word Sense Disambiguation: http://www.nltk.org/howto/wsd.html
Clustering on the synset ID would give the result you want (assuming that anxious and excited actually have at least one synonymous word sense in the database you're using for disambiguation).
I'm writing a crawler to get content from some website, but the content can duplicated, I want
to avoid that. So I need a function can return the same percent between two text to detect two content maybe duplicated Example:
Text 1:"I'm writing a crawler to"
Text 2:"I'm writing a some text crawler to get"
The compare function will return text 2 as the same text 1 by 5/8%(with 5 is words number of text 2 same text 1(compare by word order), and 8 is total words of text 2). If remove the "some text" then text 2 as the same text 1(I need detect the situation).How can I do that?
You are facing a problem which is known in the field of Information Retrieval as Near Duplicates Detection.
One of the known solutions to it is to use Jaccard-Similarity for getting the difference between two documents.
Jaccard Similarity is basically - get sets of words from each document, let these sets be s1 and s2 - and the jaccard similarity is |s1 [intersection] s2|/|s1 [union] s2|.
Usually when facing near duplicates - the order of words has some importance however. In order to deal with it - when generating the sets s1 and s2 - you actually generate sets of k-shinglings, instead sets of only words.
In your example, with k=2, the sets will be:
s1 = { I'm write, write a, a crawler, crawler to }
s2 = { I'm write, write a, a some, some text, text crawler, crawler to, to get }
s1 [union] s2 = { I'm write, write a, a crawler, crawler to, a some, some text, text crawler, to get }
s1 [intersection] s2 = { I'm write, write a, crawler to }
In the above, the jaccard-similarity will be 3/8. If you use single words with the same approach, (k=1 shinglings) you will get your desired 5/8 - but this is worse solution in my (and most IR experts) opinion.
This procedure can be scaled nicely to deal very efficiently with huge collections, without checking all pairs and creating huge numbers of sets. More details could be found in these lecture notes (I gave this lecture few months ago, based on the author's notes).
A good algorithm for comparing two text is tf-idf. It will give similarity between two documents.
1. calculate tf-idf for the document
2. calculate cosine similarity for two given text
3. the cosine similarity will indicate match between two documents.
This is a very good tutorial for calculating tf-idf and cosine similarity in Java. It would be simple to extend it to C#.
In bioinformatics there is a algorithm which should do the job. It is called Needleman-Wunsch and is normally used for global sequence alignment with nucletide sequences.
Using this algorithm you could easily calculate accordance between two strings. You can use my code. But this method only returns the alignment you would have to calculate the accordance yourself.