Spacy/Gensim - creating similar sentences - gensim

I have around 1000 pairs of sentences. Each pair consists of two sentences, one causing high CTR and one LOW. I want to create a mechanism to auto-produce sentences optimized for high CTR. When iterating the pairs, I can get a vector (using Spacy NLP) for each sentence. I take the vectors difference (Sent1.vector - Sent2.Vector) and then mean all the pairs using numpy mean. When I have the "difference vector" in hand, I want to add it to any given text and get a new sentence. any Ideas how to obtain this? Gensim most_similar only works on single words... Thanks

Related

Doc2vec - About getting document vector

I'm a very new student of doc2vec and have some questions about document vector.
What I'm trying to get is a vector of phrase like 'cat-like mammal'.
So, what I've tried so far is by using doc2vec pre-trained model, I tried the code below
import gensim.models as g
model = "path/pre-trained doc2vec model.bin"
m = g. Doc2vec.load(model)
oneword = 'cat'
phrase = 'cat like mammal'
oneword_vec = m[oneword]
phrase_vec = m[phrase_vec]
When I tried this code, I could get a vector for one word 'cat', but not 'cat-like mammal'.
Because word2vec only provide the vector for one word like 'cat' right? (If I'm wrong, plz correct me)
So I've searched and found infer_vector() and tried the code below
phrase = phrase.lower().split(' ')
phrase_vec = m.infer_vector(phrase)
When I tried this code, I could get a vector, but every time I get different value when I tried
phrase_vec = m.infer_vector(phrase)
Because infer_vector has 'steps'.
When I set steps=0, I get always the same vector.
phrase_vec = m.infer_vector(phrase, steps=0)
However, I also found that document vector is obtained from averaging words in document.
like if the document is composed of three words, 'cat-like mammal', add three vectors of 'cat', 'like', 'mammal', and then average it, that would be the document vector. (If I'm wrong, plz correct me)
So here are some questions.
Is it the right way to use infer_vector() with 0 steps to getting a vector of phrase?
If it is the right averaging vector of words to get document vector, is there no need to use infer_vector()?
What is a model.docvecs for?
Using 0 steps means no inference at all happens: the vector stays at its randomly-initialized position. So you definitely don't want that. That the vectors for the same text vary a little each time you run infer_vector() is normal: the algorithm is using randomness. The important thing is that they're similar-to-each-other, within a small tolerance. You are more likely to make them more similar (but still not identical) with a larger steps value.
You can see also an entry about this non-determinism in Doc2Vec training or inference in the gensim FAQ.
Averaging word-vectors together to get a doc-vector is one useful technique, that might be good as a simple baseline for many purposes. But it's not the same as what Doc2Vec.infer_vector() does - which involves iteratively adjusting a candidate vector to be better and better at predicting the text's words, just like Doc2Vec training. For your doc-vector to be comparable to other doc-vectors created during model training, you should use infer_vector().
The model.docvecs object holds all the doc-vectors that were learned during model training, for lookup (by the tags given as their names during training) or other operations, like finding the most_similar() N doc-vectors to a target tag/vector amongst those learned during training.

How to rotate a word2vec onto another word2vec?

I am training multiple word2vec models with Gensim. Each of the word2vec will have the same parameter and dimension, but trained with slightly different data. Then I want to compare how the change in data affected the vector representation of some words.
But every time I train a model, the vector representation of the same word is wildly different. Their similarity among other words remain similar, but the whole vector space seems to be rotated.
Is there any way I can rotate both of the word2vec representation in such way that same words occupy same position in vector space, or at least they are as close as possible.
Thanks in advance.
That the locations of words vary between runs is to be expected. There's no one 'right' place for words, just mutual arrangements that are good at the training task (predicting words from other nearby words) – and the algorithm involves random initialization, random choices during training, and (usually) multithreaded operation which can change the effective ordering of training examples, and thus final results, even if you were to try to eliminate the randomness by reliance on a deterministically-seeded pseudorandom number generator.
There's a class called TranslationMatrix in gensim that implements the learn-a-projection-between-two-spaces method, as used for machine-translation between natural languages in one of the early word2vec papers. It requires you to have some words that you specify should have equivalent vectors – an anchor/reference set – then lets other words find their positions in relation to those. There's a demo of its use in gensim's documentation notebooks:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/translation_matrix.ipynb
But, there are some other techniques you could also consider:
transform & concatenate the training corpuses instead, to both retain some words that are the same across all corpuses (such as very frequent words), but make other words of interest different per segment. For example, you might leave words like "hot" and "cold" unchanged, but replace words like "tamale" or "skiing" with subcorpus-specific versions, like "tamale(A)", "tamale(B)", "skiing(A)", "skiing(B)". Shuffle all data together for training in a single session, then check the distances/directions between "tamale(A)" and "tamale(B)" - since they were each only trained by their respective subsets of the data. (It's still important to have many 'anchor' words, shared between different sets, to force a correlation on those words, and thus a shared influence/meaning for the varying-words.)
create a model for all the data, with a single vector per word. Save that model aside. Then, re-load it, and try re-training it with just subsets of the whole data. Check how much words move, when trained on just the segments. (It might again help comparability to hold certain prominent anchor words constant. There's an experimental property in the model.trainables, with a name ending _lockf, that lets you scale the updates to each word. If you set its values to 0.0, instead of the default 1.0, for certain word slots, those words can't be further updated. So after re-loading the model, you could 'freeze' your reference words, by setting their _lockf values to 0.0, so that only other words get updated by the secondary training, and they're still bound to have coordinates that make sense with regard to the unmoving anchor words. Read the source code to better understand how _lockf works.)

How to align two GloVe models in text2vec?

Let's say I have trained two separate GloVe vector space models (using text2vec in R) based on two different corpora. There could be different reasons for doing so: the two base corpora may come from two different time periods, or two very different genres, for example. I would be interested in comparing the usage/meaning of words between these two corpora. If I simply concatenated the two corpora and their vocabularies, that would not work (the location in the vector space for word pairs with different usages would just be somewhere in the "middle").
My initial idea was to train just one model, but when preparing the texts, append a suffix (_x, _y) to each word (where x and y stand for the usage of word A in corpus x/y), as well as keep a separate copy of each corpus without the suffixes, so that the vocabulary of the final concatenated training corpus would consist of: A, A_x, A_y, B, B_x, B_y ... etc, e.g.:
this is an example of corpus X
this be corpus Y yo
this_x is_x an_x example_x of_x corpus_x X_x
this_y be_y corpus_y Y_y yo_y
I figured the "mean" usages of A and B would serve as sort of "coordinates" of the space, and I could measure the distance between A_x and A_y in the same space. But then I realized since A_x and A_y never occur in the same context (due to the suffixation of all words, including the ones around them), this would probably distort the space and not work. I also know there is something called an orthogonal procrustes problem, which relates to aligning matrices, but I wouldn't know how to implement it for my case.
What would be a reasonable way to fit two GloVe models (preferably in R and so that they work with text2vec) into a common vector space, if my final goal is to measure the cosine similarity of word pairs, which are orthographically identical, but occur in two different corpora?
I see 2 possible solutions:
try to initialize second glove model with solution from first and hope that coordinate system won't change too much during the fit of the second model
fit two models and get word vector matrices A, B. Then find rotation matrix that minimize sum of the angles between rows of A and B (don't know how to do that yet)
Also check http://nlp.stanford.edu/projects/histwords/, mb it will help with methodology.
Seems this is a good question for https://math.stackexchange.com/

text classificacion: how many dimensions does my data have?

I am classifying text using the bag of words model. I read in 800 text files, each containing a sentence.
The sentences are then represented like this:
[{"OneWord":True,"AnotherWord":True,"AndSoOn":True},{"FirstWordNewSentence":True,"AnSoOn":True},...]
How many dimensions does my data have?
Is it the number of entries in the largest vector? Or is it the number of unique words? Or something else?
For each doc, the bag of words model has a set of sparse features. For example (use your first sentence in your example):
OneWord
AnotherWord
AndSoOn
The above three are the three active features for the document. It is sparse because we never list those inactive features explicitly AND we have a very large vocabulary (all possible unique words that you consider as features). In another words, we did not say:
OneWord
AnotherWord
AndSoOn
FirstWordNewSentence: false
We only include those words that are "true".
How many dimensions does my data have?
Is it the number of entries in the largest vector? Or is it the number of unique words? Or something else?
If you stick with the sparse feature representation, you might want to estimate the average number of active features per document instead. That number is 2.5 in your example ((3+2)/2 = 2.5).
If you use a dense representation (e.g., one-hot encoding, it is not a good idea though if the vocabulary is large), the input dimension is equal to your vocabulary size.
If you use a word embedding that has 100-dimension and combine all words' embedding to form a new input vector to represent a document, your input dimension is 100 then. In this case, you convert your sparse features into dense features via the embedding.

Algorithm for generating a 'top list' using word frequency

I have a big collection of human generated content. I want to find the words or phrases that occur most often. What is an efficient way to do this?
Don't reinvent the wheel. Use a full text search engine such as Lucene.
The simple/naive way is to use a hashtable. Walk through the words and increment the count as you go.
At the end of the process sort the key/value pairs by count.
the basic idea is simple -- in executable pseudocode,
from collections import defaultdict
def process(words):
d = defaultdict(int)
for w in words: d[w] += 1
return d
Of course, the devil is in the details -- how do you turn the big collection into an iterator yielding words? Is it big enough that you can't process it on a single machine but rather need a mapreduce approach e.g. via hadoop? Etc, etc. NLTK can help with the linguistic aspects (isolating words in languages that don't separate them cleanly).
On a single-machine execution (net of mapreduce), one issue that can arise is that the simple idea gives you far too many singletons or thereabouts (words occurring once or just a few times), which fill memory. A probabilistic retort to that is to do two passes: one with random sampling (get only one word in ten, or one in a hundred) to make a set of words that are candidates for the top ranks, then a second pass skipping words that are not in the candidate set. Depending on how many words you're sampling and how many you want in the result, it's possible to compute an upper bound on the probability that you're going to miss an important word this way (and for reasonable numbers, and any natural language, I assure you that you'll be just fine).
Once you have your dictionary mapping words to numbers of occurrences you just need to pick the top N words by occurrences -- a heap-queue will help there, if the dictionary is just too large to sort by occurrences in its entirety (e.g. in my favorite executable pseudocode, heapq.nlargest will do it, for example).
Look into the Apriori algorithm. It can be used to find frequent items and/or frequent sets of items.
Like the wikipedia article states, there are more efficient algorithms that do the same thing, but this could be a good start to see if this will apply to your situation.
Maybe you can try using a PATRICIA trie or practical algorithm to retrieve information coded in alphanumeric trie?
Why not a simple map with key as the word and the Counter as the Value.
It will give the top used words, by taking the high value counter.
It is just a O(N) operation.

Resources