Use Spacy to find most similar sentences in doc

Use Spacy to find most similar sentences in doc - gensim

I'm looking for a solution to use something like most_similar() from Gensim but using Spacy.
I want to find the most similar sentence in a list of sentences using NLP.
I tried to use similarity() from Spacy (e.g. https://spacy.io/api/doc#similarity) one by one in loop, but it takes a very long time.
To go deeper :
I would like to put all these sentences in a graph (like this) to find sentence clusters.
Any idea ?

This is a simple, built-in solution you could use:
import spacy
nlp = spacy.load("en_core_web_lg")
text = (
"Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity."
" These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature."
" The term semantic similarity is often confused with semantic relatedness."
" Semantic relatedness includes any relation between two terms, while semantic similarity only includes 'is a' relations."
" My favorite fruit is apples."
)
doc = nlp(text)
max_similarity = 0.0
most_similar = None, None
for i, sent in enumerate(doc.sents):
for j, other in enumerate(doc.sents):
if j <= i:
continue
similarity = sent.similarity(other)
if similarity > max_similarity:
max_similarity = similarity
most_similar = sent, other
print("Most similar sentences are:")
print(f"-> '{most_similar[0]}'")
print("and")
print(f"-> '{most_similar[1]}'")
print(f"with a similarity of {max_similarity}")
(text from wikipedia)
It will yield the following output:
Most similar sentences are:
-> 'Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity.'
and
-> 'These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature.'
with a similarity of 0.9583859443664551
Note the following information from spacy.io:
To make them compact and fast, spaCy’s small pipeline packages (all packages that end in sm) don’t ship with word vectors, and only include context-sensitive tensors. This means you can still use the similarity() methods to compare documents, spans and tokens – but the result won’t be as good, and individual tokens won’t have any vectors assigned. So in order to use real word vectors, you need to download a larger pipeline package:
- python -m spacy download en_core_web_sm
+ python -m spacy download en_core_web_lg
Also see Document similarity in Spacy vs Word2Vec for advice on how to improve the similarity scores.

Related

Similarity measure using vectors in gensim

I have a pair of word and semantic types of those words. I am trying to compute the relatedness measure between these two words using semantic types, for example: word1=king, type1=man, word2=queen, type2=woman
we can use gensim word_vectors.most_similar to get 'queen' from 'king-man+woman'. However, I am looking for similarity measure between vector represented by 'king-man+woman' and 'queen'.
I am looking for a solution to above (or)
way to calculate vector that is representative of 'king-man+woman' (and)
calculating similarity between two vectors using vector values in gensim (or)
way to calculate simple mean of the projection weight vectors(i.e king-man+woman)

You should look at the source code for the gensim most_similar() method, which is used to propose answers to such analogy questions. Specifically, when you try...
sims = wv_model.most_similar(positive=['king', 'woman'], negative=['man'])
...the top result will (in a sufficiently-trained model) often be 'queen' or similar. So, you can look at the source code to see exactly how it calculates the target combination of wv('king') - wv('man') + wv('woman'), before searching all known vectors for those closest vectors to that target. See...
https://github.com/RaRe-Technologies/gensim/blob/5f6b28c538d7509138eb090c41917cb59e4709af/gensim/models/keyedvectors.py#L486
...and note that the local variable mean is the combination of the positive and negative values provided.
You might also find other methods there useful, either directly or as models for your own code, such as distances()...
https://github.com/RaRe-Technologies/gensim/blob/5f6b28c538d7509138eb090c41917cb59e4709af/gensim/models/keyedvectors.py#L934
...or n_similarity()...
https://github.com/RaRe-Technologies/gensim/blob/5f6b28c538d7509138eb090c41917cb59e4709af/gensim/models/keyedvectors.py#L1005

Doc2vec - About getting document vector

I'm a very new student of doc2vec and have some questions about document vector.
What I'm trying to get is a vector of phrase like 'cat-like mammal'.
So, what I've tried so far is by using doc2vec pre-trained model, I tried the code below
import gensim.models as g
model = "path/pre-trained doc2vec model.bin"
m = g. Doc2vec.load(model)
oneword = 'cat'
phrase = 'cat like mammal'
oneword_vec = m[oneword]
phrase_vec = m[phrase_vec]
When I tried this code, I could get a vector for one word 'cat', but not 'cat-like mammal'.
Because word2vec only provide the vector for one word like 'cat' right? (If I'm wrong, plz correct me)
So I've searched and found infer_vector() and tried the code below
phrase = phrase.lower().split(' ')
phrase_vec = m.infer_vector(phrase)
When I tried this code, I could get a vector, but every time I get different value when I tried
phrase_vec = m.infer_vector(phrase)
Because infer_vector has 'steps'.
When I set steps=0, I get always the same vector.
phrase_vec = m.infer_vector(phrase, steps=0)
However, I also found that document vector is obtained from averaging words in document.
like if the document is composed of three words, 'cat-like mammal', add three vectors of 'cat', 'like', 'mammal', and then average it, that would be the document vector. (If I'm wrong, plz correct me)
So here are some questions.
Is it the right way to use infer_vector() with 0 steps to getting a vector of phrase?
If it is the right averaging vector of words to get document vector, is there no need to use infer_vector()?
What is a model.docvecs for?

Using 0 steps means no inference at all happens: the vector stays at its randomly-initialized position. So you definitely don't want that. That the vectors for the same text vary a little each time you run infer_vector() is normal: the algorithm is using randomness. The important thing is that they're similar-to-each-other, within a small tolerance. You are more likely to make them more similar (but still not identical) with a larger steps value.
You can see also an entry about this non-determinism in Doc2Vec training or inference in the gensim FAQ.
Averaging word-vectors together to get a doc-vector is one useful technique, that might be good as a simple baseline for many purposes. But it's not the same as what Doc2Vec.infer_vector() does - which involves iteratively adjusting a candidate vector to be better and better at predicting the text's words, just like Doc2Vec training. For your doc-vector to be comparable to other doc-vectors created during model training, you should use infer_vector().
The model.docvecs object holds all the doc-vectors that were learned during model training, for lookup (by the tags given as their names during training) or other operations, like finding the most_similar() N doc-vectors to a target tag/vector amongst those learned during training.

Is it possible to search for part the of text using word embeddings?

I have found successful weighting theme for adding word vectors which seems to work for sentence comparison in my case:
query1 = vectorize_query("human cat interaction")
query2 = vectorize_query("people and cats talk")
query3 = vectorize_query("monks predicted frost")
query4 = vectorize_query("man found his feline in the woods")
>>> print(1 - spatial.distance.cosine(query1, query2))
>>> 0.7154500319
>>> print(1 - spatial.distance.cosine(query1, query3))
>>> 0.415183904078
>>> print(1 - spatial.distance.cosine(query1, query4))
>>> 0.690741014142
When I add additional information to the sentence which acts as noise I get decrease:
>>> query4 = vectorize_query("man found his feline in the dark woods while picking white mushrooms and watching unicorns")
>>> print(1 - spatial.distance.cosine(query1, query4))
>>> 0.618269123349
Are there any ways to deal with additional information when comparing using word vectors? When I know that some subset of the text can provide better match.
UPD: edited the code above to make it more clear.
vectorize_query in my case does so called smooth inverse frequency weighting, when word vectors from GloVe model (that can be word2vec as well, etc.) are added with weights a/(a+w), where w should be the word frequency. I use there word's inverse tfidf score, i.e. w = 1/tfidf(word). Coefficient a is typically taken 1e-3 in this approach. Taking just tfidf score as weight instead of that fraction gives almost similar result, I also played with normalization, etc.
But I wanted to have just "vectorize sentence" in my example to not overload the question as I think it does not depend on how I add word vectors using weighting theme - the problem is only that comparison works best when sentences have approximately the same number of meaning words.
I am aware of another approach when distance between sentence and text is being computed using the sum or mean of minimal pairwise word distances, e.g.
"Obama speaks to the media in Illinois" <-> "The President greets the press in Chicago" where we have dist = d(Obama, president) + d(speaks, greets) + d(media, press) + d(Chicago, Illinois). But this approach does not take into account that adjective can change the meaning of noun significantly, etc - which is more or less incorporated in vector models. Words like adjectives 'good', 'bad', 'nice', etc. become noise there, as they match in two texts and contribute as zero or low distances, thus decreasing the distance between sentence and text.
I played a bit with doc2vec models, it seems it was gensim doc2vec implementation and skip-thoughts embedding, but in my case (matching short query with much bigger amount of text) I had unsatisfactory results.

If you are interested in part-of-speech to trigger similarity (e.g. only interested in nouns and noun phrases and ignore adjectives), you might want to look at sense2vec, which incorporates word classes into the model. https://explosion.ai/blog/sense2vec-with-spacy ...after which you can weight the word class while performing a dot product across all terms, effectively deboosting what you consider the 'noise'.

It's not clear your original result, the similarity decreasing when a bunch of words are added, is 'bad' in general. A sentence that says a lot more is a very different sentence!
If that result is specifically bad for your purposes – you need a model that captures whether a sentence says "the same and then more", you'll need to find/invent some other tricks. In particular, you might need a non-symmetric 'contains-similar' measure – so that the longer sentence is still a good match for the shorter one, but not vice-versa.
Any shallow, non-grammar-sensitive embedding that's fed by word-vectors will likely have a hard time with single-word reversals-of-meaning, as for example the difference between:
After all considerations, including the relevant measures of economic, cultural, and foreign-policy progress, historians should conclude that Nixon was one of the very *worst* Presidents
After all considerations, including the relevant measures of economic, cultural, and foreign-policy progress, historians should conclude that Nixon was one of the very *best* Presidents
The words 'worst' and 'best' will already be quite-similar, as they serve the same functional role and appear in the same sorts of contexts, and may only contrast with each other a little in the full-dimensional space. And then their influence may be swamped in the influence of all the other words. Only more sophisticated analyses may highlight their role as reversing the overall import of the sentence.
While it's not yet an option in gensim, there are alternative ways to calculation the "Word Mover's Distance" that report the unmatched 'remainder' after all the easy pairwise-meaning-measuring is finished. While I don't know any prior analysis or code that'd flesh out this idea for your needs, or prove its value, I have a hunch such an analysis might help better discover cases of "says the same and more", or "says mostly the same but with reversal in a few words/aspects".

In general, when does TF-IDF reduce accuracy?

I'm training a corpus consisting of 200000 reviews into positive and negative reviews using a Naive Bayes model, and I noticed that performing TF-IDF actually reduced the accuracy (while testing on test set of 50000 reviews) by about 2%. So I was wondering if TF-IDF has any underlying assumptions on the data or model that it works with, i.e. any cases where accuracy is reduced by the use of it?

The IDF component of TF*IDF can harm your classification accuracy in some cases.
Let suppose the following artificial, easy classification task, made for the sake of illustration:
Class A: texts containing the word 'corn'
Class B: texts not containing the word 'corn'
Suppose now that in Class A, you have 100 000 examples and in class B, 1000 examples.
What will happen to TFIDF? The inverse document frequency of corn will be very low (because it is found in almost all documents), and the feature 'corn' will get a very small TFIDF, which is the weight of the feature used by the classifier. Obviously, 'corn' was THE best feature for this classification task. This is an example where TFIDF may reduce your classification accuracy. In more general terms:
when there is class imbalance. If you have more instances in one class, the good word features of the frequent class risk having lower IDF, thus their best features will have a lower weight
when you have words with high frequency that are very predictive of one of the classes (words found in most documents of that class)

You can heuristically determine whether the usage of IDF on your training data decreases your predictive accuracy by performing grid search as appropriate.
For example, if you are working in sklearn, and you want to determine whether IDF decreases the predictive accuracy of your model, you can perform a grid search on the use_idf parameter of the TfidfVectorizer.
As an example, this code would implement the gridsearch algorithm on the selection of IDF for classification with SGDClassifier (you must import all the objects being instantiated first):
# import all objects first
X = # your training data
y = # your labels
pipeline = Pipeline([('tfidf',TfidfVectorizer()),
('sgd',SGDClassifier())])
params = {'tfidf__use_idf':(False,True)}
gridsearch = GridSearch(pipeline,params)
gridsearch.fit(X,y)
print(gridsearch.best_params_)
The output would be either:
Parameters selected as the best fit:
{'tfidf__use_idf': False}
or
{'tfidf__use_idf': True}

TF-IDF as far as I understand is a feature. TF is term frequency i.e. frequency of occurence in a document. IDF is inverse document frequncy i.e frequency of documents in which the term occurs.
Here, the model is using the TF-IDF info in the training corpus to estimate the new documents. For a very simple example, Say a document with word bad has pretty high term frequency of word bad in training set will sentiment label as negative. So, any new document containing bad will be more likely to be negative.
For the accuracy you can manaually select training corpus which contains mostly used negative or positive words. This will boost the accuracy.

Classifying english words into rare and common

I'm trying to devise a method that will be able to classify a given number of english words into 2 sets - "rare" and "common" - the reference being to how much they are used in the language.
The number of words I would like to classify is bounded - currently at around 10,000, and include everything from articles, to proper nouns that could be borrowed from other languages (and would thus be classified as "rare"). I've done some frequency analysis from within the corpus, and I have a distribution of these words (ranging from 1 use, to tops about 100).
My intuition for such a system was to use word lists (such as the BNC word frequency corpus, wordnet, internal corpus frequency), and assign weights to its occurrence in one of them.
For instance, a word that has a mid level frequency in the corpus, (say 50), but appears in a word list W - can be regarded as common since its one of the most frequent in the entire language. My question was - whats the best way to create a weighted score for something like this? Should I go discrete or continuous? In either case, what kind of a classification system would work best for this?
Or do you recommend an alternative method?
Thanks!
EDIT:
To answer Vinko's question on the intended use of the classification -
These words are tokenized from a phrase (eg: book title) - and the intent is to figure out a strategy to generate a search query string for the phrase, searching a text corpus. The query string can support multiple parameters such as proximity, etc - so if a word is common, these params can be tweaked.
To answer Igor's question -
(1) how big is your corpus?
Currently, the list is limited to 10k tokens, but this is just a training set. It could go up to a few 100k once I start testing it on the test set.
2) do you have some kind of expected proportion of common/rare words in the corpus?
Hmm, I do not.

Assuming you have a way to evaluate the classification, you can use the "boosting" approach to machine learning. Boosting classifiers use a set of weak classifiers combined to a strong classifier.
Say, you have your corpus and K external wordlists you can use.
Pick N frequency thresholds. For example, you may have 10 thresholds: 0.1%, 0.2%, ..., 1.0%.
For your corpus and each of the external word lists, create N "experts", one expert per threshold per wordlist/corpus, total of N*(K+1) experts. Each expert is a weak classifier, with a very simple rule: if the frequency of the word is higher than its threshold, they consider the word to be "common". Each expert has a weight.
The learning process is as follows: assign the weight 1 to each expert. For each word in your corpus, make the experts vote. Sum their votes: 1 * weight(i) for "common" votes and (-1) * weight(i) for "rare" votes. If the result is positive, mark the word as common.
Now, the overall idea is to evaluate the classification and increase the weight of experts that were right and decrease the weight of the experts that were wrong. Then repeat the process again and again, until your evaluation is good enough.
The specifics of the weight adjustment depends on the way how you evaluate the classification. For example, if you don't have per-word evaluation, you may still evaluate the classification as "too many common" or "too many rare" words. In the first case, promote all the pro-"rare" experts and demote all pro-"common" experts, or vice-versa.

Your distribution is most likely a Pareto distribution (a superset of Zipf's law as mentioned above). I am shocked that the most common word is used only 100 times - this is including "a" and "the" and words like that? You must have a small corpus if that is the same.
Anyways, you will have to choose a cutoff for "rare" and "common". One potential choice is the mean expected number of appearances (see the linked wiki article above to calculate the mean). Because of the "fat tail" of the distribution, a fairly small number of words will have appearances above the mean -- these are the "common". The rest are "rare". This will have the effect that many more words are rare than common. Not sure if that is what you are going for but you can just move the cutoff up and down to get your desired distribution (say, all words with > 50% of expected value are "common").

While this is not an answer to your question, you should know that you are inventing a wheel here.
Information Retrieval experts have devised ways to weight search words according to their frequency. A very popular weight is TF-IDF, which uses a word's frequency in a document and its frequency in a corpus. TF-IDF is also explained here.
An alternative score is the Okapi BM25, which uses similar factors.
See also the Lucene Similarity documentation for how TF-IDF is implemented in a popular search library.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio