How do I compare phrases for similarity? - algorithm

When entering a question, stackoverflow presents you with a list of questions that it thinks likely to cover the same topic. I have seen similar features on other sites or in other programs, too (Help file systems, for example), but I've never programmed something like this myself. Now I'm curious to know what sort of algorithm one would use for that.
The first approach that comes to my mind is splitting the phrase into words and look for phrases containing these words. Before you do that, you probably want to throw away insignificant words (like 'the', 'a', 'does' etc), and then you will want to rank the results.
Hey, wait - let's do that for web pages, and then we can have a ... watchamacallit ... - a "search engine", and then we can sell ads, and then ...
No, seriously, what are the common ways to solve this problem?

One approach is the so called bag-of-words model.
As you guessed, first you count how many times words appear in the text (usually called document in the NLP-lingo). Then you throw out the so called stop words, such as "the", "a", "or" and so on.
You're left with words and word counts. Do this for a while and you get a comprehensive set of words that appear in your documents. You can then create an index for these words:
"aardvark" is 1, "apple" is 2, ..., "z-index" is 70092.
Now you can take your word bags and turn them into vectors. For example, if your document contains two references for aardvarks and nothing else, it would look like this:
[2 0 0 ... 70k zeroes ... 0].
After this you can count the "angle" between the two vectors with a dot product. The smaller the angle, the closer the documents are.
This is a simple version and there other more advanced techniques. May the Wikipedia be with you.

#Hanno you should try the Levenshtein distance algorithm. Given an input string s and a list of of strings t iterate for each string u in t and return the one with the minimum Levenshtein distance.
http://en.wikipedia.org/wiki/Levenshtein_distance
See a Java implementation example in http://www.javalobby.org/java/forums/t15908.html

To augment the bag-of-words idea:
There are a few ways you can also pay some attention to n-grams, strings of two or more words kept in order. You might want to do this because a search for "space complexity" is much more than a search for things with "space" AND "complexity" in them, since the meaning of this phrase is more than the sum of its parts; that is, if you get a result that talks about the complexity of outer space and the universe, this is probably not what the search for "space complexity" really meant.
A key idea from natural language processing here is that of mutual information, which allows you (algorithmically) to judge whether or not a phrase is really a specific phrase (such as "space complexity") or just words which are coincidentally adjacent. Mathematically, the main idea is to ask, probabilistically, if these words appear next to each other more often than you would guess by their frequencies alone. If you see a phrase with a high mutual information score in your search query (or while indexing), you can get better results by trying to keep these words in sequence.

From my (rather small) experience developing full-text search engines: I would look up questions which contain some words from query (in your case, query is your question).
Sure, noise words should be ignored and we might want to check query for 'strong' words like 'ASP.Net' to narrow down search scope.
http://en.wikipedia.org/wiki/Index_(search_engine)#Inverted_indices'>Inverted indexes are commonly used to find questions with words we are interested in.
After finding questions with words from query, we might want to calculate distance between words we are interested in in questions, so question with 'phrases similarity' text ranks higher than question with 'discussing similarity, you hear following phrases...' text.

Here is the bag of words solution with tfidfvectorizer in python 3
#from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
nltk.download('stopwords')
s=set(stopwords.words('english'))
train_x_cleaned = []
for i in train_x:
sentence = filter(lambda w: not w in s,i.split(","))
train_x_cleaned.append(' '.join(sentence))
vectorizer = TfidfVectorizer(binary=True)
train_x_vectors = vectorizer.fit_transform(train_x_cleaned)
print(vectorizer.get_feature_names_out())
print(train_x_vectors.toarray())
from sklearn import svm
clf_svm = svm.SVC(kernel='linear')
clf_svm.fit(train_x_vectors, train_y)
test_x = vectorizer.transform(["test phrase 1", "test phrase 2", "test phrase 3"])
print (type(test_x))
clf_svm.predict(test_x)

Related

Is it possible to search for part the of text using word embeddings?

I have found successful weighting theme for adding word vectors which seems to work for sentence comparison in my case:
query1 = vectorize_query("human cat interaction")
query2 = vectorize_query("people and cats talk")
query3 = vectorize_query("monks predicted frost")
query4 = vectorize_query("man found his feline in the woods")
>>> print(1 - spatial.distance.cosine(query1, query2))
>>> 0.7154500319
>>> print(1 - spatial.distance.cosine(query1, query3))
>>> 0.415183904078
>>> print(1 - spatial.distance.cosine(query1, query4))
>>> 0.690741014142
When I add additional information to the sentence which acts as noise I get decrease:
>>> query4 = vectorize_query("man found his feline in the dark woods while picking white mushrooms and watching unicorns")
>>> print(1 - spatial.distance.cosine(query1, query4))
>>> 0.618269123349
Are there any ways to deal with additional information when comparing using word vectors? When I know that some subset of the text can provide better match.
UPD: edited the code above to make it more clear.
vectorize_query in my case does so called smooth inverse frequency weighting, when word vectors from GloVe model (that can be word2vec as well, etc.) are added with weights a/(a+w), where w should be the word frequency. I use there word's inverse tfidf score, i.e. w = 1/tfidf(word). Coefficient a is typically taken 1e-3 in this approach. Taking just tfidf score as weight instead of that fraction gives almost similar result, I also played with normalization, etc.
But I wanted to have just "vectorize sentence" in my example to not overload the question as I think it does not depend on how I add word vectors using weighting theme - the problem is only that comparison works best when sentences have approximately the same number of meaning words.
I am aware of another approach when distance between sentence and text is being computed using the sum or mean of minimal pairwise word distances, e.g.
"Obama speaks to the media in Illinois" <-> "The President greets the press in Chicago" where we have dist = d(Obama, president) + d(speaks, greets) + d(media, press) + d(Chicago, Illinois). But this approach does not take into account that adjective can change the meaning of noun significantly, etc - which is more or less incorporated in vector models. Words like adjectives 'good', 'bad', 'nice', etc. become noise there, as they match in two texts and contribute as zero or low distances, thus decreasing the distance between sentence and text.
I played a bit with doc2vec models, it seems it was gensim doc2vec implementation and skip-thoughts embedding, but in my case (matching short query with much bigger amount of text) I had unsatisfactory results.
If you are interested in part-of-speech to trigger similarity (e.g. only interested in nouns and noun phrases and ignore adjectives), you might want to look at sense2vec, which incorporates word classes into the model. https://explosion.ai/blog/sense2vec-with-spacy ...after which you can weight the word class while performing a dot product across all terms, effectively deboosting what you consider the 'noise'.
It's not clear your original result, the similarity decreasing when a bunch of words are added, is 'bad' in general. A sentence that says a lot more is a very different sentence!
If that result is specifically bad for your purposes – you need a model that captures whether a sentence says "the same and then more", you'll need to find/invent some other tricks. In particular, you might need a non-symmetric 'contains-similar' measure – so that the longer sentence is still a good match for the shorter one, but not vice-versa.
Any shallow, non-grammar-sensitive embedding that's fed by word-vectors will likely have a hard time with single-word reversals-of-meaning, as for example the difference between:
After all considerations, including the relevant measures of economic, cultural, and foreign-policy progress, historians should conclude that Nixon was one of the very *worst* Presidents
After all considerations, including the relevant measures of economic, cultural, and foreign-policy progress, historians should conclude that Nixon was one of the very *best* Presidents
The words 'worst' and 'best' will already be quite-similar, as they serve the same functional role and appear in the same sorts of contexts, and may only contrast with each other a little in the full-dimensional space. And then their influence may be swamped in the influence of all the other words. Only more sophisticated analyses may highlight their role as reversing the overall import of the sentence.
While it's not yet an option in gensim, there are alternative ways to calculation the "Word Mover's Distance" that report the unmatched 'remainder' after all the easy pairwise-meaning-measuring is finished. While I don't know any prior analysis or code that'd flesh out this idea for your needs, or prove its value, I have a hunch such an analysis might help better discover cases of "says the same and more", or "says mostly the same but with reversal in a few words/aspects".

Algorithm to reform a sentence from sentence whose spaces are removed and alphabets of words are reordered?

I was looking around some puzzles online to improve my knowledge on algorithms...
I came upon below question:
"You have a sentence with several words with spaces remove and words having their character order shuffled. You have a dictionary. Write an algorithm to produce the sentence back with spaces and words with normal character order."
I do not know what is good way to solve this.
I am new to algorithms but just looking at problem I think I would make program do what an intellectual mind would do.
Here is something I can think of:
-First find out manually common short english words from dictionary like "is" "the" "if" etc and put in dataset-1.
-Then find out permutation of words in dataset1 (eg "si", "eht" or "eth" or "fi") and put in dataset-2
-then find out from input sentense what character sequence matches the words of dataset2 and put them in dataset-3 and insert space in input sentence instead of those found.
-for rest of the words i would perform permutations to find out word from dictionary.
I am newbie to algorithms...is it a bad solution?
this seems like a perfectly fine solution,
In general there are 2 parameters for judging an algorithm.
correctness - does the algorithm provide the correct answer.
resources - the time or storage size needed to provide an answer.
usually there is a tradeoff between these two parameters.
so for example the size of your dictionary dictates what scrambled sentences you may
reconstruct, giving you a correct answer for more inputs,
however the whole searching process would take longer and would require more storage.
The hard part of the problem you presented is the fact that you need to compute permutations, and there are a LOT of them.
so checking them all is expensive, a good approach would be to do what you suggested, create a small subset of commonly used words and check them first, that way the average case is better.
note: just saying that you check the permutation/search is ok, but in the end you would need to specify the exact way of doing that.
currently what you wrote is an idea for an algorithm but it would not allow you to take a given input and mechanically work out the output.
Actually, might be wise to start by partitioning the dictionary by word length.
Then try to find the largest words that can be made using the letters avaliable, instead of finding the smallest ones. Short words are more common and thus will be harder to narrow down. IE: is it really "If" or "fig".
Then for each word length w, you can proceed w characters at a time.
There are still a lot of possible combinations though, simply because you found a valid word, doesn't mean it's the right word. Once you've gone through all the substrings, of which there should be something like O(c^4*d) where d is the number of words in the dictionary and c is the number of characters in the sentence. Practically speaking if the dictionary is sorted by word length, it'll be a fair bit less than that. Then you have to take the valid words, and figure out an ordering that works, so that all characters are used. There might be multiple solutions.

How to find similarity in texts

I have a database where users upload articles.
I would like to make an algorithm where my web app will suggest similar texts according to the one the user reads.
I saw some examples like Levenshtein distance. But those algorithms measures distance for strings and not for whole articles. Is there a way to extract most significant keywords from text? Surely, I understand that "most significant" is an ambiguous term.
How do other sites manage this?
thanks a lot
Is there a way to extract most significant keywords from text?
Yes. Basically, you extract all the words from the text, sort the words by frequency, eliminate the common words (a, an, the, etc.) by matching them against a common word dictionary, and save the top 20 or more words, along with their frequency, from each article.
The number of top words you save is related to both the length of the article and the subject matter of all the articles. Less words work for general interest articles, while more words are necessary for special interest articles, like answers to programming questions.
Articles that match more than half of the top words could be considered related. The degree of relatedness would depend on the number of matching top words and the frequencies of the matching words.
You could calculate a relatedness score by multiplying the frequencies of each matched word from the two articles and summing all the products. The higher the score, the more the articles are related.
You might try to correct the 'weight' of each word by the frequency it appears in all the articles. So the best indicators of similarity would be the words that appear only in the two compared ones and nowhere else. This would automatically disregard the common words (a, an, the, etc.) mentioned by #Gilbert Le Blanc.

Algorithm wanted: Find all words of a dictionary that are similar to words in a free text

We have a list of about 150,000 words, and when the user enters a free text, the system should present a list of words from the dictionary, that are very close to words in the free text.
For instance, the user enters: "I would like to buy legoe toys in Walmart". If the dictionary contains "Lego", "Car" and "Walmart", the system should present "Lego" and "Walmart" in the list. "Walmart" is obvious because it is identical to a word in the sentence, but "Lego" is similar enough to "Legoe" to be mentioned, too. However, nothing is similar to "Car", so that word is not shown.
Showing the list should be realtime, meaning that when the user has entered the sentence, the list of words must be present on the screen. Does anybody know a good algorithm for this?
The dictionary actually contains concepts which may include a space. For instance, "Lego spaceship". The perfect solution recognizes these multi-word concepts, too.
Any suggestions are appreciated.
Take a look at http://norvig.com/spell-correct.html for a simple algorithm. The article uses Python, but there are links to implementations in other languages at the end.
You will be doing quite a few lookups of words against a fixed dictionary. Therefore you need to prepare your dictionary. Logically, you can quickly eliminate candidates that are "just too different".
For instance, the words car and dissimilar may share a suffix, but they're obviously not misspellings of each other. Now why is that so obvious to us humans? For starters, the length is entirely different. That's an immediate disqualification (but with one exception - below). So, your dictionary should be sorted by word length. Match your input word with words of similar length. For short words that means +/- 1 character; longer words should have a higher margin (exactly how well can your demographic spell?)
Once you've restricted yourself to candidate words of similar length, you'd want to strip out words that are entirely dissimilar. With this I mean that they use entirely different letters. This is easiest to compare if you sort the letters in a word alphabetically. E.g. car becomes "acr"; rack becomes "ackr". You'll do this in preprocessing for your dictionary, and for each input word. The reason is that it's cheap to determine the (size of an) difference of two sorted sets. (Add a comment if you need explanation). car and rack have an difference of size 1, car and hat have a difference of size 2. This narrows down your set of candidates even further. Note that for longer words, you can bail out early when you've found too many differences. E.g. dissimilar and biography have a total difference of 13, but considering the length (8/9) you can probably bail out once you've found 5 differences.
This leaves you with a set of candidate words that use almost the same letters, and also are almost the same length. At this point you can start using more refined algorithms; you don't need to run 150.000 comparisons per input word anymore.
Now, for the length exception mentioned before: The problem is in "words" like greencar. It doesn't really match a word of length 8, and yet for humans it's quite obvious what was meant. In this case, you can't really break the input word at any random boundary and run an additional N-1 inexact matches against both halves. However, it is feasible to check for just a missing space. Just do a lookup for all possible prefixes. This is efficient because you'll be using the same part of the dictionary over and over, e.g. g gr, gre, gree, etc. For every prefix that you've found, check if the remaining suffixis also in the dictionery, e.g. reencar, eencar. If both halves of the input word are in the dictionary, but the word itself isn't, you can assume a missing space.
You would likely want to use an algorithm which calculates the Levenshtein distance.
However, since your data set is quite large, and you'll be comparing lots of words against it, a direct implementation of typical algorithms that do this won't be practical.
In order to find words in a reasonable amount of time, you will have to index your set of words in some way that facilitates fuzzy string matching.
One of these indexing methods would be to use a suffix tree. Another approach would be to use n-grams.
I would lean towards using a suffix tree since I find it easier to wrap my head around it and I find it more suited to the problem.
It might be of interest to look at a some algorithms such as the Levenshtein distance, which can calculate the amount of difference between 2 strings.
I'm not sure what language you are thinking of using but PHP has a function called levenshtein that performs this calculation and returns the distance. There's also a function called similar_text that does a similar thing. There's a code example here for the levenshtein function that checks a word against a dictionary of possible words and returns the closest words.
I hope this gives you a bit of insight into how a solution could work!

Classifying english words into rare and common

I'm trying to devise a method that will be able to classify a given number of english words into 2 sets - "rare" and "common" - the reference being to how much they are used in the language.
The number of words I would like to classify is bounded - currently at around 10,000, and include everything from articles, to proper nouns that could be borrowed from other languages (and would thus be classified as "rare"). I've done some frequency analysis from within the corpus, and I have a distribution of these words (ranging from 1 use, to tops about 100).
My intuition for such a system was to use word lists (such as the BNC word frequency corpus, wordnet, internal corpus frequency), and assign weights to its occurrence in one of them.
For instance, a word that has a mid level frequency in the corpus, (say 50), but appears in a word list W - can be regarded as common since its one of the most frequent in the entire language. My question was - whats the best way to create a weighted score for something like this? Should I go discrete or continuous? In either case, what kind of a classification system would work best for this?
Or do you recommend an alternative method?
Thanks!
EDIT:
To answer Vinko's question on the intended use of the classification -
These words are tokenized from a phrase (eg: book title) - and the intent is to figure out a strategy to generate a search query string for the phrase, searching a text corpus. The query string can support multiple parameters such as proximity, etc - so if a word is common, these params can be tweaked.
To answer Igor's question -
(1) how big is your corpus?
Currently, the list is limited to 10k tokens, but this is just a training set. It could go up to a few 100k once I start testing it on the test set.
2) do you have some kind of expected proportion of common/rare words in the corpus?
Hmm, I do not.
Assuming you have a way to evaluate the classification, you can use the "boosting" approach to machine learning. Boosting classifiers use a set of weak classifiers combined to a strong classifier.
Say, you have your corpus and K external wordlists you can use.
Pick N frequency thresholds. For example, you may have 10 thresholds: 0.1%, 0.2%, ..., 1.0%.
For your corpus and each of the external word lists, create N "experts", one expert per threshold per wordlist/corpus, total of N*(K+1) experts. Each expert is a weak classifier, with a very simple rule: if the frequency of the word is higher than its threshold, they consider the word to be "common". Each expert has a weight.
The learning process is as follows: assign the weight 1 to each expert. For each word in your corpus, make the experts vote. Sum their votes: 1 * weight(i) for "common" votes and (-1) * weight(i) for "rare" votes. If the result is positive, mark the word as common.
Now, the overall idea is to evaluate the classification and increase the weight of experts that were right and decrease the weight of the experts that were wrong. Then repeat the process again and again, until your evaluation is good enough.
The specifics of the weight adjustment depends on the way how you evaluate the classification. For example, if you don't have per-word evaluation, you may still evaluate the classification as "too many common" or "too many rare" words. In the first case, promote all the pro-"rare" experts and demote all pro-"common" experts, or vice-versa.
Your distribution is most likely a Pareto distribution (a superset of Zipf's law as mentioned above). I am shocked that the most common word is used only 100 times - this is including "a" and "the" and words like that? You must have a small corpus if that is the same.
Anyways, you will have to choose a cutoff for "rare" and "common". One potential choice is the mean expected number of appearances (see the linked wiki article above to calculate the mean). Because of the "fat tail" of the distribution, a fairly small number of words will have appearances above the mean -- these are the "common". The rest are "rare". This will have the effect that many more words are rare than common. Not sure if that is what you are going for but you can just move the cutoff up and down to get your desired distribution (say, all words with > 50% of expected value are "common").
While this is not an answer to your question, you should know that you are inventing a wheel here.
Information Retrieval experts have devised ways to weight search words according to their frequency. A very popular weight is TF-IDF, which uses a word's frequency in a document and its frequency in a corpus. TF-IDF is also explained here.
An alternative score is the Okapi BM25, which uses similar factors.
See also the Lucene Similarity documentation for how TF-IDF is implemented in a popular search library.

Resources