why does Porter Stemmer yield a string which can be stemmed again? - stemming

stem('apples')='apple'
stem('apple')='appl'
stem('appl')='appl'
isn't this a flaw in the stemming algorithm?
(this is using the Porter Stemming Algorithm)

That looks more like a bug in the implementation of the algorithm you're using.
When I follow the steps in the original algorithm (from the page you linked to), the final "s" from "apples" is removed in step 1a, and the "e" in step 5a, so the stem of "apples" is also "appl".

I found an implementation of porter stemming algorithm with a dictionary support here http://preciselyconcise.com/apis_and_installations/smart_stemmer.php.
This API was really simple to use and the stemmed words were corrected for mistakes in spelling. I would suggest you using this stemmer since this API has an auto correct option of the stemmed words.

Related

Do those stemming make sense in OpenNLP?

I just installed openNLP and tested some stemming. Those stemming results look suspicious to me.
people => peopl
excellent => excel
beautiful => beauti
I am not sure these are the original output of OpenNLP, or my installation has some problem which can't produce correct results.
Can someone help me verify these? Thank you really.
Yes, that makes sense. From Wikipedia:
The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root.
Lemmatizer is the tool that returns the morphological root. It gets the inflected word and a pos tag and returns the lemma. You can check how to use it in OpenNLP Manual.

Algorithm to compare similarity of English sentences

I have a collection of sentences, and I need to analyse them to see how similar they are.
Are there any established algorithms to do this?
I care about:
containing the same words (ignoring inflexions for now)
containing the same words in a similar order
I've used Levenshtein distance and n-grams for spelling before, although I'm not entirely confident if these translate to my purposes.
Naively, "I don't care about spelling differences, typos can be treated as different words" although perhaps it would be nice to account for this.
perhaps some hybrid of splitting the sentence at spaces and one of the above (or other) algorithms would be a starting point
What options are available? Any advice?
Thanks!
This paper compares several sentence similarity measures. Perhaps you can use one of them as is, or modify it for your needs.
Otherwise sentence similarity measure is a good key term to google for.
To ignore inflections you should look into stemming algorithms: http://en.wikipedia.org/wiki/Porter_stemmer
They reduce words to their root forms.

How do I approximate "Did you mean?" without using Google?

I am aware of the duplicates of this question:
How does the Google “Did you mean?” Algorithm work?
How do you implement a “Did you mean”?
... and many others.
These questions are interested in how the algorithm actually works. My question is more like: Let's assume Google did not exist or maybe this feature did not exist and we don't have user input. How does one go about implementing an approximate version of this algorithm?
Why is this interesting?
Ok. Try typing "qualfy" into Google and it tells you:
Did you mean: qualify
Fair enough. It uses Statistical Machine Learning on data collected from billions of users to do this. But now try typing this: "Trytoreconnectyou" into Google and it tells you:
Did you mean: Try To Reconnect You
Now this is the more interesting part. How does Google determine this? Have a dictionary handy and guess the most probably words again using user input? And how does it differentiate between a misspelled word and a sentence?
Now considering that most programmers do not have access to input from billions of users, I am looking for the best approximate way to implement this algorithm and what resources are available (datasets, libraries etc.). Any suggestions?
Assuming you have a dictionary of words (all the words that appear in the dictionary in the worst case, all the phrases that appear in the data in your system in the best case) and that you know the relative frequency of the various words, you should be able to reasonably guess at what the user meant via some combination of the similarity of the word and the number of hits for the similar word. The weights obviously require a bit of trial and error, but generally the user will be more interested in a popular result that is a bit linguistically further away from the string they entered than in a valid word that is linguistically closer but only has one or two hits in your system.
The second case should be a bit more straightforward. You find all the valid words that begin the string ("T" is invalid, "Tr" is invalid, "Try" is a word, "Tryt" is not a word, etc.) and for each valid word, you repeat the algorithm for the remaining string. This should be pretty quick assuming your dictionary is indexed. If you find a result where you are able to decompose the long string into a set of valid words with no remaining characters, that's what you recommend. Of course, if you're Google, you probably modify the algorithm to look for substrings that are reasonably close typos to actual words and you have some logic to handle cases where a string can be read multiple ways with a loose enough spellcheck (possibly using the number of results to break the tie).
From the horse's mouth: How to Write a Spelling Corrector
The interesting thing here is how you don't need a bunch of query logs to approximate the algorithm. You can use a corpus of mostly-correct text (like a bunch of books from Project Gutenberg).
I think this can be done using a spellchecker along with N-grams.
For Trytoreconnectyou, we first check with all 1-grams (all dictionary words) and find a closest match that's pretty terrible. So we try 2-grams (which can be built by removing spaces from phrases of length 2), and then 3-grams and so on. When we try a 4-gram, we find that there is a phrase that is at 0 distance from our search term. Since we can't do better than that, we return that answer as the suggestion.
I know this is very inefficient, but Peter Norvig's post here suggests clearly that Google uses spell correcters to generate it's suggestions. Since Google has massive paralellization capabilities, they can accomplish this task very quickly.
Impressive tutroail one how its work you can found here http://alias-i.com/lingpipe-3.9.3/demos/tutorial/querySpellChecker/read-me.html.
In few word it is trade off of query modification(on character or word level) to increasing coverage in search documents. For example "aple" lead to 2mln documents, but "apple" lead to 60mln and modification is only one character, therefore it is obvious that you mean apple.
Datasets/tools that might be useful:
WordNet
Corpora such as the ukWaC corpus
You can use WordNet as a simple dictionary of terms, and you can boost that with frequent terms extracted from a corpus.
You can use the Peter Norvig link mentioned before as a first attempt, but with a large dictionary, this won't be a good solution.
Instead, I suggest you use something like locality sensitive hashing (LSH). This is commonly used to detect duplicate documents, but it will work just as well for spelling correction. You will need a list of terms and strings of terms extracted from your data that you think people may search for - you'll have to choose a cut-off length for the strings. Alternatively if you have some data of what people actually search for, you could use that. For each string of terms you generate a vector (probably character bigrams or trigrams would do the trick) and store it in LSH.
Given any query, you can use an approximate nearest neighbour search on the LSH described by Charikar to find the closest neighbour out of your set of possible matches.
Note: links removed as I'm a new user - sorry.
#Legend - Consider using one of the variations of the Soundex algorithm. It has some known flaws, but it works decently well in most applications that need to approximate misspelled words.
Edit (2011-03-16):
I suddenly remembered another Soundex-like algorithm that I had run across a couple of years ago. In this Dr. Dobb's article, Lawrence Philips discusses improvements to his Metaphone algorithm, dubbed Double Metaphone.
You can find a Python implementation of this algorithm here, and more implementations on the same site here.
Again, these algorithms won't be the same as what Google uses, but for English language words they should get you very close. You can also check out the wikipedia page for Phonetic Algorithms for a list of other similar algorithms.
Take a look at this: How does the Google "Did you mean?" Algorithm work?

Is there an algorithm that tells the semantic similarity of two phrases

input: phrase 1, phrase 2
output: semantic similarity value (between 0 and 1), or the probability these two phrases are talking about the same thing
You might want to check out this paper:
Sentence similarity based on semantic nets and corpus statistics (PDF)
I've implemented the algorithm described. Our context was very general (effectively any two English sentences) and we found the approach taken was too slow and the results, while promising, not good enough (or likely to be so without considerable, extra, effort).
You don't give a lot of context so I can't necessarily recommend this but reading the paper could be useful for you in understanding how to tackle the problem.
Regards,
Matt.
There's a short and a long answer to this.
The short answer:
Use the WordNet::Similarity Perl package. If Perl is not your language of choice, check the WordNet project page at Princeton, or google for a wrapper library.
The long answer:
Determining word similarity is a complicated issue, and research is still very hot in this area. To compute similarity, you need an appropriate represenation of the meaning of a word. But what would be a representation of the meaning of, say, 'chair'? In fact, what is the exact meaning of 'chair'? If you think long and hard about this, it will twist your mind, you will go slightly mad, and finally take up a research career in Philosophy or Computational Linguistics to find the truth™. Both philosophers and linguists have tried to come up with an answer for literally thousands of years, and there's no end in sight.
So, if you're interested in exploring this problem a little more in-depth, I highly recommend reading Chapter 20.7 in Speech and Language Processing by Jurafsky and Martin, some of which is available through Google Books. It gives a very good overview of the state-of-the-art of distributional methods, which use word co-occurrence statistics to define a measure for word similarity. You are not likely to find libraries implementing these, however.
For anyone just coming at this, i would suggest taking a look at SEMILAR - http://www.semanticsimilarity.org/ . They implement a lot of the modern research methods for calculating word and sentence similarity. It is written in Java.
SEMILAR API comes with various similarity methods based on Wordnet, Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA), BLEU, Meteor, Pointwise Mutual Information (PMI), Dependency based methods, optimized methods based on Quadratic Assignment, etc. And the similarity methods work in different granularities - word to word, sentence to sentence, or bigger texts.
You might want to check into the WordNet project at Princeton University. One possible approach to this would be to first run each phrase through a stop-word list (to remove "common" words such as "a", "to", "the", etc.) Then for each of the remaining words in each phrase, you could compute the semantic "similarity" between each of the words in the other phrase using a distance measure based on WordNet. The distance measure could be something like: the number of arcs you have to pass through in WordNet to get from word1 to word2.
Sorry this is pretty high-level. I've obviously never tried this. Just a quick thought.
I would look into latent semantic indexing for this. I believe you can create something similar to a vector space search index but with semantically related terms being closer together i.e. having a smaller angle between them. If I learn more I will post here.
Sorry to dig up a 6 year old question, but as I just came across this post today, I'll throw in an answer in case anyone else is looking for something similar.
cortical.io has developed a process for calculating the semantic similarity of two expressions and they have a demo of it up on their website. They offer a free API providing access to the functionality, so you can use it in your own application without having to implement the algorithm yourself.
One simple solution is to use the dot product of character n-gram vectors. This is robust over ordering changes (which many edit distance metrics are not) and captures many issues around stemming. It also prevents the AI-complete problem of full semantic understanding.
To compute the n-gram vector, just pick a value of n (say, 3), and hash every 3-word sequence in the phrase into a vector. Normalize the vector to unit length, then take the dot product of different vectors to detect similarity.
This approach has been described in
J. Mitchell and M. Lapata, “Composition in Distributional Models of Semantics,” Cognitive Science, vol. 34, no. 8, pp. 1388–1429, Nov. 2010., DOI 10.1111/j.1551-6709.2010.01106.x
I would have a look at statistical techniques that take into consideration the probability of each word to appear within a sentence. This will allow you to give less importance to popular words such as 'and', 'or', 'the' and give more importance to words that appear less regurarly, and that are therefore a better discriminating factor. For example, if you have two sentences:
1) The smith-waterman algorithm gives you a similarity measure between two strings.
2) We have reviewed the smith-waterman algorithm and we found it to be good enough for our project.
The fact that the two sentences share the words "smith-waterman" and the words "algorithms" (which are not as common as 'and', 'or', etc.), will allow you to say that the two sentences might indeed be talking about the same topic.
Summarizing, I would suggest you have a look at:
1) String similarity measures;
2) Statistic methods;
Hope this helps.
Try SimService, which provides a service for computing top-n similar words and phrase similarity.
This requires your algorithm actually knows what your talking about. It can be done in some rudimentary form by just comparing words and looking for synonyms etc, but any sort of accurate result would require some form of intelligence.
Take a look at http://mkusner.github.io/publications/WMD.pdf This paper describes an algorithm called Word Mover distance that tries to uncover semantic similarity. It relies on the similarity scores as dictated by word2vec. Integrating this with GoogleNews-vectors-negative300 yields desirable results.

Levenshtein distance based methods Vs Soundex

As per this comment in a related thread, I'd like to know why Levenshtein distance based methods are better than Soundex.
Soundex is rather primitive - it was originally developed to be hand calculated. It results in a key that can be compared.
Soundex works well with western names, as it was originally developed for US census data. It's intended for phonetic comparison.
Levenshtein distance looks at two values and produces a value based on their similarity. It's looking for missing or substituted letters.
Basically Soundex is better for finding that "Schmidt" and "Smith" might be the same surname.
Levenshtein distance is better for spotting that the user has mistyped "Levnshtein" ;-)
I would suggest using Metaphone, not Soundex. As noted, Soundex was developed in the 19th century for American names. Metaphone will give you some results when checking the work of poor spellers who are "sounding it out", and spelling phonetically.
Edit distance is good at catching typos such as repeated letters, transposed letters, or hitting the wrong key.
Consider the application to decide which will fit your users best—or use both together, with Metaphone complementing the suggestions produced by Levenshtein.
With regard to the original question, I've used n-grams successfully in information retrieval applications.
I agree with you on Daitch-Mokotoff, Soundex is biased because the original US census takers wanted 'Americanized' names.
Maybe an example on the difference would help:
Soundex puts addition value in the start of a word - in fact it only considers the first 4 phonetic sounds. So while "Schmidt" and "Smith" will match "Smith" and "Wmith" won't.
Levenshtein's algorithm would be better for finding typos - one or two missing or replaced letters produces a high correlation, while the phonetic impact of those missing letters is less important.
I don't think either is better, and I'd consider both a distance algorithm and a phonetic one for helping users correct typed input.
#Keith:
As I posted on the other question, Daitch-Mokotoff is better for us Europeans (and I'd argue the US).
I've also read the Wiki on Levenshtein. But I don't see why (in real life) it's better for the user than Soundex.

Resources