Do those stemming make sense in OpenNLP? - opennlp

I just installed openNLP and tested some stemming. Those stemming results look suspicious to me.
people => peopl
excellent => excel
beautiful => beauti
I am not sure these are the original output of OpenNLP, or my installation has some problem which can't produce correct results.
Can someone help me verify these? Thank you really.

Yes, that makes sense. From Wikipedia:
The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root.
Lemmatizer is the tool that returns the morphological root. It gets the inflected word and a pos tag and returns the lemma. You can check how to use it in OpenNLP Manual.

Related

Elasticsearch handling german compound words

I have been working a while with the issue of German compound words and would like to know if someone has a modern approach to the problem.
The Problem
Many German words are actually a combination of many words like for example Baumhaus which consists of the words Baum and Haus. Now to achieve a good performing search it must be possible to search by each individual word.
Possible Solutions
By using a dictionary approach like shown here. This approach
is probably the most precise solution I tested. However the
dictionary is far from complete and hence creates a lot of undesired
results.
The second approach I tried which utilizes prebuilt Compact Patricia Tries in form of a plugin was making up for the lacking dictionary. However it of course also causes a lot of undesired results.
Both approaches I found and tested worked but were far from perfect. A combination of both using a multiplexr token filter created an okay result.
Not A Solution
My first attempt was the usage of the N-gram token filter. However this presented a crucial problem. While a search for Haus should return matches on Baumhaus a search for Baumhaus should not match on just Haus.
The Question
Since both of the mentioned solutions are rather old, impact performance when combined, and aren't really actively maintained anymore I feel like there must be another approach to this problem.
Any input is appreciated ;)

Find basic words and estimate their difficulty

I'm looking for a possibly simple solution of the following problem:
Given input of a sentence like
"Absence makes the heart grow fonder."
Produce a list of basic words followed by their difficulty/complexity
[["absence", 0.5], ["make", 0.05], ["the", 0.01"], ["grow", 0.1"], ["fond", 0.5]]
Let's assume that:
all the words in the sentence are valid English words
popularity is an acceptable measure of difficulty/complexity
base word can be understood in any constructive way (see below)
difficulty/complexity is on scale from 0 - piece of cake to 1 - mind-boggling
difficulty bias is ok, better to be mistaken saying easy is though than the other way
working simple solution is preferred to flawless but complicated stuff
[edit] there is no interaction with user
[edit] we can handle any proper English input
[edit] a word is not more difficult than it's basic form (because as smart beings we can create unhappily if we know happy), unless it creates a new word (unlikely is not same difficulty as like)
General ideas:
I considered using Google searches or sites like Wordcount to estimate words popularity that could indicate its difficulty. However, both solutions give different results depending on the form of entered words. Google gives 316m results for fond but 11m for fonder, whereas Wordcount gives them ranks of 6k and 54k.
Transforming words to their basic forms is not a must but solves ambiguity problem (and makes it easy to create dictionary links), however it's not a simple task and its sense could me found arguable. Obviously fond should be taken instead of fonder, however investigating believe instead of unbelievable seems to be an overkill ([edit] it might be not the best example, but there is a moment when modifying basic word we create a new one like -> likely) and words like doorkeeper shouldn't be cut into two.
Some ideas of what should be consider basic word can be found here on Wikipedia but maybe a simpler way of determining it would be a use of a dictionary. For instance according to dictionary.reference.com unbelievable is a basic word whereas fonder comes from fond but then grow is not the same as growing
Idea of a solution:
It seems to me that the best way to handle the problem would be using a dictionary to find basic words, apply some of the Wikipedia rules and then use Wordcount (maybe combined with number of Google searches) to estimate difficulty.
Still, there might (probably is a simpler and better) way or ready to use algorithms. I would appreciate any solution that deals with this problem and is easy to put in practice. Maybe I'm just trying to reinvent the wheel (or maybe you know my approach would work just fine and I'm wasting my time deliberating instead of coding what I have). I would, however, prefer to avoid implementing frequency analysis algorithms or preparing a corpus of texts.
Some terminology:
The core part of the word is called a stem or a root. More on this distinction later. You can think of the root/stem as the part that carries the main meaning of the word and will appear in the dictionary.
(In English) most words are composed of one root (exception: compounds like "windshield") / one stem and zero or more affixes: the affixes that come after the root/stem are called suffixes, and the affixes that precede the root/stem are called prefixes. Examples: "driver" = "drive" (root/stem) + suffix "-er"; "unkind" = "kind" (root/stem) + "un-" (prefix).
Suffixes/prefixes (=affixes) can be inflectional or derivational. For example, in English, third-person singular verbs have an s on the end: "I drive" but "He drive-s". These kind of agreement suffixes don't change the category of the word: "drive" is a verb regardless of the inflectional "s". On the other hand, a suffix like "-er" is derivational: it takes a verb (e.g. "drive") and turns it into a noun (e.g. "driver")
The stem, is the piece of the word without any inflectional affixes, whereas the root is the piece of the word without any derivational affixes. For instance, the plural noun "drivers" is decomposable into "drive" (root) + "er" (derivational affix, makes a new stem "driver") + "s" (plural).
The process of deriving the "base" form of the word is called "stemming".
So, armed with this terminology it seems that for your task the most useful thing to do would be to stem each form you come across, i.e. remove all the inflectional affixes, and keep the derivational ones, since derivational affixes can change how common the word is considered to be. Think about it this way: if I tell you a new word in English, you will always know how to make it plural, 3rd-person singular, however, you may not know some of the other words you can derive from this). English being inflection-poor language, there aren't a lot of inflectional suffixes to worry about (and Google search is pretty good about stripping them off, so maybe you can use the Google's stemming engine just by running your word forms through google search and getting out the highlighted results):
Third singular verbal -s: "I drive"/"He drive-s"
Nominal plural `-s': "One wug"/"Two wug-s". Note that there are some irregular forms here such as "children", "oxen", "geese", etc. I think I wouldn't worry about these.
Verbal past tense forms and participial forms. The regular ones are easy: the past tense has -ed for past tense and past participle ("I walk"/"I walk-ed"/"I had walk-ed"), but there are quite a few of irregular ones (fall/fell/fallen, dive/dove/dived?, etc). Maybe make a list of these?
Verbal -ing forms: "walk"/"walk-ing"
Adjectival comparative -er and superlative -est. There are a few irregular/suppletive ones ("good"/"better"/"best"), but these should not present a huge problem.
These are the main inflectional affixes in English: I may be forgetting a few that you could discover by picking up an introductory Linguistics books. Also there are going to be borderline cases, such as "un-" which is so promiscuous that we might consider it inflectional. For more information on these types, see Level 1 vs. Level 2 affixation, but I would treat these cases as derivational for your purposes and not stem them.
As far as "grading" how common various stems are, besides google you could various freely-available text corpora. The wikipedia article linked to has a few links to free corpora, and you can find a bunch more by googling. From these corpora you can build a frequency count of each stem, and use that to judge how common the form is.
I'm afraid there is no simple solution to the task of finding "basic" forms. I'm basing that on my memory of my Machine Learning textbook, of which language analysis was part of. You need some database, from which you can get them.
At the same time, please take note that the amount of words people use in everyday language is not that big. You can always ask a user what is the base form of a world you have not seen before. (unless this is your homework, which will be automatically checked)
Eventually, if you don't care about covering all words, you can create simple database, which would contain different forms of the most common words, and then try to use grammatical rules for the less common ones (which would be a good approximation, as actually, the most common words in English are irregular, whereas the uncommon ones are regular, because their original forms have been forgotten).
Note however, i'm no specialist, i'm simply trying to help :-)

How do I approximate "Did you mean?" without using Google?

I am aware of the duplicates of this question:
How does the Google “Did you mean?” Algorithm work?
How do you implement a “Did you mean”?
... and many others.
These questions are interested in how the algorithm actually works. My question is more like: Let's assume Google did not exist or maybe this feature did not exist and we don't have user input. How does one go about implementing an approximate version of this algorithm?
Why is this interesting?
Ok. Try typing "qualfy" into Google and it tells you:
Did you mean: qualify
Fair enough. It uses Statistical Machine Learning on data collected from billions of users to do this. But now try typing this: "Trytoreconnectyou" into Google and it tells you:
Did you mean: Try To Reconnect You
Now this is the more interesting part. How does Google determine this? Have a dictionary handy and guess the most probably words again using user input? And how does it differentiate between a misspelled word and a sentence?
Now considering that most programmers do not have access to input from billions of users, I am looking for the best approximate way to implement this algorithm and what resources are available (datasets, libraries etc.). Any suggestions?
Assuming you have a dictionary of words (all the words that appear in the dictionary in the worst case, all the phrases that appear in the data in your system in the best case) and that you know the relative frequency of the various words, you should be able to reasonably guess at what the user meant via some combination of the similarity of the word and the number of hits for the similar word. The weights obviously require a bit of trial and error, but generally the user will be more interested in a popular result that is a bit linguistically further away from the string they entered than in a valid word that is linguistically closer but only has one or two hits in your system.
The second case should be a bit more straightforward. You find all the valid words that begin the string ("T" is invalid, "Tr" is invalid, "Try" is a word, "Tryt" is not a word, etc.) and for each valid word, you repeat the algorithm for the remaining string. This should be pretty quick assuming your dictionary is indexed. If you find a result where you are able to decompose the long string into a set of valid words with no remaining characters, that's what you recommend. Of course, if you're Google, you probably modify the algorithm to look for substrings that are reasonably close typos to actual words and you have some logic to handle cases where a string can be read multiple ways with a loose enough spellcheck (possibly using the number of results to break the tie).
From the horse's mouth: How to Write a Spelling Corrector
The interesting thing here is how you don't need a bunch of query logs to approximate the algorithm. You can use a corpus of mostly-correct text (like a bunch of books from Project Gutenberg).
I think this can be done using a spellchecker along with N-grams.
For Trytoreconnectyou, we first check with all 1-grams (all dictionary words) and find a closest match that's pretty terrible. So we try 2-grams (which can be built by removing spaces from phrases of length 2), and then 3-grams and so on. When we try a 4-gram, we find that there is a phrase that is at 0 distance from our search term. Since we can't do better than that, we return that answer as the suggestion.
I know this is very inefficient, but Peter Norvig's post here suggests clearly that Google uses spell correcters to generate it's suggestions. Since Google has massive paralellization capabilities, they can accomplish this task very quickly.
Impressive tutroail one how its work you can found here http://alias-i.com/lingpipe-3.9.3/demos/tutorial/querySpellChecker/read-me.html.
In few word it is trade off of query modification(on character or word level) to increasing coverage in search documents. For example "aple" lead to 2mln documents, but "apple" lead to 60mln and modification is only one character, therefore it is obvious that you mean apple.
Datasets/tools that might be useful:
WordNet
Corpora such as the ukWaC corpus
You can use WordNet as a simple dictionary of terms, and you can boost that with frequent terms extracted from a corpus.
You can use the Peter Norvig link mentioned before as a first attempt, but with a large dictionary, this won't be a good solution.
Instead, I suggest you use something like locality sensitive hashing (LSH). This is commonly used to detect duplicate documents, but it will work just as well for spelling correction. You will need a list of terms and strings of terms extracted from your data that you think people may search for - you'll have to choose a cut-off length for the strings. Alternatively if you have some data of what people actually search for, you could use that. For each string of terms you generate a vector (probably character bigrams or trigrams would do the trick) and store it in LSH.
Given any query, you can use an approximate nearest neighbour search on the LSH described by Charikar to find the closest neighbour out of your set of possible matches.
Note: links removed as I'm a new user - sorry.
#Legend - Consider using one of the variations of the Soundex algorithm. It has some known flaws, but it works decently well in most applications that need to approximate misspelled words.
Edit (2011-03-16):
I suddenly remembered another Soundex-like algorithm that I had run across a couple of years ago. In this Dr. Dobb's article, Lawrence Philips discusses improvements to his Metaphone algorithm, dubbed Double Metaphone.
You can find a Python implementation of this algorithm here, and more implementations on the same site here.
Again, these algorithms won't be the same as what Google uses, but for English language words they should get you very close. You can also check out the wikipedia page for Phonetic Algorithms for a list of other similar algorithms.
Take a look at this: How does the Google "Did you mean?" Algorithm work?

Detect similar sounding words in Ruby

I'm aware of SOUNDEX and (double) Metaphone, but these don't let me test for the similarity of words as a whole - for example "Hi" sounds very similar to "Bye", but both of these methods will mark them as completely different.
Are there any libraries in Ruby, or any methods you know of, that are capable of determining the similarity between two words? (Either a boolean is/isn't similar, or numerical 40% similar)
edit: Extra bonus points if there is an easy method to 'drop in' a different dialect or language!
I think you're describing levenshtein distance. And yes, there are gems for that. If you're into pure Ruby go for the text gem.
$ gem install text
The docs have more details, but here's the crux of it:
Text::Levenshtein.distance('test', 'test') # => 0
Text::Levenshtein.distance('test', 'tent') # => 1
If you're ok with native extensions...
$ gem install levenshtein
It's usage is similar. It's performance is very good. (It handles ~1000 spelling corrections per minute on my systems.)
If you need to know how similar two words are, use distance over word length.
If you want a simple similarity test, consider something like this:
Untested, but straight forward:
String.module_eval do
def similar?(other, threshold=2)
distance = Text::Levenshtein.distance(self, other)
distance <= threshold
end
end
What you need is a pronunciation dictionary. The best free one is the CMU Pronouncing Dictionary.
Map the strings to their pronunciations, then do a bit of preprocessing (for example, you'll probably want to remove the numbers that cmudict uses to indicate stress), then you could use one of the techniques others have suggested, such as levenshtein distance, on the pronunciation strings instead of the input strings.
For an example of something similar, see dict/dict.rb in Rhyme Ninja.
You might first preprocess the words using a thesaurus database, which will convert words with similar meaning to the same word. There are various thesaurus databases out there, unfortunately I couldn't find a decent free one for English ( http://www.gutenberg.org/etext/3202 is the one I found, but this doesn't show what relations the specific words have (like similar; opposite; alternate meaning; etc.), so all words on the same line have some relation, but you won't know what that relation is )
But for example for Hungarian there is a good free thesaurus database, but you don't have soundex/metaphone for hungarian texts...
If you have the database writing a program that preprocesses the texts isn't too hard (ultimately it's a simple search-replace, but you might want to preprocess the thesaurus database using simplex or methaphone too)

Need a high efficient algorithm to check if a string contains english speech

I have got many Strings. All of them contain only characters. Characters and words are not splittet with a space from each other. Some of the characters form english words and other just bufflegab. The Strings may not contain a whole sentence.
I need to find out which of them are written in valid english speech. What I mean with that is, that the String could be build by concatenating well written english words. I know I could do something with a wordlist. But the words are not splittet from each other. So it could be very time-consuming to test every possible word combination.
I am searching for an high performance algorithm or method that check if the strings are built of english words or english speech. Maybe there is something that gives me the chance that the string contains english speech.
Do you know a method or algorithm that helps me?
Does something like Sphinx help me?
This is called the segmentation problem.
There is no trivial way to solve this. What I can suggest to you based on my guess of your knowledge level, is to build a trie out of your dictionary, and at the first chance you detect a possible word, try assuming that it is the word.
If later on, you find out that the last part of the word is gibberish, then you backtrack to the last time you decided a sequence of letter was a word, and ignore that word.
If your strings are long enough or your bufflegab strange enough, letter frequencies - possibly also bigram frequencies, trigram frequencies, etc. - might be sufficient (instead of the more general N-grams). For example, some browsers use that to guess the code page.
Check N-gram language model.
See http://en.wikipedia.org/wiki/N-gram
Sphinx probably won't help you. Try Rabin-Karp algorithm. It is awful for standard search but should work well for this particular problem. Basically, you'll want to have a dictionary of English words and will want to search with it. Overly large dictionaries will still be pretty slow, but if you use a small dictionary for common words and switch to a big one only when you hit common words, you probably still won't get too many false negatives.
Why not store your wordlist in a Trie. Then you iterate through the input looking for matching words in the Trie - this can be done very efficiently. If you find one, advance to the end of the word and continue.
It depends on what accuracy you want, how efficient you need it to be, and what kind of text you are processing.

Resources