Related
I am aware of the duplicates of this question:
How does the Google “Did you mean?” Algorithm work?
How do you implement a “Did you mean”?
... and many others.
These questions are interested in how the algorithm actually works. My question is more like: Let's assume Google did not exist or maybe this feature did not exist and we don't have user input. How does one go about implementing an approximate version of this algorithm?
Why is this interesting?
Ok. Try typing "qualfy" into Google and it tells you:
Did you mean: qualify
Fair enough. It uses Statistical Machine Learning on data collected from billions of users to do this. But now try typing this: "Trytoreconnectyou" into Google and it tells you:
Did you mean: Try To Reconnect You
Now this is the more interesting part. How does Google determine this? Have a dictionary handy and guess the most probably words again using user input? And how does it differentiate between a misspelled word and a sentence?
Now considering that most programmers do not have access to input from billions of users, I am looking for the best approximate way to implement this algorithm and what resources are available (datasets, libraries etc.). Any suggestions?
Assuming you have a dictionary of words (all the words that appear in the dictionary in the worst case, all the phrases that appear in the data in your system in the best case) and that you know the relative frequency of the various words, you should be able to reasonably guess at what the user meant via some combination of the similarity of the word and the number of hits for the similar word. The weights obviously require a bit of trial and error, but generally the user will be more interested in a popular result that is a bit linguistically further away from the string they entered than in a valid word that is linguistically closer but only has one or two hits in your system.
The second case should be a bit more straightforward. You find all the valid words that begin the string ("T" is invalid, "Tr" is invalid, "Try" is a word, "Tryt" is not a word, etc.) and for each valid word, you repeat the algorithm for the remaining string. This should be pretty quick assuming your dictionary is indexed. If you find a result where you are able to decompose the long string into a set of valid words with no remaining characters, that's what you recommend. Of course, if you're Google, you probably modify the algorithm to look for substrings that are reasonably close typos to actual words and you have some logic to handle cases where a string can be read multiple ways with a loose enough spellcheck (possibly using the number of results to break the tie).
From the horse's mouth: How to Write a Spelling Corrector
The interesting thing here is how you don't need a bunch of query logs to approximate the algorithm. You can use a corpus of mostly-correct text (like a bunch of books from Project Gutenberg).
I think this can be done using a spellchecker along with N-grams.
For Trytoreconnectyou, we first check with all 1-grams (all dictionary words) and find a closest match that's pretty terrible. So we try 2-grams (which can be built by removing spaces from phrases of length 2), and then 3-grams and so on. When we try a 4-gram, we find that there is a phrase that is at 0 distance from our search term. Since we can't do better than that, we return that answer as the suggestion.
I know this is very inefficient, but Peter Norvig's post here suggests clearly that Google uses spell correcters to generate it's suggestions. Since Google has massive paralellization capabilities, they can accomplish this task very quickly.
Impressive tutroail one how its work you can found here http://alias-i.com/lingpipe-3.9.3/demos/tutorial/querySpellChecker/read-me.html.
In few word it is trade off of query modification(on character or word level) to increasing coverage in search documents. For example "aple" lead to 2mln documents, but "apple" lead to 60mln and modification is only one character, therefore it is obvious that you mean apple.
Datasets/tools that might be useful:
WordNet
Corpora such as the ukWaC corpus
You can use WordNet as a simple dictionary of terms, and you can boost that with frequent terms extracted from a corpus.
You can use the Peter Norvig link mentioned before as a first attempt, but with a large dictionary, this won't be a good solution.
Instead, I suggest you use something like locality sensitive hashing (LSH). This is commonly used to detect duplicate documents, but it will work just as well for spelling correction. You will need a list of terms and strings of terms extracted from your data that you think people may search for - you'll have to choose a cut-off length for the strings. Alternatively if you have some data of what people actually search for, you could use that. For each string of terms you generate a vector (probably character bigrams or trigrams would do the trick) and store it in LSH.
Given any query, you can use an approximate nearest neighbour search on the LSH described by Charikar to find the closest neighbour out of your set of possible matches.
Note: links removed as I'm a new user - sorry.
#Legend - Consider using one of the variations of the Soundex algorithm. It has some known flaws, but it works decently well in most applications that need to approximate misspelled words.
Edit (2011-03-16):
I suddenly remembered another Soundex-like algorithm that I had run across a couple of years ago. In this Dr. Dobb's article, Lawrence Philips discusses improvements to his Metaphone algorithm, dubbed Double Metaphone.
You can find a Python implementation of this algorithm here, and more implementations on the same site here.
Again, these algorithms won't be the same as what Google uses, but for English language words they should get you very close. You can also check out the wikipedia page for Phonetic Algorithms for a list of other similar algorithms.
Take a look at this: How does the Google "Did you mean?" Algorithm work?
I want to generate Keywords for my CMS.
Does someone know a good PHP Script (or something else) which generates keywords?
I have a HTML Site like this: http://pastebin.com/ZU8vdyeP
This is a very hard problem for a computer to solve. It would be much easier to get somebody (else?) to do it manually, or simply not do it at all.
If you'd really need a computer to do it, I'd head over to the excellent Python library NLTK which has many tools for this sort of thing (=natural language processing), and it's a lot of fun to work with.
For example, you could calculate a frequency distribution of the words, and then search for the most common hypernyms of larger (above say 5 char) words that appear most frequently and use that as a hint of what the keywords could be.
Again, it is much easier to get it done by a human, however.
to automate, get the words from the article, match them against a blacklist and dont include words under 4 chars.
Additionally, Let user manually edit. So only automate if no present keywords.
This can be done by trigger or application layer.
regards,
/t
If I understand the problem, you have text and you want to determine keywords that are most relevant to the text.
Three approaches:
1) Have user enter keywords
2) Statistical analysis of text, for example determine the words that are far more common in the text than they are in the language overall. Any good text on Information Retrieval will have some algorithms.
3) If you have a set of documents that are already classified (perhaps previously classified by humans) then you can use a machine learning algorithm (perhaps a Bayesian classifier) to train the system to classify the new documents. If you let the users override/correct the suggested keywords, the system can learn over time.
Personally, I'd do #3, since it is more adaptive.
I need to code a solution for a certain requirement, and I wanted to know if anyone is either familiar with an off-the-shelf library that can achieve it, or can direct me at the best practice. Description:
The user inputs a word that is supposed to be one of several fixed options (I hold the options in a list). I know the input must be in a member in the list, but since it is user input, he/she may have made a mistake. I'm looking for an algorithm that will tell me what is the most probable word the user meant. I don't have any context and I can’t force the user to choose from a list (i.e. he must be able to input the word freely and manually).
For example, say the list contains the words "water", “quarter”, "beer", “beet”, “hell”, “hello” and "aardvark".
The solution must account for different types of "normal" errors:
Speed typos (e.g. doubling characters, dropping characters etc)
Keyboard adjacent-character typos (e.g. "qater" for “water”)
Non-native English typos (e.g. "quater" for “quarter”)
And so on...
The obvious solution is to compare letter-by-letter and give "penalty weights" to each different letter, extra letter and missing letter. But this solution ignores thousands of "standard" errors I'm sure are listed somewhere. I'm sure there are heuristics out there that deal with all the cases, both specific and general, probably using a large database of standard mismatches (I’m open to data-heavy solutions).
I'm coding in Python but I consider this question language-agnostic.
Any recommendations/thoughts?
You want to read how google does this: http://norvig.com/spell-correct.html
Edit: Some people have mentioned algorithms that define a metric between a user given word and a candidate word (levenshtein, soundex). This is however not a complete solution to the problem, since one would also need a datastructure to efficiently perform a non-euclidean nearest neighbour search. This can be done e.g. with the Cover Tree: http://hunch.net/~jl/projects/cover_tree/cover_tree.html
A common solution is to calculate the Levenshtein distance between the input and your fixed texts. The Levenshtein distance of two strings is just the number of simple operations - insertions, deletions, and substitutions of a single character - required to turn one of the string into the other.
Have you considered algorithms that compare by phonetic sounds, such as soundex? It shouldn't be too hard to produce soundex representations of your list of words, store them, and then get a soundex of the user input and find the closest match there.
Look for the Bitap algorithm. It qualifies well for what you want to do, and even comes with a source code example in Wikipedia.
If your data set is really small, simply comparing the Levenshtein distance on all items independently ought to suffice. If it's larger, though, you'll need to use a BK-Tree or similar indexing system. The article I linked to describes how to find matches within a given Levenshtein distance, but it's fairly straightforward to adapt to do nearest-neighbor searches (and left as an exercise to the reader ;).
Though it may not solve the entire problem, you may want to consider using the soundex algorithm as part of the solution. A quick google search of "soundex" and "python" showed some python implementations of the algorithm.
Try searching for "Levenshtein distance" or "edit distance". It counts the number of edit operations (delete, insert, change letter) you need to transform one word into another. It's a common algorithm, but depending on the problem you might need something special with different weights for the different types of typos.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I've been developing an internal website for a portfolio management tool. There is a lot of text data, company names etc. I've been really impressed with some search engines ability to very quickly respond to queries with "Did you mean: xxxx".
I need to be able to intelligently take a user query and respond with not only raw search results but also with a "Did you mean?" response when there is a highly likely alternative answer etc
[I'm developing in ASP.NET (VB - don't hold it against me! )]
UPDATE:
OK, how can I mimic this without the millions of 'unpaid users'?
Generate typos for each 'known' or 'correct' term and perform lookups?
Some other more elegant method?
Here's the explanation directly from the source ( almost )
Search 101!
at min 22:03
Worth watching!
Basically and according to Douglas Merrill former CTO of Google it is like this:
1) You write a ( misspelled ) word in google
2) You don't find what you wanted ( don't click on any results )
3) You realize you misspelled the word so you rewrite the word in the search box.
4) You find what you want ( you click in the first links )
This pattern multiplied millions of times, shows what are the most common misspells and what are the most "common" corrections.
This way Google can almost instantaneously, offer spell correction in every language.
Also this means if overnight everyone start to spell night as "nigth" google would suggest that word instead.
EDIT
#ThomasRutter: Douglas describe it as "statistical machine learning".
They know who correct the query, because they know which query comes from which user ( using cookies )
If the users perform a query, and only 10% of the users click on a result and 90% goes back and type another query ( with the corrected word ) and this time that 90% clicks on a result, then they know they have found a correction.
They can also know if those are "related" queries of two different, because they have information of all the links they show.
Furthermore, they are now including the context into the spell check, so they can even suggest different word depending on the context.
See this demo of google wave ( # 44m 06s ) that shows how the context is taken into account to automatically correct the spelling.
Here it is explained how that natural language processing works.
And finally here is an awesome demo of what can be done adding automatic machine translation ( # 1h 12m 47s ) to the mix.
I've added anchors of minute and seconds to the videos to skip directly to the content, if they don't work, try reloading the page or scrolling by hand to the mark.
I found this article some time ago: How to Write a Spelling Corrector, written by Peter Norvig (Director of Research at Google Inc.).
It's an interesting read about the "spelling correction" topic. The examples are in Python but it's clear and simple to understand, and I think that the algorithm can be easily
translated to other languages.
Below follows a short description of the algorithm.
The algorithm consists of two steps, preparation and word checking.
Step 1: Preparation - setting up the word database
Best is if you can use actual search words and their occurence.
If you don't have that a large set of text can be used instead.
Count the occurrence (popularity) of each word.
Step 2. Word checking - finding words that are similar to the one checked
Similar means that the edit distance is low (typically 0-1 or 0-2). The edit distance is the minimum number of inserts/deletes/changes/swaps needed to transform one word to another.
Choose the most popular word from the previous step and suggest it as a correction (if other than the word itself).
For the theory of "did you mean" algorithm you can refer to Chapter 3 of Introduction to Information Retrieval. It is available online for free. Section 3.3 (page 52) exactly answers your question. And to specifically answer your update you only need a dictionary of words and nothing else (including millions of users).
Hmm... I thought that google used their vast corpus of data (the internet) to do some serious NLP (Natural Language Processing).
For example, they have so much data from the entire internet that they can count the number of times a three-word sequence occurs (known as a trigram). So if they see a sentence like: "pink frugr concert", they could see it has few hits, then find the most likely "pink * concert" in their corpus.
They apparently just do a variation of what Davide Gualano was saying, though, so definitely read that link. Google does of course use all web-pages it knows as a corpus, so that makes its algorithm particularly effective.
My guess is that they use a combination of a Levenshtein distance algorithm and the masses of data they collect regarding the searches that are run. They could pull a set of searches that have the shortest Levenshtein distance from the entered search string, then pick the one with the most results.
Normally a production spelling corrector utilizes several methodologies to provide a spelling suggestion. Some are:
Decide on a way to determine whether spelling correction is required. These may include insufficient results, results which are not specific or accurate enough (according to some measure), etc. Then:
Use a large body of text or a dictionary, where all, or most are known to be correctly spelled. These are easily found online, in places such as LingPipe. Then to determine the best suggestion you look for a word which is the closest match based on several measures. The most intuitive one is similar characters. What has been shown through research and experimentation is that two or three character sequence matches work better. (bigrams and trigrams). To further improve results, weigh a higher score upon a match at the beginning, or end of the word. For performance reasons, index all these words as trigrams or bigrams, so that when you are performing a lookup, you convert to n-gram, and lookup via hashtable or trie.
Use heuristics related to potential keyboard mistakes based on character location. So that "hwllo" should be "hello" because 'w' is close to 'e'.
Use a phonetic key (Soundex, Metaphone) to index the words and lookup possible corrections. In practice this normally returns worse results than using n-gram indexing, as described above.
In each case you must select the best correction from a list. This may be a distance metric such as levenshtein, the keyboard metric, etc.
For a multi-word phrase, only one word may be misspelled, in which case you can use the remaining words as context in determining a best match.
Use Levenshtein distance, then create a Metric Tree (or Slim tree) to index words.
Then run a 1-Nearest Neighbour query, and you got the result.
Google apparently suggests queries with best results, not with those which are spelled correctly. But in this case, probably a spell-corrector would be more feasible, Of course you could store some value for every query, based on some metric of how good results it returns.
So,
You need a dictionary (english or based on your data)
Generate a word trellis and calculate probabilities for the transitions using your dictionary.
Add a decoder to calculate minimum error distance using your trellis. Of course you should take care of insertions and deletions when calculating distances. Fun thing is that QWERTY keyboard maximizes the distance if you hit keys close to each other.(cae would turn car, cay would turn cat)
Return the word which has the minimum distance.
Then you could compare that to your query database and check if there is better results for other close matches.
Here is the best answer I found, Spelling corrector implemented and described by Google's Director of Research Peter Norvig.
If you want to read more about the theory behind this, you can read his book chapter.
The idea of this algorithm is based on statistical machine learning.
I saw something on this a few years back, so may have changed since, but apparently they started it by analysing their logs for the same users submitting very similar queries in a short space of time, and used machine learning based on how users had corrected themselves.
As a guess... it could
search for words
if it is not found use some algorithm to try to "guess" the word.
Could be something from AI like Hopfield network or back propagation network, or something else "identifying fingerprints", restoring broken data, or spelling corrections as Davide mentioned already ...
Simple. They have tons of data. They have statistics for every possible term, based on how often it is queried, and what variations of it usually yield results the users click... so, when they see you typed a frequent misspelling for a search term, they go ahead and propose the more usual answer.
Actually, if the misspelling is in effect the most frequent searched term, the algorythm will take it for the right one.
regarding your question how to mimic the behavior without having tons of data - why not use tons of data collected by google? Download the google sarch results for the misspelled word and search for "Did you mean:" in the HTML.
I guess that's called mashup nowadays :-)
Apart from the above answers, in case you want to implement something by yourself quickly, here is a suggestion -
Algorithm
You can find the implementation and detailed documentation of this algorithm on GitHub.
Create a Priority Queue with a comparator.
Create a Ternay Search Tree and insert all english words (from Norvig's post) along with their frequencies.
Start traversing the TST and for every word encountered in TST, calculate its Levenshtein Distance(LD) from input_word
If LD ≤ 3 then put it in a Priority Queue.
At Last extract 10 words from the Priority Queue and display.
You mean to say spell checker? If it is a spell checker rather than a whole phrase then I've got a link about the spell checking where the algorithm is developed in python. Check this link
Meanwhile, I am also working on project that includes searching databases using text. I guess this would solve your problem
This is an old question, and I'm surprised that nobody suggested the OP using Apache Solr.
Apache Solr is a full text search engine that besides many other functionality also provides spellchecking or query suggestions. From the documentation:
By default, the Lucene Spell checkers sort suggestions first by the
score from the string distance calculation and second by the frequency
(if available) of the suggestion in the index.
There is a specific data structure - ternary search tree - that naturally supports partial matches and near-neighbor matches.
Easiest way to figure it out is to Google dynamic programming.
It's an algorithm that's been borrowed from Information Retrieval and is used heavily in modern day bioinformatics to see how similiar two gene sequences are.
Optimal solution uses dynamic programming and recursion.
This is a very solved problem with lots of solutions. Just google around until you find some open source code.
It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
Computer Haiku
How would you write a program
To make them for you
Measure syllables
Understand semantic flow
Your goal can be met
Do not attempt it
Poetry does not mix well
With metal and bits
More seriously, good haiku (and even bad haiku) is a lot more about condensing meaning and imagery than counting syllables. It is generally also based on themes gathered from nature. Random word generation and syllable counting will get you measured gibberish, but not poetry...
First, you'll want to look into Markov chains, and second, there's a book about computer-generated poetry called Virtual Muse.
count the syllables
randomly generate words
arrange sensibly
Haikus are easy, that I'll note
Solutions well documented, and functions rote
They're overdone and cheesy
Coding far too easy
Code me a limerick, then I'll vote
//I actually like haikus
Not all haikus have the same number of syllables, but it's a place to start.
In terms of actually picking the words, I think that parts of speech would not be the place where I would start. Instead, I would look at Markov chains, and train your vocabulary on existing haikus.
On Haiku Village, we have the technology to easily do this in a variety of ways. One idea is to simply read the global twitter feed, and detect unintentional haikus. Since the back-end also has a dictionary, it would be possible to produce questionable haikus, but I think the quality would be lacking.
I think if we had a star rating system, then I suppose machine learning could be used to decide what is 'good'.
for (i is 0
and i is less than thirteen)
print s i plus plus
"To convey one's mood in seventeen Syllables is very diffic . . ."
(The great John Cooper Clarke Check out Beasley Street, one of my favourites)
How much more diffic for a computer? Logic knows no moods :)
To make it readable, separate the dictionary into Nouns, Verbs, Adjectives, with syllable count.
Come up with some templates of the form:
[Noun] [Verb]"s"
[Verb] a(n) [Adjective] [Noun]
[adjective] [noun]
and trim your dictionaries to the beautiful words.
implement a genetic algorithm to generate haikus drawn from a dictionary annotated with syllable counts, then pay people to read and rate them as the fitness function [mechanical turk would help]. Over time your program should evolve some good ones.
EDIT:
a GA you need
evolves at CPU speed
if fitness you heed
Your program must grok
Metaphores and imagery
And be creative.
I would look up syntactical programming and linguistic and try to find libraries for grammatical structure. From there it should be a simple step to add the word count and syllable count constraints.
Some people here suggested using a dictionary and generating word sequences using a Markov Chain. That seems like a theoretically viable solution, especially if you use a high-order Markov Chain (not bi- or trigrams).
But I think in practice it would work better if you could collect a database of existing haikus and selectively change single words in them (e.g., change a given word to another, semantically related word). The existing haikus give you some kind of structure and cohesion, and you just need to (ex-)change little parts in them in order to create a new haiku (a variation on the old one).
Of course they won't be completely new haikus with this method, but at least they will be somewhat enjoyable for the readers.
Parse existing haikus in a relational order, like word xx used after yy n times.
So when creating, possibility of xx coming after yy will be (n / sum of count of all words used after yy). This way it will be selectively randomized and can still be a valid haiku.
Write your program to generate Haiku's in Japanese. It will be far easier to measure your syllable count, pluse you are staying faithful to the original language of the poetry. If you have flexibility with the project, why not make the original Japanese - then show the English word by word literal translation by its side. It will look mysterious to say the least.
Anyways, just a different take on the problem.
Markov Sequences
A syllabic Database
Three lines of python
I'd start with some kind of dictionary file that contains a syllable count of each word in it. Then pick words from that add up to the required syllables/line
As to making it poetry, and not just random words, I have no idea.
You could, in addition to using Ian's idea of syllable counts, also categorize the words by part of speech and generate phrases.
From the semantic sude of the story use sampling and fourier transformation. Pick significant parts of some detailed description reduced in single words and leave to a reader to fill in gaps with her own imagination
The algorithm for having a computer output high quality haiku works something like this:
Setup Phase
loop:
find the email address of a world-renowned writer of haiku
confirm that this person is willing to generate haiku on demand
until sucker^H^H^H^H^H^Hwriter is found
Execution Phase
loop:
wait for a haiku request
when a haiku request is received, email the previously-stored master and ask for a haiku
wait for the haiku to return by reply
output haiku
There are, of course, various enhancements which can be made upon this fundamental architecture. For example the setup phase can be extended to set up a pool of haiku experts. The execution phase can be used to generate haiku during idle times and cache them against future demand. The specifics of such tweaking are left as an exercise for the student.
I love this question.
It is very imaginative.
Answer Below.
Many people have suggested Markov chains, but I really don't think it would be possible. You need to know intelligently whether the syllable is a PHONEMES then you have to know where the syllable ends.
If you ever did this I would be amazed.