How does the Google "Did you mean?" Algorithm work? [closed] - algorithm

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I've been developing an internal website for a portfolio management tool. There is a lot of text data, company names etc. I've been really impressed with some search engines ability to very quickly respond to queries with "Did you mean: xxxx".
I need to be able to intelligently take a user query and respond with not only raw search results but also with a "Did you mean?" response when there is a highly likely alternative answer etc
[I'm developing in ASP.NET (VB - don't hold it against me! )]
UPDATE:
OK, how can I mimic this without the millions of 'unpaid users'?
Generate typos for each 'known' or 'correct' term and perform lookups?
Some other more elegant method?

Here's the explanation directly from the source ( almost )
Search 101!
at min 22:03
Worth watching!
Basically and according to Douglas Merrill former CTO of Google it is like this:
1) You write a ( misspelled ) word in google
2) You don't find what you wanted ( don't click on any results )
3) You realize you misspelled the word so you rewrite the word in the search box.
4) You find what you want ( you click in the first links )
This pattern multiplied millions of times, shows what are the most common misspells and what are the most "common" corrections.
This way Google can almost instantaneously, offer spell correction in every language.
Also this means if overnight everyone start to spell night as "nigth" google would suggest that word instead.
EDIT
#ThomasRutter: Douglas describe it as "statistical machine learning".
They know who correct the query, because they know which query comes from which user ( using cookies )
If the users perform a query, and only 10% of the users click on a result and 90% goes back and type another query ( with the corrected word ) and this time that 90% clicks on a result, then they know they have found a correction.
They can also know if those are "related" queries of two different, because they have information of all the links they show.
Furthermore, they are now including the context into the spell check, so they can even suggest different word depending on the context.
See this demo of google wave ( # 44m 06s ) that shows how the context is taken into account to automatically correct the spelling.
Here it is explained how that natural language processing works.
And finally here is an awesome demo of what can be done adding automatic machine translation ( # 1h 12m 47s ) to the mix.
I've added anchors of minute and seconds to the videos to skip directly to the content, if they don't work, try reloading the page or scrolling by hand to the mark.

I found this article some time ago: How to Write a Spelling Corrector, written by Peter Norvig (Director of Research at Google Inc.).
It's an interesting read about the "spelling correction" topic. The examples are in Python but it's clear and simple to understand, and I think that the algorithm can be easily
translated to other languages.
Below follows a short description of the algorithm.
The algorithm consists of two steps, preparation and word checking.
Step 1: Preparation - setting up the word database
Best is if you can use actual search words and their occurence.
If you don't have that a large set of text can be used instead.
Count the occurrence (popularity) of each word.
Step 2. Word checking - finding words that are similar to the one checked
Similar means that the edit distance is low (typically 0-1 or 0-2). The edit distance is the minimum number of inserts/deletes/changes/swaps needed to transform one word to another.
Choose the most popular word from the previous step and suggest it as a correction (if other than the word itself).

For the theory of "did you mean" algorithm you can refer to Chapter 3 of Introduction to Information Retrieval. It is available online for free. Section 3.3 (page 52) exactly answers your question. And to specifically answer your update you only need a dictionary of words and nothing else (including millions of users).

Hmm... I thought that google used their vast corpus of data (the internet) to do some serious NLP (Natural Language Processing).
For example, they have so much data from the entire internet that they can count the number of times a three-word sequence occurs (known as a trigram). So if they see a sentence like: "pink frugr concert", they could see it has few hits, then find the most likely "pink * concert" in their corpus.
They apparently just do a variation of what Davide Gualano was saying, though, so definitely read that link. Google does of course use all web-pages it knows as a corpus, so that makes its algorithm particularly effective.

My guess is that they use a combination of a Levenshtein distance algorithm and the masses of data they collect regarding the searches that are run. They could pull a set of searches that have the shortest Levenshtein distance from the entered search string, then pick the one with the most results.

Normally a production spelling corrector utilizes several methodologies to provide a spelling suggestion. Some are:
Decide on a way to determine whether spelling correction is required. These may include insufficient results, results which are not specific or accurate enough (according to some measure), etc. Then:
Use a large body of text or a dictionary, where all, or most are known to be correctly spelled. These are easily found online, in places such as LingPipe. Then to determine the best suggestion you look for a word which is the closest match based on several measures. The most intuitive one is similar characters. What has been shown through research and experimentation is that two or three character sequence matches work better. (bigrams and trigrams). To further improve results, weigh a higher score upon a match at the beginning, or end of the word. For performance reasons, index all these words as trigrams or bigrams, so that when you are performing a lookup, you convert to n-gram, and lookup via hashtable or trie.
Use heuristics related to potential keyboard mistakes based on character location. So that "hwllo" should be "hello" because 'w' is close to 'e'.
Use a phonetic key (Soundex, Metaphone) to index the words and lookup possible corrections. In practice this normally returns worse results than using n-gram indexing, as described above.
In each case you must select the best correction from a list. This may be a distance metric such as levenshtein, the keyboard metric, etc.
For a multi-word phrase, only one word may be misspelled, in which case you can use the remaining words as context in determining a best match.

Use Levenshtein distance, then create a Metric Tree (or Slim tree) to index words.
Then run a 1-Nearest Neighbour query, and you got the result.

Google apparently suggests queries with best results, not with those which are spelled correctly. But in this case, probably a spell-corrector would be more feasible, Of course you could store some value for every query, based on some metric of how good results it returns.
So,
You need a dictionary (english or based on your data)
Generate a word trellis and calculate probabilities for the transitions using your dictionary.
Add a decoder to calculate minimum error distance using your trellis. Of course you should take care of insertions and deletions when calculating distances. Fun thing is that QWERTY keyboard maximizes the distance if you hit keys close to each other.(cae would turn car, cay would turn cat)
Return the word which has the minimum distance.
Then you could compare that to your query database and check if there is better results for other close matches.

Here is the best answer I found, Spelling corrector implemented and described by Google's Director of Research Peter Norvig.
If you want to read more about the theory behind this, you can read his book chapter.
The idea of this algorithm is based on statistical machine learning.

I saw something on this a few years back, so may have changed since, but apparently they started it by analysing their logs for the same users submitting very similar queries in a short space of time, and used machine learning based on how users had corrected themselves.

As a guess... it could
search for words
if it is not found use some algorithm to try to "guess" the word.
Could be something from AI like Hopfield network or back propagation network, or something else "identifying fingerprints", restoring broken data, or spelling corrections as Davide mentioned already ...

Simple. They have tons of data. They have statistics for every possible term, based on how often it is queried, and what variations of it usually yield results the users click... so, when they see you typed a frequent misspelling for a search term, they go ahead and propose the more usual answer.
Actually, if the misspelling is in effect the most frequent searched term, the algorythm will take it for the right one.

regarding your question how to mimic the behavior without having tons of data - why not use tons of data collected by google? Download the google sarch results for the misspelled word and search for "Did you mean:" in the HTML.
I guess that's called mashup nowadays :-)

Apart from the above answers, in case you want to implement something by yourself quickly, here is a suggestion -
Algorithm
You can find the implementation and detailed documentation of this algorithm on GitHub.
Create a Priority Queue with a comparator.
Create a Ternay Search Tree and insert all english words (from Norvig's post) along with their frequencies.
Start traversing the TST and for every word encountered in TST, calculate its Levenshtein Distance(LD) from input_word
If LD ≤ 3 then put it in a Priority Queue.
At Last extract 10 words from the Priority Queue and display.

You mean to say spell checker? If it is a spell checker rather than a whole phrase then I've got a link about the spell checking where the algorithm is developed in python. Check this link
Meanwhile, I am also working on project that includes searching databases using text. I guess this would solve your problem

This is an old question, and I'm surprised that nobody suggested the OP using Apache Solr.
Apache Solr is a full text search engine that besides many other functionality also provides spellchecking or query suggestions. From the documentation:
By default, the Lucene Spell checkers sort suggestions first by the
score from the string distance calculation and second by the frequency
(if available) of the suggestion in the index.

There is a specific data structure - ternary search tree - that naturally supports partial matches and near-neighbor matches.

Easiest way to figure it out is to Google dynamic programming.
It's an algorithm that's been borrowed from Information Retrieval and is used heavily in modern day bioinformatics to see how similiar two gene sequences are.
Optimal solution uses dynamic programming and recursion.
This is a very solved problem with lots of solutions. Just google around until you find some open source code.

Related

How do I get a quick and dirty recognition of possible typos in .net?

I have to manually go through a long list of terms (~3500) which have been entered by users through the years. Beside other things, I want to reduce the list by looking for synonyms, typos and alternate spellings.
My work will be much easier if I can group the list into clusters of possible typos before starting. I was imagining to use some metric which can calculate the similarity to a term, e.g. in percent, and then cluster everything which has a similarity higher than some threshold. As I am going through it manually anyway, I don't mind a high failure rate, if it can keep the whole thing simple.
Ideally, there exists some easily available library to do this for me, implemented by people who know what they are doing. If there is no such, then at least one calculating a similarity metric for a pair of strings would be great, I can manage the clustering myself.
If this is not available either, do you know of a good algorithm which is simple to implement? I was first thinking a Hamming distance divided by word length will be a good metric, but noticed that while it will catch swapped letters, it won't handle deletions and insertions well (ptgs-1 will be caught as very similar to ptgs/1, but hematopoiesis won't be caught as very similar to haematopoiesis).
As for the requirements on the library/algorithm: it has to rely completely on spelling. I know that the usual NLP libraries don't work this way, but
there is no full text available for it to consider context.
it can't use a dictionary corpus of words, because the terms are far outside of any everyday language, frequently abbreviations of highly specialized terms.
Finally, I am most familiar with C# as a programming language, and I already have a C# pseudoscript which does some preliminary cleanup. If there is no one-step solution (feed list in, get grouped list out), I will prefer a library I can call from within a .NET program.
The whole thing should be relatively quick to learn for somebody with almost no previous knowledge in information retrieval. This will save me maybe 5-6 hours of manual work, and I don't want to spend more time than that in setting up an automated solution. OK, maybe up to 50% longer if I get the chance to learn something awesome :)
The question: What should I use, a library, or an algorithm? Which ones should I consider? If what I need is a library, how do I recognize one which is capable of delivering results based on spelling alone, as opposed to relying on context or dictionary use?
edit To clarify, I am not looking for actual semantic relatedness the way search or recommendation engines need it. I need to catch typos. So, I am looking for a metric by which mouse and rodent have zero similarity, but mouse and house have a very high similarity. And I am afraid that tools like Lucene use a metric which gets these two examples wrong (for my purposes).
Basically you are looking to cluster terms according to Semantic Relatedness.
One (hard) way to do it is following Markovitch and Gabrilovitch approach.
A quicker way will be consisting of the following steps:
download wikipedia dump and an open source Information Retrieval library such as Lucene (or Lucene.NET).
Index the files.
Search each term in the index - and get a vector - denoting how relevant the term (the query) is for each document. Note that this will be a vector of size |D|, where |D| is the total number of documents in the collection.
Cluster your vectors in any clustering algorithm. Each vector represents one term from your initial list.
If you are interested only in "visual" similarity (words are written similar to each other) then you can settle for levenshtein distance, but it won't be able to give you semantic relatedness of terms.For example, you won't be able to relate between "fall" and "autumn".

How do I approximate "Did you mean?" without using Google?

I am aware of the duplicates of this question:
How does the Google “Did you mean?” Algorithm work?
How do you implement a “Did you mean”?
... and many others.
These questions are interested in how the algorithm actually works. My question is more like: Let's assume Google did not exist or maybe this feature did not exist and we don't have user input. How does one go about implementing an approximate version of this algorithm?
Why is this interesting?
Ok. Try typing "qualfy" into Google and it tells you:
Did you mean: qualify
Fair enough. It uses Statistical Machine Learning on data collected from billions of users to do this. But now try typing this: "Trytoreconnectyou" into Google and it tells you:
Did you mean: Try To Reconnect You
Now this is the more interesting part. How does Google determine this? Have a dictionary handy and guess the most probably words again using user input? And how does it differentiate between a misspelled word and a sentence?
Now considering that most programmers do not have access to input from billions of users, I am looking for the best approximate way to implement this algorithm and what resources are available (datasets, libraries etc.). Any suggestions?
Assuming you have a dictionary of words (all the words that appear in the dictionary in the worst case, all the phrases that appear in the data in your system in the best case) and that you know the relative frequency of the various words, you should be able to reasonably guess at what the user meant via some combination of the similarity of the word and the number of hits for the similar word. The weights obviously require a bit of trial and error, but generally the user will be more interested in a popular result that is a bit linguistically further away from the string they entered than in a valid word that is linguistically closer but only has one or two hits in your system.
The second case should be a bit more straightforward. You find all the valid words that begin the string ("T" is invalid, "Tr" is invalid, "Try" is a word, "Tryt" is not a word, etc.) and for each valid word, you repeat the algorithm for the remaining string. This should be pretty quick assuming your dictionary is indexed. If you find a result where you are able to decompose the long string into a set of valid words with no remaining characters, that's what you recommend. Of course, if you're Google, you probably modify the algorithm to look for substrings that are reasonably close typos to actual words and you have some logic to handle cases where a string can be read multiple ways with a loose enough spellcheck (possibly using the number of results to break the tie).
From the horse's mouth: How to Write a Spelling Corrector
The interesting thing here is how you don't need a bunch of query logs to approximate the algorithm. You can use a corpus of mostly-correct text (like a bunch of books from Project Gutenberg).
I think this can be done using a spellchecker along with N-grams.
For Trytoreconnectyou, we first check with all 1-grams (all dictionary words) and find a closest match that's pretty terrible. So we try 2-grams (which can be built by removing spaces from phrases of length 2), and then 3-grams and so on. When we try a 4-gram, we find that there is a phrase that is at 0 distance from our search term. Since we can't do better than that, we return that answer as the suggestion.
I know this is very inefficient, but Peter Norvig's post here suggests clearly that Google uses spell correcters to generate it's suggestions. Since Google has massive paralellization capabilities, they can accomplish this task very quickly.
Impressive tutroail one how its work you can found here http://alias-i.com/lingpipe-3.9.3/demos/tutorial/querySpellChecker/read-me.html.
In few word it is trade off of query modification(on character or word level) to increasing coverage in search documents. For example "aple" lead to 2mln documents, but "apple" lead to 60mln and modification is only one character, therefore it is obvious that you mean apple.
Datasets/tools that might be useful:
WordNet
Corpora such as the ukWaC corpus
You can use WordNet as a simple dictionary of terms, and you can boost that with frequent terms extracted from a corpus.
You can use the Peter Norvig link mentioned before as a first attempt, but with a large dictionary, this won't be a good solution.
Instead, I suggest you use something like locality sensitive hashing (LSH). This is commonly used to detect duplicate documents, but it will work just as well for spelling correction. You will need a list of terms and strings of terms extracted from your data that you think people may search for - you'll have to choose a cut-off length for the strings. Alternatively if you have some data of what people actually search for, you could use that. For each string of terms you generate a vector (probably character bigrams or trigrams would do the trick) and store it in LSH.
Given any query, you can use an approximate nearest neighbour search on the LSH described by Charikar to find the closest neighbour out of your set of possible matches.
Note: links removed as I'm a new user - sorry.
#Legend - Consider using one of the variations of the Soundex algorithm. It has some known flaws, but it works decently well in most applications that need to approximate misspelled words.
Edit (2011-03-16):
I suddenly remembered another Soundex-like algorithm that I had run across a couple of years ago. In this Dr. Dobb's article, Lawrence Philips discusses improvements to his Metaphone algorithm, dubbed Double Metaphone.
You can find a Python implementation of this algorithm here, and more implementations on the same site here.
Again, these algorithms won't be the same as what Google uses, but for English language words they should get you very close. You can also check out the wikipedia page for Phonetic Algorithms for a list of other similar algorithms.
Take a look at this: How does the Google "Did you mean?" Algorithm work?

Algorithm for computing the relevance of a keyword to a short text (50 - 100 words)

I want to compute the relevance of a keyword to a short description text. What would be the best approach in terms of efficiency and ease of implementation. I am using C++?
Simple solution: Count the occurrences of the word in the text.
To do a good job though is a hard problem that companies like Google have been working on for years. If possible, you might want to take a look at using their technology
To expand, try the following:
Use a dictionary (e.g. WordNet to replace all synonyms with a common word
Detect similar words using Levenshtein distance
That's still only going to get you so far. You'll need to perform some natural language processing to truly understand what the description is about to distinguish between multiple texts containing the keyword the same number of times.
Refer to these previous Stack Overflow questions:
What are Useful Ranking Algorithms for Documents without Links (e.g. PDF, MS Documents, etc…)?
Algorithm for generating a 'top list' using word frequency.

Building or Finding a "relevant terms" suggestion feature

Given a few words of input, I want to have a utility that will return a diverse set of relevant terms, phrases, or concepts. A caveat is that it would need to have a large graph of terms to begin with, or else the feature would not be very useful.
For example, submitting "baseball" would return
["shortstop", "Babe Ruth", "foul ball", "steroids", ... ]
Google Sets is the best example I can find of this kind of feature, but I can't use it since they have no public API (and I wont go against their TOS). Also, single-word input doesn't garner a very diverse set of results. I'm looking for a solution that goes off on tangents.
The closest I've experimented with is using WikiPedia's API to search Categories and Backlinks, but there's no way to directly sort those results by "relevance" or "popularity". Without that, the suggestion list is massive and all over the place, which is not immediately useful and very hard to whittle down.
Using A Thesaurus could also work minimally, but that would leave out any proper nouns or tangentially relevant terms (like any of the results listed above).
I would happily reuse an open service, if one exists, but I haven't found anything sufficient.
I'm looking for either a way to implement this either in-house with a decently-populated starting set, or reuse a free service that offers this.
Have a solution? Thanks ahead of time!
UPDATE: Thank you for the incredibly dense & informative answers. I'll choose a winning answer in 6 to 12 months, when I'll hopefully understand what you've all suggested =)
You might be interested in WordNet. It takes a bit of linguistic knowledge to understand the API, but basically the system is a database of meaning-based links between English words, which is more or less what you're searching for. I'm sure I can dig up more information if you want it.
Peter Norvig (director of research at Google) spoke about how they do this at Google (specifically mentioning Google Sets) in a Facebook Tech Talk. The idea is that a relatively simple algorithm on a huge dataset (e.g. the entire web) is much better than a complicated algorithm on a small data set.
You could look at Google's n-gram collection as a starting point. You'd start to see what concepts are grouped together. Norvig hinted that internally Google has up to 7-grams for use in things like Google Translate.
If you're more ambitious, you could download all of Wikipedia's articles in the language you desire and create your own n-gram database.
The problem is even more complicated if you just have a single word; check out this recent thesis for more details on word sense disambiguation.
It's not an easy problem, but it is useful as you mentioned. In the end, I think you'll find that a really successful implementation will have a relatively simple algorithm and a whole lot of data.
Take a look at the following two papers:
Clustering User Queries of a Search Engine [pdf]
Topic Detection by Clustering Keywords [pdf]
Here is my attempt at a very simplified explanation:
If we have a database of past user queries, we can define a similarity function between two queries. For example: number of words in common. Now for each query in our database, we compute its similarity with each other query, and remember the k most similar queries. The non-overlapping words from these can be returned as "related terms".
We can also take this approach with a database of documents containing information users might be searching for. We can define the similarity between two search terms as the number of documents containing both divided by the number of documents containing either. To decide which terms to test, we can scan the documents and throw out words that are either too common ('and', 'the', etc.) or that are too obscure.
If our data permits, then we could see which queries led users to choosing which results, instead of comparing documents by content. For example if we had data that showed us that users searching for "Celtics" and "Lakers" both ended up clicking on espn.com, then we could call these related terms.
If you're starting from scratch with no data about past user queries, then you can try Wikipedia, or the Bag of Words dataset as a database of documents. If you are looking for a database of user search terms and results, and if you are feeling adventurous, then you can take a look at the AOL Search Data.

Is there an algorithm that extracts meaningful tags of english text

I would like to extract a reduced collection of "meaningful" tags (10 max) out of an english text of any size.
http://tagcrowd.com/ is quite interesting but the algorithm seems very basic (just word counting)
Is there any other existing algorithm to do this?
There are existing web services for this. Two Three examples:
Yahoo's Term Extraction API
Topicalizer
OpenCalais
When you subtract the human element (tagging), all that is left is frequency. "Ignore common English words" is the next best filter, since it deals with exclusion instead of inclusion. I tested a few sites, and it is very accurate. There really is no other way to derive "meaning", which is why the Semantic Web gets so much attention these days. It is a way to imply meaning with HTML... of course, that has a human element to it as well.
Basically, this is a text categorization problem/document classification problem. If you have access to a number of already tagged documents, you could analyze which (content) words trigger which tags, and then use this information for tagging new documents.
If you don't want to use a machine-learning approach and you still have a document collection, then you can use metrics like tf.idf to filter out interesting words.
Going one step further, you can use Wordnet to find synonyms and replace words by their synonym, if the frequency of the synonym is higher.
Manning & Schütze contains a lot more introduction on text categorization.
In text classification, this problem is known as dimensionality reduction. There are many useful algorithms in the literature on this subject.
You want to do the semantic analysis of a text.
Word frequency analysis is one of the easiest ways to do the semantic analysis. Unfortunately (and obviously) it is the least accurate one. It can be improved by using special dictionaries (like for synonims or forms of a word), "stop-lists" with common words, other texts (to find those "common" words and exclude them)...
As for other algorithms they could be based on:
Syntax analysis (like trying to find the main subject and/or verb in a sentence)
Format analysis (analyzing headers, bold text, italic... where applicable)
Reference analysis (if the text is in Internet, for example, then a reference can describe it in several words... used by some search engines)
BUT... you should understand that these algorithms are mereley heuristics for semantic analysis, not the strict algorithms of achieving the goal.
The problem of semantic analysis is one of the main problems in Artificial Intelligence/Machine Learning studies since the first computers appeared.
Perhaps "Term Frequency - Inverse Document Frequency" TF-IDF would be useful...
You can use this in two steps:
1 - Try topic modeling algorithms:
Latent Dirichlet Allocation
Latent word Embeddings
2 - After that you can select the most representative word of every topic as a tag

Resources