How does spell checker and spell fixer of Google (or any search engine) work? - algorithm

When searching for something in Google, if you misspell a word (may be by mistake or may be when you really mean this non-dictionary word), Google says:
"Showing results for ..... Search instead for .......".
I am trying to figure out how this would work.
This basically means being able to find the closest dictionary word to the non-dictionary word entered. How does it work? One way I can guess is :
count no. of instances of each character and then scan dictionary to find a word with same no. of instances of each character (only with +-1 difference). But this will also return anagrams.
Is some kind of probabilistic model of any use here such as Markov etc. I don't understand Markov well enough to throw it around but just a very wild guess.
Any insights?

You're forgetting that google has a lot more information available to it then you do. They track when people type in a word, don't select a result, and then do another search shortly afterwards. They then use this information to suggest better searches for you.
See How does the Google "Did you mean?" Algorithm work? for a fuller explanation.
Note that this approach makes sense when you consider that Google aren't actually doing spell-checking. Instead, they are trying to work out what search term will give you the answer you are looking for. Obviously there is a lot of overlap between this and spell-checking, but it means they are not always trying to correct a search for, e.g., "Flickr".

When you search something which is related to other searches performed earlied closed to yours and got more results, google shows suggest on them.
We are sure that it is not spell checking but it shows what other people queried the related keywords.

Related

Search-For Utility Mainframe Algorithm

Can someone please give me some pointers on how the IBM mainframe Search-For Utility algorithm works?
How does it compare strings? What kind of matching algorithm does it use? How should I enter different strings in order to make the less comparisons possible?
I am using the utility but I do not know how it works, and I believe I am not using it as well as I should.
Thank you very much for your help!
Think of it as a very dumb search.
It doesn't have the capacity to enter a REGEX or anything like that. I don't think anyone will be able to tell you what algorithm is used.
Search-For uses the SuperC program to actually perform the search. What it appears to do is search line by line for a match to the string you provided. So if I do a search for:
'PIC 9(9)'
I am going to get back results for every line that has that string in it. The only way I could bring back less search results, would be to add more to that string. So maybe search for:
'PIC 9(9).' 'PIC 9(9) VALUE 'PIC 9(9) COMP'
any of these 3 would provide less results than the first search. So if that string breaks a line like:
05 WS-SOME-VARIABLE PIC 9(9)
VALUE 123456.
a search for 'PIC 9(9) VALUE' will not return anything, but a search for 'PIC 9(9)' would.
The more specific you are, the less search results you will get back. Depending on what you are looking for, you may be able to get better results by using Search-For in batch, or using File-Aid instead. Every specific scenario is different. So without knowing exactly what you are searching for and what your requirement it, its hard to tell you how to proceed.
You might consider IBM Developer for z, which which can do regular expression based searches. When the Remote Systems Explorer Daemon (RSED) is setup and running on the z/OS lpar, you can do searches across a single PDS or groups of PDS's using IDz filters. Very powerful. It also searches in the background so you can do other tasks while it searches. The searches can be saved for future ease of reference.

Regular Expression for Address/Zip/City&State

Anybody have an example of a regular expression that matches for address, zip, or [city,state]?
Update:
Admittedly, this is a weak question because I don't have enough information regarding user behavior at this point to really qualify the parameters of the problem. Here is what I'm trying to do though:
Create a search function that depending on what information has been entered in chooses one of two divergent paths, the first being address proximity search and the second being organization name search.
It is proving a difficult problem to solve, so any input out there, besides .* (okay, okay I deserved that) would be much appreciated.
Check out geocoder (http://www.rubygeocoder.com/). It will get lat/long from text input. What you could do for your search is first try to match organization names, and then try to match locations.
Luckily google figure out how to do proximity searches a while ago

What is the best way to find all forms of a word?

If a user enters a form of the word "look" such as "looked" or "looking", how can I identify it as a modified version of the verb look? I imagine others have run into and have solved this problem before ...
This is part of a fairly complicated problem called Stemming
However it's easier if you only want to take care of verb. To begin with, you can try the naive lookup table approach, since English vocabulary is not that big.
If you want something fancier, check the wiki page above.
If a regex is what your looking for something like this works look.*?\b to match look , looked and looking
Depending on your task, WordNet can be your friend for stuff like this. It's not a stemmer, but most stem words will return hits for what you're looking for It also provides synonyms and a lot of other information if you care about the concept 'look' rather than the word itself.

How does google know if I type in redflower.jpg I mean Red Flower?

I'm curious what the programming terms or methodology is used when Google shows you the "did you mean" link for a word that is made up of multiple words?
For example if I type in "redflower.jpg" It knows to break that up into Red Flower
Is there a common paradigm for doing that sort of operation? Would a Lucene search give you that?
thanks!
If google does not see a lot of matching results for reflowers.jpg, it might then try to cut the words in multiple words until it finds a lot of matching results.
It might also recognize the extension (.jpg), recognize the image extension and then try to find images with the similar name.
If I would have to make an algorithm like this, I would use an huge EXISTING database (either a dictionary or a search engine) and then try what I said in the beginning of my post.
Perhaps they could to look at what other people do when they have searched for redflowers.jpg? Maybe a number of people searched for "redflowers.jpg", didn't click on any links, and then searched for "Red Flower" and found some results worth clicking on.
Of course they would have to take into account that the queries are similar (contain matching strings), otherwise some strange results might appear.

Searching a datastore for related topics by keyword

For example, how does StackOverflow decide other questions are similar?
When I typed in the question above and then tabbed to this memo control I saw a list of existing questions which might be the same as the one I am asking.
What technique is used to find similar questions?
I got an email from team#stackoverflow.com on Mar 20 that mentions how it works:
the "ask a question" search is
exclusively on title and will not
match anything in the body. It is a
mystery to me why people think it's
better.
The last sentence refers to the search bar, which I've found is less useful when I'm trying to find a specific question I've already seen.
I think it's plain old word matching. However, I might add that this feature does not work as well as I would like it to. It's much better to do google search with site:stackoverflow.com prefix than to rely on SO to provide the relevant suggestions.
Poorly -- using MS SQL Full Text Search, I believe. You'll have better luck using Lucene, IMO. For more background on the topic see the Wikipedia article on Lucene or the general topic of information retrieval.
The matching program would store an index of all questions. When you ask a question, all keywords in your question are matched against the index. This is similar to Google Search. Lucene open source search can be (and with high probability has been) used for this. Since the results are not quite accurate, I presume they index just the headlines of the questions, as an approximation.
The other related keyword is collaborative filtering, the algorithm popularized by Amazon to recommend products based on behavior of other similar customers. In the current case, an alternative algorithm based on collaborative filtering is: keywords are extracted from the question, then tags associated (in the history) with the keywords are found. Questions which have those tags are returned. Well, experiments are needed to see whether it works well at all.

Resources