Detecting misspelled words - algorithm

I have a list of airport names and my users have the possibility to enter one airport name to select it for futher processing.
How would you handle misspelled names and present a list of suggestions?

Look up Levenshtein distances to match a correct name against a given user input.

http://norvig.com/spell-correct.html
does something like levenshtein but, because he doesnt go all the way, its more efficient

Employ spell check in your code. The list of words should contain only correct spellings of airports.
This is not a great way to do this. You should either go for a control that provides auto complete option or a drop down as someone else suggested.
Use AJAX if your technology supports.

I know its not what you asked, but if this is an application where getting the right airport is important (e.g. booking tickets) then you might want to have a confirmation stage to make sure you have the right one. There have been cases of people getting tickets for the wrong Sydney, for instance.

It may be better to let the user select from the list of airport names instead of letting them type in their own. No mistakes can be made that way.

While it won't help right away, you could keep track of typos, and see which name they finally enter when a correct name is entered. That way you can track most common typos, and offer the best options.

Adding to Kevin's suggestion, it might be a best of both worlds if you use an input box with javascript autocomplete. such as jquery autocomplete
edit: danish beat me :(

There may be an existing spell-check library you can use. The code to do this sort of thing well is non-trivial. If you do want to write this yourself, you might want to look at dictionary trie's.
One method that may work is to just generate a huge list of possible error words and their corrections (here's an implementation in Python), which you could cache for greater performance.

Related

What's best UX practice for a OR/AND search box?

I am creating a search box that allows searching for 2 terms, separated by ; but I want to give the user the option to choose between searching only the profiles that have BOTH terms (if "all" is checked), or to search only the profiles that have one of the 2 terms (if "any" is checked). (in other words, you can consider that semi-column to be either an OR, either an AND, between the 2 terms inserted in the search textbox)
In the screenshot you can see the 2 instances of the checkbox at this moment.
Someone implied that it's not very intuitive, and the learning curve for a new user is pretty high... => My question: Is there any other best UX practice for such a search box?
thank you in advance :)
I agree, does not looks very intuitive. I think an input with dropdown would be a standard way to do this. Bootstrap has a simple one you can use:
https://getbootstrap.com/docs/4.0/components/input-group/#buttons-with-dropdowns
The Problem with the toggle you've used in your mocks is the user doesn't now upfront what options he has. Something like this is more intuitive:
Also, why do you need the ; in the first place? Can't this just be a whitespace?
This feels very confusing
Does the user have to type two terms in? Why are they in the same box?
If it is always exactly two entries, I would do something like this:
Otherwise, I think this is more human friendly. Having a user think through "and" versus "or" logic might not be a good UX for them (those terms are used a lot in computer science but not in other day-to-day things), so this avoids that:

find a single specific place with google places api

I was wondering if it is possible to find a specific place using the google places api.
I know the name of the place, the address or website url, and the coordinates.
I need this to get the ratings this place has.
Is this possible? If not, is it going to be?
I think your best bet would be to do a nearbysearch with the location (lat,long), a small radius, and the name and types parameters to narrow it down. If you are targeting a specific place, then you can just manually find it in the results and use its reference for a Details request in your solution.
If the target place can be dynamic, for example based on user input, then you might want to show the user the list of results and let them choose the correct one. I don't think there's a way to guarantee that you will always get exactly the result you're looking for as, say, the first result in the list. Experiment with different types of requests and parameters and try to get a sense for the behaviour of the responses to find what will work best for your solution.

What is the best way to find all forms of a word?

If a user enters a form of the word "look" such as "looked" or "looking", how can I identify it as a modified version of the verb look? I imagine others have run into and have solved this problem before ...
This is part of a fairly complicated problem called Stemming
However it's easier if you only want to take care of verb. To begin with, you can try the naive lookup table approach, since English vocabulary is not that big.
If you want something fancier, check the wiki page above.
If a regex is what your looking for something like this works look.*?\b to match look , looked and looking
Depending on your task, WordNet can be your friend for stuff like this. It's not a stemmer, but most stem words will return hits for what you're looking for It also provides synonyms and a lot of other information if you care about the concept 'look' rather than the word itself.

How does spell checker and spell fixer of Google (or any search engine) work?

When searching for something in Google, if you misspell a word (may be by mistake or may be when you really mean this non-dictionary word), Google says:
"Showing results for ..... Search instead for .......".
I am trying to figure out how this would work.
This basically means being able to find the closest dictionary word to the non-dictionary word entered. How does it work? One way I can guess is :
count no. of instances of each character and then scan dictionary to find a word with same no. of instances of each character (only with +-1 difference). But this will also return anagrams.
Is some kind of probabilistic model of any use here such as Markov etc. I don't understand Markov well enough to throw it around but just a very wild guess.
Any insights?
You're forgetting that google has a lot more information available to it then you do. They track when people type in a word, don't select a result, and then do another search shortly afterwards. They then use this information to suggest better searches for you.
See How does the Google "Did you mean?" Algorithm work? for a fuller explanation.
Note that this approach makes sense when you consider that Google aren't actually doing spell-checking. Instead, they are trying to work out what search term will give you the answer you are looking for. Obviously there is a lot of overlap between this and spell-checking, but it means they are not always trying to correct a search for, e.g., "Flickr".
When you search something which is related to other searches performed earlied closed to yours and got more results, google shows suggest on them.
We are sure that it is not spell checking but it shows what other people queried the related keywords.

Algorithms recognizing physical address on a webpage

What are the best algorithms for recognizing structured data on an HTML page?
For example Google will recognize the address of home/company in an email, and offers a map to this address.
A named-entity extraction framework such as GATE has at least tackled the information extraction problem for locations, assisted by a gazetteer of known places to help resolve common issues. Unless the pages were machine generated from a common source, you're going to find regular expressions a bit weak for the job.
If you have the markup proper—and not just the text from the page—I second the Beautiful Soup suggestion above. In particular, the address tag should provide the lowest of low-hanging fruit. Also look into the adr microformat. I'd only falll back to regexes if the first two didn't pull enough info or I didn't have the necessary data to look for the first two.
If you also have to handle international addresses, you're in for a world of headaches; international address formats are amazingly varied.
I'd guess that Google takes a two step approach to the problem (at least that's what I would do). First they use some fairly general search pattern to pick out everything that could be an address, and then they use their map database to look up that string and see if they get any matches. If they do it's probably an address if they don't it probably isn't. If you can use a map database in your code that will probably make your life easier.
Unless you can limit the geographic location of the addresses, I'm guessing that it's pretty much impossible to identify a string as an address just by parsing it, simply due to the huge variation of address formats used around the world.
Do not use regular expressions. Use an existing HTML parser, for example in Python I strongly recommend BeautifulSoup. Even if you use a regular expression to parse the HTML elements BeautifulSoup grabs.
If you do it with your own regexs, you not only have to worry about finding the data you require, you have to worry about things like invalid HTML, and lots of other very non-obvious problems you'll stumble over..
What you're asking is really quite a hard problem if you want to get it perfect. While a simple regexp will get it mostly right most of them time, writing one that will get it exactly right everytime is fiendishly hard. There are plenty of strange corner cases and in several cases there is no single unambiguous answer. Most web sites that I've seen to a pretty bad job handling all but the simplest URLs.
If you want to go down the regexp route your best bet is probably to check out the sourcecode of
http://metacpan.org/pod/Regexp::Common::URI::http
Again, regular expressions should do the trick.
Because of the wide variety of addresses, you can only guess if a string is an address or not by an expression like "(number), (name) Street|Boulevard|Main", etc
You can consider looking into some firefox extensions which aim to map addresses found in text to see how they work
You can check this USA extraction example http://code.google.com/p/graph-expression/wiki/USAAddressExtraction
It depends upon your requirement.
for email and contact details regex is more than enough.
For addresses regex alone will not help. Think about NLP(NER) & POS tagging.
For finding people related information you cant do anything without NER.
If you need information like paragraphs get the contents by using tags.

Resources