What algorithm to use to match beginning of strings - algorithm

I have a lot of strings that I would like to match against a search term.
Example:
folks
fort
garage
grabbed
grandmother
habit
happily
harry
heading
hunter
I'll like to search for the string "ha" and the algorithm to return the start of the list where where strings begin with "ha", in this case "habit".
Of course I don't one to go one by one since the list is huge. I can do some pre processing to sort the list or put it into a structure that makes this sort of search fasts.
Any suggestions?

Well you want a sorted structure of some type. You could get away with a TreeMap or a Radix Tree (Radix will save you some space). The overhead of this will be the sort operation or the overhead of inserting into a sorted data structure. However, once sorted a binary search will give you logN+1 worst case lookup performance.
Of note Lucene uses Radix Trees afaik

You can always look at Patricia Trees. They are almost perfectly suited for this kind of thing.

A Trie is what you are looking for.

Your post leaves too many questions unanswered. My interpretation is that you want to create a dictionary from an unordered list of words. But then when you search for ha, what is it that you really want?
Do you want
the first word that starts with ha?
the index of the first word that starts with ha?
to have easy access to all the words that start with ha?
If you want 1 and/or 3, then the person who says trie is correct. (The link I give you has an easy to read implementation).
If 2 is what you want, then can you talk about a use-case? If not, then you are looking at using a string search algorithm. Without more details, it's difficult to give more precise advice.

Your question has many fuzzy areas. Depending on exactly what your requirements are you might find that the Rabin-Karp string searching method is of use to you.

Related

Whats the best data-structure to store a string

Recently I had an interview and I was asked this question.
Given a string which can have insert,delete and substring functions.
substring function returns the string from start index to end index which are given as parameters.
All three options are in random order, what is the efficient data-structure to use.
I'm assuming insert or delete operations here can be carried out in the middle of the string, not just end. Otherwise anything like c++ vector or python list is good enough.
Otherwise, Rope data structure is a very good candidate. It allows all of those operations in O(logN), which i think the best anyone could hope for. It's a good choice for editors, or while manipulating huge strings, genome data for example.
Another related, and more common choice for editors is Gap Buffer.

What are data structures behind letter by letter search?

Recently I've been reading some information regarding various data structures and their use in practice. I am especially interested in those that are used in searching. For example searching suggestions from Google, or searching in Windows.
If the text is fully typed something like hash table should work to find it in O(1). This is because we assume that they are already in the hash table. However, what happens, when we type in every letter and it search only based on letter 1, [1-2], [1-3] ...? Is it some kind of suffix array or trie used in the process?
I think this page describing "string searching" or "string matching" algorithms is what you are looking for:
https://en.wikipedia.org/wiki/String_searching_algorithm

find repeated word in infinite stream of words

You are given an infinite supply of words, which are coming one by one, and length of words, can be huge and is unknown how big it is. How will you find if the new word is repeated, what data structure will you use to store.This was the question asked to me in the interview .please help me to verify my answer.
Normally use a hash-table to keep track of the count of each word. Since you only have to answer whether the words are duplicated, you can reduce the word count to a bitmask, so that you only store a single bit for each hash index.
If the question is related to big data, like how to write a search engine for Google, your answer may need to relate to MapReduce or similar distributed techniques (which takes root somewhat in same hash table techniques as described above)
As with most sequential data, a trie would be a good choice here. Using a trie you can store new words very cost efficiently and still be sure to find new words. Tries can actually be seen as a form of multiple hashing of the words. If this still leads to problems, because the size of the words is to big, you can make it more efficient by producing a directed acyclic word graph (DAWG) from the words in order to reduce common suffixes as well as prefixes.
If all you need to do is efficiently detect if each word is one you've seen before, a Bloom filter is one nice option. It's kind of like a set and a hash table combined in one, and therefore can result in false positives -- for this reason they are sometimes adapted to use additional techniques to reduce that risk. The advantage of Bloom filters is that they are very space efficient (important if you really don't know how large the list will be). They are also fast. On the downside, you can't get the words out again, you can only tell whether you've seen them or not.
There's a nice description at: http://en.wikipedia.org/wiki/Bloom_filter.

How do I approximate "Did you mean?" without using Google?

I am aware of the duplicates of this question:
How does the Google “Did you mean?” Algorithm work?
How do you implement a “Did you mean”?
... and many others.
These questions are interested in how the algorithm actually works. My question is more like: Let's assume Google did not exist or maybe this feature did not exist and we don't have user input. How does one go about implementing an approximate version of this algorithm?
Why is this interesting?
Ok. Try typing "qualfy" into Google and it tells you:
Did you mean: qualify
Fair enough. It uses Statistical Machine Learning on data collected from billions of users to do this. But now try typing this: "Trytoreconnectyou" into Google and it tells you:
Did you mean: Try To Reconnect You
Now this is the more interesting part. How does Google determine this? Have a dictionary handy and guess the most probably words again using user input? And how does it differentiate between a misspelled word and a sentence?
Now considering that most programmers do not have access to input from billions of users, I am looking for the best approximate way to implement this algorithm and what resources are available (datasets, libraries etc.). Any suggestions?
Assuming you have a dictionary of words (all the words that appear in the dictionary in the worst case, all the phrases that appear in the data in your system in the best case) and that you know the relative frequency of the various words, you should be able to reasonably guess at what the user meant via some combination of the similarity of the word and the number of hits for the similar word. The weights obviously require a bit of trial and error, but generally the user will be more interested in a popular result that is a bit linguistically further away from the string they entered than in a valid word that is linguistically closer but only has one or two hits in your system.
The second case should be a bit more straightforward. You find all the valid words that begin the string ("T" is invalid, "Tr" is invalid, "Try" is a word, "Tryt" is not a word, etc.) and for each valid word, you repeat the algorithm for the remaining string. This should be pretty quick assuming your dictionary is indexed. If you find a result where you are able to decompose the long string into a set of valid words with no remaining characters, that's what you recommend. Of course, if you're Google, you probably modify the algorithm to look for substrings that are reasonably close typos to actual words and you have some logic to handle cases where a string can be read multiple ways with a loose enough spellcheck (possibly using the number of results to break the tie).
From the horse's mouth: How to Write a Spelling Corrector
The interesting thing here is how you don't need a bunch of query logs to approximate the algorithm. You can use a corpus of mostly-correct text (like a bunch of books from Project Gutenberg).
I think this can be done using a spellchecker along with N-grams.
For Trytoreconnectyou, we first check with all 1-grams (all dictionary words) and find a closest match that's pretty terrible. So we try 2-grams (which can be built by removing spaces from phrases of length 2), and then 3-grams and so on. When we try a 4-gram, we find that there is a phrase that is at 0 distance from our search term. Since we can't do better than that, we return that answer as the suggestion.
I know this is very inefficient, but Peter Norvig's post here suggests clearly that Google uses spell correcters to generate it's suggestions. Since Google has massive paralellization capabilities, they can accomplish this task very quickly.
Impressive tutroail one how its work you can found here http://alias-i.com/lingpipe-3.9.3/demos/tutorial/querySpellChecker/read-me.html.
In few word it is trade off of query modification(on character or word level) to increasing coverage in search documents. For example "aple" lead to 2mln documents, but "apple" lead to 60mln and modification is only one character, therefore it is obvious that you mean apple.
Datasets/tools that might be useful:
WordNet
Corpora such as the ukWaC corpus
You can use WordNet as a simple dictionary of terms, and you can boost that with frequent terms extracted from a corpus.
You can use the Peter Norvig link mentioned before as a first attempt, but with a large dictionary, this won't be a good solution.
Instead, I suggest you use something like locality sensitive hashing (LSH). This is commonly used to detect duplicate documents, but it will work just as well for spelling correction. You will need a list of terms and strings of terms extracted from your data that you think people may search for - you'll have to choose a cut-off length for the strings. Alternatively if you have some data of what people actually search for, you could use that. For each string of terms you generate a vector (probably character bigrams or trigrams would do the trick) and store it in LSH.
Given any query, you can use an approximate nearest neighbour search on the LSH described by Charikar to find the closest neighbour out of your set of possible matches.
Note: links removed as I'm a new user - sorry.
#Legend - Consider using one of the variations of the Soundex algorithm. It has some known flaws, but it works decently well in most applications that need to approximate misspelled words.
Edit (2011-03-16):
I suddenly remembered another Soundex-like algorithm that I had run across a couple of years ago. In this Dr. Dobb's article, Lawrence Philips discusses improvements to his Metaphone algorithm, dubbed Double Metaphone.
You can find a Python implementation of this algorithm here, and more implementations on the same site here.
Again, these algorithms won't be the same as what Google uses, but for English language words they should get you very close. You can also check out the wikipedia page for Phonetic Algorithms for a list of other similar algorithms.
Take a look at this: How does the Google "Did you mean?" Algorithm work?

Need a high efficient algorithm to check if a string contains english speech

I have got many Strings. All of them contain only characters. Characters and words are not splittet with a space from each other. Some of the characters form english words and other just bufflegab. The Strings may not contain a whole sentence.
I need to find out which of them are written in valid english speech. What I mean with that is, that the String could be build by concatenating well written english words. I know I could do something with a wordlist. But the words are not splittet from each other. So it could be very time-consuming to test every possible word combination.
I am searching for an high performance algorithm or method that check if the strings are built of english words or english speech. Maybe there is something that gives me the chance that the string contains english speech.
Do you know a method or algorithm that helps me?
Does something like Sphinx help me?
This is called the segmentation problem.
There is no trivial way to solve this. What I can suggest to you based on my guess of your knowledge level, is to build a trie out of your dictionary, and at the first chance you detect a possible word, try assuming that it is the word.
If later on, you find out that the last part of the word is gibberish, then you backtrack to the last time you decided a sequence of letter was a word, and ignore that word.
If your strings are long enough or your bufflegab strange enough, letter frequencies - possibly also bigram frequencies, trigram frequencies, etc. - might be sufficient (instead of the more general N-grams). For example, some browsers use that to guess the code page.
Check N-gram language model.
See http://en.wikipedia.org/wiki/N-gram
Sphinx probably won't help you. Try Rabin-Karp algorithm. It is awful for standard search but should work well for this particular problem. Basically, you'll want to have a dictionary of English words and will want to search with it. Overly large dictionaries will still be pretty slow, but if you use a small dictionary for common words and switch to a big one only when you hit common words, you probably still won't get too many false negatives.
Why not store your wordlist in a Trie. Then you iterate through the input looking for matching words in the Trie - this can be done very efficiently. If you find one, advance to the end of the word and continue.
It depends on what accuracy you want, how efficient you need it to be, and what kind of text you are processing.

Resources