What is the best data structure for text auto completion? - algorithm

I have a long list of words and I want to show words starting with the text entered by the user. As user enters a character, application should update the list showed to user. It should be like AutoCompleteTextView on Android. I am just curious about the best data structure to store the words so that the search is very fast.

A trie could be used . http://en.wikipedia.org/wiki/Trie https://stackoverflow.com/search?q=trie
A nice article - http://www.sarathlakshman.com/2011/03/03/implementing-autocomplete-with-trie-data-structure/
PS : If you have some sub-sequences that "don't branch" then you may save space by using a radix trie, which is a trie implementation that puts several characters in node when possible - http://en.wikipedia.org/wiki/Radix_tree

You may find this thread interesting:
String similarity algorithms?
It's not exactly what you want, instead it's a slightly extended version of your problem.

For implementation of autocomplete feature, ternary search trees(TST) are also used:
http://igoro.com/archive/efficient-auto-complete-with-a-ternary-search-tree/
However, if you want to find any random substring within a string, try a Generalised suffix tree.
http://en.wikipedia.org/wiki/Generalised_suffix_tree

Tries (and their various varieties) are useful here. A more detailed treatment on this topic is in this paper. Maybe you can implement a completion trie for Android?

Related

new features for autocomplete algorithm

I met this problem in an interview. It is easy to implement a basic autocomplete system(https://www.futurice.com/blog/data-structures-for-fast-autocomplete/) to get a list of string from the prefix string. Now we want to add some new features.
ex,
User input: lun pla Output: lunch plan (mutiple words autocomplete)
User input: pla Output: lunch plan
User input: unc Output: lunch (autocomplete form part of the word)
How to implement the features?
You can try the following (basic) approach, and I will later give suggestions for extensions:
load a dictionary of accepted words
build a BK-Tree out of these words using Levenshtein-Damerau distance as the underlying metric
split an input sequence on the whitespace character to get words
for each word, check whether it is an accepted word. If it isn't find the nearest (within acceptable distance) word in the BK-tree
Now for the improvements:
As you indicated, sometimes a match makes more sense when two words are grouped together
Use the Google word2Phrase algorithm for this. You can find a C++ version here.
Use a more clever approach to finding word-boundaries. A stochastic method like HMM (Hidden Markov Model) might be useful (to avoid dates, times, abbreviations, etc being split)
Use a more intelligent error-metric. You could take into account common misspellings, keyboard layout errors (there are very specific errors for people that are used to typing qwerty that are suddenly faced with azerty), etc
Try to determine word part-of-speech type. (Adjective, noun, verb, etc). By doing this you can make much better completions.

Representing a string numerically with different properties than a hashcode

Is there a function that works similar to a hashcode where a string or set of bits is passed in and converted to a number. However this algorithm works such that strings that are more similar to one another would result in numbers closer to one another.
i.e.
f("abcdefg") - f("abcdef") < f("lorem ipsum dolor") - f("abcde")
The algorithm doesn't have to be perfect, I'm just trying to turn some descriptions into a numerical representation as one more input for an ML experiment. I know this string data has value to the algorithm I'm just trying to come up with some simple ways to turn it into a number.
What i understand from your post is very similar to a tpic of my interest. There is an great tool or process for doing the task you are asking for.
The tool i am referring to is known as word2vec. It gives a vectorization of each word in the string. It was found by Google. In this model each word is given a vectorizatipon based on the words in the vocabulary and its nearby words (next word and prev word). Go through this word2vec topic from google or youtube and you will have a clear idea of it.
The power of this tool is so much that u can do unexpected things. An example would be
King - Man + Woman = Queen
This tool is mainly being used for semantics analysis.

Storing arrangement of same word efficiently

I had a question related to Dictionary storage.
I was reading about Trie Data-structures and so far I have read that it works pretty well as prefix tree. But, I came to Trie - DS in efforts to see if it can reduce the storage of arrangement of letters formed through same word efficiently.
For ex : words "ANT", "TAN" and NAT have same letters but according to Trie it goes on to create two separate paths for these words. I can understand that Trie is meant for prefix storage and reduce redundancy. But can anyone help me in reducing the redundancy here.
One way I was thinking was to change the behavior of Trie as to each node has a status of 'word complete'; In addition if I put 'word start' status too I can make this work as below :
A
N - A - T
T - A - N
Now, every time I can check if the word is starting form here and go till the end.
Does this makes sense ? and if this is feasible ?
Or is their any better method to do this ?
Thanks
You can use 2 tries and also store the reverse trie. Then you can use a wildcard expansion everywhere in the search for example you can split the search word into 2 half and search for one half by the prefix and the other half by its suffix:http://phpir.com/tries-and-wildcards/. When you concatenate the 2 you can efficient search with a wildcard.
If you add a status field to each node you will increase the memory cost of your tree (assuming 8-bit chars) by a possibly not insignificant portion.
I understand that you want to reduce the number of letters in the DS, but you have to consider what happens if some contents are subsets of other contents, e.g. how ANTAN would be represented. Think about the minimal number of chars (128) as nodes of a fully connected graph. Obviously all words are stored in this graph, however it is not suitable to store any specific words. There is no way of telling where words end. The information stored in a trie is not just letters, but complete and properly terminated words.
If you add a marker as you suggest, how will you be able to encode this: SUPERCHARGED, SUPER, PERCH. You would set word_starts at S and P and word_ends at R and H. How would you know that SUPERCH and PER are not contained? You could instead use a non-zero label and assign number-pairs to the beginning and end of words: S:1 P:2 R:1 H:2. To make sure that start and end can occur at the same letter, you would have to use specific bits as labels.
You could then use NATANT as minimal flat representation and N:001 A:000 T:011 A:100 N:010 T: 100. This requires #words bit for the marker in the worst case: A, AA, AAA.... If you would store that in a tree however, you would have to look for the other marker, which is not an operation supported by trees. So I see no good way of using a marker.
From an information theoretical point I think the critical issue here is to properly encode the length, ordering and contents of a word in a unique way for each possible combination of these.
I originally meant to just comment, but it got a bit lengthy. I am not sure if this answers your question, but I hope it helps.
Are you hoping that any search for "ant" also brings up "tan" and "nat"?
If so, then use a TrieMap, always sort keys before reads/writes and map to a container of all words in that "anagram class."
If you're just looking for ideas to reduce the space overhead of using a Trie, then look no further. I've found burst trie's to be very space efficient. I wrote my own burst trie in Scala that also re-uses some ideas that I found in GWT's trie implementation.

Creating a suggested words algorithm

I'm designing a cool spell checker (I know I know, modern browsers already have this), anyway, I am wondering what kind of effort would it take to develop a fairly simple but decent suggest-word algorithm.
My idea is that I would first look through the misspelled word's characters and count the amount of characters it matches in each word in the dictionary (sounds resources intensive), and then pick the top 5 matches (so if the misspelled word matches the most characters with 7 words from the dictionary, it will randomly display 5 of those words as suggested spelling).
Obviously to get more advanced, we would look at "common words" and have a dictionary file that is numbered with 'frequency of that word used in English language' ranking. I think that's taking it a bit overboard maybe.
What do you think? Anyone have ideas for this?
First of all you will have to consider the complexity in finding the "nearer" words to the misspelled word. I see that you are using a dictionary, a hash table perhaps. But this may not be enough. The best and cooler solution here is to go for a TRIE datastructure. The complexity of finding these so called nearer words will take linear order timing and it is very easy to exhaust the tree.
A small example
Take the word "njce". This is a level 1 example where one word is clearly misspelled. The obvious suggestion expected would be nice. The first step is very obvious to see whether this word is present in the dictionary. Using the search function of a TRIE, this could be done O(1) time, similar to a dictionary. The cooler part is finding the suggestions. You would obviously have to exhaust all the words that start with 'a' to 'z' that has words like ajce bjce cjce upto zjce. Now to find the occurences of this type is again linear depending on the character count. You should not carried away by multiplying this number with 26 the length of words. Since TRIE immediately diminishes as the length grows. Coming back to the problem. Once that search is done for which no result was found, you go the next character. Now you would be searching for nace nbce ncce upto nzce. In fact you wont have explore all the combinations as the TRIE data structure by itself will not be having the intermediate characters. Perhaps it will have na ni ne no nu characters and the search space becomes insanely simple. So are the further occurrences. You could develop on this concept further based on second and third order matches. Hope this helped.
I'm not sure how much of the wheel you're trying to reinvent, so you may want to check out Lucene.
Apache Lucene Coreā„¢ (formerly named Lucene Java), our flagship sub-project, provides a Java-based indexing and search implementation, as well as spellchecking, hit highlighting and advanced analysis/tokenization capabilities.

Efficent methods for finding most common phrases in a body of text AKA trending topics

I previously asked a similar question on this topic, I ended up deriving several solutions which worked, one based on bloom filters + ngrams, the other based on hash tables + ngrams. Both solutions perform fine with small data sets (<1000 texts, usually tweets) but the computation time grew exponentially meaning doing 10,000 could take hours.
I am currently working in Ruby and perhaps, that is the problem but are there any other solutions or approaches I could attempt to solve this problem?
If you are looking to do text searching in large sets of data, you might have to look into something like solr. There is a really easy to setup solr gem called sunspot http://outoftime.github.com/sunspot/
Your problem can be solved by following the steps below:
(Optional, for performance purpose) Run through all the documents, create a mapping between the a unique word and an integer. Also, it is better to create a special mapping for sentence termination (.!? etc.). This is to facilitate the check of phrases that do not cross sentence boundary.
Concatenate all the documents into a huge array of mapped integers (in previous step). This can be done online (to save space) as we go through the next steps.
Constructing a suffix array of the string in previous step, augmented with the longest common prefix array. The fastest implementation known is SA-IS that runs in O(n) worst-case time. See here. Some special handling is required to be sure that each common prefix does not cross the sentence boundary.
LCP array is basically the result you need. You can do whatever you want with it, such as: sort it to find the longest repeated phrases among the documents, find all 5-words, 4 words, 3-words phrases, etc. The most common phrases (I assume at least 2-word phrases here) can be found by looking at both the LCP and suffix array.
Quick Google search show that this library contains a Ruby suffix array implementation. You can generate LCP array from there in O(n) Reference.

Resources