How do I create a tree containing a dictionary of words? - algorithm

Introduction: write a program to do text prediction suggestion (like Google does when you start typing a search term). That is, as the user
types, the program will show a list of N words that the user might be in the process of typing.
enter image description here
Requirements: read in a file with words and build an internal representation
Part 1: The internal representation that you are to use is a tree with a branching factor of 26, one branch for each possible letter. Each node should also signify if ending in that node represents a word.
Example: For instance, given the string ”parrot”, following a path from the
root should end in a node that signifies that a word has occurred. Following a path for the string ”subantiq” should arrive at a node that signifies that a word does not end at that
node.
Confusion: I do not know how to create the tree in order to fill it with words from a list. Also, there is no constraint on the language.
My question are:
1. Which language would be best for implementing this?
2. how do I create the tree that will read in a list of words, in the desired structure? Pseudo code in the best language?

You can use Trie data structure for this with a branching factor of 26.r
Now to answer your questions.
Which language would be best for implementing.
ans: You can implement it either in python or in C++.
2.How do I create the tree that will read in a list of words in the desired structure.
ans: You can take reference from this.
Additionally you can take reference for pseudo code from here.

Related

How do I modify the prefix trie data structure to handle words that are in the middle?

I want to implement simple autocomplete functionality for a website. I first wanted to use the prefix trie data structure for this, and that's how autocomplete usually works, you enter a prefix and you can search the trie for the possible suffixes, however the product owner wants to handle the words that are in the middle as well.
Let me explain what I mean. Imagine I have these product names:
tile for bathroom
tile for living room
kitchen tile
kitchen tile, black
some other tile, green
The user searches for "tile" and they will only see the first 2 results if I use the prefix trie, but I want all those results to pop up, however I don't know about any efficient data structure to handle this. Can you please suggest something? Can a prefix trie be modified to handle this?
I have thought about some modifications, such as inserting all suffixes, etc, but they will give wrong results, for example, I have inserted suffixes for
kitchen tile, black
some other tile, green
and kept the prefixes in the first node for each suffix (kind of like cartesian product), that way I can get the result of "some other tile, black" which doesn't exist. So this solution is bad. Also this solution will use a lot of memory...
The trie data structure indeed works for prefix match operations, not for in the middle text search
The usual data structure to support in the middle text search is the suffix tree: https://en.wikipedia.org/wiki/Suffix_tree
It requires enough space to store about 20 times your list of words in memory, so yes it costs more memory
Suffix array is a space efficient alternative: https://en.wikipedia.org/wiki/Suffix_array
Don't over-think this. Computers are fast. If you're talking on the order of thousands of products in memory, then a sequential search doing a contains check is going to be plenty fast enough: just a few milliseconds, if that.
If you're talking a high-traffic site with thousands of requests per second, or a system with hundreds of thousands of different products, you'll need a better approach. But for a low-traffic site and a few thousand products, do the simple thing first. It's easy to implement and easy to prove correct. Then, if it's not fast enough, you can worry about optimizing it.
I have an approach that will work using simple tries.
Assumption:- User will see Sentence once the whole word is complete
Let's take above example to understand this approach.
1. Take each sentence, say tile for bathroom.
2. Split the sentences into words as - tile, for, bathroom.
3. Create a tuple of [String, String], so for above example we will get three tuples.
(i) [tile, tile for bathroom]
(ii) [for, tile for bathroom]
(iii)[bathroom, tile for bathroom]
4. Now insert the first String of the above tuple into your trie and store the
other tuple (which is the whole sentence) as a String object to the
last character node for the word. i.e when inserting tile, the node at
character e will store the sentence string value.
5. One case to handle here is, like the tile word appears in two strings, so in that case
the last character e will store a List of string having values - tile for bathroom and tile for living room.
Once you have the trie ready based on the above approach, you would be able to search the sentence based on any word being used in that sentence. In short, we are creating each word of the sentence as tag.
Let me know, if you need more clarity on the above approach.
Hope this helps!

Storing arrangement of same word efficiently

I had a question related to Dictionary storage.
I was reading about Trie Data-structures and so far I have read that it works pretty well as prefix tree. But, I came to Trie - DS in efforts to see if it can reduce the storage of arrangement of letters formed through same word efficiently.
For ex : words "ANT", "TAN" and NAT have same letters but according to Trie it goes on to create two separate paths for these words. I can understand that Trie is meant for prefix storage and reduce redundancy. But can anyone help me in reducing the redundancy here.
One way I was thinking was to change the behavior of Trie as to each node has a status of 'word complete'; In addition if I put 'word start' status too I can make this work as below :
A
N - A - T
T - A - N
Now, every time I can check if the word is starting form here and go till the end.
Does this makes sense ? and if this is feasible ?
Or is their any better method to do this ?
Thanks
You can use 2 tries and also store the reverse trie. Then you can use a wildcard expansion everywhere in the search for example you can split the search word into 2 half and search for one half by the prefix and the other half by its suffix:http://phpir.com/tries-and-wildcards/. When you concatenate the 2 you can efficient search with a wildcard.
If you add a status field to each node you will increase the memory cost of your tree (assuming 8-bit chars) by a possibly not insignificant portion.
I understand that you want to reduce the number of letters in the DS, but you have to consider what happens if some contents are subsets of other contents, e.g. how ANTAN would be represented. Think about the minimal number of chars (128) as nodes of a fully connected graph. Obviously all words are stored in this graph, however it is not suitable to store any specific words. There is no way of telling where words end. The information stored in a trie is not just letters, but complete and properly terminated words.
If you add a marker as you suggest, how will you be able to encode this: SUPERCHARGED, SUPER, PERCH. You would set word_starts at S and P and word_ends at R and H. How would you know that SUPERCH and PER are not contained? You could instead use a non-zero label and assign number-pairs to the beginning and end of words: S:1 P:2 R:1 H:2. To make sure that start and end can occur at the same letter, you would have to use specific bits as labels.
You could then use NATANT as minimal flat representation and N:001 A:000 T:011 A:100 N:010 T: 100. This requires #words bit for the marker in the worst case: A, AA, AAA.... If you would store that in a tree however, you would have to look for the other marker, which is not an operation supported by trees. So I see no good way of using a marker.
From an information theoretical point I think the critical issue here is to properly encode the length, ordering and contents of a word in a unique way for each possible combination of these.
I originally meant to just comment, but it got a bit lengthy. I am not sure if this answers your question, but I hope it helps.
Are you hoping that any search for "ant" also brings up "tan" and "nat"?
If so, then use a TrieMap, always sort keys before reads/writes and map to a container of all words in that "anagram class."
If you're just looking for ideas to reduce the space overhead of using a Trie, then look no further. I've found burst trie's to be very space efficient. I wrote my own burst trie in Scala that also re-uses some ideas that I found in GWT's trie implementation.

What is the best data structure for text auto completion?

I have a long list of words and I want to show words starting with the text entered by the user. As user enters a character, application should update the list showed to user. It should be like AutoCompleteTextView on Android. I am just curious about the best data structure to store the words so that the search is very fast.
A trie could be used . http://en.wikipedia.org/wiki/Trie https://stackoverflow.com/search?q=trie
A nice article - http://www.sarathlakshman.com/2011/03/03/implementing-autocomplete-with-trie-data-structure/
PS : If you have some sub-sequences that "don't branch" then you may save space by using a radix trie, which is a trie implementation that puts several characters in node when possible - http://en.wikipedia.org/wiki/Radix_tree
You may find this thread interesting:
String similarity algorithms?
It's not exactly what you want, instead it's a slightly extended version of your problem.
For implementation of autocomplete feature, ternary search trees(TST) are also used:
http://igoro.com/archive/efficient-auto-complete-with-a-ternary-search-tree/
However, if you want to find any random substring within a string, try a Generalised suffix tree.
http://en.wikipedia.org/wiki/Generalised_suffix_tree
Tries (and their various varieties) are useful here. A more detailed treatment on this topic is in this paper. Maybe you can implement a completion trie for Android?

Compression and Lookup of huge list of words

I have a huge list of multi-byte sequences (lets call them words) that I need to store in a file and that I need to be able to lookup quickly. Huge means: About 2 million of those, each 10-20 bytes in length.
Furthermore, each word shall have a tag value associated with it, so that I can use that to reference more (external) data for each item (hence, a spellchecker's dictionary is not working here as that only provides a hit-test).
If this were just in memory, and if memory was plenty, I could simply store all words in a hashed map (aka dictionary, aka key-value pairs), or in a sorted list for a binary search.
However, I'd like to compress the data highly, and would also prefer not to have to read the data into memory but rather search inside the file.
As the words are mostly based on the english language, there's a certain likelyness that certain "sillables" in the words occur more often than others - which is probably helpful for an efficient algorithm.
Can someone point me to an efficient technique or algorithm for this?
Or even code examples?
Update
I figure that DAWG or anything similar routes the path into common suffixes this way won't work for me, because then I won't be able to tag each complete word path with an individual value. If I were to detect common suffixes, I'd have to put them into their own dictionary (lookup table) so that a trie node could reference them, yet the node would keep its own ending node for storing that path's tag value.
In fact, that's probably the way to go:
Instead of building the tree nodes for single chars only, I could try to find often-used character sequences, and make a node for those as well. That way, single nodes can cover multiple chars, maybe leading to better compression.
Now, if that's viable, how would I actually find often-used sub-sequences in all my phrases?
With about 2 million phrases consisting of usually 1-3 words, it'll be tough to run all permutations of all possible substrings...
There exists a data structure called a trie. I believe that this data structure is perfectly suited for your requirements. Basically a trie is a tree where each node is a letter and each node has child nodes. In an letter based trie, there would be 26 children per node.
Depending on what language you are using this may be easier or better to store as a variable length list while creation.
This structure gives:
a) Fast searching. Following a word of length n, you can find the string in n links in the tree.
b) Compression. Common prefixes are stored.
Example: The word BANANA and BANAL both will have B,A,N,A nodes equal and then the last (A) node will have 2 children, L and N. Your Nodes can also stored other information about the word.
(http://en.wikipedia.org/wiki/Trie)
Andrew JS
I would recommend using a Trie or a DAWG (directed acyclic word graph). There is a great lecture from Stanford on doing exactly what you want here: http://academicearth.org/lectures/lexicon-case-study
Have a look at the paper "How to sqeeze a lexicon". It explains how to build a minimized finite state automaton (which is just another name for a DAWG) with a one-to-one mapping of words to numbers and vice versa. Exactly what you need.
You should get familiar with Indexed file.
Have you tried just using a hash map? Thing is, on a modern OS architecture, the OS will use virtual memory to swap out unused memory segments to disk anyway. So it may turn out that just loading it all into a hash map is actually efficient.
And as jkff points out, your list would only be about 40 MB, which is not all that much.

Algorithm for generating a 'top list' using word frequency

I have a big collection of human generated content. I want to find the words or phrases that occur most often. What is an efficient way to do this?
Don't reinvent the wheel. Use a full text search engine such as Lucene.
The simple/naive way is to use a hashtable. Walk through the words and increment the count as you go.
At the end of the process sort the key/value pairs by count.
the basic idea is simple -- in executable pseudocode,
from collections import defaultdict
def process(words):
d = defaultdict(int)
for w in words: d[w] += 1
return d
Of course, the devil is in the details -- how do you turn the big collection into an iterator yielding words? Is it big enough that you can't process it on a single machine but rather need a mapreduce approach e.g. via hadoop? Etc, etc. NLTK can help with the linguistic aspects (isolating words in languages that don't separate them cleanly).
On a single-machine execution (net of mapreduce), one issue that can arise is that the simple idea gives you far too many singletons or thereabouts (words occurring once or just a few times), which fill memory. A probabilistic retort to that is to do two passes: one with random sampling (get only one word in ten, or one in a hundred) to make a set of words that are candidates for the top ranks, then a second pass skipping words that are not in the candidate set. Depending on how many words you're sampling and how many you want in the result, it's possible to compute an upper bound on the probability that you're going to miss an important word this way (and for reasonable numbers, and any natural language, I assure you that you'll be just fine).
Once you have your dictionary mapping words to numbers of occurrences you just need to pick the top N words by occurrences -- a heap-queue will help there, if the dictionary is just too large to sort by occurrences in its entirety (e.g. in my favorite executable pseudocode, heapq.nlargest will do it, for example).
Look into the Apriori algorithm. It can be used to find frequent items and/or frequent sets of items.
Like the wikipedia article states, there are more efficient algorithms that do the same thing, but this could be a good start to see if this will apply to your situation.
Maybe you can try using a PATRICIA trie or practical algorithm to retrieve information coded in alphanumeric trie?
Why not a simple map with key as the word and the Counter as the Value.
It will give the top used words, by taking the high value counter.
It is just a O(N) operation.

Resources