Implementation of a t9 dictionary using trie - data-structures

In a technical interview i was asked to implement the t9 dictionary. I knew it can be done using tries, but didn't know how to go about it. Could anyone please explain it ?
Note: Don't mark it as duplicate due to this, as it doesn't contain any explanation.

1)Build a trie(add all words from dictionary to it).
2)Initially a current node is a root of the trie.
3)When a new character is typed in, you can simply go to the next node from the current node by the edge that corresponds to this character(or report an error if there is nowhere to go).
4)To get all(or k first) possible words with a given prefix, you can just traverse the trie in depth first search order starting from the current node(if you need only k first words, you may stop the search when k words are found).
5)When the entire word is typed in and a new word is started, move to the root again and repeat the steps 3) - 5) for the next word.
P.S All nodes that correspond to a word(not a prefix of a word, but an entire word) can be marked when the trie is build so that it is easy to understand whether a new word is found or not when you traverse the trie during the step 4).

Related

How to find longest accepted word by automata?

I need to write a code in Java that will find the longest word that DFA accepts. Firstly, if there is transition to one of previous states (or self-transition) on path that leads to final state, that means there are infinite words, and longest one doesn't exist (that means there is Kleene star applied on some word). I was thinking to form queue by BFS, where each level is separated by null, so that when I'm iterating through queue and come across null, length of the word would be increases by one, but it would be hard to track set of previous states so I'm kind of idealess. If you can't code in Java I would appreciate pseudocode or algorithm.
I don't think this is strictly necessary, but it would not hurt the performance too terribly much in practice and might be sufficient for your needs. I would suggest, as a first pass, minimizing the DFA. This can be done in O(nlogn) in terms of the number of states, using e.g. Hopcroft. This is probably conceptually similiar to what Christian Sloper suggests in the comments regarding reversing the transitions to find unproductive states ; indeed, there is a minimization algorithm that does this as well, but you might be able to get away with just removing unproductive states and not minimizing here (though minimizing does make the reasoning a little easier).
Doing that is nice because it will remove all unproductive loops and combine them into a single dead state, if indeed there are any unproductive prefixes. It is easy to find the one dead state, if there is one, and remove it from the directed graph formed by the DFA's states and transitions. To do this, do either DFS or BFS and check each state to come to and see if (1) all transitions are self-loops and (2) the state is not accepting.
With the one dead state removed (if any) any loops or cycles we detect in the remaining directed graph imply there are infinitely many strings in the language, since by definition any remaining states have a path to acceptance. If we find a loop or cycle, we know the language is infinite, and can respond accordingly.
If there are no loops or cycles remaining after removing the dead state from the minimal DFA, what remains is a tree rooted at the start state and whose leaves are accepting states (think about this for a moment and you will see it must be true). Therefore, the length of the longest string accepted is the length (in edges) of the longest path from the root to a leaf; so basically the height of the tree or something close to it (depending on how you define depth/height, whether edges or nodes). You can take any old algorithm for finding the depth and modify it so that in addition to returning the depth, it returns the string corresponding to the deepest subtree, so you can get the string without having to go back through the tree. Something like this:
GetLongestStringInTree(root)
1. if root is null return ""
2. result = ""
3. maxlen = 0
4. for each transition
5. child = transition.target
6. symbol = transition.symbol
7. str = GetLongestStringInTree(child)
8. if str.length > maxlen then
9. maxlen = str.length
10. result = str
11. return result
This could be pretty easily modified to find all words of maximum length by adding str to a collection if its length is equal to the max length so far, and emptying that collection when a new longer string is found, and returning the collection (and using the length of the first thing in the collection for checking). That can be left as an exercise; as written, this will just find some arbitrary longest string accepted by the DFA.
This problem becomes a lot simpler if you split it in two. (Sorry no java)
Step 1: Determine if there is a loop.
If there is a loop there exist an infinite long input. Detecting a loop in a directed graph can be done with DFS.
Step 2 (no loop): You now have a directed acyclic graph (DAG) and you can find the longest path using this algorithm: Longest path in Directed acyclic graph

How do I use a Trie for spell checking

I have a trie that I've built from a dictionary of words. I want to use this for spell checking( and suggest closest matches in the dictionary , maybe for a given number of edits x). I'm thinking I'd use levenshtein distance between the target word and words in my dictionary, but is there a smart way to traverse the trie without actually running the edit distance logic over each word separately? How should I do the traversal and the edit distance matching?
For e.g, if I have words MAN, MANE, I should be able to reuse the edit distance computation on MAN in MANE. Otherwise the Trie wouldnt serve any purpose
I think you should instead give a try to bk-trees; it's a data structure that fits well spell-checking as it will allow you to compute efficiently the edit distance with the words of your dictionary.
This link gives a good insight into BK-trees applied to spell-checking
Try computing for each tree node an array A where A[x] the smallest edit distance to be at that position in the trie after matching the first x letters of the target word.
You can then stop examining any nodes if every element in the array is greater than your target distance.
For example, with a trie containing MAN and MANE and an input BANE:
Node 0 representing '', A=[0,1,2,3,4]
Node 1 representing 'M', A=[1,1,2,3,4]
Node 2 representing 'MA', A=[2,1,1,2,3]
Node 3 representing 'MAN' A=[3,2,2,1,2]
Node 4 representing 'MANE' A=[4,3,2,2,1]
The smallest value for A[end] is 1 reached with the word 'MANE' so this is the best match.
There is a smart way to get every element that is not quite a Levenstein distance since the following algorithm does not incorporate transpositions.
Assuming we have the Tree structure, we can implement a recursive search of the tree. Your recursive search assumes we start with a cost-row representing the cost of deleting every letter. As we recursively search the tree, the information we have is
You are at node n, that has been indexed in your Trie structure by letter l.
You are considering a distance from a word w
Your current path assumes a previous cost-row up to this point, we wish to update this to form a new cost row for this node n.
We want to update our cost-row at the letter you are considering in accordance with 4 situations; l is the next letter in the word (cost row remains the same), the letter needs to be inserted (new cost +1), a letter has been deleted (cost of previous step +1), and the letter replaces a previous word (new cost +1).
The cost of proceeding down this path on your tree is the minimum of these costs. At this point, if your at a point in the Trie structure defining a word, append it to a list, and then recursively search all children for more words assuming the current cost is within a defined maximum cost. An implementation in Python can be found in another post:
https://stackoverflow.com/a/62823597/8249836
I also have this in C for piping. Since the algorithm is pretty fast even for high edit distances (< len of word) one may use a fast efficient implementation of the Levenstein distance to correct this method.

Trie iterator, suffix ordering

I need to implement an iterator for a trie. Let's say I have
a
/\
b c
/\
d e
If the current iterator.state="abd", I would like to have iterator.next.state="abe", then "ac". At each level, nodes are sorted in lexicographical order (e.g. on level 2, c comes after b). Also this should happen in log(n) time, where n is the number of nodes.
One solution I can think of is: consider a special case, when each branch has the same height. A rather cool implementation I think, would be to maintain a balanced tree for each "level". On asking: "what string follows after abd", when positioned on b, one could search for the first element bigger than "b" in the tree associated with the third level, giving "abe".
However that might be impractical, due to having to create the trees.
If I understand the question correctly, the iterator state could be the current string and a pointer to the current location in the trie. Then, to move to the next element:
if your current location has a sibling, move to it and replace the last character in the current string with the current location's character.
else, remove the last character and go up the tree. If you're trying to go up from the root, you're done. Otherwise, go to step 1.
So for example when you're at abd (in your example), the current string is "abd" and the pointer points to the 'd'. To move to the next element you change the string to "ab", move to the sibling node ('e') and add it to the string, yielding "abe". After that, you'll be going up since there's no sibling and then to b's sibling, yielding the correct next value 'ac'.
As you can see, at worst each of those steps needs to go all the way back to the root before it can find a sibling; that's the log(n) you were asking for.

Find strings which are prefixes of other strings

This is an interview question. Given a number of strings find such strings, which are prefixes of others. For example, given strings = {"a", "aa", "ab", abb"} the result is {"a", "ab"}.
The simplest solution is just to sort the strings and check each pair of two subsequent strings if the 1st one is a prefix of the 2nd one. The running time of the algorithm is the running time of the sorting.
I guess there is another solution, which uses a trie, and has complexity O(N), where N is the number of strings. Could you suggest such an algorithm?
I have a following idea regarding Trie, complexity O(N):
You start with empty Trie.
You take words one by one, and add word to Trie.
After you add a word (let's call it word Wi) to Trie, there are two cases to consider:
Wi is prefix of some of the words you added before.
That statement is true if you didn't add any nodes to Trie while adding word Wi.
In that case, Wi is prefix and part of our solution.
Some of the words added before are prefix of Wi.
That statement is true if you passed through node that represents end of some word added before (let's cal that word Wj). In that case, Wj is prefix of Wi and part of our solution.
In more details (pseudocode):
for word in words
add word to trie
if size of trie did not change then // first case
add word to result
if ending nodes found while adding word // second case
add words defined by those nodes to result
return result
Adding new word to Trie:
node = trie.root();
for letter in word
if node.hasChild(letter) == false then // if letter doesnt exist, add it
node.addChild(letter)
if letter is last_letter_of_word then // if last letter of word, store that info
node.setIsLastLetterOf(word)
node = node.getChild(letter) // move
While you are adding new word, you can also check if you passed through any nodes that represent last letters of other words.
Complexity of algorithm that I described is O(N).
Another important thing is that this way you can know how many times word Wi prefixes other words, which may be useful.
Example for {aab, aaba, aa}:
Green nodes are nodes detected as case 1.
Red nodes are nodes detected as case 2.
Each column(trie) is one step. At the beginning trie is empty.
Black arrows show which nodes we visited(added) in that step.
Nodes that represent last letter of some word have that word written in parenthesess.
In step 1 we add word aab.
In step 2 we add word aaba, recognize one case 2 (word aab) and add word aab to result.
In step 3 we add word aa, recognize case 1 and add word aa to result.
At the end we have result = {aab, aa} which is correct.
The original answer is correct for: is a string a a substring of b (misread).
Using a trie, you can simply add all strings to it in a first iteration, and in the 2nd iteration, start reading each word, let it be w. If you find a word that you finished your read, but did not reach the string terminator ($ usually), you reach some node v in the trie.
By doing a DFS from v, you can get all strings which w is prefix of them.
high level pseudo code:
t <- new trie
for each word w:
t.add(w)
for each word w:
node <- t.getLastNode(w)
if node.val != $
collection<- DFS(node) (excluding w itself)
w is a prefix of each word in collection
Note: in order to optimize it, you might need to do some extra work: if a is prefix of b, and b is prefix of c, then a is prefix of c, so - when you do the DFS, if you reach some node that was already searched - just append its strings to the current prefix.
Still, since there could be quadric number of possibilities ("a", "aa", "aaa", .... ), getting all of them requires quadric time.
Original answer: finding if a is a substring of b:
The suggested solution runs in a quadric complexity, you will need to check each two pairs, giving you O(n* (n-1) * |S|).
You can build a suffix tree from the strings in the first iteration, and in the 2nd iteration check if each string is a non trivial entry (not itself) of another string.
This solution is O(n*|S|)

What is the most efficient way to keep track of a specific character's index in a string?

Take the following string as an example:
"The quick brown fox"
Right now the q in quick is at index 4 of the string (starting at 0) and the f in fox is at index 16. Now lets say the user enters some more text into this string.
"The very quick dark brown fox"
Now the q is at index 9 and the f is at index 26.
What is the most efficient method of keeping track of the index of the original q in quick and f in fox no matter how many characters are added by the user?
Language doesn't matter to me, this is more of a theory question than anything so use whatever language you want just try to keep it to generally popular and current languages.
The sample string I gave is short but I'm hoping for a way that can efficiently handle any size string. So updating an array with the offset would work with a short string but will bog down with to many characters.
Even though in the example I was looking for the index of unique characters in the string I also want to be able to track the index of the same character in different locations such as the o in brown and the o in fox. So searching is out of the question.
I was hoping for the answer to be both time and memory efficient but if I had to choose just one I care more about performance speed.
Let's say that you have a string and some of its letters are interesting. To make things easier let's say that the letter at index 0 is always interesting and you never add something before it—a sentinel. Write down pairs of (interesting letter, distance to the previous interesting letter). If the string is "+the very Quick dark brown Fox" and you are interested in q from 'quick' and f from 'fox' then you would write: (+,0), (q,10), (f,17). (The sign + is the sentinel.)
Now you put these in a balanced binary tree whose in-order traversal gives the sequence of letters in the order they appear in the string. You might now recognize the partial sums problem: You enhance the tree so that nodes contain (letter, distance, sum). The sum is the sum of all distances in the left subtree. (Therefore sum(x)=distance(left(x))+sum(left(x)).)
You can now query and update this data structure in logarithmic time.
To say that you added n characters to the left of character c you say distance(c)+=n an then go and update sum for all parents of c.
To ask what is the index of c you compute sum(c)+sum(parent(c))+sum(parent(parent(c)))+...
Your question is a little ambiguous - are you looking to keep track of the first instances of every letter? If so, an array of length 26 might be the best option.
Whenever you insert text into a string at a position lower than the index you have, just compute the offset based on the length of the inserted string.
It would also help if you had a target language in mind as not all data structures and interactions are equally efficient and effective in all languages.
The standard trick that usually helps in similar situations is to keep the characters of the string as leaves in a balanced binary tree. Additionally, internal nodes of the tree should keep sets of letters (if the alphabet is small and fixed, they could be bitmaps) that occur in the subtree rooted at a particular node.
Inserting or deleting a letter into this structure only needs O(log(N)) operations (update the bitmaps on the path to root) and finding the first occurence of a letter also takes O(log(N)) operations - you descend from the root, going for the leftmost child whose bitmap contains the interesting letter.
Edit: The internal nodes should also keep number of leaves in the represented subtree, for efficient computation of the letter's index.

Resources