How do I use a Trie for spell checking - algorithm

I have a trie that I've built from a dictionary of words. I want to use this for spell checking( and suggest closest matches in the dictionary , maybe for a given number of edits x). I'm thinking I'd use levenshtein distance between the target word and words in my dictionary, but is there a smart way to traverse the trie without actually running the edit distance logic over each word separately? How should I do the traversal and the edit distance matching?
For e.g, if I have words MAN, MANE, I should be able to reuse the edit distance computation on MAN in MANE. Otherwise the Trie wouldnt serve any purpose

I think you should instead give a try to bk-trees; it's a data structure that fits well spell-checking as it will allow you to compute efficiently the edit distance with the words of your dictionary.
This link gives a good insight into BK-trees applied to spell-checking

Try computing for each tree node an array A where A[x] the smallest edit distance to be at that position in the trie after matching the first x letters of the target word.
You can then stop examining any nodes if every element in the array is greater than your target distance.
For example, with a trie containing MAN and MANE and an input BANE:
Node 0 representing '', A=[0,1,2,3,4]
Node 1 representing 'M', A=[1,1,2,3,4]
Node 2 representing 'MA', A=[2,1,1,2,3]
Node 3 representing 'MAN' A=[3,2,2,1,2]
Node 4 representing 'MANE' A=[4,3,2,2,1]
The smallest value for A[end] is 1 reached with the word 'MANE' so this is the best match.

There is a smart way to get every element that is not quite a Levenstein distance since the following algorithm does not incorporate transpositions.
Assuming we have the Tree structure, we can implement a recursive search of the tree. Your recursive search assumes we start with a cost-row representing the cost of deleting every letter. As we recursively search the tree, the information we have is
You are at node n, that has been indexed in your Trie structure by letter l.
You are considering a distance from a word w
Your current path assumes a previous cost-row up to this point, we wish to update this to form a new cost row for this node n.
We want to update our cost-row at the letter you are considering in accordance with 4 situations; l is the next letter in the word (cost row remains the same), the letter needs to be inserted (new cost +1), a letter has been deleted (cost of previous step +1), and the letter replaces a previous word (new cost +1).
The cost of proceeding down this path on your tree is the minimum of these costs. At this point, if your at a point in the Trie structure defining a word, append it to a list, and then recursively search all children for more words assuming the current cost is within a defined maximum cost. An implementation in Python can be found in another post:
https://stackoverflow.com/a/62823597/8249836
I also have this in C for piping. Since the algorithm is pretty fast even for high edit distances (< len of word) one may use a fast efficient implementation of the Levenstein distance to correct this method.

Related

Minimum number of deletions for a given word to become a dictionary word

Given a dictionary as a hashtable. Find the minimum # of
deletions needed for a given word in order to make it match any word in the
dictionary.
Is there some clever trick to solve this problem in less than exponential complexity (trying all possible combinations)?
For starters, suppose that you have a single word w in the the hash table and that your word is x. You can delete letters from x to form w if and only if w is a subsequence of x, and in that case the number of letters you need to delete from x to form w is given by |x - w|. So certainly one option would be to just iterate over the hash table and, for each word, to see if x is a subsequence of that word, taking the best match you find across the table.
To analyze the runtime of this operation, let's suppose that there are n total words in your hash table and that their total length is L. Then the runtime of this operation is O(L), since you'll process each character across all the words at most once. The complexity of your initial approach is O(|x| · 2|x|) because there are 2|x| possible words you can make by deleting letters from x and you'll spend O(|x|) time processing each one. Depending on the size of your dictionary and the size of your word, one algorithm might be better than the other, but we can say that the runtime is O(min{L, |x|·2|x|) if you take the better of the two approaches.
You can build a trie and then see where your given word fits into it. The difference in the depth of your word and the closest existing parent is the number of deletions required.

Given a continuous stream of words, remove the duplicates

I was asked this question recently.
Given a continuous stream of words, remove the duplicates while reading the input.
Example:
Input: This is next stream of question see it is a question
Output: This next stream of see it is a question
Starting from end, question as well as is already appeared once, so the second time it's ignored.
My solution:
Use hashing in this scenario for each word coming through stream.
If there is a collision then then ignore that word.
It's definitely not a good solution. I was asked to optimize it.
What is the best approach to solve this problem?
Hashing isn't a particularly bad solution.
It gives expected O(wordLength) lookup time, but O(wordLength * wordCount) in the worst case, and uses O(maxWordLength * wordCount) space.
Alternatives:
Trie
A trie is a tree data structure where each edge corresponds to a letter and the path from the root defines the value of the node.
This will give O(wordLength) lookup time and uses O(wordCount * maxWordLength) space, although the actual space usage may be lower as repeated prefixes (e.g. te in the below example) only use space once.
Binary search tree
A binary search tree is a tree data structure where each node in the subtree rooted at the left child is smaller than its parent, and similarly all nodes to the right are greater.
A self-balancing one gives O(wordLength * log wordCount) lookup time and uses O(wordCount * maxWordLength) space.
Bloom filter
A bloom filter is a data structure consisting of some number of bits and a few hash functions which maps a word to a bit, sets the output of each hash function on add and checks if any are not set on query.
This uses less space than the above solutions, but at the cost of false positives - some words will be marked as duplicates that aren't.
Specifically, it uses 1.44 log2(1/e) bits per key, where e is the false positive rate, giving O(wordCount) space usage, but with an incredibly low constant factor.
This will give O(wordLength) lookup time.
An example of a Bloom filter, representing the set {x, y, z}. The colored arrows show the positions in the bit array that each set element is mapped to. The element w is not in the set {x, y, z}, because it hashes to one bit-array position containing 0. For this figure, m=18 and k=3.

Data structure supporting Add and Partial-Sum

Let A[1..n] be an array of real numbers. Design an algorithm to perform any sequence of the following operations:
Add(i,y) -- Add the value y to the ith number.
Partial-sum(i) -- Return the sum of the first i numbers, i.e.
There are no insertions or deletions; the only change is to the values of the numbers. Each operation should take O(logn) steps. You may use one additional array of size n as a work space.
How to design a data structure for above algorithm?
Construct a balanced binary tree with n leaves; stick the elements along the bottom of the tree in their original order.
Augment each node in the tree with "sum of leaves of subtree"; a tree has #leaves-1 nodes so this takes O(n) setup time (which we have).
Querying a partial-sum goes like this: Descend the tree towards the query (leaf) node, but whenever you descend right, add the subtree-sum on the left plus the element you just visited, since those elements are in the sum.
Modifying a value goes like this: Find the query (left) node. Calculate the difference you added. Travel to the root of the tree; as you travel to the root, update each node you visit by adding in the difference (you may need to visit adjacent nodes, depending if you're storing "sum of leaves of subtree" or "sum of left-subtree plus myself" or some variant); the main idea is that you appropriately update all the augmented branch data that needs updating, and that data will be on the root path or adjacent to it.
The two operations take O(log(n)) time (that's the height of a tree), and you do O(1) work at each node.
You can probably use any search tree (e.g. a self-balancing binary search tree might allow for insertions, others for quicker access) but I haven't thought that one through.
You may use Fenwick Tree
See this question

Ordering a dictionary to maximize common letters between adjacent words

This is intended to be a more concrete, easily expressable form of my earlier question.
Take a list of words from a dictionary with common letter length.
How to reorder this list tto keep as many letters as possible common between adjacent words?
Example 1:
AGNI, CIVA, DEVA, DEWA, KAMA, RAMA, SIVA, VAYU
reorders to:
AGNI, CIVA, SIVA, DEVA, DEWA, KAMA, RAMA, VAYU
Example 2:
DEVI, KALI, SHRI, VACH
reorders to:
DEVI, SHRI, KALI, VACH
The simplest algorithm seems to be: Pick anything, then search for the shortest distance?
However, DEVI->KALI (1 common) is equivalent to DEVI->SHRI (1 common)
Choosing the first match would result in fewer common pairs in the entire list (4 versus 5).
This seems that it should be simpler than full TSP?
What you're trying to do, is calculate the shortest hamiltonian path in a complete weighted graph, where each word is a vertex, and the weight of each edge is the number of letters that are differenct between those two words.
For your example, the graph would have edges weighted as so:
DEVI KALI SHRI VACH
DEVI X 3 3 4
KALI 3 X 3 3
SHRI 3 3 X 4
VACH 4 3 4 X
Then it's just a simple matter of picking your favorite TSP solving algorithm, and you're good to go.
My pseudo code:
Create a graph of nodes where each node represents a word
Create connections between all the nodes (every node connects to every other node). Each connection has a "value" which is the number of common characters.
Drop connections where the "value" is 0.
Walk the graph by preferring connections with the highest values. If you have two connections with the same value, try both recursively.
Store the output of a walk in a list along with the sum of the distance between the words in this particular result. I'm not 100% sure ATM if you can simply sum the connections you used. See for yourself.
From all outputs, chose the one with the highest value.
This problem is probably NP complete which means that the runtime of the algorithm will become unbearable as the dictionaries grow. Right now, I see only one way to optimize it: Cut the graph into several smaller graphs, run the code on each and then join the lists. The result won't be as perfect as when you try every permutation but the runtime will be much better and the final result might be "good enough".
[EDIT] Since this algorithm doesn't try every possible combination, it's quite possible to miss the perfect result. It's even possible to get caught in a local maximum. Say, you have a pair with a value of 7 but if you chose this pair, all other values drop to 1; if you didn't take this pair, most other values would be 2, giving a much better overall final result.
This algorithm trades perfection for speed. When trying every possible combination would take years, even with the fastest computer in the world, you must find some way to bound the runtime.
If the dictionaries are small, you can simply create every permutation and then select the best result. If they grow beyond a certain bound, you're doomed.
Another solution is to mix the two. Use the greedy algorithm to find "islands" which are probably pretty good and then use the "complete search" to sort the small islands.
This can be done with a recursive approach. Pseudo-code:
Start with one of the words, call it w
FindNext(w, l) // l = list of words without w
Get a list l of the words near to w
If only one word in list
Return that word
Else
For every word w' in l do FindNext(w', l') //l' = l without w'
You can add some score to count common pairs and to prefer "better" lists.
You may want to take a look at BK-Trees, which make finding words with a given distance to each other efficient. Not a total solution, but possibly a component of one.
This problem has a name: n-ary Gray code. Since you're using English letters, n = 26. The Wikipedia article on Gray code describes the problem and includes some sample code.

What is the most efficient way to keep track of a specific character's index in a string?

Take the following string as an example:
"The quick brown fox"
Right now the q in quick is at index 4 of the string (starting at 0) and the f in fox is at index 16. Now lets say the user enters some more text into this string.
"The very quick dark brown fox"
Now the q is at index 9 and the f is at index 26.
What is the most efficient method of keeping track of the index of the original q in quick and f in fox no matter how many characters are added by the user?
Language doesn't matter to me, this is more of a theory question than anything so use whatever language you want just try to keep it to generally popular and current languages.
The sample string I gave is short but I'm hoping for a way that can efficiently handle any size string. So updating an array with the offset would work with a short string but will bog down with to many characters.
Even though in the example I was looking for the index of unique characters in the string I also want to be able to track the index of the same character in different locations such as the o in brown and the o in fox. So searching is out of the question.
I was hoping for the answer to be both time and memory efficient but if I had to choose just one I care more about performance speed.
Let's say that you have a string and some of its letters are interesting. To make things easier let's say that the letter at index 0 is always interesting and you never add something before it—a sentinel. Write down pairs of (interesting letter, distance to the previous interesting letter). If the string is "+the very Quick dark brown Fox" and you are interested in q from 'quick' and f from 'fox' then you would write: (+,0), (q,10), (f,17). (The sign + is the sentinel.)
Now you put these in a balanced binary tree whose in-order traversal gives the sequence of letters in the order they appear in the string. You might now recognize the partial sums problem: You enhance the tree so that nodes contain (letter, distance, sum). The sum is the sum of all distances in the left subtree. (Therefore sum(x)=distance(left(x))+sum(left(x)).)
You can now query and update this data structure in logarithmic time.
To say that you added n characters to the left of character c you say distance(c)+=n an then go and update sum for all parents of c.
To ask what is the index of c you compute sum(c)+sum(parent(c))+sum(parent(parent(c)))+...
Your question is a little ambiguous - are you looking to keep track of the first instances of every letter? If so, an array of length 26 might be the best option.
Whenever you insert text into a string at a position lower than the index you have, just compute the offset based on the length of the inserted string.
It would also help if you had a target language in mind as not all data structures and interactions are equally efficient and effective in all languages.
The standard trick that usually helps in similar situations is to keep the characters of the string as leaves in a balanced binary tree. Additionally, internal nodes of the tree should keep sets of letters (if the alphabet is small and fixed, they could be bitmaps) that occur in the subtree rooted at a particular node.
Inserting or deleting a letter into this structure only needs O(log(N)) operations (update the bitmaps on the path to root) and finding the first occurence of a letter also takes O(log(N)) operations - you descend from the root, going for the leftmost child whose bitmap contains the interesting letter.
Edit: The internal nodes should also keep number of leaves in the represented subtree, for efficient computation of the letter's index.

Resources