weighted trie for autosuggest feature - data-structures

I have a trie (suffix tree) which I am using for an auto-suggest feature in my website.
Now I want to show the most popular (most highly-weighted) text above text with a lower weight. How can I change my trie so that suggestions come out in weighted order.
Or should I just sort by weight in memory?

You could add a count or weight property at each node and update it as you build the trie using your words. Each character will have an initial weight of 0, but if a character is the terminal character of a word, then it will have an initial weight of 1. As you keep adding words, you can adjust the weights on terminal characters.
So, for example, you could have:
t:0
|
o:1
|
w:3---e:0
| \ \
n:2 a:0 l:4
\
r:0
\
d:2
For the strings to (appearing once), tow (appearing thrice), towel (appearing four times), town (appearing twice), and toward (also appearing twice).
Then if you had the prefix tow, you can look at non-zero weighted strings such as tow:3, towel:4, town:2, and toward:2.
After that, you can sort based on weight.
I haven't tried out this implementation in practice; this is just an idea.

Related

Big-O time to generate all letter-replacement words?

Let's say you have a word, such as 'cook', and you want to generate a graph of all possible words that could be made from that word by replacing each letter with all the other letters. An important restriction: you can ask the dictionary if a collection of letters is a word, but that is the limit of your interface to the dictionary. You cannot just ask the dictionary for all n-letter words.
I would imagine this would be a recursive algorithm generating a DAG such as follows:
cook
/ | \
aook ... zook
/ | \ / | \
aaok ... azok zaok ... zzok
And so on. Obviously in reality many of these permutations would be rejected as not being real words, but eventually you would have a DAG that contains all 'words' that can be generated. The height of the graph would be the length of the input word plus 1.
In the worst case, each permutation would be a word. The first level has 1 word, the second level 25, the next level (25*25) and so on. Thus if n is the length of the word, am I correct in thinking this would mean the algorithm has a worst-case time complexity of 25^n, and also a worst-case storage complexity of 25^n?
It is probably better not to view it as a tree, but as a graph.
Let n be the input, which is the length of the words.
Let W be all the (meaningful) words of length $n$.
Construct a simple graph G as follows: the vertices of G is W; two words w1 and w2 has an edge connecting them if and only if they differ by one charater.
Then what you want to do is the following:
Given a word w in W, find the connected component of w in the graph G.
This is typically done with depth-first search (DFS) or breadth-first search (BFS). Your algorithm is kind of BFS (but not quite correct: each time you should generate all neighbours of a word, not just those with one place replaced).
Since we may assume $n$ small, the time complexity is theoretically linear in the size of the result (although with a very large big O constant)
However, you should also have an equal size of memory, memorizing which word has already been checked.
If you generate only words that can be found in dictionary, then time complexity is limited by the dictionary size, so it should be O(min(|D|, 25^n)).

How do I use a Trie for spell checking

I have a trie that I've built from a dictionary of words. I want to use this for spell checking( and suggest closest matches in the dictionary , maybe for a given number of edits x). I'm thinking I'd use levenshtein distance between the target word and words in my dictionary, but is there a smart way to traverse the trie without actually running the edit distance logic over each word separately? How should I do the traversal and the edit distance matching?
For e.g, if I have words MAN, MANE, I should be able to reuse the edit distance computation on MAN in MANE. Otherwise the Trie wouldnt serve any purpose
I think you should instead give a try to bk-trees; it's a data structure that fits well spell-checking as it will allow you to compute efficiently the edit distance with the words of your dictionary.
This link gives a good insight into BK-trees applied to spell-checking
Try computing for each tree node an array A where A[x] the smallest edit distance to be at that position in the trie after matching the first x letters of the target word.
You can then stop examining any nodes if every element in the array is greater than your target distance.
For example, with a trie containing MAN and MANE and an input BANE:
Node 0 representing '', A=[0,1,2,3,4]
Node 1 representing 'M', A=[1,1,2,3,4]
Node 2 representing 'MA', A=[2,1,1,2,3]
Node 3 representing 'MAN' A=[3,2,2,1,2]
Node 4 representing 'MANE' A=[4,3,2,2,1]
The smallest value for A[end] is 1 reached with the word 'MANE' so this is the best match.
There is a smart way to get every element that is not quite a Levenstein distance since the following algorithm does not incorporate transpositions.
Assuming we have the Tree structure, we can implement a recursive search of the tree. Your recursive search assumes we start with a cost-row representing the cost of deleting every letter. As we recursively search the tree, the information we have is
You are at node n, that has been indexed in your Trie structure by letter l.
You are considering a distance from a word w
Your current path assumes a previous cost-row up to this point, we wish to update this to form a new cost row for this node n.
We want to update our cost-row at the letter you are considering in accordance with 4 situations; l is the next letter in the word (cost row remains the same), the letter needs to be inserted (new cost +1), a letter has been deleted (cost of previous step +1), and the letter replaces a previous word (new cost +1).
The cost of proceeding down this path on your tree is the minimum of these costs. At this point, if your at a point in the Trie structure defining a word, append it to a list, and then recursively search all children for more words assuming the current cost is within a defined maximum cost. An implementation in Python can be found in another post:
https://stackoverflow.com/a/62823597/8249836
I also have this in C for piping. Since the algorithm is pretty fast even for high edit distances (< len of word) one may use a fast efficient implementation of the Levenstein distance to correct this method.

Check if given string can be created by a set of characters cut out from magazine article

"Observe that when you cut a character out of a magazine, the character on the reverse side of the page is also removed. Give an algorithm to determine whether you can generate a given string by pasting cutouts from a given magazine. Assume that you are given a function that will identify the character and its position on the reverse side of the page for any given character position."
How can I do it?
I can do some initial pruning so that if a needed character has only one way of getting picked up, its taken initially before turning the sub-problem for dynamic technique, but what after this initial pruning?
What is the time and space complexity?
As #LiKao suggested, this can be solved using max flow. To construct the network we make two "layers" of vertices: one with all the distinct characters in the input string and one with each position on the page. Make an edge with capacity 1 from a character to a position if that position has that character on one side. Make edges of capacity 1 from each position to the sink, and make edges from the source to each character with capacity equal to the multiplicity of that character in the input string.
For example, let's say we're searching for the word "FOO" on a page with four positions:
pos 1 2 3 4
front F C O Z
back O O K Z
We then generate the following network, ignoring position 4 since it does not provide any of the required characters.
Now, we only need to determine if there is a flow from the source to the sink of length("FOO") = 3 or more.
You can use dynamic programming directly.
We are given string s with n letters. We are given a set of pieces P = {p_1, ..., p_k}. Each piece has one letter in the front p_i.f and one in the back p_i.b.
Denote with f(j, p) the function that returns true if it is feasible to create substring s_1...s_j using pieces in p \subseteq P, and false otherwise.
The following recurrence holds:
f(n, P) = f(n-1, P-p_1) | f(n-1, P-p_2) | ... | f(n-1, P-p_k)
In plain English the feasibility of s using all pieces in P, depends on the feasibility of the substring s_1...s_n-1 given one less piece, and we try removing all possible pieces (of course in practice we do not have to remove all pieces one by one; we only need to remove those pieces for which p_i.f == s_n || p_i.b == s_n).
The initial condition is that f(1, P-p_1) = f(1, P-p2) = ... = true, assuming that we have already checked a-priori (in linear time) that there are enough letters in P to cover all the letters in s.
While this problem can be formulated as a Maxflow problem as shown in the accepted answer, it is simpler and more efficient to formulate it as a maximum cardinality matching problem in a bipartite graph. Maxflow algorithms like Dinic's are slower than the special case algorithms like Hopcroft–Karp algorithm.
The bipartite graph is formed by adding two edges from every character of the given string to a cutout, one edge for each side. We then run Hopcroft–Karp. In the end, we simply check whether the cardinality of the matching is equal to the length of the string.
For a working implementation (in Scala) using JGraphT, see my GitHub.
I'd like to come up with a more efficient DP solution, since Skiena's book has this problem in the DP section, but so far haven't found any.

Ordering a dictionary to maximize common letters between adjacent words

This is intended to be a more concrete, easily expressable form of my earlier question.
Take a list of words from a dictionary with common letter length.
How to reorder this list tto keep as many letters as possible common between adjacent words?
Example 1:
AGNI, CIVA, DEVA, DEWA, KAMA, RAMA, SIVA, VAYU
reorders to:
AGNI, CIVA, SIVA, DEVA, DEWA, KAMA, RAMA, VAYU
Example 2:
DEVI, KALI, SHRI, VACH
reorders to:
DEVI, SHRI, KALI, VACH
The simplest algorithm seems to be: Pick anything, then search for the shortest distance?
However, DEVI->KALI (1 common) is equivalent to DEVI->SHRI (1 common)
Choosing the first match would result in fewer common pairs in the entire list (4 versus 5).
This seems that it should be simpler than full TSP?
What you're trying to do, is calculate the shortest hamiltonian path in a complete weighted graph, where each word is a vertex, and the weight of each edge is the number of letters that are differenct between those two words.
For your example, the graph would have edges weighted as so:
DEVI KALI SHRI VACH
DEVI X 3 3 4
KALI 3 X 3 3
SHRI 3 3 X 4
VACH 4 3 4 X
Then it's just a simple matter of picking your favorite TSP solving algorithm, and you're good to go.
My pseudo code:
Create a graph of nodes where each node represents a word
Create connections between all the nodes (every node connects to every other node). Each connection has a "value" which is the number of common characters.
Drop connections where the "value" is 0.
Walk the graph by preferring connections with the highest values. If you have two connections with the same value, try both recursively.
Store the output of a walk in a list along with the sum of the distance between the words in this particular result. I'm not 100% sure ATM if you can simply sum the connections you used. See for yourself.
From all outputs, chose the one with the highest value.
This problem is probably NP complete which means that the runtime of the algorithm will become unbearable as the dictionaries grow. Right now, I see only one way to optimize it: Cut the graph into several smaller graphs, run the code on each and then join the lists. The result won't be as perfect as when you try every permutation but the runtime will be much better and the final result might be "good enough".
[EDIT] Since this algorithm doesn't try every possible combination, it's quite possible to miss the perfect result. It's even possible to get caught in a local maximum. Say, you have a pair with a value of 7 but if you chose this pair, all other values drop to 1; if you didn't take this pair, most other values would be 2, giving a much better overall final result.
This algorithm trades perfection for speed. When trying every possible combination would take years, even with the fastest computer in the world, you must find some way to bound the runtime.
If the dictionaries are small, you can simply create every permutation and then select the best result. If they grow beyond a certain bound, you're doomed.
Another solution is to mix the two. Use the greedy algorithm to find "islands" which are probably pretty good and then use the "complete search" to sort the small islands.
This can be done with a recursive approach. Pseudo-code:
Start with one of the words, call it w
FindNext(w, l) // l = list of words without w
Get a list l of the words near to w
If only one word in list
Return that word
Else
For every word w' in l do FindNext(w', l') //l' = l without w'
You can add some score to count common pairs and to prefer "better" lists.
You may want to take a look at BK-Trees, which make finding words with a given distance to each other efficient. Not a total solution, but possibly a component of one.
This problem has a name: n-ary Gray code. Since you're using English letters, n = 26. The Wikipedia article on Gray code describes the problem and includes some sample code.

What is the most efficient way to keep track of a specific character's index in a string?

Take the following string as an example:
"The quick brown fox"
Right now the q in quick is at index 4 of the string (starting at 0) and the f in fox is at index 16. Now lets say the user enters some more text into this string.
"The very quick dark brown fox"
Now the q is at index 9 and the f is at index 26.
What is the most efficient method of keeping track of the index of the original q in quick and f in fox no matter how many characters are added by the user?
Language doesn't matter to me, this is more of a theory question than anything so use whatever language you want just try to keep it to generally popular and current languages.
The sample string I gave is short but I'm hoping for a way that can efficiently handle any size string. So updating an array with the offset would work with a short string but will bog down with to many characters.
Even though in the example I was looking for the index of unique characters in the string I also want to be able to track the index of the same character in different locations such as the o in brown and the o in fox. So searching is out of the question.
I was hoping for the answer to be both time and memory efficient but if I had to choose just one I care more about performance speed.
Let's say that you have a string and some of its letters are interesting. To make things easier let's say that the letter at index 0 is always interesting and you never add something before it—a sentinel. Write down pairs of (interesting letter, distance to the previous interesting letter). If the string is "+the very Quick dark brown Fox" and you are interested in q from 'quick' and f from 'fox' then you would write: (+,0), (q,10), (f,17). (The sign + is the sentinel.)
Now you put these in a balanced binary tree whose in-order traversal gives the sequence of letters in the order they appear in the string. You might now recognize the partial sums problem: You enhance the tree so that nodes contain (letter, distance, sum). The sum is the sum of all distances in the left subtree. (Therefore sum(x)=distance(left(x))+sum(left(x)).)
You can now query and update this data structure in logarithmic time.
To say that you added n characters to the left of character c you say distance(c)+=n an then go and update sum for all parents of c.
To ask what is the index of c you compute sum(c)+sum(parent(c))+sum(parent(parent(c)))+...
Your question is a little ambiguous - are you looking to keep track of the first instances of every letter? If so, an array of length 26 might be the best option.
Whenever you insert text into a string at a position lower than the index you have, just compute the offset based on the length of the inserted string.
It would also help if you had a target language in mind as not all data structures and interactions are equally efficient and effective in all languages.
The standard trick that usually helps in similar situations is to keep the characters of the string as leaves in a balanced binary tree. Additionally, internal nodes of the tree should keep sets of letters (if the alphabet is small and fixed, they could be bitmaps) that occur in the subtree rooted at a particular node.
Inserting or deleting a letter into this structure only needs O(log(N)) operations (update the bitmaps on the path to root) and finding the first occurence of a letter also takes O(log(N)) operations - you descend from the root, going for the leftmost child whose bitmap contains the interesting letter.
Edit: The internal nodes should also keep number of leaves in the represented subtree, for efficient computation of the letter's index.

Resources