Big-O time to generate all letter-replacement words? - algorithm

Let's say you have a word, such as 'cook', and you want to generate a graph of all possible words that could be made from that word by replacing each letter with all the other letters. An important restriction: you can ask the dictionary if a collection of letters is a word, but that is the limit of your interface to the dictionary. You cannot just ask the dictionary for all n-letter words.
I would imagine this would be a recursive algorithm generating a DAG such as follows:
cook
/ | \
aook ... zook
/ | \ / | \
aaok ... azok zaok ... zzok
And so on. Obviously in reality many of these permutations would be rejected as not being real words, but eventually you would have a DAG that contains all 'words' that can be generated. The height of the graph would be the length of the input word plus 1.
In the worst case, each permutation would be a word. The first level has 1 word, the second level 25, the next level (25*25) and so on. Thus if n is the length of the word, am I correct in thinking this would mean the algorithm has a worst-case time complexity of 25^n, and also a worst-case storage complexity of 25^n?

It is probably better not to view it as a tree, but as a graph.
Let n be the input, which is the length of the words.
Let W be all the (meaningful) words of length $n$.
Construct a simple graph G as follows: the vertices of G is W; two words w1 and w2 has an edge connecting them if and only if they differ by one charater.
Then what you want to do is the following:
Given a word w in W, find the connected component of w in the graph G.
This is typically done with depth-first search (DFS) or breadth-first search (BFS). Your algorithm is kind of BFS (but not quite correct: each time you should generate all neighbours of a word, not just those with one place replaced).
Since we may assume $n$ small, the time complexity is theoretically linear in the size of the result (although with a very large big O constant)
However, you should also have an equal size of memory, memorizing which word has already been checked.

If you generate only words that can be found in dictionary, then time complexity is limited by the dictionary size, so it should be O(min(|D|, 25^n)).

Related

Huffman code with very large variety of elements

I am trying to implement the Huffman algorithm, as taught in my math class. However, I noticed that in the worst-case scenario, the produced code would be larger than the actual charset encoding.
Currently I only have flowcharts to show. This is strictly a theoretical question.
The issue is that the tree building algorithm does not guarantee a balanced tree where the leaves are evenly distributed, thus minimizing the tree height and therefore, minimizing the produced codeword lengths.
For values of the element frequencies follow a sequence such as Fibonacci's
N1 = 1, N2 = 2, Ni = Ni-1 + Ni-2.
Nk-1 < Nk < Nk+1 is always true.
when using the following algorithm to build the tree, n - 1 tree levels are needed, where n is the number of elements present:
pop the smaller two elements from list
make a new node
assign both popped nodes to either branch of the new node
the weight of the new node is the sum of the weight of its two branches
insert new node in list at the appropriate location according to weight
repeat until only one node remains
Given that Huffman codewords have a 1:1 relation to the height at which an element is present you would potentially need more bits than the un-compressed encoded character.
Imagine encoding a file written in English. Most of the times there will be more than 10 different characters present. These characters would use 8 bits in ascii or utf-8. However, to encode them using Huffman, in the worst-case scenario you would need up to 9bits. which is worse than the original encoding.
This problem would be exacerbated with larger sets or if you used combinations of characters where the number of possible combinations would be C(n, k), where n is the number of elements present and k is the number of characters to be represented by one codeword.
Obviously, there is something I am missing. I would dearly appreciate an explanation on how to solve this, or links to quality resources where I can learn more. Thank you.
Worst-case scenario:
/\
A \
/\
B \
/\
C …
/\
Y Z

Minimum number of deletions for a given word to become a dictionary word

Given a dictionary as a hashtable. Find the minimum # of
deletions needed for a given word in order to make it match any word in the
dictionary.
Is there some clever trick to solve this problem in less than exponential complexity (trying all possible combinations)?
For starters, suppose that you have a single word w in the the hash table and that your word is x. You can delete letters from x to form w if and only if w is a subsequence of x, and in that case the number of letters you need to delete from x to form w is given by |x - w|. So certainly one option would be to just iterate over the hash table and, for each word, to see if x is a subsequence of that word, taking the best match you find across the table.
To analyze the runtime of this operation, let's suppose that there are n total words in your hash table and that their total length is L. Then the runtime of this operation is O(L), since you'll process each character across all the words at most once. The complexity of your initial approach is O(|x| · 2|x|) because there are 2|x| possible words you can make by deleting letters from x and you'll spend O(|x|) time processing each one. Depending on the size of your dictionary and the size of your word, one algorithm might be better than the other, but we can say that the runtime is O(min{L, |x|·2|x|) if you take the better of the two approaches.
You can build a trie and then see where your given word fits into it. The difference in the depth of your word and the closest existing parent is the number of deletions required.

Radix Tries, Tries and Ternary Search Tries

I'm currently trying to get my head around the variations of Trie and was wondering if anyone would be able to clarify a couple of points. (I got quite confused with the answer to this question Tries versus ternary search trees for autocomplete?, especially within the first paragraph).
From what I have read, is the following correct? Supposing we have stored n elements in the data structures, and L is the length of the string we are searching for:
A Trie stores its keys at the leaf nodes, and if we have a positive hit for the search then this means that it will perform O(L) comparisons. For a miss however, the average performance is O(log2(n)).
Similarly, a Radix tree (with R = 2^r) stores the keys at the leaf nodes and will perform O(L) comparisons for a positive hit. However misses will be quicker, and occur on average in O(logR(n)).
A Ternary Search Trie is essentially a BST with operations <,>,= and with a character stored in every node. Instead of comparing the whole key at a node (as with BST), we only compare a character of the key at that node. Overall, supposing our alphabet size is A, then if there is a hit we must perform (at most) O(L*A) = O(L) comparisons. If there is not a hit, on average we have O(log3(n)).
With regards to the Radix tree, if for example our alphabet is {0,1} and we set R = 4, for a binary string 0101 we would only need two comparisons right? Hence if the size of our alphabet is A, we would actually only perform L * (A / R) comparisons? If so then I guess this just becomes O(L), but I was curious if this was correct reasoning.
Thanks for any help you folks can give!

Find the pair of bitstrings with the largest number of common set bits

I want to find an algorithm to find the pair of bitstrings in an array that have the largest number of common set bits (among all pairs in the array). I know it is possible to do this by comparing all pairs of bitstrings in the array, but this is O(n2). Is there a more efficient algorithm? Ideally, I would like the algorithm to work incrementally by processing one incoming bitstring in each iteration.
For example, suppose we have this array of bitstrings (of length 8):
B1:01010001
B2:01101010
B3:01101010
B4:11001010
B5:00110001
The best pair here is B2 and B3, which have four common set bits.
I found a paper that appears to describe such an algorithm (S. Taylor & T. Drummond (2011); "Binary Histogrammed Intensity Patches for Efficient and Robust Matching"; Int. J. Comput. Vis. 94:241–265), but I don't understand this description from page 252:
This can be incrementally updated in each iteration as the only [bitstring] overlaps that need recomputing are those for the new parent feature and any other [bitstrings] in the root whose “most overlapping feature” was one of the two selected for combination. This avoids the need for the O(N2) overlap comparison in every iteration and allows a forest for a typically-sized database of 700 features to be built in under a second.
As far as I can tell, Taylor & Drummond (2011) do not purport to give an O(n) algorithm for finding the pair of bitstrings in an array with the largest number of common set bits. They sketch an argument that a record of the best such pairs can be updated in O(n) after a new bitstring has been added to the array (and two old bitstrings removed).
Certainly the explanation of the algorithm on page 252 is not very clear, and I think their sketch argument that the record can be updated in O(n) is incomplete at best, so I can see why you are confused.
Anyway, here's my best attempt to explain Algorithm 1 from the paper.
Algorithm
The algorithm takes an array of bitstrings and constructs a lookup tree. A lookup tree is a binary forest (set of binary trees) whose leaves are the original bitstrings from the array, whose internal nodes are new bitstrings, and where if node A is a parent of node B, then A & B = A (that is, all the set bits in A are also set in B).
For example, if the input is this array of bitstrings:
then the output is the lookup tree:
The algorithm as described in the paper proceeds as follows:
Let R be the initial set of bitstrings (the root set).
For each bitstring f1 in R that has no partner in R, find and record its partner (the bitstring f2 in R − {f1} which has the largest number of set bits in common with f1) and record the number of bits they have in common.
If there is no pair of bitstrings in R with any common set bits, stop.
Let f1 and f2 be the pair of bitstrings in R with the largest number of common set bits.
Let p = f1 & f2 be the parent of f1 and f2.
Remove f1 and f2 from R; add p to R.
Go to step 2.
Analysis
Suppose that the array contains n bitstrings of fixed length. Then the algorithm as described is O(n3) because step 2 is O(n2), and there are O(n) iterations, because at each iteration we remove two bitstrings from R and add one.
The paper contains an argument that step 2 is Ω(n2) only on the first time around the loop, and on other iterations it is O(n) because we only have to find the partner of p "and any other bitstrings in R whose partner was one of the two selected for combination." However, this argument is not convincing to me: it is not clear that there are only O(1) other such bitstrings. (Maybe there's a better argument?)
We could bring the algorithm down to O(n2) by storing the number of common set bits between every pair of bitstrings. This requires O(n2) extra space.
Reference
S. Taylor & T. Drummond (2011). "Binary Histogrammed Intensity Patches for Efficient and Robust Matching". Int. J. Comput. Vis. 94:241–265.
Well for each bit position you could maintain two sets, those with that position on and those with it off. The sets could be placed in two binary trees for example.
Then you just perform set unions, first with all eight bits, than every combination of 7 and so on, until you find union with two elements.
The complexity here grows exponentially in the bit size, but if it is small and fixed this isn't a problem.
Another way to do it might be to look at the n k-bit strings as n points in a kD space, and your task is to find the two points closest together. There are a number of geometric algorithms to do this.

Ordering a dictionary to maximize common letters between adjacent words

This is intended to be a more concrete, easily expressable form of my earlier question.
Take a list of words from a dictionary with common letter length.
How to reorder this list tto keep as many letters as possible common between adjacent words?
Example 1:
AGNI, CIVA, DEVA, DEWA, KAMA, RAMA, SIVA, VAYU
reorders to:
AGNI, CIVA, SIVA, DEVA, DEWA, KAMA, RAMA, VAYU
Example 2:
DEVI, KALI, SHRI, VACH
reorders to:
DEVI, SHRI, KALI, VACH
The simplest algorithm seems to be: Pick anything, then search for the shortest distance?
However, DEVI->KALI (1 common) is equivalent to DEVI->SHRI (1 common)
Choosing the first match would result in fewer common pairs in the entire list (4 versus 5).
This seems that it should be simpler than full TSP?
What you're trying to do, is calculate the shortest hamiltonian path in a complete weighted graph, where each word is a vertex, and the weight of each edge is the number of letters that are differenct between those two words.
For your example, the graph would have edges weighted as so:
DEVI KALI SHRI VACH
DEVI X 3 3 4
KALI 3 X 3 3
SHRI 3 3 X 4
VACH 4 3 4 X
Then it's just a simple matter of picking your favorite TSP solving algorithm, and you're good to go.
My pseudo code:
Create a graph of nodes where each node represents a word
Create connections between all the nodes (every node connects to every other node). Each connection has a "value" which is the number of common characters.
Drop connections where the "value" is 0.
Walk the graph by preferring connections with the highest values. If you have two connections with the same value, try both recursively.
Store the output of a walk in a list along with the sum of the distance between the words in this particular result. I'm not 100% sure ATM if you can simply sum the connections you used. See for yourself.
From all outputs, chose the one with the highest value.
This problem is probably NP complete which means that the runtime of the algorithm will become unbearable as the dictionaries grow. Right now, I see only one way to optimize it: Cut the graph into several smaller graphs, run the code on each and then join the lists. The result won't be as perfect as when you try every permutation but the runtime will be much better and the final result might be "good enough".
[EDIT] Since this algorithm doesn't try every possible combination, it's quite possible to miss the perfect result. It's even possible to get caught in a local maximum. Say, you have a pair with a value of 7 but if you chose this pair, all other values drop to 1; if you didn't take this pair, most other values would be 2, giving a much better overall final result.
This algorithm trades perfection for speed. When trying every possible combination would take years, even with the fastest computer in the world, you must find some way to bound the runtime.
If the dictionaries are small, you can simply create every permutation and then select the best result. If they grow beyond a certain bound, you're doomed.
Another solution is to mix the two. Use the greedy algorithm to find "islands" which are probably pretty good and then use the "complete search" to sort the small islands.
This can be done with a recursive approach. Pseudo-code:
Start with one of the words, call it w
FindNext(w, l) // l = list of words without w
Get a list l of the words near to w
If only one word in list
Return that word
Else
For every word w' in l do FindNext(w', l') //l' = l without w'
You can add some score to count common pairs and to prefer "better" lists.
You may want to take a look at BK-Trees, which make finding words with a given distance to each other efficient. Not a total solution, but possibly a component of one.
This problem has a name: n-ary Gray code. Since you're using English letters, n = 26. The Wikipedia article on Gray code describes the problem and includes some sample code.

Resources