Find the Most Frequent Ordered Word Pair In a Document - algorithm

This is a problem from S. Skiena's "Algorithm. Design Manual" book, the problem statement is:
Give an algorithm for finding an ordered word pair(e.g."New York")
occurring with the greatest frequency in a given webpage.
Which data structure would you use? Optimize both time and space.
One obvious solution is inserting each ordered pair in a hash-map and then iterating over all of them, to find the most frequent one, however, there definitely should be a better way, can anyone suggest anything?

In a text with n words, we have exactly n - 1 ordered word pairs (not distinct of course). One solution is to use a max priority queue; we simply insert each pair in the max PQ with frequency 1 if not already present. If present, we increment the key. However, if we use a Trie, we don't need to represent all n - 1 pairs separately. Take for example the following text:
A new puppy in New York is happy with it's New York life.
The resulting Trie would look like the following:
If we store the number of occurrences of a pair in the leaf nodes, we could easily compute the maximum occurrence in linear time. Since we need to look at each word, that's the best we can do, time wise.
Working Scala code below. The official site has a solution in Python.
class TrieNode(val parent: Option[TrieNode] = None,
val children: MutableMap[Char, TrieNode] = MutableMap.empty,
var n: Int = 0) {
def add(c: Char): TrieNode = {
val child = children.getOrElseUpdate(c, new TrieNode(parent = Some(this)))
child.n += 1
child
}
def letter(node: TrieNode): Char = {
node.parent
.flatMap(_.children.find(_._2 eq node))
.map(_._1)
.getOrElse('\u0000')
}
override def toString: String = {
Iterator
.iterate((ListBuffer.empty[Char], Option(this))) {
case (buffer, node) =>
node
.filter(_.parent.isDefined)
.map(letter)
.foreach(buffer.prepend(_))
(buffer, node.flatMap(_.parent))
}
.dropWhile(_._2.isDefined)
.take(1)
.map(_._1.mkString)
.next()
}
}
def mostCommonPair(text: String): (String, Int) = {
val root = new TrieNode()
#tailrec
def loop(s: String,
mostCommon: TrieNode,
count: Int,
parent: TrieNode): (String, Int) = {
s.split("\\s+", 2) match {
case Array(head, tail # _*) if head.nonEmpty =>
val word = head.foldLeft(parent)((tn, c) => tn.add(c))
val (common, n, p) =
if (parent eq root) (mostCommon, count, word.add(' '))
else if (word.n > count) (word, word.n, root)
else (mostCommon, count, root)
loop(tail.headOption.getOrElse(""), common, n, p)
case _ => (mostCommon.toString, count)
}
}
loop(text, new TrieNode(), -1, root)
}
Inspired by the question here.

I think the first point to note is that finding the most frequent ordered word pair is no more (or less) difficult than finding the most frequent word. The only difference is that instead of words made up of the letters a..z+A.Z separated by punctuation or spaces, you are looking for word-pairs made up of the letters a..z+A..Z+exactly_one_space, similarly separated by punctuation or spaces.
If your web-page has n words then there are only n-1 word-pairs. So hashing each word-pair then iterating over the hash table will O(n) in both time and memory. This should be pretty quick to do even if n is ~10^6 (i.e. the length of an average novel). I can't imagine anything more efficient unless n is fairly small, in which case the memory savings resulting from constructing an ordered list of word pairs (instead of a hash table) might outweigh the cost of increasing time complexity to O(nlogn)

why not keep all the ordered pairs in AVL tree with 10 elements array to track top 10 ordered pairs. In AVL we would keep all the order pairs with their occurring count and top 10 will keep in the array. this way searching of any ordered pair would be O(log N) and traversing would be O(N).

I think we could not do better than O(n) in terms of time as one would have to see at least each element once. So time complexity cannot be optimised further.
But we can use a trie to optimise the space used. In a page, there are often words which are repeated, so this might lead to significant reduction in space usage. The leaf nodes in the trie cold store the frequency of the ordered pair and using two pointers to iterate in the text where one would point at the current word and second would point at previous word.

Related

What is the Time and Space complexity of following solution?

Problem statement:
Given a non-empty string s and a dictionary wordDict containing a list of non-empty words, add spaces in s to construct a sentence where each word is a valid dictionary word. Return all such possible sentences.
Note:
The same word in the dictionary may be reused multiple times in the segmentation.
You may assume the dictionary does not contain duplicate words.
Sample test case:
Input:
s = "catsanddog"
wordDict = ["cat", "cats", "and", "sand", "dog"]
Output:
[
"cats and dog",
"cat sand dog"
]
My Solution:
class Solution {
unordered_set<string> words;
unordered_map<string, vector<string> > memo;
public:
vector<string> getAllSentences(string s) {
if(s.size()==0){
return {""};
}
if(memo.count(s)) {
return memo[s];
}
string curWord = ""; vector<string> result;
for(int i = 0; i < s.size(); i++ ) {
curWord+=s[i];
if(words.count(curWord)) {
auto sentences = getAllSentences(s.substr(i+1));
for(string s : sentences) {
string sentence = curWord + ((int)s.size()>0? ((" ") + s) : "");
result.push_back(sentence);
}
}
}
return memo[s] = result;
}
vector<string> wordBreak(string s, vector<string>& wordDict) {
for(auto word : wordDict) {
words.insert(word);
}
return getAllSentences(s);
}
};
I am not sure about the time and space complexity. I think it should be 2^n where n is the length of given string s. Can anyone please help me to prove time and space complexity?
I have also some following questions:
If I don't use memo in the getAllSentences function what will be the
time complexity in this case?
Is there any better solution than this?
Let's try to go through the algorithm step by step but for specific wordDict to simplify the things.
So let wordDict be all the characters from a to z,
wordDict = ["a",..., "z"]
In this case if(words.count(curWord)) would be true every time when i = 0 and false otherwise.
Also, let's skip using memo cache (we'll add it later).
In the case above, we just got though string s recursively until we reach the end without any additional memory except result vector which gives the following:
time complexity is O(n!)
space complexity is O(1) - just 1 solution exists
where n - lenght of s
Now let's examine how using memo cache changes the situation in our case. Cache would contain n items - size of our string s which changes space complexity to O(n). Our time is the same since every there will be no hits by using memo cache.
This is the basis for us to move forward.
Now let's try to find how the things are changed if wordDict contains all the pairs of letters (and length of s is 2*something, so we could reach the end).
So, wordDict = ['aa','ab',...,'zz']
In this case we move forward with for 2 letters instead of 1 and everything else is the same, which gives us the following complexity withoug using memo cache:
time complexity is O((n/2)!)
space complexity is O(1) - just 1 solution exists
Memo cache would contain (n/2) items, giving a complexity of O(n) which also changes space complexity to O(n) but all the checks there are of different length.
Let's now imagine that wordDict contains both dictionaries we mentioned before ('a'...'z','aa'...'zz').
In this case we have the following complexity without using memo cache
time complexity is O((n)!) as we need to check the case for i=0 and i=1 which roughly doubles the number of checks we need to do for each step but on the other size it reduces the number of checks we have to do later since we move forward by 2 letters instead of one (this is the trickiest part for me).
Space complexity is ~O(2^n) since every additional char doubles the number of results.
Now let's think of the memo cache we have. It would be usefull for every 3 letters, because for example '...ab c...' gives the same as '...a bc...', so it reduces the number of calculations by 2 at every step, so our complexity would be the following
time complexity is roughly O((n/2)!) and we need O(2*n)=O(n) memory to store the memo. Let's also remember that in n/2 expression 2 reflects the cache effectiveness.
space complexity is O(2^n) - 2 here is a charateristic of the wordDict we've constructed
These were 3 cases for us to understand how the complexity is changing depending of the curcumstances. Now let's try to generalize it to the generic case:
time complexity is O((n/(l*e))!) where l = min length of words in wordDict, e - cache effectiveness (I would assume it 1 in general case but there might bt situations where it's different as we saw in the case above
space complexity is O(a^n) where a is a similarity of words in our wordDict, could be very very roughly estimated as P(h/l)=(h/l)! where h is max word length in a dictionary and l is min word length as (for example, if wordDict contains all combinations of up 3 letters, this gives us 3! combinations for every 6 letters)
This is how I see your approach and it's complexity.
As for improving the solution itself, I don't see any simple way to improve it. There might be an alternative way to divide the string in 3 parts and then processing each part separately but it would definitely work if we could get rid of searching the results and just count the number of results without displaying them.
I hope it helps.

How to "sort" elements of 2 possible values in place in linear time? [duplicate]

This question already has answers here:
Stable separation for two classes of elements in an array
(3 answers)
Closed 9 years ago.
Suppose I have a function f and array of elements.
The function returns A or B for any element; you could visualize the elements this way ABBAABABAA.
I need to sort the elements according to the function, so the result is: AAAAAABBBB
The number of A values doesn't have to equal the number of B values. The total number of elements can be arbitrary (not fixed). Note that you don't sort chars, you sort objects that have a single char representation.
Few more things:
the sort should take linear time - O(n),
it should be performed in place,
it should be a stable sort.
Any ideas?
Note: if the above is not possible, do you have ideas for algorithms sacrificing one of the above requirements?
If it has to be linear and in-place, you could do a semi-stable version. By semi-stable I mean that A or B could be stable, but not both. Similar to Dukeling's answer, but you move both iterators from the same side:
a = first A
b = first B
loop while next A exists
if b < a
swap a,b elements
b = next B
a = next A
else
a = next A
With the sample string ABBAABABAA, you get:
ABBAABABAA
AABBABABAA
AAABBBABAA
AAAABBBBAA
AAAAABBBBA
AAAAAABBBB
on each turn, if you make a swap you move both, if not you just move a. This will keep A stable, but B will lose its ordering. To keep B stable instead, start from the end and work your way left.
It may be possible to do it with full stability, but I don't see how.
A stable sort might not be possible with the other given constraints, so here's an unstable sort that's similar to the partition step of quick-sort.
Have 2 iterators, one starting on the left, one starting on the right.
While there's a B at the right iterator, decrement the iterator.
While there's an A at the left iterator, increment the iterator.
If the iterators haven't crossed each other, swap their elements and repeat from 2.
Lets say,
Object_Array[1...N]
Type_A objs are A1,A2,...Ai
Type_B objs are B1,B2,...Bj
i+j = N
FOR i=1 :N
if Object_Array[i] is of Type_A
obj_A_count=obj_A_count+1
else
obj_B_count=obj_B_count+1
LOOP
Fill the resultant array with obj_A and obj_B with their respective counts depending on obj_A > obj_B
The following should work in linear time for a doubly-linked list. Because up to N insertion/deletions are involved that may cause quadratic time for arrays though.
Find the location where the first B should be after "sorting". This can be done in linear time by counting As.
Start with 3 iterators: iterA starts from the beginning of the container, and iterB starts from the above location where As and Bs should meet, and iterMiddle starts one element prior to iterB.
With iterA skip over As, find the 1st B, and move the object from iterA to iterB->previous position. Now iterA points to the next element after where the moved element used to be, and the moved element is now just before iterB.
Continue with step 3 until you reach iterMiddle. After that all elements between first() and iterB-1 are As.
Now set iterA to iterB-1.
Skip over Bs with iterB. When A is found move it to just after iterA and increment iterA.
Continue step 6 until iterB reaches end().
This would work as a stable sort for any container. The algorithm includes O(N) insertion/deletion, which is linear time for containers with O(1) insertions/deletions, but, alas, O(N^2) for arrays. Applicability in you case depends on whether the container is an array rather than a list.
If your data structure is a linked list instead of an array, you should be able to meet all three of your constraints. You just skim through the list and accumulating and moving the "B"s will be trivial pointer changes. Pseudo code below:
sort(list) {
node = list.head, blast = null, bhead = null
while(node != null) {
nextnode = node.next
if(node.val == "a") {
if(blast != null){
//move the 'a' to the front of the 'B' list
bhead.prev.next = node, node.prev = bhead.prev
blast.next = node.next, node.next.prev = blast
node.next = bhead, bhead.prev = node
}
}
else if(node.val == "b") {
if(blast == null)
bhead = blast = node
else //accumulate the "b"s..
blast = node
}
3
node = nextnode
}
}
So, you can do this in an array, but the memcopies, that emulate the list swap, will make it quiet slow for large arrays.
Firstly, assuming the array of A's and B's is either generated or read-in, I wonder why not avoid this question entirely by simply applying f as the list is being accumulated into memory into two lists that would subsequently be merged.
Otherwise, we can posit an alternative solution in O(n) time and O(1) space that may be sufficient depending on Sir Bohumil's ultimate needs:
Traverse the list and sort each segment of 1,000,000 elements in-place using the permutation cycles of the segment (once this step is done, the list could technically be sorted in-place by recursively swapping the inner-blocks, e.g., ABB AAB -> AAABBB, but that may be too time-consuming without extra space). Traverse the list again and use the same constant space to store, in two interval trees, the pointers to each block of A's and B's. For example, segments of 4,
ABBAABABAA => AABB AABB AA + pointers to blocks of A's and B's
Sequential access to A's or B's would be immediately available, and random access would come from using the interval tree to locate a specific A or B. One option could be to have the intervals number the A's and B's; e.g., to find the 4th A, look for the interval containing 4.
For sorting, an array of 1,000,000 four-byte elements (3.8MB) would suffice to store the indexes, using one bit in each element for recording visited indexes during the swaps; and two temporary variables the size of the largest A or B. For a list of one billion elements, the maximum combined interval trees would number 4000 intervals. Using 128 bits per interval, we can easily store numbered intervals for the A's and B's, and we can use the unused bits as pointers to the block index (10 bits) and offset in the case of B (20 bits). 4000*16 bytes = 62.5KB. We can store an additional array with only the B blocks' offsets in 4KB. Total space under 5MB for a list of one billion elements. (Space is in fact dependent on n but because it is extremely small in relation to n, for all practical purposes, we may consider it O(1).)
Time for sorting the million-element segments would be - one pass to count and index (here we can also accumulate the intervals and B offsets) and one pass to sort. Constructing the interval tree is O(nlogn) but n here is only 4000 (0.00005 of the one-billion list count). Total time O(2n) = O(n)
This should be possible with a bit of dynamic programming.
It works a bit like counting sort, but with a key difference. Make arrays of size n for both a and b count_a[n] and count_b[n]. Fill these arrays with how many As or Bs there has been before index i.
After just one loop, we can use these arrays to look up the correct index for any element in O(1). Like this:
int final_index(char id, int pos){
if(id == 'A')
return count_a[pos];
else
return count_a[n-1] + count_b[pos];
}
Finally, to meet the total O(n) requirement, the swapping needs to be done in a smart order. One simple option is to have recursive swapping procedure that doesn't actually perform any swapping until both elements would be placed in correct final positions. EDIT: This is actually not true. Even naive swapping will have O(n) swaps. But doing this recursive strategy will give you absolute minimum required swaps.
Note that in general case this would be very bad sorting algorithm since it has memory requirement of O(n * element value range).

clustering words based on their char set

Say there is a word set and I would like to clustering them based on their char bag (multiset). For example
{tea, eat, abba, aabb, hello}
will be clustered into
{{tea, eat}, {abba, aabb}, {hello}}.
abba and aabb are clustered together because they have the same char bag, i.e. two a and two b.
To make it efficient, a naive way I can think of is to covert each word into a char-cnt series, for exmaple, abba and aabb will be both converted to a2b2, tea/eat will be converted to a1e1t1. So that I can build a dictionary and group words with same key.
Two issues here: first I have to sort the chars to build the key; second, the string key looks awkward and performance is not as good as char/int keys.
Is there a more efficient way to solve the problem?
For detecting anagrams you can use a hashing scheme based on the product of prime numbers A->2, B->3, C->5 etc. will give "abba" == "aabb" == 36 (but a different letter to primenumber mapping will be better)
See my answer here.
Since you are going to sort words, I assume all characters ascii values are in the range 0-255. Then you can do a Counting Sort over the words.
The counting sort is going to take the same amount of time as the size of the input word. Reconstruction of the string obtained from counting sort will take O(wordlen). You cannot make this step less than O(wordLen) because you will have to iterate the string at least once ie O(wordLen). There is no predefined order. You cannot make any assumptions about the word without iterating though all the characters in that word. Traditional sorting implementations(ie comparison based ones) will give you O(n * lg n). But non comparison ones give you O(n).
Iterate over all the words of the list and sort them using our counting sort. Keep a map of
sorted words to the list of known words they map. Addition of elements to a list takes constant time. So overall the complexity of the algorithm is O(n * avgWordLength).
Here is a sample implementation
import java.util.ArrayList;
public class ClusterGen {
static String sortWord(String w) {
int freq[] = new int[256];
for (char c : w.toCharArray()) {
freq[c]++;
}
StringBuilder sortedWord = new StringBuilder();
//It is at most O(n)
for (int i = 0; i < freq.length; ++i) {
for (int j = 0; j < freq[i]; ++j) {
sortedWord.append((char)i);
}
}
return sortedWord.toString();
}
static Map<String, List<String>> cluster(List<String> words) {
Map<String, List<String>> allClusters = new HashMap<String, List<String>>();
for (String word : words) {
String sortedWord = sortWord(word);
List<String> cluster = allClusters.get(sortedWord);
if (cluster == null) {
cluster = new ArrayList<String>();
}
cluster.add(word);
allClusters.put(sortedWord, cluster);
}
return allClusters;
}
public static void main(String[] args) {
System.out.println(cluster(Arrays.asList("tea", "eat", "abba", "aabb", "hello")));
System.out.println(cluster(Arrays.asList("moon", "bat", "meal", "tab", "male")));
}
}
Returns
{aabb=[abba, aabb], ehllo=[hello], aet=[tea, eat]}
{abt=[bat, tab], aelm=[meal, male], mnoo=[moon]}
Using an alphabet of x characters and a maximum word length of y, you can create hashes of (x + y) bits such that every anagram has a unique hash. A value of 1 for a bit means there is another of the current letter, a value of 0 means to move on to the next letter. Here's an example showing how this works:
Let's say we have a 7 letter alphabet(abcdefg) and a maximum word length of 4. Every word hash will be 11 bits. Let's hash the word "fade": 10001010100
The first bit is 1, indicating there is an a present. The second bit indicates that there are no more a's. The third bit indicates that there are no more b's, and so on. Another way to think about this is the number of ones in a row represents the number of that letter, and the total zeroes before that string of ones represents which letter it is.
Here is the hash for "dada": 11000110000
It's worth noting that because there is a one-to-one correspondence between possible hashes and possible anagrams, this is the smallest possible hash guaranteed to give unique hashes for any input, which eliminates the need to check everything in your buckets when you are done hashing.
I'm well aware that using large alphabets and long words will result in a large hash size. This solution is geared towards guaranteeing unique hashes in order to avoid comparing strings. If you can design an algorithm to compute this hash in constant time(given you know the values of x and y) then you'll be able to solve the entire grouping problem in O(n).
I would do this in two steps, first sort all your words according to their length and work on each subset separately(this is to avoid lots of overlaps later.)
The next step is harder and there are many ways to do it. One of the simplest would be to assign every letter a number(a = 1, b = 2, etc. for example) and add up all the values for each word, thereby assigning each word to an integer. Then you can sort the words according to this integer value which drastically cuts the number you have to compare.
Depending on your data set you may still have a lot of overlaps("bad" and "cac" would generate the same integer hash) so you may want to set a threshold where if you have too many words in one bucket you repeat the previous step with another hash(just assigning different numbers to the letters) Unless someone has looked at your code and designed a wordlist to mess you up, this should cut the overlaps to almost none.
Keep in mind that this approach will be efficient when you are expecting small numbers of words to be in the same char bag. If your data is a lot of long words that only go into a couple char bags, the number of comparisons you would do in the final step would be astronomical, and in this case you would be better off using an approach like the one you described - one that has no possible overlaps.
One thing I've done that's similar to this, but allows for collisions, is to sort the letters, then get rid of duplicates. So in your example, you'd have buckets for "aet", "ab", and "ehlo".
Now, as I say, this allows for collisions. So "rod" and "door" both end up in the same bucket, which may not be what you want. However, the collisions will be a small set that is easily and quickly searched.
So once you have the string for a bucket, you'll notice you can convert it into a 32-bit integer (at least for ASCII). Each letter in the string becomes a bit in a 32-bit integer. So "a" is the first bit, "b" is the second bit, etc. All (English) words make a bucket with a 26-bit identifier. You can then do very fast integer compares to find the bucket a new words goes into, or find the bucket an existing word is in.
Count the frequency of characters in each of the strings then build a hash table based on the frequency table. so for an example, for string aczda and aacdz we get 20110000000000000000000001. Using hash table we can partition all these strings in buckets in O(N).
26-bit integer as a hash function
If your alphabet isn't too large, for instance, just lower case English letters, you can define this particular hash function for each word: a 26 bit integer where each bit represents whether that English letter exists in the word. Note that two words with the same char set will have the same hash.
Then just add them to a hash table. It will automatically be clustered by hash collisions.
It will take O(max length of the word) to calculate a hash, and insertion into a hash table is constant time. So the overall complexity is O(max length of a word * number of words)

trie or balanced binary search tree to store dictionary?

I have a simple requirement (perhaps hypothetical):
I want to store english word dictionary (n words) and given a word (character length m), the dictionary is able to tell, if the word exists in dictionary or not.
What would be an appropriate data structure for this?
a balanced binary search tree? as done in C++ STL associative data structures like set,map
or
a trie on strings
Some complexity analysis:
in a balanced bst, time would be (log n)*m (comparing 2 string takes O(m) time character by character)
in trie, if at each node, we could branch out in O(1) time, we can find using O(m), but the assumption that at each node, we can branch in O(1) time is not valid. at each node, max possible branches would be 26. if we want O(1) at a node, we will keep a short array indexible on characters at each node. This will blow-up the space. After a few levels in the trie, branching will reduce, so its better to keep a linked list of next node characters and pointers.
what looks more practical? any other trade-offs?
Thanks,
I'd say use a Trie, or better yet use its more space efficient cousin the Directed Acyclic Word Graph (DAWG).
It has the same runtime characteristics (insert, look up, delete) as a Trie but overlaps common suffixes as well as common prefixes which can be a big saving on space.
If this is C++, you should also consider std::tr1::unordered_set. (If you have C++0x, you can use std::unordered_set.)
This just uses a hash table internally, which I would wager will out-perform any tree-like structure in practice. It is also trivial to implement because you have nothing to implement.
Binary search is going to be easier to implement and it's only going to involve comparing tens of strings at the most. Given you know the data up front, you can build a balanced binary tree so performance is going to be predictable and easily understood.
With that in mind, I'd use a standard binary tree (probably using set from C++ since that's typically implemented as a tree).
A simple solution is to store the dict as sorted, \n-separated words on disk, load it into memory and do a binary search. The only non-standard part here is that you have to scan backwards for the start of a word when you're doing the binary search.
Here's some code! (It assumes globals wordlist pointing to the loaded dict, and wordlist_end which points to just after the end of the loaded dict.
// Return >0 if word > word at position p.
// Return <0 if word < word at position p.
// Return 0 if word == word at position p.
static int cmp_word_at_index(size_t p, const char *word) {
while (p > 0 && wordlist[p - 1] != '\n') {
p--;
}
while (1) {
if (wordlist[p] == '\n') {
if (*word == '\0') return 0;
else return 1;
}
if (*word == '\0') {
return -1;
}
int char0 = toupper(*word);
int char1 = toupper(wordlist[p]);
if (char0 != char1) {
return (int)char0 - (int)char1;
}
++p;
++word;
}
}
// Test if a word is in the dictionary.
int is_word(const char* word_to_find) {
size_t index_min = 0;
size_t index_max = wordlist_end - wordlist;
while (index_min < index_max - 1) {
size_t index = (index_min + index_max) / 2;
int c = cmp_word_at_index(index, word_to_find);
if (c == 0) return 1; // Found word.
if (c < 0) {
index_max = index;
} else {
index_min = index;
}
}
return 0;
}
A huge benefit of this approach is that the dict is stored in a human-readable way on disk, and that you don't need any fancy code to load it (allocate a block of memory and read() it in in one go).
If you want to use a trie, you could use a packed and suffix-compressed representation. Here's a link to one of Donald Knuth's students, Franklin Liang, who wrote about this trick in his thesis.
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.123.7018&rep=rep1&type=pdf
It uses half the storage of the straightforward textual dict representation, gives you the speed of a trie, and you can (like the textual dict representation) store the whole thing on disk and load it in one go.
The trick it uses is to pack all the trie nodes into a single array, interleaving them where possible. As well as a new pointer (and an end-of-word marker bit) in each array location like in a regular trie, you store the letter that this node is for -- this lets you tell if the node is valid for your state or if it's from an overlapping node. Read the linked doc for a fuller and clearer explanation, as well as an algorithm for packing the trie into this array.
It's not trivial to implement the suffix-compression and greedy packing algorithm described, but it's easy enough.
Industry standard is to store the dictionary in a hashtable and have an amortized O(1) lookup time. Space is no more critical in industry especially due to the advancement in distributive computing.
hashtable is how google implement its autocomplete feature. Specifically have every prefix of a word as a key and put the word as the value in the hashtable.

Most common substring of length X

I have a string s and I want to search for the substring of length X that occurs most often in s. Overlapping substrings are allowed.
For example, if s="aoaoa" and X=3, the algorithm should find "aoa" (which appears 2 times in s).
Does an algorithm exist that does this in O(n) time?
You can do this using a rolling hash in O(n) time (assuming good hash distribution). A simple rolling hash would be the xor of the characters in the string, you can compute it incrementally from the previous substring hash using just 2 xors. (See the Wikipedia entry for better rolling hashes than xor.) Compute the hash of your n-x+1 substrings using the rolling hash in O(n) time. If there were no collisions, the answer is clear - if collisions happen, you'll need to do more work. My brain hurts trying to figure out if that can all be resolved in O(n) time.
Update:
Here's a randomized O(n) algorithm. You can find the top hash in O(n) time by scanning the hashtable (keeping it simple, assume no ties). Find one X-length string with that hash (keep a record in the hashtable, or just redo the rolling hash). Then use an O(n) string searching algorithm to find all occurrences of that string in s. If you find the same number of occurrences as you recorded in the hashtable, you're done.
If not, that means you have a hash collision. Pick a new random hash function and try again. If your hash function has log(n)+1 bits and is pairwise independent [Prob(h(s) == h(t)) < 1/2^{n+1} if s != t], then the probability that the most frequent x-length substring in s hash a collision with the <=n other length x substrings of s is at most 1/2. So if there is a collision, pick a new random hash function and retry, you will need only a constant number of tries before you succeed.
Now we only need a randomized pairwise independent rolling hash algorithm.
Update2:
Actually, you need 2log(n) bits of hash to avoid all (n choose 2) collisions because any collision may hide the right answer. Still doable, and it looks like hashing by general polynomial division should do the trick.
I don't see an easy way to do this in strictly O(n) time, unless X is fixed and can be considered a constant. If X is a parameter to the algorithm, then most simple ways of doing this will actually be O(n*X), as you will need to do comparison operations, string copies, hashes, etc., on a substring of length X at every iteration.
(I'm imagining, for a minute, that s is a multi-gigabyte string, and that X is some number over a million, and not seeing any simple ways of doing string comparison, or hashing substrings of length X, that are O(1), and not dependent on the size of X)
It might be possible to avoid string copies during scanning, by leaving everything in place, and to avoid re-hashing the entire substring -- perhaps by using an incremental hash algorithm where you can add a byte at a time, and remove the oldest byte -- but I don't know of any such algorithms that wouldn't result in huge numbers of collisions that would need to be filtered out with an expensive post-processing step.
Update
Keith Randall points out that this kind of hash is known as a rolling hash. It still remains, though, that you would have to store the starting string position for each match in your hash table, and then verify after scanning the string that all of your matches were true. You would need to sort the hashtable, which could contain n-X entries, based on the number of matches found for each hash key, and verify each result -- probably not doable in O(n).
It should be O(n*m) where m is the average length of a string in the list. For very small values of m then the algorithm will approach O(n)
Build a hashtable of counts for each string length
Iterate over your collection of strings, updating the hashtable accordingly, storing the current most prevelant number as an integer variable separate from the hashtable
done.
Naive solution in Python
from collections import defaultdict
from operator import itemgetter
def naive(s, X):
freq = defaultdict(int)
for i in range(len(s) - X + 1):
freq[s[i:i+X]] += 1
return max(freq.iteritems(), key=itemgetter(1))
print naive("aoaoa", 3)
# -> ('aoa', 2)
In plain English
Create mapping: substring of length X -> how many times it occurs in the s string
for i in range(len(s) - X + 1):
freq[s[i:i+X]] += 1
Find a pair in the mapping with the largest second item (frequency)
max(freq.iteritems(), key=itemgetter(1))
Here is a version I did in C. Hope that it helps.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main(void)
{
char *string = NULL, *maxstring = NULL, *tmpstr = NULL, *tmpstr2 = NULL;
unsigned int n = 0, i = 0, j = 0, matchcount = 0, maxcount = 0;
string = "aoaoa";
n = 3;
for (i = 0; i <= (strlen(string) - n); i++) {
tmpstr = (char *)malloc(n + 1);
strncpy(tmpstr, string + i, n);
*(tmpstr + (n + 1)) = '\0';
for (j = 0; j <= (strlen(string) - n); j++) {
tmpstr2 = (char *)malloc(n + 1);
strncpy(tmpstr2, string + j, n);
*(tmpstr2 + (n + 1)) = '\0';
if (!strcmp(tmpstr, tmpstr2))
matchcount++;
}
if (matchcount > maxcount) {
maxstring = tmpstr;
maxcount = matchcount;
}
matchcount = 0;
}
printf("max string: \"%s\", count: %d\n", maxstring, maxcount);
free(tmpstr);
free(tmpstr2);
return 0;
}
You can build a tree of sub-strings. The idea is to organise your sub-strings like a telephone book. You then look up the sub-string and increase its count by one.
In your example above, the tree will have sections (nodes) starting with the letters: 'a' and 'o'. 'a' appears three times and 'o' appears twice. So those nodes will have a count of 3 and 2 respectively.
Next, under the 'a' node a sub-node of 'o' will appear corresponding to the sub-string 'ao'. This appears twice. Under the 'o' node 'a' also appears twice.
We carry on in this fashion until we reach the end of the string.
A representation of the tree for 'abac' might be (nodes on the same level are separated by a comma, sub-nodes are in brackets, counts appear after the colon).
a:2(b:1(a:1(c:1())),c:1()),b:1(a:1(c:1())),c:1()
If the tree is drawn out it will be a lot more obvious! What this all says for example is that the string 'aba' appears once, or the string 'a' appears twice etc. But, storage is greatly reduced and more importantly retrieval is greatly speeded up (compare this to keeping a list of sub-strings).
To find out which sub-string is most repeated, do a depth first search of the tree, every time a leaf node is reached, note the count, and keep a track of the highest one.
The running time is probably something like O(log(n)) not sure, but certainly better than O(n^2).
Python-3 Solution:
from collections import Counter
list = []
list.append([string[i: j] for i in range(len(string)) for j in range(i + 1, len(string) + 1) if len(string[i:j]) == K]) # Where K is length
# now find the most common value in this list
# you can do this natively, but I prefer using collections
most_frequent = Counter(list).most_common(1)[0][0]
print(most_freqent)
Here is the native way to get the most common (for those that are interested):
most_occurences = 0
current_most = ""
for i in list:
frequency = list.count(i)
if frequency > most_occurences:
most_occurences = frequency
current_most = list[i]
print(f"{current_most}, Occurences: {most_occurences}")
[Extract K length substrings (geeks for geeks)][1]
[1]: https://www.geeksforgeeks.org/python-extract-k-length-substrings/
LZW algorithm does this
This is exactly what Lempel-Ziv-Welch (LZW used in GIF image format) compression algorithm does. It finds prevalent repeated bytes and changes them for something short.
LZW on Wikipedia
There's no way to do this in O(n).
Feel free to downvote me if you can prove me wrong on this one, but I've got nothing.

Resources