How can we optimise the creation of a trie if we know the input is in alphabetical order? - algorithm

I am implementing a prefix tree, with a standard insertion mechanism. If we know we will be given a list of words in alphabetical order, is there any way we can change the insertion to skip a few steps? I am coding in Java, although I'm not looking for code in any particular language. I have considered adding the Nodes for each word to a queue, then hopping backwards through it until we're at a prefix of the next word, but this may be circumventing the whole point of the prefix tree!
Any thoughts on something like this? I'm finding it hard to come up with an implementation that's of any use unless the input is many many very similar words ("aaaaaaaaaab", "aaaaaaaaaac", "aaaaaaaaaad", ...) or something. But even then doing a string comparison on the prefixes is probably a similar cost to just using the prefix tree normally.

There is no way that you can avoid looking at all the characters in the input strings from which you're building the tree. If there was a way to do this, then I could make your algorithm incorrect. In particular, suppose that there is a word w and you don't look at one of its characters (say, the kth character). Then when your algorithm runs and tries to place the word somewhere in the trie, it must be able to place it without knowing all the characters. Therefore, if I change the kth character of the word to something else, your algorithm would put it in exactly the same place as before, which is incorrect because one of the characters in the word won't be correct.
Since the normal algorithm for constructing a trie takes time proportional to the number of characters in the input, you won't be able to asymptotically outperform it without doing some crazy tricks like parallelizing the construction code or packing the characters into machine words and hitting them with your Hammer of Bit Hackery.
However, you could potentially get a constant factor speedup. Following large numbers of pointers in a linked structure can be slow due to cache performance, so you could speed up the algorithm by minimizing the number of pointers you have to follow. One thing you could do would be to maintain the position of the end of the last string that you inserted, along with a list (preferably as a dynamic array) of nodes tracing the path back up to the root. To insert a new character, you could do the following:
Find the longest prefix of the string that matches the last string you inserted.
Jump to the pointer in the array marking where that would take you.
Trace the rest of the path down as normal, adding all the nodes that you trace out to the array and overwriting the previous pointers.
This way, if you insert a lot of words with a common prefix of a reasonable length, you can avoid doing a bunch of pointer-chasing back through a shared part of the structure. This could conceivably give you a performance boost if you have lots of words with the same prefix. It's not asymptotically better than before (and, in fact, uses more memory), but the savings from not following pointers could add up. I haven't tested this, but it seems like it might work.
Hope this helps!

Related

Generate or find a shortest text given list of words

Let's say I have a list of 1000+ words and I would like to generate a text that includes these words from the list. I would like to use as few extra words outside of the list as possible. How would one tackle such a problem? Or alternatively, is there a way to efficiently search for a smaller portion of text containing these words the most, given some larger text (millions of words)? Basically, the resulting text from the search should be optimized to be shortest but to contain all the words from the list.
I am not sure how you'd like the text to be generated, so I'll attempt to answer the second question:
Is there a way to efficiently search for a smaller portion of text containing these words the most, given some larger text (millions of words)? Basically, the resulting text from the search should be optimized to be shortest but to contain all the words from the list.
This is obviously a computationally demanding endeavour so I'll assume you are alright with spending like a gig of RAM on this and some time (but maybe not too long). Since you are looking for the shortest continuous text which satisfies some condition, one can conclude the following:
If the text satisfies the condition, you want to shorten it.
If it doesn't, you want to make it longer so that hopefully it will start satisfying the condition.
Now, when it comes to the condition, it is whatever predicate that will say whether the continuous section of the text is "good enough" or not, based on some relatively simple statistics. For instance, the predicate could check if some cumulative index based on what ratio of the words from your list are included in the section, modified by the number of words from outside the list, is greater than some expected value.
What my mind races to when I see something like this is the sliding window technique, described in this article. I do not know if this is a good article, I did not take the time to read it, but scanning through it seems to be decent. It's also known as the caterpillar method, which is a particularly common name for it in Poland.
Basically, you have two pointers, a left pointer and a right pointer. In the case of looking for the shortest continuous fragment of a larger text, such that the fragment satisfies a condition and given that if a condition is met for a fragment, then it is met for a larger fragment containing the previous fragment, you advance the right pointer forward as long as the condition is unmet, and then once it is met, you advance the left pointer, until the condition isn't met. This repeats until either or both pointers reach the end of the text.
This is a neat technique, which allows you to iterate over the whole text exactly once, linearly. It is clearly desirable in your case to have an algorithm linear with respect to the length of the text.
Now, we have to consider the statistics you will be collecting. You will probably want to know how many words from the list, and how many words from outside of the list are present in a continuous fragment. An extra condition for these statistics is that they will need to be relatively easily modifiable (preferably in constant time, but that will be hard to achieve) every time one of the pointers advances.
In order to keep track of the words, we will use a hashmap of ordered sets of indeces. In Java these data structures are called HashMap and TreeSet, in C++ they're unordered_map and set. The keys to the hashmap will be strings representing words. The values will be sets of indices of where the words appear in the text. Note that lookup in a hashmap is linear relative to the length of the key, so we can assume constant as most words are like <10 characters long, and checking how many values in a set there are between two given values is logarithmic relative to the size of the set. So getting the number of times a word appears in a fragment of the text is easy and fast. Keeping track of whether a word exists in the given list or not can also be achieved with a hashmap (or a hashset).
So let's get back to the statistics. Say you want to keep track of the number of words from inside and from outside your list in a given fragment. This can be achieved very simply:
Every time you add a word to the fragment by advancing its right end, you check if it appears in the list in constant time and if so, you add one to the "good words" number, and otherwise, you add one to the "bad words" number.
Every time you remove a word from the fragment by advancing the left end, you do the same but you decrement the counters instead.
Now if you want to track how many unique words from inside and from outside the list there are in the fragment, every time you will need to check the number of times a given word exists in the fragment. We established earlier that this can be done logarithmically relative to the length of the fragment, so now the trick is simple. You only modify the counters if the number of appearances of a word in the fragment either
rose from 0 to 1 when advancing the right pointer, or
fell from 1 to 0 when advancing the left pointer.
Otherwise, you ignore the word, not changing the counters.
Additional memory optimisations include removing indices from the sets of indices when they are out of scope of the fragment and removing hashmap entries from the hashmap if a set of indices becomes empty.
It is now up to you to perhaps find a better heuristic, some other statistical values which you can easily track whatever it is you intend to check in your predicate. Although it is important that whenever a fragment meets your condition, a bigger fragment must meet it too.
In the case described above you could keep track of all the fragments which had at least... I don't know... 90% of the words from your list and from those choose the shortest one or the one with the fewest foreign words.

What would be a good data structure to store a dictionary(of words) to optimize the search time?

Provided a list of valid words, and a search word, I want to find whether the search word is a valid word or not ALLOWING 2 typo characters.
What would be a good data structure to store a dictionary of words(assumingly it contains a million words) and algorithm to find whether the word exists in the dictionary(allowing 2 typo characters).
If no typo characters where allowed, then a Trie would be a good way to store the words but not sure if it stays the best way to store dictionary when typos are allowed. Not sure what the complexity for a backtracking algorithm(to search for a word in Trie allowing 2 typos) would be. Any idea about it?
You might want to checkout Directed Acyclic Word Graph or DAWG. It has more of an automata structure than a tree of graph structure. Multiple possibilities from one place may provide you with your solution.
If there is no need to also store all mistyped words I would consider to use a two-step approach for this problem.
Build a set containing hashes of all valid words (not including
typos). So probably we are talking here about some 10.000 entries,
which should still allow quite fast lookups with a binary search. If
the hash of a word is found in the set it is typed correctly.
If a words hash is not found in the set the word is probably
mistyped. So calculate a the Damerau-Levenshtein distance between
the word and all known words to figure out what the user might have
meant. To gain some performance here modify the DL-algorithm to
abort calculation if the distance gets bigger than your allowed
threshold of 2 typos.

Compressing words into one word consisting of them as subwords [duplicate]

I bet somebody has solved this before, but my searches have come up empty.
I want to pack a list of words into a buffer, keeping track of the starting position and length of each word. The trick is that I'd like to pack the buffer efficiently by eliminating the redundancy.
Example: doll dollhouse house
These can be packed into the buffer simply as dollhouse, remembering that doll is four letters starting at position 0, dollhouse is nine letters at 0, and house is five letters at 3.
What I've come up with so far is:
Sort the words longest to shortest: (dollhouse, house, doll)
Scan the buffer to see if the string already exists as a substring, if so note the location.
If it doesn't already exist, add it to the end of the buffer.
Since long words often contain shorter words, this works pretty well, but it should be possible to do significantly better. For example, if I extend the word list to include ragdoll, then my algorithm comes up with dollhouseragdoll which is less efficient than ragdollhouse.
This is a preprocessing step, so I'm not terribly worried about speed. O(n^2) is fine. On the other hand, my actual list has tens of thousands of words, so O(n!) is probably out of the question.
As a side note, this storage scheme is used for the data in the `name' table of a TrueType font, cf. http://www.microsoft.com/typography/otspec/name.htm
This is the shortest superstring problem: find the shortest string that contains a set of given strings as substrings. According to this IEEE paper (which you may not have access to unfortunately), solving this problem exactly is NP-complete. However, heuristic solutions are available.
As a first step, you should find all strings that are substrings of other strings and delete them (of course you still need to record their positions relative to the containing strings somehow). These fully-contained strings can be found efficiently using a generalised suffix tree.
Then, by repeatedly merging the two strings having longest overlap, you are guaranteed to produce a solution whose length is not worse than 4 times the minimum possible length. It should be possible to find overlap sizes quickly by using two radix trees as suggested by a comment by Zifre on Konrad Rudolph's answer. Or, you might be able to use the generalised suffix tree somehow.
I'm sorry I can't dig up a decent link for you -- there doesn't seem to be a Wikipedia page, or any publicly accessible information on this particular problem. It is briefly mentioned here, though no suggested solutions are provided.
I think you can use a Radix Tree. It costs some memory because of pointers to leafs and parents, but it is easy to match up strings (O(k) (where k is the longest string size).
My first thought here is: use a data structure to determine common prefixes and suffixes of your strings. Then sort the words under consideration of these prefixes and postfixes. This would result in your desired ragdollhouse.
Looks similar to the Knapsack problem, which is NP-complete, so there is not a "definitive" algorithm.
I did a lab back in college where we tasked with implementing a simple compression program.
What we did was sequentially apply these techniques to text:
BWT (Burrows-Wheeler transform): helps reorder letters into sequences of identical letters (hint* there are mathematical substitutions for getting the letters instead of actually doing the rotations)
MTF (Move to front transform): Rewrites the sequence of letters as a sequence of indices of a dynamic list.
Huffman encoding: A form of entropy encoding that constructs a variable-length code table in which shorter codes are given to frequently encountered symbols and longer codes are given to infrequently encountered symbols
Here, I found the assignment page.
To get back your original text, you do (1) Huffman decoding, (2) inverse MTF, and then (3) inverse BWT. There are several good resources on all of this on the Interwebs.
Refine step 3.
Look through current list and see whether any word in the list starts with a suffix of the current word. (You might want to keep the suffix longer than some length - longer than 1, for example).
If yes, then add the distinct prefix to this word as a prefix to the existing word, and adjust all existing references appropriately (slow!)
If no, add word to end of list as in current step 3.
This would give you 'ragdollhouse' as the stored data in your example. It is not clear whether it would always work optimally (if you also had 'barbiedoll' and 'dollar' in the word list, for example).
I would not reinvent this wheel yet another time. There has already gone an enormous amount of manpower into compression algorithms, why not take one of the already available ones?
Here are a few good choices:
gzip for fast compression / decompression speed
bzip2 for a bit bitter compression but much slower decompression
LZMA for very high compression ratio and fast decompression (faster than bzip2 but slower than gzip)
lzop for very fast compression / decompression
If you use Java, gzip is already integrated.
It's not clear what do you want to do.
Do you want a data structure that lets to you store in a memory-conscious manner the strings while letting operations like search possible in a reasonable amount of time?
Do you just want an array of words, compressed?
In the first case, you can go for a patricia trie or a String B-Tree.
For the second case, you can just adopt some index compression techinique, like that:
If you have something like:
aaa
aaab
aasd
abaco
abad
You can compress like that:
0aaa
3b
2sd
1baco
2ad
The number is the length of the largest common prefix with the preceding string.
You can tweak that schema, for ex. planning a "restart" of the common prefix after just K words, for a fast reconstruction

data structure for NFA representation

In my lexical analyzer generator I use McNaughton and Yamada algorithm for NFA construction, and one of its properties that transition form I to J marked with char at J position.
So, each node of NFA can be represented simply as list of next possible states.
Which data structure best suit for storing this type of data? It must provide fast lookup for all possible states and use less space, but insertion time is not so important.
My understanding is that you want to encode a graph, where the nodes are states and the edges are transitions, and where every edge is labelled with a character. Is that correct?
The dull but practical answer is to have a object for each state, and to encode the transitions in some little structure in that object.
The simplest one would be an array, indexed by character code: that's as fast as it gets, but not naturally space-efficient. You can make it more space efficient by using a sort of offset, truncated array: store only the part of the array which contains transitions, along with the start and end indices of that part. When looking up a character in it, check that its code is within the bounds; if it isn't, treat it as a null edge (or an edge back to the start state or whatever), and if it is, fetch the element at index (character code - start). Does that make sense?
A more complex option would be a little hashtable, which would be more compact but slightly slower. I would suggest closed hashing, because collision lists will use too much memory; linear probing should be enough. You could look into using perfect hashing (look it up), which takes a lot of time to generate the table but then gives collision-free lookup. The generation process is quite complex, though.
A clever approach is to use both arrays and hashtables, and to pick one or the other based on the number of edges: if the compacted array would be more than, say, a third full, use it, but if not, use a hashtable.
Now, something a bit more radical you could do would be to use arrays, but to overlap them - if they're sparse, they'll have lots of holes in, and if you're clever, you can arrange them so that the entries in each array lines up with holes in the others. That will give you fast lookups, but also excellent memory efficiency. You will need some scheme for distinguishing when a lookup has found something from when it's found an empty slot with some other state's transition in, but i'm sure you can think of something.

What algorithm can you use to find duplicate phrases in a string?

Given an arbitrary string, what is an efficient method of finding duplicate phrases? We can say that phrases must be longer than a certain length to be included.
Ideally, you would end up with the number of occurrences for each phrase.
In theory
A suffix array is the 'best' answer since it can be implemented to use linear space and time to detect any duplicate substrings. However - the naive implementation actually takes time O(n^2 log n) to sort the suffixes, and it's not completely obvious how to reduce this down to O(n log n), let alone O(n), although you can read the related papers if you want to.
A suffix tree can take slightly more memory (still linear, though) than a suffix array, but is easier to implement to build quickly since you can use something like a radix sort idea as you add things to the tree (see the wikipedia link from the name for details).
The KMP algorithm is also good to be aware of, which is specialized for searching for a particular substring within a longer string very quickly. If you only need this special case, just use KMP and no need to bother building an index of suffices first.
In practice
I'm guessing you're analyzing a document of actual natural language (e.g. English) words, and you actually want to do something with the data you collect.
In this case, you might just want to do a quick n-gram analysis for some small n, such as just n=2 or 3. For example, you could tokenize your document into a list of words by stripping out punctuation, capitalization, and stemming words (running, runs both -> 'run') to increase semantic matches. Then just build a hash map (such as hash_map in C++, a dictionary in python, etc) of each adjacent pair of words to its number of occurrences so far. In the end you get some very useful data which was very fast to code, and not crazy slow to run.
Like the earlier folks mention that suffix tree is the best tool for the job. My favorite site for suffix trees is http://www.allisons.org/ll/AlgDS/Tree/Suffix/. It enumerates all the nifty uses of suffix trees on one page and has a test js applicaton embedded to test strings and work through examples.
Suffix trees are a good way to implement this. The bottom of that article has links to implementations in different languages.
Like jmah said, you can use suffix trees/suffix arrays for this.
There is a description of an algorithm you could use here (see Section 3.1).
You can find a more in-depth description in the book they cite (Gusfield, 1997), which is on google books.
suppose you are given sorted array A with n entries (i=1,2,3,...,n)
Algo(A(i))
{
while i<>n
{
temp=A[i];
if A[i]<>A[i+1] then
{
temp=A[i+1];
i=i+1;
Algo(A[i])
}
else if A[i]==A[i+1] then
mark A[i] and A[i+1] as duplicates
}
}
This algo runs at O(n) time.

Resources