Ukkonen's Algorithm Generalized Suffix Trees - algorithm

I understand the ukkonen's algorithm. I am only curious how to extend it to have more than one string in it (ending with a special character say "$").
I read somewhere that Given strings s1(say "abcddefx$") and s2(say "abddefgh$"), I should insert the s1 normally by ukkonen's algo. Then traverse down the tree with s2. That is I should search for s2 in the tree.
Once I get to a node where the search ends ("ab", after 'b') I should resume the ukkonen's algorithm from there.
I understand the basic logic behind this. But what I am curious about is, what happens to the old suffix links. Are they still valid???
Also I am confused about my triple (active_node,active_length,remainder) should it be (node representing "ab",0,0) as I start the new pass???

For dealing with special characters you can use the Unicode Private Use Areas. These are a few special ranges of characters reserved for your own use, however the ranges are only around 4000 characters in size. Depending on the unicode support of the language you are using this can be really easy or difficult.
If that does not work, instead of inserting characters into your tree, wrap them in some other sort of variable (struct, object, dictionary) to 'extend' their meaning. That way you can provide the extra information needed (is this the end of a string? which string is this the end of?). Then you can provide custom operators for equality on this new wrapper instead of using characters directly.

Related

What would be a good data structure to store a dictionary(of words) to optimize the search time?

Provided a list of valid words, and a search word, I want to find whether the search word is a valid word or not ALLOWING 2 typo characters.
What would be a good data structure to store a dictionary of words(assumingly it contains a million words) and algorithm to find whether the word exists in the dictionary(allowing 2 typo characters).
If no typo characters where allowed, then a Trie would be a good way to store the words but not sure if it stays the best way to store dictionary when typos are allowed. Not sure what the complexity for a backtracking algorithm(to search for a word in Trie allowing 2 typos) would be. Any idea about it?
You might want to checkout Directed Acyclic Word Graph or DAWG. It has more of an automata structure than a tree of graph structure. Multiple possibilities from one place may provide you with your solution.
If there is no need to also store all mistyped words I would consider to use a two-step approach for this problem.
Build a set containing hashes of all valid words (not including
typos). So probably we are talking here about some 10.000 entries,
which should still allow quite fast lookups with a binary search. If
the hash of a word is found in the set it is typed correctly.
If a words hash is not found in the set the word is probably
mistyped. So calculate a the Damerau-Levenshtein distance between
the word and all known words to figure out what the user might have
meant. To gain some performance here modify the DL-algorithm to
abort calculation if the distance gets bigger than your allowed
threshold of 2 typos.

Comparing word to targeted words in dictionary

I'm trying to write a program in JAVA that stores a dictionary in a hashmap (each word under a different key) and compares a given word to the words in the dictionary and comes up with a spelling suggestion if it is not found in the dictionary -- basically a spell check program.
I already came up with the comparison algorithm (i.e. Needleman-Wunsch then Levenshtein distance), etc., but got stuck when it came figuring out what words in the dictionary-hashmap to compare the word to i.e. "hellooo".
I cannot compare "ohelloo" [should be corrected to "hello" to each word in the dictionary b/c that would take too long and I cannot compare it to all words int the dictionary starting with 'o' b/c it's supposed to be "hello".
Any ideas?
The most common spelling mistakes are
Delete a letter (smaller word OR word split)
Swap adjacent letters
Alter letter (QWERTY adjacent letters)
Insert letter
Some reports say that 70-90% of mistakes fall in the above categories (edit distance 1)
Take a look on the url below that provides a solution for single or double mistakes (edit distance 1 or 2). Almost everything you'll need is there!
How to write a spelling corrector
FYI: You can find implementation in various programming languages in the bottom of the aforementioned article. I've used it in some of my projects, practical accuracy is really good, sometimes more than 95+% as claimed by the author.
--Based on OP's comment--
If you don't want to pre-compute every possible alteration and then search on the map, I suggest that you use a patricia trie (radix tree) instead of a HashMap. Unfortunately, you will again need to handle the "first-letter mistake" (eg remove first letter or swap first with the second, or just replace it with a Qwerty adjacent) and you can limit your search with high probability.
You can even combine it with an extra Index Map or Trie with "reversed" words or an extra index that omits first N characters (eg first 2), so you can catch errors occurred on prefix only.

How can we optimise the creation of a trie if we know the input is in alphabetical order?

I am implementing a prefix tree, with a standard insertion mechanism. If we know we will be given a list of words in alphabetical order, is there any way we can change the insertion to skip a few steps? I am coding in Java, although I'm not looking for code in any particular language. I have considered adding the Nodes for each word to a queue, then hopping backwards through it until we're at a prefix of the next word, but this may be circumventing the whole point of the prefix tree!
Any thoughts on something like this? I'm finding it hard to come up with an implementation that's of any use unless the input is many many very similar words ("aaaaaaaaaab", "aaaaaaaaaac", "aaaaaaaaaad", ...) or something. But even then doing a string comparison on the prefixes is probably a similar cost to just using the prefix tree normally.
There is no way that you can avoid looking at all the characters in the input strings from which you're building the tree. If there was a way to do this, then I could make your algorithm incorrect. In particular, suppose that there is a word w and you don't look at one of its characters (say, the kth character). Then when your algorithm runs and tries to place the word somewhere in the trie, it must be able to place it without knowing all the characters. Therefore, if I change the kth character of the word to something else, your algorithm would put it in exactly the same place as before, which is incorrect because one of the characters in the word won't be correct.
Since the normal algorithm for constructing a trie takes time proportional to the number of characters in the input, you won't be able to asymptotically outperform it without doing some crazy tricks like parallelizing the construction code or packing the characters into machine words and hitting them with your Hammer of Bit Hackery.
However, you could potentially get a constant factor speedup. Following large numbers of pointers in a linked structure can be slow due to cache performance, so you could speed up the algorithm by minimizing the number of pointers you have to follow. One thing you could do would be to maintain the position of the end of the last string that you inserted, along with a list (preferably as a dynamic array) of nodes tracing the path back up to the root. To insert a new character, you could do the following:
Find the longest prefix of the string that matches the last string you inserted.
Jump to the pointer in the array marking where that would take you.
Trace the rest of the path down as normal, adding all the nodes that you trace out to the array and overwriting the previous pointers.
This way, if you insert a lot of words with a common prefix of a reasonable length, you can avoid doing a bunch of pointer-chasing back through a shared part of the structure. This could conceivably give you a performance boost if you have lots of words with the same prefix. It's not asymptotically better than before (and, in fact, uses more memory), but the savings from not following pointers could add up. I haven't tested this, but it seems like it might work.
Hope this helps!

Why do we need a sentinel character in a Suffix Tree?

Why do we need to append "$" to the original string when we implement a suffix tree?
There can be special reasons for appending one (or even more) special characters to the end of the string when specific construction algorithms are used – both in the case of suffix trees and suffix arrays.
However, the most fundamental underlying reason in the case of suffix trees is a combination of two properties of suffix trees:
Suffix trees are PATRICIA trees, i.e. the edge labels are, unlike the edge labels of tries, strings consisting of one or more characters
Internal nodes exist only at branching points
This means you can potentially have a situation where one edge label is a prefix of another:
The idea here is that the black node on the right is a leaf node, i.e. a suffix ends here. But if the text has a suffix aa, then the single character a must also be a suffix. But there is no way for us to store the information that a suffix ends after the first a, because aa forms one continuous edge of the tree (property 1 above). We would have to introduce an intermediate node in which we could store the information, like this:
But this would be illegal because of property 2: No inner node must exist unless there is a branching point.
The problem is solved if we can guarantee that the last character of the text is a character that occurs nowhere else in the entire string. The dollar sign is normally used as a symbol for that.
Clearly, if the last character occurs nowhere else, there can't possible be any repetition (such as aa, or even a more complex one like abcabc) at the end of the string, hence the problem of non-branching inner nodes does not occur. In the example above, the effect of putting $ at the end of the string is:
There are three suffixes now: aa$, a$ and $, but none is a prefix of another. Obviously, this means we need to introduce an inner node after all, and there are a total of three leaves now. So, to be sure, the advantage of this is not that we save space or anything becomes more efficient. It's just a way to guarantee the two properties above. These properties are important when we prove certain useful characteristics of suffix trees, including the fact that its number of inner nodes is linear in the length of the string (you could not prove this if non-branching inner nodes were allowed).
This also means that in practice, you might use different ways of dealing with suffixes that are prefixes of other suffixes, and with non-branching inner nodes. For example, if you use the well-known Ukkonen algorithm to construct the tree, you can do that without appending a unique character to the end; you just have to make sure that at the end, after the final iteration, you put non-branching inner nodes to the end of every implicit suffix (i.e. every suffix that ends in the middle of an edge).
Again, there can be further, and very specific reasons for putting $ at the end of text before constructing a suffix tree or array. For example, in construction algorithms for suffix arrays that are based on the DC (difference cover) principle, you must put two $ signs to the end of the string to ensure that even the last character of the string is part of a complete character trigram, as the algorithm is based on trigram sorting. Furthermore, there are specific situations when the unique $ character must be interpreted in a special way. For the Ukkonen construction algorithm, it is sufficient for $ to be unique; for the DC suffix array algorithms it is necessary, in addition to uniqueness, that $ is lexicographically smaller than all other characters, and in the suffix-tree based circular string cutting algorithm (mentioned recently here) it is actually necessary to interpret $ as the lexicographically largest character.
I suspect that it for traversing purposes. When you are generating something from the suffix tree you need to know if you at a node where the string finishes or not, if not then you know that you have to keep going. Looking at the longest common substring to which a suffix tree provides a linear time solution, you need the $ sentinels to determine that you've arrived at a node where the string terminates. You can't finish after A-NA.
from Wikipedia
1. NOT A PATRICIA
Suffix tree is NOT a Patricia Tree, which is Radix 2. Suffix Tree node may have 2 or MORE children.
2. NO ANY VALID REASON TODAY
There is no any reasons to add a special character other than
requirement to have 2 or more childrens
requirement to have exactly n leaves for string of n characters
Suffix tree can be implemented the same way as compressed trie (or radix tree as one of the kind), without any special symbols and there is no any functional disadvantages in this case.
3. OLD TRAILS
If you'll look into old book from 1973, you'll see very similar to trie structure, which is named "uncompressed suffix tree", but with values and termination symbols. Then they compact it.
4. BUT WHAT'S DIFFERENT?
Prefix and suffix trees both have metadata in nodes, right? Which is implemented as value of the node.
But with suffix tree we've got one interesting requirement - we need to have an index of a suffix. So, in the last node we have to keep TWO metadata fields, TWO values. And you need to keep nodes of the same size, byte-to-byte. SO THEY DID IT THROUGH ADDITIONAL NODE, END NODE
In the modern world, you can keep as many fields as you want, you are not going to save each and every byte spent, so you don't need this trick.
5. SO, DO WE HAVE A REASONS FOR END SYMBOL?
Yes, potentially we have a non-functional reason - save few bytes in each non-leaf node.
6. STILL ... ANY FUNCTIONAL REASON FOR END SYMBOL?
Yes, we may have one case where end symbolS are useful - GENERALIZED suffix tree, not just a suffix tree.
Generalized suffix tree will require different end markers, in a collection on the node or as a separate end symbols. Again, you can implement with or without special symbols.
7. BOTTOMLINE
These requirement seems to be a legacy for old systems
Feel free to implement suffix tree in a same way as a compressed prefix tree, there is no caveats except few bytes wasted in each node for unused end index flag.
Generalized suffix tree is a structure where end symbolS may be useful (but you still can build it without them)
I hope this will make situation clearer.

Find all (english word) substrings of a given string

This is an interview question: Find all (english word) substrings of a given string. (every = every, ever, very).
Obviously, we can loop over all substrings and check each one against an English dictionary, organized as a set. I believe the dictionary is small enough to fit the RAM. How to organize the dictionary ? As for as I remember, the original spell command loaded the words file in a bitmap, represented a set of words hash values. I would start from that.
Another solution is a trie built from the dictionary. Using the trie we can loop over all string characters and check the trie for each character. I guess the complexity of this solution would be the same in the worst case (O(n^2))
Does it make sense? Would you suggest other solutions?
The Aho-Corasick string matching algorithm which "constructs a finite state machine that resembles a trie with additional links between the various internal nodes."
But everything considered the "build a trie from the English dictionary and do a simultaneous search on it for all suffixes of the given string" should be pretty good for an interview.
I'm not sure a Trie will work easily to match sub words that begin in the middle of the string.
Another solution with a similar concept is to use a state machine or regular expression.
the regular expression is just word1|word2|....
I'm not sure if standard regular expression engines can handle an expression covering the whole English language, but it shouldn't be hard to build the equivalent state machine given the dictionary.
Once the regular expression is compiled \ the state machine is built the complexity of analyzing a specific string is O(n)
The first solution can be refined to have a different hash map for each word length (to reduce collisions) but other than that I can't think of anything significantly better.

Resources