Why do we need a sentinel character in a Suffix Tree? - algorithm

Why do we need to append "$" to the original string when we implement a suffix tree?

There can be special reasons for appending one (or even more) special characters to the end of the string when specific construction algorithms are used – both in the case of suffix trees and suffix arrays.
However, the most fundamental underlying reason in the case of suffix trees is a combination of two properties of suffix trees:
Suffix trees are PATRICIA trees, i.e. the edge labels are, unlike the edge labels of tries, strings consisting of one or more characters
Internal nodes exist only at branching points
This means you can potentially have a situation where one edge label is a prefix of another:
The idea here is that the black node on the right is a leaf node, i.e. a suffix ends here. But if the text has a suffix aa, then the single character a must also be a suffix. But there is no way for us to store the information that a suffix ends after the first a, because aa forms one continuous edge of the tree (property 1 above). We would have to introduce an intermediate node in which we could store the information, like this:
But this would be illegal because of property 2: No inner node must exist unless there is a branching point.
The problem is solved if we can guarantee that the last character of the text is a character that occurs nowhere else in the entire string. The dollar sign is normally used as a symbol for that.
Clearly, if the last character occurs nowhere else, there can't possible be any repetition (such as aa, or even a more complex one like abcabc) at the end of the string, hence the problem of non-branching inner nodes does not occur. In the example above, the effect of putting $ at the end of the string is:
There are three suffixes now: aa$, a$ and $, but none is a prefix of another. Obviously, this means we need to introduce an inner node after all, and there are a total of three leaves now. So, to be sure, the advantage of this is not that we save space or anything becomes more efficient. It's just a way to guarantee the two properties above. These properties are important when we prove certain useful characteristics of suffix trees, including the fact that its number of inner nodes is linear in the length of the string (you could not prove this if non-branching inner nodes were allowed).
This also means that in practice, you might use different ways of dealing with suffixes that are prefixes of other suffixes, and with non-branching inner nodes. For example, if you use the well-known Ukkonen algorithm to construct the tree, you can do that without appending a unique character to the end; you just have to make sure that at the end, after the final iteration, you put non-branching inner nodes to the end of every implicit suffix (i.e. every suffix that ends in the middle of an edge).
Again, there can be further, and very specific reasons for putting $ at the end of text before constructing a suffix tree or array. For example, in construction algorithms for suffix arrays that are based on the DC (difference cover) principle, you must put two $ signs to the end of the string to ensure that even the last character of the string is part of a complete character trigram, as the algorithm is based on trigram sorting. Furthermore, there are specific situations when the unique $ character must be interpreted in a special way. For the Ukkonen construction algorithm, it is sufficient for $ to be unique; for the DC suffix array algorithms it is necessary, in addition to uniqueness, that $ is lexicographically smaller than all other characters, and in the suffix-tree based circular string cutting algorithm (mentioned recently here) it is actually necessary to interpret $ as the lexicographically largest character.

I suspect that it for traversing purposes. When you are generating something from the suffix tree you need to know if you at a node where the string finishes or not, if not then you know that you have to keep going. Looking at the longest common substring to which a suffix tree provides a linear time solution, you need the $ sentinels to determine that you've arrived at a node where the string terminates. You can't finish after A-NA.
from Wikipedia

1. NOT A PATRICIA
Suffix tree is NOT a Patricia Tree, which is Radix 2. Suffix Tree node may have 2 or MORE children.
2. NO ANY VALID REASON TODAY
There is no any reasons to add a special character other than
requirement to have 2 or more childrens
requirement to have exactly n leaves for string of n characters
Suffix tree can be implemented the same way as compressed trie (or radix tree as one of the kind), without any special symbols and there is no any functional disadvantages in this case.
3. OLD TRAILS
If you'll look into old book from 1973, you'll see very similar to trie structure, which is named "uncompressed suffix tree", but with values and termination symbols. Then they compact it.
4. BUT WHAT'S DIFFERENT?
Prefix and suffix trees both have metadata in nodes, right? Which is implemented as value of the node.
But with suffix tree we've got one interesting requirement - we need to have an index of a suffix. So, in the last node we have to keep TWO metadata fields, TWO values. And you need to keep nodes of the same size, byte-to-byte. SO THEY DID IT THROUGH ADDITIONAL NODE, END NODE
In the modern world, you can keep as many fields as you want, you are not going to save each and every byte spent, so you don't need this trick.
5. SO, DO WE HAVE A REASONS FOR END SYMBOL?
Yes, potentially we have a non-functional reason - save few bytes in each non-leaf node.
6. STILL ... ANY FUNCTIONAL REASON FOR END SYMBOL?
Yes, we may have one case where end symbolS are useful - GENERALIZED suffix tree, not just a suffix tree.
Generalized suffix tree will require different end markers, in a collection on the node or as a separate end symbols. Again, you can implement with or without special symbols.
7. BOTTOMLINE
These requirement seems to be a legacy for old systems
Feel free to implement suffix tree in a same way as a compressed prefix tree, there is no caveats except few bytes wasted in each node for unused end index flag.
Generalized suffix tree is a structure where end symbolS may be useful (but you still can build it without them)
I hope this will make situation clearer.

Related

Compressing words into one word consisting of them as subwords [duplicate]

I bet somebody has solved this before, but my searches have come up empty.
I want to pack a list of words into a buffer, keeping track of the starting position and length of each word. The trick is that I'd like to pack the buffer efficiently by eliminating the redundancy.
Example: doll dollhouse house
These can be packed into the buffer simply as dollhouse, remembering that doll is four letters starting at position 0, dollhouse is nine letters at 0, and house is five letters at 3.
What I've come up with so far is:
Sort the words longest to shortest: (dollhouse, house, doll)
Scan the buffer to see if the string already exists as a substring, if so note the location.
If it doesn't already exist, add it to the end of the buffer.
Since long words often contain shorter words, this works pretty well, but it should be possible to do significantly better. For example, if I extend the word list to include ragdoll, then my algorithm comes up with dollhouseragdoll which is less efficient than ragdollhouse.
This is a preprocessing step, so I'm not terribly worried about speed. O(n^2) is fine. On the other hand, my actual list has tens of thousands of words, so O(n!) is probably out of the question.
As a side note, this storage scheme is used for the data in the `name' table of a TrueType font, cf. http://www.microsoft.com/typography/otspec/name.htm
This is the shortest superstring problem: find the shortest string that contains a set of given strings as substrings. According to this IEEE paper (which you may not have access to unfortunately), solving this problem exactly is NP-complete. However, heuristic solutions are available.
As a first step, you should find all strings that are substrings of other strings and delete them (of course you still need to record their positions relative to the containing strings somehow). These fully-contained strings can be found efficiently using a generalised suffix tree.
Then, by repeatedly merging the two strings having longest overlap, you are guaranteed to produce a solution whose length is not worse than 4 times the minimum possible length. It should be possible to find overlap sizes quickly by using two radix trees as suggested by a comment by Zifre on Konrad Rudolph's answer. Or, you might be able to use the generalised suffix tree somehow.
I'm sorry I can't dig up a decent link for you -- there doesn't seem to be a Wikipedia page, or any publicly accessible information on this particular problem. It is briefly mentioned here, though no suggested solutions are provided.
I think you can use a Radix Tree. It costs some memory because of pointers to leafs and parents, but it is easy to match up strings (O(k) (where k is the longest string size).
My first thought here is: use a data structure to determine common prefixes and suffixes of your strings. Then sort the words under consideration of these prefixes and postfixes. This would result in your desired ragdollhouse.
Looks similar to the Knapsack problem, which is NP-complete, so there is not a "definitive" algorithm.
I did a lab back in college where we tasked with implementing a simple compression program.
What we did was sequentially apply these techniques to text:
BWT (Burrows-Wheeler transform): helps reorder letters into sequences of identical letters (hint* there are mathematical substitutions for getting the letters instead of actually doing the rotations)
MTF (Move to front transform): Rewrites the sequence of letters as a sequence of indices of a dynamic list.
Huffman encoding: A form of entropy encoding that constructs a variable-length code table in which shorter codes are given to frequently encountered symbols and longer codes are given to infrequently encountered symbols
Here, I found the assignment page.
To get back your original text, you do (1) Huffman decoding, (2) inverse MTF, and then (3) inverse BWT. There are several good resources on all of this on the Interwebs.
Refine step 3.
Look through current list and see whether any word in the list starts with a suffix of the current word. (You might want to keep the suffix longer than some length - longer than 1, for example).
If yes, then add the distinct prefix to this word as a prefix to the existing word, and adjust all existing references appropriately (slow!)
If no, add word to end of list as in current step 3.
This would give you 'ragdollhouse' as the stored data in your example. It is not clear whether it would always work optimally (if you also had 'barbiedoll' and 'dollar' in the word list, for example).
I would not reinvent this wheel yet another time. There has already gone an enormous amount of manpower into compression algorithms, why not take one of the already available ones?
Here are a few good choices:
gzip for fast compression / decompression speed
bzip2 for a bit bitter compression but much slower decompression
LZMA for very high compression ratio and fast decompression (faster than bzip2 but slower than gzip)
lzop for very fast compression / decompression
If you use Java, gzip is already integrated.
It's not clear what do you want to do.
Do you want a data structure that lets to you store in a memory-conscious manner the strings while letting operations like search possible in a reasonable amount of time?
Do you just want an array of words, compressed?
In the first case, you can go for a patricia trie or a String B-Tree.
For the second case, you can just adopt some index compression techinique, like that:
If you have something like:
aaa
aaab
aasd
abaco
abad
You can compress like that:
0aaa
3b
2sd
1baco
2ad
The number is the length of the largest common prefix with the preceding string.
You can tweak that schema, for ex. planning a "restart" of the common prefix after just K words, for a fast reconstruction

what's the best data structure for pattern matching within a vocabulary?

Given a set of vocabulary, what's the best data structure that can be used to find all the words within the vocabulary that matches a given sub string?
Suppose "Ap" is the substring,
"Apple" and "Application" should be returned.
Since in this case, "Ap" is at the start of the two strings, I can think of the use of Tries.
But what if the substring to be matched can be found anywhere in the words of the vocabulary?
Eg: If "ap" is given, "shape" should also be returned since "ap" occurs in "shape".
The vocabulary set is very large.
What you want is a suffix tree. This stores all suffixes of a (set of) strings in a trie (in your case, the set of words). Each leaf of the trie is associated with the the set of strings that have that suffix.
When searching for a substring, you simply match the substring at the root of trie; your substring must be a prefix of some suffix or there is no match. Discovering the existence of a match is linear time in the length of the substring. To determine all matching words, you have to enumerate all leaves of the trie accessible from the point where the match completes. That is a tree walk problem; if the tree has significant branching, it might be bit expensive.
You could precompute, for each trie node, the set of associated words; this is likely to be pretty big, but now you have an extremely fast determination of matching words.
If you only need the members of the set to examine until you find one with some nice property, I'd stick with the enumeration.

How to match a tree against a large set of patterns?

I have a potentially infinite set of symbols: A, B, C, ... There is also a distinct special placeholder symbol ? (its meaning will be explained below).
Consider non-empty finite trees such that every node has a symbol attached to it and 0 or more non-empty sub-trees. The order of sub-trees of a given node is significant (so, for example, if there is a node with 2 sub-trees, we can distinguish which one is left and which one is right). Any given symbol can appear in the tree 0 of more times attached to different nodes. The placeholder symbol ? can be attached only to leaf nodes (i.e. nodes having no sub-trees). It follows from the usual definition of a tree that trees are acyclic.
The finiteness requirement means that the total number of nodes in a tree is a positive finite integer. It follows that the total number of attached symbols, the tree depth and the total number of nodes in every sub-tree are all finite.
Trees are given in a functional notation: a node is represented by a symbol attached to it and, if there are any sub-trees, it is followed by parentheses containing comma-separated list of sub-trees represented recursively in the same way. So, for example the tree
A
/ \
? B
/ \
A C
/|\
A C Q
\
?
is represented as A(?,B(A(A,C,Q(?)),C)).
I have a pre-calculated unchanging set of trees S that will be used as patterns to match. The set will typically have ~ 105 trees, and every its element will typically have ~ 10-30 nodes. I can use a plenty of time to create beforehand any representation of S that will best suit my problem stated below.
I need to write a function that accepts a tree T (typically with ~ 102 nodes) and checks as fast as possible if T contains as a subtree any element of S, provided that any node with placeholder symbol ? matches any non-empty subtree (both when it appears in T or in an element of S).
Please suggest a data structure to store the set S and an algorithm to check for a match. Any programing language or a pseudo-code is OK.
This paper describes a variant of the Aho–Corasick algorithm, where instead of using a finite state machine (which the standard Aho–Corasick algorithm uses for string matching) the algorithm instead uses a pushdown automaton for subtree matching. Like the Aho-Corasick string-matching algorithm, their variant only requires one pass through the input tree to match against the entire dictionary of S.
The paper is quite complex - it may be worth it to contact the author to see if he has any source code available.
What you need is a finite state machine that tracks the set of potential matches you might have.
In essence, such a machine is is the result of matching the patterns against each other, and determining what part of the individual matches they share. This is analogous to how lexers take sets of regular expressions for tokens and compose them into a large FSA that can match any of the regular expressions by processing characters one at a time.
You can find references to methods for doing this under term rewriting systems.

How can we optimise the creation of a trie if we know the input is in alphabetical order?

I am implementing a prefix tree, with a standard insertion mechanism. If we know we will be given a list of words in alphabetical order, is there any way we can change the insertion to skip a few steps? I am coding in Java, although I'm not looking for code in any particular language. I have considered adding the Nodes for each word to a queue, then hopping backwards through it until we're at a prefix of the next word, but this may be circumventing the whole point of the prefix tree!
Any thoughts on something like this? I'm finding it hard to come up with an implementation that's of any use unless the input is many many very similar words ("aaaaaaaaaab", "aaaaaaaaaac", "aaaaaaaaaad", ...) or something. But even then doing a string comparison on the prefixes is probably a similar cost to just using the prefix tree normally.
There is no way that you can avoid looking at all the characters in the input strings from which you're building the tree. If there was a way to do this, then I could make your algorithm incorrect. In particular, suppose that there is a word w and you don't look at one of its characters (say, the kth character). Then when your algorithm runs and tries to place the word somewhere in the trie, it must be able to place it without knowing all the characters. Therefore, if I change the kth character of the word to something else, your algorithm would put it in exactly the same place as before, which is incorrect because one of the characters in the word won't be correct.
Since the normal algorithm for constructing a trie takes time proportional to the number of characters in the input, you won't be able to asymptotically outperform it without doing some crazy tricks like parallelizing the construction code or packing the characters into machine words and hitting them with your Hammer of Bit Hackery.
However, you could potentially get a constant factor speedup. Following large numbers of pointers in a linked structure can be slow due to cache performance, so you could speed up the algorithm by minimizing the number of pointers you have to follow. One thing you could do would be to maintain the position of the end of the last string that you inserted, along with a list (preferably as a dynamic array) of nodes tracing the path back up to the root. To insert a new character, you could do the following:
Find the longest prefix of the string that matches the last string you inserted.
Jump to the pointer in the array marking where that would take you.
Trace the rest of the path down as normal, adding all the nodes that you trace out to the array and overwriting the previous pointers.
This way, if you insert a lot of words with a common prefix of a reasonable length, you can avoid doing a bunch of pointer-chasing back through a shared part of the structure. This could conceivably give you a performance boost if you have lots of words with the same prefix. It's not asymptotically better than before (and, in fact, uses more memory), but the savings from not following pointers could add up. I haven't tested this, but it seems like it might work.
Hope this helps!

What is a good algorithm to traverse a Trie to check for spelling suggestions?

Assuming that a general Trie of dictionary words is built, what would be the best method to check for the 4 cases of spelling mistakes - substitution, deletion, transposition and insertion during traversal?
One method is to figure out all the words within n edit distances of a given word and then checking for them in the Trie. This isn't a bad option, but a better intuition here seems to be use a dynamic programming (or a recursive equivalent) method to determine the best sub-tries after having modified the words during traversal.
Any ideas would be welcome!
PS, would appreciate actual inputs rather than just links in answers.
I actually wrote some code to do this the other day:
https://bitbucket.org/teoryn/spell-checker/src/tip/spell_checker.py
It's based on the code by Peter Norvig (http://norvig.com/spell-correct.html) but stores the dictionary in a trie for finding words within a given edit distance faster.
The algorithm walks the trie recursively applying the possible edits (or not) at each step along the way by consuming letters from the input word. A parameter to the recursive call states how many more edits can be made. The trie helps narrow the search space by checking which letters can actually be reached from our given prefix. For example, when inserting a character, instead of adding each letter in the alphabet, we only add letters that are reachable from the current node. Not making an edit is equivalent to taking the branch from the current node in the trie along the current letter from the input word. If that branch is not there then we can backtrack and avoid searching a possibly large space where no real words could be found.
I think you can do this with a straightforward breadth-first search on the tree: choose a threshold of the number of errors you are looking for, simply run through the letters of the word to be matched one at a time, generating a set of (prefix, subtrie) pairs reached so far matching the prefix, and while you are beneath your error threshold, add to your set of next subgoals:
No error at this character place: add the subgoal of the trie at the next character in the word
An inserted, deleted, or substituted character at this place: find the appropriate trie there, and increment the error count;
Not an additional goal, but note that transpositions are either an insertion or deletion that matches an earlier deletion or insertion: if this test hold, then don't increment the error count.
This seems pretty naive: is there a problem with this that led you to think of dynamic programming?
Presuming each successive character in your word represents one level in your tree, then you have five cases to check at each character (a match, deletion, insertion, substitution and transposition). I'm assuming transpositions are two adjacent characters.
You will need a function (CheckNode) that accepts a tree node and a character to check. It will need to return a set of (child/grand-child) nodes representing matches.
You will need a function (CheckWord) that accepts a word. It checks each character in turn against a set of nodes. It will return a set of (leaf) nodes representing matched words.
The idea is that each level in the tree (child, grand-child etc.), matches the position of the character in the word. If you call the top level tree node, level 0, then you'll have level 1, level 2 etc.
Clearly for a word without errors, there is a one to one match between the character position and the level in the tree.
For deletions, you need to skip a level in the tree.
For insertions, you need to skip a character in the word.
For substitutions, you need to skip both a level and a character.
For transpositions, you need to (temporarily) swap the characters in the word.
Take a look at calculating the Levenshtein distance which provides a dynamic programming solution for finding the distance between two sequences.

Resources