Conceptually simple linear-time suffix tree constructions - algorithm

In 1973 Weiner gave the first linear-time construction of suffix trees. The algorithm was simplified in 1976 by McCreight, and in 1995 by Ukkonen. Nevertheless, I find Ukkonen's algorithm relatively involved conceptually.
Has there been simplifications to Ukkonen's algorithm since 1995?

A more direct answer to the original question is the top-down (and lazy) suffix tree construction by Giegerich, Kurtz, Stoye: https://pub.uni-bielefeld.de/luur/download?func=downloadFile&recordOId=1610397&fileOId=2311132
In addition, suffix arrays (as mentioned in the previous answer) are not only easier to construct, but they can be enhanced so as to emulate anything you'd expect from a suffix tree: http://www.daimi.au.dk/~cstorm/courses/StrAlg_e04/papers/KurtzOthers2004_EnhancedSuffixArrays.pdf
Since the data structures involved in an enhanced suffix array can be compressed, compressed (emulated) suffix trees become possible: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.79.8644&rep=rep1&type=pdf

It's not a direct answer, however it can help you.
Last year, while working on the subject, I ended using suffix-arrays instead of suffix-trees, and IIRC, I used the paper "An incomplex algorithm for fast suffix array construction " KB Schürmann (2007) [1] as a reference. IIRC, it's a two pass linear algorithm to build suffix-arrays.
[1] http://scholar.google.com/scholar?q=An+incomplex+algorithm+for+fast+suffix+array+construction+&hl=en&btnG=Search&as_sdt=1%2C5&as_sdtp=on

Related

Why the Suffix Array use less space than the Suffix Tree?

I'm researching about Suffix Array and Suffix Tree for my project.
In several papers such as :
"Suffix arrays: A new method for on-line string searches" by Manber and Myers - 1993.
"Simple Linear Work Suffix Array Construction" by Juha Karkkainen and Peter Sanders - 2003.
The author said that: "The advantage of the Suffix Array use less space than the Suffix Tree".
My question is "How can we know that ? Do we have any mathematical proof for that or we base on the practical experiments ?"
By observation, an array data structure seems to use less space than a tree data structure. But I want to know exactly why.
Seems like the paper mentioned at below link (section 1.2.4) will give answer to your query:
http://web.cs.iastate.edu/~cs548/suffix.pdf

really hard to understand suffix tree

I've been searching for tutorials about suffix tree for quite a while. In SO, I found 2 posts about understanding suffix tree: 1, 2.
But I can't say that I understand how to build one, Oops. In Skiena's book Algorithm design manual, he says:
Since linear time suffix tree construction algorithms are nontrivial,
I recommend using an existing implementation.
Well, is the on-line construction algorithm for suffix tree so hard? Anybody can put me in the right direction to understand it?
Anyway, cut to the chase, besides the construction, there is one more thing I don't understand about suffix tree. Because the edges in suffix tree are just a pair of integers (right?) specifying the starting and ending pos of the substring, then if I want to search a string x in this suffix tree, how should I do it? De-reference those integers in the suffix tree, then compare them one by one with x? Couldn't be this way.
Firstly, there are many ways to construct a suffix tree. There is the original O(n) method by Weiner (1973), the improved one by McCreight (1976), the most well-known by Ukkonen (1991/1992), and a number of further improvements, largely related to implementation and storage efficiency considerations. Most notable among those is perhaps the Efficient implementation of lazy suffix trees by Giegerich and Kurtz.
Moreover, since the direct construction of suffix arrays has become possible in O(n) time in 2003 (e.g. using the Skew algorithm, but there are others as well), and since there are well-studied methods for
emulating suffix trees using suffix arrays (e.g. Abouelhoda/Kurtz 2004)
compressing suffix arrays (see Navarro/Mäkinen 2007 for a survey)
suffix arrays are usually preferred over suffix trees. Therefore, if your intention is to build a highly optimised implementation for a specific purpose, you might want to look into studying suffix array construction algorithms.
However, if your interest is in suffix tree construction, and in particular the Ukkonen algorithm, I would like to suggest that you take a close look at the description in this SO post, which you mentioned already, and we try to improve that description together. It's certainly far from a perfectly intuitive explanation.
To answer the question about how to compare input string to edge labels: For efficiency reasons during construction and look-up, the initial character of every edge label is usually stored in the node. But the rest must be looked up in the main text string, just like you said, and indeed this can cause issues, in particular when the string is so large that it cannot readily be held in memory. That (plus the fact that, like any direct implementation of a tree, the suffix tree is a data structure that contains a lot of pointers, which consume much memory and make it hard to maintain locality of reference and to benefit from memory caching) is one of the main reasons why suffix trees are so much harder to handle than e.g. inverted indexes.
If you combine the suffix array with an lcp table and a child table, which of course you should do, you essentially get a suffix tree. This point is made in the paper: Linearized Suffix Trees by Kim, Park and Kim. The lcp table enables a rather awkward bottom-up traversal, and the child table enables an easy traversal of either kind. So the story about suffix trees using pointers causing locality of reference problems is in my opinion obsolete information. The suffix tree is therefore``the right and easy way to go,'' as long as you implement the tree using an underlying suffix array.
The paper by Kim, Park and Kim describes a variant of the approach in Abouelhoda et al's misleadingly titled paper: Replacing suffix trees with enhanced suffix arrays. The Kim et al paper get it right that this is an implementation of suffix trees, and not a replacement. Moreover, the details of Abouelhoda et al's construction are more simply and intuitively described in Kim et al.
,
there's an implementation of Ukkonen's linear construction of suffix trees (plus suffix arrays, lcp array) here: http://code.google.com/p/text-indexing/ . the visualization provided along with the suffixtree.js may help

Where would a suffix array be preferable to a suffix tree?

Two closely-related data structures are the suffix tree and suffix array. From what I've read, the suffix tree is faster, more powerful, more flexible, and more memory-efficient than a suffix array. However, in this earlier question, one of the top answers mentioned that suffix arrays are more widely used in practice. I don't have any experience using either of these structures, but right now it seems like I would always prefer a suffix tree over a suffix array for problems that needed the functionality they provide (fast substring checking, for example).
In what circumstances would a suffix array be preferable to a suffix tree?
(By the way, while this question is related to the one I've linked, I don't think it's an exact duplicate as I'm interested solely in a comparison of suffix arrays and suffix trees, leaving tries completely out of the picture. If you disagree, though, I would understand if this question were to be closed.)
Citing from http://www.youtube.com/watch?v=1DGZxd-PP7U
Suffix Arrays and Suffix Trees used to be different. But nowadays
Suffix Arrays are just a way of implementing a Suffix Tree (or vice
versa). See: Kim, Kim, and Park. Linearized suffix tree: an efficient
index data structure with the capabilities of suffix trees and suffix
arrays. Algorithmica, 2007.
The Kim et al paper is well written, accessible and has references to other important papers, such as the one by Abouelhoda et al.
A suffix array is nearly always preferable, except:
If you are going to index little ammounts of data.
If you are doing research on protein matching or dna mutations and have access to extremely expensive computers.
If you must at all cost use the error search with wildcards.
A suffix array can be used to implement the suffix tree. Meaning a suffix tree can be a suffix array and a few additional data structures to simulate the suffix tree functionality.
Therefore:
Suffix arrays use less space (a lot less)
Suffix trees are slower to build
Suffix trees are faster doing pattern matching operations
Suffix trees can do more operations, the best is error pattern matching with wildcards (suffix array also does pattern matching but not with wildcards)
If you want to index a lot of data, like more than 50 megabytes. The suffix tree uses so much space that your computer does not have enough ram to keep it in central memory. Therefore it starts to use secondary memory and you will see a huge degradation in speed. (for example the human dna uses 700 megabytes, a suffix tree of that data "can" use 40 gigabytes -> * "can" depending on the implementation * )
Because of this the suffix tree is nearly never used in practice. In practice the suffix array is used and small additional data structures give it some extra functionality (never a complete suffix tree).
However they are different. There are many cases where a pure suffix array is preferable for pattern matching due to efficient speed, fast construction speed and low space use.

Analyzing Recursive Algorithms

I've often been slightly stumped by recursive algorithms that seem to require magical leaps (with a large dose of shrunken notation born of an ink shortage) of logic.
I realize that the alternative is to simply memorize the Big O notation for all the common algorithms but at a certain point, that approach fails. For example, I am happy to disclose the performance for bubble sort, insertion sort, binary tree insertion/removal, mergesort, and quicksort but don't ask me to come up with the performance of AVL trees or Djikstra's shortest path algorithm off the top of my head.
Where can I go to get:
A discussion of recursive algorithm analysis that uses words instead of a profusion of symbols
Practice problems to confirm that my newly-obtained understanding is actually correct
Example:
Bad:
Sigma v e T (1+cv)
Possible 'good' equivalent:
The amount of work required for 1 node in the tree (which is 1+the # of children of a node), which is then executed once for every element in the tree where the original node is the root.
Side commentary:
I could simply watch a video for every single algorithm because there's no way to make one's voice turn into a subscript (or any of the other contortions) but I suspect that would take an inordinate amount of time compared to reading textual descriptions.
Update:
Here's 1 source of solved problems: http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-046j-introduction-to-algorithms-sma-5503-fall-2005/ (this tackles #2 above)
TopCoders has a great source of tutorials and thorough explanations. Have you tried them out?
http://www.topcoder.com/tc?d1=tutorials&d2=alg_index&module=Static
To 'officially' answer this question...
Source of sample problems: http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-046j-introduction-to-algorithms-sma-5503-fall-2005/
An explanation of analysis for CS: http://en.wikipedia.org/wiki/Concrete_Mathematics.

Finding the most frequent subtrees in a collection of (parse) trees

I have a collection of trees whose nodes are labelled (but not uniquely). Specifically the trees are from a collection of parsed sentences (see http://en.wikipedia.org/wiki/Treebank). I wish to extract the most common subtrees from the collection - performance is not (yet) an issue. I'd be grateful for algorithms (ideally Java) or pointers to tools which do this for treebanks. Note that order of child nodes is important.
EDIT #mjv. We are working in a limited domain (chemistry) which has a stylised language so the varirty of the trees is not huge - probably similar to children's readers. Simple tree for "the cat sat on the mat".
<sentence>
<nounPhrase>
<article/>
<noun/>
</nounPhrase>
<verbPhrase>
<verb/>
<prepositionPhrase>
<preposition/>
<nounPhrase>
<article/>
<noun/>
</nounPhrase>
</prepositionPhrase>
</verbPhrase>
</sentence>
Here the sentence contains two identical part-of-speech subtrees (the actual tokens "cat". "mat" are not important in matching). So the algorithm would need to detect this. Note that not all nounPhrases are identical - "the big black cat" could be:
<nounPhrase>
<article/>
<adjective/>
<adjective/>
<noun/>
</nounPhrase>
The length of sentences will be longer - between 15 to 30 nodes. I would expect to get useful results from 1000 trees. If this does not take more than a day or so that's acceptable.
Obviously the shorter the tree the more frequent, so nounPhrase will be very common.
EDIT If this is to be solved by flattening the tree then I think it would be related to Longest Common Substring, not Longest Common Sequence. But note that I don't necessarily just want the longest - I want a list of all those long enough to be "interesting" (criterion yet to be decided).
Finding the most frequent subtrees in the collection, create a compact form of the subtree, then iterate every subtree and use a hashset to count their occurrences. 30 nodes is too big for a perfect hash - it's only about one bit per node, and you need that much to indicate whether it's a sibling or a child.
That problem isn't LCS - the most common sequence isn't related to the longest common subsequence. The most frequent subtree is that which occurs the most.
It should be at worst case O(N L^2) for N trees of length L (assuming testing equality of a subtree containing L nodes is O(L)).
I think, although you say that performance isn't yet an issue, this is an NP-hard problem, so it may never be possible to make it fast. If I've understood correctly, you can consider this a variant of the Longest common subsequence problem; if you flatten your tree into a straight sequence like
(nounphrase)(DOWN)(article:the)(adjective:big)(adjective:black)(noun:cat)(UP)
Then your problem becomes LCS.
Wikibooks has a java implementation of LCS here
This is a well-known problem in computer science, for which there are efficient solutions.
Here are some relevant references:
Kenji Abe, Shinji Kawasoe, Tatsuya Asai, Hiroki Arimura, Setsuo Arikawa, Optimized Substructure Discovery for Semi-structured Data, Proc. 6th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD-2002), LNAI 2431, Springer-Verlag, 1-14, August 2002.
Mohammed J. Zaki, Efficiently Mining Frequent Trees in a Forest, 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, July 2002.
Or, if you just want fast code, go here:
FREQT
(transforming xml to S-expressions shouldn't give you too much problems, and is left as an exercise for the reader)
I found tool called gspan very useful in this case. Its available for free download at http://www.cs.ucsb.edu/~xyan/software/gSpan.htm . Its c++ version with matlab interface is at http://www.nowozin.net/sebastian/gboost/

Resources