Why the Suffix Array use less space than the Suffix Tree? - data-structures

I'm researching about Suffix Array and Suffix Tree for my project.
In several papers such as :
"Suffix arrays: A new method for on-line string searches" by Manber and Myers - 1993.
"Simple Linear Work Suffix Array Construction" by Juha Karkkainen and Peter Sanders - 2003.
The author said that: "The advantage of the Suffix Array use less space than the Suffix Tree".
My question is "How can we know that ? Do we have any mathematical proof for that or we base on the practical experiments ?"
By observation, an array data structure seems to use less space than a tree data structure. But I want to know exactly why.

Seems like the paper mentioned at below link (section 1.2.4) will give answer to your query:
http://web.cs.iastate.edu/~cs548/suffix.pdf

Related

given prefix, how to find most frequently words effiecitly

This is a interview question extended from this one: http://www.geeksforgeeks.org/find-the-k-most-frequent-words-from-a-file/
But this question required on more thing: GIVEN PREFIX
For example, given "Bl" return most frequently words such as "bloom, blame, bloomberg" etc.
SO using TRIE is a must. But then, how to efficitnly construct heap? It's not right or pratical to construct heap for each prefix at run time. what could be a good solution.
[suppose this TRIE or data structure is static, pre-build]
thanks!
Keep a trie for all the words appearing in the file and their counts. So now if it asked to return all the words with the prefix "BI", then you can do it efficiently using the trie formed. But since you are interested in giving the most frequently occuring words with the prefix "BI", you can now use a min-heap to give the answer efficiently just like it has been done in the post you have linked to.
Also note that since the space usage of a trie can grow very large, you can suggest the interviewer to use a Ternary search tree instead of a trie.

What is the advantage of generalized suffix tree over prefix tree?

It will be of great help if some-one explains the reason in bit detail and in which scenario one is more advantageous than the other. Thanks in advance !!
Prefix trees (tries) and generalized suffix trees are designed for different problems. Typically, you'd use tries to answer queries like "is string w contained in this set?" or "is w a prefix of some string in the set?" Generalized suffix trees are designed for queries like "what strings in this set contain w as a substring?" as well as many other queries, like longest common substring. For standard programming purposes, tries usually cover what's needed, but in specialized applications (particularly genomics) generalized suffix trees are more flexible.
Hope this helps!

really hard to understand suffix tree

I've been searching for tutorials about suffix tree for quite a while. In SO, I found 2 posts about understanding suffix tree: 1, 2.
But I can't say that I understand how to build one, Oops. In Skiena's book Algorithm design manual, he says:
Since linear time suffix tree construction algorithms are nontrivial,
I recommend using an existing implementation.
Well, is the on-line construction algorithm for suffix tree so hard? Anybody can put me in the right direction to understand it?
Anyway, cut to the chase, besides the construction, there is one more thing I don't understand about suffix tree. Because the edges in suffix tree are just a pair of integers (right?) specifying the starting and ending pos of the substring, then if I want to search a string x in this suffix tree, how should I do it? De-reference those integers in the suffix tree, then compare them one by one with x? Couldn't be this way.
Firstly, there are many ways to construct a suffix tree. There is the original O(n) method by Weiner (1973), the improved one by McCreight (1976), the most well-known by Ukkonen (1991/1992), and a number of further improvements, largely related to implementation and storage efficiency considerations. Most notable among those is perhaps the Efficient implementation of lazy suffix trees by Giegerich and Kurtz.
Moreover, since the direct construction of suffix arrays has become possible in O(n) time in 2003 (e.g. using the Skew algorithm, but there are others as well), and since there are well-studied methods for
emulating suffix trees using suffix arrays (e.g. Abouelhoda/Kurtz 2004)
compressing suffix arrays (see Navarro/Mäkinen 2007 for a survey)
suffix arrays are usually preferred over suffix trees. Therefore, if your intention is to build a highly optimised implementation for a specific purpose, you might want to look into studying suffix array construction algorithms.
However, if your interest is in suffix tree construction, and in particular the Ukkonen algorithm, I would like to suggest that you take a close look at the description in this SO post, which you mentioned already, and we try to improve that description together. It's certainly far from a perfectly intuitive explanation.
To answer the question about how to compare input string to edge labels: For efficiency reasons during construction and look-up, the initial character of every edge label is usually stored in the node. But the rest must be looked up in the main text string, just like you said, and indeed this can cause issues, in particular when the string is so large that it cannot readily be held in memory. That (plus the fact that, like any direct implementation of a tree, the suffix tree is a data structure that contains a lot of pointers, which consume much memory and make it hard to maintain locality of reference and to benefit from memory caching) is one of the main reasons why suffix trees are so much harder to handle than e.g. inverted indexes.
If you combine the suffix array with an lcp table and a child table, which of course you should do, you essentially get a suffix tree. This point is made in the paper: Linearized Suffix Trees by Kim, Park and Kim. The lcp table enables a rather awkward bottom-up traversal, and the child table enables an easy traversal of either kind. So the story about suffix trees using pointers causing locality of reference problems is in my opinion obsolete information. The suffix tree is therefore``the right and easy way to go,'' as long as you implement the tree using an underlying suffix array.
The paper by Kim, Park and Kim describes a variant of the approach in Abouelhoda et al's misleadingly titled paper: Replacing suffix trees with enhanced suffix arrays. The Kim et al paper get it right that this is an implementation of suffix trees, and not a replacement. Moreover, the details of Abouelhoda et al's construction are more simply and intuitively described in Kim et al.
,
there's an implementation of Ukkonen's linear construction of suffix trees (plus suffix arrays, lcp array) here: http://code.google.com/p/text-indexing/ . the visualization provided along with the suffixtree.js may help

Conceptually simple linear-time suffix tree constructions

In 1973 Weiner gave the first linear-time construction of suffix trees. The algorithm was simplified in 1976 by McCreight, and in 1995 by Ukkonen. Nevertheless, I find Ukkonen's algorithm relatively involved conceptually.
Has there been simplifications to Ukkonen's algorithm since 1995?
A more direct answer to the original question is the top-down (and lazy) suffix tree construction by Giegerich, Kurtz, Stoye: https://pub.uni-bielefeld.de/luur/download?func=downloadFile&recordOId=1610397&fileOId=2311132
In addition, suffix arrays (as mentioned in the previous answer) are not only easier to construct, but they can be enhanced so as to emulate anything you'd expect from a suffix tree: http://www.daimi.au.dk/~cstorm/courses/StrAlg_e04/papers/KurtzOthers2004_EnhancedSuffixArrays.pdf
Since the data structures involved in an enhanced suffix array can be compressed, compressed (emulated) suffix trees become possible: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.79.8644&rep=rep1&type=pdf
It's not a direct answer, however it can help you.
Last year, while working on the subject, I ended using suffix-arrays instead of suffix-trees, and IIRC, I used the paper "An incomplex algorithm for fast suffix array construction " KB Schürmann (2007) [1] as a reference. IIRC, it's a two pass linear algorithm to build suffix-arrays.
[1] http://scholar.google.com/scholar?q=An+incomplex+algorithm+for+fast+suffix+array+construction+&hl=en&btnG=Search&as_sdt=1%2C5&as_sdtp=on

What algorithm can you use to find duplicate phrases in a string?

Given an arbitrary string, what is an efficient method of finding duplicate phrases? We can say that phrases must be longer than a certain length to be included.
Ideally, you would end up with the number of occurrences for each phrase.
In theory
A suffix array is the 'best' answer since it can be implemented to use linear space and time to detect any duplicate substrings. However - the naive implementation actually takes time O(n^2 log n) to sort the suffixes, and it's not completely obvious how to reduce this down to O(n log n), let alone O(n), although you can read the related papers if you want to.
A suffix tree can take slightly more memory (still linear, though) than a suffix array, but is easier to implement to build quickly since you can use something like a radix sort idea as you add things to the tree (see the wikipedia link from the name for details).
The KMP algorithm is also good to be aware of, which is specialized for searching for a particular substring within a longer string very quickly. If you only need this special case, just use KMP and no need to bother building an index of suffices first.
In practice
I'm guessing you're analyzing a document of actual natural language (e.g. English) words, and you actually want to do something with the data you collect.
In this case, you might just want to do a quick n-gram analysis for some small n, such as just n=2 or 3. For example, you could tokenize your document into a list of words by stripping out punctuation, capitalization, and stemming words (running, runs both -> 'run') to increase semantic matches. Then just build a hash map (such as hash_map in C++, a dictionary in python, etc) of each adjacent pair of words to its number of occurrences so far. In the end you get some very useful data which was very fast to code, and not crazy slow to run.
Like the earlier folks mention that suffix tree is the best tool for the job. My favorite site for suffix trees is http://www.allisons.org/ll/AlgDS/Tree/Suffix/. It enumerates all the nifty uses of suffix trees on one page and has a test js applicaton embedded to test strings and work through examples.
Suffix trees are a good way to implement this. The bottom of that article has links to implementations in different languages.
Like jmah said, you can use suffix trees/suffix arrays for this.
There is a description of an algorithm you could use here (see Section 3.1).
You can find a more in-depth description in the book they cite (Gusfield, 1997), which is on google books.
suppose you are given sorted array A with n entries (i=1,2,3,...,n)
Algo(A(i))
{
while i<>n
{
temp=A[i];
if A[i]<>A[i+1] then
{
temp=A[i+1];
i=i+1;
Algo(A[i])
}
else if A[i]==A[i+1] then
mark A[i] and A[i+1] as duplicates
}
}
This algo runs at O(n) time.

Resources