Suffix Array sentinel character lexicographical order - algorithm

This question is based on this answer by jogojapan.
In that answer, he notes that for some suffix tree/suffix array algorithms, just having a unique sentinel character $ is sufficient, while others require $ to either lexicographically compare smallest/largest.
reading Abouelhoda et al.'s paper Replacing suffix trees with enhanced suffix arrays, they make the choice that $ must be larger than any other character. With this choice, the are able to construct efficient algorithms which can simulate both bottom-up and top-down suffix tree traversal, as well as various potential applications based on these traversal schemes.
On the other hand, algorithms for efficiently constructing the suffix array or LCP array using induced sorting make the opposite choice: $ must be lexicographically smallest. (see: Linear Suffix Array Construction by Almost Pure Induced-Sorting by Nong et al., and Inducing the LCP-Array by Johannes Fischer).
It's not immediately obvious to me if these choices for what properties $ has are necessary or were just done for convenience.
It would strike me as extremely unfortunate if the fastest SA/LCP-Array construction algorithms can't be used with many efficient algorithms which utilize suffix arrays.
Do the induced sorting construction methods strictly require that $ be lexicographically smallest, or do they work equally well (or with minor modifications) if I chose $ to be lexicographically largest?
If the answer to 1 is no, do the algorithms Abouelhoda presents for emulating top-down/bottom-up suffix tree traversal apply if $ is lexicographically smallest, and if not can they be slightly modified so they can be used?
If no to 1 and 2, are there completely different algorithms which may be used to perform similar tasks when I make the choice $ is lexicographically smallest? What are they, if they exist?

If it ever matters, then you can just add another sentinel.
I'm pretty sure you can get induced sorting to work with a largest-value sentinel, but if you can't, or if you just don't want to bother figuring out how, then just add a largest-value sentinel before adding the smallest-value sentinel that the algorithm requires.
This would add just one extra suffix to the suffix array, which you could easily remove, and the remaining ones will be in the order you require.

Related

really hard to understand suffix tree

I've been searching for tutorials about suffix tree for quite a while. In SO, I found 2 posts about understanding suffix tree: 1, 2.
But I can't say that I understand how to build one, Oops. In Skiena's book Algorithm design manual, he says:
Since linear time suffix tree construction algorithms are nontrivial,
I recommend using an existing implementation.
Well, is the on-line construction algorithm for suffix tree so hard? Anybody can put me in the right direction to understand it?
Anyway, cut to the chase, besides the construction, there is one more thing I don't understand about suffix tree. Because the edges in suffix tree are just a pair of integers (right?) specifying the starting and ending pos of the substring, then if I want to search a string x in this suffix tree, how should I do it? De-reference those integers in the suffix tree, then compare them one by one with x? Couldn't be this way.
Firstly, there are many ways to construct a suffix tree. There is the original O(n) method by Weiner (1973), the improved one by McCreight (1976), the most well-known by Ukkonen (1991/1992), and a number of further improvements, largely related to implementation and storage efficiency considerations. Most notable among those is perhaps the Efficient implementation of lazy suffix trees by Giegerich and Kurtz.
Moreover, since the direct construction of suffix arrays has become possible in O(n) time in 2003 (e.g. using the Skew algorithm, but there are others as well), and since there are well-studied methods for
emulating suffix trees using suffix arrays (e.g. Abouelhoda/Kurtz 2004)
compressing suffix arrays (see Navarro/Mäkinen 2007 for a survey)
suffix arrays are usually preferred over suffix trees. Therefore, if your intention is to build a highly optimised implementation for a specific purpose, you might want to look into studying suffix array construction algorithms.
However, if your interest is in suffix tree construction, and in particular the Ukkonen algorithm, I would like to suggest that you take a close look at the description in this SO post, which you mentioned already, and we try to improve that description together. It's certainly far from a perfectly intuitive explanation.
To answer the question about how to compare input string to edge labels: For efficiency reasons during construction and look-up, the initial character of every edge label is usually stored in the node. But the rest must be looked up in the main text string, just like you said, and indeed this can cause issues, in particular when the string is so large that it cannot readily be held in memory. That (plus the fact that, like any direct implementation of a tree, the suffix tree is a data structure that contains a lot of pointers, which consume much memory and make it hard to maintain locality of reference and to benefit from memory caching) is one of the main reasons why suffix trees are so much harder to handle than e.g. inverted indexes.
If you combine the suffix array with an lcp table and a child table, which of course you should do, you essentially get a suffix tree. This point is made in the paper: Linearized Suffix Trees by Kim, Park and Kim. The lcp table enables a rather awkward bottom-up traversal, and the child table enables an easy traversal of either kind. So the story about suffix trees using pointers causing locality of reference problems is in my opinion obsolete information. The suffix tree is therefore``the right and easy way to go,'' as long as you implement the tree using an underlying suffix array.
The paper by Kim, Park and Kim describes a variant of the approach in Abouelhoda et al's misleadingly titled paper: Replacing suffix trees with enhanced suffix arrays. The Kim et al paper get it right that this is an implementation of suffix trees, and not a replacement. Moreover, the details of Abouelhoda et al's construction are more simply and intuitively described in Kim et al.
,
there's an implementation of Ukkonen's linear construction of suffix trees (plus suffix arrays, lcp array) here: http://code.google.com/p/text-indexing/ . the visualization provided along with the suffixtree.js may help

Where would a suffix array be preferable to a suffix tree?

Two closely-related data structures are the suffix tree and suffix array. From what I've read, the suffix tree is faster, more powerful, more flexible, and more memory-efficient than a suffix array. However, in this earlier question, one of the top answers mentioned that suffix arrays are more widely used in practice. I don't have any experience using either of these structures, but right now it seems like I would always prefer a suffix tree over a suffix array for problems that needed the functionality they provide (fast substring checking, for example).
In what circumstances would a suffix array be preferable to a suffix tree?
(By the way, while this question is related to the one I've linked, I don't think it's an exact duplicate as I'm interested solely in a comparison of suffix arrays and suffix trees, leaving tries completely out of the picture. If you disagree, though, I would understand if this question were to be closed.)
Citing from http://www.youtube.com/watch?v=1DGZxd-PP7U
Suffix Arrays and Suffix Trees used to be different. But nowadays
Suffix Arrays are just a way of implementing a Suffix Tree (or vice
versa). See: Kim, Kim, and Park. Linearized suffix tree: an efficient
index data structure with the capabilities of suffix trees and suffix
arrays. Algorithmica, 2007.
The Kim et al paper is well written, accessible and has references to other important papers, such as the one by Abouelhoda et al.
A suffix array is nearly always preferable, except:
If you are going to index little ammounts of data.
If you are doing research on protein matching or dna mutations and have access to extremely expensive computers.
If you must at all cost use the error search with wildcards.
A suffix array can be used to implement the suffix tree. Meaning a suffix tree can be a suffix array and a few additional data structures to simulate the suffix tree functionality.
Therefore:
Suffix arrays use less space (a lot less)
Suffix trees are slower to build
Suffix trees are faster doing pattern matching operations
Suffix trees can do more operations, the best is error pattern matching with wildcards (suffix array also does pattern matching but not with wildcards)
If you want to index a lot of data, like more than 50 megabytes. The suffix tree uses so much space that your computer does not have enough ram to keep it in central memory. Therefore it starts to use secondary memory and you will see a huge degradation in speed. (for example the human dna uses 700 megabytes, a suffix tree of that data "can" use 40 gigabytes -> * "can" depending on the implementation * )
Because of this the suffix tree is nearly never used in practice. In practice the suffix array is used and small additional data structures give it some extra functionality (never a complete suffix tree).
However they are different. There are many cases where a pure suffix array is preferable for pattern matching due to efficient speed, fast construction speed and low space use.

Lexicographically smallest perfect matching

I want to find the lexicographically smallest perfect matching in two partial graphs. I'm supposed to use Kuhn's algorithm, but I don't understand how to make matching lexicographically smallest. Is it at all possible in Kuhn's algorithm? I can provide my code, but it's classic enough.
As a hint, consider how you could determine where just the first node should be matched in the lex-min matching.
In most cases like this it is usually easier to make a reduction instead of modifying the algorithm:
Find a way to change the input in your problem so that lexicographical order breaks any ties (but in a manner that perfect matchings still have a higher score than imperfect ones)
Run the modified graph through Kuhn's algorithm.
If needed, translate the answer back to the original problem.
I haven't tried to actually solve this myself or read the problem in detail. But this seems to be a textbook exercise and I feel this answer is enough :-)
Think about how you can create assignment prices that encourage lexicographically early matchings.

The BeechickSort Algorithm better than quicksort?

We know Quicksort is a efficient sorting Algorithm, now here they say this:
BeechickSort (patent 5,218,700) has these characteristics:
Sorts two to three times faster than the quicksort algorithm, depending on the list.
Unlike quicksort algorithms, it provides stable sorting of duplicate keys.
Whether the list is previously sorted or shuffled makes no difference.
Uses no compares.
Uses no swaps.
Uses no pivot point.
Works equally well with short or long lists.
Is economical with memory.
The first sorted results are available for other processes almost immediately, while the rest of the list is still being sorted.
Do you know the implementation, or we have to wait until the realese?
It appears to be basically a radix sort: that is, classify items by their "most significant part" (leading bits/digits for an integer, first character(s) for a string), then recursively by "less significant" parts. You can do this by, e.g., setting up an array with one entry per possible most-significant part, then doing a single pass over all the items and assigning each to the appropriate element.
Most versions of radix sort actually process the least-significant part first; this turns out to make things easier. "Beechick sort" apparently involves processing the most-significant part first; apparently the inventor has, or claims to have, a novel way of doing this that doesn't incur enough overhead to outweigh the advantage of not needing to process parts of the data that aren't needed to establish the ordering.
You can read the whole thing at http://www.freepatentsonline.com/5218700.pdf if you want to figure out exactly what contribution this patent allegedly makes beyond plain ol' radix sort (which has been pretty well known for ages) don't mind wading through a load of patentese. Alternatively, there's some explanation at http://www.beechick-sort.bizhosting.com/abcsort.html. The latter includes C code for a simple version of the algorithm.

What algorithm can you use to find duplicate phrases in a string?

Given an arbitrary string, what is an efficient method of finding duplicate phrases? We can say that phrases must be longer than a certain length to be included.
Ideally, you would end up with the number of occurrences for each phrase.
In theory
A suffix array is the 'best' answer since it can be implemented to use linear space and time to detect any duplicate substrings. However - the naive implementation actually takes time O(n^2 log n) to sort the suffixes, and it's not completely obvious how to reduce this down to O(n log n), let alone O(n), although you can read the related papers if you want to.
A suffix tree can take slightly more memory (still linear, though) than a suffix array, but is easier to implement to build quickly since you can use something like a radix sort idea as you add things to the tree (see the wikipedia link from the name for details).
The KMP algorithm is also good to be aware of, which is specialized for searching for a particular substring within a longer string very quickly. If you only need this special case, just use KMP and no need to bother building an index of suffices first.
In practice
I'm guessing you're analyzing a document of actual natural language (e.g. English) words, and you actually want to do something with the data you collect.
In this case, you might just want to do a quick n-gram analysis for some small n, such as just n=2 or 3. For example, you could tokenize your document into a list of words by stripping out punctuation, capitalization, and stemming words (running, runs both -> 'run') to increase semantic matches. Then just build a hash map (such as hash_map in C++, a dictionary in python, etc) of each adjacent pair of words to its number of occurrences so far. In the end you get some very useful data which was very fast to code, and not crazy slow to run.
Like the earlier folks mention that suffix tree is the best tool for the job. My favorite site for suffix trees is http://www.allisons.org/ll/AlgDS/Tree/Suffix/. It enumerates all the nifty uses of suffix trees on one page and has a test js applicaton embedded to test strings and work through examples.
Suffix trees are a good way to implement this. The bottom of that article has links to implementations in different languages.
Like jmah said, you can use suffix trees/suffix arrays for this.
There is a description of an algorithm you could use here (see Section 3.1).
You can find a more in-depth description in the book they cite (Gusfield, 1997), which is on google books.
suppose you are given sorted array A with n entries (i=1,2,3,...,n)
Algo(A(i))
{
while i<>n
{
temp=A[i];
if A[i]<>A[i+1] then
{
temp=A[i+1];
i=i+1;
Algo(A[i])
}
else if A[i]==A[i+1] then
mark A[i] and A[i+1] as duplicates
}
}
This algo runs at O(n) time.

Resources