How is worst case time complexity of constructing suffix tree linear? - algorithm

I have trouble understanding how the worst case time complexity of constructing a suffix tree is linear - particularly when we need to build a suffix tree for a string that may be composed of repeating single character such as "aaaaa".
Even if I were to construct a compressed suffix tree for "aaaaa", I won't be really able to compress any nodes since no two edges starting out of a node can have string-labels beginning with the same character.
This would result in a suffix tree of height 5, and at each insertion of the suffix, I would need to keep traversing from the root to the leaf.
Here was how I approached:
suffixes: a, aa, aaa, aaaa, aaaaa
Create root node, create an edge bearing 'a' and connect this to a new node, where to its left bears "$", and repeat this process until we can aaaaa.
This would result in O(n^2) instead of O(n).
What am I missing here?

I agree with the comments, but here are some more details:
The procedure you describe for forming the aaaaa tree is O(n), not O(n^2). Node and leaf creation are constant-time operations, and you perform them n==5 times. Your assumption of O(n^2) seems to be based on the idea that you'd traverse from root to leaf in each step, but there is no need to do this; e.g. in Ukkonen's algorithm:
You keep a pointer to the node you left off with before inserting the next
And in case of repetitions you don't perform any work until the repetitions end, and then you make insertions of the final $ mark one by one, following down the characters on the edge you have created, as well as the chain of suffix links in case of more complex repetitions
The key to why the Ukkonen algorithm (details here) is O(n) is that it maintains a "memory" of where to make inserts, in the form of (a) a pointer to where the previous insert was made, and (b) a network of suffix links. That network can be vast, but there is only one suffix link per internal node, so it's still O(n) in size.

Related

How to get the n-th value of a b-tree

Is there general pseudocode or related data structure to get the nth value of a b-tree? For example, the eighth value of this tree is 13 [1,4,9,9,11,11,12,13].
If I have some values sorted in a b-tree, I would like to find the nth value without having to go through the entire tree. Is there a better structure for this problem? The data order could update anytime.
You are looking for order statistics tree. The idea of it, is in addition to any data stored in nodes - also store the size of the subtree in the node, and keep them updated in insertions and deletions.
Since you are "touching" O(logn) nodes for each insert/delete operation - keeping it up to date still keeps the O(logn) behavior of these.
FindKth() is then done by eliminating subtrees that their bigger index is still smaller than k, and checking the next one. Since you don't need to go to the depth of each subtree, only directly to the required one (and checking the nodes in the path to this element) - you need to "touch" O(logn) nodes, which makes this operation O(logn) as well.

Data structure for inverting a subarray in log(n)

Build a Data structure that has functions:
set(arr,n) - initialize the structure with array arr of length n. Time O(n)
fetch(i) - fetch arr[i]. Time O(log(n))
invert(k,j) - (when 0 <= k <= j <= n) inverts the sub-array [k,j]. meaning [4,7,2,8,5,4] with invert(2,5) becomes [4,7,4,5,8,2]. Time O(log(n))
How about saving the indices in binary search tree and using a flag saying the index is inverted? But if I do more than 1 invert, it mess it up.
Here is how we can approach designing such a data structure.
Indeed, using a balanced binary search tree is a good idea to start.
First, let us store array elements as pairs (index, value).
Naturally, the elements are sorted by index, so that the in-order traversal of a tree will yield the array in its original order.
Now, if we maintain a balanced binary search tree, and store the size of the subtree in each node, we can already do fetch in O(log n).
Next, let us only pretend we store the index.
Instead, we still arrange elements as we did with (index, value) pairs, but store only the value.
The index is now stored implicitly and can be calculated as follows.
Start from the root and go down to the target node.
Whenever we move to a left subtree, the index does not change.
When moving to a right subtree, add the size of the left subtree plus one (the size of the current vertex) to the index.
What we got at this point is a fixed-length array stored in a balanced binary search tree. It takes O(log n) to access (read or write) any element, as opposed to O(1) for a plain fixed-length array, so it is about time to get some benefit for all the trouble.
The next step is to devise a way to split our array into left and right parts in O(log n) given the required size of the left part, and merge two arrays by concatenation.
This step introduces dependency on our choice of the balanced binary search tree.
Treap is the obvious candidate since it is built on top of the split and merge primitives, so this improvement comes for free.
Perhaps it is also possible to split a Red-black tree or a Splay tree in O(log n) (though I admit I didn't try to figure out the details myself).
Right now, the structure is already more powerful than an array: it allows splitting and concatenation of "arrays" in O(log n), although element access is as slow as O(log n) too.
Note that this would not be possible if we still stored index explicitly at this point, since indices would be broken in the right part of a split or merge operation.
Finally, it is time to introduce the invert operation.
Let us store a flag in each node to signal whether the whole subtree of this node has to be inverted.
This flag will be lazily propagating: whenever we access a node, before doing anything, check if the flag is true.
If this is the case, swap the left and right subtrees, toggle (true <-> false) the flag in the root nodes of both subtrees, and set the flag in the current node to false.
Now, when we want to invert a subarray:
split the array into three parts (before the subarray, the subarray itself, and after the subarray) by two split operations,
toggle (true <-> false) the flag in the root of the middle (subarray) part,
then merge the three parts back in their original order by two merge operations.

Nearest node in a tree

Recent I encountered this problem on trees whose solution I found in O(n*q) . I am thinking if there is much better way to deal this with lesser complexity.
The problem is here as follows :
Given an unweighted tree of 'n' nodes ( n>=1 and n can go to 105 ) , Its nodes can be special or non special. Node 1 is always special and rest non special initially. Now ,There are two operations :
1.we can update any non special node to special node by an update operation by "U Node_Number"
OR
2.At any time , we can ask user "Q Node_Number" which should return that special node in tree closest to "Node_Number".
These operations can also go upto 105.
My Solution :
I thought of creating adjacency list. For operation 1, I can keep record of special or Non special by boolean flag. But for operation 2 , my solution comprises of doing BFS whenever "Q Node_Number" is asked taking "Node_Number" as root to begin my BFS.
But complexity is quadratic. Is this the most optimal way of going about this problem ?
Here's an O(n^1.5 + n^0.5 q)-time algorithm via a sqrt decomposition. We need a constant-time distance oracle (this is basically least common ancestors). The idea is, every n^0.5 times a node is made special, perform a breadth-first search from all special nodes, which yields for each node in the tree the closest node that is currently special. On each query, take the closest of (i) the nodes that were special as of the last breadth-first search (ii) the at most n^0.5 newly special nodes.
As I mentioned in the comments, I expect that there's a very complicated O((n + q) log n)-time algorithm via top trees.

Generating suffix tree of string S[2..m] from suffix tree of string S[1..m]

Is there a fast (O(1) time complexity) way of generating a suffix tree of string S[2..m] from suffix tree of string S[1..m]?
I am familiar with Ukkonen's, so I know how to make fast suffix tree of string S[1..m+1] from suffix tree of string S[1..m], but I couldn't apply the algorithm for reverse situation.
Well, as #jogojapan says, to get the S[2..m] tree from the S[1..m] tree we need to:
Find the position-0 leaf L.
If L has more than one sibling, delete the pointer from L's parent to L
If L has exactly one sibling, change the pointer from L's grandparent to L's parent so it instead points to L's sibling.
#jogojapan further suggests that you keep a pointer to the deepest leaf in the tree. There are two problems with that: L isn't necessarily the deepest leaf in the tree, as Wikipedia's example shows, and second if you want to be able to output the same type of data structure as you received, once removing L you need to find the new position-0 leaf, which will take O(m) time anyway.
(What you could do is construct an array of pointers to each leaf in O(m) time and count-sort them by position in another O(m) time. Then you'd be able to construct all the trees { S[t..n] : 1 <= t <= m } in constant amortized time each)
Assuming you're not interested in amortized time though, let's prove what you ask is impossible.
We know any algorithm to modify the suffix tree of S[1..m] must start at the root: it can't start anywhere else because we know nothing about the underlying concrete data structure, and we don't know that the tree's nodes have parent pointers, so the only position the whole tree is accessible from is the root.
We also know that it must locate the position-0 leaf before it can hope to modify the data structure into the suffix tree for S[2..m]. To do this, it must obviously traverse every node between the root and the position-0 leaf.
Thing is, consider the suffix tree of a^m (the character a repeated m times): the length of the path is m-1. So any algorithm must visit at least m-1 nodes, and therefore take O(m) time in the worst case.

Data structure supporting Add and Partial-Sum

Let A[1..n] be an array of real numbers. Design an algorithm to perform any sequence of the following operations:
Add(i,y) -- Add the value y to the ith number.
Partial-sum(i) -- Return the sum of the first i numbers, i.e.
There are no insertions or deletions; the only change is to the values of the numbers. Each operation should take O(logn) steps. You may use one additional array of size n as a work space.
How to design a data structure for above algorithm?
Construct a balanced binary tree with n leaves; stick the elements along the bottom of the tree in their original order.
Augment each node in the tree with "sum of leaves of subtree"; a tree has #leaves-1 nodes so this takes O(n) setup time (which we have).
Querying a partial-sum goes like this: Descend the tree towards the query (leaf) node, but whenever you descend right, add the subtree-sum on the left plus the element you just visited, since those elements are in the sum.
Modifying a value goes like this: Find the query (left) node. Calculate the difference you added. Travel to the root of the tree; as you travel to the root, update each node you visit by adding in the difference (you may need to visit adjacent nodes, depending if you're storing "sum of leaves of subtree" or "sum of left-subtree plus myself" or some variant); the main idea is that you appropriately update all the augmented branch data that needs updating, and that data will be on the root path or adjacent to it.
The two operations take O(log(n)) time (that's the height of a tree), and you do O(1) work at each node.
You can probably use any search tree (e.g. a self-balancing binary search tree might allow for insertions, others for quicker access) but I haven't thought that one through.
You may use Fenwick Tree
See this question

Resources