O(1) algorithm to determine if node is descendant of another node in a multiway tree? - algorithm

Imagine the following tree:
A
/ \
B C
/ \ \
D E F
I'm looking for a way to query if for example F is a descendant of A (note: F doesn't need to be a direct descendant of A), which, in this particular case would be true. Only a limited amount of potential parent nodes need to be tested against a larger potential descendants node pool.
When testing whether a node is a descendant of a node in the potential parent pool, it needs to be tested against ALL potential parent nodes.
This is what a came up with:
Convert multiway tree to a trie, i.e. assign the following prefixes to every node in the above tree:
A = 1
B = 11
C = 12
D = 111
E = 112
F = 121
Then, reserve a bit array for every possible prefix size and add the parent nodes to be tested against, i.e. if C is added to the potential parent node pool, do:
1 2 3 <- Prefix length
*[1] [1] ...
[2] *[2] ...
[3] [3] ...
[4] [4] ...
... ...
When testing if a node is a descendant of a potential parent node, take its trie prefix, lookup the first character in the first "prefix array" (see above) and if it is present, lookup the second prefix character in the second "prefix array" and so on, i.e. testing F leads to:
F = 1 2 1
*[1] [1] ...
[2] *[2] ...
[3] [3] ...
[4] [4] ...
... ...
so yes F, is a descendant of C.
This test seems to be worst case O(n), where n = maximum prefix length = maximum tree depth, so its worst case is exactly equal to the obvious way of just going up the tree and comparing nodes. However, this performs much better if the tested node is near the bottom of the tree and the potential parent node is somewhere at the top. Combining both algorithms would mitigate both worst case scenarios. However, memory overhead is a concern.
Is there another way for doing that? Any pointers greatly appreciated!

Are your input trees always static? If so, then you can use a Lowest Common Ancestor algorithm to answer the is descendant question in O(1) time with an O(n) time/space construction. An LCA query is given two nodes and asked which is the lowest node in the tree whose subtree contains both nodes. Then you can answer the IsDescendent query with a single LCA query, if LCA(A, B) == A or LCA(A, B) == B, then one is the descendent of the other.
This Topcoder algorithm tuorial gives a thorough discussion of the problem and a few solutions at various levels of code complexity/efficiency.

I don't know if this would fit your problem, but one way to store hierarchies in databases, with quick "give me everything from this node and downwards" features is to store a "path".
For instance, for a tree that looks like this:
+-- b
|
a --+ +-- d
| |
+-- c --+
|
+-- e
you would store the rows as follows, assuming the letter in the above tree is the "id" of each row:
id path
a a
b a*b
c a*c
d a*c*d
e a*c*e
To find all descendants of a particular node, you would do a "STARTSWITH" query on the path column, ie. all nodes with a path that starts with a*c*
To find out if a particular node is a descendant of another node, you would see if the longest path started with the shortest path.
So for instance:
e is a descendant of a since a*c*e starts with a
d is a descendant of c since a*c*d starts with a*c
Would that be useful in your instance?

Traversing any tree will require "depth-of-tree" steps. Therefore if you maintain balanced tree structure it is provable that you will need O(log n) operations for your lookup operation. From what I understand your tree looks special and you can not maintain it in a balanced way, right? So O(n) will be possible. But this is bad during creation of the tree anyways, so you will probably die before you use the lookup anyway...
Depending on how often you will need that lookup operation compared to insert, you could decide to pay during insert to maintain an extra data structure. I would suggest a hashing if you really need amortized O(1). On every insert operation you put all parents of a node into a hashtable. By your description this could be O(n) items on a given insert. If you do n inserts this sounds bad (towards O(n^2)), but actually your tree can not degrade that bad, so you probably get an amortized overall hastable size of O(n log n). (actually, the log n part depends on the degration-degree of your tree. If you expect it to be maximal degraed, don't do it.)
So, you would pay about O(log n) on every insert, and get hashtable efficiency O(1) for a lookup.

For a M-way tree, instead of your bit array, why not just store the binary "trie id" (using M bits per level) with each node? For your example (assuming M==2) : A=0b01, B=0b0101, C=0b1001, ...
Then you can do the test in O(1):
bool IsParent(node* child, node* parent)
{
return ((child->id & parent->id) == parent->id)
}
You could compress the storage to ceil(lg2(M)) bits per level if you have a fast FindMSB() function which returns the position of the most significant bit set:
mask = (1<<( FindMSB(parent->id)+1) ) -1;
retunr (child->id&mask == parent->id);

In a pre-order traversal, every set of descendants is contiguous. For your example,
A B D E C F
+---------+ A
+---+ B
+ D
+ E
+-+ C
+ F
If you can preprocess, then all you need to do is number each node and compute the descendant interval.
If you can't preprocess, then a link/cut tree offers O(log n) performance for both updates and queries.

You can answer query of the form "Is node A a descendant of node B?" in constant time, by just using two auxiliary arrays.
Preprocess the tree, by visiting in Depth-First order, and for each node A store its starting and ending time in the visit in the two arrays Start[] and End[].
So, let us say that End[u] and Start[u] are respectively the ending and starting time of the visit of node u.
Then node u is a descendant of node v if and only if:
Start[v] <= Start[u] and End[u] <= End[v].
and you are done, checking this condition requires just two lookup in the arrays Start and End

Take a look at Nested set model It's very effective to select but too slow to update

For what it's worth, what you're asking for here is equivalent to testing if a class is a subtype of another class in a class hierarchy, and in implementations like CPython this is just done the good old fashioned "iterate the parents looking for the parent" way.

Related

when exactly should root split in a B Tree

I learned B trees recently and from what I understand a node can have minimum t-1 keys and maximum 2t-1 keys given minimum degree t. Exception being root can have even 1 key.
Here is the example from CLRS 3rd edition Fig 18.7 (Page 498) where t=3
min keys = 3-1 = 2
max keys = 2*3-1 = 5
In the d) example when L is inserted why is the root splitted when it doesn't violate the B tree properties at the moment (It has 5 keys which is maximum allowed).
Why isn't inserting L into [J K L] without splitting [G M P T X] considered.
Should I always split the root when it reaches the maximum?
There are several variants of the insertion algorithm for B-trees. In this case the insertion algorithm is the "single pass down the tree" variant.
The background for this variant is given on page 493:
Since we cannot insert a key into a leaf node that is full, we introduce an operation that splits a full node 𝑦 (having 2𝑡 − 1 keys) around its median key 𝑦:key𝑡 into two nodes having only 𝑡 − 1 keys each. The median key moves up into 𝑦’s parent to identify the dividing point between the two new trees. But if 𝑦’s parent is also full, we must split it before we can insert the new key, and thus we could end up splitting full nodes all the way up the tree.
As with a binary search tree, we can insert a key into a B-tree in a single pass down the tree from the root to a leaf. To do so, we do not wait to find out whether we will actually need to split a full node in order to do the insertion. Instead, as we travel down the tree searching for the position where the new key belongs, we split each full node we come to along the way (including the leaf itself). Thus whenever we want to split a full node 𝑦, we are assured that its parent is not full.
In other words, this insertion algorithm will split a node earlier than might be strictly needed, in order to avoid to have to split nodes while backtracking out of recursion.
This algorithm is further described on page 495 with pseudo code.
This explains why at the insertion of L the root node is split immediately before any recursive call is made.
Alternative algorithms would not do this, and would delay the split up to the point when it is inevitable.

locating lowest common ancestor in AVL tree

I have an AVL tree and 2 keys in it. how do I find the lowest common ancestor (by lowest I mean hight, not value) with O(logn) complexity?
I've seen an answer here on stackoverflow, but I admit I didn't exactly understand it. it involved finding the routes from each key to the root and then comparing them. I'm not sure how this meets the complexity requirements
For the first node you move up and mark the nodes. For the second node you move up and look if a node on the path is marked. As soon as you find a marked node you can stop. (And remove the marks by doing the first path again).
If you cannot mark nodes in the tree directly then modify the values contained to include a place where you can mark. If you cannot do this either then add a hashmap that stores which nodes are marked.
This is O(logn) because the tree is O(logn) deep and at worst you walk 3 times to the root.
Also, if you wish you can alternate steps of the two paths instead of first walking the first path completely. (Note that then both paths have to check for marks.) This might be better if you expect the two nodes to have their ancestor somewhat locally. The asymptotic runtime is the same as above.
A better solution for the AVL tree (balanced binary search tree) is (I have used C pointers like notation)-
Let K1 and K2 be 2 keys, for which LCA is to be found. Assume K1 < K2
A pointer P = root of tree
If P->key >= K1 and P->key <= K2 : return P
Else if P->key > K1 and P->key > K2 : P = P->left
Else P = P->right
Repeat step 3 to 5
The returned P points to the required LCA.
Note that this approach works only for BST, not any other Binary tree.

Checking if A is a part of binary tree B

Let's say I have binary trees A and B and I want to know if A is a "part" of B. I am not only talking about subtrees. What I want to know is if B has all the nodes and edges that A does.
My thoughts were that since tree is essentially a graph, and I could view this question as a subgraph isomorphism problem (i.e. checking to see if A is a subgraph of B). But according to wikipedia this is an NP-complete problem.
http://en.wikipedia.org/wiki/Subgraph_isomorphism_problem
I know that you can check if A is a subtree of B or not with O(n) algorithms (e.g. using preorder and inorder traversals to flatten the trees to strings and checking for substrings). I was trying to modify this a little to see if I can also test for just "parts" as well, but to no avail. This is where I'm stuck.
Are there any other ways to view this problem other than using subgraph isomorphism? I'm thinking there must be faster methods since binary trees are much more restricted and simpler versions of graphs.
Thanks in advance!
EDIT: I realized that the worst case for even a brute force method for my question would only take O(m * n), which is polynomial. So I guess this isn't a NP-complete problem after all. Then my next question is, is there an algorithm that is faster than O(m*n)?
I would approach this problem in two steps:
Find the root of A in B (either BFS of DFS)
Verify that A is contained in B (giving that starting node), using a recursive algorithm, as below (I concocted same crazy pseudo-language, because you didn't specify the language. I think this should be understandable, no matter your background). Note that a is a node from A (initially the root) and b is a node from B (initially the node found in step 1)
function checkTrees(node a, node b) returns boolean
if a does not exist or b does not exist then
// base of the recursion
return false
else if a is different from b then
// compare the current nodes
return false
else
// check the children of a
boolean leftFound = true
boolean rightFound = true
if a.left exists then
// try to match the left child of a with
// every possible neighbor of b
leftFound = checkTrees(a.left, b.left)
or checkTrees(a.left, b.right)
or checkTrees(a.left, b.parent)
if a.right exists then
// try to match the right child of a with
// every possible neighbor of b
leftFound = checkTrees(a.right, b.left)
or checkTrees(a.right, b.right)
or checkTrees(a.right, b.parent)
return leftFound and rightFound
About the running time: let m be the number of nodes in A and n be the number of nodes in B. The search in the first step takes O(n) time. The running time of the second step depends on one crucial assumption I made, but that might be wrong: I assumed that every node of A is equal to at most one node of B. If that is the case, the running time of the second step is O(m) (because you can never search too far in the wrong direction). So the total running time would be O(m + n).
While writing down my assumption, I start to wonder whether that's not oversimplifying your case...
you could compare the trees in bottom-up as follows:
for each leaf in tree A, identify the corresponding node in tree B.
start a parallel traversal towards the root in both trees from the nodes just matched.
specifically, move to the parent of a node in A and subsequently move towards the root in B until you either encounter the corresponding node in B (proceed) or a marked node in A (see below, if a match in B is found proceed, else fail) or the root of B (fail)
mark all nodes visited in A.
you succeed, if you haven't failed ;-).
the main part of the algorithm runs in O(e_B) - in the worst case, all edges in B are visited a constant number of times. the leaf node matching will run in O(n_A * log n_B) if there the B vertices are sorted, O(n_A * log n_A + n_B * log n_B + n) = O(n_B * log n_B) (sort each node set, lienarly scan the results thereafter) otherwise.
EDIT:
re-reading your question, abovementioned step 2 is even easier, as for matching nodes in A, B, their parents must match too (otheriwse there would be a mismatch between the edge sets). no effect on worst-case run time, of course.

Generating suffix tree of string S[2..m] from suffix tree of string S[1..m]

Is there a fast (O(1) time complexity) way of generating a suffix tree of string S[2..m] from suffix tree of string S[1..m]?
I am familiar with Ukkonen's, so I know how to make fast suffix tree of string S[1..m+1] from suffix tree of string S[1..m], but I couldn't apply the algorithm for reverse situation.
Well, as #jogojapan says, to get the S[2..m] tree from the S[1..m] tree we need to:
Find the position-0 leaf L.
If L has more than one sibling, delete the pointer from L's parent to L
If L has exactly one sibling, change the pointer from L's grandparent to L's parent so it instead points to L's sibling.
#jogojapan further suggests that you keep a pointer to the deepest leaf in the tree. There are two problems with that: L isn't necessarily the deepest leaf in the tree, as Wikipedia's example shows, and second if you want to be able to output the same type of data structure as you received, once removing L you need to find the new position-0 leaf, which will take O(m) time anyway.
(What you could do is construct an array of pointers to each leaf in O(m) time and count-sort them by position in another O(m) time. Then you'd be able to construct all the trees { S[t..n] : 1 <= t <= m } in constant amortized time each)
Assuming you're not interested in amortized time though, let's prove what you ask is impossible.
We know any algorithm to modify the suffix tree of S[1..m] must start at the root: it can't start anywhere else because we know nothing about the underlying concrete data structure, and we don't know that the tree's nodes have parent pointers, so the only position the whole tree is accessible from is the root.
We also know that it must locate the position-0 leaf before it can hope to modify the data structure into the suffix tree for S[2..m]. To do this, it must obviously traverse every node between the root and the position-0 leaf.
Thing is, consider the suffix tree of a^m (the character a repeated m times): the length of the path is m-1. So any algorithm must visit at least m-1 nodes, and therefore take O(m) time in the worst case.

Why in B-tree and B+_tree store from half-full to complete-full in each non-leaf node

I've just learn B-tree and B+-tree in DBMS.
I don't understand why a non-leaf node in tree has between [n/2] and n children, when n is fix for particular tree.
Why is that? and advantage of that?
Thanks !
This is the feature that makes the B+ and B-tree balanced, and due to it, we can easily compute the complexity of ops on the tree and bound it to O(logn) [where n is the number of elements in the data set].
If a node could have more then B sons, we could create a tree with depth 2: a root, and all other nodes will be leaves, from the root. searching for an element will be then O(n), and not the desired O(logn).
If a node could have less then B/2 sons, we could create a tree which is actually a linked list [n nodes, each with 1 son], with height n - and a search op will again be O(n) instead of O(logn)
Small currection: every non-leaf node - except the root, has B/2 to B children. the root alone is allowed to have less then B/2 sons.
The basic assumption of this structure is to have a fixed block size, this is why each internal block has n slots for indexing its children.
When there is a need to add a child to a block that is full (has exactly n children), the block is split into two blocks, which then replace the original block in its parent's index. The number of children in each of the two blocks is obviously n div 2 (assuming n is even). This is what the lower limit comes from.
If the parent is full, the operation repeats, potentially up to the root itself.
The split operation and allowing for n/2-filled blocks allows for most of the insertions/deletions to only cause local changes instead of re-balancing huge parts of the tree.

Resources