Checking if A is a part of binary tree B - algorithm

Let's say I have binary trees A and B and I want to know if A is a "part" of B. I am not only talking about subtrees. What I want to know is if B has all the nodes and edges that A does.
My thoughts were that since tree is essentially a graph, and I could view this question as a subgraph isomorphism problem (i.e. checking to see if A is a subgraph of B). But according to wikipedia this is an NP-complete problem.
http://en.wikipedia.org/wiki/Subgraph_isomorphism_problem
I know that you can check if A is a subtree of B or not with O(n) algorithms (e.g. using preorder and inorder traversals to flatten the trees to strings and checking for substrings). I was trying to modify this a little to see if I can also test for just "parts" as well, but to no avail. This is where I'm stuck.
Are there any other ways to view this problem other than using subgraph isomorphism? I'm thinking there must be faster methods since binary trees are much more restricted and simpler versions of graphs.
Thanks in advance!
EDIT: I realized that the worst case for even a brute force method for my question would only take O(m * n), which is polynomial. So I guess this isn't a NP-complete problem after all. Then my next question is, is there an algorithm that is faster than O(m*n)?

I would approach this problem in two steps:
Find the root of A in B (either BFS of DFS)
Verify that A is contained in B (giving that starting node), using a recursive algorithm, as below (I concocted same crazy pseudo-language, because you didn't specify the language. I think this should be understandable, no matter your background). Note that a is a node from A (initially the root) and b is a node from B (initially the node found in step 1)
function checkTrees(node a, node b) returns boolean
if a does not exist or b does not exist then
// base of the recursion
return false
else if a is different from b then
// compare the current nodes
return false
else
// check the children of a
boolean leftFound = true
boolean rightFound = true
if a.left exists then
// try to match the left child of a with
// every possible neighbor of b
leftFound = checkTrees(a.left, b.left)
or checkTrees(a.left, b.right)
or checkTrees(a.left, b.parent)
if a.right exists then
// try to match the right child of a with
// every possible neighbor of b
leftFound = checkTrees(a.right, b.left)
or checkTrees(a.right, b.right)
or checkTrees(a.right, b.parent)
return leftFound and rightFound
About the running time: let m be the number of nodes in A and n be the number of nodes in B. The search in the first step takes O(n) time. The running time of the second step depends on one crucial assumption I made, but that might be wrong: I assumed that every node of A is equal to at most one node of B. If that is the case, the running time of the second step is O(m) (because you can never search too far in the wrong direction). So the total running time would be O(m + n).
While writing down my assumption, I start to wonder whether that's not oversimplifying your case...

you could compare the trees in bottom-up as follows:
for each leaf in tree A, identify the corresponding node in tree B.
start a parallel traversal towards the root in both trees from the nodes just matched.
specifically, move to the parent of a node in A and subsequently move towards the root in B until you either encounter the corresponding node in B (proceed) or a marked node in A (see below, if a match in B is found proceed, else fail) or the root of B (fail)
mark all nodes visited in A.
you succeed, if you haven't failed ;-).
the main part of the algorithm runs in O(e_B) - in the worst case, all edges in B are visited a constant number of times. the leaf node matching will run in O(n_A * log n_B) if there the B vertices are sorted, O(n_A * log n_A + n_B * log n_B + n) = O(n_B * log n_B) (sort each node set, lienarly scan the results thereafter) otherwise.
EDIT:
re-reading your question, abovementioned step 2 is even easier, as for matching nodes in A, B, their parents must match too (otheriwse there would be a mismatch between the edge sets). no effect on worst-case run time, of course.

Related

Why is the number of sub-trees gained from a range tree query is O(log(n))?

I'm trying to figure out this data structure, but I don't understand how can we
tell there are O(log(n)) subtrees that represents the answer to a query?
Here is a picture for illustration:
Thanks!
If we make the assumption that the above is a purely functional binary tree [wiki], so where the nodes are immutable, then we can make a "copy" of this tree such that only elements with a value larger than x1 and lower than x2 are in the tree.
Let us start with a very simple case to illustrate the point. Imagine that we simply do not have any bounds, than we can simply return the entire tree. So instead of constructing a new tree, we return a reference to the root of the tree. So we can, without any bounds return a tree in O(1), given that tree is not edited (at least not as long as we use the subtree).
The above case is of course quite simple. We simply make a "copy" (not really a copy since the data is immutable, we can just return the tree) of the entire tree. So let us aim to solve a more complex problem: we want to construct a tree that contains all elements larger than a threshold x1. Basically we can define a recursive algorithm for that:
the cutted version of None (or whatever represents a null reference, or a reference to an empty tree) is None;
if the node has a value is smaller than the threshold, we return a "cutted" version of the right subtree; and
if the node has a value greater than the threshold, we return an inode that has the same right subtree, and as left subchild the cutted version of the left subchild.
So in pseudo-code it looks like:
def treelarger(some_node, min):
if some_tree is None:
return None
if some_node.value > min:
return Node(treelarger(some_node.left, min), some_node.value, some_node.right)
else:
return treelarger(some_node.right, min)
This algorithm thus runs in O(h) with h the height of the tree, since for each case (except the first one), we recurse to one (not both) of the children, and it ends in case we have a node without children (or at least does not has a subtree in the direction we need to cut the subtree).
We thus do not make a complete copy of the tree. We reuse a lot of nodes in the old tree. We only construct a new "surface" but most of the "volume" is part of the old binary tree. Although the tree itself contains O(n) nodes, we construct, at most, O(h) new nodes. We can optimize the above such that, given the cutted version of one of the subtrees is the same, we do not create a new node. But that does not even matter much in terms of time complexity: we generate at most O(h) new nodes, and the total number of nodes is either less than the original number, or the same.
In case of a complete tree, the height of the tree h scales with O(log n), and thus this algorithm will run in O(log n).
Then how can we generate a tree with elements between two thresholds? We can easily rewrite the above into an algorithm treesmaller that generates a subtree that contains all elements that are smaller:
def treesmaller(some_node, max):
if some_tree is None:
return None
if some_node.value < min:
return Node(some_node.left, some_node.value, treesmaller(some_node.right, max))
else:
return treesmaller(some_node.left, max)
so roughly speaking there are two differences:
we change the condition from some_node.value > min to some_node.value < max; and
we recurse on the right subchild in case the condition holds, and on the left if it does not hold.
Now the conclusions we draw from the previous algorithm are also conclusions that can be applied to this algorithm, since again it only introduces O(h) new nodes, and the total number of nodes can only decrease.
Although we can construct an algorithm that takes the two thresholds concurrently into account, we can simply reuse the above algorithms to construct a subtree containing only elements within range: we first pass the tree to the treelarger function, and then that result through a treesmaller (or vice versa).
Since in both algorithms, we introduce O(h) new nodes, and the height of the tree can not increase, we thus construct at most O(2 h) and thus O(h) new nodes.
Given the original tree was a complete tree, then it thus holds that we create O(log n) new nodes.
Consider the search for the two endpoints of the range. This search will continue until finding the lowest common ancestor of the two leaf nodes that span your interval. At that point, the search branches with one part zigging left and one part zagging right. For now, let's just focus on the part of the query that branches to the left, since the logic is the same but reversed for the right branch.
In this search, it helps to think of each node as not representing a single point, but rather a range of points. The general procedure, then, is the following:
If the query range fully subsumes the range represented by this node, stop searching in x and begin searching the y-subtree of this node.
If the query range is purely in range represented by the right subtree of this node, continue the x search to the right and don't investigate the y-subtree.
If the query range overlaps the left subtree's range, then it must fully subsume the right subtree's range. So process the right subtree's y-subtree, then recursively explore the x-subtree to the left.
In all cases, we add at most one y-subtree in for consideration and then recursively continue exploring the x-subtree in only one direction. This means that we essentially trace out a path down the x-tree, adding in at most one y-subtree per step. Since the tree has height O(log n), the overall number of y-subtrees visited this way is O(log n). And then, including the number of y-subtrees visited in the case where we branched right at the top, we get another O(log n) subtrees for a total of O(log n) total subtrees to search.
Hope this helps!

linearizing a tree to an array and answering "sum" queries on paths

The question is motivated by the travtree problem in codechef. In the editorial they recommend linearizing the tree to an array by recording for each node its discovery and exit times in a DFS traversal. Now we can quickly answer queries about sum subtree - by summing events that happened in the segment [discovery time, exit time] of that node. (we are using a Fenwick tree to answer these queries fast).
HOWEVER, to solve that problem we also need to quickly answer sum path queries. That is - summing events that happened along the shortest path between a, b. How is that possible? The answer they give is this:
For each interesting event they update this:
update(BT2,event_node,1);
update(BT2,out[event_node],-1);
and the sum path(a,b) is now this:
int l = lca(a,b);
ans = query(BT2,a) + query(BT2,b) - query(BT2,l) - (l==1 ? 0 : query(BT2, parent[0][l]));
Where query is the prefix sum. How is that correct?? when you look at the prefix sum till a you might encounter lots of nodes which are irrelevant to the path between l and a!
In order to linearize a sum path query - sum of events that happened on the shortest path between tree nodes a, b we indeed have to do the following:
When an event happens in node v, we update(IN[v], 1) and update(OUT[v], -1). IN being the node's DFS discovery time and OUT the DFS exit time.
Now the query would be query(IN[b]) - query(IN[a]-1). The query(IN[b]) is a prefix sum: it starts from the root, and traverses the tree until it reaches b. Note that for each node v we will pass not on the direct path from root to b, we will discover and then eventually exit it. Only for the nodes on the path we will discover and not exit. Because of the way we updated, this means that we will effectively sum the nodes on the path root, b (including b).
Now its clear that the same happens in query(IN[a]-1) - it is the sum of the nodes on the path root, a (not including a this time). Subtracting them gives us a, b. Draw a sketch and you'll see it for yourselves.
For completeness - the method for sum subtree is different both in update and in query. For each event you only update(IN[v]). Now for querying sum subtree(a) we do query(OUT[a]) - query(IN[a]-1). This time in query(OUT[a]) we sum all nodes we traversed until we discover a, and then all nodes in a's subtree until we exit it. Now we subtract query(IN[a] - 1) - all the nodes until we discover a. We're left exactly with only the a subtree.

Time cost analysis of trees

I am having trouble calculating the time analysis of for the following algorithm on any arbitrary tree of size N.
Question is:
Consider the following algorithm,
which makes the following assumptions. x and y are the roots of two binary
trees, Tx and Ty. Left(z) is a pointer to the left child of node z in either
tree, and Right(z) points to the right child. If the node doesn't have a
left or right child, the pointer returns \NIL". Each node z also has a eld
Size(z) which returns the number of nodes in the sub-tree rooted at z.
Size(NIL) is defined to be 0. The algorithm SameTree(x; y) returns a
boolean answer that says whether or not the trees rooted at x and y are
the same if you ignore the difference between left and right pointers.
Program: SameTree(x,y: Nodes): Boolean;
IF Size(x) 6= Size(y) THEN return False; halt.
IF x = NIL THEN return T rue; halt.
IF (SameTree(Left(x); Left(y)) AND SameTree(Right(x); Right(y)))
OR (SameTree(Right(x); Left(y)) AND SameTree(Left(x); Right(y)))
THEN return T rue; halt.
Return False; halt
Give the time analysis to run the above algorithm on any arbitrary tree of size N. I got O(nlog2^3) for dense graphs and O(n) for less dense graphs. Am I right? Can someone help me determine the time costs please?
Well let's use the Master principle. We shell consider the worst case where line 4 checks the condition before the OR and then checks the condition after it on EACH recursive call.
We will also simplify it by assuming the binaries trees are less or more balanced (has almost the same amount of nodes in each son of each node in the tree).
You have:
T(n) = 4*T(n/2)+2.
Look at http://en.wikipedia.org/wiki/Master_theorem to understand what I will do next:
We have case 1 from the Master theorem.
log in base 2 of 4 is 2. so the correct answer is O(n^2). This is the analysis for the General Case. If you wish a more precise analysis, you need to tell us much more on the odds for your tree to be balanced, unbalanced and what is the chance of it built in such a way that line 4 will be activating both conditions in each recursive call.
Average cases are much more complicated.

locating lowest common ancestor in AVL tree

I have an AVL tree and 2 keys in it. how do I find the lowest common ancestor (by lowest I mean hight, not value) with O(logn) complexity?
I've seen an answer here on stackoverflow, but I admit I didn't exactly understand it. it involved finding the routes from each key to the root and then comparing them. I'm not sure how this meets the complexity requirements
For the first node you move up and mark the nodes. For the second node you move up and look if a node on the path is marked. As soon as you find a marked node you can stop. (And remove the marks by doing the first path again).
If you cannot mark nodes in the tree directly then modify the values contained to include a place where you can mark. If you cannot do this either then add a hashmap that stores which nodes are marked.
This is O(logn) because the tree is O(logn) deep and at worst you walk 3 times to the root.
Also, if you wish you can alternate steps of the two paths instead of first walking the first path completely. (Note that then both paths have to check for marks.) This might be better if you expect the two nodes to have their ancestor somewhat locally. The asymptotic runtime is the same as above.
A better solution for the AVL tree (balanced binary search tree) is (I have used C pointers like notation)-
Let K1 and K2 be 2 keys, for which LCA is to be found. Assume K1 < K2
A pointer P = root of tree
If P->key >= K1 and P->key <= K2 : return P
Else if P->key > K1 and P->key > K2 : P = P->left
Else P = P->right
Repeat step 3 to 5
The returned P points to the required LCA.
Note that this approach works only for BST, not any other Binary tree.

O(1) algorithm to determine if node is descendant of another node in a multiway tree?

Imagine the following tree:
A
/ \
B C
/ \ \
D E F
I'm looking for a way to query if for example F is a descendant of A (note: F doesn't need to be a direct descendant of A), which, in this particular case would be true. Only a limited amount of potential parent nodes need to be tested against a larger potential descendants node pool.
When testing whether a node is a descendant of a node in the potential parent pool, it needs to be tested against ALL potential parent nodes.
This is what a came up with:
Convert multiway tree to a trie, i.e. assign the following prefixes to every node in the above tree:
A = 1
B = 11
C = 12
D = 111
E = 112
F = 121
Then, reserve a bit array for every possible prefix size and add the parent nodes to be tested against, i.e. if C is added to the potential parent node pool, do:
1 2 3 <- Prefix length
*[1] [1] ...
[2] *[2] ...
[3] [3] ...
[4] [4] ...
... ...
When testing if a node is a descendant of a potential parent node, take its trie prefix, lookup the first character in the first "prefix array" (see above) and if it is present, lookup the second prefix character in the second "prefix array" and so on, i.e. testing F leads to:
F = 1 2 1
*[1] [1] ...
[2] *[2] ...
[3] [3] ...
[4] [4] ...
... ...
so yes F, is a descendant of C.
This test seems to be worst case O(n), where n = maximum prefix length = maximum tree depth, so its worst case is exactly equal to the obvious way of just going up the tree and comparing nodes. However, this performs much better if the tested node is near the bottom of the tree and the potential parent node is somewhere at the top. Combining both algorithms would mitigate both worst case scenarios. However, memory overhead is a concern.
Is there another way for doing that? Any pointers greatly appreciated!
Are your input trees always static? If so, then you can use a Lowest Common Ancestor algorithm to answer the is descendant question in O(1) time with an O(n) time/space construction. An LCA query is given two nodes and asked which is the lowest node in the tree whose subtree contains both nodes. Then you can answer the IsDescendent query with a single LCA query, if LCA(A, B) == A or LCA(A, B) == B, then one is the descendent of the other.
This Topcoder algorithm tuorial gives a thorough discussion of the problem and a few solutions at various levels of code complexity/efficiency.
I don't know if this would fit your problem, but one way to store hierarchies in databases, with quick "give me everything from this node and downwards" features is to store a "path".
For instance, for a tree that looks like this:
+-- b
|
a --+ +-- d
| |
+-- c --+
|
+-- e
you would store the rows as follows, assuming the letter in the above tree is the "id" of each row:
id path
a a
b a*b
c a*c
d a*c*d
e a*c*e
To find all descendants of a particular node, you would do a "STARTSWITH" query on the path column, ie. all nodes with a path that starts with a*c*
To find out if a particular node is a descendant of another node, you would see if the longest path started with the shortest path.
So for instance:
e is a descendant of a since a*c*e starts with a
d is a descendant of c since a*c*d starts with a*c
Would that be useful in your instance?
Traversing any tree will require "depth-of-tree" steps. Therefore if you maintain balanced tree structure it is provable that you will need O(log n) operations for your lookup operation. From what I understand your tree looks special and you can not maintain it in a balanced way, right? So O(n) will be possible. But this is bad during creation of the tree anyways, so you will probably die before you use the lookup anyway...
Depending on how often you will need that lookup operation compared to insert, you could decide to pay during insert to maintain an extra data structure. I would suggest a hashing if you really need amortized O(1). On every insert operation you put all parents of a node into a hashtable. By your description this could be O(n) items on a given insert. If you do n inserts this sounds bad (towards O(n^2)), but actually your tree can not degrade that bad, so you probably get an amortized overall hastable size of O(n log n). (actually, the log n part depends on the degration-degree of your tree. If you expect it to be maximal degraed, don't do it.)
So, you would pay about O(log n) on every insert, and get hashtable efficiency O(1) for a lookup.
For a M-way tree, instead of your bit array, why not just store the binary "trie id" (using M bits per level) with each node? For your example (assuming M==2) : A=0b01, B=0b0101, C=0b1001, ...
Then you can do the test in O(1):
bool IsParent(node* child, node* parent)
{
return ((child->id & parent->id) == parent->id)
}
You could compress the storage to ceil(lg2(M)) bits per level if you have a fast FindMSB() function which returns the position of the most significant bit set:
mask = (1<<( FindMSB(parent->id)+1) ) -1;
retunr (child->id&mask == parent->id);
In a pre-order traversal, every set of descendants is contiguous. For your example,
A B D E C F
+---------+ A
+---+ B
+ D
+ E
+-+ C
+ F
If you can preprocess, then all you need to do is number each node and compute the descendant interval.
If you can't preprocess, then a link/cut tree offers O(log n) performance for both updates and queries.
You can answer query of the form "Is node A a descendant of node B?" in constant time, by just using two auxiliary arrays.
Preprocess the tree, by visiting in Depth-First order, and for each node A store its starting and ending time in the visit in the two arrays Start[] and End[].
So, let us say that End[u] and Start[u] are respectively the ending and starting time of the visit of node u.
Then node u is a descendant of node v if and only if:
Start[v] <= Start[u] and End[u] <= End[v].
and you are done, checking this condition requires just two lookup in the arrays Start and End
Take a look at Nested set model It's very effective to select but too slow to update
For what it's worth, what you're asking for here is equivalent to testing if a class is a subtype of another class in a class hierarchy, and in implementations like CPython this is just done the good old fashioned "iterate the parents looking for the parent" way.

Resources