Counting p-cousins on a directed tree - algorithm

We're given a directed tree to work with. We define the concepts of p-ancestor and p-cousin as follows
p-ancestor: A node is an 1-ancestor of another if it is the parent of it. It is the p-ancestor of a node, if it is the parent of the (p-1)-th ancestor.
p-cousin: A node is the p-cousin of another, if they share the same p-ancestor.
For example, consider the tree below.
4 has three 1-cousins i,e, 3, 4 and 5 since they all share the common
1-ancestor, which is 1
For a particular tree, the problem is as follows. You are given multiple pairs of (node,p) and are supposed to count (and output) the number of p-cousins of the corresponding nodes.
A slow algorithm would be to crawl up to the p-ancestor and run a BFS for each node.
What is the (asymptotically) fastest way to solve the problem?

If an off-line solution is acceptable, two Depth first searches can do the job.
Assume that we can index all of those n queries (node, p) from 0 to n - 1
We can convert each query (node, p) into another type of query (ancestor , p) as follow:
Answer for query (node, p), with node has level a (distance from root to this node is a), is the number of descendants level a of the ancestor at level a - p. So, for each queries, we can find who is that ancestor:
Pseudo code
dfs(int node, int level, int[]path, int[] ancestorForQuery, List<Query>[]data){
path[level] = node;
visit all child node;
for(Query query : data[node])
if(query.p <= level)
ancestorForQuery[query.index] = path[level - p];
}
Now, after the first DFS, instead of the original query, we have a new type of query (ancestor, p)
Assume that we have an array count, which at index i stores the number of node which has level i. Assume that, node a at level x , we need to count number of p descendants, so, the result for this query is:
query result = count[x + p] after we visit a - count[x + p] before we visit a
Pseudo code
dfs2(int node, int level, int[] result, int[]count, List<TransformedQuery>[]data){
count[level] ++;
for(TransformedQuery query : data[node]){
result[query.index] -= count[level + query.p];
}
visit all child node;
for(TransformedQuery query : data[node]){
result[query.index] += count[level + query.p];
}
}
Result of each query is stored in result array.

If p is fixed, I suggest the following algorithm:
Let's say that count[v] is number of p-children of v. Initially all count[v] are set to 0. And pparent[v] is p-parent of v.
Let's now run a dfs on the tree and keep the stack of visited nodes, i.e. when we visit some v, we put it into the stack. Once we leave v, we pop.
Suppose we've come to some node v in our dfs. Let's do count[stack[size - p]]++, indicating that we are a p-child of v. Also pparent[v] = stack[size-p]
Once your dfs is finished, you can calculate the desired number of p-cousins of v like this:
count[pparent[v]]
The complexity of this is O(n + m) for dfs and O(1) for each query

First I'll describe a fairly simple way to answer each query in O(p) time that uses O(n) preprocessing time and space, and then mention a way that query times can be sped up to O(log p) time for a factor of just O(log n) extra preprocessing time and space.
O(p)-time query algorithm
The basic idea is that if we write out the sequence of nodes visited during a DFS traversal of the tree in such a way that every node is written out at a vertical position corresponding to its level in the tree, then the set of p-cousins of a node form a horizontal interval in this diagram. Note that this "writing out" looks very much like a typical tree diagram, except without lines connecting nodes, and (if a postorder traversal is used; preorder would be just as good) parent nodes always appearing to the right of their children. So given a query (v, p), what we will do is essentially:
Find the p-th ancestor u of the given node v. Naively this takes O(p) time.
Find the p-th left-descendant l of u -- that is, the node you reach after repeating the process of visiting the leftmost child of the current node, p times. Naively this takes O(p) time.
Find the p-th right-descendant r of u (defined similarly). Naively this takes O(p) time.
Return the value x[r] - x[l] + 1, where x[i] is a precalculated value that records the number of nodes in the sequence described above that are at the same level as, and at or to the left of, node i. This takes constant time.
The preprocessing step is where we calculate x[i], for each 1 <= i <= n. This is accomplished by performing a DFS that builds up a second array y[] that records the number y[d] of nodes visited so far at depth d. Specifically, y[d] is initially 0 for each d; during the DFS, when we visit a node v at depth d, we simply increment y[d] and then set x[v] = y[d].
O(log p)-time query algorithm
The above algorithm should already be fast enough if the tree is fairly balanced -- but in the worst case, when each node has just a single child, O(p) = O(n). Notice that it is navigating up and down the tree in the first 3 of the above 4 steps that force O(p) time -- the last step takes constant time.
To fix this, we can add some extra pointers to make navigating up and down the tree faster. A simple and flexible way uses "pointer doubling": For each node v, we will store log2(depth(v)) pointers to successively higher ancestors. To populate these pointers, we perform log2(maxDepth) DFS iterations, where on the i-th iteration we set each node v's i-th ancestor pointer to its (i-1)-th ancestor's (i-1)-th ancestor: this takes just two pointer lookups per node per DFS. With these pointers, moving any distance p up the tree always takes at most log(p) jumps, because the distance can be reduced by at least half on each jump. The exact same procedure can be used to populate corresponding lists of pointers for "left-descendants" and "right-descendants" to speed up steps 2 and 3, respectively, to O(log p) time.

Related

What is an Efficient Algorithm to find Graph Centroid?

Graph Centroid is a vertex at equal distance or at distance less than or equal (N/2) where N is the size of the connected components connected through this vertex?! [Needs Correction?!]
Here's a problem at CodeForces that asks to find whether each vertex is a centroid, but after removing and replacing exactly one edge at a time.
Problem Statement
I need help to refine this PseudoCode / Algorithm.
Loop all Vertices:
Loop all edges:
Position each edge in every empty edge position between two unconnected nodes
Count Size of each Connected Component (*1).
If all sizes are less than or equal N/2,
Then return true
The problem is that this algorithm will run in at least O(N*M^2)). It's not acceptable.
I looked up the answers, but I couldn't come up with the high level abstraction of the algorithm used by others. Could you please help me understand how these solutions work?
Solutions' Link
(*1) DFS Loop
I will try to describe you a not so complex algorithm for solving this problem in linear time, for future references see my code (it have some comments).
The main idea is that you can root the tree T at an arbitrary vertex and traverse it, for each vertex V you can do this:
Cut subtree V from T.
Find the heaviest vertex H having size <= N/2 (H can be in any of T or subtree V).
Move subtree H to become child of subtree V.
Re root T with V and find if the heaviest vertex have size <= N/2
The previous algorithm can be implemented carefully to get linear time complexity, the issue is that it have a lot of cases to handle.
A better idea is to find the centroid C of T and root T at vertex C.
Having vertex C as the root of T is useful because it guarantee us that every descendant of C have size <= N/2.
When traversing the tree we can avoid to check for the heaviest vertex down the tree but up, every time we visit a child W, we can pass the heaviest size (being <= N/2) if we re root T at W.
Try to understand what I explained, let me know if something is not clear.
Well, the Centroid of a tree can be determined in O(N) space and time complexity.
Construct a Matrix representing the Tree, with the row indexes representing the N nodes and the elements in the i-th row representing the nodes to which i-th node is connected. You can any other representation also.
Maintain 2 linear arrays of size N with index i representing the depth of i-th node (depth) and the parent of the i-th node (parent) respectively.
Also maintain 2 more linear arrays, the first one containing the BFS traversal sequence of the tree (queue), and the other one (leftOver) containing the value of [N - Number of nodes in the subtree rooted at that node]. In other words, the i-th index contains the number of nodes that is left in the whole tree when the i-th node is removed from the tree along with all its children.
Now, perform a BFS traversal taking any arbitrary node as root and fill the arrays 'parent' and 'depth'. This takes O(N) time complexity. Also, record the traversal sequence in the array 'queue'.
Starting from the leaf nodes, add the number of nodes present in the subtree rooted at that node, with the value at the parent index in the array 'leftOver'. This also takes O(N) time since you can use the already prepared 'queue' array and travel from backwards.
Finally, traverse through the array 'leftOver' and modify each value to [N-1 - Initial Value]. The 'leftOver' array is prepared. Cost: Another O(N).
Your work is almost done. Now, iterate over this 'leftOver' array and find the index whose value is closest to floor(N/2). However, this value must not exceed floor(N/2) at any cost.
This index is the index of the Centroid of the Tree. Overall time complexity: O(N).
Java Code:
import java.util.ArrayList;
import java.util.Iterator;
import java.util.Scanner;
class Find_Centroid
{
static final int MAXN=100_005;
static ArrayList<Integer>[] graph;
static int[] depth,parent; // Step 2
static int N;
static Scanner io=new Scanner(System.in);
public static void main(String[] args)
{
int i;
N=io.nextInt();
// Number of nodes in the Tree
graph=new ArrayList[N];
for(i=0;i<graph.length;++i)
graph[i]=new ArrayList<>();
//Initialisation
for(i=1;i<N;++i)
{
int a=io.nextInt()-1,b=io.nextInt()-1;
// Assuming 1-based indexing
graph[a].add(b); graph[b].add(a);
// Step 1
}
int centroid = findCentroid(new java.util.Random().nextInt(N));
// Arbitrary indeed... ;)
System.out.println("Centroid: "+(centroid+1));
// '+1' for output in 1-based index
}
static int[] queue=new int[MAXN],leftOver;
// Step 3
static int findCentroid(int r)
{
leftOver=new int[N];
int i,target=N/2,ach=-1;
bfs(r); // Step 4
for(i=N-1;i>=0;--i)
if(queue[i]!=r)
leftOver[parent[queue[i]]] += leftOver[queue[i]] +1;
// Step 5
for(i=0;i<N;++i)
leftOver[i] = N-1 -leftOver[i];
// Step 6
for(i=0;i<N;++i)
if(leftOver[i]<=target && leftOver[i]>ach)
// Closest to target(=N/2) but does not exceed it.
{
r=i; ach=leftOver[i];
}
// Step 7
return r;
}
static void bfs(int root) // Iterative
{
parent=new int[N]; depth=new int[N];
int st=0,end=0;
parent[root]=-1; depth[root]=1;
// Parent of root is obviously undefined. Hence -1.
// Assuming depth of root = 1
queue[end++]=root;
while(st<end)
{
int node = queue[st++], h = depth[node]+1;
Iterator<Integer> itr=graph[node].iterator();
while(itr.hasNext())
{
int ch=itr.next();
if(depth[ch]>0) // 'ch' is parent of 'node'
continue;
depth[ch]=h; parent[ch]=node;
queue[end++]=ch; // Recording the Traversal sequence
}
}
}
}
Now, for the problem, http://codeforces.com/contest/709/problem/E, iterate over each node i, consider it as the root, keep descenting down the child which has >N/2 nodes and try to arrive at a node which has just less than N/2 nodes (closest to N/2 nodes) under it. If, on removing this node along with all its children makes 'i' the centroid, print '1' otherwise print 0. This process can be carried out efficiently, as the 'leftOver' array is already there for you.
Actually, you are detaching the disturbing node (the node which is preventing i from being the centroid) along with its children and attaching it to the i-th node itself. The subtree is guaranteed to have atmost N/2 nodes (as checked earlier) and so won't cause a problem now.
Happy Coding..... :)

Segment tree time complexity analysis

How can we prove that the update and query operations on a segment tree (http://letuskode.blogspot.in/2013/01/segtrees.html) (not to be confused with an interval tree) are O(log n)?
I thought of a way which goes like this - At every node, we make at most two recursive calls on the left and right sub-trees. If we could prove that one of these calls terminates fairly quickly, the time complexity would be logarithmically bounded. But how do we prove this?
Lemma: at most 2 nodes are used at each level of the tree(a level is set of nodes with a fixed distance from the root).
Proof: Let's assume that at the level h at least 3 nodes were used(let's call them L, M and R). It means that the entire interval from the left bound of the L node to the right bound of the R node lies inside the query range. That's why M is fully covered by a node(let's call it UP) from the h - 1 level that fully lies inside the query range. But it implies that M could not be visited at all because the traversal would stop in the UP node or higher. Here are some pictures to clarify this step of the proof:
h - 1: UP UP UP
/\ /\ /\
h: L M R L M R L M R
That's why at most two nodes at each level are used. There are only log N levels in a segment tree so at most 2 * log N are used in total.
The claim is that there are at most 2 nodes which are expanded at each level. We will prove this by contradiction.
Consider the segment tree given below.
Let's say that there are 3 nodes that are expanded in this tree. This means that the range is from the left most colored node to the right most colored node. But notice that if the range extends to the right most node, then the full range of the middle node is covered. Thus, this node will immediately return the value and won't be expanded. Thus, we prove that at each level, we expand at most 2 nodes and since there are log⁡n levels, the nodes that are expanded are 2⋅logn=Θ(logn)
If we prove that there at most N nodes to visit on each level and knowing that Binary segment tree has max logN height - we can say that query operatioin has is O(LogN) complexity. Other answers tell you that there at most 2 nodes to visit on each level but i assume that there at most 4 nodes to visit 4 nodes are visited on the level. You can find the same statement without proof in other sources like Geek for Geeks
Other answers show you too small segment tree. Consider this example: Segment tree with leaf nodes size - 16, indexes start from zero. You are looking for the range [0-14]
See example: Crossed are nodes that we are visiting
At each level (L) of tree there would be at max 2 nodes which could have partial overlap. (If unable to prove - why ?, please mention)
So, at level (L+1) we have to explore at max 4 nodes. and total height/levels in the tree is O(log(N)) (where N is number of nodes). Hence time complexity is O(4*Log(N)) ~ O(Log(N)).
PS: Please refer diagram attached by #Oleksandr Papchenko to get better understanding.
I will try to give simple mathematical explanation.
Look at the code below . As per the segment tree implementation for range_query
int query(int node, int st, int end, int l, int r)
{
/*if range lies inside the query range*/
if(l <= st && end <= r )
{
return tree[node];
}
/*If range is totally outside the query range*/
if(st>r || end<l)
return INT_MAX;
/*If query range intersects both the children*/
int mid = (st+end)/2;
int ans1 = query(2*node, st, mid, l, r);
int ans2 = query(2*node+1, mid+1, end, l, r);
return min(ans1, ans2);
}
you go left and right and if its range then you return value.
So at each level 2 nodes are selected let's call LeftMost and rightMost. If say some other node is selected in between called mid one, then their least common ancestor must have been same and that range would have been included. thus
thus , For segment tree with logN levels.
Search at each level = 2
Total search = (search at each level ) * (number of levels) = (2logN)
Therefore search complexity = O(2logN) ~ O(logN).
P.S for space complexity (https://codeforces.com/blog/entry/49939 )

Time analysis of a Binary Search Tree in-order traversal algorithm

Below is an iterative algorithm to traverse a Binary Search Tree in in-order fashion (first left child , then the parent , finally right child) without using a Stack :
(Idea : the whole idea is to find the left-most child of a tree and find the successor of the node at hand each time and print its value , until there's no more node left.)
void In-Order-Traverse(Node root){
Min-Tree(root); //finding left-most child
Node current = root;
while (current != null){
print-on-screen(current.key);
current = Successor(current);
}
return;
}
Node Min-Tree(Node root){ // find the leftmost child
Node current = root;
while (current.leftChild != null)
current = current.leftChild;
return current;
}
Node Successor(Node root){
if (root.rightChild != null) // if root has a right child ,find the leftmost child of the right sub-tree
return Min-Tree(root.rightChild);
else{
current = root;
while (current.parent != null && current.parent.leftChild != current)
current = current.parent;
return current.parrent;
}
}
It's been claimed that the time complexity of this algorithm is Theta(n) assuming there are n nodes in the BST , which is for sure correct . However I cannot convince myself as I guess some of the nodes are traversed more than constant number of times which depends on the number of nodes in their sub-trees and summing up all these number of visits wouldn't result time complexity of Theta(n)
Any idea or intuition on how to prove it ?
It is easier to reason with edges rather than nodes. Let us reason based on the code of Successor function.
Case 1 (then branch)
For all nodes with a right child, we will visit the right subtree once ("right-turn" edge), then always visit the left subtree ("left-turn" edges) with Min-Tree function. We can prove that such traversal will create a path whose edges are unique - the edges will not be repeated in any traversal made from any other node with a right child, since the traversal ensures that you never visit any "right-turn" edge of other nodes on the tree. (Proof by construction).
Case 2 (else branch)
For all nodes without a right child (else branch), we will visit the ancestors by following "right-turn" edges until you have to make a "left-turn" edge or encounter the root of the binary tree. Again, the edges in the path generated are unique - will never be repeated in any other traversal made from any other node without a right child. This is because:
Except for the starting node and the node reached by following "left-turn" edge, all other nodes in between has a right child (which means those are excluded from else branch). The starting node of course does not have a right child.
Each node has a unique parent (only the root node does not have parent), and the path to parent is either "left-turn" or "right-turn" (the node is a left child or a right child). Given any node (ignoring the right child condition), there is only one path that creates the pattern: many "right-turn" then a "left-turn".
Since the nodes in between have a right child, there is no way for an edge to appear in 2 traversal starting at different nodes. (Since we are currently considering nodes without a right child).
(The proof here is quite hand-waving, but I think it can be formally proven by contradiction).
Since the edges are unique, the total number of edges traversed in case 1 only (or case 2 only) will be O(n) (since the number of edges in a tree is equal to the number of vertices - 1). Therefore, after summing the 2 cases up, In-Order Traversal will be O(n).
Note that I only know each edge is visited at most once - I don't know whether all edges are visited or not from the proof, but the number of edges is bounded by the number of vertices, which is just right.
We can easily see that it is also Omega(n) (each node is visited once), so we can conclude that it is Theta(n).
The given program runs in Θ(N) time. Θ(N) doesn't mean that each node is visited exactly once. Remember there is a constant factor. So Θ(N) could actually be limited by 5 N or 10 N or even a 1000 N. So as such it doesn't give you an exact count on the number of times a node is visited.
The Time complexity of in-order iterative traversal of Binary Search Tree can be analyzed as follows,
Consider a Tree with N nodes,
Let the execution time be denoted by the complexity function T(N).
Let the left sub tree and right sub tree contain X and N-X-1 nodes respectively,
Then the time complexity T(N) = T(X) + T(N-X-1) + c,
Now consider the two extreme cases of a BST,
CASE 1: A BST which is perfectly balanced, i.e. both the sub trees have equal number of nodes. For example consider the BST shown below,
10
/ \
5 14
/ \ / \
1 6 11 16
For such a Tree the complexity function is,
T(N) = 2 T(⌊N/2⌋) + c
Master Theorem gives us a complexity of Θ(N) in this case.
CASE 2: A fully unbalanced BST, i.e. either the left sub tree or right sub tree is empty. There for X = 0. For example consider the BST shown below,
10
/
9
/
8
/
7
Now T(N) = T(0) + T(N-1) + c,
T(N) = T(N-1) + c
T(N) = T(N-2) + c + c
T(N) = T(N-3) + c + c + c
.
.
.
T(N) = T(0) + N c
Since T(N) = K, where K is a constant,
T(N) = K + N c
There for T(N) = Θ(N).
Thus the complexity is Θ(N) for all the cases.
We focus on edges instead of nodes.
( to have a better intuition look at this picture : http://i.stack.imgur.com/WlK5O.png)
We claim that in this algorithm every edge is visited at most twice, (actually it's visited exactly twice);
First time when it's traversed downward and and the second time when it's traversed upward.
To visit an edge more than twice , we have to traverse that edge it downward again : down , up , down , ....
We prove that it's not possible to have a second downward visit of an edge.
Let's assume that we traverse an edge (u , v) downward for the second time , this means that one of the ancestors of u has a successor which is a decedent of u.
This is not possible :
We know that when we are traversing an edge upward , we are looking for a left-turn edge to find a successor , so while u is on the left side of the the successor, successor of this successor is on the right side of it , by moving to the right side of a successor (to find its successor) reaching u again and therefore edge (u,v) again is impossible. (to find a successor we either move to the right or to the up but not to the left)

How to find largest common sub-tree in the given two binary search trees?

Two BSTs (Binary Search Trees) are given. How to find largest common sub-tree in the given two binary trees?
EDIT 1:
Here is what I have thought:
Let, r1 = current node of 1st tree
r2 = current node of 2nd tree
There are some of the cases I think we need to consider:
Case 1 : r1.data < r2.data
2 subproblems to solve:
first, check r1 and r2.left
second, check r1.right and r2
Case 2 : r1.data > r2.data
2 subproblems to solve:
- first, check r1.left and r2
- second, check r1 and r2.right
Case 3 : r1.data == r2.data
Again, 2 cases to consider here:
(a) current node is part of largest common BST
compute common subtree size rooted at r1 and r2
(b)current node is NOT part of largest common BST
2 subproblems to solve:
first, solve r1.left and r2.left
second, solve r1.right and r2.right
I can think of the cases we need to check, but I am not able to code it, as of now. And it is NOT a homework problem. Does it look like?
Just hash the children and key of each node and look for duplicates. This would give a linear expected time algorithm. For example, see the following pseudocode, which assumes that there are no hash collisions (dealing with collisions would be straightforward):
ret = -1
// T is a tree node, H is a hash set, and first is a boolean flag
hashTree(T, H, first):
if (T is null):
return 0 // leaf case
h = hash(hashTree(T.left, H, first), hashTree(T.right, H, first), T.key)
if (first):
// store hashes of T1's nodes in the set H
H.insert(h)
else:
// check for hashes of T2's nodes in the set H containing T1's nodes
if H.contains(h):
ret = max(ret, size(T)) // size is recursive and memoized to get O(n) total time
return h
H = {}
hashTree(T1, H, true)
hashTree(T2, H, false)
return ret
Note that this is assuming the standard definition of a subtree of a BST, namely that a subtree consists of a node and all of its descendants.
Assuming there are no duplicate values in the trees:
LargestSubtree(Tree tree1, Tree tree2)
Int bestMatch := 0
Int bestMatchCount := 0
For each Node n in tree1 //should iterate breadth-first
//possible optimization: we can skip every node that is part of each subtree we find
Node n2 := BinarySearch(tree2(n.value))
Int matchCount := CountMatches(n, n2)
If (matchCount > bestMatchCount)
bestMatch := n.value
bestMatchCount := matchCount
End
End
Return ExtractSubtree(BinarySearch(tree1(bestMatch)), BinarySearch(tree2(bestMatch)))
End
CountMatches(Node n1, Node n2)
If (!n1 || !n2 || n1.value != n2.value)
Return 0
End
Return 1 + CountMatches(n1.left, n2.left) + CountMatches(n1.right, n2.right)
End
ExtractSubtree(Node n1, Node n2)
If (!n1 || !n2 || n1.value != n2.value)
Return nil
End
Node result := New Node(n1.value)
result.left := ExtractSubtree(n1.left, n2.left)
result.right := ExtractSubtree(n1.right, n2.right)
Return result
End
To briefly explain, this is a brute-force solution to the problem. It does a breadth-first walk of the first tree. For each node, it performs a BinarySearch of the second tree to locate the corresponding node in that tree. Then using those nodes it evaluates the total size of the common subtree rooted there. If the subtree is larger than any previously found subtree, it remembers it for later so that it can construct and return a copy of the largest subtree when the algorithm completes.
This algorithm does not handle duplicate values. It could be extended to do so by using a BinarySearch implementation that returns a list of all nodes with the given value, instead of just a single node. Then the algorithm could iterate this list and evaluate the subtree for each node and then proceed as normal.
The running time of this algorithm is O(n log m) (it traverses n nodes in the first tree, and performs a log m binary-search operation for each one), putting it on par with most common sorting algorithms. The space complexity is O(1) while running (nothing allocated beyond a few temporary variables), and O(n) when it returns its result (because it creates an explicit copy of the subtree, which may not be required depending upon exactly how the algorithm is supposed to express its result). So even this brute-force approach should perform reasonably well, although as noted by other answers an O(n) solution is possible.
There are also possible optimizations that could be applied to this algorithm, such as skipping over any nodes that were contained in a previously evaluated subtree. Because the tree-walk is breadth-first we know than any node that was part of some prior subtree cannot ever be the root of a larger subtree. This could significantly improve the performance of the algorithm in certain cases, but the worst-case running time (two trees with no common subtrees) would still be O(n log m).
I believe that I have an O(n + m)-time, O(n + m) space algorithm for solving this problem, assuming the trees are of size n and m, respectively. This algorithm assumes that the values in the trees are unique (that is, each element appears in each tree at most once), but they do not need to be binary search trees.
The algorithm is based on dynamic programming and works with the following intution: suppose that we have some tree T with root r and children T1 and T2. Suppose the other tree is S. Now, suppose that we know the maximum common subtree of T1 and S and of T2 and S. Then the maximum subtree of T and S
Is completely contained in T1 and r.
Is completely contained in T2 and r.
Uses both T1, T2, and r.
Therefore, we can compute the maximum common subtree (I'll abbreviate this as MCS) as follows. If MCS(T1, S) or MCS(T2, S) has the roots of T1 or T2 as roots, then the MCS we can get from T and S is given by the larger of MCS(T1, S) and MCS(T2, S). If exactly one of MCS(T1, S) and MCS(T2, S) has the root of T1 or T2 as a root (assume w.l.o.g. that it's T1), then look up r in S. If r has the root of T1 as a child, then we can extend that tree by a node and the MCS is given by the larger of this augmented tree and MCS(T2, S). Otherwise, if both MCS(T1, S) and MCS(T2, S) have the roots of T1 and T2 as roots, then look up r in S. If it has as a child the root of T1, we can extend the tree by adding in r. If it has as a child the root of T2, we can extend that tree by adding in r. Otherwise, we just take the larger of MCS(T1, S) and MCS(T2, S).
The formal version of the algorithm is as follows:
Create a new hash table mapping nodes in tree S from their value to the corresponding node in the tree. Then fill this table in with the nodes of S by doing a standard tree walk in O(m) time.
Create a new hash table mapping nodes in T from their value to the size of the maximum common subtree of the tree rooted at that node and S. Note that this means that the MCS-es stored in this table must be directly rooted at the given node. Leave this table empty.
Create a list of the nodes of T using a postorder traversal. This takes O(n) time. Note that this means that we will always process all of a node's children before the node itself; this is very important!
For each node v in the postorder traversal, in the order they were visited:
Look up the corresponding node in the hash table for the nodes of S.
If no node was found, set the size of the MCS rooted at v to 0.
If a node v' was found in S:
If neither of the children of v' match the children of v, set the size of the MCS rooted at v to 1.
If exactly one of the children of v' matches a child of v, set the size of the MCS rooted at v to 1 plus the size of the MCS of the subtree rooted at that child.
If both of the children of v' match the children of v, set the size of the MCS rooted at v to 1 plus the size of the MCS of the left subtree plus the size of the MCS of the right subtree.
(Note that step (4) runs in expected O(n) time, since it visits each node in S exactly once, makes O(n) hash table lookups, makes n hash table inserts, and does a constant amount of processing per node).
Iterate across the hash table and return the maximum value it contains. This step takes O(n) time as well. If the hash table is empty (S has size zero), return 0.
Overall, the runtime is O(n + m) time expected and O(n + m) space for the two hash tables.
To see a correctness proof, we proceed by induction on the height of the tree T. As a base case, if T has height zero, then we just return zero because the loop in (4) does not add anything to the hash table. If T has height one, then either it exists in T or it does not. If it exists in T, then it can't have any children at all, so we execute branch 4.3.1 and say that it has height one. Step (6) then reports that the MCS has size one, which is correct. If it does not exist, then we execute 4.2, putting zero into the hash table, so step (6) reports that the MCS has size zero as expected.
For the inductive step, assume that the algorithm works for all trees of height k' < k and consider a tree of height k. During our postorder walk of T, we will visit all of the nodes in the left subtree, then in the right subtree, and finally the root of T. By the inductive hypothesis, the table of MCS values will be filled in correctly for the left subtree and right subtree, since they have height ≤ k - 1 < k. Now consider what happens when we process the root. If the root doesn't appear in the tree S, then we put a zero into the table, and step (6) will pick the largest MCS value of some subtree of T, which must be fully contained in either its left subtree or right subtree. If the root appears in S, then we compute the size of the MCS rooted at the root of T by trying to link it with the MCS-es of its two children, which (inductively!) we've computed correctly.
Whew! That was an awesome problem. I hope this solution is correct!
EDIT: As was noted by #jonderry, this will find the largest common subgraph of the two trees, not the largest common complete subtree. However, you can restrict the algorithm to only work on subtrees quite easily. To do so, you would modify the inner code of the algorithm so that it records a subtree of size 0 if both subtrees aren't present with nonzero size. A similar inductive argument will show that this will find the largest complete subtree.
Though, admittedly, I like the "largest common subgraph" problem a lot more. :-)
The following algorithm computes all the largest common subtrees of two binary trees (with no assumption that it is a binary search tree). Let S and T be two binary trees. The algorithm works from the bottom of the trees up, starting at the leaves. We start by identifying leaves with the same value. Then consider their parents and identify nodes with the same children. More generally, at each iteration, we identify nodes provided they have the same value and their children are isomorphic (or isomorphic after swapping the left and right children). This algorithm terminates with the collection of all pairs of maximal subtrees in T and S.
Here is a more detailed description:
Let S and T be two binary trees. For simplicity, we may assume that for each node n, the left child has value <= the right child. If exactly one child of a node n is NULL, we assume the right node is NULL. (In general, we consider two subtrees isomorphic if they are up to permutation of the left/right children for each node.)
(1) Find all leaf nodes in each tree.
(2) Define a bipartite graph B with edges from nodes in S to nodes in T, initially with no edges. Let R(S) and T(S) be empty sets. Let R(S)_next and R(T)_next also be empty sets.
(3) For each leaf node in S and each leaf node in T, create an edge in B if the nodes have the same value. For each edge created from nodeS in S to nodeT in T, add all the parents of nodeS to the set R(S) and all the parents of nodeT to the set R(T).
(4) For each node nodeS in R(S) and each node nodeT in T(S), draw an edge between them in B if they have the same value AND
{
(i): nodeS->left is connected to nodeT->left and nodeS->right is connected to nodeT->right, OR
(ii): nodeS->left is connected to nodeT->right and nodeS->right is connected to nodeT->left, OR
(iii): nodeS->left is connected to nodeT-> right and nodeS->right == NULL and nodeT->right==NULL
(5) For each edge created in step (4), add their parents to R(S)_next and R(T)_next.
(6) If (R(S)_next) is nonempty {
(i) swap R(S) and R(S)_next and swap R(T) and R(T)_next.
(ii) Empty the contents of R(S)_next and R(T)_next.
(iii) Return to step (4).
}
When this algorithm terminates, R(S) and T(S) contain the roots of all maximal subtrees in S and T. Furthermore, the bipartite graph B identifies all pairs of nodes in S and nodes in T that give isomorphic subtrees.
I believe this algorithm has complexity is O(n log n), where n is the total number of nodes in S and T, since the sets R(S) and T(S) can be stored in BST’s ordered by value, however I would be interested to see a proof.

Finding last element of a binary heap

quoting Wikipedia:
It is perfectly acceptable to use a
traditional binary tree data structure
to implement a binary heap. There is
an issue with finding the adjacent
element on the last level on the
binary heap when adding an element
which can be resolved
algorithmically...
Any ideas on how such an algorithm might work?
I was not able to find any information about this issue, for most binary heaps are implemented using arrays.
Any help appreciated.
Recently, I have registered an OpenID account and am not able to edit my initial post nor comment answers. That's why I am responding via this answer. Sorry for this.
quoting Mitch Wheat:
#Yse: is your question "How do I find
the last element of a binary heap"?
Yes, it is.
Or to be more precise, my question is: "How do I find the last element of a non-array-based binary heap?".
quoting Suppressingfire:
Is there some context in which you're
asking this question? (i.e., is there
some concrete problem you're trying to
solve?)
As stated above, I would like to know a good way to "find the last element of a non-array-based binary heap" which is necessary for insertion and deletion of nodes.
quoting Roy:
It seems most understandable to me to
just use a normal binary tree
structure (using a pRoot and Node
defined as [data, pLeftChild,
pRightChild]) and add two additional
pointers (pInsertionNode and
pLastNode). pInsertionNode and
pLastNode will both be updated during
the insertion and deletion subroutines
to keep them current when the data
within the structure changes. This
gives O(1) access to both insertion
point and last node of the structure.
Yes, this should work. If I am not mistaken, it could be a little bit tricky to find the insertion node and the last node, when their locations change to another subtree due to an deletion/insertion. But I'll give this a try.
quoting Zach Scrivena:
How about performing a depth-first
search...
Yes, this would be a good approach. I'll try that out, too.
Still I am wondering, if there is a way to "calculate" the locations of the last node and the insertion point. The height of a binary heap with N nodes can be calculated by taking the log (of base 2) of the smallest power of two that is larger than N. Perhaps it is possible to calculate the number of nodes on the deepest level, too. Then it was maybe possible to determine how the heap has to be traversed to reach the insertion point or the node for deletion.
Basically, the statement quoted refers to the problem of resolving the location for insertion and deletion of data elements into and from the heap. In order to maintain "the shape property" of a binary heap, the lowest level of the heap must always be filled from left to right leaving no empty nodes. To maintain the average O(1) insertion and deletion times for the binary heap, you must be able to determine the location for the next insertion and the location of the last node on the lowest level to use for deletion of the root node, both in constant time.
For a binary heap stored in an array (with its implicit, compacted data structure as explained in the Wikipedia entry), this is easy. Just insert the newest data member at the end of the array and then "bubble" it into position (following the heap rules). Or replace the root with the last element in the array "bubbling down" for deletions. For heaps in array storage, the number of elements in the heap is an implicit pointer to where the next data element is to be inserted and where to find the last element to use for deletion.
For a binary heap stored in a tree structure, this information is not as obvious, but because it's a complete binary tree, it can be calculated. For example, in a complete binary tree with 4 elements, the point of insertion will always be the right child of the left child of the root node. The node to use for deletion will always be the left child of the left child of the root node. And for any given arbitrary tree size, the tree will always have a specific shape with well defined insertion and deletion points. Because the tree is a "complete binary tree" with a specific structure for any given size, it is very possible to calculate the location of insertion/deletion in O(1) time. However, the catch is that even when you know where it is structurally, you have no idea where the node will be in memory. So, you have to traverse the tree to get to the given node which is an O(log n) process making all inserts and deletions a minimum of O(log n), breaking the usually desired O(1) behavior. Any search ("depth-first", or some other) will be at least O(log n) as well because of the traversal issue noted and usually O(n) because of the random nature of the semi-sorted heap.
The trick is to be able to both calculate and reference those insertion/deletion points in constant time either by augmenting the data structure ("threading" the tree, as mention in the Wikipedia article) or using additional pointers.
The implementation which seems to me to be the easiest to understand, with low memory and extra coding overhead, is to just use a normal simple binary tree structure (using a pRoot and Node defined as [data, pParent, pLeftChild, pRightChild]) and add two additional pointers (pInsert and pLastNode). pInsert and pLastNode will both be updated during the insertion and deletion subroutines to keep them current when the data within the structure changes. This implementation gives O(1) access to both insertion point and last node of the structure and should allow preservation of overall O(1) behavior in both insertion and deletions. The cost of the implementation is two extra pointers and some minor extra code in the insertion/deletion subroutines (aka, minimal).
EDIT: added pseudocode for an O(1) insert()
Here is pseudo code for an insert subroutine which is O(1), on average:
define Node = [T data, *pParent, *pLeft, *pRight]
void insert(T data)
{
do_insertion( data ); // do insertion, update count of data items in tree
# assume: pInsert points node location of the tree that where insertion just took place
# (aka, either shuffle only data during the insertion or keep pInsert updated during the bubble process)
int N = this->CountOfDataItems + 1; # note: CountOfDataItems will always be > 0 (and pRoot != null) after an insertion
p = new Node( <null>, null, null, null); // new empty node for the next insertion
# update pInsert (three cases to handle)
if ( int(log2(N)) == log2(N) )
{# #1 - N is an exact power of two
# O(log2(N))
# tree is currently a full complete binary tree ("perfect")
# ... must start a new lower level
# traverse from pRoot down tree thru each pLeft until empty pLeft is found for insertion
pInsert = pRoot;
while (pInsert->pLeft != null) { pInsert = pInsert->pLeft; } # log2(N) iterations
p->pParent = pInsert;
pInsert->pLeft = p;
}
else if ( isEven(N) )
{# #2 - N is even (and NOT a power of 2)
# O(1)
p->pParent = pInsert->pParent;
pInsert->pParent->pRight = p;
}
else
{# #3 - N is odd
# O(1)
p->pParent = pInsert->pParent->pParent->pRight;
pInsert->pParent->pParent->pRight->pLeft = p;
}
pInsert = p;
// update pLastNode
// ... [similar process]
}
So, insert(T) is O(1) on average: exactly O(1) in all cases except when the tree must be increased by one level when it is O(log N), which happens every log N insertions (assuming no deletions). The addition of another pointer (pLeftmostLeaf) could make insert() O(1) for all cases and avoids the possible pathologic case of alternating insertion & deletion in a full complete binary tree. (Adding pLeftmost is left as an exercise [it's fairly easy].)
My first time to participate in stack overflow.
Yes, the above answer by Zach Scrivena (god I don't know how to properly refer to other people, sorry) is right. What I want to add is a simplified way if we are given the count of nodes.
The basic idea is:
Given the count N of nodes in this full binary tree, do "N % 2" calculation and push the results into a stack. Continue the calculation until N == 1. Then pop the results out. The result being 1 means right, 0 means left. The sequence is the route from root to target position.
Example:
The tree now have 10 nodes, I want insert another node at position 11. How to route it?
11 % 2 = 1 --> right (the quotient is 5, and push right into stack)
5 % 2 = 1 --> right (the quotient is 2, and push right into stack)
2 % 2 = 0 --> left (the quotient is 1, and push left into stack. End)
Then pop the stack: left -> right -> right. This is the path from the root.
You could use the binary representation of the size of the Binary Heap to find the location of the last node in O(log N). The size could be stored and incremented which would take O(1) time. The the fundamental concept behind this is the structure of the binary tree.
Suppose our heap size is 7. The binary representation of 7 is, "111". Now, remember to always omit the first bit. So, now we are left with "11". Read from left-to-right. The bit is '1', so, go to the right child of the root node. Then the string left is "1", the first bit is '1'. So, again go to the right child of the current node you are at. As you no longer have bits to process, this indicates that you have reached the last node. So, the raw working of the process is that, convert the size of the heap into bits. Omit the first bit. According to the leftmost bit, go to the right child of the current node if it is '1', and to the left child of the current node if it is '0'.
As you always to to the very end of the binary tree this operation always takes O(log N) time. This is a simple and accurate procedure to find the last node.
You may not understand it in the first reading. Try working this method on the paper for different values of Binary Heap, I'm sure you'll get the intuition behind it. I'm sure this knowledge is enough to solve your problem, if you want more explanation with figures, you can refer to my blog.
Hope my answer has helped you, if it did, let me know...! ☺
How about performing a depth-first search, visiting the left child before the right child, to determine the height of the tree. Thereafter, the first leaf you encounter with a shorter depth, or a parent with a missing child would indicate where you should place the new node before "bubbling up".
The depth-first search (DFS) approach above doesn't assume that you know the total number of nodes in the tree. If this information is available, then we can "zoom-in" quickly to the desired place, by making use of the properties of complete binary trees:
Let N be the total number of nodes in the tree, and H be the height of the tree.
Some values of (N,H) are (1,0), (2,1), (3,1), (4,2), ..., (7,2), (8, 3).
The general formula relating the two is H = ceil[log2(N+1)] - 1.
Now, given only N, we want to traverse from the root to the position for the new node, in the least number of steps, i.e. without any "backtracking".
We first compute the total number of nodes M in a perfect binary tree of height H = ceil[log2(N+1)] - 1, which is M = 2^(H+1) - 1.
If N == M, then our tree is perfect, and the new node should be added in a new level. This means that we can simply perform a DFS (left before right) until we hit the first leaf; the new node becomes the left child of this leaf. End of story.
However, if N < M, then there are still vacancies in the last level of our tree, and the new node should be added to the leftmost vacant spot.
The number of nodes that are already at the last level of our tree is just (N - 2^H + 1).
This means that the new node takes spot X = (N - 2^H + 2) from the left, at the last level.
Now, to get there from the root, you will need to make the correct turns (L vs R) at each level so that you end up at spot X at the last level. In practice, you would determine the turns with a little computation at each level. However, I think the following table shows the big picture and the relevant patterns without getting mired in the arithmetic (you may recognize this as a form of arithmetic coding for a uniform distribution):
0 0 0 0 0 X 0 0 <--- represents the last level in our tree, X marks the spot!
^
L L L L R R R R <--- at level 0, proceed to the R child
L L R R L L R R <--- at level 1, proceed to the L child
L R L R L R L R <--- at level 2, proceed to the R child
^ (which is the position of the new node)
this column tells us
if we should proceed to the L or R child at each level
EDIT: Added a description on how to get to the new node in the shortest number of steps assuming that we know the total number of nodes in the tree.
Solution in case you don't have reference to parent !!!
To find the right place for next node you have 3 cases to handle
case (1) Tree level is complete Log2(N)
case (2) Tree node count is even
case (3) Tree node count is odd
Insert:
void Insert(Node root,Node n)
{
Node parent = findRequiredParentToInsertNewNode (root);
if(parent.left == null)
parent.left = n;
else
parent.right = n;
}
Find the parent of the node in order to insert it
void findRequiredParentToInsertNewNode(Node root){
Node last = findLastNode(root);
//Case 1
if(2*Math.Pow(levelNumber) == NodeCount){
while(root.left != null)
root=root.left;
return root;
}
//Case 2
else if(Even(N)){
Node n =findParentOfLastNode(root ,findParentOfLastNode(root ,last));
return n.right;
}
//Case 3
else if(Odd(N)){
Node n =findParentOfLastNode(root ,last);
return n;
}
}
To find the last node you need to perform a BFS (breadth first search) and get the last element in the queue
Node findLastNode(Node root)
{
if (root.left == nil)
return root
Queue q = new Queue();
q.enqueue(root);
Node n = null;
while(!q.isEmpty()){
n = q.dequeue();
if ( n.left != null )
q.enqueue(n.left);
if ( n.right != null )
q.enqueue(n.right);
}
return n;
}
Find the parent of the last node in order to set the node to null in case replacing with the root in removal case
Node findParentOfLastNode(Node root ,Node lastNode)
{
if(root == null)
return root;
if( root.left == lastNode || root.right == lastNode )
return root;
Node n1= findParentOfLastNode(root.left,lastNode);
Node n2= findParentOfLastNode(root.left,lastNode);
return n1 != null ? n1 : n2;
}
I know this is an old thread but i was looking for a answer to the same question. But i could not afford to do an o(log n) solution as i had to find the last node thousands of times in a few seconds. I did have a O(log n) algorithm but my program was crawling because of the number of times it performed this operation. So after much thought I did finally find a fix for this. Not sure if anybody things this is interesting.
This solution is O(1) for search. For insertion it is definitely less than O(log n), although I cannot say it is O(1).
Just wanted to add that if there is interest, i can provide my solution as well.
The solution is to add the nodes in the binary heap to a queue. Every queue node has front and back pointers.We keep adding nodes to the end of this queue from left to right until we reach the last node in the binary heap. At this point, the last node in the binary heap will be in the rear of the queue.
Every time we need to find the last node, we dequeue from the rear,and the second-to-last now becomes the last node in the tree.
When we want to insert, we search backwards from the rear for the first node where we can insert and put it there. It is not exactly O(1) but reduces the running time dramatically.

Resources