Hi all i have algorithmic problem and i struggle with finding optimal solution. I have tree which i want to traverse. Nodes of the tree consist of value and a rank of node (value as well as rank can be random number).
What i want to do is traverse tree and for each node i want to sum values from all descendant nodes except descendants with lower rank and all nodes under them (irrespective of rank). Tree does not have any special properties as every node can have <0, Integer.MAX_VALUE> of children. There are no rules applied on relations between parent and children regarding rank or value.
My naive solution is to recursively traverse subtree for each node and stop recursion as soon as i find node with lower rank, summing values on my way back to root of subtree. However this feels to me like suboptimal solution (in worst case - that is each node has only one descendant - its basically linked list and ranks are sorted ascending to the root this solution would be O(n^2)).
It is possible to have sums for all nodes settled after one traversal?
Edit:
Solution, slightly better as my naive approach could be for every node visited propagate its value recursively back to the root while keeping minimum rank of visited nodes (during back to root traversal). Then adding value only to nodes that have lower than minimal rank.
Edit from my phone: since we only ever inspect the value of tree roots, we can use a disjoint sets structure with path compression only instead of a dynamic tree. Don't bother updating non-roots.
Here's an O(n log n)-time algorithm with dynamic trees. (I know. They're a pain to implement. I try not to include them in answers.)
Sort the nodes from greatest to least rank and initialize a totally disconnected dynamic tree with their values. For each node in order, issue dynamic tree operations to
Report its current value (O(log n) amortized, output this value eventually), and
If it's not the root, add its value to each of its ancestors' values (O(log n) amortized) and then link it to its parent (also O(log n) amortized).
The effect of Step 2 is that each node's dynamic value is the sum of its descendants' (relative to the present tree) original values.
Edit
Not correct answer, does not solve the problem asked by the OP (cf comment)
Old answer (before edit)
You can see that this problem (like most problem on trees) can be solved with a recursive approach. That is because the sum value of a node depends only on the sum values and respective ranks of its children.
Here is a pseudo-code describing a solution.
get_sum(my_node, result_arr):
my_sum = 0
for child in my_node.children():
get_sum(child, result_arr) // we compute the sum value of the children
if rank(child) >= rank(my_node): // if child node is big enough add its sum
my_sum += result_arr[child]
result_arr[my_node] = my_sum // store the result somewhere
This is a BFS based algorithm, which should run in O(n) with n the number of nodes in your tree. To get the values for all the nodes, call this recursive function on the root node of your tree.
I suggest you a postfixed DFS. For every node, keep the reference of its predecessor.
the sum_by_rank for a leaf is an empty dict;
the sum_by_rank for any node is the dict ranks of all subnodes -> values of all subodes. If two or more subnodes have the same rank, just add their values.
The postfixed DFS allows you to compute the sums from bottom to up.
Here's some Python 3.7 program to play with (the code is probably not optimized):
from dataclasses import dataclass
from typing import List, Dict
#dataclass
class Node:
value: int
rank: int
sum_by_rank: Dict[int, int]
children: List[object]
tree = Node(5, 0, {}, [
Node(4,2, {}, [Node(3,1, {}, [])]),
Node(7,2, {}, [Node(11,1, {}, [Node(7,8, {}, [])])]),
Node(8,4, {}, [Node(3,3, {}, []), Node(9,5, {}, []), Node(4,2, {}, [])]),
])
def dfs(node, previous=None):
for child in node.children:
dfs(child, node)
node.sum_by_rank[child.rank] = node.sum_by_rank.get(child.rank, 0) + child.value # add this children rank -> value
# add the subtree rank -> value
for r,v in child.sum_by_rank.items():
node.sum_by_rank[r] = node.sum_by_rank.get(r, 0)+v
dfs(tree)
print (tree)
# Node(value=5, rank=0, sum_by_rank={2: 15, 1: 14, 8: 7, 4: 8, 3: 3, 5: 9}, children=[Node(value=4, rank=2, sum_by_rank={1: 3}, children=[Node(value=3, rank=1, sum_by_rank={}, children=[])]), Node(value=7, rank=2, sum_by_rank={1: 11, 8: 7}, children=[Node(value=11, rank=1, sum_by_rank={8: 7}, children=[Node(value=7, rank=8, sum_by_rank={}, children=[])])]), Node(value=8, rank=4, sum_by_rank={3: 3, 5: 9, 2: 4}, children=[Node(value=3, rank=3, sum_by_rank={}, children=[]), Node(value=9, rank=5, sum_by_rank={}, children=[]), Node(value=4, rank=2, sum_by_rank={}, children=[])])])
Hence, to get the sum of a node, just add the values is associated with the ranks greater or equal to the node rank. In Python:
sum(value for rank, value in node.sum_by_rank.items() where rank >= node.rank)
Let a, b, c be nodes. Assume that a is an ancestor or b, and b is an ancestor of c.
Observe: if b.rank > c.rank and a.rank > b.rank then a.rank > c.rank.
This leads us to the conclusion that the sum_by_rank of a is equal to the sum of sum_by_rank(b) + b.value for every b direct child of a, having a rank lower than a.
That suggests the following recursion:
ComputeRank(v)
if v is null
return
let sum = 0
foreach child in v.children
ComputeRank(child)
if child.rank <= v.rank
sum += child.sumByRank + child.value
v.sumByRank = sum
By the end of the algorithm each node will have it's sumByRank as you required (if I understood correctly).
Observe that for each node n in the input tree, the algorithm will visit n exactly once, and query it once again while visiting it's predecessor. This is a constant number of times, meaning the algorithm will take O(N) time.
Hope it helps :)
Related
let us say we have a dynamic set of S of integers and an index i, we wish to find the i-th smallest negative number in S written in increasing order, if any.
example:
S= {-5, -2, -1, 2, 5} the naswer is -1 for i=3 and is undefined for i = 4.
the objective is to choose the red-black tree as an underlying data structure and define an additional attribute that allows to solve the problem in O(lg n) time. Any guides on the algorithm should be used to solve such a question?
It's called Order Statistic Tree (https://en.wikipedia.org/wiki/Order_statistic_tree).
In general, you extend your tree node with an extra attribute, the size of subtrees. For a leaf, it's 1, for an inner node, it's
size(left_subtree) + size(right_subtree) + 1
Wiki has a clear explanation and pseudocode. It works with any kind of balanced tree (RB/AVl/Treap/etc), you need to support the size of subtrees during rotation (or any tree modification).
I have a level order complete binary tree rooted from 2, as in below figure.
Given a root value and another value v, how can I decide whether v is on left or right subtree of the tree, without traversing the tree?
For example: Let's say root = 2, v = 15. I want to decide using a mathematical function or something that v is in right subtree.
Another example could be, root = 3, v = 10. Answer should be left subtree.
I know I can do this by a tree traversal. I want to know if this is possible in O(1).
It is unclear from your question if you want O(1) to be the time complexity or space complexity.
But, I assume you are talking about the time complexity as space is abundant these days.
If the space complexity permits, there is an approach using which you can query the subtree with a search value in constant time.
The idea is to store all the ancestors of the node with proper direction.
For example:
Let's assume Node 11 to be the target node.
In a single traversal, we can maintain a separate ancestors map for all the nodes containing the respective ancestor and direction to reach the target node.
Starting from the root, Node 2.
Node 2 has no parent, therefore, its ancestors map will be empty.
For Node 3, store a key value pair <2, L> (2 for parent and L for left).
Likewise, for Node 4, store a key value pair <2, R> (2 for parent and R for right).
For Node 6, the ancestors map looks like:
{
2 : "L",
3 : "R"
}
Repeat the procedure for until we cover each node.
Now, the ancestors map for Node 11 will look like as follows:
{
2 : "L",
3 : "R",
6 : "L"
}
Just check if the value of the root of the subtree is present in the ancestors map of Node 11.
If present, just return its value, which denotes the left/right subtree, in constant time.
PS: Using unordered map can be beneficial in such case.
Also, as it is a binary tree, the maximum height for N nodes, will be log2(N).
Therefore, space complexity required is O(N * log2(N)).
The time complexity to for insertion into unordered map is O(1) on average.
Therefore, time complexity for building all the maps = O(N * log2(N) * some constant factor).
Time complexity for queuing = constant ~ O(1).
For, N <= 10^5, the logic for building the ancestors map can be executed within 1 second.
We're given a directed tree to work with. We define the concepts of p-ancestor and p-cousin as follows
p-ancestor: A node is an 1-ancestor of another if it is the parent of it. It is the p-ancestor of a node, if it is the parent of the (p-1)-th ancestor.
p-cousin: A node is the p-cousin of another, if they share the same p-ancestor.
For example, consider the tree below.
4 has three 1-cousins i,e, 3, 4 and 5 since they all share the common
1-ancestor, which is 1
For a particular tree, the problem is as follows. You are given multiple pairs of (node,p) and are supposed to count (and output) the number of p-cousins of the corresponding nodes.
A slow algorithm would be to crawl up to the p-ancestor and run a BFS for each node.
What is the (asymptotically) fastest way to solve the problem?
If an off-line solution is acceptable, two Depth first searches can do the job.
Assume that we can index all of those n queries (node, p) from 0 to n - 1
We can convert each query (node, p) into another type of query (ancestor , p) as follow:
Answer for query (node, p), with node has level a (distance from root to this node is a), is the number of descendants level a of the ancestor at level a - p. So, for each queries, we can find who is that ancestor:
Pseudo code
dfs(int node, int level, int[]path, int[] ancestorForQuery, List<Query>[]data){
path[level] = node;
visit all child node;
for(Query query : data[node])
if(query.p <= level)
ancestorForQuery[query.index] = path[level - p];
}
Now, after the first DFS, instead of the original query, we have a new type of query (ancestor, p)
Assume that we have an array count, which at index i stores the number of node which has level i. Assume that, node a at level x , we need to count number of p descendants, so, the result for this query is:
query result = count[x + p] after we visit a - count[x + p] before we visit a
Pseudo code
dfs2(int node, int level, int[] result, int[]count, List<TransformedQuery>[]data){
count[level] ++;
for(TransformedQuery query : data[node]){
result[query.index] -= count[level + query.p];
}
visit all child node;
for(TransformedQuery query : data[node]){
result[query.index] += count[level + query.p];
}
}
Result of each query is stored in result array.
If p is fixed, I suggest the following algorithm:
Let's say that count[v] is number of p-children of v. Initially all count[v] are set to 0. And pparent[v] is p-parent of v.
Let's now run a dfs on the tree and keep the stack of visited nodes, i.e. when we visit some v, we put it into the stack. Once we leave v, we pop.
Suppose we've come to some node v in our dfs. Let's do count[stack[size - p]]++, indicating that we are a p-child of v. Also pparent[v] = stack[size-p]
Once your dfs is finished, you can calculate the desired number of p-cousins of v like this:
count[pparent[v]]
The complexity of this is O(n + m) for dfs and O(1) for each query
First I'll describe a fairly simple way to answer each query in O(p) time that uses O(n) preprocessing time and space, and then mention a way that query times can be sped up to O(log p) time for a factor of just O(log n) extra preprocessing time and space.
O(p)-time query algorithm
The basic idea is that if we write out the sequence of nodes visited during a DFS traversal of the tree in such a way that every node is written out at a vertical position corresponding to its level in the tree, then the set of p-cousins of a node form a horizontal interval in this diagram. Note that this "writing out" looks very much like a typical tree diagram, except without lines connecting nodes, and (if a postorder traversal is used; preorder would be just as good) parent nodes always appearing to the right of their children. So given a query (v, p), what we will do is essentially:
Find the p-th ancestor u of the given node v. Naively this takes O(p) time.
Find the p-th left-descendant l of u -- that is, the node you reach after repeating the process of visiting the leftmost child of the current node, p times. Naively this takes O(p) time.
Find the p-th right-descendant r of u (defined similarly). Naively this takes O(p) time.
Return the value x[r] - x[l] + 1, where x[i] is a precalculated value that records the number of nodes in the sequence described above that are at the same level as, and at or to the left of, node i. This takes constant time.
The preprocessing step is where we calculate x[i], for each 1 <= i <= n. This is accomplished by performing a DFS that builds up a second array y[] that records the number y[d] of nodes visited so far at depth d. Specifically, y[d] is initially 0 for each d; during the DFS, when we visit a node v at depth d, we simply increment y[d] and then set x[v] = y[d].
O(log p)-time query algorithm
The above algorithm should already be fast enough if the tree is fairly balanced -- but in the worst case, when each node has just a single child, O(p) = O(n). Notice that it is navigating up and down the tree in the first 3 of the above 4 steps that force O(p) time -- the last step takes constant time.
To fix this, we can add some extra pointers to make navigating up and down the tree faster. A simple and flexible way uses "pointer doubling": For each node v, we will store log2(depth(v)) pointers to successively higher ancestors. To populate these pointers, we perform log2(maxDepth) DFS iterations, where on the i-th iteration we set each node v's i-th ancestor pointer to its (i-1)-th ancestor's (i-1)-th ancestor: this takes just two pointer lookups per node per DFS. With these pointers, moving any distance p up the tree always takes at most log(p) jumps, because the distance can be reduced by at least half on each jump. The exact same procedure can be used to populate corresponding lists of pointers for "left-descendants" and "right-descendants" to speed up steps 2 and 3, respectively, to O(log p) time.
Let us assume that we have a tree consisting on N nodes. The task is to find all longest unique paths in the tree. For example, if the tree looks like following:
Then there are three longest unique paths in the tree: 1 - 2 - 3 - 4 - 5, 6 - 2 - 3 - 4 - 5 and 1 - 2 - 6.
I want to programmatically find and store all such paths for a given tree.
One way to do it would be to compute paths between each pair of node in the tree and then reject the paths which are contained in any other path. However, I am looking for an efficient way to do it. My questions are as follows:
Is it possible to compute this information in less than O(N^2)? I have not been able to think of a solution which would be faster than O(N^2).
If yes, could you be kind enough to guide me towards the solution.
The reason why I want to try it out is because I am trying to solve this problem: KNODES
An algorithm with a time complexity below O(N^2) may only exist, if every solution for a tree with N nodes can be encoded in less than O(N^2) space.
Suppose a complete binary tree with n leaves (N=n log n). The solution to the problem will contain a path for every set of 2 leaves. That means, the solution will have O(n^2) elements. So for this case we can encode the solution as the 2-element sets of leaves.
Now consider a nearly complete binary tree with m leaves, which was created by only removing arbitrary leaves from a complete binary tree with n leaves. When comparing the solution of this tree to that of the complete binary tree, both will share a possibly empty set of paths. In fact for every subset of paths of a solution of a complete binary tree, there will exist at least one binary tree with m leaves as mentioned above, that contains every solution of such a subset. (We intentionally ignore the fact that a tree with m leaves may have some more paths in the solution where at least some of the path ends are not leaves of the complete binary tree.)
Only that part of the solution for a binary tree with m leaves will be encoded by a number with (n^2)/2 bits. The index of a bit in this number represents an element in the upper right half of a matrix with n columns and rows.
For n=4 this would be:
x012
xx34
xxx5
The bit at index i will be set if the undirected path row(i),column(i) is contained in the solution.
As we have already statet that a solution for a tree with m leaves may contain any subset of the solution to the complete binary tree with n>=m leaves, every binary number with (n^2)/2 bits will represent a solution for a tree with m leaves.
Now encoding every possible number with (n^2)/2 bits with less than (n^2)/2 is not possible. So we have shown that solutions at least require O(n^2) space to be represented. Using N=n log n from above we yield a space requirement of at least O(N^2).
Therefore there doens't exist an algorithm with time complexity less than O(N^2)
As far as I could understand, you have a tree without a selected root. Your admissible paths are the paths that do not allow to visit tree nodes twice (you are not allowed to return back). And you need to find all such admissible paths that are not subpaths of any admissible path.
So if I understood right, then if a node has only one edge, than the admissible path either start or stop at this node. If tree is connected, then you can get from any node to any node by one admissible path.
So you select all nodes with one edge, call it S. Then select one of S and walk the whole tree saving the paths to the ends (path, not the walk order). Then you do this with every other item in S and remove duplicated paths (they can be in reverse order: like starting from 1: 1 - 2 - 6 and starting from 6: 6 - 2 - 1).
So here you have to visit all the nodes in the tree as much times as you have leafs in the tree. So complexity depends on the branching factor (in the worst case it is O(n^2). There are some optimizations that can reduce the amount of operations, like you don't have to walk the tree from the last of S.
In this picture the longest paths are {1, 2, 3, 4}, {1, 2, 3, 5}, {1, 2, 3, 6}, {1, 2, 3, 7}, {1, 2, 3, 8}, {1, 2, 3, 9}, {1, 2, 3, 10}
For tree like this storing all longest paths will cost you O(N2)
Let's take the tree which looks like the star with n nodes and n-1 edges.
Then you have got C(n-1, 2) unique longest paths.
So the lower limit of complexity can't be less than O(n^2).
Suppose you have got Binary Tree like on your picture.
class Node(object):
def __init__(self, key):
self.key = key
self.right = None
self.left = None
self.parent = None
def append_left(self, node):
self.left = node
node.parent = self
def append_right(self, node):
self.right = node
node.parent = self
root = Node(3)
root.append_left(Node(2))
root.append_right(Node(4))
root.left.append_left(Node(1))
root.left.append_right(Node(6))
root.right.append_right(Node(5))
And we need to get all paths between all leaves. So in your tree they are:
1, 2, 6
6, 2, 3, 4, 5
1, 2, 3, 4, 5
You can do this in (edit: not linear, quadratic) time.
def leaf_paths(node, paths = [], stacks = {}, visited = set()):
visited.add(node)
if node.left is None and node.right is None:
for leaf, path in stacks.iteritems():
path.append(node)
paths.append(path[:])
stacks[node] = [node]
else:
for leaf, path in stacks.iteritems():
if len(path) > 1 and path[-2] == node:
path.pop()
else:
path.append(node)
if node.left and node.left not in visited:
leaf_paths(node.left, paths, stacks, visited)
elif node.right and node.right not in visited:
leaf_paths(node.right, paths, stacks, visited)
elif node.parent:
leaf_paths(node.parent, paths, stacks, visited)
return paths
for path in leaf_paths(root):
print [n.key for n in path]
An output for your tree will be:
[1, 2, 6]
[6, 2, 3, 4, 5]
[1, 2, 3, 4, 5]
The idea is to track all visited leaves while traversing a tree. And to keep stack of paths for each leaf. So here is memory/performance tradeoff.
Draw the tree. Let v be a vertex and p its parent. The length of the longest path including v but not p = (height of left subtree of v) + (height of right subtree of v).
The maximum over all v is the longest path in the graph. You can calculate this in O(n):
First calculate all the intermediate heights. Start at the leaves and work up: (height below v) = 1 + max(height below left child, height below right child)
Then calculate the sum (height of left subtree of v) + (height of right subtree of v) for each vertex v, and take the maximum. This is the length of longest path in the graph.
Two BSTs (Binary Search Trees) are given. How to find largest common sub-tree in the given two binary trees?
EDIT 1:
Here is what I have thought:
Let, r1 = current node of 1st tree
r2 = current node of 2nd tree
There are some of the cases I think we need to consider:
Case 1 : r1.data < r2.data
2 subproblems to solve:
first, check r1 and r2.left
second, check r1.right and r2
Case 2 : r1.data > r2.data
2 subproblems to solve:
- first, check r1.left and r2
- second, check r1 and r2.right
Case 3 : r1.data == r2.data
Again, 2 cases to consider here:
(a) current node is part of largest common BST
compute common subtree size rooted at r1 and r2
(b)current node is NOT part of largest common BST
2 subproblems to solve:
first, solve r1.left and r2.left
second, solve r1.right and r2.right
I can think of the cases we need to check, but I am not able to code it, as of now. And it is NOT a homework problem. Does it look like?
Just hash the children and key of each node and look for duplicates. This would give a linear expected time algorithm. For example, see the following pseudocode, which assumes that there are no hash collisions (dealing with collisions would be straightforward):
ret = -1
// T is a tree node, H is a hash set, and first is a boolean flag
hashTree(T, H, first):
if (T is null):
return 0 // leaf case
h = hash(hashTree(T.left, H, first), hashTree(T.right, H, first), T.key)
if (first):
// store hashes of T1's nodes in the set H
H.insert(h)
else:
// check for hashes of T2's nodes in the set H containing T1's nodes
if H.contains(h):
ret = max(ret, size(T)) // size is recursive and memoized to get O(n) total time
return h
H = {}
hashTree(T1, H, true)
hashTree(T2, H, false)
return ret
Note that this is assuming the standard definition of a subtree of a BST, namely that a subtree consists of a node and all of its descendants.
Assuming there are no duplicate values in the trees:
LargestSubtree(Tree tree1, Tree tree2)
Int bestMatch := 0
Int bestMatchCount := 0
For each Node n in tree1 //should iterate breadth-first
//possible optimization: we can skip every node that is part of each subtree we find
Node n2 := BinarySearch(tree2(n.value))
Int matchCount := CountMatches(n, n2)
If (matchCount > bestMatchCount)
bestMatch := n.value
bestMatchCount := matchCount
End
End
Return ExtractSubtree(BinarySearch(tree1(bestMatch)), BinarySearch(tree2(bestMatch)))
End
CountMatches(Node n1, Node n2)
If (!n1 || !n2 || n1.value != n2.value)
Return 0
End
Return 1 + CountMatches(n1.left, n2.left) + CountMatches(n1.right, n2.right)
End
ExtractSubtree(Node n1, Node n2)
If (!n1 || !n2 || n1.value != n2.value)
Return nil
End
Node result := New Node(n1.value)
result.left := ExtractSubtree(n1.left, n2.left)
result.right := ExtractSubtree(n1.right, n2.right)
Return result
End
To briefly explain, this is a brute-force solution to the problem. It does a breadth-first walk of the first tree. For each node, it performs a BinarySearch of the second tree to locate the corresponding node in that tree. Then using those nodes it evaluates the total size of the common subtree rooted there. If the subtree is larger than any previously found subtree, it remembers it for later so that it can construct and return a copy of the largest subtree when the algorithm completes.
This algorithm does not handle duplicate values. It could be extended to do so by using a BinarySearch implementation that returns a list of all nodes with the given value, instead of just a single node. Then the algorithm could iterate this list and evaluate the subtree for each node and then proceed as normal.
The running time of this algorithm is O(n log m) (it traverses n nodes in the first tree, and performs a log m binary-search operation for each one), putting it on par with most common sorting algorithms. The space complexity is O(1) while running (nothing allocated beyond a few temporary variables), and O(n) when it returns its result (because it creates an explicit copy of the subtree, which may not be required depending upon exactly how the algorithm is supposed to express its result). So even this brute-force approach should perform reasonably well, although as noted by other answers an O(n) solution is possible.
There are also possible optimizations that could be applied to this algorithm, such as skipping over any nodes that were contained in a previously evaluated subtree. Because the tree-walk is breadth-first we know than any node that was part of some prior subtree cannot ever be the root of a larger subtree. This could significantly improve the performance of the algorithm in certain cases, but the worst-case running time (two trees with no common subtrees) would still be O(n log m).
I believe that I have an O(n + m)-time, O(n + m) space algorithm for solving this problem, assuming the trees are of size n and m, respectively. This algorithm assumes that the values in the trees are unique (that is, each element appears in each tree at most once), but they do not need to be binary search trees.
The algorithm is based on dynamic programming and works with the following intution: suppose that we have some tree T with root r and children T1 and T2. Suppose the other tree is S. Now, suppose that we know the maximum common subtree of T1 and S and of T2 and S. Then the maximum subtree of T and S
Is completely contained in T1 and r.
Is completely contained in T2 and r.
Uses both T1, T2, and r.
Therefore, we can compute the maximum common subtree (I'll abbreviate this as MCS) as follows. If MCS(T1, S) or MCS(T2, S) has the roots of T1 or T2 as roots, then the MCS we can get from T and S is given by the larger of MCS(T1, S) and MCS(T2, S). If exactly one of MCS(T1, S) and MCS(T2, S) has the root of T1 or T2 as a root (assume w.l.o.g. that it's T1), then look up r in S. If r has the root of T1 as a child, then we can extend that tree by a node and the MCS is given by the larger of this augmented tree and MCS(T2, S). Otherwise, if both MCS(T1, S) and MCS(T2, S) have the roots of T1 and T2 as roots, then look up r in S. If it has as a child the root of T1, we can extend the tree by adding in r. If it has as a child the root of T2, we can extend that tree by adding in r. Otherwise, we just take the larger of MCS(T1, S) and MCS(T2, S).
The formal version of the algorithm is as follows:
Create a new hash table mapping nodes in tree S from their value to the corresponding node in the tree. Then fill this table in with the nodes of S by doing a standard tree walk in O(m) time.
Create a new hash table mapping nodes in T from their value to the size of the maximum common subtree of the tree rooted at that node and S. Note that this means that the MCS-es stored in this table must be directly rooted at the given node. Leave this table empty.
Create a list of the nodes of T using a postorder traversal. This takes O(n) time. Note that this means that we will always process all of a node's children before the node itself; this is very important!
For each node v in the postorder traversal, in the order they were visited:
Look up the corresponding node in the hash table for the nodes of S.
If no node was found, set the size of the MCS rooted at v to 0.
If a node v' was found in S:
If neither of the children of v' match the children of v, set the size of the MCS rooted at v to 1.
If exactly one of the children of v' matches a child of v, set the size of the MCS rooted at v to 1 plus the size of the MCS of the subtree rooted at that child.
If both of the children of v' match the children of v, set the size of the MCS rooted at v to 1 plus the size of the MCS of the left subtree plus the size of the MCS of the right subtree.
(Note that step (4) runs in expected O(n) time, since it visits each node in S exactly once, makes O(n) hash table lookups, makes n hash table inserts, and does a constant amount of processing per node).
Iterate across the hash table and return the maximum value it contains. This step takes O(n) time as well. If the hash table is empty (S has size zero), return 0.
Overall, the runtime is O(n + m) time expected and O(n + m) space for the two hash tables.
To see a correctness proof, we proceed by induction on the height of the tree T. As a base case, if T has height zero, then we just return zero because the loop in (4) does not add anything to the hash table. If T has height one, then either it exists in T or it does not. If it exists in T, then it can't have any children at all, so we execute branch 4.3.1 and say that it has height one. Step (6) then reports that the MCS has size one, which is correct. If it does not exist, then we execute 4.2, putting zero into the hash table, so step (6) reports that the MCS has size zero as expected.
For the inductive step, assume that the algorithm works for all trees of height k' < k and consider a tree of height k. During our postorder walk of T, we will visit all of the nodes in the left subtree, then in the right subtree, and finally the root of T. By the inductive hypothesis, the table of MCS values will be filled in correctly for the left subtree and right subtree, since they have height ≤ k - 1 < k. Now consider what happens when we process the root. If the root doesn't appear in the tree S, then we put a zero into the table, and step (6) will pick the largest MCS value of some subtree of T, which must be fully contained in either its left subtree or right subtree. If the root appears in S, then we compute the size of the MCS rooted at the root of T by trying to link it with the MCS-es of its two children, which (inductively!) we've computed correctly.
Whew! That was an awesome problem. I hope this solution is correct!
EDIT: As was noted by #jonderry, this will find the largest common subgraph of the two trees, not the largest common complete subtree. However, you can restrict the algorithm to only work on subtrees quite easily. To do so, you would modify the inner code of the algorithm so that it records a subtree of size 0 if both subtrees aren't present with nonzero size. A similar inductive argument will show that this will find the largest complete subtree.
Though, admittedly, I like the "largest common subgraph" problem a lot more. :-)
The following algorithm computes all the largest common subtrees of two binary trees (with no assumption that it is a binary search tree). Let S and T be two binary trees. The algorithm works from the bottom of the trees up, starting at the leaves. We start by identifying leaves with the same value. Then consider their parents and identify nodes with the same children. More generally, at each iteration, we identify nodes provided they have the same value and their children are isomorphic (or isomorphic after swapping the left and right children). This algorithm terminates with the collection of all pairs of maximal subtrees in T and S.
Here is a more detailed description:
Let S and T be two binary trees. For simplicity, we may assume that for each node n, the left child has value <= the right child. If exactly one child of a node n is NULL, we assume the right node is NULL. (In general, we consider two subtrees isomorphic if they are up to permutation of the left/right children for each node.)
(1) Find all leaf nodes in each tree.
(2) Define a bipartite graph B with edges from nodes in S to nodes in T, initially with no edges. Let R(S) and T(S) be empty sets. Let R(S)_next and R(T)_next also be empty sets.
(3) For each leaf node in S and each leaf node in T, create an edge in B if the nodes have the same value. For each edge created from nodeS in S to nodeT in T, add all the parents of nodeS to the set R(S) and all the parents of nodeT to the set R(T).
(4) For each node nodeS in R(S) and each node nodeT in T(S), draw an edge between them in B if they have the same value AND
{
(i): nodeS->left is connected to nodeT->left and nodeS->right is connected to nodeT->right, OR
(ii): nodeS->left is connected to nodeT->right and nodeS->right is connected to nodeT->left, OR
(iii): nodeS->left is connected to nodeT-> right and nodeS->right == NULL and nodeT->right==NULL
(5) For each edge created in step (4), add their parents to R(S)_next and R(T)_next.
(6) If (R(S)_next) is nonempty {
(i) swap R(S) and R(S)_next and swap R(T) and R(T)_next.
(ii) Empty the contents of R(S)_next and R(T)_next.
(iii) Return to step (4).
}
When this algorithm terminates, R(S) and T(S) contain the roots of all maximal subtrees in S and T. Furthermore, the bipartite graph B identifies all pairs of nodes in S and nodes in T that give isomorphic subtrees.
I believe this algorithm has complexity is O(n log n), where n is the total number of nodes in S and T, since the sets R(S) and T(S) can be stored in BST’s ordered by value, however I would be interested to see a proof.