Sum up child values and save the values calculated in intermediate steps - binary-tree

struct node {
int value;
struct node* left;
struct node* right;
int left_sum;
int right_sum;
}
In a binary tree, from a particular node, there is a simply recursive algorithm to sum up all its child values. Is there a way to save the values calculated in the intermediate steps and store them as left_sum and right_sum in child nodes?
Will it be easier to do this bottom up by adding a struct node* parent link to the node definition?

No, this is clearly an exercise in recursion. Think about what the sum means. It's zero plus the "sum of all values from the root down".
Interestingly enough, the "sum of all values from the root down" is the value of the root node plus the "sum of all values from its left node down" plus the "sum of all values from its right node down".
Hopefully, you can see where I'm going here.
The essence of recursion is to define an operation in terms of similar, simpler, operations with a terminating condition.
The terminating condition, in this case, is the leaf nodes of the tree or, to make the code simpler, beyond the leaf nodes.
Examine the following pseudo-code:
def sumAllNodes (node):
if node == NULL:
return 0
return node.value + sumAllNodes (node.left) + sumAllNodes (node.right)
fullSum = sumAllNodes (rootnode)
That's really all there is to it. With the following tree:
__A(9)__
/ \
B(3) C(2)
/ \ \
D(21) E(7) F(1)
Using the pseudo-code, the sum is the value of A (9) plus the sums of the left and right subtrees.
The left subtree of A is the value of B (3) plus the sums of its left and right subtrees.
The left subtree of B is the value of D (21) plus the sums of its left and right subtrees.
The left subtree of D is the value of NULL (0).
Later on, the right subtree of A is the value of C (2) plus the sums of its left and right subtrees, it's left subtree being empty, its right subtree being F (1).
Because you're doing this recursively, you don't explicitly ever walk your way up the tree. It's the fact that the recursive calls are returning with the summed values which gives that ability. In other words, it happens under the covers.
And the other part of your question is not really useful though, of course, there may be unstated requirements that I'm not taking into account, because they're, well, ... unstated :-)
Is there a way to save the values calculated in the intermediate steps and store them as left_sum and right_sum in child nodes?
You never actually re-use the sums for a given sub-tree. During a sum calculation, you would calculate the B-and-below subtree only once as part of adding it to A and the C-and-below subtree.
You could store those values so that B contained both the value and the two sums (left and right) - this would mean that every change to the tree would have to propagate itself up to the root as well but it's doable.
Now there are some situations where that may be useful. For example, if the tree itself changes very rarely but you want the sum very frequently, it makes sense performance wise to do it on update so that the cost is amortised across lots of reads.
I sometimes use this method with databases (which are mostly read far more often than written) but it's unusual to see it in "normal" binary trees.
Another possible optimisation: just maintain the sum as a separate variable in the tree object. Initialise it to zero then, whenever you add a node, add its value to the sum.
When you delete a node, subtract its value from the sum. That gives you your very fast O(1) "return sum" function without having to propagate upwards on update.
The downside is that you only have a sum for the tree as a whole but I'm having a hard time coming up with a valid use case for needing the sum of subtrees. If you have such a use case, then I'd go for something like:
def updateAllNodes (node):
if node == NULL:
return 0
node.leftSum = updateAllNodes (node.left)
node.rightSum = updateAllNodes (node.right)
return node.value + node.leftSum + node.rightSum
change the tree somehow (possibly many times)
fullSum = updateAllNodes (root)
In other words, just update the entire tree after each change (or batch the changes then update if you know there's quite a few changes happening). This will probably be a little simpler than trying to do it as part of the tree update itself.
You can even use a separate dirtyFlag which is set to true whenever the tree changes and set to false whenever you calculate and store the sum. Then use that in the sum calculation code to only do the recalc if it's dirty (in other words, a cache of the sums).
That way, code like:
fullSum = updateAllNodes (root)
fullSum = updateAllNodes (root)
fullSum = updateAllNodes (root)
fullSum = updateAllNodes (root)
fullSum = updateAllNodes (root)
will only incur a cost on the first invocation. The other four should be blindingly fast since the sum is cached.

Related

Job Interview Question Using Trees, What data to save?

I was solving the following job interview question and solved most of it but failed at the last requirement.
Q: Build a data structure which supports the following functions:
Init - Initialise Empty DS. O(1) Time complexity.
SetPositiveInDay(d,x) - Add to the DS that in day d exactly x new people were infected with covid-19. O(log n)Time complexity.
WorseBefore(d) - From the days inserted into the DS and smaller than d return the last one which has more newly infected people than d. O(log n)Time complexity.
For example:
Init()
SetPositiveInDay(1,10)
SetPositiveInDay(2,20)
SetPositiveInDay(3,15)
SetPositiveInDay(5,17)
SetPositiveInDay(23,180)
SetPositiveInDay(8,13)
SetPositiveInDay(13,18)
WorstBefore(13) // Returns day #2
SetPositiveInDay(10,19)
WorstBefore(13) // Returns day #10
Important note: you can't suppose that days will be entered by order and can't suppose too that there won't be "gaps" between days. (Some days may not be saved in the DS while those after it may be).
What I did?
I used AVL tree (I could use 2-3 tree too).
For each node I have:
Sick - Number of new infected people in that day.
maxLeftSick - Max number of infected people for left son.
maxRightSick - Max number of infected people for right son.
When inserted a new node I made sure that in rotation data won't get missed plus, for each single node from the new one till the root I did:
But I wasn't successful implementing WorseBefore(d).
Where to search?
First you need to find the node node corresponding to d in the tree ordered by days. Let x = Sick(node). This can be done in O(log n).
If maxLeftSick(node) > x, the solution must be in the left subtree of node. Search for the solution there and return the answer. This can be done in O(log n) - see below.
Otherwise, traverse the tree upwards towards the root, starting from node, until you find the first node nextPredecessor satisfying this property (this takes O(log n)):
nextPredecessor is smaller than node,
and either
Sick(nextPredecessor) > x or
maxLeftSick(nextPredecessor) > x.
If no such node exists, we give up. In case 1, just return nextPredecessor since that is the best solution.
In case 2, we know that the solution must be in the left subtree of nextPredecessor, so search there and return the answer. Again, this takes O(log n) - see below.
Note that there is no need to search in the right subtree of nextPredecessor since the only nodes that are smaller than node in that subtree would be the left subtree of node itself, and we have already excluded that.
Note also that it is not necessary to traverse further up the tree than nextPredecessor since those nodes are even smaller, and we are looking for the largest node satisfying all constraints.
How to search?
OK, so how do we search for the solution in a subtree? Finding the largest day within a subtree rooted in q that is worse than an infection number x is simple using the maxLeftSick and maxRightSick information:
If q has a right child and maxRightSick(q) > x then search in the right subtree of q.
If q has no right child and Sick(q) > x, return Day(q).
If q has a left child and maxLeftSick(q) > x then search in the left subtree of q.
Otherwise there is no solution within the subtree q.
We are effectively using maxLeftSick and maxRightSick to prune the search tree to include only "worse" nodes, and within that pruned tree we get the right most node, i.e. the one with the largest day.
It is easy to see that this algorithm runs in O(log n) where n is the total number of nodes since the number of steps is bounded by the height of the tree.
Pseudocode
Here is the pseudocode (assuming maxLeftSick and maxRightSick return -1 if no corresponding child node exists):
// Returns the largest day smaller than d such that its
// infection number is larger than the infection number on day d.
// Returns -1 if no such day exists.
int WorstBefore(int d) {
node = find(d);
// try to find the solution in the left subtree
if (maxLeftSick(node) > Sick(node)) {
return FindLastWorseThan(node -> left, Sick(node));
}
// move up towards root until we find the first node
// that is smaller than `node` and such that
// Sick(nextPredecessor) > Sick(node) or
// maxLeftSick(nextPredecessor) > Sick(node).
nextPredecessor = findNextPredecessor(node);
if (nextPredecessor == null) return -1;
// Case 1
if (Sick(nextPredecessor) > Sick(node)) return nextPredecessor;
// Case 2: maxLeftSick(nextPredecessor) > Sick(node)
return FindLastWorseThan(nextPredecessor -> left, Sick(node));
}
// Finds the latest day within the given subtree with root "node" where
// the infection number is larger than x. Runs in O(log(size(q)).
int FindLastWorseThan(Node q, int x) {
if ((q -> right) = null and Sick(q) > x) return Day(q);
if (maxRightSick(q) > x) return FindLastWorseThan(q -> right, x);
if (maxLeftSick(q) > x) return FindLastWorseThan(q -> left, x);
return -1;
}
First of all, your chosen data structure looks fine to me. You did not mention it explicitly, but I assume that the "key" you use in the AVL tree is the day number, i.e. an in-order traversal of the tree would list the nodes in their chronological order.
I would just suggest a cosmetic change: store the maximum value of sick in the node itself, so that you don't have two similar informations (maxLeftSick and maxRightSick) stored in one node instance, but move those two informations to the child nodes, so that your node.maxLeftSick is actually stored in node.left.maxSick, and similarly node.maxRightSick is stored in node.right.maxSick. This is of course not done when that child does not exist, but then we don't need that information either. In your structure maxLeftSick would be 0 when left is not defined. In my proposed structure, you would not have that value -- the 0 would follow naturally from the fact that there is no left child. In my proposal, the root node would have an information in maxSick which is not present in yours, and which would be the sum of your root.maxLeftSick and root.maxRightSick. This information would not really be used, but it is just there to make the structure consistent throughout the tree.
So you would just store one maxSick, which considers the current node's sick value also in that maximum. The processing you do during rotations will need to change accordingly, but will not become more complex.
I will assume that your AVL tree is single-threaded, i.e. you don't keep track of parent-pointers. So create a find method which will return the path to the node to be found. For instance, in Python syntax, it could look like this:
def find(self, day):
node = self.root
path = [] # an array of nodes
while node:
path.append(node)
if node.day == day: # bingo
return path
if day < node.day:
node = node.left
else:
node = node.right
Then the worstBefore method could look like this:
def worstBefore(self, day):
path = self.find(day)
if not path:
return # day not found
# get number of sick people on that day:
sick = path[-1].sick
# look for recent day with greater number of sick
while path:
node = path.pop() # walk upward, starting with found node
if node.day < day and node.sick > sick:
return node.day
if node.left and node.left.maxSick > sick:
# we will find the result in this subtree
node = node.left
while True:
if node.right and node.right.maxSick > sick:
node = node.right
elif node.sick > sick: # bingo
return node.day
else:
node = node.left
So the path returned by the find method will be used to get the parents of a node when you need to backtrack upwards in the tree along that path.
If along that path you find a left child whose maxSick is greater, then you know that the targeted node must be in that subtree. It is then a matter to walk down that subtree in a controlled way, choosing the right child when it still has maxSick greater. Otherwise check the current node's sick value and return that one if that value is greater. Otherwise go left, and repeat.
While there is no such left sub tree, go up along the path. If that parent would be a match, then return it (make sure to verify the day number). Keep checking for left sub trees that have a larger maxSick.
This runs in O(logn) because you first will walk zero or more steps upward and then zero or more steps downward (in a left subtree).
You can see your example scenario run on repl.it. There I focussed on this question, and didn't implement the rotations.

How to adapt Fenwick tree to answer range minimum queries

Fenwick tree is a data-structure that gives an efficient way to answer to main queries:
add an element to a particular index of an array update(index, value)
find sum of elements from 1 to N find(n)
both operations are done in O(log(n)) time and I understand the logic and implementation. It is not hard to implement a bunch of other operations like find a sum from N to M.
I wanted to understand how to adapt Fenwick tree for RMQ. It is obvious to change Fenwick tree for first two operations. But I am failing to figure out how to find minimum on the range from N to M.
After searching for solutions majority of people think that this is not possible and a small minority claims that it actually can be done (approach1, approach2).
The first approach (written in Russian, based on my google translate has 0 explanation and only two functions) relies on three arrays (initial, left and right) upon my testing was not working correctly for all possible test cases.
The second approach requires only one array and based on the claims runs in O(log^2(n)) and also has close to no explanation of why and how should it work. I have not tried to test it.
In light of controversial claims, I wanted to find out whether it is possible to augment Fenwick tree to answer update(index, value) and findMin(from, to).
If it is possible, I would be happy to hear how it works.
Yes, you can adapt Fenwick Trees (Binary Indexed Trees) to
Update value at a given index in O(log n)
Query minimum value for a range in O(log n) (amortized)
We need 2 Fenwick trees and an additional array holding the real values for nodes.
Suppose we have the following array:
index 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
value 1 0 2 1 1 3 0 4 2 5 2 2 3 1 0
We wave a magic wand and the following trees appear:
Note that in both trees each node represents the minimum value for all nodes within that subtree. For example, in BIT2 node 12 has value 0, which is the minimum value for nodes 12,13,14,15.
Queries
We can efficiently query the minimum value for any range by calculating the minimum of several subtree values and one additional real node value. For example, the minimum value for range [2,7] can be determined by taking the minimum value of BIT2_Node2 (representing nodes 2,3) and BIT1_Node7 (representing node 7), BIT1_Node6 (representing nodes 5,6) and REAL_4 - therefore covering all nodes in [2,7]. But how do we know which sub trees we want to look at?
Query(int a, int b) {
int val = infinity // always holds the known min value for our range
// Start traversing the first tree, BIT1, from the beginning of range, a
int i = a
while (parentOf(i, BIT1) <= b) {
val = min(val, BIT2[i]) // Note: traversing BIT1, yet looking up values in BIT2
i = parentOf(i, BIT1)
}
// Start traversing the second tree, BIT2, from the end of range, b
i = b
while (parentOf(i, BIT2) >= a) {
val = min(val, BIT1[i]) // Note: traversing BIT2, yet looking up values in BIT1
i = parentOf(i, BIT2)
}
val = min(val, REAL[i]) // Explained below
return val
}
It can be mathematically proven that both traversals will end in the same node. That node is a part of our range, yet it is not a part of any subtrees we have looked at. Imagine a case where the (unique) smallest value of our range is in that special node. If we didn't look it up our algorithm would give incorrect results. This is why we have to do that one lookup into the real values array.
To help understand the algorithm I suggest you simulate it with pen & paper, looking up data in the example trees above. For example, a query for range [4,14] would return the minimum of values BIT2_4 (rep. 4,5,6,7), BIT1_14 (rep. 13,14), BIT1_12 (rep. 9,10,11,12) and REAL_8, therefore covering all possible values [4,14].
Updates
Since a node represents the minimum value of itself and its children, changing a node will affect its parents, but not its children. Therefore, to update a tree we start from the node we are modifying and move up all the way to the fictional root node (0 or N+1 depending on which tree).
Suppose we are updating some node in some tree:
If new value < old value, we will always overwrite the value and move up
If new value == old value, we can stop since there will be no more changes cascading upwards
If new value > old value, things get interesting.
If the old value still exists somewhere within that subtree, we are done
If not, we have to find the new minimum value between real[node] and each tree[child_of_node], change tree[node] and move up
Pseudocode for updating node with value v in a tree:
while (node <= n+1) {
if (v > tree[node]) {
if (oldValue == tree[node]) {
v = min(v, real[node])
for-each child {
v = min(v, tree[child])
}
} else break
}
if (v == tree[node]) break
tree[node] = v
node = parentOf(node, tree)
}
Note that oldValue is the original value we replaced, whereas v may be reassigned multiple times as we move up the tree.
Binary Indexing
In my experiments Range Minimum Queries were about twice as fast as a Segment Tree implementation and updates were marginally faster. The main reason for this is using super efficient bitwise operations for moving between nodes. They are very well explained here. Segment Trees are really simple to code so think about is the performance advantage really worth it? The update method of my Fenwick RMQ is 40 lines and took a while to debug. If anyone wants my code I can put it on github. I also produced a brute and test generators to make sure everything works.
I had help understanding this subject & implementing it from the Finnish algorithm community. Source of the image is http://ioinformatics.org/oi/pdf/v9_2015_39_44.pdf, but they credit Fenwick's 1994 paper for it.
The Fenwick tree structure works for addition because addition is invertible. It doesn't work for minimum, because as soon as you have a cell that's supposed to be the minimum of two or more inputs, you've lost information potentially.
If you're willing to double your storage requirements, you can support RMQ with a segment tree that is constructed implicitly, like a binary heap. For an RMQ with n values, store the n values at locations [n, 2n) of an array. Locations [1, n) are aggregates, with the formula A(k) = min(A(2k), A(2k+1)). Location 2n is an infinite sentinel. The update routine should look something like this.
def update(n, a, i, x): # value[i] = x
i += n
a[i] = x
# update the aggregates
while i > 1:
i //= 2
a[i] = min(a[2*i], a[2*i+1])
The multiplies and divides here can be replaced by shifts for efficiency.
The RMQ pseudocode is more delicate. Here's another untested and unoptimized routine.
def rmq(n, a, i, j): # min(value[i:j])
i += n
j += n
x = inf
while i < j:
if i%2 == 0:
i //= 2
else:
x = min(x, a[i])
i = i//2 + 1
if j%2 == 0:
j //= 2
else:
x = min(x, a[j-1])
j //= 2
return x

Check if a tree is a mirror image?

Given a binary tree which is huge and can not be placed in memory, how do you check if the tree is a mirror image.
I got this as an interview question
If a tree is a mirror image of another tree, the inorder traversal of one tree would be reverse of another.
So just do inorder traversal on the first tree and a reverse inorder traversal on another and check if all the elements are the same.
I can't take full credit for this reply of course; a handful of my colleagues helped with some assumptions and for poking holes in my original idea. Much thanks to them!
Assumptions
We can't have the entire tree in memory, so it's not ideal to use recursion. Let's assume, for simplicity's sake, that we can only hold a maximum of two nodes in memory.
We know n, the total number of levels in our tree.
We can perform seeks on the data with respect to the character or line position it's in.
The data that is on disk is ordered by depth. That is to say, the first entry on disk is the root, and the next two are its children, and the next four are its children's children, and so forth.
There are cases in which the data is perfectly mirrored, and cases in which it isn't. Blank data interlaced with non-blank data is considered "acceptable", unless otherwise specified.
We have freedom over using any data type we wish so long as the values can be compared for equivalence. Testing for object equivalence may not be ideal, so let's assume we're comparing primitives.
"Mirrored" means mirrored between the root's children. To use different terminologies, the grandparent's left child is mirrored with its right child, and the left child (parent)'s left child is mirrored with the grandparent's right child's right child. This is illustrated in the graph below; the matching symbols represent the mirroring we want to check for.
G
P* P*
C1& C2^ C3^ C4&
Approach
We know how many nodes on each level we should expect when we're reading from disk - some multiple of 2k. We can establish a double loop to iterate over the total depth of the tree, and the count of the nodes in each level. Inside of this, we can simply compare the outermost values for equivalence, and short-circuit if we find an unequal value.
We can determine the location of each outer location by using multiples of 2k. The leftmost child of any level will always be 2k, and the rightmost child of any level will always be 2k+1-1.
Small Proof: Outermost nodes on level 1 are 2 and 3; 21 = 2, 21+1-1 = 22-1 = 3. Outermost nodes on level 2 are 4 and 7; 22 = 4, 22+1-1 = 23-1 = 7. One could expand this all the way to the nth case.
Pseudocode
int k, i;
for(k = 1; k < n; k++) { // Skip root, trivially mirrored
for(i = 0; i < pow(2, k) / 2; i++) {
if(node_retrieve(i + pow(2, k)) != node_retrieve(pow(2, (k+1)-i)) {
return false;
}
}
}
return true;
Thoughts
This sort of question is a great interview question because, more than likely, they want to see how you would approach this problem. This approach may be horrible, it may be immaculate, but an employer would want you to take your time, draw things on a piece of paper or whiteboard, and ask them questions about how the data is stored, how it can be read, what limitations there are on seeks, etc etc.
It's not the coding aspect that interviewers are interested in, but the problem solving aspect.
Recursion is easy.
struct node {
struct node *left;
struct node *right;
int payload;
};
int is_not_mirror(struct node *one, struct node *two)
{
if (!one && !two) return 0;
if (!one) return 1;
if (!two) return 1;
if (compare(one->payload, two->payload)) return 1;
if (is_not_mirror(one->left, two->right)) return 1;
if (is_not_mirror(one->right, two->left)) return 1;
return 0;
}

Determine if two binary trees are equal

What would be the efficient algorithm to find if two given binary trees are equal - in structure and content?
It's a minor issue, but I'd adapt the earlier solution as follows...
eq(t1, t2) =
t1.data=t2.data && eq(t1.left, t2.left) && eq(t1.right, t2.right)
The reason is that mismatches are likely to be common, and it is better to detect (and stop comparing) early - before recursing further. Of course, I'm assuming a short-circuit && operator here.
I'll also point out that this is glossing over some issues with handling structurally different trees correctly, and with ending the recursion. Basically, there need to be some null checks for t1.left etc. If one tree has a null .left but the other doesn't, you have found a structural difference. If both have null .left, there's no difference, but you have reached a leaf - don't recurse further. Only if both .left values are non-null do you recurse to check the subtree. The same applies, of course, for .right.
You could include checks for e.g. (t1.left == t2.left), but this only makes sense if subtrees can be physically shared (same data structure nodes) for the two trees. This check would be another way to avoid recursing where it is unnecessary - if t1.left and t2.left are the same physical node, you already know that those whole subtrees are identical.
A C implementation might be...
bool tree_compare (const node* t1, const node* t2)
{
// Same node check - also handles both NULL case
if (t1 == t2) return true;
// Gone past leaf on one side check
if ((t1 == NULL) || (t2 == NULL)) return false;
// Do data checks and recursion of tree
return ((t1->data == t2->data) && tree_compare (t1->left, t2->left )
&& tree_compare (t1->right, t2->right));
}
EDIT In response to a comment...
The running time for a full tree comparison using this is most simply stated as O(n) where n is kinda the size of a tree. If you're willing to accept a more complex bound you can get a smaller one such as O(minimum(n1, n2)) where n1 and n2 are the sizes of the trees.
The explanation is basically that the recursive call is only made (at most) once for each node in the left tree, and only made (at most) once for each node in the right tree. As the function itself (excluding recursions) only specifies at most a constant amount of work (there are no loops), the work including all recursive calls can only be as much as the size of the smaller tree times that constant.
You could analyse further to get a more complex but smaller bound using the idea of the intersection of the trees, but big O just gives an upper bound - not necessarily the lowest possible upper bound. It's probably not worthwhile doing that analysis unless you're trying to build a bigger algorithm/data structure with this as a component, and as a result you know that some property will always apply to those trees which may allow you a tighter bound for the larger algorithm.
One way to form a tigher bound is to consider the sets of paths to nodes in both trees. Each step is either an L (left subtree) or an R (right subtree). So the root is specified with an empty path. The right child of the left child of the root is "LR". Define a function "paths (T)" (mathematically - not part of the program) to represent the set of valid paths into a tree - one path for every node.
So we might have...
paths(t1) = { "", "L", "LR", "R", "RL" }
paths(t2) = { "", "L", "LL", "R", "RR" }
The same path specifications apply to both trees. And each recursion always follows the same left/right link for both trees. So the recursion visits the paths in the itersection of these sets, and the tightest bound we can specify using this is the cardinality of that intersection (still with the constant bound on work per recursive call).
For the tree structures above, we do recursions for the following paths...
paths(t1) intersection paths(t2) = { "", "L", "R" }
So our work in this case is bounded to at most three times the maximum cost of non-recursive work in the tree_compare function.
This is normally an unnecessary amount of detail, but clearly the intersection of the path-sets is at most as large as the number of nodes in the smallest original tree. And whether the n in O(n) refers to the number of nodes in one original tree or to the sum of the nodes in both, this is clearly no smaller than either the minimum or our intersection. Therefore O(n) isn't such a tight bound, but it's still a valid upper bound, even if we're a bit vague which size we're talking about.
Modulo stack overflow, something like
eq(t1, t2) =
eq(t1.left, t2.left) && t1.data=t2.data && eq(t1.right, t2.right)
(This generalizes to an equality predicate for all tree-structured algebraic data types - for any piece of structured data, check if each of its sub-parts are equal to each of the other one's sub-parts.)
We can also do any of the two traversals (pre-order, post-order or in-order) and then compare the results of both the trees. If they are same, we can be sure of their equivalence.
A more general term for what you are probably trying to accomplish is graph isomorphism. There are some algorithms to do this on that page.
Since it's a proven fact that - it is possible to recreate a binary tree as long as we have the following:
The sequence of nodes that are encountered in an In-Order Traversal.
The sequence of nodes that are encountered in a Pre-Order OR Post-Order Traversal
If two binary trees have the same in-order and [pre-order OR post-order] sequence, then they should be equal both structurally and in terms of values.
Each traversal is an O(n) operation. The traversals are done 4 times in total and the results from the same-type of traversal is compared.
O(n) * 4 + 2 => O(n)
Hence, the total order of time-complexity would be O(n)
I would write it as follows. The following code will work in most functional language, and even in python if your datatypes are hashable (e.g. not dictionaries or lists):
topological equality (same in structure, i.e. Tree(1,Tree(2,3))==Tree(Tree(2,3),1)):
tree1==tree2 means set(tree1.children)==set(tree2.children)
ordered equality:
tree1==tree2 means tree1.children==tree2.children
(Tree.children is an ordered list of children)
You don't need to handle the base cases (leaves), because equality has been defined for them already.
bool identical(node* root1,node* root2){
if(root1 == NULL && root2 == NULL)
return true;
if(root1==NULL && root2!=NULL || root1!=NULL && root2 == NULL)
return false;
if(root1->data == root2->data){
bool lIdetical = identical(root1->left,root2->left);
if(!lIdentical)
return false;
bool rIdentical = identical(root1->right,root2->identical);
return lIdentical && rIdentical;
}
else{
printf("data1:%d vs data2:%d",root1->data,root2->data);
return false;
}
}
I do not know if this is the most effecient but I think this works.

Hashing a Tree Structure

I've just come across a scenario in my project where it I need to compare different tree objects for equality with already known instances, and have considered that some sort of hashing algorithm that operates on an arbitrary tree would be very useful.
Take for example the following tree:
O
/ \
/ \
O O
/|\ |
/ | \ |
O O O O
/ \
/ \
O O
Where each O represents a node of the tree, is an arbitrary object, has has an associated hash function. So the problem reduces to: given the hash code of the nodes of tree structure, and a known structure, what is a decent algorithm for computing a (relatively) collision-free hash code for the entire tree?
A few notes on the properties of the hash function:
The hash function should depend on the hash code of every node within the tree as well as its position.
Reordering the children of a node should distinctly change the resulting hash code.
Reflecting any part of the tree should distinctly change the resulting hash code
If it helps, I'm using C# 4.0 here in my project, though I'm primarily looking for a theoretical solution, so pseudo-code, a description, or code in another imperative language would be fine.
UPDATE
Well, here's my own proposed solution. It has been helped much by several of the answers here.
Each node (sub-tree/leaf node) has the following hash function:
public override int GetHashCode()
{
int hashCode = unchecked((this.Symbol.GetHashCode() * 31 +
this.Value.GetHashCode()));
for (int i = 0; i < this.Children.Count; i++)
hashCode = unchecked(hashCode * 31 + this.Children[i].GetHashCode());
return hashCode;
}
The nice thing about this method, as I see it, is that hash codes can be cached and only recalculated when the node or one of its descendants changes. (Thanks to vatine and Jason Orendorff for pointing this out).
Anyway, I would be grateful if people could comment on my suggested solution here - if it does the job well, then great, otherwise any possible improvements would be welcome.
If I were to do this, I'd probably do something like the following:
For each leaf node, compute the concatenation of 0 and the hash of the node data.
For each internal node, compute the concatenation of 1 and the hash of any local data (NB: may not be applicable) and the hash of the children from left to right.
This will lead to a cascade up the tree every time you change anything, but that MAY be low-enough of an overhead to be worthwhile. If changes are relatively infrequent compared to the amount of changes, it may even make sense to go for a cryptographically secure hash.
Edit1: There is also the possibility of adding a "hash valid" flag to each node and simply propagate a "false" up the tree (or "hash invalid" and propagate "true") up the tree on a node change. That way, it may be possible to avoid a complete recalculation when the tree hash is needed and possibly avoid multiple hash calculations that are not used, at the risk of slightly less predictable time to get a hash when needed.
Edit3: The hash code suggested by Noldorin in the question looks like it would have a chance of collisions, if the result of GetHashCode can ever be 0. Essentially, there is no way of distinguishing a tree composed of a single node, with "symbol hash" 30 and "value hash" 25 and a two-node tree, where the root has a "symbol hash" of 0 and a "value hash" of 30 and the child node has a total hash of 25. The examples are entirely invented, I don't know what expected hash ranges are so I can only comment on what I see in the presented code.
Using 31 as the multiplicative constant is good, in that it will cause any overflow to happen on a non-bit boundary, although I am thinking that, with sufficient children and possibly adversarial content in the tree, the hash contribution from items hashed early MAY be dominated by later hashed items.
However, if the hash performs decently on expected data, it looks as if it will do the job. It's certainly faster than using a cryptographic hash (as done in the example code listed below).
Edit2: As for specific algorithms and minimum data structure needed, something like the following (Python, translating to any other language should be relatively easy).
#! /usr/bin/env python
import Crypto.Hash.SHA
class Node:
def __init__ (self, parent=None, contents="", children=[]):
self.valid = False
self.hash = False
self.contents = contents
self.children = children
def append_child (self, child):
self.children.append(child)
self.invalidate()
def invalidate (self):
self.valid = False
if self.parent:
self.parent.invalidate()
def gethash (self):
if self.valid:
return self.hash
digester = crypto.hash.SHA.new()
digester.update(self.contents)
if self.children:
for child in self.children:
digester.update(child.gethash())
self.hash = "1"+digester.hexdigest()
else:
self.hash = "0"+digester.hexdigest()
return self.hash
def setcontents (self):
self.valid = False
return self.contents
Okay, after your edit where you've introduced a requirement that the hashing result should be different for different tree layouts, you're only left with option to traverse the whole tree and write its structure to a single array.
That's done like this: you traverse the tree and dump the operations you do. For an original tree that could be (for a left-child-right-sibling structure):
[1, child, 2, child, 3, sibling, 4, sibling, 5, parent, parent, //we're at root again
sibling, 6, child, 7, child, 8, sibling, 9, parent, parent]
You may then hash the list (that is, effectively, a string) the way you like. As another option, you may even return this list as a result of hash-function, so it becomes collision-free tree representation.
But adding precise information about the whole structure is not what hash functions usually do. The way proposed should compute hash function of every node as well as traverse the whole tree. So you may consider other ways of hashing, described below.
If you don't want to traverse the whole tree:
One algorithm that immediately came to my mind is like this. Pick a large prime number H (that's greater than maximal number of children). To hash a tree, hash its root, pick a child number H mod n, where n is the number of children of root, and recursively hash the subtree of this child.
This seems to be a bad option if trees differ only deeply near the leaves. But at least it should run fast for not very tall trees.
If you want to hash less elements but go through the whole tree:
Instead of hashing subtree, you may want to hash layer-wise. I.e. hash root first, than hash one of nodes that are its children, then one of children of the children etc. So you cover the whole tree instead of one of specific paths. This makes hashing procedure slower, of course.
--- O ------- layer 0, n=1
/ \
/ \
--- O --- O ----- layer 1, n=2
/|\ |
/ | \ |
/ | \ |
O - O - O O------ layer 2, n=4
/ \
/ \
------ O --- O -- layer 3, n=2
A node from a layer is picked with H mod n rule.
The difference between this version and previous version is that a tree should undergo quite an illogical transformation to retain the hash function.
The usual technique of hashing any sequence is combining the values (or hashes thereof) of its elements in some mathematical way. I don't think a tree would be any different in this respect.
For example, here is the hash function for tuples in Python (taken from Objects/tupleobject.c in the source of Python 2.6):
static long
tuplehash(PyTupleObject *v)
{
register long x, y;
register Py_ssize_t len = Py_SIZE(v);
register PyObject **p;
long mult = 1000003L;
x = 0x345678L;
p = v->ob_item;
while (--len >= 0) {
y = PyObject_Hash(*p++);
if (y == -1)
return -1;
x = (x ^ y) * mult;
/* the cast might truncate len; that doesn't change hash stability */
mult += (long)(82520L + len + len);
}
x += 97531L;
if (x == -1)
x = -2;
return x;
}
It's a relatively complex combination with constants experimentally chosen for best results for tuples of typical lengths. What I'm trying to show with this code snippet is that the issue is very complex and very heuristic, and the quality of the results probably depend on the more specific aspects of your data - i.e. domain knowledge may help you reach better results. However, for good-enough results you shouldn't look too far. I would guess that taking this algorithm and combining all the nodes of the tree instead of all the tuple elements, plus adding their position into play will give you a pretty good algorithm.
One option of taking the position into account is the node's position in an inorder walk of the tree.
Any time you are working with trees recursion should come to mind:
public override int GetHashCode() {
int hash = 5381;
foreach(var node in this.BreadthFirstTraversal()) {
hash = 33 * hash + node.GetHashCode();
}
}
The hash function should depend on the hash code of every node within the tree as well as its position.
Check. We are explicitly using node.GetHashCode() in the computation of the tree's hash code. Further, because of the nature of the algorithm, a node's position plays a role in the tree's ultimate hash code.
Reordering the children of a node should distinctly change the resulting hash code.
Check. They will be visited in a different order in the in-order traversal leading to a different hash code. (Note that if there are two children with the same hash code you will end up with the same hash code upon swapping the order of those children.)
Reflecting any part of the tree should distinctly change the resulting hash code
Check. Again the nodes would be visited in a different order leading to a different hash code. (Note that there are circumstances where the reflection could lead to the same hash code if every node is reflected into a node with the same hash code.)
The collision-free property of this will depend on how collision-free the hash function used for the node data is.
It sounds like you want a system where the hash of a particular node is a combination of the child node hashes, where order matters.
If you're planning on manipulating this tree a lot, you may want to pay the price in space of storing the hashcode with each node, to avoid the penalty of recalculation when performing operations on the tree.
Since the order of the child nodes matters, a method which might work here would be to combine the node data and children using prime number multiples and addition modulo some large number.
To go for something similar to Java's String hashcode:
Say you have n child nodes.
hash(node) = hash(nodedata) +
hash(childnode[0]) * 31^(n-1) +
hash(childnode[1]) * 31^(n-2) +
<...> +
hash(childnode[n])
Some more detail on the scheme used above can be found here: http://computinglife.wordpress.com/2008/11/20/why-do-hash-functions-use-prime-numbers/
I can see that if you have a large set of trees to compare, then you could use a hash function to retrieve a set of potential candidates, then do a direct comparison.
A substring that would work is just use lisp syntax to put brackets around the tree, write out the identifiere of each node in pre-order. But this is computationally equivalent to a pre-order comparison of the tree, so why not just do that?
I've given 2 solutions: one is for comparing the two trees when you're done (needed to resolve collisions) and the other to compute the hashcode.
TREE COMPARISON:
The most efficient way to compare will be to simply recursively traverse each tree in a fixed order (pre-order is simple and as good as anything else), comparing the node at each step.
So, just create a Visitor pattern that successively returns the next node in pre-order for a tree. i.e. it's constructor can take the root of the tree.
Then, just create two insces of the Visitor, that act as generators for the next node in preorder. i.e. Vistor v1 = new Visitor(root1), Visitor v2 = new Visitor(root2)
Write a comparison function that can compare itself to another node.
Then just visit each node of the trees, comparing, and returning false if comparison fails. i.e.
Module
Function Compare(Node root1, Node root2)
Visitor v1 = new Visitor(root1)
Visitor v2 = new Visitor(root2)
loop
Node n1 = v1.next
Node n2 = v2.next
if (n1 == null) and (n2 == null) then
return true
if (n1 == null) or (n2 == null) then
return false
if n1.compare(n2) != 0 then
return false
end loop
// unreachable
End Function
End Module
HASH CODE GENERATION:
if you want to write out a string representation of the tree, you can use the lisp syntax for a tree, then sample the string to generate a shorter hashcode.
Module
Function TreeToString(Node n1) : String
if node == null
return ""
String s1 = "(" + n1.toString()
for each child of n1
s1 = TreeToString(child)
return s1 + ")"
End Function
The node.toString() can return the unique label/hash code/whatever for that node. Then you can just do a substring comparison from the strings returned by the TreeToString function to determine if the trees are equivalent. For a shorter hashcode, just sample the TreeToString Function, i.e. take every 5 character.
End Module
I think you could do this recursively: Assume you have a hash function h that hashes strings of arbitrary length (e.g. SHA-1). Now, the hash of a tree is the hash of a string that is created as a concatenation of the hash of the current element (you have your own function for that) and hashes of all the children of that node (from recursive calls of the function).
For a binary tree you would have:
Hash( h(node->data) || Hash(node->left) || Hash(node->right) )
You may need to carefully check if tree geometry is properly accounted for. I think that with some effort you could derive a method for which finding collisions for such trees could be as hard as finding collisions in the underlying hash function.
A simple enumeration (in any deterministic order) together with a hash function that depends when the node is visited should work.
int hash(Node root) {
ArrayList<Node> worklist = new ArrayList<Node>();
worklist.add(root);
int h = 0;
int n = 0;
while (!worklist.isEmpty()) {
Node x = worklist.remove(worklist.size() - 1);
worklist.addAll(x.children());
h ^= place_hash(x.hash(), n);
n++;
}
return h;
}
int place_hash(int hash, int place) {
return (Integer.toString(hash) + "_" + Integer.toString(place)).hash();
}
class TreeNode
{
public static QualityAgainstPerformance = 3; // tune this for your needs
public static PositionMarkConstan = 23498735; // just anything
public object TargetObject; // this is a subject of this TreeNode, which has to add it's hashcode;
IEnumerable<TreeNode> GetChildParticipiants()
{
yield return this;
foreach(var child in Children)
{
yield return child;
foreach(var grandchild in child.GetParticipiants() )
yield return grandchild;
}
IEnumerable<TreeNode> GetParentParticipiants()
{
TreeNode parent = Parent;
do
yield return parent;
while( ( parent = parent.Parent ) != null );
}
public override int GetHashcode()
{
int computed = 0;
var nodesToCombine =
(Parent != null ? Parent : this).GetChildParticipiants()
.Take(QualityAgainstPerformance/2)
.Concat(GetParentParticipiants().Take(QualityAgainstPerformance/2));
foreach(var node in nodesToCombine)
{
if ( node.ReferenceEquals(this) )
computed = AddToMix(computed, PositionMarkConstant );
computed = AddToMix(computed, node.GetPositionInParent());
computed = AddToMix(computed, node.TargetObject.GetHashCode());
}
return computed;
}
}
AddToTheMix is a function, which combines the two hashcodes, so the sequence matters.
I don't know what it is, but you can figure out. Some bit shifting, rounding, you know...
The idea is that you have to analyse some environment of the node, depending on the quality you want to achieve.
I have to say, that you requirements are somewhat against the entire concept of hashcodes.
Hash function computational complexity should be very limited.
It's computational complexity should not linearly depend on the size of the container (the tree), otherwise it totally breaks the hashcode-based algorithms.
Considering the position as a major property of the nodes hash function also somewhat goes against the concept of the tree, but achievable, if you replace the requirement, that it HAS to depend on the position.
Overall principle i would suggest, is replacing MUST requirements with SHOULD requirements.
That way you can come up with appropriate and efficient algorithm.
For example, consider building a limited sequence of integer hashcode tokens, and add what you want to this sequence, in the order of preference.
Order of the elements in this sequence is important, it affects the computed value.
for example for each node you want to compute:
add the hashcode of underlying object
add the hashcodes of underlying objects of the nearest siblings, if available. I think, even the single left sibling would be enough.
add the hashcode of underlying object of the parent and it's nearest siblings like for the node itself, same as 2.
repeat this to with the grandparents to a limited depth.
//--------5------- ancestor depth 2 and it's left sibling;
//-------/|------- ;
//------4-3------- ancestor depth 1 and it's left sibling;
//-------/|------- ;
//------2-1------- this;
the fact that you are adding a direct sibling's underlying object's hashcode gives a positional property to the hashfunction.
if this is not enough, add the children:
You should add every child, just some to give a decent hashcode.
add the first child and it's first child and it's first child.. limit the depth to some constant, and do not compute anything recursively - just the underlying node's object's hashcode.
//----- this;
//-----/--;
//----6---;
//---/--;
//--7---;
This way the complexity is linear to the depth of the underlying tree, not the total number of elements.
Now you have a sequence if integers, combine them with a known algorithm, like Ely suggests above.
1,2,...7
This way, you will have a lightweight hash function, with a positional property, not dependent on the total size of the tree, and even not dependent on the tree depth, and not requiring to recompute hash function of the entire tree when you change the tree structure.
I bet this 7 numbers would give a hash destribution near to perfect.
Writing your own hash function is almost always a bug, because you basically need a degree in mathematics to do it well. Hashfunctions are incredibly nonintuitive, and have highly unpredictable collision characteristics.
Don't try directly combining hashcodes for Child nodes -- this will magnify any problems in the underlying hash functions. Instead, concatenate the raw bytes from each node in order, and feed this as a byte stream to a tried-and-true hash function. All the cryptographic hash functions can accept a byte stream. If the tree is small, you may want to just create a byte array and hash it in one operation.

Resources