Related
I was thinking about a recursive algorithm (it's a theoretical question, so it's not important the programming language). It consists of finding the minimum of a set of numbers
I was thinking of this way: let "n" be the number of elements in the set. let's rearrange the set as:
(a, (b, c, ..., z) ).
the function moves from left to right, and the first element is assumed as minimum in the first phase (it's, of course, the 0-th element, a). next steps are defined as follows:
(a, min(b, c, ..., z) ), check if a is still minimum, or if b is to be assumed as minimum, then (a or b, min(c, d, ..., z) ), another check condition, (a or b or c, min(d, e, ..., z)), check condition, etc.
I think the theoretical pseudocode may be as follows:
f(x) {
// base case
if I've reached the last element, assume it's a possible minimum, and check if y < z. then return a value to stop recursive calls.
// inductive steps
if ( f(i-th element) < f(i+1, next element) ) {
/* just assume the current element is the current minimum */
}
}
I'm having trouble with the base case. I don't know how to formalize it. I think I've understood the basic idea about it: it's basically what I've written in the pseudocode, right?
does what I've written so far make sense? Sorry if it's not clear but I'm a beginner, and I'm studying recursion for the first time, and I personally find it confusing. So, I've tried my best to explain it. If it's not clear, let me know, and I'll try to explain it better with different words.
Recursive problems can be hard to visualize. Let's take an example : arr = [3,5,1,6]
This is a relatively small array but still it's not easy to visualize how recursion will work here from start to end.
Tip : Try to reduce the size of the input. This will make it easy to visualize and help you with finding the base case. First decide what our function should do. In our case it finds the minimum number from an array. If our function works for array of size n then it should also work for array of size n-1 (Recursive leap of faith). Now using this we can reduce the size of input until we cannot reduce it any further, which should give us our base case.
Let's use the above example: arr = [3,5,1,6]
Let create a function findMin(arr, start) which takes an array and a start index and returns the minimum number from start index to end of array.
1st Iteration : [3,5,1,6]
// arr[start] = 3, If we can somehow find minimum from the remaining array,
// then we can compare it with current element and return the minimum of the two.
// so our input is now reduced to the remaining array [5,1,6]
2nd Iteration : [5,1,6]
// arr[start] = 5, If we can somehow find minimum from the remaining array,
// then we can compare it with current element and return the minimum of the two.
// so our input is now reduced to the remaining array [1,6]
3rd Iteration : [1,6]
// arr[start] = 1, If we can somehow find minimum from the remaining array,
// then we can compare it with current element and return the minimum of the two.
// so our input is now reduced to the remaining array [6]
4th Iteration : [6]
// arr[start] = 6, Since it is the only element in the array, it is the minimum.
// This is our base case as we cannot reduce the input any further.
// We will simply return 6.
------------ Tracking Back ------------
3rd Iteration : [1,6]
// 1 will be compared with whatever the 4th Iteration returned (6 in this case).
// This iteration will return minimum(1, 4th Iteration) => minimum(1,6) => 1
2nd Iteration : [5,1,6]
// 5 will be compared with whatever the 3th Iteration returned (1 in this case).
// This iteration will return minimum(5, 3rd Iteration) => minimum(5,1) => 1
1st Iteration : [3,5,1,6]
// 3 will be compared with whatever the 2nd Iteration returned (1 in this case).
// This iteration will return minimum(3, 2nd Iteration) => minimum(3,1) => 1
Final answer = 1
function findMin(arr, start) {
if (start === arr.length - 1) return arr[start];
return Math.min(arr[start], findMin(arr, start + 1));
}
const arr = [3, 5, 1, 6];
const min = findMin(arr, 0);
console.log('Minimum element = ', min);
This is a good problem for practicing recursion for beginners. You can also try these problems for practice.
Reverse a string using recursion.
Reverse a stack using recursion.
Sort a stack using recursion.
To me, it's more like this:
int f(int[] x)
{
var minimum = head of X;
if (x has tail)
{
var remainder = f(tail of x);
if (remainder < minimum)
{
minimum = remainder;
}
}
return minimum;
}
You have the right idea.
You've correctly observed that
min_recursive(array) = min(array[0], min_recursive(array[1:]))
The function doesn't care about who's calling it or what's going on outside of it -- it just needs to return the minimum of the array passed in. The base case is when the array has a single value. That value is the minimum of the array so it should just return it. Otherwise find the minimum of the rest of the array by calling itself again and compare the result with the head of the array.
The other answers show some coding examples.
This is a recursive solution to the problem that you posed, using JavaScript:
a = [5,12,3,5,34,12]
const min = a => {
if (!a.length) { return 0 }
if (a.length === 1) { return a[0] }
return Math.min(a[0], min(a.slice(1)))
}
min(a)
Note the approach which is to first detect the simplest case (empty array), then a more complex case (single element array), then finally a recursive call which will reduce more complex cases to functions of simpler ones.
However, you don't need recursion to traverse a one dimensional array.
I have approached it this way:
suppose the nodes are A and B.
start from node A
if node is unvisited, then push it into the stack.
move onto its children.
if we backtrack from any node then pop it from the stack.
continue it until you reach B and then print the stack
Here is a pseudo code that will allow to find a path (any path, not the shortest) between two nodes of a graph if such a path exist
function find_path(G):
s = init_stack()
visited_vertices = set()
stack.push(A)
return rec_function(s, B, visited_vertices)
function rec_function(s, B, visited_vertices):
A = stack.head
if visited_vertices.contains(A) then:
return false
if A == B then:
print_stack(s)
return true
for C in A.children do:
s.push(C)
found_dest = rec_function(s, B)
if found_dest then:
return true
s.pop()
return false
It is not written in a very intuitive manner, since usually we will take advantage of the fact that we use a recursive function to 'hide' the stack within the recursive calls stack.
In order to understand how this code runs, you can take a small example, with a few vertices in the graph (say 5) and run it by hand.
datatype tree = br of tree*int*tree | lf
The tree is valid if the values of branches on the left are always lower than the root and branches on the right are always higher.
For instance:
valid(br(br(lf,2,lf),1,lf)) = false;
valid(br (br (lf, 2, br (lf, 7, lf)), 8, lf)) = true;
I'm looking for somethine like this, except i have no idea how to compare the integer of inner branches to the integer of roots.
fun valid(lf)=true
| valid(br(left,x,right)) = valid(left) andalso valid(right);
EDIT:
So far i've come up with this (but it still doesn't check all the integers against all the upper nodes, just 1 node above.. it's close but no cigar)
fun valid(lf)=true
| valid(br(lf,x,lf)) = true
| valid(br(lf,x,br(left2,z,right2))) = if x<z then valid(br(left2,z,right2)) else false
| valid(br(br(left,y,right),x,lf)) = if y<x then valid(br(left,y,right)) else false
| valid(br(br(left,y,right),x,br(left2,z,right2))) = if y<x andalso x<z then valid(br(left,y,right)) andalso valid(br(left2,z,right2)) else false;
Assuming that this is homework, I'll give you a hint rather than a complete answer. It would help to first define a function, maybe call it all_nodes which has type
tree * (int -> bool) -> bool
this should be a function which, when passed a tree and a Boolean function on integers returns true if all integers in the tree satisfy the predicate.
Then your function valid should have 4 (rather than 2) clauses. For br(left,x,right) to be valid:
1) left must be valid
2) all integer nodes in left must satisfy that they are < x . This can be checked using all_nodes and an appropriate anonymous function
3) right must be valid
4) all integer nodes in right must be > x, which again can be checked by passing an appropriate anonymous function to all_nodes
On Edit: the above will work but entails a certain amount of inefficiency due to applying two rather than one recursive function to each subtree. An alternative approach would be to define a helper function which only applies to trees of the form br(left,x,right). This helper function could return a tuple of type bool*int*int which tells you whether or not the tree is valid, together with the minimum and the maximum int in the tree. The main function would simply take care of the lf pattern and invoke and interpret the helper function. The crucial point is that at the key recursive step, it is enough to check the max of the left and the min of the right against the int at the node. Some care needs to be taken in how the recursive step is formulated so that you don't call the helper on lf, but it is certainly doable.
This question may be old, but I couldn't think of an answer.
Say, there are two lists of different lengths, merging at a point; how do we know where the merging point is?
Conditions:
We don't know the length
We should parse each list only once.
The following is by far the greatest of all I have seen - O(N), no counters. I got it during an interview to a candidate S.N. at VisionMap.
Make an interating pointer like this: it goes forward every time till the end, and then jumps to the beginning of the opposite list, and so on.
Create two of these, pointing to two heads.
Advance each of the pointers by 1 every time, until they meet. This will happen after either one or two passes.
I still use this question in the interviews - but to see how long it takes someone to understand why this solution works.
Pavel's answer requires modification of the lists as well as iterating each list twice.
Here's a solution that only requires iterating each list twice (the first time to calculate their length; if the length is given you only need to iterate once).
The idea is to ignore the starting entries of the longer list (merge point can't be there), so that the two pointers are an equal distance from the end of the list. Then move them forwards until they merge.
lenA = count(listA) //iterates list A
lenB = count(listB) //iterates list B
ptrA = listA
ptrB = listB
//now we adjust either ptrA or ptrB so that they are equally far from the end
while(lenA > lenB):
ptrA = ptrA->next
lenA--
while(lenB > lenA):
prtB = ptrB->next
lenB--
while(ptrA != NULL):
if (ptrA == ptrB):
return ptrA //found merge point
ptrA = ptrA->next
ptrB = ptrB->next
This is asymptotically the same (linear time) as my other answer but probably has smaller constants, so is probably faster. But I think my other answer is cooler.
If
by "modification is not allowed" it was meant "you may change but in the end they should be restored", and
we could iterate the lists exactly twice
the following algorithm would be the solution.
First, the numbers. Assume the first list is of length a+c and the second one is of length b+c, where c is the length of their common "tail" (after the mergepoint). Let's denote them as follows:
x = a+c
y = b+c
Since we don't know the length, we will calculate x and y without additional iterations; you'll see how.
Then, we iterate each list and reverse them while iterating! If both iterators reach the merge point at the same time, then we find it out by mere comparing. Otherwise, one pointer will reach the merge point before the other one.
After that, when the other iterator reaches the merge point, it won't proceed to the common tail. Instead will go back to the former beginning of the list that had reached merge-point before! So, before it reaches the end of the changed list (i.e. the former beginning of the other list), he will make a+b+1 iterations total. Let's call it z+1.
The pointer that reached the merge-point first, will keep iterating, until reaches the end of the list. The number of iterations it made should be calculated and is equal to x.
Then, this pointer iterates back and reverses the lists again. But now it won't go back to the beginning of the list it originally started from! Instead, it will go to the beginning of the other list! The number of iterations it made should be calculated and equal to y.
So we know the following numbers:
x = a+c
y = b+c
z = a+b
From which we determine that
a = (+x-y+z)/2
b = (-x+y+z)/2
c = (+x+y-z)/2
Which solves the problem.
Well, if you know that they will merge:
Say you start with:
A-->B-->C
|
V
1-->2-->3-->4-->5
1) Go through the first list setting each next pointer to NULL.
Now you have:
A B C
1-->2-->3 4 5
2) Now go through the second list and wait until you see a NULL, that is your merge point.
If you can't be sure that they merge you can use a sentinel value for the pointer value, but that isn't as elegant.
If we could iterate lists exactly twice, than I can provide method for determining merge point:
iterate both lists and calculate lengths A and B
calculate difference of lengths C = |A-B|;
start iterating both list simultaneously, but make additional C steps on list which was greater
this two pointers will meet each other in the merging point
Here's a solution, computationally quick (iterates each list once) but uses a lot of memory:
for each item in list a
push pointer to item onto stack_a
for each item in list b
push pointer to item onto stack_b
while (stack_a top == stack_b top) // where top is the item to be popped next
pop stack_a
pop stack_b
// values at the top of each stack are the items prior to the merged item
You can use a set of Nodes. Iterate through one list and add each Node to the set. Then iterate through the second list and for every iteration, check if the Node exists in the set. If it does, you've found your merge point :)
This arguably violates the "parse each list only once" condition, but implement the tortoise and hare algorithm (used to find the merge point and cycle length of a cyclic list) so you start at List A, and when you reach the NULL at the end you pretend it's a pointer to the beginning of list B, thus creating the appearance of a cyclic list. The algorithm will then tell you exactly how far down List A the merge is (the variable 'mu' according to the Wikipedia description).
Also, the "lambda" value tells you the length of list B, and if you want, you can work out the length of list A during the algorithm (when you redirect the NULL link).
Maybe I am over simplifying this, but simply iterate the smallest list and use the last nodes Link as the merging point?
So, where Data->Link->Link == NULL is the end point, giving Data->Link as the merging point (at the end of the list).
EDIT:
Okay, from the picture you posted, you parse the two lists, the smallest first. With the smallest list you can maintain the references to the following node. Now, when you parse the second list you do a comparison on the reference to find where Reference [i] is the reference at LinkedList[i]->Link. This will give the merge point. Time to explain with pictures (superimpose the values on the picture the OP).
You have a linked list (references shown below):
A->B->C->D->E
You have a second linked list:
1->2->
With the merged list, the references would then go as follows:
1->2->D->E->
Therefore, you map the first "smaller" list (as the merged list, which is what we are counting has a length of 4 and the main list 5)
Loop through the first list, maintain a reference of references.
The list will contain the following references Pointers { 1, 2, D, E }.
We now go through the second list:
-> A - Contains reference in Pointers? No, move on
-> B - Contains reference in Pointers? No, move on
-> C - Contains reference in Pointers? No, move on
-> D - Contains reference in Pointers? Yes, merge point found, break.
Sure, you maintain a new list of pointers, but thats not outside the specification. However the first list is parsed exactly once, and the second list will only be fully parsed if there is no merge point. Otherwise, it will end sooner (at the merge point).
I have tested a merge case on my FC9 x86_64, and print every node address as shown below:
Head A 0x7fffb2f3c4b0
0x214f010
0x214f030
0x214f050
0x214f070
0x214f090
0x214f0f0
0x214f110
0x214f130
0x214f150
0x214f170
Head B 0x7fffb2f3c4a0
0x214f0b0
0x214f0d0
0x214f0f0
0x214f110
0x214f130
0x214f150
0x214f170
Note becase I had aligned the node structure, so when malloc() a node, the address is aligned w/ 16 bytes, see the least 4 bits.
The least bits are 0s, i.e., 0x0 or 000b.
So if your are in the same special case (aligned node address) too, you can use these least 4 bits.
For example when travel both lists from head to tail, set 1 or 2 of the 4 bits of the visiting node address, that is, set a flag;
next_node = node->next;
node = (struct node*)((unsigned long)node | 0x1UL);
Note above flags won't affect the real node address but only your SAVED node pointer value.
Once found somebody had set the flag bit(s), then the first found node should be the merge point.
after done, you'd restore the node address by clear the flag bits you had set. while an important thing is that you should be careful when iterate (e.g. node = node->next) to do clean. remember you had set flag bits, so do this way
real_node = (struct node*)((unsigned long)node) & ~0x1UL);
real_node = real_node->next;
node = real_node;
Because this proposal will restore the modified node addresses, it could be considered as "no modification".
There can be a simple solution but will require an auxilary space. The idea is to traverse a list and store each address in a hash map, now traverse the other list and match if the address lies in the hash map or not. Each list is traversed only once. There's no modification to any list. Length is still unknown. Auxiliary space used: O(n) where 'n' is the length of first list traversed.
this solution iterates each list only once...no modification of list required too..though you may complain about space..
1) Basically you iterate in list1 and store the address of each node in an array(which stores unsigned int value)
2) Then you iterate list2, and for each node's address ---> you search through the array that you find a match or not...if you do then this is the merging node
//pseudocode
//for the first list
p1=list1;
unsigned int addr[];//to store addresses
i=0;
while(p1!=null){
addr[i]=&p1;
p1=p1->next;
}
int len=sizeof(addr)/sizeof(int);//calculates length of array addr
//for the second list
p2=list2;
while(p2!=null){
if(search(addr[],len,&p2)==1)//match found
{
//this is the merging node
return (p2);
}
p2=p2->next;
}
int search(addr,len,p2){
i=0;
while(i<len){
if(addr[i]==p2)
return 1;
i++;
}
return 0;
}
Hope it is a valid solution...
There is no need to modify any list. There is a solution in which we only have to traverse each list once.
Create two stacks, lets say stck1 and stck2.
Traverse 1st list and push a copy of each node you traverse in stck1.
Same as step two but this time traverse 2nd list and push the copy of nodes in stck2.
Now, pop from both stacks and check whether the two nodes are equal, if yes then keep a reference to them. If no, then previous nodes which were equal are actually the merge point we were looking for.
int FindMergeNode(Node headA, Node headB) {
Node currentA = headA;
Node currentB = headB;
// Do till the two nodes are the same
while (currentA != currentB) {
// If you reached the end of one list start at the beginning of the other
// one currentA
if (currentA.next == null) {
currentA = headA;
} else {
currentA = currentA.next;
}
// currentB
if (currentB.next == null) {
currentB = headB;
} else {
currentB = currentB.next;
}
}
return currentB.data;
}
We can use two pointers and move in a fashion such that if one of the pointers is null we point it to the head of the other list and same for the other, this way if the list lengths are different they will meet in the second pass.
If length of list1 is n and list2 is m, their difference is d=abs(n-m). They will cover this distance and meet at the merge point.
Code:
int findMergeNode(SinglyLinkedListNode* head1, SinglyLinkedListNode* head2) {
SinglyLinkedListNode* start1=head1;
SinglyLinkedListNode* start2=head2;
while (start1!=start2){
start1=start1->next;
start2=start2->next;
if (!start1)
start1=head2;
if (!start2)
start2=head1;
}
return start1->data;
}
Here is naive solution , No neeed to traverse whole lists.
if your structured node has three fields like
struct node {
int data;
int flag; //initially set the flag to zero for all nodes
struct node *next;
};
say you have two heads (head1 and head2) pointing to head of two lists.
Traverse both the list at same pace and put the flag =1(visited flag) for that node ,
if (node->next->field==1)//possibly longer list will have this opportunity
//this will be your required node.
How about this:
If you are only allowed to traverse each list only once, you can create a new node, traverse the first list to have every node point to this new node, and traverse the second list to see if any node is pointing to your new node (that's your merge point). If the second traversal doesn't lead to your new node then the original lists don't have a merge point.
If you are allowed to traverse the lists more than once, then you can traverse each list to find our their lengths and if they are different, omit the "extra" nodes at the beginning of the longer list. Then just traverse both lists one step at a time and find the first merging node.
Steps in Java:
Create a map.
Start traversing in the both branches of list and Put all traversed nodes of list into the Map using some unique thing related to Nodes(say node Id) as Key and put Values as 1 in the starting for all.
When ever first duplicate key comes, increment the value for that Key (let say now its value became 2 which is > 1.
Get the Key where the value is greater than 1 and that should be the node where two lists are merging.
We can efficiently solve it by introducing "isVisited" field. Traverse first list and set "isVisited" value to "true" for all nodes till end. Now start from second and find first node where flag is true and Boom ,its your merging point.
Step 1: find lenght of both the list
Step 2 : Find the diff and move the biggest list with the difference
Step 3 : Now both list will be in similar position.
Step 4 : Iterate through list to find the merge point
//Psuedocode
def findmergepoint(list1, list2):
lendiff = list1.length() > list2.length() : list1.length() - list2.length() ? list2.lenght()-list1.lenght()
biggerlist = list1.length() > list2.length() : list1 ? list2 # list with biggest length
smallerlist = list1.length() < list2.length() : list2 ? list1 # list with smallest length
# move the biggest length to the diff position to level both the list at the same position
for i in range(0,lendiff-1):
biggerlist = biggerlist.next
#Looped only once.
while ( biggerlist is not None and smallerlist is not None ):
if biggerlist == smallerlist :
return biggerlist #point of intersection
return None // No intersection found
int FindMergeNode(Node *headA, Node *headB)
{
Node *tempB=new Node;
tempB=headB;
while(headA->next!=NULL)
{
while(tempB->next!=NULL)
{
if(tempB==headA)
return tempB->data;
tempB=tempB->next;
}
headA=headA->next;
tempB=headB;
}
return headA->data;
}
Use Map or Dictionary to store the addressess vs value of node. if the address alread exists in the Map/Dictionary then the value of the key is the answer.
I did this:
int FindMergeNode(Node headA, Node headB) {
Map<Object, Integer> map = new HashMap<Object, Integer>();
while(headA != null || headB != null)
{
if(headA != null && map.containsKey(headA.next))
{
return map.get(headA.next);
}
if(headA != null && headA.next != null)
{
map.put(headA.next, headA.next.data);
headA = headA.next;
}
if(headB != null && map.containsKey(headB.next))
{
return map.get(headB.next);
}
if(headB != null && headB.next != null)
{
map.put(headB.next, headB.next.data);
headB = headB.next;
}
}
return 0;
}
A O(n) complexity solution. But based on an assumption.
assumption is: both nodes are having only positive integers.
logic : make all the integer of list1 to negative. Then walk through the list2, till you get a negative integer. Once found => take it, change the sign back to positive and return.
static int findMergeNode(SinglyLinkedListNode head1, SinglyLinkedListNode head2) {
SinglyLinkedListNode current = head1; //head1 is give to be not null.
//mark all head1 nodes as negative
while(true){
current.data = -current.data;
current = current.next;
if(current==null) break;
}
current=head2; //given as not null
while(true){
if(current.data<0) return -current.data;
current = current.next;
}
}
You can add the nodes of list1 to a hashset and the loop through the second and if any node of list2 is already present in the set .If yes, then thats the merge node
static int findMergeNode(SinglyLinkedListNode head1, SinglyLinkedListNode head2) {
HashSet<SinglyLinkedListNode> set=new HashSet<SinglyLinkedListNode>();
while(head1!=null)
{
set.add(head1);
head1=head1.next;
}
while(head2!=null){
if(set.contains(head2){
return head2.data;
}
}
return -1;
}
Solution using javascript
var getIntersectionNode = function(headA, headB) {
if(headA == null || headB == null) return null;
let countA = listCount(headA);
let countB = listCount(headB);
let diff = 0;
if(countA > countB) {
diff = countA - countB;
for(let i = 0; i < diff; i++) {
headA = headA.next;
}
} else if(countA < countB) {
diff = countB - countA;
for(let i = 0; i < diff; i++) {
headB = headB.next;
}
}
return getIntersectValue(headA, headB);
};
function listCount(head) {
let count = 0;
while(head) {
count++;
head = head.next;
}
return count;
}
function getIntersectValue(headA, headB) {
while(headA && headB) {
if(headA === headB) {
return headA;
}
headA = headA.next;
headB = headB.next;
}
return null;
}
If editing the linked list is allowed,
Then just make the next node pointers of all the nodes of list 2 as null.
Find the data value of the last node of the list 1.
This will give you the intersecting node in single traversal of both the lists, with "no hi fi logic".
Follow the simple logic to solve this problem:
Since both pointer A and B are traveling with same speed. To meet both at the same point they must be cover the same distance. and we can achieve this by adding the length of a list to another.
I've just come across a scenario in my project where it I need to compare different tree objects for equality with already known instances, and have considered that some sort of hashing algorithm that operates on an arbitrary tree would be very useful.
Take for example the following tree:
O
/ \
/ \
O O
/|\ |
/ | \ |
O O O O
/ \
/ \
O O
Where each O represents a node of the tree, is an arbitrary object, has has an associated hash function. So the problem reduces to: given the hash code of the nodes of tree structure, and a known structure, what is a decent algorithm for computing a (relatively) collision-free hash code for the entire tree?
A few notes on the properties of the hash function:
The hash function should depend on the hash code of every node within the tree as well as its position.
Reordering the children of a node should distinctly change the resulting hash code.
Reflecting any part of the tree should distinctly change the resulting hash code
If it helps, I'm using C# 4.0 here in my project, though I'm primarily looking for a theoretical solution, so pseudo-code, a description, or code in another imperative language would be fine.
UPDATE
Well, here's my own proposed solution. It has been helped much by several of the answers here.
Each node (sub-tree/leaf node) has the following hash function:
public override int GetHashCode()
{
int hashCode = unchecked((this.Symbol.GetHashCode() * 31 +
this.Value.GetHashCode()));
for (int i = 0; i < this.Children.Count; i++)
hashCode = unchecked(hashCode * 31 + this.Children[i].GetHashCode());
return hashCode;
}
The nice thing about this method, as I see it, is that hash codes can be cached and only recalculated when the node or one of its descendants changes. (Thanks to vatine and Jason Orendorff for pointing this out).
Anyway, I would be grateful if people could comment on my suggested solution here - if it does the job well, then great, otherwise any possible improvements would be welcome.
If I were to do this, I'd probably do something like the following:
For each leaf node, compute the concatenation of 0 and the hash of the node data.
For each internal node, compute the concatenation of 1 and the hash of any local data (NB: may not be applicable) and the hash of the children from left to right.
This will lead to a cascade up the tree every time you change anything, but that MAY be low-enough of an overhead to be worthwhile. If changes are relatively infrequent compared to the amount of changes, it may even make sense to go for a cryptographically secure hash.
Edit1: There is also the possibility of adding a "hash valid" flag to each node and simply propagate a "false" up the tree (or "hash invalid" and propagate "true") up the tree on a node change. That way, it may be possible to avoid a complete recalculation when the tree hash is needed and possibly avoid multiple hash calculations that are not used, at the risk of slightly less predictable time to get a hash when needed.
Edit3: The hash code suggested by Noldorin in the question looks like it would have a chance of collisions, if the result of GetHashCode can ever be 0. Essentially, there is no way of distinguishing a tree composed of a single node, with "symbol hash" 30 and "value hash" 25 and a two-node tree, where the root has a "symbol hash" of 0 and a "value hash" of 30 and the child node has a total hash of 25. The examples are entirely invented, I don't know what expected hash ranges are so I can only comment on what I see in the presented code.
Using 31 as the multiplicative constant is good, in that it will cause any overflow to happen on a non-bit boundary, although I am thinking that, with sufficient children and possibly adversarial content in the tree, the hash contribution from items hashed early MAY be dominated by later hashed items.
However, if the hash performs decently on expected data, it looks as if it will do the job. It's certainly faster than using a cryptographic hash (as done in the example code listed below).
Edit2: As for specific algorithms and minimum data structure needed, something like the following (Python, translating to any other language should be relatively easy).
#! /usr/bin/env python
import Crypto.Hash.SHA
class Node:
def __init__ (self, parent=None, contents="", children=[]):
self.valid = False
self.hash = False
self.contents = contents
self.children = children
def append_child (self, child):
self.children.append(child)
self.invalidate()
def invalidate (self):
self.valid = False
if self.parent:
self.parent.invalidate()
def gethash (self):
if self.valid:
return self.hash
digester = crypto.hash.SHA.new()
digester.update(self.contents)
if self.children:
for child in self.children:
digester.update(child.gethash())
self.hash = "1"+digester.hexdigest()
else:
self.hash = "0"+digester.hexdigest()
return self.hash
def setcontents (self):
self.valid = False
return self.contents
Okay, after your edit where you've introduced a requirement that the hashing result should be different for different tree layouts, you're only left with option to traverse the whole tree and write its structure to a single array.
That's done like this: you traverse the tree and dump the operations you do. For an original tree that could be (for a left-child-right-sibling structure):
[1, child, 2, child, 3, sibling, 4, sibling, 5, parent, parent, //we're at root again
sibling, 6, child, 7, child, 8, sibling, 9, parent, parent]
You may then hash the list (that is, effectively, a string) the way you like. As another option, you may even return this list as a result of hash-function, so it becomes collision-free tree representation.
But adding precise information about the whole structure is not what hash functions usually do. The way proposed should compute hash function of every node as well as traverse the whole tree. So you may consider other ways of hashing, described below.
If you don't want to traverse the whole tree:
One algorithm that immediately came to my mind is like this. Pick a large prime number H (that's greater than maximal number of children). To hash a tree, hash its root, pick a child number H mod n, where n is the number of children of root, and recursively hash the subtree of this child.
This seems to be a bad option if trees differ only deeply near the leaves. But at least it should run fast for not very tall trees.
If you want to hash less elements but go through the whole tree:
Instead of hashing subtree, you may want to hash layer-wise. I.e. hash root first, than hash one of nodes that are its children, then one of children of the children etc. So you cover the whole tree instead of one of specific paths. This makes hashing procedure slower, of course.
--- O ------- layer 0, n=1
/ \
/ \
--- O --- O ----- layer 1, n=2
/|\ |
/ | \ |
/ | \ |
O - O - O O------ layer 2, n=4
/ \
/ \
------ O --- O -- layer 3, n=2
A node from a layer is picked with H mod n rule.
The difference between this version and previous version is that a tree should undergo quite an illogical transformation to retain the hash function.
The usual technique of hashing any sequence is combining the values (or hashes thereof) of its elements in some mathematical way. I don't think a tree would be any different in this respect.
For example, here is the hash function for tuples in Python (taken from Objects/tupleobject.c in the source of Python 2.6):
static long
tuplehash(PyTupleObject *v)
{
register long x, y;
register Py_ssize_t len = Py_SIZE(v);
register PyObject **p;
long mult = 1000003L;
x = 0x345678L;
p = v->ob_item;
while (--len >= 0) {
y = PyObject_Hash(*p++);
if (y == -1)
return -1;
x = (x ^ y) * mult;
/* the cast might truncate len; that doesn't change hash stability */
mult += (long)(82520L + len + len);
}
x += 97531L;
if (x == -1)
x = -2;
return x;
}
It's a relatively complex combination with constants experimentally chosen for best results for tuples of typical lengths. What I'm trying to show with this code snippet is that the issue is very complex and very heuristic, and the quality of the results probably depend on the more specific aspects of your data - i.e. domain knowledge may help you reach better results. However, for good-enough results you shouldn't look too far. I would guess that taking this algorithm and combining all the nodes of the tree instead of all the tuple elements, plus adding their position into play will give you a pretty good algorithm.
One option of taking the position into account is the node's position in an inorder walk of the tree.
Any time you are working with trees recursion should come to mind:
public override int GetHashCode() {
int hash = 5381;
foreach(var node in this.BreadthFirstTraversal()) {
hash = 33 * hash + node.GetHashCode();
}
}
The hash function should depend on the hash code of every node within the tree as well as its position.
Check. We are explicitly using node.GetHashCode() in the computation of the tree's hash code. Further, because of the nature of the algorithm, a node's position plays a role in the tree's ultimate hash code.
Reordering the children of a node should distinctly change the resulting hash code.
Check. They will be visited in a different order in the in-order traversal leading to a different hash code. (Note that if there are two children with the same hash code you will end up with the same hash code upon swapping the order of those children.)
Reflecting any part of the tree should distinctly change the resulting hash code
Check. Again the nodes would be visited in a different order leading to a different hash code. (Note that there are circumstances where the reflection could lead to the same hash code if every node is reflected into a node with the same hash code.)
The collision-free property of this will depend on how collision-free the hash function used for the node data is.
It sounds like you want a system where the hash of a particular node is a combination of the child node hashes, where order matters.
If you're planning on manipulating this tree a lot, you may want to pay the price in space of storing the hashcode with each node, to avoid the penalty of recalculation when performing operations on the tree.
Since the order of the child nodes matters, a method which might work here would be to combine the node data and children using prime number multiples and addition modulo some large number.
To go for something similar to Java's String hashcode:
Say you have n child nodes.
hash(node) = hash(nodedata) +
hash(childnode[0]) * 31^(n-1) +
hash(childnode[1]) * 31^(n-2) +
<...> +
hash(childnode[n])
Some more detail on the scheme used above can be found here: http://computinglife.wordpress.com/2008/11/20/why-do-hash-functions-use-prime-numbers/
I can see that if you have a large set of trees to compare, then you could use a hash function to retrieve a set of potential candidates, then do a direct comparison.
A substring that would work is just use lisp syntax to put brackets around the tree, write out the identifiere of each node in pre-order. But this is computationally equivalent to a pre-order comparison of the tree, so why not just do that?
I've given 2 solutions: one is for comparing the two trees when you're done (needed to resolve collisions) and the other to compute the hashcode.
TREE COMPARISON:
The most efficient way to compare will be to simply recursively traverse each tree in a fixed order (pre-order is simple and as good as anything else), comparing the node at each step.
So, just create a Visitor pattern that successively returns the next node in pre-order for a tree. i.e. it's constructor can take the root of the tree.
Then, just create two insces of the Visitor, that act as generators for the next node in preorder. i.e. Vistor v1 = new Visitor(root1), Visitor v2 = new Visitor(root2)
Write a comparison function that can compare itself to another node.
Then just visit each node of the trees, comparing, and returning false if comparison fails. i.e.
Module
Function Compare(Node root1, Node root2)
Visitor v1 = new Visitor(root1)
Visitor v2 = new Visitor(root2)
loop
Node n1 = v1.next
Node n2 = v2.next
if (n1 == null) and (n2 == null) then
return true
if (n1 == null) or (n2 == null) then
return false
if n1.compare(n2) != 0 then
return false
end loop
// unreachable
End Function
End Module
HASH CODE GENERATION:
if you want to write out a string representation of the tree, you can use the lisp syntax for a tree, then sample the string to generate a shorter hashcode.
Module
Function TreeToString(Node n1) : String
if node == null
return ""
String s1 = "(" + n1.toString()
for each child of n1
s1 = TreeToString(child)
return s1 + ")"
End Function
The node.toString() can return the unique label/hash code/whatever for that node. Then you can just do a substring comparison from the strings returned by the TreeToString function to determine if the trees are equivalent. For a shorter hashcode, just sample the TreeToString Function, i.e. take every 5 character.
End Module
I think you could do this recursively: Assume you have a hash function h that hashes strings of arbitrary length (e.g. SHA-1). Now, the hash of a tree is the hash of a string that is created as a concatenation of the hash of the current element (you have your own function for that) and hashes of all the children of that node (from recursive calls of the function).
For a binary tree you would have:
Hash( h(node->data) || Hash(node->left) || Hash(node->right) )
You may need to carefully check if tree geometry is properly accounted for. I think that with some effort you could derive a method for which finding collisions for such trees could be as hard as finding collisions in the underlying hash function.
A simple enumeration (in any deterministic order) together with a hash function that depends when the node is visited should work.
int hash(Node root) {
ArrayList<Node> worklist = new ArrayList<Node>();
worklist.add(root);
int h = 0;
int n = 0;
while (!worklist.isEmpty()) {
Node x = worklist.remove(worklist.size() - 1);
worklist.addAll(x.children());
h ^= place_hash(x.hash(), n);
n++;
}
return h;
}
int place_hash(int hash, int place) {
return (Integer.toString(hash) + "_" + Integer.toString(place)).hash();
}
class TreeNode
{
public static QualityAgainstPerformance = 3; // tune this for your needs
public static PositionMarkConstan = 23498735; // just anything
public object TargetObject; // this is a subject of this TreeNode, which has to add it's hashcode;
IEnumerable<TreeNode> GetChildParticipiants()
{
yield return this;
foreach(var child in Children)
{
yield return child;
foreach(var grandchild in child.GetParticipiants() )
yield return grandchild;
}
IEnumerable<TreeNode> GetParentParticipiants()
{
TreeNode parent = Parent;
do
yield return parent;
while( ( parent = parent.Parent ) != null );
}
public override int GetHashcode()
{
int computed = 0;
var nodesToCombine =
(Parent != null ? Parent : this).GetChildParticipiants()
.Take(QualityAgainstPerformance/2)
.Concat(GetParentParticipiants().Take(QualityAgainstPerformance/2));
foreach(var node in nodesToCombine)
{
if ( node.ReferenceEquals(this) )
computed = AddToMix(computed, PositionMarkConstant );
computed = AddToMix(computed, node.GetPositionInParent());
computed = AddToMix(computed, node.TargetObject.GetHashCode());
}
return computed;
}
}
AddToTheMix is a function, which combines the two hashcodes, so the sequence matters.
I don't know what it is, but you can figure out. Some bit shifting, rounding, you know...
The idea is that you have to analyse some environment of the node, depending on the quality you want to achieve.
I have to say, that you requirements are somewhat against the entire concept of hashcodes.
Hash function computational complexity should be very limited.
It's computational complexity should not linearly depend on the size of the container (the tree), otherwise it totally breaks the hashcode-based algorithms.
Considering the position as a major property of the nodes hash function also somewhat goes against the concept of the tree, but achievable, if you replace the requirement, that it HAS to depend on the position.
Overall principle i would suggest, is replacing MUST requirements with SHOULD requirements.
That way you can come up with appropriate and efficient algorithm.
For example, consider building a limited sequence of integer hashcode tokens, and add what you want to this sequence, in the order of preference.
Order of the elements in this sequence is important, it affects the computed value.
for example for each node you want to compute:
add the hashcode of underlying object
add the hashcodes of underlying objects of the nearest siblings, if available. I think, even the single left sibling would be enough.
add the hashcode of underlying object of the parent and it's nearest siblings like for the node itself, same as 2.
repeat this to with the grandparents to a limited depth.
//--------5------- ancestor depth 2 and it's left sibling;
//-------/|------- ;
//------4-3------- ancestor depth 1 and it's left sibling;
//-------/|------- ;
//------2-1------- this;
the fact that you are adding a direct sibling's underlying object's hashcode gives a positional property to the hashfunction.
if this is not enough, add the children:
You should add every child, just some to give a decent hashcode.
add the first child and it's first child and it's first child.. limit the depth to some constant, and do not compute anything recursively - just the underlying node's object's hashcode.
//----- this;
//-----/--;
//----6---;
//---/--;
//--7---;
This way the complexity is linear to the depth of the underlying tree, not the total number of elements.
Now you have a sequence if integers, combine them with a known algorithm, like Ely suggests above.
1,2,...7
This way, you will have a lightweight hash function, with a positional property, not dependent on the total size of the tree, and even not dependent on the tree depth, and not requiring to recompute hash function of the entire tree when you change the tree structure.
I bet this 7 numbers would give a hash destribution near to perfect.
Writing your own hash function is almost always a bug, because you basically need a degree in mathematics to do it well. Hashfunctions are incredibly nonintuitive, and have highly unpredictable collision characteristics.
Don't try directly combining hashcodes for Child nodes -- this will magnify any problems in the underlying hash functions. Instead, concatenate the raw bytes from each node in order, and feed this as a byte stream to a tried-and-true hash function. All the cryptographic hash functions can accept a byte stream. If the tree is small, you may want to just create a byte array and hash it in one operation.