Determine distance between two random nodes in a tree - algorithm

Given a general tree, I want the distance between two nodes v and w.
Wikipedia states the following:
Computation of lowest common ancestors may be useful, for instance, as part of a procedure for determining the distance between pairs of nodes in a tree: the distance from v to w can be computed as the distance from the root to v, plus the distance from the root to w, minus twice the distance from the root to their lowest common ancestor.
Let's say d(x) denotes the distance of node x from the root which we set to 1. d(x,y) denotes the distance between two vertices x and y. lca(x,y) denotes the lowest common ancestor of vertex pair x and y.
Thus if we have 4 and 8, lca(4,8) = 2 therefore, according to the description above, d(4,8) = d(4) + d(8) - 2 * d(lca(4,8)) = 2 + 3 - 2 * 1 = 3. Great, that worked!
However, the case stated above seems to fail for the vertex pair (8,3) (lca(8,3) = 2) d(8,3) = d(8) + d(3) - 2 * d(2) = 3 + 1 - 2 * 1 = 2. This is incorrect however, the distance d(8,3) = 4 as can be seen on the graph. The algorithm seems to fail for anything that crosses over the defined root.
What am I missing?

You missed that the lca(8,3) = 1, and not = 2. Hence the d(1) == 0 which makes it:
d(8,3) = d(8) + d(3) - 2 * d(1) = 3 + 1 - 2 * 0 = 4

For the appropriate 2 node, namely the one one the right, d(lca(8,2)) == 0, not 1 as you have it in your derivation. The distance from the root--which is the lca in this case--to itself is zero. So
d(8,2) = d(8) + d(2) - 2 * d(lca(8,2)) = 3 + 1 - 2 * 0 = 4
The fact that you have two nodes labeled 2 is probably confusing things.
Edit: The post has been edited so that a node originally labeled 2 is now labeled 3. In this case, the derivation is now correct but the statement
the distance d(8,2) = 4 as can be seen on the graph
is incorrect, d(8,2) = 2.

Related

How to find all possible reachable numbers from a position?

Given 2 elements n, s and an array A of size m, where s is initial position which lies between 1 <= s <= n, our task is to perform m operations to s and in each operation we either make s = s + A[i] or s = s - A[i], and we have to print all the values which are possible after the m operation and all those value should lie between 1 - n (inclusive).
Important Note: If during an operation we get a value s < 1 or s > n,
we don't go further with that value of s.
I solved the problem using BFS, but the problem is BFS approach is not optimal here, can someone suggest any other more optimal approach to me or an algorithm will greatly help.
For example:-
If n = 3, s = 3, and A = {1, 1, 1}
3
/ \
operation 1: 2 4 (we don’t proceed with 4 as it is > n)
/ \ / \
operation 2: 1 3 3 5
/ \ / \ / \ / \
operation 3: 0 2 2 4 2 4 4 6
So final values reachable by following above rules are 2 and 2 (that is two times 2). we don't consider the third two as it has an intermediate state which is > n ( same case applicable if < 1).
There is this dynamic programming solution, which runs in O(nm) time and requires O(n) space.
First establish a boolean array called reachable, initialize it to false everywhere except for reachable[s], which is true.
This array now represents whether a number is reachable in 0 steps. Now for every i from 1 to m, we update the array so that reachable[x] represents whether the number x is reachable in i steps. This is easy: x is reachable in i steps if and only if either x - A[i] or x + A[i] is reachable in i - 1 steps.
In the end, the array becomes the final result you want.
EDIT: pseudo-code here.
// initialization:
for x = 1 to n:
r[x] = false
r[s] = true
// main loop:
for k = 1 to m:
for x = 1 to n:
last_r[x] = r[x]
for x = 1 to n:
r[x] = (last_r[x + A[k]] or last_r[x - A[k]])
Here last_r[x] is by convention false if x is not in the range [1 .. n].
If you want to maintain the number of ways that each number can be reached, then you do the following changes:
Change the array r to an integer array;
In the initialization, initialize all r[x] to 0, except r[s] to 1;
In the main loop, change the key line to:
r[x] = last_r[x + A[k]] + last_r[x - A[k]]

How can you calculate depth of a binary tree with less complexity?

Given a binary search tree t, it is rather easy to get its depth using recursion, as the following:
def node_height(t):
if t.left.value == None and t.right.value == None:
return 1
else:
height_left = t.left.node_height()
height_right = t.right.node_height()
return ( 1 + max(height_left,height_right) )
However, I noticed that its complexity increases exponentially, and thus should perform very badly when we have a deep tree. Is there any faster algorithm for doing this?
If you store the height as a field in the Node object, you can add 1 as you add nodes to the tree (and subtracting during remove).
That'll make the operation constant time for getting the height of any node, but it adds some additional complexity into the add/remove operations.
This kind of extends from what #cricket_007 mentioned in his answer.
So, if you do a ( 1 + max(height_left,height_right) ), you end up having to visit every node, which is essentially an O(N) operation. For an average case with a balanced tree, you would be looking at something like T(n) = 2T(n/2) + Θ(1).
Now, this can be improved to a time of O(1) if you can store the height of a certain node. In that case, the height of the tree would be equal to the height of the root. So, the modification you would need to make would be to your insert(value) method. At the beginning, the root is given a default height of 0. The node to be added is assigned a height of 0. For every node you encounter while trying to add this new node, increase node.height by 1 if needed, and ensure it is set to 1 + max(left child's height, right child's height). So, the height function will simply return node.height, hence allowing for constant time. The time complexity for the insert will also not change; we just need some extra space to store n integer values, where n is the number of nodes.
The following is shown to give an understanding of what I am trying to say.
5 [0]
- insert 2 [increase height of root by 1]
5 [1]
/
/
[0] 2
- insert 1 [increase height of node 2 by 1, increase height of node 5 by 1]
5 [2]
/
/
[1] 2
/
/
[0] 1
- insert 3 [new height of node 2 = 1 + max(height of node 1, height of node 3)
= 1 + 0 = 1; height of node 5 also does not change]
5 [2]
/
/
[1] 2
/ \
/ \
[0] 1 3 [0]
- insert 6 [new height of node 5 = 1 + max(height of node 2, height of node 6)
= 1 + 1 = 2]
5 [2]
/ \
/ \
[1] 2 6 [0]
/ \
/ \
[0] 1 3 [0]

Number of ways of distributing n identical balls into groups such that each group has atleast k balls?

I am trying to do this using recursion with memoization ,I have identified the following base cases .
I) when n==k there is only one group with all the balls.
II) when k>n then no groups can have atleast k balls,hence zero.
I am unable to move forward from here.How can this be done?
As an illustration when n=6 ,k=2
(2,2,2)
(4,2)
(3,3)
(6)
That is 4 different groupings can be formed.
This can be represented by the two dimensional recursive formula described below:
T(0, k) = 1
T(n, k) = 0 n < k, n != 0
T(n, k) = T(n-k, k) + T(n, k + 1)
^ ^
There is a box with k balls, No box with k balls, advance to next k
put them
In the above, T(n,k) is the number of distributions of n balls such that each box gets at least k.
And the trick is to think of k as the lowest possible number of balls, and seperate the problem to two scenarios: Is there a box with exactly k balls (if so, place them and recurse with n-k balls), or not (and then, recurse with minimal value of k+1, and same number of balls).
Example, to calculate your example: T(6,2) (6 balls, minimum 2 per box):
T(6,2) = T(4,2) + T(6,3)
T(4,2) = T(2,2) + T(4,3) = T(0,2) + T(2,3) + T(1,3) + T(4,4) =
= T(0,2) + T(2,3) + T(1,3) + T(0,4) + T(4,5) =
= 1 + 0 + 0 + 1 + 0
= 2
T(6,3) = T(3,3) + T(6,4) = T(0,3) + T(3,4) + T(2,4) + T(6,5)
= T(0,3) + T(3,4) + T(2,4) + T(1,5) + T(6,6) =
= T(0,3) + T(3,4) + T(2,4) + T(1,5) + T(0,6) + T(6,7) =
= 1 + 0 + 0 + 0 + 1 + 0
= 2
T(6,2) = T(4,2) + T(6,3) = 2 + 2 = 4
Using Dynamic Programming, it can be calculated in O(n^2) time.
This case can be solved pretty simple:
Number of buckets
The maximum-number of buckets b can be determined as follows:
b = roundDown(n / k)
Each valid distribution can use at most b buckets.
Number of distributions with x buckets
For a given number of buckets the number of distribution can be found pretty simple:
Distribute k balls to each bucket. Find the number of ways to distribute the remaining balls (r = n - k * x) to x buckets:
total_distributions(x) = bincoefficient(x , n - k * x)
EDIT: this will onyl work, if order matters. Since it doesn't for the question, we can use a few tricks here:
Each distribution can be mapped to a sequence of numbers. E.g.: d = {d1 , d2 , ... , dx}. We can easily generate all of these sequences starting with the "first" sequence {r , 0 , ... , 0} and subsequently moving 1s from the left to the right. So the next sequence would look like this: {r - 1 , 1 , ... , 0}. If only sequences matching d1 >= d2 >= ... >= dx are generated, no duplicates will be generated. This constraint can easily be used to optimize this search a bit: We can only move a 1 from da to db (with a = b - 1), if da - 1 >= db + 1 is given, since otherwise the constraint that the array is sorted is violated. The 1s to move are always the rightmost that can be moved. Another way to think of this would be to view r as a unary number and simply split that string into groups such that each group is atleast as long as it's successor.
countSequences(x)
sequence[]
sequence[0] = r
sequenceCount = 1
while true
int i = findRightmostMoveable(sequence)
if i == -1
return sequenceCount
sequence[i] -= 1
sequence[i + 1] -= 1
sequenceCount
findRightmostMoveable(sequence)
for i in [length(sequence) - 1 , 0)
if sequence[i - 1] > sequence[i] + 1
return i - 1
return -1
Actually findRightmostMoveable could be optimized a bit, if we look at the structure-transitions of the sequence (to be more precise the difference between two elements of the sequence). But to be honest I'm by far too lazy to optimize this further.
Putting the pieces together
range(1 , roundDown(n / k)).map(b -> countSequences(b)).sum()

How do you manually compute for silhouette, cohesion and separation of Cluster

Good day!
I have been looking all over the Internet on how to compute for silhouette coefficient, cohesion and separation unfortunately, despite the resources, I just can't understand the formulas posted. I know that there are implementations of it in some tool, but I want to know how to manually compute them especially given a vector space model.
Assuming that I have the following clusters:
Cluster 1 ={{1,0},{1,1}}
Cluster 2 ={{1,2},{2,3},{2,2},{1,2}},
Cluster 3 ={{3,1},{3,3},{2,1}}
The way I understood it according to [1] is that I have to get the average of the points per cluster:
C1 X = 1; Y = .5
C2 X = 1.5; Y = 2.25
C3 X = 2.67; Y = 1.67
Given the mean, I have to compute for my cohesion by Sum of Square Error (SSE):
Cohesion(C1) = (1-1)^2 + (1-1)^2 + (0-.5)^2 + (0-.5)^2 = 0.5
Cohesion(C2) = (1-1.5)^2 + (2-1.5)^2 + (2-1.5)^2 + (1-1.5)^2 + (2-2.5)^2 + (3-2.5)^2 + (2-2.5)^2 +(2-2.5)^2 = 2
Cohesion(C3) = (3-2.67)^2 + (3-2.67)^2 + (2-2.67)^2 + (1-1.67)^2 + (3-1.67)^2 + (1-1.67)^2 = 3.3334
Cluster(C) = 0.5 + 2 + 3.3334 = 5.8334
My questions are:
1. Did I perform cohesion correctly?
2. How do I compute for Separation?
3. How do I compute for Silhouette Coefficient?
Thank you.
References: [1] http://www.cs.kent.edu/~jin/DM08/ClusterValidation.pdf
Cluster 1 ={{1,0},{1,1}}
Cluster 2 ={{1,2},{2,3},{2,2},{1,2}},
Cluster 3 ={{3,1},{3,3},{2,1}}
Take a point {1,0} in cluster 1
Calculate its average distance to all other points in it’s cluster, i.e. cluster 1
So a1 =√( (1-1)^2 + (0-1)^2) =√(0+1)=√1=1
Now for the object {1,0} in cluster 1 calculate its average distance from all the objects in cluster 2 and cluster 3. Of these take the minimum average distance.
So for cluster 2
{1,0} ----> {1,2} = distance = √((1-1)^2 + (0-2)^2) =√(0+4)=√4=2
{1,0} ----> {2,3} = distance = √((1-2)^2 + (0-3)^2) =√(1+9)=√10=3.16
{1,0} ----> {2,2} = distance = √((1-2)^2 + (0-2)^2) =√(1+4)=√5=2.24
{1,0} ----> {1,2} = distance = √((1-1)^2 + (0-2)^2) =√(0+4)=√4=2
Therefore, the average distance of point {1,0} in cluster 1 to all the points in cluster 2 =
(2+3.16+2.24+2)/4 = 2.325
Similarly, for cluster 3
{1,0} ----> {3,1} = distance = √((1-3)^2 + (0-1)^2) =√(4+1)=√5=2.24
{1,0} ----> {3,3} = distance = √((1-3)^2 + (0-3)^2) =√(4+9)=√13=3.61
{1,0} ----> {2,1} = distance = √((1-2)^2 + (0-1)^2) =√(1+1)=√2=2.24
Therefore, the average distance of point {1,0} in cluster 1 to all the points in cluster 3 =
(2.24+3.61+2.24)/3 = 2.7
Now, the minimum average distance of the point {1,0} in cluster 1 to the other clusters 2 and 3 is,
b1 =2.325 (2.325 < 2.7)
So the silhouette coefficient of cluster 1
s1= 1-(a1/b1) = 1- (1/2.325)=1-0.4301=0.5699
In a similar fashion you need to calculate the silhouette coefficient for cluster 2 and cluster 3 separately by taking any single object point in each of the clusters and repeating the steps above. Of these the cluster with the greatest silhouette coefficient is the best as per evaluation.
Note: The distance here is the Euclidean Distance! You can also have a look at this video for further explanation:
https://www.coursera.org/learn/cluster-analysis/lecture/RJJfM/6-2-clustering-evaluation-measuring-clustering-quality
Computation of Silhouette is straightforward, but it does not involve the centroids.
So don't try to compute it from what you did for cohesion; compute it from your original data.
As you have calculated the Cohesion of C1, there is a mistake.
Cohesion(C1) = (1 - 1) ^ 2 + (1 - 1) ^ 2 + (0 - .5) ^ 2 + (1 - .5) ^ 2 = 0.5
This is the Prototype-Based (Centroid in this case) Cohesion calculation.
For calculating Separation: {Between clusters i.e. (C1,C2) , (C1,C3) & (C2,C3)}
Separation(C1,C2) = SSE(Centroid(C1), Centroid(C2))
= (1 - 1.5) ^ 2 + (0.5 - 2.25) ^ 2 = 1 + 3.0625 = 4.0625
Silhouette Coefficient: Combines both the Cohesion and Separation.
Refer https://cs.fit.edu/~pkc/classes/ml-internet/silhouette.pdf
thanks for your answer,
Calculate its average distance to all other points in its cluster, i.e. cluster 1' --> This part has to be corrected.
So
a1 =√( (1-1)^2 + (0-1)^2) =√(0+1)=√1=1
{1,0} ----> {2,1} = distance = √((1-2)^2 + (0-1)^2) =√(1+1)=√2=2.24
this is an error because the root of 2 is approximately 1.41

Analytical solution to predict array size of binary tree

I'm constructing a binary tree for a sequence of data and the tree is stored in a 1-based array. So if index of parent node is idx,
the left child is 2 * idx and the right is 2 * idx + 1.
Every iteration, I sort current sequence based on certain criteria, select the median element as parent, tree[index] = sequence[median], then do same operation on left(the sub sequence before median) and right(the subsequence after median) recursively.
Eg, if 3 elements in total, the tree will be:
1
/ \
2 3
the array size to store the tree is also 3
4 elements:
1
/ \
2 3
/
4
the array size to store the tree is also 4
5 elements:
1
/ \
2 3
/ \ /
4 null 5
the array size to store the tree has to be 6, since there is a hole between 4 and 5.
Thus, the array size is only determined by number of elements, I believe there is an anlytical solution for it, just can't prove it.
Any suggestion will be appreciated.
Thanks.
Every level of a binary tree contains twice as many nodes as the previous level. If you have n nodes, then the number of levels required (the height of the tree) is log2(n) + 1, rounded up to a whole number. So if you have 5 nodes, your binary tree will have a height of 3.
The number of nodes in a full binary tree of height h is (2^h) - 1. So you know that the maximum size array you need for 5 items is 7. Assuming all the levels are filled except possibly the last one.
The last row of your tree will contain (2^h)-1 - n nodes. The last level of a full tree contains 2^(h-1) nodes. Assuming you want it balanced so half of the nodes are on the left and half are on the right, and the right side is left-filled, that is, you want this:
1
2 3
4 5 6 7
8 9 10 11
The number of array spaces required required for the last level of your tree, then, is either 1, or it's half the number required by a full tree, plus half the nodes required by your tree.
So:
n = 5
height = roundUp(log2(n) + 1)
fullTreeNodes = (2^height) - 1
fullTreeLeafNodes = 2^(height-1)
nodesOnLeafLevel = fullTreeNodes - n
Now comes the fun part. If there is more than 1 node required on the leaf level, and you want to balance the sides, you need half of fullTreeLeafNodes, plus half of nodesOnLeafLevel. In the tree above, for example, the leaf level has a potential for 8 nodes. But you only have 4 leaf nodes. You want two of them on the left side, and two on the right. So you need to allocate space for 4 nodes on the left side (2 for the left side items, and 2 empty spaces), plus two more for the two right side items.
if (nodesOnLeafLevel == 1)
arraySize = n
else
arraySize = (fullTreeNodes - fullTreeLeafNodes/2) + (nodesOnLeafLevel / 2)
You really shouldn't have any holes. They are created by your partitioning algorithm, but that algorithm is incorrect.
For 1-5 items, your trees should look like:
1 2 2 3 4
/ \ / \ / \ / \
1 1 3 2 4 2 5
/ / \
1 1 3
The easiest way to populate the tree is to do an in-order traversal of the node locations, filling items from the sequence in order.
I'm close to formalizing a solution. By intuition, first find the maximal power of 2 < N, then check whether the N - 2^m is even or odd, decide which part of the leave level need be growed.
int32_t rup2 = roundUpPower2(nPoints);
if (rup2 == nPoints || rup2 == nPoints + 1)
{
return nPoints;
}
int32_t leaveLevelCapacity = rup2 / 2;
int32_t allAbove = leaveLevelCapacity - 1;
int32_t pointsOnLeave = nPoints - allAbove;
int32_t iteration = roundDownLog2(pointsOnLeave);
int32_t leaveSize = 1;
int32_t gap = leaveLevelCapacity;
for (int32_t i = 1; i <= iteration; ++i)
{
leaveSize += gap / 2;
gap /= 2;
}
return (allAbove + leaveSize);

Resources