How do you manually compute for silhouette, cohesion and separation of Cluster - validation

Good day!
I have been looking all over the Internet on how to compute for silhouette coefficient, cohesion and separation unfortunately, despite the resources, I just can't understand the formulas posted. I know that there are implementations of it in some tool, but I want to know how to manually compute them especially given a vector space model.
Assuming that I have the following clusters:
Cluster 1 ={{1,0},{1,1}}
Cluster 2 ={{1,2},{2,3},{2,2},{1,2}},
Cluster 3 ={{3,1},{3,3},{2,1}}
The way I understood it according to [1] is that I have to get the average of the points per cluster:
C1 X = 1; Y = .5
C2 X = 1.5; Y = 2.25
C3 X = 2.67; Y = 1.67
Given the mean, I have to compute for my cohesion by Sum of Square Error (SSE):
Cohesion(C1) = (1-1)^2 + (1-1)^2 + (0-.5)^2 + (0-.5)^2 = 0.5
Cohesion(C2) = (1-1.5)^2 + (2-1.5)^2 + (2-1.5)^2 + (1-1.5)^2 + (2-2.5)^2 + (3-2.5)^2 + (2-2.5)^2 +(2-2.5)^2 = 2
Cohesion(C3) = (3-2.67)^2 + (3-2.67)^2 + (2-2.67)^2 + (1-1.67)^2 + (3-1.67)^2 + (1-1.67)^2 = 3.3334
Cluster(C) = 0.5 + 2 + 3.3334 = 5.8334
My questions are:
1. Did I perform cohesion correctly?
2. How do I compute for Separation?
3. How do I compute for Silhouette Coefficient?
Thank you.
References: [1] http://www.cs.kent.edu/~jin/DM08/ClusterValidation.pdf

Cluster 1 ={{1,0},{1,1}}
Cluster 2 ={{1,2},{2,3},{2,2},{1,2}},
Cluster 3 ={{3,1},{3,3},{2,1}}
Take a point {1,0} in cluster 1
Calculate its average distance to all other points in it’s cluster, i.e. cluster 1
So a1 =√( (1-1)^2 + (0-1)^2) =√(0+1)=√1=1
Now for the object {1,0} in cluster 1 calculate its average distance from all the objects in cluster 2 and cluster 3. Of these take the minimum average distance.
So for cluster 2
{1,0} ----> {1,2} = distance = √((1-1)^2 + (0-2)^2) =√(0+4)=√4=2
{1,0} ----> {2,3} = distance = √((1-2)^2 + (0-3)^2) =√(1+9)=√10=3.16
{1,0} ----> {2,2} = distance = √((1-2)^2 + (0-2)^2) =√(1+4)=√5=2.24
{1,0} ----> {1,2} = distance = √((1-1)^2 + (0-2)^2) =√(0+4)=√4=2
Therefore, the average distance of point {1,0} in cluster 1 to all the points in cluster 2 =
(2+3.16+2.24+2)/4 = 2.325
Similarly, for cluster 3
{1,0} ----> {3,1} = distance = √((1-3)^2 + (0-1)^2) =√(4+1)=√5=2.24
{1,0} ----> {3,3} = distance = √((1-3)^2 + (0-3)^2) =√(4+9)=√13=3.61
{1,0} ----> {2,1} = distance = √((1-2)^2 + (0-1)^2) =√(1+1)=√2=2.24
Therefore, the average distance of point {1,0} in cluster 1 to all the points in cluster 3 =
(2.24+3.61+2.24)/3 = 2.7
Now, the minimum average distance of the point {1,0} in cluster 1 to the other clusters 2 and 3 is,
b1 =2.325 (2.325 < 2.7)
So the silhouette coefficient of cluster 1
s1= 1-(a1/b1) = 1- (1/2.325)=1-0.4301=0.5699
In a similar fashion you need to calculate the silhouette coefficient for cluster 2 and cluster 3 separately by taking any single object point in each of the clusters and repeating the steps above. Of these the cluster with the greatest silhouette coefficient is the best as per evaluation.
Note: The distance here is the Euclidean Distance! You can also have a look at this video for further explanation:
https://www.coursera.org/learn/cluster-analysis/lecture/RJJfM/6-2-clustering-evaluation-measuring-clustering-quality

Computation of Silhouette is straightforward, but it does not involve the centroids.
So don't try to compute it from what you did for cohesion; compute it from your original data.

As you have calculated the Cohesion of C1, there is a mistake.
Cohesion(C1) = (1 - 1) ^ 2 + (1 - 1) ^ 2 + (0 - .5) ^ 2 + (1 - .5) ^ 2 = 0.5
This is the Prototype-Based (Centroid in this case) Cohesion calculation.
For calculating Separation: {Between clusters i.e. (C1,C2) , (C1,C3) & (C2,C3)}
Separation(C1,C2) = SSE(Centroid(C1), Centroid(C2))
= (1 - 1.5) ^ 2 + (0.5 - 2.25) ^ 2 = 1 + 3.0625 = 4.0625
Silhouette Coefficient: Combines both the Cohesion and Separation.
Refer https://cs.fit.edu/~pkc/classes/ml-internet/silhouette.pdf

thanks for your answer,
Calculate its average distance to all other points in its cluster, i.e. cluster 1' --> This part has to be corrected.
So
a1 =√( (1-1)^2 + (0-1)^2) =√(0+1)=√1=1

{1,0} ----> {2,1} = distance = √((1-2)^2 + (0-1)^2) =√(1+1)=√2=2.24
this is an error because the root of 2 is approximately 1.41

Related

Set values of a matrix on positions inside a triangle

I have a N x N matrix will all values equal to zero, then I need to get the coordinates of a triangle and set the values inside this triangle to one (1).
How can I determine the position of each element in the matrix that forms the triangle faces?
Like this 10x10 matrix, I have a triangle set at (9,1),(5,5) and (9,5):
0000000000
0000000000
0000000000
0000000000
0000000000
0000010000
0000110000
0001010000
0010010000
0111110000
I don't need the code made for me, I want to check if there is a proper way (maybe using math) to get the "coordinates".
When you have two points x1,y1 and x2,y2, you can use these to create a formula for the line using "point-slope form"
Calculate slope with m = (y1 - y2) / (x1 - x2)
Then you have a formula of y - y1 = m(x - x1)
This further goes to y = m(x - x1) + y1
So in your example of (9,1),(5,5) you calculate the m = (1 - 5) / (9 - 5) = (-4) / (4) = -1
Then your formula becomes, for that line, y = (-1)(x - 9) + 1
Then iterate between 5 and 9.
f(5) = -(5-9) + 1 = -(-4) + 1 = 4 + 1 = 5
f(6) = -(6-9) + 1 = -(-3) + 1 = 3 + 1 = 4
f(7) = -(7-9) + 1 = -(-2) + 1 = 2 + 1 = 3
f(8) = -(8-9) + 1 = -(-1) + 1 = 1 + 1 = 2
f(9) = -(9-9) + 1 = -(0)) + 1 = 0 + 1 = 1
Triangles have nice properties allowing a very simple algorithm to suffice.
Find Ymax, the topmost Y coordinate set in the triangle. Then for Ymax, find Xmin and Xmax, of the left and rightmost pixels set in that row. Now there are 2 cases. If Xmin == Xmax, then one vertex is (Xmin,Ymax), otherwise two of the coordinates are (Xmin, Ymax) and (Xmax, Ymax).
With this you've found the topmost coordinate or coordinates.
It's pretty simple to continue this reasoning to find the other ones. I'll let you puzzle it out for the fun...
You can combine the min and max-finding in the algorithm above with the algorithm that does the filling as required in the second part of the problem.

Why don't we include 0 matches while calculating jaccard distance between binary numbers?

I am working on a program based on Jaccard Distance, and I need to calculate the Jaccard Distance between two binary bit vectors. I came across the following on the net:
If p1 = 10111 and p2 = 10011,
The total number of each combination attributes for p1 and p2:
M11 = total number of attributes where p1 & p2 have a value 1,
M01 = total number of attributes where p1 has a value 0 & p2 has a value 1,
M10 = total number of attributes where p1 has a value 1 & p2 has a value 0,
M00 = total number of attributes where p1 & p2 have a value 0.
Jaccard similarity coefficient = J =
intersection/union = M11/(M01 + M10 + M11)
= 3 / (0 + 1 + 3) = 3/4,
Jaccard distance = J' = 1 - J = 1 - 3/4 = 1/4,
Or J' = 1 - (M11/(M01 + M10 + M11)) = (M01 + M10)/(M01 + M10 + M11)
= (0 + 1)/(0 + 1 + 3) = 1/4
Now, while calculating the coefficient, why was "M00" not included in the denominator? Can anyone please explain?
Jaccard coefficient is a measure of asymmetric binary attributes,f.e., a scenario where the presence of an item is more important than its absence.
Since M00 deals only with absence, we do not consider it while calculating Jaccard coeffecient.
For example, while checking for the presence/absence of a disease, the presence of the disease is the more significant outcome.
Hope it helps!
The Jacquard index of A and B is |A∩B|/|A∪B| = |A∩B|/(|A| + |B| - |A∩B|).
We have: |A∩B| = M11, |A| = M11 + M10, |B| = M11 + M01.
So |A∩B|/(|A| + |B| - |A∩B|) = M11 / (M11 + M10 + M11 + M01 - M11) = M11 / (M10 + M01 + M11).
This Venn diagram may help:

Number of ways of distributing n identical balls into groups such that each group has atleast k balls?

I am trying to do this using recursion with memoization ,I have identified the following base cases .
I) when n==k there is only one group with all the balls.
II) when k>n then no groups can have atleast k balls,hence zero.
I am unable to move forward from here.How can this be done?
As an illustration when n=6 ,k=2
(2,2,2)
(4,2)
(3,3)
(6)
That is 4 different groupings can be formed.
This can be represented by the two dimensional recursive formula described below:
T(0, k) = 1
T(n, k) = 0 n < k, n != 0
T(n, k) = T(n-k, k) + T(n, k + 1)
^ ^
There is a box with k balls, No box with k balls, advance to next k
put them
In the above, T(n,k) is the number of distributions of n balls such that each box gets at least k.
And the trick is to think of k as the lowest possible number of balls, and seperate the problem to two scenarios: Is there a box with exactly k balls (if so, place them and recurse with n-k balls), or not (and then, recurse with minimal value of k+1, and same number of balls).
Example, to calculate your example: T(6,2) (6 balls, minimum 2 per box):
T(6,2) = T(4,2) + T(6,3)
T(4,2) = T(2,2) + T(4,3) = T(0,2) + T(2,3) + T(1,3) + T(4,4) =
= T(0,2) + T(2,3) + T(1,3) + T(0,4) + T(4,5) =
= 1 + 0 + 0 + 1 + 0
= 2
T(6,3) = T(3,3) + T(6,4) = T(0,3) + T(3,4) + T(2,4) + T(6,5)
= T(0,3) + T(3,4) + T(2,4) + T(1,5) + T(6,6) =
= T(0,3) + T(3,4) + T(2,4) + T(1,5) + T(0,6) + T(6,7) =
= 1 + 0 + 0 + 0 + 1 + 0
= 2
T(6,2) = T(4,2) + T(6,3) = 2 + 2 = 4
Using Dynamic Programming, it can be calculated in O(n^2) time.
This case can be solved pretty simple:
Number of buckets
The maximum-number of buckets b can be determined as follows:
b = roundDown(n / k)
Each valid distribution can use at most b buckets.
Number of distributions with x buckets
For a given number of buckets the number of distribution can be found pretty simple:
Distribute k balls to each bucket. Find the number of ways to distribute the remaining balls (r = n - k * x) to x buckets:
total_distributions(x) = bincoefficient(x , n - k * x)
EDIT: this will onyl work, if order matters. Since it doesn't for the question, we can use a few tricks here:
Each distribution can be mapped to a sequence of numbers. E.g.: d = {d1 , d2 , ... , dx}. We can easily generate all of these sequences starting with the "first" sequence {r , 0 , ... , 0} and subsequently moving 1s from the left to the right. So the next sequence would look like this: {r - 1 , 1 , ... , 0}. If only sequences matching d1 >= d2 >= ... >= dx are generated, no duplicates will be generated. This constraint can easily be used to optimize this search a bit: We can only move a 1 from da to db (with a = b - 1), if da - 1 >= db + 1 is given, since otherwise the constraint that the array is sorted is violated. The 1s to move are always the rightmost that can be moved. Another way to think of this would be to view r as a unary number and simply split that string into groups such that each group is atleast as long as it's successor.
countSequences(x)
sequence[]
sequence[0] = r
sequenceCount = 1
while true
int i = findRightmostMoveable(sequence)
if i == -1
return sequenceCount
sequence[i] -= 1
sequence[i + 1] -= 1
sequenceCount
findRightmostMoveable(sequence)
for i in [length(sequence) - 1 , 0)
if sequence[i - 1] > sequence[i] + 1
return i - 1
return -1
Actually findRightmostMoveable could be optimized a bit, if we look at the structure-transitions of the sequence (to be more precise the difference between two elements of the sequence). But to be honest I'm by far too lazy to optimize this further.
Putting the pieces together
range(1 , roundDown(n / k)).map(b -> countSequences(b)).sum()

Explanation of Bicubic interpolation in Matlab?

I am confused by Matlab's example on Bicubic interpolation at http://www.mathworks.com/help/vision/ug/interpolation-methods.html#f13689
I think I understand their Bilinear example. It seems like they took the averages of the adjacent translated values on either side. So, to get the 0.5 in their first row, first column, the average of 0 and 1 was taken.
For their Bicubic interpolation example, I am rather confused by their method. They say that they take the "weighted average of the two translated values on either side".
In their example, they have
1 2 3
4 5 6
7 8 9
and in their first step of Bicubic interpolation, they add zeros to the matrix and translate it by 0.5 pixel to the right to get the following:
0 0 0 1 1 2 2 3 3 0 0 0 0
0 0 0 4 4 5 5 6 6 0 0 0 0
0 0 0 7 7 8 8 9 9 0 0 0 0
Then, using weighted average, they get
0.375 1.500 3.000 1.625
1.875 4.875 6.375 3.125
3.375 8.250 9.750 4.625
However, I am not sure how they got those numbers. Instead of 0.375 in the first row, first column, I would have done instead (1 * 3/8 + 2 * 1/8) = 5/8 . This is because the format seems to be
0 _ 0 1 1 _ 2
3d d d 3d
where d is the distance.
So to take the weighted average of the translated values, we can note that the we can first do (3d + d + d + 3d) = 1 and so d = 1/8. That means we should put 3/8 weight on each of the closer translated values and 1/8 weight on each of the further translated values. That leads to (0 * 1/8 + 0 * 3/8 + 1 * 3/8 + 2 * 1/8), which is 5/8 and does not match their 3/8 result. I was wondering where I went wrong.
Thanks!
Bicubic interpolation uses negative weights (this sometimes results in overshoot when filtering).
In this example, the weights used are:
-1/8 5/8 5/8 -1/8
These weights sum to 1, but give larger weight to the middle samples and smaller (negative) weights to the outer samples.
Using these weights we get the observed values, e.g.
0.375 = 5/8*1 -1/8*2
1.5 = 5/8*1+5/8*2 -1/8*3
I found this topic imresize - trying to understand the bicubic interpolation could solve your confusion, especially for the comment with 7 upvotes. By the way, in that comment, the author states that alpha = -0.5 in Matlab, it's contrast to my experience. I wrote 2 functions to test, and I figured out Matlab set alpha = -0.9.
Here are the code I could provide:
Cubic:
function f = cubic(x)
a = -0.9;
absx = abs(x);
absx2 = absx.^2;
absx3 = absx.^3;
f = ((a+2)*absx3 - (a+3)*absx2 + 1) .* (absx <= 1) + ...
(a*absx3 -5*a*absx2 + 8*a*absx - 4*a) .* ((1 < absx) & (absx <= 2));
end
Interpolation with Bi-cubic:
function f = intpolcub(x1,x2,x3,x4,d)
f = x1*cubic(-d-1) + x2*cubic(-d) + x3*cubic(-d+1) + x4*cubic(-d+2);
end
You could test with the following line of code:
intpolcub(0,0,1,2,0.5)
This reproduce the first number in the output matrix of Matlab example about bicubic interpolation you have mentioned above.
Matlab (R2017a) works with a=-1 so:
For cubic:
function f_c = cubic(x)
a = -1;
absx = abs(x);
absx2 = absx.^2;
absx3 = absx.^3;
f_c = ((a+2)*absx3 - (a+3)*absx2 + 1) .* (absx <= 1) + ...
(a*absx3 -5*a*absx2 + 8*a*absx - 4*a) .* ((1 < absx) & (absx <= 2));
end
And for Bicubic interpolation:
function f_bc = intpolcub(x1,x2,x3,x4,d)
f_bc = x1*cubic(-d-1) + x2*cubic(-d) + x3*cubic(-d+1) + x4*cubic(-d+2);
end
Test:
intpolcub(0,0,1,2,0.5)
Explicitly it goes:
f_bc = 0*cubic(-0.5-1)+0*cubic(-0.5)+1*cubic(-0.5+1)+2*cubic(-0.5+2) = 1*cubic(0.5)+2*(cubic(1.5);
Now the calculation of cubic for 0.5 (f_c<1) and 1.5 (1<f_c<=2) is:
cubic(0.5) = (-1+2)*0.5^3-(-1+3)*0.5^2+1 = 5/8
cubic(1.5) = (-1)*1.5^3-5*(-1)*1.5^2+8*(-1)*1.5-4*(-1) = -1/8
So that f_bc is:
f_bc = 5/8+2*(-1/8) = 0.375

Determine distance between two random nodes in a tree

Given a general tree, I want the distance between two nodes v and w.
Wikipedia states the following:
Computation of lowest common ancestors may be useful, for instance, as part of a procedure for determining the distance between pairs of nodes in a tree: the distance from v to w can be computed as the distance from the root to v, plus the distance from the root to w, minus twice the distance from the root to their lowest common ancestor.
Let's say d(x) denotes the distance of node x from the root which we set to 1. d(x,y) denotes the distance between two vertices x and y. lca(x,y) denotes the lowest common ancestor of vertex pair x and y.
Thus if we have 4 and 8, lca(4,8) = 2 therefore, according to the description above, d(4,8) = d(4) + d(8) - 2 * d(lca(4,8)) = 2 + 3 - 2 * 1 = 3. Great, that worked!
However, the case stated above seems to fail for the vertex pair (8,3) (lca(8,3) = 2) d(8,3) = d(8) + d(3) - 2 * d(2) = 3 + 1 - 2 * 1 = 2. This is incorrect however, the distance d(8,3) = 4 as can be seen on the graph. The algorithm seems to fail for anything that crosses over the defined root.
What am I missing?
You missed that the lca(8,3) = 1, and not = 2. Hence the d(1) == 0 which makes it:
d(8,3) = d(8) + d(3) - 2 * d(1) = 3 + 1 - 2 * 0 = 4
For the appropriate 2 node, namely the one one the right, d(lca(8,2)) == 0, not 1 as you have it in your derivation. The distance from the root--which is the lca in this case--to itself is zero. So
d(8,2) = d(8) + d(2) - 2 * d(lca(8,2)) = 3 + 1 - 2 * 0 = 4
The fact that you have two nodes labeled 2 is probably confusing things.
Edit: The post has been edited so that a node originally labeled 2 is now labeled 3. In this case, the derivation is now correct but the statement
the distance d(8,2) = 4 as can be seen on the graph
is incorrect, d(8,2) = 2.

Resources