Clustering points to reach global minimum - algorithm

For a number of points, for example, 100 points, each two has a 'connection' (a number), the goal of the algorithm is to split those points into a given number of clusters (like 5 clusters), minimized the total connections insides clusters.
input:
A matrix with shape n * n, the matrix[i][j] describe the connection between point i and j (the matrix should be symmetry matrix). The cluster number m.
output:
m clusters for n points. And the total connections inside clusters are minimized.
T= ∑(C⊆m)∑(i,j⊆C)M_ij
(Goal is to minimize T above)
For example: 5 points with the matrix
1 2 3 4 5
1 0.1 0.1 0.3 0.5 0.7
2 0.1 0.1 0.7 0.9 1.1
3 0.3 0.7 0.5 0.1 0.2
4 0.5 0.9 0.1 0.3 0.5
5 0.7 1.1 0.2 0.5 0.1
To split into 2 clusters, the splitting
Cluster 1: {1,2}
Cluster 2: {3,4,5}
has the total internal connection of T = C1 + C2 = M12 + M34 + M35 + M45 = 0.1 + 0.1 + 0.2 + 0.5 = 0.9
The splitting
Cluster 1: {1,3,4}
Cluster 2: {2,5}
Has the total internal connection T = C1 + C2 = M13 + M14 + M34 + M25 = 0.3 + 0.5 + 0.1 + 1.1 = 2.0
The goal is to find the lowest internal connection
This is easy when n and m is small, just loop all possible case to find the global minimum. but when n and m become bigger, iteration is not possible.
I have tried Kernighan–Lin algorithm to solve this problem. Initialize with random splitting, then defined two behavior, inserting the point into another cluster, and swap two points in two clusters, each time to find the behavior that can lower the total connections insides clusters mostly. Applied this behavior, then re-calculate again, until no more insertion/swapping can lower the total connections. (Greedy algorithm Strategy).
However it can only reach local minimum, with different initialization, the results also are different. Is there a standard way to solve this problem to reach the global minimum?

The problem is supposedly NP hard, so either you use a local optimum,or you have to try all O(k^n) possibilities.
You can use a local optimum to bound your search, but there is no guarantee that this helps much.

Related

Efficiently computing all the perfect square numbers for very large numbers like 10**20

Examples of perfect square numbers are 1,4,9,16,25....
How do we compute all the perfect square numbers for very large numbers like 10 pow 20. For 10 pow 20 there are 10 pow 10 perfect square numbers.
So far what i have done....
Bruteforce : calucate the x**2 in range 1 to 10 pow 10. As my system accepts just 10 pow 6. This didn't work.
Two pointer approach : I have taken the upper and lower bounds....
Upper bound is 10 pow 20
Lower bound is 1
Now, i have taken two pointers, one at the start and the other at the end. Then next perfect square for lower bound will be
lower bound + (sqrt(lower bound) *2+1)
Example : for 4 next perfect square is
4 + (sqrt(4)*2+1)= 9
In the same way upper bound will be decreasing
upper bound - (sqrt(upper bound) *2-1)
Example : for 25 the previous perfect square is
25 - (sqrt(25)*2-1) =16
Both of the above mentioned approaches didn't work well because the upper bound is very very large number 10 pow 20.
How can we efficiently compute all the perfect squares till 10 pow 20 in less time ?
It's easy to note the difference between perfect squares:
0 1 4 9 16 25 ...
|___|___|___|___|_____|
| | | | |
1 3 5 7 9
So we have:
answer = 0;
for(i = 1; answer <= 10^20; i = i + 2)
answer = answer + i;
print(answer);
}
Since you want all the perfect squares until x, the time complexity will be O(sqrt(x)), which may be slow for x = 10^20, whose square is 10^10.

A variant of the Knapsack algorithm

I have a list of items, a, b, c,..., each of which has a weight and a value.
The 'ordinary' Knapsack algorithm will find the selection of items that maximises the value of the selected items, whilst ensuring that the weight is below a given constraint.
The problem I have is slightly different. I wish to minimise the value (easy enough by using the reciprocal of the value), whilst ensuring that the weight is at least the value of the given constraint, not less than or equal to the constraint.
I have tried re-routing the idea through the ordinary Knapsack algorithm, but this can't be done. I was hoping there is another combinatorial algorithm that I am not aware of that does this.
In the german wiki it's formalized as:
finite set of objects U
w: weight-function
v: value-function
w: U -> R
v: U -> R
B in R # constraint rhs
Find subset K in U subject to:
sum( w(u) <= B ) | all w in K
such that:
max sum( v(u) ) | all u in K
So there is no restriction like nonnegativity.
Just use negative weights, negative values and a negative B.
The basic concept is:
sum( w(u) ) <= B | all w in K
<->
-sum( w(u) ) >= -B | all w in K
So in your case:
classic constraint: x0 + x1 <= B | 3 + 7 <= 12 Y | 3 + 10 <= 12 N
becomes: -x0 - x1 <= -B |-3 - 7 <=-12 N |-3 - 10 <=-12 Y
So for a given implementation it depends on the software if this is allowed. In terms of the optimization-problem, there is no problem. The integer-programming formulation for your case is as natural as the classic one (and bounded).
Python Demo based on Integer-Programming
Code
import numpy as np
import scipy.sparse as sp
from cylp.cy import CyClpSimplex
np.random.seed(1)
""" INSTANCE """
weight = np.random.randint(50, size = 5)
value = np.random.randint(50, size = 5)
capacity = 50
""" SOLVE """
n = weight.shape[0]
model = CyClpSimplex()
x = model.addVariable('x', n, isInt=True)
model.objective = value # MODIFICATION: default = minimize!
model += sp.eye(n) * x >= np.zeros(n) # could be improved
model += sp.eye(n) * x <= np.ones(n) # """
model += np.matrix(-weight) * x <= -capacity # MODIFICATION
cbcModel = model.getCbcModel()
cbcModel.logLevel = True
status = cbcModel.solve()
x_sol = np.array(cbcModel.primalVariableSolution['x'].round()).astype(int) # assumes existence
print("INSTANCE")
print(" weights: ", weight)
print(" values: ", value)
print(" capacity: ", capacity)
print("Solution")
print(x_sol)
print("sum weight: ", x_sol.dot(weight))
print("value: ", x_sol.dot(value))
Small remarks
This code is just a demo using a somewhat low-level like library and there are other tools available which might be better suited (e.g. windows: pulp)
it's the classic integer-programming formulation from wiki modifies as mentioned above
it will scale very well as the underlying solver is pretty good
as written, it's solving the 0-1 knapsack (only variable bounds would need to be changed)
Small look at the core-code:
# create model
model = CyClpSimplex()
# create one variable for each how-often-do-i-pick-this-item decision
# variable needs to be integer (or binary for 0-1 knapsack)
x = model.addVariable('x', n, isInt=True)
# the objective value of our IP: a linear-function
# cylp only needs the coefficients of this function: c0*x0 + c1*x1 + c2*x2...
# we only need our value vector
model.objective = value # MODIFICATION: default = minimize!
# WARNING: typically one should always use variable-bounds
# (cylp problems...)
# workaround: express bounds lower_bound <= var <= upper_bound as two constraints
# a constraint is an affine-expression
# sp.eye creates a sparse-diagonal with 1's
# example: sp.eye(3) * x >= 5
# 1 0 0 -> 1 * x0 + 0 * x1 + 0 * x2 >= 5
# 0 1 0 -> 0 * x0 + 1 * x1 + 0 * x2 >= 5
# 0 0 1 -> 0 * x0 + 0 * x1 + 1 * x2 >= 5
model += sp.eye(n) * x >= np.zeros(n) # could be improved
model += sp.eye(n) * x <= np.ones(n) # """
# cylp somewhat outdated: need numpy's matrix class
# apart from that it's just the weight-constraint as defined at wiki
# same affine-expression as above (but only a row-vector-like matrix)
model += np.matrix(-weight) * x <= -capacity # MODIFICATION
# internal conversion of type neeeded to treat it as IP (or else it would be
LP)
cbcModel = model.getCbcModel()
cbcModel.logLevel = True
status = cbcModel.solve()
# type-casting
x_sol = np.array(cbcModel.primalVariableSolution['x'].round()).astype(int)
Output
Welcome to the CBC MILP Solver
Version: 2.9.9
Build Date: Jan 15 2018
command line - ICbcModel -solve -quit (default strategy 1)
Continuous objective value is 4.88372 - 0.00 seconds
Cgl0004I processed model has 1 rows, 4 columns (4 integer (4 of which binary)) and 4 elements
Cutoff increment increased from 1e-05 to 0.9999
Cbc0038I Initial state - 0 integers unsatisfied sum - 0
Cbc0038I Solution found of 5
Cbc0038I Before mini branch and bound, 4 integers at bound fixed and 0 continuous
Cbc0038I Mini branch and bound did not improve solution (0.00 seconds)
Cbc0038I After 0.00 seconds - Feasibility pump exiting with objective of 5 - took 0.00 seconds
Cbc0012I Integer solution of 5 found by feasibility pump after 0 iterations and 0 nodes (0.00 seconds)
Cbc0001I Search completed - best objective 5, took 0 iterations and 0 nodes (0.00 seconds)
Cbc0035I Maximum depth 0, 0 variables fixed on reduced cost
Cuts at root node changed objective from 5 to 5
Probing was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
Gomory was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
Knapsack was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
Clique was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
MixedIntegerRounding2 was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
FlowCover was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
TwoMirCuts was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
Result - Optimal solution found
Objective value: 5.00000000
Enumerated nodes: 0
Total iterations: 0
Time (CPU seconds): 0.00
Time (Wallclock seconds): 0.00
Total time (CPU seconds): 0.00 (Wallclock seconds): 0.00
INSTANCE
weights: [37 43 12 8 9]
values: [11 5 15 0 16]
capacity: 50
Solution
[0 1 0 1 0]
sum weight: 51
value: 5

What is the probability of the survival of a tribble?

You have a population of k Tribbles. This particular species of Tribbles live for exactly one day and then die. Just before death, a single Tribble has the probability P_i of giving birth to i more Tribbles. What is the probability that after m generations, every Tribble will be dead?
Is my analysis right? If it is right, why it not matching the output?
Case 1:
Number of tribbles: k = 1
Number of generations: m = 1
Probabilities: P_0 = 0.33 P_1 = 0.34 P_2 = 0.33
The probability that after 1 generation every Tribble would be dead = P_0 = 0.33
Case 2:
Number of tribbles: k = 1
Number of generations: m = 2
Probabilities: P_0 = 0.33 P_1 = 0.34 P_2 = 0.33
Each tribble can have either 0 or 1 or 2 children.
At the end of the first year there has to be at least one tribble to ensure that there are tribbles in the second generation also.
The tribble of the first generation should have 1 or 2 children. So, the number of tribbles at the end of the first year would be either 1 or 2 with probabilities P_1=0.34 P_1=0.34 and P_2=0.33 P_2=0.33 respectively.
If there is to be no children after the second generation, none of these children should have children of their own.
If there is 1 child in the second generation, the probability it would have no children is P_0=0.33
If there are 2 children in the second generation, the probability that none of them would have children is (P_0)^2=(0.33)^2=0.1089
The probability that after 2 generations every tribble would be dead is the probability of there being 1 child times the probability of it not having children plus the probability of there being 2 children times the probability of none of them having children =0.34×0.33+0.33×0.0.1089=0.148137
You miss 1st generation 0 child case
The correct equation is
P0 x 1 + P1 x P0 + P2 x P0^2
= 0.33 + 0.34 x 0.33 + 0.33 x (0.33)^2
= 0.478137

Chain Matrix Multiplication

Im trying to learn chain matrix multiplication.
Suppose A is a 10 × 30 matrix, B is a 30 × 5 matrix, and C is a 5 × 60 matrix. Then,
How do we get the following number of operations? (Is it number of rows into columns ???)
(AB)C = (10×30×5) + (10×5×60) = 1500 + 3000 = 4500 operations
A(BC) = (30×5×60) + (10×30×60) = 9000 + 18000 = 27000 operations.
http://www.geeksforgeeks.org/dynamic-programming-set-8-matrix-chain-multiplication/
The number of operations is the number of multiplications required to calculate the result. A * B will result in a 10 x 5 matrix. Each entry in this matrix is the dotproduct of the respective row of A with the column of B with the same index. Thus: A * B requires calculation of 10 x 5 cells, where each cell is the sum of 30 multiplication, so 10 x 5 x 30. Though this is a rather strange representation.

How do you manually compute for silhouette, cohesion and separation of Cluster

Good day!
I have been looking all over the Internet on how to compute for silhouette coefficient, cohesion and separation unfortunately, despite the resources, I just can't understand the formulas posted. I know that there are implementations of it in some tool, but I want to know how to manually compute them especially given a vector space model.
Assuming that I have the following clusters:
Cluster 1 ={{1,0},{1,1}}
Cluster 2 ={{1,2},{2,3},{2,2},{1,2}},
Cluster 3 ={{3,1},{3,3},{2,1}}
The way I understood it according to [1] is that I have to get the average of the points per cluster:
C1 X = 1; Y = .5
C2 X = 1.5; Y = 2.25
C3 X = 2.67; Y = 1.67
Given the mean, I have to compute for my cohesion by Sum of Square Error (SSE):
Cohesion(C1) = (1-1)^2 + (1-1)^2 + (0-.5)^2 + (0-.5)^2 = 0.5
Cohesion(C2) = (1-1.5)^2 + (2-1.5)^2 + (2-1.5)^2 + (1-1.5)^2 + (2-2.5)^2 + (3-2.5)^2 + (2-2.5)^2 +(2-2.5)^2 = 2
Cohesion(C3) = (3-2.67)^2 + (3-2.67)^2 + (2-2.67)^2 + (1-1.67)^2 + (3-1.67)^2 + (1-1.67)^2 = 3.3334
Cluster(C) = 0.5 + 2 + 3.3334 = 5.8334
My questions are:
1. Did I perform cohesion correctly?
2. How do I compute for Separation?
3. How do I compute for Silhouette Coefficient?
Thank you.
References: [1] http://www.cs.kent.edu/~jin/DM08/ClusterValidation.pdf
Cluster 1 ={{1,0},{1,1}}
Cluster 2 ={{1,2},{2,3},{2,2},{1,2}},
Cluster 3 ={{3,1},{3,3},{2,1}}
Take a point {1,0} in cluster 1
Calculate its average distance to all other points in it’s cluster, i.e. cluster 1
So a1 =√( (1-1)^2 + (0-1)^2) =√(0+1)=√1=1
Now for the object {1,0} in cluster 1 calculate its average distance from all the objects in cluster 2 and cluster 3. Of these take the minimum average distance.
So for cluster 2
{1,0} ----> {1,2} = distance = √((1-1)^2 + (0-2)^2) =√(0+4)=√4=2
{1,0} ----> {2,3} = distance = √((1-2)^2 + (0-3)^2) =√(1+9)=√10=3.16
{1,0} ----> {2,2} = distance = √((1-2)^2 + (0-2)^2) =√(1+4)=√5=2.24
{1,0} ----> {1,2} = distance = √((1-1)^2 + (0-2)^2) =√(0+4)=√4=2
Therefore, the average distance of point {1,0} in cluster 1 to all the points in cluster 2 =
(2+3.16+2.24+2)/4 = 2.325
Similarly, for cluster 3
{1,0} ----> {3,1} = distance = √((1-3)^2 + (0-1)^2) =√(4+1)=√5=2.24
{1,0} ----> {3,3} = distance = √((1-3)^2 + (0-3)^2) =√(4+9)=√13=3.61
{1,0} ----> {2,1} = distance = √((1-2)^2 + (0-1)^2) =√(1+1)=√2=2.24
Therefore, the average distance of point {1,0} in cluster 1 to all the points in cluster 3 =
(2.24+3.61+2.24)/3 = 2.7
Now, the minimum average distance of the point {1,0} in cluster 1 to the other clusters 2 and 3 is,
b1 =2.325 (2.325 < 2.7)
So the silhouette coefficient of cluster 1
s1= 1-(a1/b1) = 1- (1/2.325)=1-0.4301=0.5699
In a similar fashion you need to calculate the silhouette coefficient for cluster 2 and cluster 3 separately by taking any single object point in each of the clusters and repeating the steps above. Of these the cluster with the greatest silhouette coefficient is the best as per evaluation.
Note: The distance here is the Euclidean Distance! You can also have a look at this video for further explanation:
https://www.coursera.org/learn/cluster-analysis/lecture/RJJfM/6-2-clustering-evaluation-measuring-clustering-quality
Computation of Silhouette is straightforward, but it does not involve the centroids.
So don't try to compute it from what you did for cohesion; compute it from your original data.
As you have calculated the Cohesion of C1, there is a mistake.
Cohesion(C1) = (1 - 1) ^ 2 + (1 - 1) ^ 2 + (0 - .5) ^ 2 + (1 - .5) ^ 2 = 0.5
This is the Prototype-Based (Centroid in this case) Cohesion calculation.
For calculating Separation: {Between clusters i.e. (C1,C2) , (C1,C3) & (C2,C3)}
Separation(C1,C2) = SSE(Centroid(C1), Centroid(C2))
= (1 - 1.5) ^ 2 + (0.5 - 2.25) ^ 2 = 1 + 3.0625 = 4.0625
Silhouette Coefficient: Combines both the Cohesion and Separation.
Refer https://cs.fit.edu/~pkc/classes/ml-internet/silhouette.pdf
thanks for your answer,
Calculate its average distance to all other points in its cluster, i.e. cluster 1' --> This part has to be corrected.
So
a1 =√( (1-1)^2 + (0-1)^2) =√(0+1)=√1=1
{1,0} ----> {2,1} = distance = √((1-2)^2 + (0-1)^2) =√(1+1)=√2=2.24
this is an error because the root of 2 is approximately 1.41

Resources