Clustering with a distance matrix - algorithm

I have a (symmetric) matrix M that represents the distance between each pair of nodes. For example,
A B C D E F G H I J K L
A 0 20 20 20 40 60 60 60 100 120 120 120
B 20 0 20 20 60 80 80 80 120 140 140 140
C 20 20 0 20 60 80 80 80 120 140 140 140
D 20 20 20 0 60 80 80 80 120 140 140 140
E 40 60 60 60 0 20 20 20 60 80 80 80
F 60 80 80 80 20 0 20 20 40 60 60 60
G 60 80 80 80 20 20 0 20 60 80 80 80
H 60 80 80 80 20 20 20 0 60 80 80 80
I 100 120 120 120 60 40 60 60 0 20 20 20
J 120 140 140 140 80 60 80 80 20 0 20 20
K 120 140 140 140 80 60 80 80 20 20 0 20
L 120 140 140 140 80 60 80 80 20 20 20 0
Is there any method to extract clusters from M (if needed, the number of clusters can be fixed), such that each cluster contains nodes with small distances between them. In the example, the clusters would be (A, B, C, D), (E, F, G, H) and (I, J, K, L).
Thanks a lot :)

Hierarchical clustering works directly with the distance matrix instead of the actual observations. If you know the number of clusters, you will already know your stopping criterion (stop when there are k clusters). The main trick here will be to choose an appropriate linkage method. Also, this paper(pdf) gives an excellent overview of all kinds of clustering methods.

One more possible way is using Partitioning Around Medoids which often called K-Medoids. If you look at R-clustering package you will see pam function which recieves distance matrix as input data.

Well, It is possible to perform K-means clustering on a given similarity matrix, at first you need to center the matrix and then take the eigenvalues of the matrix. The final and the most important step is multiplying the first two set of eigenvectors to the square root of diagonals of the eigenvalues to get the vectors and then move on with K-means . Below the code shows how to do it. You can change similarity matrix. fpdist is the similarity matrix.
mds.tau <- function(H)
{
n <- nrow(H)
P <- diag(n) - 1/n
return(-0.5 * P %*% H %*% P)
}
B<-mds.tau(fpdist)
eig <- eigen(B, symmetric = TRUE)
v <- eig$values[1:2]
#convert negative values to 0.
v[v < 0] <- 0
X <- eig$vectors[, 1:2] %*% diag(sqrt(v))
library(vegan)
km <- kmeans(X,centers= 5, iter.max=1000, nstart=10000) .
#embedding using MDS
cmd<-cmdscale(fpdist)

Related

Quickly compute `dot(a(n:end), b(1:end-n))`

Suppose we have two, one dimensional arrays of values a and b which both have length N. I want to create a new array c such that c(n)=dot(a(n:N), b(1:N-n+1)) I can of course do this using a simple loop:
for n=1:N
c(n)=dot(a(n:N), b(1:N-n+1));
end
but given that this is such a simple operation which resembles a convolution I was wondering if there isn't a more efficient method to do this (using Matlab).
A solution using 1D convolution conv:
out = conv(a, flip(b));
c = out(ceil(numel(out)/2):end);
In conv the first vector is multiplied by the reversed version of the second vector so we need to compute the convolution of a and the flipped b and trim the unnecessary part.
This is an interesting problem!
I am going to assume that a and b are column vectors of the same length. Let us consider a simple example:
a = [9;10;2;10;7];
b = [1;3;6;10;10];
% yields:
c = [221;146;74;31;7];
Now let's see what happens when we compute the convolution of these vectors:
>> conv(a,b)
ans =
9
37
86
166
239
201
162
170
70
>> conv2(a, b.')
ans =
9 27 54 90 90
10 30 60 100 100
2 6 12 20 20
10 30 60 100 100
7 21 42 70 70
We notice that c is the sum of elements along the lower diagonals of the result of conv2. To show it clearer we'll transpose to get the diagonals in the same order as values in c:
>> triu(conv2(a.', b))
ans =
9 10 2 10 7
0 30 6 30 21
0 0 12 60 42
0 0 0 100 70
0 0 0 0 70
So now it becomes a question of summing the diagonals of a matrix, which is a more common problem with existing solution, for example this one by Andrei Bobrov:
C = conv2(a.', b);
p = sum( spdiags(C, 0:size(C,2)-1) ).'; % This gives the same result as the loop.

Sum data in one column in a specific order in Spotfire

Does anyone know how to create a calculated column (in Spotfire) that will sum data in order of increasing values contained within another column?
For example, what would the expression be to Sum data in [P] in increasing order of [K], for each [Well]
Some example data:
Well Depth P K
A 85 0.191 108
A 85.5 0.192 102
A 87 0.17 49
A 88 0.184 47
A 89 0.192 50
B 298 0.215 177
B 298.5 0.2 177
B 300 .017 105
B 301 0.23 200
You can use:
Sum([P]) OVER (intersect([Well],AllPrevious([K])))
This returns the cumulative sum of P in order of K per Well in ascending order of K.
Well K P Cumulative Sum of P
A 47 0,184 0,184
A 49 0,17 0,354
A 50 0,192 0,546
A 102 0,192 0,738
A 108 0,191 0,929
B 105 0,017 0,017
B 177 0,215 0,432
B 177 0,2 0,432
B 200 0,23 0,662
Edit Based on OP's comment:
you can use to get the cumulative sum in descending order of K:
Sum([P]) OVER (intersect([Well],AllNExt([K])))

How to divide N numbers into N/2 groups (2 numbers each group) such that the sum of diffrence between the 2 numbers in each group is minmal?

General Problem Description
Hi, it is actually a special assignment problem( check wiki if interested). Suppose I have 10 agents denoted as A1, A2, ... A10 and I need them to work in pairs. While, according to previous experience, I know the working efficiency of each two-agent pair so that I have an efficiency matrix shown as follows whose ( i, j ) element represents the working efficiency of agent pair ( Ai, Aj ). Hence, we know it should be a symmetric matrix, which means E( i, j )=E( j, i ) and E( i, i ) should be 0. Now, I need divide these 10 agents into 5 groups such that the overall (sum) efficiency is maximal.
E =
0 25 28 23 39 77 56 58 85 41
25 0 18 77 32 52 69 59 47 18
28 18 0 20 55 75 63 38 5 56
23 77 20 0 59 76 24 82 68 64
39 32 55 59 0 49 70 28 42 31
77 52 75 76 49 0 33 84 50 29
56 69 63 24 70 33 0 15 49 83
58 59 38 82 28 84 15 0 68 40
85 47 5 68 42 50 49 68 0 56
41 18 56 64 31 29 83 40 56 0
N.B.
From the matrix view of this problem, I need pick 5 elements from above matrix such that none of them share a same index with others. ( if you pick E( 2, 3 ), then you cannot pick any elments with index containing 2 or 3 since A2 and A3 are assigned to work. In other words, you cannot pick any elements from the 2nd, 3rd row and 2nd, 3rd column.)
The title of this problem is an equivalent problem to the special assignment problem mentinoned above.
You may find the Hungarian(munkres) algorithm helpful! Here is the matlab code.
Another view of this problem is to solve a normal assignment problem, but we need to find a solution whose elements are symmetrically distributed about the diagonal.
Directly applying Hungarian(munkres) algorithm to the symmetric efficiency matrix does not always work. Sometimes it will give asymmetric permutations e.g.
E =
0 30 63 32 20 40
30 0 67 84 75 63
63 67 0 37 79 88
32 84 37 0 43 59
20 75 79 43 0 56
40 63 88 59 56 0
The optimal solution is:
assignment =
0 0 0 0 0 1
0 0 0 1 0 0
1 0 0 0 0 0
0 1 0 0 0 0
0 0 1 0 0 0
0 0 0 0 1 0
This can be solved as weighted maximum matching problem, where:
G = (V,E,w)
V = { all numbers }
E = { (v,u) | v,u in V, v!=u }
w(u,v) = -|u-v|
The solution to maximum matching will pair all your vertices (numbers), such that sum of: sum { -|u-v| : u,v paired } is maximum, which means sum { |u-v| : u,v paired is minimum.

Understanding "median of medians" algorithm

I want to understand "median of medians" algorithm on the following example:
We have 45 distinct numbers divided into 9 group with 5 elements each.
48 43 38 33 28 23 18 13 8
49 44 39 34 29 24 19 14 9
50 45 40 35 30 25 20 15 10
51 46 41 36 31 26 21 16 53
52 47 42 37 32 27 22 17 54
The first step is sorting every group (in this case they are already sorted)
Second step recursively, find the "true" median of the medians (50 45 40 35 30 25 20 15 10) i.e. the set will be divided into 2 groups:
50 25
45 20
40 15
35 10
30
sorting these 2 groups
30 10
35 15
40 20
45 25
50
the medians is 40 and 15 (in case the numbers are even we took left median)
so the returned value is 15 however "true" median of medians (50 45 40 35 30 25 20 15 10) is 30, moreover there are 5 elements less then 15 which are much less than 30% of 45 which are mentioned in wikipedia
and so T(n) <= T(n/5) + T(7n/10) + O(n) fails.
By the way in the Wikipedia example, I get result of recursion as 36. However, the true median is 47.
So, I think in some cases this recursion may not return true median of medians. I want to understand where is my mistake.
The problem is in the step where you say to find the true median of the medians. In your example, you had these medians:
50 45 40 35 30 25 20 15 10
The true median of this data set is 30, not 15. You don't find this median by splitting the groups into blocks of five and taking the median of those medians, but instead by recursively calling the selection algorithm on this smaller group. The error in your logic is assuming that median of this group is found by splitting the above sequence into two blocks
50 45 40 35 30
and
25 20 15 10
then finding the median of each block. Instead, the median-of-medians algorithm will recursively call itself on the complete data set 50 45 40 35 30 25 20 15 10. Internally, this will split the group into blocks of five and sort them, etc., but it does so to determine the partition point for the partitioning step, and it's in this partitioning step that the recursive call will find the true median of the medians, which in this case will be 30. If you use 30 as the median as the partitioning step in the original algorithm, you do indeed get a very good split as required.
Hope this helps!
Here is the pseudocode for median of medians algorithm (slightly modified to suit your example). The pseudocode in wikipedia fails to portray the inner workings of the selectIdx function call.
I've added comments to the code for explanation.
// L is the array on which median of medians needs to be found.
// k is the expected median position. E.g. first select call might look like:
// select (array, N/2), where 'array' is an array of numbers of length N
select(L,k)
{
if (L has 5 or fewer elements) {
sort L
return the element in the kth position
}
partition L into subsets S[i] of five elements each
(there will be n/5 subsets total).
for (i = 1 to n/5) do
x[i] = select(S[i],3)
M = select({x[i]}, n/10)
// The code to follow ensures that even if M turns out to be the
// smallest/largest value in the array, we'll get the kth smallest
// element in the array
// Partition array into three groups based on their value as
// compared to median M
partition L into L1<M, L2=M, L3>M
// Compare the expected median position k with length of first array L1
// Run recursive select over the array L1 if k is less than length
// of array L1
if (k <= length(L1))
return select(L1,k)
// Check if k falls in L3 array. Recurse accordingly
else if (k > length(L1)+length(L2))
return select(L3,k-length(L1)-length(L2))
// Simply return M since k falls in L2
else return M
}
Taking your example:
The median of medians function will be called over the entire array of 45 elements like (with k = 45/2 = 22):
median = select({48 49 50 51 52 43 44 45 46 47 38 39 40 41 42 33 34 35 36 37 28 29 30 31 32 23 24 25 26 27 18 19 20 21 22 13 14 15 16 17 8 9 10 53 54}, 45/2)
The first time M = select({x[i]}, n/10) is called, array {x[i]} will contain the following numbers: 50 45 40 35 30 20 15 10.
In this call, n = 45, and hence the select function call will be M = select({50 45 40 35 30 20 15 10}, 4)
The second time M = select({x[i]}, n/10) is called, array {x[i]} will contain the following numbers: 40 20.
In this call, n = 9 and hence the call will be M = select({40 20}, 0).
This select call will return and assign the value M = 20.
Now, coming to the point where you had a doubt, we now partition the array L around M = 20 with k = 4.
Remember array L here is: 50 45 40 35 30 20 15 10.
The array will be partitioned into L1, L2 and L3 according to the rules L1 < M, L2 = M and L3 > M. Hence:
L1: 10 15
L2: 20
L3: 30 35 40 45 50
Since k = 4, it's greater than length(L1) + length(L2) = 3. Hence, the search will be continued with the following recursive call now:
return select(L3,k-length(L1)-length(L2))
which translates to:
return select({30 35 40 45 50}, 1)
which will return 30 as a result. (since L has 5 or fewer elements, hence it'll return the element in kth i.e. 1st position in the sorted array, which is 30).
Now, M = 30 will be received in the first select function call over the entire array of 45 elements, and the same partitioning logic which separates the array L around M = 30 will apply to finally get the median of medians.
Phew! I hope I was verbose and clear enough to explain median of medians algorithm.

how to group photos with similar faces together

In most face recognition SDK, it only provides two major functions
detecting faces and extracting templates from photos, this is called detection.
comparing two templates and returning the similar score, this is called recognition.
However, beyond those two functions, what I am looking for is an algorithm or SDK for grouping photos with similar faces together, e.g. based on similar scores.
Thanks
First, perform step 1 to extract the templates, then compare each template with all the others by applying step two on all the possible pairs, obtaining their similarity scores.
Sort the matches based on this similarity score, decide on a threshold and group together those templates that exceed it.
Take, for instance, the following case:
Ten templates: A, B, C, D, E, F, G, H, I, J.
Scores between: 0 and 100.
Similarity threshold: 80.
Similarity table:
A B C D E F G H I J
A 100 85 8 0 1 50 55 88 90 10
B 85 100 5 30 99 60 15 23 8 2
C 8 5 100 60 16 80 29 33 5 8
D 0 30 60 100 50 50 34 18 2 66
E 1 99 16 50 100 8 3 2 19 6
F 50 60 80 50 8 100 20 55 13 90
G 55 15 29 34 3 20 100 51 57 16
H 88 23 33 18 2 55 51 100 8 0
I 90 8 5 2 19 13 57 8 100 3
J 10 2 8 66 6 90 16 0 3 100
Sorted matches list:
AI 90
FJ 90
BE 99
AH 88
AB 85
CF 80
------- <-- Threshold cutoff line
DJ 66
.......
Iterate through the list until the threshold cutoff point, where the values no longer exceed it, maintain a full templates set and association sets for each template, obtaining the final groups:
// Empty initial full templates set
fullSet = {};
// Iterate through the pairs list
foreach (templatePair : pairList)
{
// If the full set contains the first template from the pair
if (fullSet.contains(templatePair.first))
{
// Add the second template to its group
templatePair.first.addTemplateToGroup(templatePair.second);
// If the full set also contains the second template
if (fullSet.contains(templatePair.second))
{
// The second template is removed from the full set
fullSet.remove(templatePair.second);
// The second template's group is added to the first template's group
templatePair.first.addGroupToGroup(templatePair.second.group);
}
}
else
{
// If the full set contains only the second template from the pair
if (fullSet.contains(templatePair.second))
{
// Add the first template to its group
templatePair.second.addTemplateToGroup(templatePair.first);
}
}
else
{
// If none of the templates are present in the full set, add the first one
// to the full set and the second one to the first one's group
fullSet.add(templatePair.first);
templatePair.first.addTemplateToGroup(templatePair.second);
}
}
Execution details on the list:
AI: fullSet.add(A); A.addTemplateToGroup(I);
FJ: fullSet.add(F); F.addTemplateToGroup(J);
BE: fullSet.add(B); B.addTemplateToGroup(E);
AH: A.addTemplateToGroup(H);
AB: A.addTemplateToGroup(B); fullSet.remove(B); A.addGroupToGroup(B.group);
CF: C.addTemplateToGroup(F);
In the end, you end up with the following similarity groups:
A - I, H, B, E
C - F, J

Resources