Finding the center of a cluster - algorithm

I have the following problem - made abstract to bring out the key issues.
I have 10 points each which is some distance from the other. I want to
be able to find the center of the cluster i.e. the point for which the pairwise distance to each other point is minimised,
let p(j) ~ p(k) represent the pairwise distance beteen points j and k
p(i) is center-point of the cluster iff p(i) s.t. min[sum(p(j)~p(k))] for all 0 < j,k <= n where we have n points in the cluster
determine how to split the cluster in to two clusters once the number of data points in the cluster goes above some threshold t.
This is not euclidean space. But the distances can be summarised as follows - p(i) is point i:
p(1) p(2) p(3) p(4) p(5) p(6) p(7) p(8) p(9) p(10)
p(1) 0 2 1 3 2 3 3 2 3 4
p(2) 2 0 1 3 2 3 3 2 3 4
p(3) 1 1 0 2 0 1 2 1 2 3
p(4) 3 3 2 0 1 2 3 2 3 4
p(5) 2 2 1 1 0 1 2 1 2 3
p(6) 3 3 2 2 1 0 3 2 3 4
p(7) 3 3 2 3 2 3 0 1 2 3
p(8) 2 2 1 2 1 2 1 0 1 2
p(9) 3 3 2 3 2 3 2 1 0 1
p(10) 4 4 3 4 3 4 3 2 1 0
How would I calculate which is the center point of this cluster?

As far as I understand this looks like K Means Clustering, and what you are looking for is usually known as 'Medoids'.
See here: http://en.wikipedia.org/wiki/Medoids or here: http://en.wikipedia.org/wiki/K-medoids

I may be about to have that frisson that comes just before displaying utter stupidity. But doesn't this yield easily to brute force? In Python:
distances = [
[ 0 , 2 , 1 , 3 , 2 , 3 , 3 , 2 , 3 , 4 , ],
[ 2 , 0 , 1 , 3 , 2 , 3 , 3 , 2 , 3 , 4 , ],
[ 1 , 1 , 0 , 2 , 0 , 1 , 2 , 1 , 2 , 3 , ],
[ 3 , 3 , 2 , 0 , 1 , 2 , 3 , 2 , 3 , 4 , ],
[ 2 , 2 , 1 , 1 , 0 , 1 , 2 , 1 , 2 , 3 , ],
[ 3 , 3 , 2 , 2 , 1 , 0 , 3 , 2 , 3 , 4 , ],
[ 3 , 3 , 2 , 3 , 2 , 3 , 0 , 1 , 2 , 3 , ],
[ 2 , 2 , 1 , 2 , 1 , 2 , 1 , 0 , 1 , 2 , ],
[ 3 , 3 , 2 , 3 , 2 , 3 , 2 , 1 , 0 , 1 , ],
[ 4 , 4 , 3 , 4 , 3 , 4 , 3 , 2 , 1 , 0 , ],
]
currentMinimum = 99999
for point in range ( 10 ) :
distance_sum = 0
for second_point in range ( 10 ) :
if point == second_point : continue
distance_sum += distances [ point ] [ second_point ]
print '>>>>>', point, distance_sum
if distance_sum < currentMinimum :
currentMinimum = distance_sum
centre = point
print centre

a)
find median or average values of all distances. = avgAll
For each p, find average distance to other machines. = avgP(i)
Pick the closer one as center. avgAll ~= avgP(i)
b)
no idea for now..
maybe for each p, find the closer machine.
by this logic make a graph.
than somehow (i dont know yet) divide the graph

What you're trying to do, or at least (b) belongs to Cluster Analysis. A branch of mathematics / statistics / econometrics where datapoints (e.g. points in n-dimensional space) are divided among groups or clusters. How to do this is not a trivial questions, there are many, many possible ways.
Read more at the wikipedia article on cluster analysis.

Related

Combining add cases and add variables by merging files in SPSS

I would like to merge different SPSS files. The PAID indicates different persons. The files also contain the variable ID which indicates the moment of measurement. So ID=1 means that the data are results of measurement one (ID=2 ; measurement two etc.). However, not all data files contain the same moments of measurement.
I have already read the following post, but that has not completely answered my question:
SPSS - merging files with duplicate cases of ID variable and new cases/variables
Example data files
Data file 1:
PAID ID X1 X2 X3 X4
1 1 3 4 4 5
2 1 3 4 5 6
3 1 3 4 4 6
4 1 . . . .
Data file 2:
PAID ID X5 X6 X7
1 1 1 1 2
1 2 1 2 1
2 1 1 2 2
2 2 2 2 2
3 1 1 1 1
3 2 1 . .
4 1 1 1 1
4 2 2 2 2
I want the following result:
PAID ID X1 X2 X3 X4 X5 X6 X7
1 1 3 4 4 5 1 1 2
1 2 . . . . 1 2 1
2 1 3 4 5 6 1 2 2
2 2 . . . . 2 2 2
3 1 3 4 4 6 1 1 1
3 2 . . . . 1 . .
4 1 . . . . 1 1 1
4 2 . . . . 2 2 2
I think I have to use some combination of the functions add cases and add variables. However, is this possible within SPSS? And if so, how can I do this?
Thanks in advance!
This will do the job:
match files /file='path\DataFile1.sav' /file='path\DataFile2.sav'/by paid id.
Please note though, both files need to be sorted by paid id before running the match.
To demonstrate with your sample data:
*first preparing demonstration data.
DATA LIST list/paid id x1 to x4 (6f).
begin data.
1,1,3,4,4,5
2,1,3,4,5,6
3,1,3,4,4,6
4,1, , , ,
end data.
* instead of creating the data, you can can get your original data:
* get file="path\file name 1.sav".
sort cases by paid id.
dataset name DataFile1.
DATA LIST list/paid id x5 to x7 (5f).
begin data.
1,1,1,1,2
1,2,1,2,1
2,1,1,2,2
2,2,2,2,2
3,1,1,1,1
3,2,1, ,
4,1,1,1,1
4,2,2,2,2
end data.
sort cases by paid id.
dataset name DataFile2.
match files /file=DataFile1 /file=DataFile2/by paid id.
exe.
the result looks like this:
paid id x1 x2 x3 x4 x5 x6 x7
1 1 3 4 4 5 1 1 2
1 2 1 2 1
2 1 3 4 5 6 1 2 2
2 2 2 2 2
3 1 3 4 4 6 1 1 1
3 2 1
4 1 1 1 1
4 2 2 2 2

Efficiently construct a square matrix with unique numbers in each row

A matrix of size nxn needs to be constructed with the desired properties.
n is even. (given as input to the algorithm)
Matrix should contain integers from 0 to n-1
Main diagonal should contain only zeroes and matrix should be symmetric.
All numbers in each row should be different.
For various n , any one of the possible output is required.
input
2
output
0 1
1 0
input
4
output
0 1 3 2
1 0 2 3
3 2 0 1
2 3 1 0
Now the only idea that comes to my mind is to brute-force build combinations recursively and prune.
How can this be done in a iterative way perhaps efficiently?
IMO, You can handle your answer by an algorithm to handle this:
If 8x8 result is:
0 1 2 3 4 5 6 7
1 0 3 2 5 4 7 6
2 3 0 1 6 7 4 5
3 2 1 0 7 6 5 4
4 5 6 7 0 1 2 3
5 4 7 6 1 0 3 2
6 7 4 5 2 3 0 1
7 6 5 4 3 2 1 0
You have actually a matrix of two 4x4 matrices in below pattern:
m0 => 0 1 2 3 m1 => 4 5 6 7 pattern => m0 m1
1 0 3 2 5 4 7 6 m1 m0
2 3 0 1 6 7 4 5
3 2 1 0 7 6 5 4
And also each 4x4 is a matrix of two 2x2 matrices with a relation to a power of 2:
m0 => 0 1 m1 => 2 3 pattern => m0 m1
1 0 3 2 m1 m0
In other explanation I should say you have a 2x2 matrix of 0 and 1 then you expand it to a 4x4 matrix by replacing each cell with a new 2x2 matrix:
0 => 0+2*0 1+2*0 1=> 0+2*1 1+2*1
1+2*0 0+2*0 1+2*1 0+2*1
result => 0 1 2 3
1 0 3 2
2 3 0 1
3 2 1 0
Now expand it again:
0,1=> as above 2=> 0+2*2 1+2*2 3=> 0+2*3 1+2*3
1+2*2 0+2*2 1+2*3 0+2*3
I can calculate value of each cell by this C# sample code:
// i: row, j: column, n: matrix dimension
var v = 0;
var m = 2;
do
{
var p = m/2;
v = v*2 + (i%(n/p) < n/m == j%(n/p) < n/m ? 0 : 1);
m *= 2;
} while (m <= n);
We know each row must contain each number. Likewise, each row contains each number.
Let us take CS convention of indices starting from 0.
First, consider how to place the 1's in the matrix. Choose a random number k0, from 1 to n-1. Place the 1 in row 0 at position (0,k0). In row 1, if k0 = 1 in which case there is already a one placed. Otherwise, there are n-2 free positions and place the 1 at position (1,k1). Continue in this way until all the 1 are placed. In the final row there is exactly one free position.
Next, repeat with the 2 which have to fit in the remaining places.
Now the problem is that we might not be able to actually complete the square. We may find there are some constraints which make it impossible to fill in the last digits. The problem is that checking a partially filled latin square is NP-complete.(wikipedia) This basically means pretty compute intensive and there no know short-cut algorithm. So I think the best you can do is generate squares and test if they work or not.
If you only want one particular square for each n then there might be simpler ways of generating them.
The link Ted Hopp gave in his comment Latin Squares. Simple Construction does provide a method for generating a square starting with the addition of integers mod n.
I might be wrong, but if you just look for printing a symmetric table - a special case of latin squares isomorphic to the symmetric difference operation table over a powerset({0,1,..,n}) mapped to a ring {0,1,2,..,2^n-1}.
One can also produce such a table, using XOR(i,j) where i and j are n*n table indexes.
For example:
def latin_powerset(n):
for i in range(n):
for j in range(n):
yield (i, j, i^j)
Printing tuples coming from previously defined special-case generator of symmetric latin squares declared above:
def print_latin_square(sq, n=None):
cells = [c for c in sq]
if n is None:
# find the length of the square side
n = 1; n2 = len(cells)
while n2 != n*n:
n += 1
rows = list()
for i in range(n):
rows.append(" ".join("{0}".format(cells[i*n + j][2]) for j in range(n)))
print("\n".join(rows))
square = latin_powerset(8)
print(print_latin_square(square))
outputs:
0 1 2 3 4 5 6 7
1 0 3 2 5 4 7 6
2 3 0 1 6 7 4 5
3 2 1 0 7 6 5 4
4 5 6 7 0 1 2 3
5 4 7 6 1 0 3 2
6 7 4 5 2 3 0 1
7 6 5 4 3 2 1 0
See also
This covers more generic cases of latin squares, rather than that super symmetrical case with the trivial code above:
https://www.cut-the-knot.org/arithmetic/latin2.shtml (also pointed in the comments above for symmetric latin square construction)
https://doc.sagemath.org/html/en/reference/combinat/sage/combinat/matrices/latin.html

Java algorithm to generate all the possible permutation of a matrix

I would need a java algorithm to generate all the possible permutation of a given matrix.
For example,
1 2
A = 3 4
The algorithm should return:
1 2 1 2 2 1 2 1 3 4 4 3 3 4 4 3
A = 3 4 B = 4 3 C = 3 4 D = 4 3 E = 1 2 E = 1 2 F = 2 1 G = 2 1
Any idea?
Thank you

Adding zeros between every 2 elements of a matrix in matlab/octave

I am interested in how can I add rows and columns of zeros in a matrix so that it looks like this:
1 0 2 0 3
1 2 3 0 0 0 0 0
2 3 4 => 2 0 3 0 4
5 4 3 0 0 0 0 0
5 0 4 0 3
Actually I am interested in how can I do this efficiently, because walking the matrix and adding zeros takes a lot of time if you work with a big matrix.
Update:
Thank you very much.
Now I'm trying to replace the zeroes with the sum of their neighbors:
1 0 2 0 3 1 3 2 5 3
1 2 3 0 0 0 0 0 3 8 5 12... and so on
2 3 4 => 2 0 3 0 4 =>
5 4 3 0 0 0 0 0
5 0 4 0 3
as you can see i'm considering all the 8 neighbors of an element, but again using for and walking the matrix slows me down quite a bit, is there a faster way ?
Let your little matrix be called m1. Then:
m2 = zeros(5)
m2(1:2:end,1:2:end) = m1(:,:)
Obviously this is hard-wired to your example, I'll leave it to you to generalise.
Here are two ways to do part 2 of the question. The first does the shifts explicitly, and the second uses conv2. The second way should be faster.
M=[1 2 3; 2 3 4 ; 5 4 3];
% this matrix (M expanded) has zeros inserted, but also an extra row and column of zeros
Mex = kron(M,[1 0 ; 0 0 ]);
% The sum matrix is built from shifts of the original matrix
Msum = Mex + circshift(Mex,[1 0]) + ...
circshift(Mex,[-1 0]) +...
circshift(Mex,[0 -1]) + ...
circshift(Mex,[0 1]) + ...
circshift(Mex,[1 1]) + ...
circshift(Mex,[-1 1]) + ...
circshift(Mex,[1 -1]) + ...
circshift(Mex,[-1 -1]);
% trim the extra line
Msum = Msum(1:end-1,1:end-1)
% another version, a bit more fancy:
MexTrimmed = Mex(1:end-1,1:end-1);
MsumV2 = conv2(MexTrimmed,ones(3),'same')
Output:
Msum =
1 3 2 5 3
3 8 5 12 7
2 5 3 7 4
7 14 7 14 7
5 9 4 7 3
MsumV2 =
1 3 2 5 3
3 8 5 12 7
2 5 3 7 4
7 14 7 14 7
5 9 4 7 3

Evaluating the distribution of words in a grid

I'm creating a word search and am trying to calculate quality of the generated puzzles by verifying the word set is "distributed evenly" throughout the grid. For example placing each word consecutively, filling them up row-wise is not particularly interesting because there will be clusters and the user will quickly notice a pattern.
How can I measure how 'evenly distributed' the words are?
What I'd like to do is write a program that takes in a word search as input and output a score that evaluates the 'quality' of the puzzle. I'm wondering if anyone has seen a similar problem and could refer me to some resources. Perhaps there is some concept in statistics that might help? Thanks.
The basic problem is distribution of lines in a square or rectangle. You can eighter do this geometrically or using integer arrays. I will try the integer arrays here.
Let M be a matrix of your puzzle,
A B C D
E F G H
I J K L
M N O P
Let the word "EFGH" be an existent word, as well as "CGKO". Then, create a matrix which will contain the count of membership in eighter words in each cell:
0 0 1 0
1 1 2 1
0 0 1 0
0 0 1 0
Apply a rule: the current cell value is equal to the sum of all neighbours (4-way) and multiply with the cell's original value, if the original value is 2 or higher.
0 0 1 0 1 2 2 2
1 1 2 1 -\ 1 3 8 2
0 0 1 0 -/ 1 2 3 2
0 0 1 0 0 1 1 1
And sum up all values in rows and columns the matrix:
1 2 2 2 = 7
1 3 8 2 = 14
1 2 3 2 = 8
0 1 1 1 = 3
| | | |
3 7 | 6
14
Then calculate the avarage of both result sets:
(7 + 14 + 8 + 3) / 4 = 32 / 4 = 8
(3 + 7 + 14 + 6) / 4 = 30 / 4 = 7.5
And calculate the avarage difference to the avarage of each result set:
3 <-> 7.5 = 4.5 7 <-> 8 = 1
7 <-> 7.5 = 0.5 14 <-> 8 = 6
14 <-> 7.5 = 6.5 8 <-> 8 = 0
6 <-> 7.5 = 1.5 3 <-> 8 = 5
___avg ___avg
3.25 3
And multiply them together:
3 * 3.25 = 9.75
Which you treat as a distributionscore. You might need to tweak it a little bit to make it work better, but this should calculate distributionscores quite nicely.
Here is an example of a bad distribution:
1 0 0 0 1 1 0 0 2
1 0 0 0 -\ 2 1 0 0 -\ 3 -\ C avg 2.5 -\ C avg-2-avg 0.5
1 0 0 0 -/ 2 1 0 0 -/ 3 -/ R avg 2.5 -/ R avg-2-avg 2.5
1 0 0 0 1 1 0 0 2 _____*
6 4 0 0 1.25 < score
Edit: calc. errors fixed.

Resources