Find second minimum for each row of a matrix - matrix

I have a set i of customers and a set j of facilities. I have two binary variables: y ij which is 1 if client i is served by a primary facility, 0 otherwise; b ij is 1 if client i is served by a backup facility, 0 otherwise.
Given the starting matrix d:
-I must set y[i,j] = 1 based on the minimum distance of each row in the matrix (and this I have done);
I have to fix b[i,j] = 1 according to the second minimum distance of each row in the matrix (I don't know how to do this. I wrote max, but I don't have to do that). I've tried removing the first minimum from each row with the various pop, deleteat, splice, etc, but the solver gives me an error.
using JuMP
using Gurobi
using DelimitedFiles
import Random
import LinearAlgebra
import Plots
n = 3
m = 5
model = Model(Gurobi.Optimizer);
#variable(model, y[1:m,1:n] >= 0, Bin);
#variable(model, b[1:m,1:n] >= 0, Bin);
d = [
[80 20 40]
[71 55 24]
[56 47 81]
[10 20 30]
[31 41 21]
];
#PRIMARY ASSIGNMENTS
# 1) For each customer find the minimum d i-j and its position in matrix and create a vector V composed by all d i-j just founded
V = [];
for i = 1:m;
c = findmin(d[i,j] for j = 1:n);
push!(V,[c[1] ,c[2], i]);
end
println(V)
# 2) Sort vector's evelements from the smallest to the largest
S = sort(V)
println(S)
for i = 1:m
println(S[i][2])
println(S[i][3])
end
# 3) Fix primary assingnments for the first 50% of customers
for i = 1:3
fix(y[S[i][3], S[i][2]], 1.0, force = true);
end
# SECONDARY ASSIGNMENTS
# 1) For each customer find the second minimum d i-j and its position in matrix and create a vector W composed by all d i-j just founded
W = [];
for i = 1:m;
f = findmax(d[i,j] for j = 1:n);
push!(W,[f[1] ,f[2], i]);
end
println(W)
# 2) Sort vector's elements from the smallest to the largest
T = sort(W)
println(T)
for i = 1:3
println(T[i][2])
println(T[i][3])
end
# 3) Fix secondary assingnments for the first 50% of customers
for i = 1:3
fix(b[T[i][3], T[i][2]], 1.0, force = true);
end
optimize!(model)
I tried to find for each line the second minimum, but I could not.

Related

clastering one dimension data with unknown distance metric

I have a 2-dimensional array which describes the distance between objects:
A B C
A 0 1 2
B 1 0 3
C 2 3 0
for example distance(A,B) = 1, distance(B,C) = 3, distance(A,C) = 2,
distance(x,y) = distance(x,y). I do not know anything more about this distance, it is not Euclides Distance or any commonly known distance function.
How to find number of groups and partition points (x,y)?
I have found solution:
D =[x][y] #two dimencion array with distances between x and y
sorted_distance = sorted_distance(D) # all values apears in D, delete duplicates and sort from max to min value
for distance in sorted_distance:
V = D.keys()
E = []
for x in V:
for y in V:
if x==y: continue
if D[x][y]<=distance:
E.append((x,y))
G = Grapth(V,E)
connected_components = get_connected_components(G)
if len(connected_components)>1: # this value could be increase if result is not rewarding
return connected_components

What kind of algorithm would find a grid of squares in a reasonable time?

I found a result that there is a grid of size 9x13 with following properties:
Every cell contains a digit in base 10.
One can read the numbers from the grid by selecting a starting square, go to one of its 8 nearest grid, maintain that direction and concatenate numbers.
For example, if we have the following grid:
340934433
324324893
455423343
Then one can select the leftmost upper number 3 and select direction to the right and down to read numbers 3, 32 and 325.
Now one has to prove that there is a grid of size 9x13 where one can read the squares of 1 to 100, i.e. one can read all of the integers of the form i^2 where i=1,...,100 from the square.
The best grid I found on the net is of size 11x11, given in Solving a recreational square packing problem . But it looks like it is hard to modify the program to find integers in rectangular grid.
So what kind of algorithm would output a suitable grid in a reasonable time?
I just got a key error from this code:
import random, time, sys
N = 9
M = 13
K = 100
# These are the numbers we would like to pack
numbers = [str(i*i) for i in xrange(1, K+1)]
# Build the global list of digits (used for weighted random guess)
digits = "".join(numbers)
def random_digit(n=len(digits)-1):
return digits[random.randint(0, n)]
# By how many lines each of the numbers is currently covered
count = dict((x, 0) for x in numbers)
# Number of actually covered numbers
covered = 0
# All lines in current position (row, cols, diags, counter-diags)
lines = (["*"*N for x in xrange(N)] +
["*"*M for x in xrange(M)] +
["*"*x for x in xrange(1, N)] + ["*"*x for x in xrange(N, 0, -1)] +
["*"*x for x in xrange(1, M)] + ["*"*x for x in xrange(M, 0, -1)])
# lines_of[x, y] -> list of line/char indexes
lines_of = {}
def add_line_of(x, y, L):
try:
lines_of[x, y].append(L)
except KeyError:
lines_of[x, y] = [L]
for y in xrange(N):
for x in xrange(N):
add_line_of(x, y, (y, x))
add_line_of(x, y, (M + x, y))
add_line_of(x, y, (2*M + (x + y), x - max(0, x + y - M + 1)))
add_line_of(x, y, (2*M + 2*N-1 + (x + N-1 - y), x - max(0, x + (M-1 - y) - M + 1)))
# Numbers covered by each line
covered_numbers = [set() for x in xrange(len(lines))]
# Which numbers the string x covers
def cover(x):
c = x + "/" + x[::-1]
return [y for y in numbers if y in c]
# Set a matrix element
def setValue(x, y, d):
global covered
for i, j in lines_of[x, y]:
L = lines[i]
C = covered_numbers[i]
newL = L[:j] + d + L[j+1:]
newC = set(cover(newL))
for lost in C - newC:
count[lost] -= 1
if count[lost] == 0:
covered -= 1
for gained in newC - C:
count[gained] += 1
if count[gained] == 1:
covered += 1
covered_numbers[i] = newC
lines[i] = newL
def do_search(k, r):
start = time.time()
for i in xrange(r):
x = random.randint(0, N-1)
y = random.randint(0, M-1)
setValue(x, y, random_digit())
best = None
attempts = k
while attempts > 0:
attempts -= 1
old = []
for ch in xrange(1):
x = random.randint(0, N-1)
y = random.randint(0, M-1)
old.append((x, y, lines[y][x]))
setValue(x, y, random_digit())
if best is None or covered > best[0]:
now = time.time()
sys.stdout.write(str(covered) + chr(13))
sys.stdout.flush()
attempts = k
if best is None or covered >= best[0]:
best = [covered, lines[:N][:]]
else:
for x, y, o in old[::-1]:
setValue(x, y, o)
print
sys.stdout.flush()
return best
for y in xrange(N):
for x in xrange(N):
setValue(x, y, random_digit())
best = None
while True:
if best is not None:
for y in xrange(M):
for x in xrange(N):
setValue(x, y, best[1][y][x])
x = do_search(100000, M)
if best is None or x[0] > best[0]:
print x[0]
print "\n".join(" ".join(y) for y in x[1])
if best is None or x[0] >= best[0]:
best = x[:]
To create such a grid, I'd start with a list of strings representing the squares of the first K (100) numbers.
Reduce those strings as much as possible, where many are contained within others (for example, 625 contains 25, so 625 covers the squares of 5 and 25).
This should yield an initial list of 81 unique squares, requiring a minimum of about 312 digits:
def construct_optimal_set(K):
# compute a minimal solution:
numbers = [str(n*n) for n in range(0,K+1)]
min_numbers = []
# note: go in reverse direction, biggest to smallest, to maximize elimination of smaller numbers later
while len(numbers) > 0:
i = 0
while i < len(min_numbers):
q = min_numbers[i]
qr = reverse(min_numbers[i])
# check if the first number is contained within any element of min_numbers
if numbers[-1] in q or numbers[-1] in qr:
break
# check if any element of min_numbers is contained within the first number
elif q in numbers[-1] or qr in numbers[-1]:
min_numbers[i] = numbers[-1]
break
i += 1
# if not found, add it
if i >= len(min_numbers):
min_numbers.append(numbers[-1])
numbers = numbers[:-1]
min_numbers.sort()
return min_numbers
This will return a minimal set of squares, with any squares that are subsets of other squares removed. Extend this by concatenating any mostly-overlapping elements (such as 484 and 841 into 4841); I leave that as an exercise, since it will build familiarity with this code.
Then, you assemble these sort of like a cross-word puzzle. As you assemble the values, pack based on probability of possible future overlaps, by computing a weight for each digit (for example, 1's are fairly common, 9's are less common, so given the choice, you would favor overlapping 9's rather than 1's).
Use something like the following code to build a list of all possible values that are represented in the current grid. Use this periodically while building, in order to eliminate squares that are already represented, as well as to test whether your grid is a full solution.
def merge(digits):
result = 0
for i in range(len(digits)-1,-1,-1):
result = result * 10 + digits[i]
return result
def merge_reverse(digits):
result = 0
for i in range(0, len(digits)):
result = result * 10 + digits[i]
return result
# given a grid where each element contains a single numeric digit,
# return list of every ordering of those digits less than SQK,
# such that you pick a starting point and one of eight directions,
# and assemble digits until either end of grid or larger than SQK;
# this will construct only the unique combinations;
# also note that this will not construct a large number of values,
# since for any given direction, there are at most
# (sqrt(n*n + m*m))!
# possible arrangements, and there will rarely be that many.
def construct_lines(grid, k):
# rather than build a dictionary type, use a little more memory to use faster simple array indexes;
# index is #, and value at index indicates existence: 0 = does not exist, >0 means exists in grid
sqk = k*k
combinations = [0]*(sqk+1)
# do all horizontals, since they are easiest
for y in range(len(grid)):
digits = []
for x in range(len(grid[y])):
digits.append(grid[y][x])
# for every possible starting point...
for q in range(1,len(digits)):
number = merge(digits[q:])
if number <= sqk:
combinations[number] += 1
# now do all verticals
# note that if the grid is really square, grid[0] will give an accurate width of all grid[y][] rows
for x in range(len(grid[0])):
digits = []
for y in range(len(grid)):
digits.append(grid[y][x])
# for every possible starting point...
for q in range(1,len(digits)):
number = merge(digits[q:])
if number <= sqk:
combinations[number] += 1
# the longer axis (x or y) in both directions will contain every possible diagonal
# e.g. x is the longer axis here (using random characters to more easily distinguish idea):
# [1 2 3 4]
# [a b c d]
# [. , $ !]
# 'a,' can be obtained by reversing the diagonal starting on the bottom and working up and to the left
# this means that every set must be reversed as well
if len(grid) > len(grid[0]):
# for each y, grab top and bottom in each of two diagonal directions, for a total of four sets,
# and include the reverse of each set
for y in range(len(grid)):
digitsul = [] # origin point upper-left, heading down and right
digitsur = [] # origin point upper-right, heading down and left
digitsll = [] # origin point lower-left, heading up and right
digitslr = [] # origin point lower-right, heading up and left
revx = len(grid[y])-1 # pre-adjust this for computing reverse x coordinate
for deltax in range(len(grid[y])): # this may go off the grid, so check bounds
if y+deltax < len(grid):
digitsul.append(grid[y+deltax][deltax])
digitsll.append(grid[y+deltax][revx - deltax])
for q in range(1,len(digitsul)):
number = merge(digitsul[q:])
if number <= sqk:
combinations[number] += 1
number = merge_reverse(digitsul[q:])
if number <= sqk:
combinations[number] += 1
for q in range(1,len(digitsll)):
number = merge(digitsll[q:])
if number <= sqk:
combinations[number] += 1
number = merge_reverse(digitsll[q:])
if number <= sqk:
combinations[number] += 1
if y-deltax >= 0:
digitsur.append(grid[y-deltax][deltax])
digitslr.append(grid[y-deltax][revx - deltax])
for q in range(1,len(digitsur)):
number = merge(digitsur[q:])
if number <= sqk:
combinations[number] += 1
number = merge_reverse(digitsur[q:])
if number <= sqk:
combinations[number] += 1
for q in range(1,len(digitslr)):
number = merge(digitslr[q:])
if number <= sqk:
combinations[number] += 1
number = merge_reverse(digitslr[q:])
if number <= sqk:
combinations[number] += 1
else:
# for each x, ditto above
for x in range(len(grid[0])):
digitsul = [] # origin point upper-left, heading down and right
digitsur = [] # origin point upper-right, heading down and left
digitsll = [] # origin point lower-left, heading up and right
digitslr = [] # origin point lower-right, heading up and left
revy = len(grid)-1 # pre-adjust this for computing reverse y coordinate
for deltay in range(len(grid)): # this may go off the grid, so check bounds
if x+deltay < len(grid[0]):
digitsul.append(grid[deltay][x+deltay])
digitsll.append(grid[revy - deltay][x+deltay])
for q in range(1,len(digitsul)):
number = merge(digitsul[q:])
if number <= sqk:
combinations[number] += 1
number = merge_reverse(digitsul[q:])
if number <= sqk:
combinations[number] += 1
for q in range(1,len(digitsll)):
number = merge(digitsll[q:])
if number <= sqk:
combinations[number] += 1
number = merge_reverse(digitsll[q:])
if number <= sqk:
combinations[number] += 1
if x-deltay >= 0:
digitsur.append(grid[deltay][x-deltay])
digitslr.append(grid[revy - deltay][x - deltay])
for q in range(1,len(digitsur)):
number = merge(digitsur[q:])
if number <= sqk:
combinations[number] += 1
number = merge_reverse(digitsur[q:])
if number <= sqk:
combinations[number] += 1
for q in range(1,len(digitslr)):
number = merge(digitslr[q:])
if number <= sqk:
combinations[number] += 1
number = merge_reverse(digitslr[q:])
if number <= sqk:
combinations[number] += 1
# now filter for squares only
return [i for i in range(0,k+1) if combinations[i*i] > 0]
Constructing the grid will be computationally expensive overall, but you will only need to run the check function once for each possible placement, to select the best placement.
Optimize placement by finding the subset of overlapping areas where you can place a sequence of numbers - this should be tolerable in terms of time required, because you can cap the number of possible locations to check; e.g. you might cap it at 10 (again, find the optimal number experimentally), such that you test the first 10 possible placements against the function above to determine which placement, if any, adds the most possible squares. As you progress, you will have fewer possible locations in which to insert the numbers, so testing which placement is best becomes computationally less expensive at the same time that your search for possible placements becomes more expensive, balancing out each other.
This will not handle all combinations, and will not pack as tightly as trying every possible arrangement and computing how many squares are covered, so some might be missed, but compared to O((N*M)!), this algorithm will actually complete in your lifetime (I'd actually estimate a few minutes on a decent computer - more if you parallelize the check for placement).

Indexing a matrix with predetermined rule

I have P <151x1 double> and D <6x1 double>. An example of D would be [24;7;9;11;10;12]. I have to index P based on D such that in P I want to keep 6 blocks of 12 elements but each block is separated from the next block by n number of elements. n is given by D. The first 12 elements of P is the first block. Thus, the first block would be P(1:12), the second block would be P(37:48,1) because we want to skip 24 elements after the first block (24 is D(1,1), Third block to keep would be P(56,1) because we want to skip 7 elements after the second block (7 is D(2,1)), etc. After indexing I should end up with 72 elements.
Could anyone help me find a solution to indexing this efficiently?
Thanks!
One approach -
%// Parameters
block_size = 12;
num_blocks = 6;
step_add = [0 ; cumsum(D(1:num_blocks-1))];
start_ind = [0:block_size:block_size*(num_blocks-1)]'+1 + step_add; %//'
all_valid_ind = bsxfun(#plus,start_ind,0:block_size-1)'; %//'
out = P(all_valid_ind(:)); %// desired output
Please note that you won't be using the last element of D into the calculations, because each element of D defines the "gap" between consecutive blocks of elements that you are picking up from P.So you need only 5 elements to define 5 gaps between 6 blocks of elements .
Benchmarking
Loop approach from this solution:
function blocks = loop1(P,D)
blocks = zeros(12, numel(D)); % //Pre-allocate blocks matrix
%// We start accessing values at 1
startIndex = 1;
%// For each index in D
for idx = 1 : numel(D)
%// Grab the 12 elements
blocks(:,idx) = P(startIndex : startIndex + 11);
%// Skip over 12 elements PLUS the number specified at D
startIndex = startIndex + 12 + D(idx);
end
return;
No-loop approach (as discussed earlier in this solution):
function out = no_loop1(P,D)
%// Parameters
block_size = 12;
num_blocks = numel(D);
step_add = [0 ; cumsum(D(1:num_blocks-1))];
start_ind = [0:block_size:block_size*(num_blocks-1)]'+1 + step_add; %//'
all_valid_ind = bsxfun(#plus,start_ind,0:block_size-1)'; %//'
out = P(all_valid_ind(:)); %// desired output
return;
Actual benchmarking and plotting results:
P = rand(200000,1);
N_arr = [100 200 500 1000 2000 5000]; %// No. of D elements
timeall = zeros(2,numel(N_arr));
for k1 = 1:numel(N_arr)
N = N_arr(k1);
D = randi(10,N,1)+10;
f = #() loop1(P,D);
timeall(1,k1) = timeit(f);
clear f
f = #() no_loop1(P,D);
timeall(2,k1) = timeit(f);
clear f
end
figure,
hold on
plot(N_arr,timeall(1,:),'-ro')
plot(N_arr,timeall(2,:),'-kx')
legend('Loop Method','No-loop Method')
xlabel('Datasize (No. of D elements) ->')
ylabel('Time(sec) ->')
Results
Conclusions
No-loop approach as used in this solution looks the more efficient one across a varying range of datasizes.
Because this is using a recurrence relation, the only option I can see is using for loops. We must use output values from the previous iteration as input into the next iteration. I personally can't see any technique in my arsenal that can do this vectorized.
If there is anyone else (Divakar, Luis Mendo, natan, Ben, Daniel, Amro, etc.) that can propose a more optimum solution, please feel free and answer. Without further ado:
D = [24;7;9;11;10;12]; %// Define number of elements to skip over
blocks = zeros(12, numel(D)); % //Pre-allocate blocks matrix
%// We start accessing values at 1
startIndex = 1;
%// For each index in D
for idx = 1 : numel(D)
%// Grab the 12 elements
blocks(:,idx) = P(startIndex : startIndex + 11);
%// Skip over 12 elements PLUS the number specified at D
startIndex = startIndex + 12 + D(idx);
end
This should give you a 12 x 6 matrix, where each column corresponds to the set of elements you extracted from P. As a small test, we can display the start and ending indices that we need to access P for extracting elements. These are generated by replacing blocks(:,idx) = ..., with disp([startIndex startIndex + 11]); in the loop. The indices generated are:
1 12
37 48
56 67
77 88
100 111
122 133
This can be vectorized, no problem.
P = 1:200; % a generic P
D = [24;7;9;11;10;12];
D = [0 D(1:end-1)];
basis = repmat(0:11, [6 1]);
startingIndices = cumsum(D + 12) + 1;
usefulIndices = bsxfun(#plus, basis, startingIndices);
P(usefulIndices)
Without some more context, it's hard to suggest a method that indexes this "efficiently" -- if you're only doing this operation a few times, clarity of code is the most important. But I think this will give you a good starting point.

optimization of pairwise L2 distance computations

I need help optimizing this loop. matrix_1 is a (nx 2) int matrix and matrix_2 is a (m x 2), m & n very.
index_j = 1;
for index_k = 1:size(Matrix_1,1)
for index_l = 1:size(Matrix_2,1)
M2_Index_Dist(index_j,:) = [index_l, sqrt(bsxfun(#plus,sum(Matrix_1(index_k,:).^2,2),sum(Matrix_2(index_l,:).^2,2)')-2*(Matrix_1(index_k,:)*Matrix_2(index_l,:)'))];
index_j = index_j + 1;
end
end
I need M2_Index_Dist to provide a ((n*m) x 2) matrix with the index of matrix_2 in the first column and the distance in the second column.
Output example:
M2_Index_Dist = [ 1, 5.465
2, 56.52
3, 6.21
1, 35.3
2, 56.52
3, 0
1, 43.5
2, 9.3
3, 236.1
1, 8.2
2, 56.52
3, 5.582]
Here's how to apply bsxfun with your formula (||A-B|| = sqrt(||A||^2 + ||B||^2 - 2*A*B)):
d = real(sqrt(bsxfun(#plus, dot(Matrix_1,Matrix_1,2), ...
bsxfun(#minus, dot(Matrix_2,Matrix_2,2).', 2 * Matrix_1*Matrix_2.')))).';
You can avoid the final transpose if you change your interpretation of the matrix.
Note: There shouldn't be any complex values to handle with real but it's there in case of very small differences that may lead to tiny negative numbers.
Edit: It may be faster without dot:
d = sqrt(bsxfun(#plus, sum(Matrix_1.*Matrix_1,2), ...
bsxfun(#minus, sum(Matrix_2.*Matrix_2,2)', 2 * Matrix_1*Matrix_2.'))).';
Or with just one call to bsxfun:
d = sqrt(bsxfun(#plus, sum(Matrix_1.*Matrix_1,2), sum(Matrix_2.*Matrix_2,2)') ...
- 2 * Matrix_1*Matrix_2.').';
Note: This last order of operations gives identical results to you, rather than with an error ~1e-14.
Edit 2: To replicate M2_Index_Dist:
II = ndgrid(1:size(Matrix_2,1),1:size(Matrix_2,1));
M2_Index_Dist = [II(:) d(:)];
If I understand correctly, this does what you want:
ind = repmat((1:size(Matrix_2,1)).',size(Matrix_1,1),1); %'// first column: index
d = pdist2(Matrix_2,Matrix_1); %// compute distance between each pair of rows
d = d(:); %// second column: distance
result = [ind d]; %// build result from first column and second column
As you see, this code calls pdist2 to compute the distance between every pair of rows of your matrices. By default this function uses Euclidean distance.
If you don't have pdist2 (which is part of the the Statistics Toolbox), you can replace line 2 above with bsxfun:
d = squeeze(sqrt(sum(bsxfun(#minus,Matrix_2,permute(Matrix_1, [3 2 1])).^2,2)));

How Could One Implement the K-Means++ Algorithm?

I am having trouble fully understanding the K-Means++ algorithm. I am interested exactly how the first k centroids are picked, namely the initialization as the rest is like in the original K-Means algorithm.
Is the probability function used based on distance or Gaussian?
In the same time the most long distant point (From the other centroids) is picked for a new centroid.
I will appreciate a step by step explanation and an example. The one in Wikipedia is not clear enough. Also a very well commented source code would also help. If you are using 6 arrays then please tell us which one is for what.
Interesting question. Thank you for bringing this paper to my attention - K-Means++: The Advantages of Careful Seeding
In simple terms, cluster centers are initially chosen at random from the set of input observation vectors, where the probability of choosing vector x is high if x is not near any previously chosen centers.
Here is a one-dimensional example. Our observations are [0, 1, 2, 3, 4]. Let the first center, c1, be 0. The probability that the next cluster center, c2, is x is proportional to ||c1-x||^2. So, P(c2 = 1) = 1a, P(c2 = 2) = 4a, P(c2 = 3) = 9a, P(c2 = 4) = 16a, where a = 1/(1+4+9+16).
Suppose c2=4. Then, P(c3 = 1) = 1a, P(c3 = 2) = 4a, P(c3 = 3) = 1a, where a = 1/(1+4+1).
I've coded the initialization procedure in Python; I don't know if this helps you.
def initialize(X, K):
C = [X[0]]
for k in range(1, K):
D2 = scipy.array([min([scipy.inner(c-x,c-x) for c in C]) for x in X])
probs = D2/D2.sum()
cumprobs = probs.cumsum()
r = scipy.rand()
for j,p in enumerate(cumprobs):
if r < p:
i = j
break
C.append(X[i])
return C
EDIT with clarification: The output of cumsum gives us boundaries to partition the interval [0,1]. These partitions have length equal to the probability of the corresponding point being chosen as a center. So then, since r is uniformly chosen between [0,1], it will fall into exactly one of these intervals (because of break). The for loop checks to see which partition r is in.
Example:
probs = [0.1, 0.2, 0.3, 0.4]
cumprobs = [0.1, 0.3, 0.6, 1.0]
if r < cumprobs[0]:
# this event has probability 0.1
i = 0
elif r < cumprobs[1]:
# this event has probability 0.2
i = 1
elif r < cumprobs[2]:
# this event has probability 0.3
i = 2
elif r < cumprobs[3]:
# this event has probability 0.4
i = 3
One Liner.
Say we need to select 2 cluster centers, instead of selecting them all randomly{like we do in simple k means}, we will select the first one randomly, then find the points that are farthest to the first center{These points most probably do not belong to the first cluster center as they are far from it} and assign the second cluster center nearby those far points.
I have prepared a full source implementation of k-means++ based on the book "Collective Intelligence" by Toby Segaran and the k-menas++ initialization provided here.
Indeed there are two distance functions here. For the initial centroids a standard one is used based numpy.inner and then for the centroids fixation the Pearson one is used. Maybe the Pearson one can be also be used for the initial centroids. They say it is better.
from __future__ import division
def readfile(filename):
lines=[line for line in file(filename)]
rownames=[]
data=[]
for line in lines:
p=line.strip().split(' ') #single space as separator
#print p
# First column in each row is the rowname
rownames.append(p[0])
# The data for this row is the remainder of the row
data.append([float(x) for x in p[1:]])
#print [float(x) for x in p[1:]]
return rownames,data
from math import sqrt
def pearson(v1,v2):
# Simple sums
sum1=sum(v1)
sum2=sum(v2)
# Sums of the squares
sum1Sq=sum([pow(v,2) for v in v1])
sum2Sq=sum([pow(v,2) for v in v2])
# Sum of the products
pSum=sum([v1[i]*v2[i] for i in range(len(v1))])
# Calculate r (Pearson score)
num=pSum-(sum1*sum2/len(v1))
den=sqrt((sum1Sq-pow(sum1,2)/len(v1))*(sum2Sq-pow(sum2,2)/len(v1)))
if den==0: return 0
return 1.0-num/den
import numpy
from numpy.random import *
def initialize(X, K):
C = [X[0]]
for _ in range(1, K):
#D2 = numpy.array([min([numpy.inner(c-x,c-x) for c in C]) for x in X])
D2 = numpy.array([min([numpy.inner(numpy.array(c)-numpy.array(x),numpy.array(c)-numpy.array(x)) for c in C]) for x in X])
probs = D2/D2.sum()
cumprobs = probs.cumsum()
#print "cumprobs=",cumprobs
r = rand()
#print "r=",r
i=-1
for j,p in enumerate(cumprobs):
if r 0:
for rowid in bestmatches[i]:
for m in range(len(rows[rowid])):
avgs[m]+=rows[rowid][m]
for j in range(len(avgs)):
avgs[j]/=len(bestmatches[i])
clusters[i]=avgs
return bestmatches
rows,data=readfile('/home/toncho/Desktop/data.txt')
kclust = kcluster(data,k=4)
print "Result:"
for c in kclust:
out = ""
for r in c:
out+=rows[r] +' '
print "["+out[:-1]+"]"
print 'done'
data.txt:
p1 1 5 6
p2 9 4 3
p3 2 3 1
p4 4 5 6
p5 7 8 9
p6 4 5 4
p7 2 5 6
p8 3 4 5
p9 6 7 8

Resources