clastering one dimension data with unknown distance metric - algorithm

I have a 2-dimensional array which describes the distance between objects:
A B C
A 0 1 2
B 1 0 3
C 2 3 0
for example distance(A,B) = 1, distance(B,C) = 3, distance(A,C) = 2,
distance(x,y) = distance(x,y). I do not know anything more about this distance, it is not Euclides Distance or any commonly known distance function.
How to find number of groups and partition points (x,y)?

I have found solution:
D =[x][y] #two dimencion array with distances between x and y
sorted_distance = sorted_distance(D) # all values apears in D, delete duplicates and sort from max to min value
for distance in sorted_distance:
V = D.keys()
E = []
for x in V:
for y in V:
if x==y: continue
if D[x][y]<=distance:
E.append((x,y))
G = Grapth(V,E)
connected_components = get_connected_components(G)
if len(connected_components)>1: # this value could be increase if result is not rewarding
return connected_components

Related

Find second minimum for each row of a matrix

I have a set i of customers and a set j of facilities. I have two binary variables: y ij which is 1 if client i is served by a primary facility, 0 otherwise; b ij is 1 if client i is served by a backup facility, 0 otherwise.
Given the starting matrix d:
-I must set y[i,j] = 1 based on the minimum distance of each row in the matrix (and this I have done);
I have to fix b[i,j] = 1 according to the second minimum distance of each row in the matrix (I don't know how to do this. I wrote max, but I don't have to do that). I've tried removing the first minimum from each row with the various pop, deleteat, splice, etc, but the solver gives me an error.
using JuMP
using Gurobi
using DelimitedFiles
import Random
import LinearAlgebra
import Plots
n = 3
m = 5
model = Model(Gurobi.Optimizer);
#variable(model, y[1:m,1:n] >= 0, Bin);
#variable(model, b[1:m,1:n] >= 0, Bin);
d = [
[80 20 40]
[71 55 24]
[56 47 81]
[10 20 30]
[31 41 21]
];
#PRIMARY ASSIGNMENTS
# 1) For each customer find the minimum d i-j and its position in matrix and create a vector V composed by all d i-j just founded
V = [];
for i = 1:m;
c = findmin(d[i,j] for j = 1:n);
push!(V,[c[1] ,c[2], i]);
end
println(V)
# 2) Sort vector's evelements from the smallest to the largest
S = sort(V)
println(S)
for i = 1:m
println(S[i][2])
println(S[i][3])
end
# 3) Fix primary assingnments for the first 50% of customers
for i = 1:3
fix(y[S[i][3], S[i][2]], 1.0, force = true);
end
# SECONDARY ASSIGNMENTS
# 1) For each customer find the second minimum d i-j and its position in matrix and create a vector W composed by all d i-j just founded
W = [];
for i = 1:m;
f = findmax(d[i,j] for j = 1:n);
push!(W,[f[1] ,f[2], i]);
end
println(W)
# 2) Sort vector's elements from the smallest to the largest
T = sort(W)
println(T)
for i = 1:3
println(T[i][2])
println(T[i][3])
end
# 3) Fix secondary assingnments for the first 50% of customers
for i = 1:3
fix(b[T[i][3], T[i][2]], 1.0, force = true);
end
optimize!(model)
I tried to find for each line the second minimum, but I could not.

Print the elements which making min cost path from a start point to end point in a grid

We can calculate min cost suppose take this recurrence relation
min(mat[i-1][j],mat[i][j-1])+mat[i][j];
0 1 2 3
4 5 6 7
8 9 10 11
for calculating min cost using the above recurrence relation we will get for min-cost(1,2)=0+1+2+6=9
i am getting min cost sum, that's not problem..now i want to print the elements 0,1,2,6 bcz this elements are making min cost path.
Any help is really appreciated.
Suppose, your endpoint is [x, y] and start-point is [a, b]. After the recursion step, now start from the endpoint and crawl-back/backtrack to start point.
Here is the pseudocode:
# Assuming grid is the given input 2D grid
output = []
p = x, q = y
while(p != a && q != b):
output.add(grid[p][q])
min = infinity
newP = -1, newQ = -1
if(p - 1 >= 0 && mat[p - 1][q] < min):
min = matrix[p -1][q]
newP = p - 1
newQ = q
if(q - 1 >= 0 && mat[p][q - 1] < min):
min = mat[p][q - 1]
newP = p
newQ = q - 1
p = newP, q = newQ
end
output.add(grid[a][b])
# print output
Notice, here we used mat and grid - two 2D matrix where grid is the given input and mat is the matrix generated after the recursion step mat[i][j] = min(mat[i - 1][j], mat[i][j - 1]) + grid[i][j]
Hope it helps!
Besides computing the min cost matrix using the relation that you mentioned, you can also create a predecessor matrix.
For each cell (i, j), you should also store the information about who was the "min" in the relation that you mentioned (was it the left element, or is it the element above?). In this way, you will know for each cell, which is its preceding cell in an optimal path.
Afterwards, you can generate the path by starting from the final cell and moving backwards according to the "predecessor" matrix, until you reach the top-left cell.
Note that the going backwards idea can be applied also without explicitly constructing a predecessor matrix. At each point, you would need to look which of the candidate predecessors has a lower total cost.

Finding a Summation in range

I have given an Array A and B. Where B contains the indexes of A.
Let
A = [2,3,5,6,7,8,9]
B = [1,3,1,4,5,2,6]
I have given Q queries where i have to find the Sum in the Range L to R using B.
Example L=2 , R=4
Sum = A[B[2]]+A[B[3]]+ A[B[4]]
Sum = A[3] + A[1]+A[4] = 5+2+6=13
Now a Query for update i.e
U 2 10
A[2] =10 // previously 3
S 6 7
Sum = A[B[6]]+A[B[7]] = A[2]+A[6]=10+8=18
Is there any solution better than O(Q*N) , any data structure which support both update and summation in this case. (Segment Tree and BITS like)
Constraints:
N,Q,Value<10^6

Enumerate matrix combinations with fixed row and column sums

I'm attempting to find an algorithm (not a matlab command) to enumerate all possible NxM matrices with the constraints of having only positive integers in each cell (or 0) and fixed sums for each row and column (these are the parameters of the algorithm).
Exemple :
Enumerate all 2x3 matrices with row totals 2, 1 and column totals 0, 1, 2:
| 0 0 2 | = 2
| 0 1 0 | = 1
0 1 2
| 0 1 1 | = 2
| 0 0 1 | = 1
0 1 2
This is a rather simple example, but as N and M increase, as well as the sums, there can be a lot of possibilities.
Edit 1
I might have a valid arrangement to start the algorithm:
matrix = new Matrix(N, M) // NxM matrix filled with 0s
FOR i FROM 0 TO matrix.rows().count()
FOR j FROM 0 TO matrix.columns().count()
a = target_row_sum[i] - matrix.rows[i].sum()
b = target_column_sum[j] - matrix.columns[j].sum()
matrix[i, j] = min(a, b)
END FOR
END FOR
target_row_sum[i] being the expected sum on row i.
In the example above it gives the 2nd arrangement.
Edit 2:
(based on j_random_hacker's last statement)
Let M be any matrix verifying the given conditions (row and column sums fixed, positive or null cell values).
Let (a, b, c, d) be 4 cell values in M where (a, b) and (c, d) are on the same row, and (a, c) and (b, d) are on the same column.
Let Xa be the row number of the cell containing a and Ya be its column number.
Example:
| 1 a b |
| 1 2 3 |
| 1 c d |
-> Xa = 0, Ya = 1
-> Xb = 0, Yb = 2
-> Xc = 2, Yc = 1
-> Xd = 2, Yd = 2
Here is an algorithm to get all the combinations verifying the initial conditions and making only a, b, c and d varying:
// A matrix array containing a single element, M
// It will be filled with all possible combinations
matrices = [M]
I = min(a, d)
J = min(b, c)
FOR i FROM 1 TO I
tmp_matrix = M
tmp_matrix[Xa, Ya] = a - i
tmp_matrix[Xb, Yb] = b + i
tmp_matrix[Xc, Yc] = c - i
tmp_matrix[Xd, Yd] = d + i
matrices.add(tmp_matrix)
END FOR
FOR j FROM 1 TO J
tmp_matrix = M
tmp_matrix[Xa, Ya] = a + j
tmp_matrix[Xb, Yb] = b - j
tmp_matrix[Xc, Yc] = c + j
tmp_matrix[Xd, Yd] = d - j
matrices.add(tmp_matrix)
END FOR
It should then be possible to find every possible combination of matrix values:
Apply the algorithm on the first matrix for every possible group of 4 cells ;
Recursively apply the algorithm on each sub-matrix obtained by the previous iteration, for every possible group of 4 cells except any group already used in a parent execution ;
The recursive depth should be (N*(N-1)/2)*(M*(M-1)/2), each execution resulting in ((N*(N-1)/2)*(M*(M-1)/2) - depth)*(I+J+1) sub-matrices. But this creates a LOT of duplicate matrices, so this could probably be optimized.
Are you needing this to calculate Fisher's exact test? Because that requires what you're doing, and based on that page, it seems there will in general be a vast number of solutions, so you probably can't do better than a brute force recursive enumeration if you want every solution. OTOH it seems Monte Carlo approximations are successfully used by some software instead of full-blown enumerations.
I asked a similar question, which might be helpful. Although that question deals with preserving frequencies of letters in each row and column rather than sums, some results can be translated across. E.g. if you find any submatrix (pair of not-necessarily-adjacent rows and pair of not-necessarily-adjacent columns) with numbers
xy
yx
Then you can rearrange these to
yx
xy
without changing any row or column sums. However:
mhum's answer proves that there will in general be valid matrices that cannot be reached by any sequence of such 2x2 swaps. This can be seen by taking his 3x3 matrices and mapping A -> 1, B -> 2, C -> 4 and noticing that, because no element appears more than once in a row or column, frequency preservation in the original matrix is equivalent to sum preservation in the new matrix. However...
someone's answer links to a mathematical proof that it actually will work for matrices whose entries are just 0 or 1.
More generally, if you have any submatrix
ab
cd
where the (not necessarily unique) minimum is d, then you can replace this with any of the d+1 matrices
ef
gh
where h = d-i, g = c+i, f = b+i and e = a-i, for any integer 0 <= i <= d.
For a NXM matrix you have NXM unknowns and N+M equations. Put random numbers to the top-left (N-1)X(M-1) sub-matrix, except for the (N-1, M-1) element. Now, you can find the closed form for the rest of N+M elements trivially.
More details: There are total of T = N*M elements
There are R = (N-1)+(M-1)-1 randomly filled out elements.
Remaining number of unknowns: T-S = N*M - (N-1)*(M-1) +1 = N+M

How Could One Implement the K-Means++ Algorithm?

I am having trouble fully understanding the K-Means++ algorithm. I am interested exactly how the first k centroids are picked, namely the initialization as the rest is like in the original K-Means algorithm.
Is the probability function used based on distance or Gaussian?
In the same time the most long distant point (From the other centroids) is picked for a new centroid.
I will appreciate a step by step explanation and an example. The one in Wikipedia is not clear enough. Also a very well commented source code would also help. If you are using 6 arrays then please tell us which one is for what.
Interesting question. Thank you for bringing this paper to my attention - K-Means++: The Advantages of Careful Seeding
In simple terms, cluster centers are initially chosen at random from the set of input observation vectors, where the probability of choosing vector x is high if x is not near any previously chosen centers.
Here is a one-dimensional example. Our observations are [0, 1, 2, 3, 4]. Let the first center, c1, be 0. The probability that the next cluster center, c2, is x is proportional to ||c1-x||^2. So, P(c2 = 1) = 1a, P(c2 = 2) = 4a, P(c2 = 3) = 9a, P(c2 = 4) = 16a, where a = 1/(1+4+9+16).
Suppose c2=4. Then, P(c3 = 1) = 1a, P(c3 = 2) = 4a, P(c3 = 3) = 1a, where a = 1/(1+4+1).
I've coded the initialization procedure in Python; I don't know if this helps you.
def initialize(X, K):
C = [X[0]]
for k in range(1, K):
D2 = scipy.array([min([scipy.inner(c-x,c-x) for c in C]) for x in X])
probs = D2/D2.sum()
cumprobs = probs.cumsum()
r = scipy.rand()
for j,p in enumerate(cumprobs):
if r < p:
i = j
break
C.append(X[i])
return C
EDIT with clarification: The output of cumsum gives us boundaries to partition the interval [0,1]. These partitions have length equal to the probability of the corresponding point being chosen as a center. So then, since r is uniformly chosen between [0,1], it will fall into exactly one of these intervals (because of break). The for loop checks to see which partition r is in.
Example:
probs = [0.1, 0.2, 0.3, 0.4]
cumprobs = [0.1, 0.3, 0.6, 1.0]
if r < cumprobs[0]:
# this event has probability 0.1
i = 0
elif r < cumprobs[1]:
# this event has probability 0.2
i = 1
elif r < cumprobs[2]:
# this event has probability 0.3
i = 2
elif r < cumprobs[3]:
# this event has probability 0.4
i = 3
One Liner.
Say we need to select 2 cluster centers, instead of selecting them all randomly{like we do in simple k means}, we will select the first one randomly, then find the points that are farthest to the first center{These points most probably do not belong to the first cluster center as they are far from it} and assign the second cluster center nearby those far points.
I have prepared a full source implementation of k-means++ based on the book "Collective Intelligence" by Toby Segaran and the k-menas++ initialization provided here.
Indeed there are two distance functions here. For the initial centroids a standard one is used based numpy.inner and then for the centroids fixation the Pearson one is used. Maybe the Pearson one can be also be used for the initial centroids. They say it is better.
from __future__ import division
def readfile(filename):
lines=[line for line in file(filename)]
rownames=[]
data=[]
for line in lines:
p=line.strip().split(' ') #single space as separator
#print p
# First column in each row is the rowname
rownames.append(p[0])
# The data for this row is the remainder of the row
data.append([float(x) for x in p[1:]])
#print [float(x) for x in p[1:]]
return rownames,data
from math import sqrt
def pearson(v1,v2):
# Simple sums
sum1=sum(v1)
sum2=sum(v2)
# Sums of the squares
sum1Sq=sum([pow(v,2) for v in v1])
sum2Sq=sum([pow(v,2) for v in v2])
# Sum of the products
pSum=sum([v1[i]*v2[i] for i in range(len(v1))])
# Calculate r (Pearson score)
num=pSum-(sum1*sum2/len(v1))
den=sqrt((sum1Sq-pow(sum1,2)/len(v1))*(sum2Sq-pow(sum2,2)/len(v1)))
if den==0: return 0
return 1.0-num/den
import numpy
from numpy.random import *
def initialize(X, K):
C = [X[0]]
for _ in range(1, K):
#D2 = numpy.array([min([numpy.inner(c-x,c-x) for c in C]) for x in X])
D2 = numpy.array([min([numpy.inner(numpy.array(c)-numpy.array(x),numpy.array(c)-numpy.array(x)) for c in C]) for x in X])
probs = D2/D2.sum()
cumprobs = probs.cumsum()
#print "cumprobs=",cumprobs
r = rand()
#print "r=",r
i=-1
for j,p in enumerate(cumprobs):
if r 0:
for rowid in bestmatches[i]:
for m in range(len(rows[rowid])):
avgs[m]+=rows[rowid][m]
for j in range(len(avgs)):
avgs[j]/=len(bestmatches[i])
clusters[i]=avgs
return bestmatches
rows,data=readfile('/home/toncho/Desktop/data.txt')
kclust = kcluster(data,k=4)
print "Result:"
for c in kclust:
out = ""
for r in c:
out+=rows[r] +' '
print "["+out[:-1]+"]"
print 'done'
data.txt:
p1 1 5 6
p2 9 4 3
p3 2 3 1
p4 4 5 6
p5 7 8 9
p6 4 5 4
p7 2 5 6
p8 3 4 5
p9 6 7 8

Resources