first use of PyMc fails - pymc

I am new to PyMc and would like to know why this code doesn't work. I already spent hours on this but I miss something. Could anyone help me ?
question I want to address:
I have a set of Npts measures that show 3 bumps, so I want to model this as the sum of 3 gaussians (assuming the measures are noisy and the gaussian approx is ok) ==> I want to estimate 8 parameters: the relative weights of the bumps (i.e. 2 params), their 3 means and their 3 variances.
I want this approach wide enough to be applicable on other sets that may not have the same bumps, so I take loose flat priors.
problem:
My code below gives me crappy estimations. what's wrong ? thx
"""
hypothesis: multimodal distrib sum of 3 gaussian distributions
model description:
* p1, p2, p3 are the probabilities for a point to belong to gaussian 1, 2 or 3
==> p1, p2, p3 are the relative weights of the 3 gaussians
* once a point is associated with a gaussian,
it is distributed normally according to the parameters mu_i, sigma_i of the gaussian
but instead of considering sigma, pymc prefers considering tau=1/sigma**2
* thus, PyMc must guess 8 parameters: p1, p2, mu1, mu2, mu3, tau1, tau2, tau3
* priors on p1, p2 are flat between 0.1 and 0.9 ==> 'pm.Uniform' variables
with the constraint p2<=1-p1. p3 is deterministic ==1-p1-p2
* the 'assignment' variable assigns each point to a gaussian, according to probabilities p1, p2, p3
* priors on mu1, mu2, mu3 are flat between 40 and 120 ==> 'pm.Uniform' variables
* priors on sigma1, sigma2, sigma3 are flat between 4 and 12 ==> 'pm.Uniform' variables
"""
import numpy as np
import pymc as pm
data = np.loadtxt('distrib.txt')
Npts = len(data)
mumin = 40
mumax = 120
sigmamin=4
sigmamax=12
p1 = pm.Uniform("p1",0.1,0.9)
p2 = pm.Uniform("p2",0.1,1-p1)
p3 = 1-p1-p2
assignment = pm.Categorical('assignment',[p1,p2,p3],size=Npts)
mu = pm.Uniform('mu',[mumin,mumin,mumin],[mumax,mumax,mumax])
sigma = pm.Uniform('sigma',[sigmamin,sigmamin,sigmamin],
[sigmamax,sigmamax,sigmamax])
tau = 1/sigma**2
#pm.deterministic
def assign_mu(assi=assignment,mu=mu):
return mu[assi]
#pm.deterministic
def assign_tau(assi=assignment,sig=tau):
return sig[assi]
hypothesis = pm.Normal("obs", assign_mu, assign_tau, value=data, observed=True)
model = pm.Model([hypothesis, p1, p2, tau, mu])
test = pm.MCMC(model)
test.sample(50000,burn=20000) # conservative values, let's take a coffee...
print('\nguess\n* p1, p2 = ',
np.mean(test.trace('p1')[:]),' ; ',
np.mean(test.trace('p2')[:]),' ==> p3 = ',
1-np.mean(test.trace('p1')[:])-np.mean(test.trace('p2')[:]),
'\n* mu = ',
np.mean(test.trace('mu')[:,0]),' ; ',
np.mean(test.trace('mu')[:,1]),' ; ',
np.mean(test.trace('mu')[:,2]))
print('why does this guess suck ???!!!')
I can send the data file 'distrib.txt'. It is ~500 kb and data are plotted below. For instance last run gave me:
p1, p2 = 0.366913192214 ; 0.583816452532 ==> p3 = 0.04927035525400003
mu = 77.541619286 ; 75.3371615466 ; 77.2427165073
while there are obviously bumps around ~55, ~75 and ~90, with probabilities around ~0.2, ~0.5 and ~0.3

You have the problem described here: Negative Binomial Mixture in PyMC
The problem is the Categorical variable converges too slowly for the three component distributions to get even close.
First, we generate your test data:
data1 = np.random.normal(55,5,2000)
data2 = np.random.normal(75,5,5000)
data3 = np.random.normal(90,5,3000)
data=np.concatenate([data1, data2, data3])
np.savetxt("distrib.txt", data)
Then we plot the histogram, colored by the posterior group assignment:
tablebyassignment = [data[np.nonzero(np.round(test.trace("assignment")[:].mean(axis=0)) == i)] for i in range(0,3) ]
plt.hist(tablebyassingment, bins=30, stacked = True)
This will eventually converge, but not quickly enough to be useful to you.
You can fix this problem by guessing the values of assignment before starting MCMC:
from sklearn.cluster import KMeans
kme = KMeans(3)
kme.fit(np.atleast_2d(data).T)
assignment = pm.Categorical('assignment',[p1,p2,p3],size=Npts, value=kme.labels_)
Which gives you:
Using k-means to initialize the categorical may not work all of the time, but it is better than not converging.

Related

3D gaussian generator/transform

How do I generate 3 Gaussian variables? I know that the Box-Muller algorithm can be used to convert two (U1,U2) uniform variables into two (X,Y) Gaussian variables but how do i generate the 3rd one? (Z).
A simple way:
It is unlikely in this sort of game that you will need 3 Gaussian variates just once.
You need some store variable that can contains either a triplet of Gaussian variates or nothing (Null, Nothing, Empty, whatever that is in your programming language, you didn't tell us which one).
Initially, the store contains nothing (empty).
When asked for a triplet:
if the store contains a triplet, just return that triplet.
And mark the store as empty.
if the store is empty, run Box-Muller 3 times.
That gives you 2 triplets.
Put the second triplet in the store.
Return the first triplet.
An alternative way for the mathematically inclined programmer:
If one just tries to adapt Box-Muller to 3 dimensions, the sole tricky part is to get the norm of the random 3D vector. The rest is about the 2 spherical angles θ (theta) and φ (phi), which is easy stuff.
It turns out that in 3 dimensions, that norm involves the inverse of the incomplete gamma function.
And if you have Python and Numpy/Scipy, this is function scipy.special.gammaincinv.
We can thus write this code:
import math
import numpy.random as rd
import scipy.special as sp
# convert 3 uniform [0,1) variates into 3 unit Gaussian variates:
def boxMuller3d(u3):
u0,u1,u2 = u3 # 3 uniform random numbers in [0,1)
gamma = u0
norm2 = 2.0 * sp.gammaincinv(1.5, gamma) # "regularized" versions
norm = math.sqrt(norm2)
zr = (2.0 * u1) - 1.0 # sin(theta)
hr = math.sqrt(1.0 - zr*zr) # cos(theta)
phi = 2.0 * math.pi * u2
xr = hr * math.cos(phi)
yr = hr * math.sin(phi)
g3 = list(map(lambda c: c*norm, [xr, yr, zr]))
return g3
# generate 3 uniform variates and convert them into 3 unit Gaussian variates:
def gauss3(rng):
u3 = rng.uniform(0.0, 1.0, 3)
g3 = boxMuller3d(u3)
return g3
To (partly) check correctness, we can have this small main program, which displays the statistical moments of order 1 to 4 of the resulting random serie:
randomSeed = 42
rng = rd.default_rng(randomSeed)
count = 3000000 # (X,Y,Z) triplet count
variates = []
for i in range(count):
g3 = gauss3(rng)
variates += g3
ln = len(variates)
print("length=%d\n" % ln)
# Checking statistical moments of order 1 to 4:
m1 = sum(variates) / ln
m2 = sum( map(lambda x: x*x, variates) ) / ln
m3 = sum( map(lambda x: x**3, variates) ) / ln
m4 = sum( map(lambda x: x**4, variates) ) / ln
print("m1=%g m2=%g m3=%g m4=%g\n" % (m1,m2,m3,m4))
Test program output:
length=9000000
m1=-0.000455911 m2=1.00025 m3=-0.000563454 m4=3.00184
We thus can see that these moments are reasonably close to their mathematically expected values, respectively 0,1,0,3.

Discrete path tracking with python gekko

I have some discrete data points representing a path and I want to minimize the distance between a trajectory of an object to these path points along with some other constraints. I'm trying out gekko as a tool to solve this problem and for that I made a simple problem by making data points from a parabola and a constraint to the path. My attempt to solve it is
from gekko import GEKKO
import numpy as np
import matplotlib.pyplot as plt
import time
#path data points
x_ref = np.linspace(0, 4, num=21)
y_ref = - np.square(x_ref) + 16
#constraint for visualization purposes
x_bound = np.linspace(0, 4, num=10)
y_bound = 1.5*x_bound + 4
def distfunc(x,y,xref,yref,p):
'''
Shortest distance from (x,y) to (xref, yref)
'''
dtemp = []
for i in range(len(xref)):
d = (x-xref[i])**2+(y-yref[i])**2
dtemp.append(dtemp)
min_id = dtemp.index(min(dtemp))
if min_id == 0:
next_id = min_id+1
elif min_id == len(x_ref):
next_id = min_id-1
else:
d2 = (x-xref[min_id-1])**2+(y-yref[min_id-1])**2
d1 = (x-xref[min_id+1])**2+(y-yref[mid_id+1])**2
d_next = [d2, d1]
next_id = min_id + 2*d_next.index(min(d_next)) - 1
n1 = xref[next_id] - xref[min_id]
n2 = yref[next_id] - yref[min_id]
nnorm = p.sqrt(n1**2+n2**2)
n1 = n1 / nnorm
n2 = n2 / nnorm
difx = x-xref[min_id]
dify = y-yref[min_id]
dot = difx*n1 + dify*n2
deltax = difx - dot*n1
deltay = dify - dot*n2
return deltax**2+deltay**2
v_ref = 3
now = time.time()
p = GEKKO(remote=False)
p.time = np.linspace(0,10,21)
x = p.Var(value=0)
y = p.Var(value=16)
vx = p.Var(value=1)
vy = p.Var(value=0)
ax = p.Var(value=0)
ay = p.Var(value=0)
p.options.IMODE = 6
p.options.SOLVER = 3
p.options.WEB = 0
x_refg = p.Param(value=x_ref)
y_refg = p.Param(value=y_ref)
x_refg = p.Param(value=x_ref)
y_refg = p.Param(value=y_ref)
v_ref = p.Const(value=v_ref)
p.Obj(distfunc(x,y,x_refg,y_refg,p))
p.Obj( (p.sqrt(vx**2+vy**2) - v_ref)**2 + ax**2 + ay**2)
p.Equation(x.dt()==vx)
p.Equation(y.dt()==vy)
p.Equation(vx.dt()==ax)
p.Equation(vy.dt()==ay)
p.Equation(y>=1.5*x+4)
p.solve(disp=False, debug=True)
print(f'run time: {time.time()-now}')
plt.plot(x_ref, y_ref)
plt.plot(x_bound, y_bound)
plt.plot(x1.value,x2.value)
plt.show()
This is the result that I get. As you can see, its not exactly the solution that one should expect. For reference to a solution that you may expect, here is what I get using the cost function below
p.Obj((x-x_refg)**2 + (y-y_refg)**2 + ax**2 + ay**2)
However since what I actually wanted is the shortest distance to a path described by these points I expect the distfunc to be closer to what I want since the shortest distance is most likely to some interpolated point. So my question is twofold:
Is this the correct gekko expression/formulation for the objective function?
My other goal is solution speed so is there a more efficient way of expressing this problem for gekko?
You can't define an objective function that changes based on conditions unless you insert logical conditions that are continuously differentiable such as with the if2 or if3 function. Gekko evaluates the symbolic model once and then passes that off to an executable for solution. It only calls the Python model build once because it is compiling the model to efficient byte-code for execution. You can see the model that you created with p.open_folder(). The model file ends in the apm extension: gk_model0.apm.
Model
Constants
i0 = 3
End Constants
Parameters
p1
p2
p3
p4
End Parameters
Variables
v1 = 0
v2 = 16
v3 = 1
v4 = 0
v5 = 0
v6 = 0
End Variables
Equations
v3=$v1
v4=$v2
v5=$v3
v6=$v4
v2>=(((1.5)*(v1))+4)
minimize (((((v1-0.0)-((((((v1-0.0))*((0.2/sqrt(0.04159999999999994))))+(((v2-16.0))&
*((-0.03999999999999915/sqrt(0.04159999999999994))))))*&
((0.2/sqrt(0.04159999999999994))))))^(2))+((((v2-16.0)&
-((((((v1-0.0))*((0.2/sqrt(0.04159999999999994))))+(((v2-16.0))&
*((-0.03999999999999915/sqrt(0.04159999999999994))))))&
*((-0.03999999999999915/sqrt(0.04159999999999994))))))^(2)))
minimize (((((sqrt((((v3)^(2))+((v4)^(2))))-i0))^(2))+((v5)^(2)))+((v6)^(2)))
End Equations
End Model
One strategy is to split your problem into multiple optimization problems that are all minimal time problems where you navigate to the first way-point and then re-initialize the problem to navigate to the second way-point, and so on. If you want to preserve momentum and anticipate the turning then you'll need to use more advanced methods such as shown in the Pigeon / Eagle tracking problem (see source files) or similar to a trajectory optimization with UAVs or HALE UAVs (see references below).
Martin, R.A., Gates, N., Ning, A., Hedengren, J.D., Dynamic Optimization of High-Altitude Solar Aircraft Trajectories Under Station-Keeping Constraints, Journal of Guidance, Control, and Dynamics, 2018, doi: 10.2514/1.G003737.
Gates, N.S., Moore, K.R., Ning, A., Hedengren, J.D., Combined Trajectory, Propulsion and Battery Mass Optimization for Solar-Regenerative High-Altitude Long Endurance Unmanned Aircraft, AIAA Science and Technology Forum (SciTech), 2019.

Finding center point given distance matrix

I have a matrix (really a loaded image) in which every element is a L2 distance from some unknown center point.
Here is a trivial example
A = [1.4142 1.0000 1.4142 2.2361]
[1.0000 0.0000 1.0000 2.0000]
[1.4142 1.0000 1.4142 2.2361]
In this case, the center is obviously at coordinate (1,1) (index A[1,1] in a 0-indexed matrix or 2D array).
However, in the case where my centers are not constrained to be integer indices, it's no longer as obvious. For example, given this matrix B, where is my center coordinate?
B = [3.0292 1.9612 2.8932 5.8252]
[1.2292 0.1612 1.0932 4.0252]
[1.4292 0.3612 1.2932 4.2252]
How would you find that the answer in this case is at row 1.034 and column 1.4?
I am aware of the trilateration solution (having provided MATLAB code to visualize that in 3D previously), but is there a more efficient way (e.g. one without a matrix inversion)?
This question is sort of language agnostic, as I am looking more for algorithmic help. If you could stick to MATLAB, Python, or C++ though in a solution, that would be great ;-).
While having no experience with similar tasks, i read some stuff and also tried something.
When unfamiliar with this topic it's hard to grasp it seems and all those resources i found are a bit chaotic.
Still unclear in regards to theory for me:
is the problem as stated above a convex-optimization problem (local-minimum = global-minimum; would mean access to powerful solvers!)
there are much more resources about more generic problems (Sensor Network
Localization), which are non-convex and where extremely complex methods have been developed
is your trilateration-approach able to exploit > 3 points (trilateration vs. multilateration; at least this code does not seem like it can which means: bad performance with noise!)
Here some example code with two approaches:
A: Convex-optimization: SOCP-Relaxation
Follows SECOND-ORDER CONE PROGRAMMING RELAXATION OF SENSOR NETWORK LOCALIZATION
Not impressive performance, but should be powerful as approximation for big-data
Guaranteed global-optimum for this relaxation!
Implemented with cvxpy
B: Nonlinear-programming optimization
Implemented using scipy.optimize
Pretty much perfect in my synthetic experiments; even good results in noisy case; despite the fact we are using numerical-differentiation (automatic-diff hard to use here)
Some additional remark:
Your example B surely has some (pretty bad) noise or some other problem in my opinion, as my approaches are completely off; while especially approach B shines for my synthetic-data (at least that's my impression)
Code:
import numpy as np
import cvxpy as cvx
from scipy.spatial.distance import cdist
from scipy.optimize import minimize
np.random.seed(1)
""" Create noise-free (not anymore!) fake-problem """
real_x = np.random.random(size=2) * 3
M, N = 5, 10
NOISE_DISTS = 0.1
pos = np.array([(i,j) for i in range(M) for j in range(N)]) # ugly -> tile/repeat/stack
real_x_stacked = np.vstack([real_x for i in range(pos.shape[0])])
Y = cdist(pos, real_x[np.newaxis])
Y += np.random.normal(size=Y.shape)*NOISE_DISTS # Let's add some noise!
print('-----')
print('PROBLEM')
print('-------')
print('real x: ', real_x)
print('dist mat: ', np.round(Y,3).T)
""" Helper """
def cost(x, Y, pos):
res = np.linalg.norm(pos - x, ord=2, axis=1) - Y.ravel()
return np.linalg.norm(res, 2)
print('cost with real_x (check vs. noisy): ', cost(real_x, Y, pos))
""" SOLVER SOCP """
def solve_socp_relax(pos, Y):
x = cvx.Variable(2)
y = cvx.Variable(pos.shape[0])
fake_stack = [x for i in range(pos.shape[0])] # hacky
objective = cvx.sum_entries(cvx.norm(y - Y))
x_stacked = cvx.reshape(cvx.vstack(*fake_stack), pos.shape[0], 2) # hacky
constraints = [cvx.norm(pos - x_stacked, 2, axis=1) <= y]
problem = cvx.Problem(cvx.Minimize(objective), constraints)
problem.solve(solver=cvx.ECOS, verbose=False)
return x.value.T
""" SOLVER NLP """
def solve_nlp(pos, Y):
sol = minimize(cost, np.zeros(pos.shape[1]), args=(Y, pos), method='BFGS')
# print(sol)
return sol.x
""" TEST """
print('-----')
print('SOLVE')
print('-----')
socp_relax_sol = solve_socp_relax(pos, Y)
print('SOCP RELAX SOL: ', socp_relax_sol)
nlp_sol = solve_nlp(pos, Y)
print('NLP SOL: ', nlp_sol)
Output:
-----
PROBLEM
-------
real x: [ 1.25106601 2.16097348]
dist mat: [[ 2.444 1.599 1.348 1.276 2.399 3.026 4.07 4.973 6.118 6.746
2.143 1.149 0.412 0.766 1.839 2.762 3.851 4.904 5.734 6.958
2.377 1.432 0.856 1.056 1.973 2.843 3.885 4.95 5.818 6.84
2.711 2.015 1.689 1.939 2.426 3.358 4.385 5.22 6.076 6.97
3.422 3.153 2.759 2.81 3.326 4.162 4.734 5.627 6.484 7.336]]
cost with real_x (check vs. noisy): 0.665125233772
-----
SOLVE
-----
SOCP RELAX SOL: [[ 1.95749275 2.00607253]]
NLP SOL: [ 1.23560791 2.16756168]
Edit: Further speedup can be achieved (especially in large-scale) in using nonlinear-least-squares instead of the more general NLP-approach! My results are still the same (as expected if the problem would be convex). Timings between NLP/NLS can look like 9 vs. 0.5 seconds!
This is my recommended method!
def solve_nls(pos, Y):
def res(x, Y, pos):
return np.linalg.norm(pos - x, ord=2, axis=1) - Y.ravel()
sol = least_squares(res, np.zeros(pos.shape[1]), args=(Y, pos), method='lm')
# print(sol)
return sol.x
Especially the second-approach (NLP) will also run for much bigger instances (cvxpy's overhead hurts; that's not a downside of the SOCP-solver which should scale much much better!).
Here some output for M, N = 500, 1000 with some more noise:
-----
PROBLEM
-------
real x: [ 12.51066014 21.6097348 ]
dist mat: [[ 24.706 23.573 23.693 ..., 1090.29 1091.216
1090.817]]
cost with real_x (check vs. noisy): 353.354267797
-----
SOLVE
-----
NLP SOL: [ 12.51082419 21.60911561]
used: 5.9552763315495625 # SECONDS
So in my experiments it works, but i won't give any global-convergence guarantees or reconstruction-guarantees (still missing some theory).
At first i though about using the global optimum of the relaxed-SOCP-problem as initial-point in the NLP-solver, but i did not find any example where this is needed!
Some just-for-fun visuals using:
M, N = 20, 30
NOISE_DISTS = 0.2
...
import matplotlib.pyplot as plt
plt.imshow(Y.reshape(M, N), cmap='viridis', interpolation='none')
plt.colorbar()
plt.scatter(nlp_sol[1], nlp_sol[0], color='red', s=20)
plt.xlim((0, N))
plt.ylim((0, M))
plt.show()
And some super noisy case (nice performance!):
M, N = 50, 100
NOISE_DISTS = 5
-----
PROBLEM
-------
real x: [ 12.51066014 21.6097348 ]
dist mat: [[ 22.329 18.745 27.588 ..., 94.967 80.034 91.206]]
cost with real_x (check vs. noisy): 354.527196716
-----
SOLVE
-----
NLP SOL: [ 12.44158986 21.50164637]
used: 0.01050068340320306
If I understand correctly, you have a matrix A, where A[i,j] holds the distance from (i,j) to some unknown point (y,x). You could find (y,x) like this:
Square each element of A, to make a matrix B say.
We then want to find (y,x) so
(y-i)*(y-i) + (x-j)*(x-j) = B[i,j]
Subtracting each equation from the 0,0 equation and rearranging:
2*i*y + 2*j*x = B[0,0] + i*i + j*j - B[i,j]
This can be solved by linear least squares. Note that since there are 2 unknowns, the matix inversion (better, factorisation) involved will be on a 2x2 matrix and so not time consuming. You could indeed, given just the dimensions of A, work out the required matrix and its inverse analytically.

Optimizing a program by vectorized notation

Hi all I am working on Image processing and have written a short piece of code in MATLAB. The code is quite slow.
I am giving my code snippet here
for i=1:10
//find c1,c2,c3
//c1 c2 and c3 change at each iteration
u = (1./((abs(P-c1))^m) + 1./((abs(P-c2))^m) + 1./((abs(P-c3))^m));
u1 = 1./((abs(P-c1))^m)./u;
u2 = 1./((abs(P-c2))^m)./u;
u3 = 1./((abs(P-c3))^m)./u;
end
Let me explain the variables here:
P,u,u1,u2 and u3 are all matrices of size 512x512
c1,c2 and c3 are constants of dimension 1x1
m is a constant with value = 2
I want to repeat this operations in a loop (say 10 times). However my code is quite slow.
The results of the profiler are given below :
The total running time of the program was 4.6 secs. However the four steps listed above itself takes abour 80% of the time.
So I wanted to make my code run faster.
MY FIRST EDIT
My changed code snippet
for i=1:10
//find c1 and c2
//c1 and c2 changes at each iteration
a=((abs(P-c1))^m);
b=((abs(P-c2))^m);
c=((abs(P-c3))^m);
x=1./a; y=1./b; z=1./c;
u = (x + y + z);
u1 = x./u;
u2 = y./u;
u3 = z./u;
end
Now the program computes in 2.47 seconds computation time for the above steps are given below:
So this is way much more faster than my first method.
2nd edit
for i=1:10
//find c1,c2,c3
//c1 c2 and c3 change at each iteration
a=(P-c1).*(P-c1);
b=(P-c2).*(P-c2);
c=(P-c3).*(P-c3);
x=1./a; y=1./b; z=1./c;
u = (x + y + z);
u1 = x./u;
u2 = y./u;
u3 = z./u;
end
Now the program computes in 0.808 seconds.
The four steps described above computes above very quickly.
I am sure it can be made even faster. Can you guys please help me to further optimize my code.
It would be extremely helpful for matrices larger size than 512 such as 1024 , 2048 or likewise.
Thanks in advance.
Your current code is:
a=((abs(P-c1))^m);
b=((abs(P-c2))^m);
c=((abs(P-c3))^m);
x=1./a; y=1./b; z=1./c;
u = (x + y + z);
u1 = x./u;
u2 = y./u;
u3 = z./u;
Firstly, realize that the absolute value function is multiplicative. So |AB| = |A|x|B|. Now, abs(P-C1)^m is equivalent to abs( (P-C1)^m ).
Just a preliminary glance at it suggests that some of the computation in the bottleneck can be reused. Specifically, since c1,c2 and c3 are constants, the computation can be sped up a little bit if you try to reuse them (at the expense of additional memory).
temp_P2 = P*P;
temp_PCA = P*ones(size(P));
temp_PCB = ones(size(P))*P;
a = abs(temp_P2 - c1*temp_PCA - c1*temp_PCB + c1^2 * length(P))
The computation of temp_PCA and temp_PCB can also be avoided since multiplication by a constant matrix always amounts to the construction of a rank 1 matrix with either constant rows or columns.
I don't claim that any of these modifications will speed up your code but they are definitely worth trying.
The first suggestion is:
if m = 2 and it is not changing, why you don't try these alternatives:
A*A
and if m = 2 then do you really need abs ?
this part that you are doing
1./a
is faster than
a.^(-1)
so I don't see any better option in this part.
Another thing you can try is this. instead of:
x=1./a; y=1./b; z=1./c;
u = (x + y + z);
u1 = x./u;
u2 = y./u;
u3 = z./u;
You can have this:
u = (x + y + z);
u1 = 1./(a.*u);
u2 = 1./(b.*u);
u3 = 1./(c.*u);
this way I guess it is a little bit faster by removing 3 variables. but the code becomes less readable.

How Could One Implement the K-Means++ Algorithm?

I am having trouble fully understanding the K-Means++ algorithm. I am interested exactly how the first k centroids are picked, namely the initialization as the rest is like in the original K-Means algorithm.
Is the probability function used based on distance or Gaussian?
In the same time the most long distant point (From the other centroids) is picked for a new centroid.
I will appreciate a step by step explanation and an example. The one in Wikipedia is not clear enough. Also a very well commented source code would also help. If you are using 6 arrays then please tell us which one is for what.
Interesting question. Thank you for bringing this paper to my attention - K-Means++: The Advantages of Careful Seeding
In simple terms, cluster centers are initially chosen at random from the set of input observation vectors, where the probability of choosing vector x is high if x is not near any previously chosen centers.
Here is a one-dimensional example. Our observations are [0, 1, 2, 3, 4]. Let the first center, c1, be 0. The probability that the next cluster center, c2, is x is proportional to ||c1-x||^2. So, P(c2 = 1) = 1a, P(c2 = 2) = 4a, P(c2 = 3) = 9a, P(c2 = 4) = 16a, where a = 1/(1+4+9+16).
Suppose c2=4. Then, P(c3 = 1) = 1a, P(c3 = 2) = 4a, P(c3 = 3) = 1a, where a = 1/(1+4+1).
I've coded the initialization procedure in Python; I don't know if this helps you.
def initialize(X, K):
C = [X[0]]
for k in range(1, K):
D2 = scipy.array([min([scipy.inner(c-x,c-x) for c in C]) for x in X])
probs = D2/D2.sum()
cumprobs = probs.cumsum()
r = scipy.rand()
for j,p in enumerate(cumprobs):
if r < p:
i = j
break
C.append(X[i])
return C
EDIT with clarification: The output of cumsum gives us boundaries to partition the interval [0,1]. These partitions have length equal to the probability of the corresponding point being chosen as a center. So then, since r is uniformly chosen between [0,1], it will fall into exactly one of these intervals (because of break). The for loop checks to see which partition r is in.
Example:
probs = [0.1, 0.2, 0.3, 0.4]
cumprobs = [0.1, 0.3, 0.6, 1.0]
if r < cumprobs[0]:
# this event has probability 0.1
i = 0
elif r < cumprobs[1]:
# this event has probability 0.2
i = 1
elif r < cumprobs[2]:
# this event has probability 0.3
i = 2
elif r < cumprobs[3]:
# this event has probability 0.4
i = 3
One Liner.
Say we need to select 2 cluster centers, instead of selecting them all randomly{like we do in simple k means}, we will select the first one randomly, then find the points that are farthest to the first center{These points most probably do not belong to the first cluster center as they are far from it} and assign the second cluster center nearby those far points.
I have prepared a full source implementation of k-means++ based on the book "Collective Intelligence" by Toby Segaran and the k-menas++ initialization provided here.
Indeed there are two distance functions here. For the initial centroids a standard one is used based numpy.inner and then for the centroids fixation the Pearson one is used. Maybe the Pearson one can be also be used for the initial centroids. They say it is better.
from __future__ import division
def readfile(filename):
lines=[line for line in file(filename)]
rownames=[]
data=[]
for line in lines:
p=line.strip().split(' ') #single space as separator
#print p
# First column in each row is the rowname
rownames.append(p[0])
# The data for this row is the remainder of the row
data.append([float(x) for x in p[1:]])
#print [float(x) for x in p[1:]]
return rownames,data
from math import sqrt
def pearson(v1,v2):
# Simple sums
sum1=sum(v1)
sum2=sum(v2)
# Sums of the squares
sum1Sq=sum([pow(v,2) for v in v1])
sum2Sq=sum([pow(v,2) for v in v2])
# Sum of the products
pSum=sum([v1[i]*v2[i] for i in range(len(v1))])
# Calculate r (Pearson score)
num=pSum-(sum1*sum2/len(v1))
den=sqrt((sum1Sq-pow(sum1,2)/len(v1))*(sum2Sq-pow(sum2,2)/len(v1)))
if den==0: return 0
return 1.0-num/den
import numpy
from numpy.random import *
def initialize(X, K):
C = [X[0]]
for _ in range(1, K):
#D2 = numpy.array([min([numpy.inner(c-x,c-x) for c in C]) for x in X])
D2 = numpy.array([min([numpy.inner(numpy.array(c)-numpy.array(x),numpy.array(c)-numpy.array(x)) for c in C]) for x in X])
probs = D2/D2.sum()
cumprobs = probs.cumsum()
#print "cumprobs=",cumprobs
r = rand()
#print "r=",r
i=-1
for j,p in enumerate(cumprobs):
if r 0:
for rowid in bestmatches[i]:
for m in range(len(rows[rowid])):
avgs[m]+=rows[rowid][m]
for j in range(len(avgs)):
avgs[j]/=len(bestmatches[i])
clusters[i]=avgs
return bestmatches
rows,data=readfile('/home/toncho/Desktop/data.txt')
kclust = kcluster(data,k=4)
print "Result:"
for c in kclust:
out = ""
for r in c:
out+=rows[r] +' '
print "["+out[:-1]+"]"
print 'done'
data.txt:
p1 1 5 6
p2 9 4 3
p3 2 3 1
p4 4 5 6
p5 7 8 9
p6 4 5 4
p7 2 5 6
p8 3 4 5
p9 6 7 8

Resources