3D gaussian generator/transform - random

How do I generate 3 Gaussian variables? I know that the Box-Muller algorithm can be used to convert two (U1,U2) uniform variables into two (X,Y) Gaussian variables but how do i generate the 3rd one? (Z).

A simple way:
It is unlikely in this sort of game that you will need 3 Gaussian variates just once.
You need some store variable that can contains either a triplet of Gaussian variates or nothing (Null, Nothing, Empty, whatever that is in your programming language, you didn't tell us which one).
Initially, the store contains nothing (empty).
When asked for a triplet:
if the store contains a triplet, just return that triplet.
And mark the store as empty.
if the store is empty, run Box-Muller 3 times.
That gives you 2 triplets.
Put the second triplet in the store.
Return the first triplet.
An alternative way for the mathematically inclined programmer:
If one just tries to adapt Box-Muller to 3 dimensions, the sole tricky part is to get the norm of the random 3D vector. The rest is about the 2 spherical angles θ (theta) and φ (phi), which is easy stuff.
It turns out that in 3 dimensions, that norm involves the inverse of the incomplete gamma function.
And if you have Python and Numpy/Scipy, this is function scipy.special.gammaincinv.
We can thus write this code:
import math
import numpy.random as rd
import scipy.special as sp
# convert 3 uniform [0,1) variates into 3 unit Gaussian variates:
def boxMuller3d(u3):
u0,u1,u2 = u3 # 3 uniform random numbers in [0,1)
gamma = u0
norm2 = 2.0 * sp.gammaincinv(1.5, gamma) # "regularized" versions
norm = math.sqrt(norm2)
zr = (2.0 * u1) - 1.0 # sin(theta)
hr = math.sqrt(1.0 - zr*zr) # cos(theta)
phi = 2.0 * math.pi * u2
xr = hr * math.cos(phi)
yr = hr * math.sin(phi)
g3 = list(map(lambda c: c*norm, [xr, yr, zr]))
return g3
# generate 3 uniform variates and convert them into 3 unit Gaussian variates:
def gauss3(rng):
u3 = rng.uniform(0.0, 1.0, 3)
g3 = boxMuller3d(u3)
return g3
To (partly) check correctness, we can have this small main program, which displays the statistical moments of order 1 to 4 of the resulting random serie:
randomSeed = 42
rng = rd.default_rng(randomSeed)
count = 3000000 # (X,Y,Z) triplet count
variates = []
for i in range(count):
g3 = gauss3(rng)
variates += g3
ln = len(variates)
print("length=%d\n" % ln)
# Checking statistical moments of order 1 to 4:
m1 = sum(variates) / ln
m2 = sum( map(lambda x: x*x, variates) ) / ln
m3 = sum( map(lambda x: x**3, variates) ) / ln
m4 = sum( map(lambda x: x**4, variates) ) / ln
print("m1=%g m2=%g m3=%g m4=%g\n" % (m1,m2,m3,m4))
Test program output:
length=9000000
m1=-0.000455911 m2=1.00025 m3=-0.000563454 m4=3.00184
We thus can see that these moments are reasonably close to their mathematically expected values, respectively 0,1,0,3.

Related

Why does finding the eigenvalues of a 4*4 matrix by z3py take so much time and do not give any solutions?

I'm trying to calculate the eigenvalues of a 4*4 matrix called A in my code (I know that the eigenvalues are real values). All the elements of A are z3 expressions and need to be calculated from the previous constraints. The code below is the last part of a long code that tries to calculate matrix A, then its eigenvalues. The code is written as an entire but I've split it into two separate parts in order to debug it: part 1 in which the code tries to find the matrix A and part 2 which is eigenvalues' calculation. In part 1, the code works very fast and calculates A in less than a sec, but when I add part 2 to the code, it doesn't give me any solutions after.
I was wondering what could be the reason? Is it because of the order of the polynomial (which is 4) or what? I would appreciate it if anyone can help me find an alternative way to calculate the eigenvalues or give me some hints on how to rewrite the code so it can solve the problem.
(Note that A2 in the actusl code is a matrix with all of its elements as z3 expressions defined by previous constraints in the code. But, here I've defined the elements as real values just to make the code executable. In this way, the code gives a solution so fast but in the real situation it takes so long, like days.
for example, one of the elements of A is almost like this:
0 +
1*Vq0__1 +
2 * -Vd0__1 +
0 +
((5.5 * Iq0__1 - 0)/64/5) *
(0 +
0 * (Vq0__1 - 0) +
-521702838063439/62500000000000 * (-Vd0__1 - 0)) +
((.10 * Id0__1 - Etr_q0__1)/64/5) *
(0 +
521702838063439/62500000000000 * (Vq0__1 - 0) +
0.001 * (-Vd0__1 - 0)) +
0 +
0 + 0 +
0 +
((100 * Iq0__1 - 0)/64/5) * 0 +
((20 * Id0__1 - Etr_q0__1)/64/5) * 0 +
0 +
-5/64
All the variables in this example are z3 variables.)
from z3 import *
import numpy as np
def sub(*arg):
counter = 0
for matrix in arg:
if counter == 0:
counter += 1
Sub = []
for i in range(len(matrix)):
Sub1 = []
for j in range(len(matrix[0])):
Sub1 += [matrix[i][j]]
Sub += [Sub1]
else:
row = len(matrix)
colmn = len(matrix[0])
for i in range(row):
for j in range(colmn):
Sub[i][j] = Sub[i][j] - matrix[i][j]
return Sub
Landa = RealVector('Landa', 2) # Eigenvalues considered as real values
LandaI0 = np.diag( [ Landa[0] for i in range(4)] ).tolist()
ALandaz3 = RealVector('ALandaz3', 4 * 4 )
############# Building ( A - \lambda * I ) to find the eigenvalues ############
A2 = [[1,2,3,4],
[5,6,7,8],
[3,7,4,1],
[4,9,7,1]]
s = Solver()
for i in range(4):
for j in range(4):
s.add( ALandaz3[ 4 * i + j ] == sub(A2, LandaI0)[i][j] )
ALanda = [[ALandaz3[0], ALandaz3[1], ALandaz3[2], ALandaz3[3] ],
[ALandaz3[4], ALandaz3[5], ALandaz3[6], ALandaz3[7] ],
[ALandaz3[8], ALandaz3[9], ALandaz3[10], ALandaz3[11]],
[ALandaz3[12], ALandaz3[13], ALandaz3[14], ALandaz3[15] ]]
Determinant = (
ALandaz3[0] * ALandaz3[5] * (ALandaz3[10] * ALandaz3[15] - ALandaz3[14] * ALandaz3[11]) -
ALandaz3[1] * ALandaz3[4] * (ALandaz3[10] * ALandaz3[15] - ALandaz3[14] * ALandaz3[11]) +
ALandaz3[2] * ALandaz3[4] * (ALandaz3[9] * ALandaz3[15] - ALandaz3[13] * ALandaz3[11]) -
ALandaz3[3] * ALandaz3[4] * (ALandaz3[9] * ALandaz3[14] - ALandaz3[13] * ALandaz3[10]) )
tol = 0.001
s.add( And( Determinant >= -tol, Determinant <= tol ) ) # giving some flexibility instead of equalling to zero
print(s.check())
print(s.model())
Note that you seem to be using Z3 for a type of equations it absolutely isn't meant for. Z is a sat/smt solver. Such a solver works internally with a huge number of boolean equations. Integers and fractions can be converted to boolean expressions, but with general floats Z3 quickly reaches its limits. See here and here for a lot of typical examples, and note how floats are avoided.
Z3 can work in a limited way with floats, converting them to fractions, but doesn't work with approximations and accuracies as in needed in numerical algorithms. Therefore, the results are usually not what you are hoping for.
Finding eigenvalues is a typical numerical problem, where accuracy issues are very tricky. Python has libraries such as numpy and scipy to efficiently deal with those. See e.g. numpy.linalg.eig.
If, however your A2 matrix contains some symbolic expressions (and uses fractions instead of floats), sympy's matrix functions could be an interesting alternative.

first use of PyMc fails

I am new to PyMc and would like to know why this code doesn't work. I already spent hours on this but I miss something. Could anyone help me ?
question I want to address:
I have a set of Npts measures that show 3 bumps, so I want to model this as the sum of 3 gaussians (assuming the measures are noisy and the gaussian approx is ok) ==> I want to estimate 8 parameters: the relative weights of the bumps (i.e. 2 params), their 3 means and their 3 variances.
I want this approach wide enough to be applicable on other sets that may not have the same bumps, so I take loose flat priors.
problem:
My code below gives me crappy estimations. what's wrong ? thx
"""
hypothesis: multimodal distrib sum of 3 gaussian distributions
model description:
* p1, p2, p3 are the probabilities for a point to belong to gaussian 1, 2 or 3
==> p1, p2, p3 are the relative weights of the 3 gaussians
* once a point is associated with a gaussian,
it is distributed normally according to the parameters mu_i, sigma_i of the gaussian
but instead of considering sigma, pymc prefers considering tau=1/sigma**2
* thus, PyMc must guess 8 parameters: p1, p2, mu1, mu2, mu3, tau1, tau2, tau3
* priors on p1, p2 are flat between 0.1 and 0.9 ==> 'pm.Uniform' variables
with the constraint p2<=1-p1. p3 is deterministic ==1-p1-p2
* the 'assignment' variable assigns each point to a gaussian, according to probabilities p1, p2, p3
* priors on mu1, mu2, mu3 are flat between 40 and 120 ==> 'pm.Uniform' variables
* priors on sigma1, sigma2, sigma3 are flat between 4 and 12 ==> 'pm.Uniform' variables
"""
import numpy as np
import pymc as pm
data = np.loadtxt('distrib.txt')
Npts = len(data)
mumin = 40
mumax = 120
sigmamin=4
sigmamax=12
p1 = pm.Uniform("p1",0.1,0.9)
p2 = pm.Uniform("p2",0.1,1-p1)
p3 = 1-p1-p2
assignment = pm.Categorical('assignment',[p1,p2,p3],size=Npts)
mu = pm.Uniform('mu',[mumin,mumin,mumin],[mumax,mumax,mumax])
sigma = pm.Uniform('sigma',[sigmamin,sigmamin,sigmamin],
[sigmamax,sigmamax,sigmamax])
tau = 1/sigma**2
#pm.deterministic
def assign_mu(assi=assignment,mu=mu):
return mu[assi]
#pm.deterministic
def assign_tau(assi=assignment,sig=tau):
return sig[assi]
hypothesis = pm.Normal("obs", assign_mu, assign_tau, value=data, observed=True)
model = pm.Model([hypothesis, p1, p2, tau, mu])
test = pm.MCMC(model)
test.sample(50000,burn=20000) # conservative values, let's take a coffee...
print('\nguess\n* p1, p2 = ',
np.mean(test.trace('p1')[:]),' ; ',
np.mean(test.trace('p2')[:]),' ==> p3 = ',
1-np.mean(test.trace('p1')[:])-np.mean(test.trace('p2')[:]),
'\n* mu = ',
np.mean(test.trace('mu')[:,0]),' ; ',
np.mean(test.trace('mu')[:,1]),' ; ',
np.mean(test.trace('mu')[:,2]))
print('why does this guess suck ???!!!')
I can send the data file 'distrib.txt'. It is ~500 kb and data are plotted below. For instance last run gave me:
p1, p2 = 0.366913192214 ; 0.583816452532 ==> p3 = 0.04927035525400003
mu = 77.541619286 ; 75.3371615466 ; 77.2427165073
while there are obviously bumps around ~55, ~75 and ~90, with probabilities around ~0.2, ~0.5 and ~0.3
You have the problem described here: Negative Binomial Mixture in PyMC
The problem is the Categorical variable converges too slowly for the three component distributions to get even close.
First, we generate your test data:
data1 = np.random.normal(55,5,2000)
data2 = np.random.normal(75,5,5000)
data3 = np.random.normal(90,5,3000)
data=np.concatenate([data1, data2, data3])
np.savetxt("distrib.txt", data)
Then we plot the histogram, colored by the posterior group assignment:
tablebyassignment = [data[np.nonzero(np.round(test.trace("assignment")[:].mean(axis=0)) == i)] for i in range(0,3) ]
plt.hist(tablebyassingment, bins=30, stacked = True)
This will eventually converge, but not quickly enough to be useful to you.
You can fix this problem by guessing the values of assignment before starting MCMC:
from sklearn.cluster import KMeans
kme = KMeans(3)
kme.fit(np.atleast_2d(data).T)
assignment = pm.Categorical('assignment',[p1,p2,p3],size=Npts, value=kme.labels_)
Which gives you:
Using k-means to initialize the categorical may not work all of the time, but it is better than not converging.

Efficient way of computing multivariate gaussian varying the mean - Matlab

Is there a efficient way to do the computation of a multivariate gaussian (as below) that returns matrix p , that is, making use of some sort of vectorization? I am aware that matrix p is symmetric, but still for a matrix of size 40000x3, for example, this will take quite a long time.
Matlab code example:
DataMatrix = [3 1 4; 1 2 3; 1 5 7; 3 4 7; 5 5 1; 2 3 1; 4 4 4];
[rows, cols ] = size(DataMatrix);
I = eye(cols);
p = zeros(rows);
for k = 1:rows
p(k,:) = mvnpdf(DataMatrix(:,:),DataMatrix(k,:),I);
end
Stage 1: Hack into source code
Iteratively we are performing mvnpdf(DataMatrix(:,:),DataMatrix(k,:),I)
The syntax is : mvnpdf(X,Mu,Sigma).
Thus, the correspondence with our input becomes :
X = DataMatrix(:,:);
Mu = DataMatrix(k,:);
Sigma = I
For the sizes relevant to our situation, the source code mvnpdf.m reduces to -
%// Store size parameters of X
[n,d] = size(X);
%// Get vector mean, and use it to center data
X0 = bsxfun(#minus,X,Mu);
%// Make sure Sigma is a valid covariance matrix
[R,err] = cholcov(Sigma,0);
%// Create array of standardized data, and compute log(sqrt(det(Sigma)))
xRinv = X0 / R;
logSqrtDetSigma = sum(log(diag(R)));
%// Finally get the quadratic form and thus, the final output
quadform = sum(xRinv.^2, 2);
p_out = exp(-0.5*quadform - logSqrtDetSigma - d*log(2*pi)/2)
Now, if the Sigma is always an identity matrix, we would have R as an identity matrix too. Therefore, X0 / R would be same as X0, which is saved as xRinv. So, essentially quadform = sum(X0.^2, 2);
Thus, the original code -
for k = 1:rows
p(k,:) = mvnpdf(DataMatrix(:,:),DataMatrix(k,:),I);
end
reduces to -
[n,d] = size(DataMatrix);
[R,err] = cholcov(I,0);
p_out = zeros(rows);
K = sum(log(diag(R))) + d*log(2*pi)/2;
for k = 1:rows
X0 = bsxfun(#minus,DataMatrix,DataMatrix(k,:));
quadform = sum(X0.^2, 2);
p_out(k,:) = exp(-0.5*quadform - K);
end
Now, if the input matrix is of size 40000x3, you might want to stop here. But with system resources permitting, you can vectorize everything as discussed next.
Stage 2: Vectorize everything
Now that we see what's actually going on and that the computations look parallelizable, it's time to step-up to use bsxfun in 3D with his good friend permute for a vectorized solution, like so -
%// Get size params and R
[n,d] = size(DataMatrix);
[R,err] = cholcov(I,0);
%// Calculate constants : "logSqrtDetSigma" and "d*log(2*pi)/2`"
K1 = sum(log(diag(R)));
K2 = d*log(2*pi)/2;
%// Major thing happening here as we calclate "X0" for all iterations
%// in one go with permute and bsxfun
diffs = bsxfun(#minus,DataMatrix,permute(DataMatrix,[3 2 1]));
%// "Sigma" is an identity matrix, so it plays no in "/R" at "xRinv = X0 / R".
%// Perform elementwise squaring and summing rows to get vectorized "quadform"
quadform1 = squeeze(sum(diffs.^2,2))
%// Finally use "quadform1" and get vectorized output as a 2D array
p_out = exp(-0.5*quadform1 - K1 - K2)

implementing a simple big bang big crunch (BB-BC) in matlab

i want to implement a simple BB-BC in MATLAB but there is some problem.
here is the code to generate initial population:
pop = zeros(N,m);
for j = 1:m
% formula used to generate random number between a and b
% a + (b-a) .* rand(N,1)
pop(:,j) = const(j,1) + (const(j,2) - const(j,1)) .* rand(N,1);
end
const is a matrix (mx2) which holds constraints for control variables. m is number of control variables. random initial population is generated.
here is the code to compute center of mass in each iteration
sum = zeros(1,m);
sum_f = 0;
for i = 1:N
f = fitness(new_pop(i,:));
%keyboard
sum = sum + (1 / f) * new_pop(i,:);
%keyboard
sum_f = sum_f + 1/f;
%keyboard
end
CM = sum / sum_f;
new_pop holds newly generated population at each iteration, and is initialized with pop.
CM is a 1xm matrix.
fitness is a function to give fitness value for each particle in generation. lower the fitness, better the particle.
here is the code to generate new population in each iteration:
for i=1:N
new_pop(i,:) = CM + rand(1) * alpha1 / (n_itr+1) .* ( const(:,2)' - const(:,1)');
end
alpha1 is 0.9.
the problem is that i run the code for 100 iterations, but fitness just decreases and becomes negative. it shouldnt happen at all, because all particles are in search space and CM should be there too, but it goes way beyond the limits.
for example, if this is the limits (m=4):
const = [1 10;
1 9;
0 5;
1 4];
then running yields this CM:
57.6955 -2.7598 15.3098 20.8473
which is beyond all limits.
i tried limiting CM in my code, but then it just goes and sticks at all top boundaries, which in this example give CM=
10 9 5 4
i am confused. there is something wrong in my implementation or i have understood something wrong in BB-BC?

How Could One Implement the K-Means++ Algorithm?

I am having trouble fully understanding the K-Means++ algorithm. I am interested exactly how the first k centroids are picked, namely the initialization as the rest is like in the original K-Means algorithm.
Is the probability function used based on distance or Gaussian?
In the same time the most long distant point (From the other centroids) is picked for a new centroid.
I will appreciate a step by step explanation and an example. The one in Wikipedia is not clear enough. Also a very well commented source code would also help. If you are using 6 arrays then please tell us which one is for what.
Interesting question. Thank you for bringing this paper to my attention - K-Means++: The Advantages of Careful Seeding
In simple terms, cluster centers are initially chosen at random from the set of input observation vectors, where the probability of choosing vector x is high if x is not near any previously chosen centers.
Here is a one-dimensional example. Our observations are [0, 1, 2, 3, 4]. Let the first center, c1, be 0. The probability that the next cluster center, c2, is x is proportional to ||c1-x||^2. So, P(c2 = 1) = 1a, P(c2 = 2) = 4a, P(c2 = 3) = 9a, P(c2 = 4) = 16a, where a = 1/(1+4+9+16).
Suppose c2=4. Then, P(c3 = 1) = 1a, P(c3 = 2) = 4a, P(c3 = 3) = 1a, where a = 1/(1+4+1).
I've coded the initialization procedure in Python; I don't know if this helps you.
def initialize(X, K):
C = [X[0]]
for k in range(1, K):
D2 = scipy.array([min([scipy.inner(c-x,c-x) for c in C]) for x in X])
probs = D2/D2.sum()
cumprobs = probs.cumsum()
r = scipy.rand()
for j,p in enumerate(cumprobs):
if r < p:
i = j
break
C.append(X[i])
return C
EDIT with clarification: The output of cumsum gives us boundaries to partition the interval [0,1]. These partitions have length equal to the probability of the corresponding point being chosen as a center. So then, since r is uniformly chosen between [0,1], it will fall into exactly one of these intervals (because of break). The for loop checks to see which partition r is in.
Example:
probs = [0.1, 0.2, 0.3, 0.4]
cumprobs = [0.1, 0.3, 0.6, 1.0]
if r < cumprobs[0]:
# this event has probability 0.1
i = 0
elif r < cumprobs[1]:
# this event has probability 0.2
i = 1
elif r < cumprobs[2]:
# this event has probability 0.3
i = 2
elif r < cumprobs[3]:
# this event has probability 0.4
i = 3
One Liner.
Say we need to select 2 cluster centers, instead of selecting them all randomly{like we do in simple k means}, we will select the first one randomly, then find the points that are farthest to the first center{These points most probably do not belong to the first cluster center as they are far from it} and assign the second cluster center nearby those far points.
I have prepared a full source implementation of k-means++ based on the book "Collective Intelligence" by Toby Segaran and the k-menas++ initialization provided here.
Indeed there are two distance functions here. For the initial centroids a standard one is used based numpy.inner and then for the centroids fixation the Pearson one is used. Maybe the Pearson one can be also be used for the initial centroids. They say it is better.
from __future__ import division
def readfile(filename):
lines=[line for line in file(filename)]
rownames=[]
data=[]
for line in lines:
p=line.strip().split(' ') #single space as separator
#print p
# First column in each row is the rowname
rownames.append(p[0])
# The data for this row is the remainder of the row
data.append([float(x) for x in p[1:]])
#print [float(x) for x in p[1:]]
return rownames,data
from math import sqrt
def pearson(v1,v2):
# Simple sums
sum1=sum(v1)
sum2=sum(v2)
# Sums of the squares
sum1Sq=sum([pow(v,2) for v in v1])
sum2Sq=sum([pow(v,2) for v in v2])
# Sum of the products
pSum=sum([v1[i]*v2[i] for i in range(len(v1))])
# Calculate r (Pearson score)
num=pSum-(sum1*sum2/len(v1))
den=sqrt((sum1Sq-pow(sum1,2)/len(v1))*(sum2Sq-pow(sum2,2)/len(v1)))
if den==0: return 0
return 1.0-num/den
import numpy
from numpy.random import *
def initialize(X, K):
C = [X[0]]
for _ in range(1, K):
#D2 = numpy.array([min([numpy.inner(c-x,c-x) for c in C]) for x in X])
D2 = numpy.array([min([numpy.inner(numpy.array(c)-numpy.array(x),numpy.array(c)-numpy.array(x)) for c in C]) for x in X])
probs = D2/D2.sum()
cumprobs = probs.cumsum()
#print "cumprobs=",cumprobs
r = rand()
#print "r=",r
i=-1
for j,p in enumerate(cumprobs):
if r 0:
for rowid in bestmatches[i]:
for m in range(len(rows[rowid])):
avgs[m]+=rows[rowid][m]
for j in range(len(avgs)):
avgs[j]/=len(bestmatches[i])
clusters[i]=avgs
return bestmatches
rows,data=readfile('/home/toncho/Desktop/data.txt')
kclust = kcluster(data,k=4)
print "Result:"
for c in kclust:
out = ""
for r in c:
out+=rows[r] +' '
print "["+out[:-1]+"]"
print 'done'
data.txt:
p1 1 5 6
p2 9 4 3
p3 2 3 1
p4 4 5 6
p5 7 8 9
p6 4 5 4
p7 2 5 6
p8 3 4 5
p9 6 7 8

Resources