Converting a Uniform Distribution to a Normal Distribution - algorithm

How can I convert a uniform distribution (as most random number generators produce, e.g. between 0.0 and 1.0) into a normal distribution? What if I want a mean and standard deviation of my choosing?

There are plenty of methods:
Do not use Box Muller. Especially if you draw many gaussian numbers. Box Muller yields a result which is clamped between -6 and 6 (assuming double precision. Things worsen with floats.). And it is really less efficient than other available methods.
Ziggurat is fine, but needs a table lookup (and some platform-specific tweaking due to cache size issues)
Ratio-of-uniforms is my favorite, only a few addition/multiplications and a log 1/50th of the time (eg. look there).
Inverting the CDF is efficient (and overlooked, why ?), you have fast implementations of it available if you search google. It is mandatory for Quasi-Random numbers.

The Ziggurat algorithm is pretty efficient for this, although the Box-Muller transform is easier to implement from scratch (and not crazy slow).

Changing the distribution of any function to another involves using the inverse of the function you want.
In other words, if you aim for a specific probability function p(x) you get the distribution by integrating over it -> d(x) = integral(p(x)) and use its inverse: Inv(d(x)). Now use the random probability function (which have uniform distribution) and cast the result value through the function Inv(d(x)). You should get random values cast with distribution according to the function you chose.
This is the generic math approach - by using it you can now choose any probability or distribution function you have as long as it have inverse or good inverse approximation.
Hope this helped and thanks for the small remark about using the distribution and not the probability itself.

Here is a javascript implementation using the polar form of the Box-Muller transformation.
/*
* Returns member of set with a given mean and standard deviation
* mean: mean
* standard deviation: std_dev
*/
function createMemberInNormalDistribution(mean,std_dev){
return mean + (gaussRandom()*std_dev);
}
/*
* Returns random number in normal distribution centering on 0.
* ~95% of numbers returned should fall between -2 and 2
* ie within two standard deviations
*/
function gaussRandom() {
var u = 2*Math.random()-1;
var v = 2*Math.random()-1;
var r = u*u + v*v;
/*if outside interval [0,1] start over*/
if(r == 0 || r >= 1) return gaussRandom();
var c = Math.sqrt(-2*Math.log(r)/r);
return u*c;
/* todo: optimize this algorithm by caching (v*c)
* and returning next time gaussRandom() is called.
* left out for simplicity */
}

Where R1, R2 are random uniform numbers:
NORMAL DISTRIBUTION, with SD of 1:
sqrt(-2*log(R1))*cos(2*pi*R2)
This is exact... no need to do all those slow loops!
Reference: dspguide.com/ch2/6.htm

Use the central limit theorem wikipedia entry mathworld entry to your advantage.
Generate n of the uniformly distributed numbers, sum them, subtract n*0.5 and you have the output of an approximately normal distribution with mean equal to 0 and variance equal to (1/12) * (1/sqrt(N)) (see wikipedia on uniform distributions for that last one)
n=10 gives you something half decent fast. If you want something more than half decent go for tylers solution (as noted in the wikipedia entry on normal distributions)

I would use Box-Muller. Two things about this:
You end up with two values per iteration
Typically, you cache one value and return the other. On the next call for a sample, you return the cached value.
Box-Muller gives a Z-score
You have to then scale the Z-score by the standard deviation and add the mean to get the full value in the normal distribution.

It seems incredible that I could add something to this after eight years, but for the case of Java I would like to point readers to the Random.nextGaussian() method, which generates a Gaussian distribution with mean 0.0 and standard deviation 1.0 for you.
A simple addition and/or multiplication will change the mean and standard deviation to your needs.

The standard Python library module random has what you want:
normalvariate(mu, sigma)
Normal distribution. mu is the mean, and sigma is the standard deviation.
For the algorithm itself, take a look at the function in random.py in the Python library.
The manual entry is here

This is a Matlab implementation using the polar form of the Box-Muller transformation:
Function randn_box_muller.m:
function [values] = randn_box_muller(n, mean, std_dev)
if nargin == 1
mean = 0;
std_dev = 1;
end
r = gaussRandomN(n);
values = r.*std_dev - mean;
end
function [values] = gaussRandomN(n)
[u, v, r] = gaussRandomNValid(n);
c = sqrt(-2*log(r)./r);
values = u.*c;
end
function [u, v, r] = gaussRandomNValid(n)
r = zeros(n, 1);
u = zeros(n, 1);
v = zeros(n, 1);
filter = r==0 | r>=1;
% if outside interval [0,1] start over
while n ~= 0
u(filter) = 2*rand(n, 1)-1;
v(filter) = 2*rand(n, 1)-1;
r(filter) = u(filter).*u(filter) + v(filter).*v(filter);
filter = r==0 | r>=1;
n = size(r(filter),1);
end
end
And invoking histfit(randn_box_muller(10000000),100); this is the result:
Obviously it is really inefficient compared with the Matlab built-in randn.

This is my JavaScript implementation of Algorithm P (Polar method for normal deviates) from Section 3.4.1 of Donald Knuth's book The Art of Computer Programming:
function normal_random(mean,stddev)
{
var V1
var V2
var S
do{
var U1 = Math.random() // return uniform distributed in [0,1[
var U2 = Math.random()
V1 = 2*U1-1
V2 = 2*U2-1
S = V1*V1+V2*V2
}while(S >= 1)
if(S===0) return 0
return mean+stddev*(V1*Math.sqrt(-2*Math.log(S)/S))
}

I thing you should try this in EXCEL: =norminv(rand();0;1). This will product the random numbers which should be normally distributed with the zero mean and unite variance. "0" can be supplied with any value, so that the numbers will be of desired mean, and by changing "1", you will get the variance equal to the square of your input.
For example: =norminv(rand();50;3) will yield to the normally distributed numbers with MEAN = 50 VARIANCE = 9.

Q How can I convert a uniform distribution (as most random number generators produce, e.g. between 0.0 and 1.0) into a normal distribution?
For software implementation I know couple random generator names which give you a pseudo uniform random sequence in [0,1] (Mersenne Twister, Linear Congruate Generator). Let's call it U(x)
It is exist mathematical area which called probibility theory.
First thing: If you want to model r.v. with integral distribution F then you can try just to evaluate F^-1(U(x)). In pr.theory it was proved that such r.v. will have integral distribution F.
Step 2 can be appliable to generate r.v.~F without usage of any counting methods when F^-1 can be derived analytically without problems. (e.g. exp.distribution)
To model normal distribution you can cacculate y1*cos(y2), where y1~is uniform in[0,2pi]. and y2 is the relei distribution.
Q: What if I want a mean and standard deviation of my choosing?
You can calculate sigma*N(0,1)+m.
It can be shown that such shifting and scaling lead to N(m,sigma)

I have the following code which maybe could help:
set.seed(123)
n <- 1000
u <- runif(n) #creates U
x <- -log(u)
y <- runif(n, max=u*sqrt((2*exp(1))/pi)) #create Y
z <- ifelse (y < dnorm(x)/2, -x, NA)
z <- ifelse ((y > dnorm(x)/2) & (y < dnorm(x)), x, z)
z <- z[!is.na(z)]

It is also easier to use the implemented function rnorm() since it is faster than writing a random number generator for the normal distribution. See the following code as prove
n <- length(z)
t0 <- Sys.time()
z <- rnorm(n)
t1 <- Sys.time()
t1-t0

function distRandom(){
do{
x=random(DISTRIBUTION_DOMAIN);
}while(random(DISTRIBUTION_RANGE)>=distributionFunction(x));
return x;
}

Related

random number generator with x,y coordinates as seed

I'm looking for a efficient, uniformly distributed PRNG, that generates one random integer for any whole number point in the plain with coordinates x and y as input to the function.
int rand(int x, int y)
It has to deliver the same random number each time you input the same coordinate.
Do you know of algorithms, that can be used for this kind of problem and also in higher dimensions?
I already tried to use normal PRNGs like a LFSR and merged the x,y coordinates together to use it as a seed value. Something like this.
int seed = x << 16 | (y & 0xFFFF)
The obvious problem with this method is that the seed is not iterated over multiple times but is initialized again for every x,y-point. This results in very ugly non random patterns if you visualize the results.
I already know of the method which uses shuffled permutation tables of some size like 256 and you get a random integer out of it like this.
int r = P[x + P[y & 255] & 255];
But I don't want to use this method because of the very limited range, restricted period length and high memory consumption.
Thanks for any helpful suggestions!
I found a very simple, fast and sufficient hash function based on the xxhash algorithm.
// cash stands for chaos hash :D
int cash(int x, int y){
int h = seed + x*374761393 + y*668265263; //all constants are prime
h = (h^(h >> 13))*1274126177;
return h^(h >> 16);
}
It is now much faster than the lookup table method I described above and it looks equally random. I don't know if the random properties are good compared to xxhash but as long as it looks random to the eye it's a fair solution for my purpose.
This is what it looks like with the pixel coordinates as input:
My approach
In general i think you want some hash-function (mostly all of these are designed to output randomness; avalanche-effect for RNGs, explicitly needed randomness for CryptoPRNGs). Compare with this thread.
The following code uses this approach:
1) build something hashable from your input
2) hash -> random-bytes (non-cryptographically)
3) somehow convert these random-bytes to your integer range (hard to do correctly/uniformly!)
The last step is done by this approach, which seems to be not that fast, but has strong theoretical guarantees (selected answer was used).
The hash-function i used supports seeds, which will be used in step 3!
import xxhash
import math
import numpy as np
import matplotlib.pyplot as plt
import time
def rng(a, b, maxExclN=100):
# preprocessing
bytes_needed = int(math.ceil(maxExclN / 256.0))
smallest_power_larger = 2
while smallest_power_larger < maxExclN:
smallest_power_larger *= 2
counter = 0
while True:
random_hash = xxhash.xxh32(str((a, b)).encode('utf-8'), seed=counter).digest()
random_integer = int.from_bytes(random_hash[:bytes_needed], byteorder='little')
if random_integer < 0:
counter += 1
continue # inefficient but safe; could be improved
random_integer = random_integer % smallest_power_larger
if random_integer < maxExclN:
return random_integer
else:
counter += 1
test_a = rng(3, 6)
test_b = rng(3, 9)
test_c = rng(3, 6)
print(test_a, test_b, test_c) # OUTPUT: 90 22 90
random_as = np.random.randint(100, size=1000000)
random_bs = np.random.randint(100, size=1000000)
start = time.time()
rands = [rng(*x) for x in zip(random_as, random_bs)]
end = time.time()
plt.hist(rands, bins=100)
plt.show()
print('needed secs: ', end-start)
# OUTPUT: needed secs: 15.056888341903687 -> 0,015056 per sample
# -> possibly heavy-dependence on range of output
Possible improvements
Add additional entropy from some source (urandom; could be put into str)
Make a class and initialize to memorize preprocessing (costly if done for each sampling)
Handle negative integers; maybe just use abs(x)
Assumptions:
the ouput-range is [0, N) -> just shift for others!
the output-range is smaller (bits) than the hash-output (may use xxh64)
Evaluation:
Check randomness/uniformity
Check if deterministic regarding input
You can use various randomness extractors to achieve your goals. There are at least two sources you can look for a solution.
Dodis et al, "Randomness Extraction and Key Derivation
Using the CBC, Cascade and HMAC Modes"
NIST SP800-90 "Recommendation for the Entropy Sources Used for
Random Bit Generation"
All in all, you can preferably use:
AES-CBC-MAC using a random key (may be fixed and reused)
HMAC, preferably with SHA2-512
SHA-family hash functions (SHA1, SHA256 etc); using a random final block (eg use a big random salt at the end)
Thus, you can concatenate your coordinates, get their bytes, add a random key (for AES and HMAC) or a salt for SHA and your output has an adequate entropy.
According to NIST, the output entropy relies on the input entropy:
Assuming you use SHA1; thus n = 160bits. Let's suppose that m = input_entropy (your coordinates' entropy)
if m >= 2n then output_entropy=n=160 bits
if 2n < m <= n then maximum output_entropy=m (but full entropy is not guaranteed).
if m < n then maximum output_entropy=m (this is your case)
see NIST sp800-90c (page 11)

finding the best/ scale/shift between two vectors

I have two vectors that represents a function f(x), and another vector f(ax+b) i.e. a scaled and shifted version of f(x). I would like to find the best scale and shift factors.
*best - by means of least squares error , maximum likelihood, etc.
any ideas?
for example:
f1 = [0;0.450541598502498;0.0838213779969326;0.228976968716819;0.91333736150167;0.152378018969223;0.825816977489547;0.538342435260057;0.996134716626885;0.0781755287531837;0.442678269775446;0];
f2 = [-0.029171964726699;-0.0278570165494982;0.0331454732535324;0.187656956432487;0.358856370923984;0.449974662483267;0.391341738643094;0.244800719791534;0.111797007617227;0.0721767235173722;0.0854437239807415;0.143888234591602;0.251750993723227;0.478953530572365;0.748209818420035;0.908044924557262;0.811960826711455;0.512568916956487;0.22669198638799;0.168136111568694;0.365578085161896;0.644996661336714;0.823562159983554;0.792812945867018;0.656803251999341;0.545799498053254;0.587013303815021;0.777464637372241;0.962722388208354;0.980537136457874;0.734416947254272;0.375435649393553;0.106489547770962;0.0892376361668696;0.242467741982851;0.40610516900965;0.427497319032133;0.301874099075184;0.128396341665384;0.00246347624097456;-0.0322120242872125]
*note that f(x) may be irreversible...
Thanks,
Ohad
For each f(x), take the absolute value of f(x) and normalize it such that it can be considered a probability mass function over its support. Calculate the expected value E[x] and variance of Var[x]. Then, we have that
E[a x + b] = a E[x] + b
Var[a x + b] = a^2 Var[x]
Use the above equations and the known values of E[x] and Var[x] to calculate a and b. Taking your values of f1 and f2 from your example, the following Octave script performs this procedure:
% Octave script
% f1, f2 are defined as given in your example
f1 = [zeros(length(f2) - length(f1), 1); f1];
save_f1 = f1; save_f2 = f2;
f1 = abs( f1 ); f2 = abs( f2 );
f1 = f1 ./ sum( f1 ); f2 = f2 ./ sum( f2 );
mean = #(x)sum(((1:length(x))' .* x));
var = #(x)sum((((1:length(x))'-mean(x)).^2) .* x);
m1 = mean(f1); m2 = mean(f2);
v1 = var(f1); v2 = var(f2)
a = sqrt( v2 / v1 ); b = m2 - a * m1;
plot( a .* (1:length( save_f1 )) + b, save_f1, ...
1:length( save_f2 ), save_f2 );
axis([0 length( save_f1 )];
And the output is
Here's a simple, effective, but perhaps somewhat naive approach.
First make sure you make a generic interpolator through both functions. That way you can evaluate both functions in between the given data points. I used a cubic-splines interpolator, since that seems general enough for the type of smooth functions you provided (and does not require additional toolboxes).
Then you evaluate the source function ("original") at a large number of points. Use this number also as a parameter in an inline function, that takes as input X, where
X = [a b]
(as in ax+b). For any input X, this inline function will compute
the function values of the target function at the same x-locations, but then scaled and offset by a and b, respectively.
The sum of the squared-differences between the resulting function values, and the ones of the source function you computed earlier.
Use this inline function in fminsearch with some initial estimate (one that you have obtained visually or by via automatic means). For the example you provided, I used a few random ones, which all converged to near-optimal fits.
All of the above in code:
function s = findScaleOffset
%% initialize
f2 = [0;0.450541598502498;0.0838213779969326;0.228976968716819;0.91333736150167;0.152378018969223;0.825816977489547;0.538342435260057;0.996134716626885;0.0781755287531837;0.442678269775446;0];
f1 = [-0.029171964726699;-0.0278570165494982;0.0331454732535324;0.187656956432487;0.358856370923984;0.449974662483267;0.391341738643094;0.244800719791534;0.111797007617227;0.0721767235173722;0.0854437239807415;0.143888234591602;0.251750993723227;0.478953530572365;0.748209818420035;0.908044924557262;0.811960826711455;0.512568916956487;0.22669198638799;0.168136111568694;0.365578085161896;0.644996661336714;0.823562159983554;0.792812945867018;0.656803251999341;0.545799498053254;0.587013303815021;0.777464637372241;0.962722388208354;0.980537136457874;0.734416947254272;0.375435649393553;0.106489547770962;0.0892376361668696;0.242467741982851;0.40610516900965;0.427497319032133;0.301874099075184;0.128396341665384;0.00246347624097456;-0.0322120242872125];
figure(1), clf, hold on
h(1) = subplot(2,1,1); hold on
plot(f1);
legend('Original')
h(2) = subplot(2,1,2); hold on
plot(f2);
linkaxes(h)
axis([0 max(length(f1),length(f2)), min(min(f1),min(f2)),max(max(f1),max(f2))])
%% make cubic interpolators and test points
pp1 = spline(1:numel(f1), f1);
pp2 = spline(1:numel(f2), f2);
maxX = max(numel(f1), numel(f2));
N = 100 * maxX;
x2 = linspace(1, maxX, N);
y1 = ppval(pp1, x2);
%% search for parameters
s = fminsearch(#(X) sum( (y1 - ppval(pp2,X(1)*x2+X(2))).^2 ), [0 0])
%% plot results
y2 = ppval( pp2, s(1)*x2+s(2));
figure(1), hold on
subplot(2,1,2), hold on
plot(x2,y2, 'r')
legend('before', 'after')
end
Results:
s =
2.886234493867320e-001 3.734482822175923e-001
Note that this computes the opposite transformation from the one you generated the data with. Reversing the numbers:
>> 1/s(1)
ans =
3.464721948700991e+000 % seems pretty decent
>> -s(2)
ans =
-3.734482822175923e-001 % hmmm...rather different from 7/11!
(I'm not sure about the 7/11 value you provided; using the exact values you gave to make a plot results in a less accurate approximation to the source function...Are you sure about the 7/11?)
Accuracy can be improved by either
using a different optimizer (fmincon, fminunc, etc.)
demanding a higher accuracy from fminsearch through optimset
having more sample points in both f1 and f2 to improve the quality of the interpolations
Using a better initial estimate
Anyway, this approach is pretty general and gives nice results. It also requires no toolboxes.
It has one major drawback though -- the solution found may not be the global optimizer, e.g., the quality of the outcomes of this method could be quite sensitive to the initial estimate you provide. So, always make a (difference) plot to make sure the final solution is accurate, or if you have a large number of such things to do, compute some sort of quality factor upon which you decide to re-start the optimization with a different initial estimate.
It is of course very possible to use the results of the Fourier+Mellin transforms (as suggested by chaohuang below) as an initial estimate to this method. That might be overkill for the simple example you provide, but I can easily imagine situations where this could indeed be very useful.
For the scale factor a, you can estimate it by computing the ratio of the amplitude spectra of the two signals since the Fourier transform is invariant to shift.
Similarly, you can estimate the shift factor b by using the Mellin transform, which is scale invariant.
Here's a super simple approach to estimate the scale a that works on your example data:
a = length(f2) / length(f1)
This gives 3.4167 which is close to your stated value of 3.4. If that estimate is good enough, you can use correlation to estimate the shift.
I realize that this is not exactly what you asked, but it may be an acceptable alternative depending on the data.
Both Rody Oldenhuis and jstarr's answers are correct. I'm adding my own answer just to sum things up, and connect between them.
I've messed up Rody's code a little bit and ended up with the following:
function findScaleShift
load f1f2
x0 = [length(f1)/length(f2) 0]; %initial guess, can do better
n=length(f1);
costFunc = #(z) sum((eval_f1(z,f2,n)-f1).^2);
opt.TolFun = eps;
xopt=fminsearch(costFunc,x0,opt);
f1r=eval_f1(xopt,f2,n);
subplot(211);
plot(1:n,f1,1:n,f1r,'--','linewidth',5)
title(xopt);
subplot(212);
plot(1:n,(f1-f1r).^2);
title('squared error')
end
function y = eval_f1(x,f2,n)
t = maketform('affine',[x(1) 0 x(2); 0 1 0 ; 0 0 1]');
y=imtransform(f2',t,'cubic','xdata',[1 n ],'ydata',[1 1])';
end
This gives zero results:
This method is accurate but exhaustive and may take some time. Another disadvantage is that it finds only a local minima, and may give false results if initial guess (x0) is far.
On the other hand, jstarr method gave the following results:
xopt = [ 3.49655562549115 -0.676062367063033]
which is 10% deviation from the correct answer. Pretty fast solution, but not as accurate as I requested, but still should be noted.
I think in order to get the best results jstarr method should be used as an initial guess for the method purposed by Rody, giving an accurate solution.
Ohad

example algorithm for generating random value in dataset with normal distribution?

I'm trying to generate some random numbers with simple non-uniform probability to mimic lifelike data for testing purposes. I'm looking for a function that accepts mu and sigma as parameters and returns x where the probably of x being within certain ranges follows a standard bell curve, or thereabouts. It needn't be super precise or even efficient. The resulting dataset needn't match the exact mu and sigma that I set. I'm just looking for a relatively simple non-uniform random number generator. Limiting the set of possible return values to ints would be fine. I've seen many suggestions out there, but none that seem to fit this simple case.
Box-Muller transform in a nutshell:
First, get two independent, uniform random numbers from the interval (0, 1], call them U and V.
Then you can get two independent, unit-normal distributed random numbers from the formulae
X = sqrt(-2 * log(U)) * cos(2 * pi * V);
Y = sqrt(-2 * log(U)) * sin(2 * pi * V);
This gives you iid random numbers for mu = 0, sigma = 1; to set sigma = s, multiply your random numbers by s; to set mu = m, add m to your random numbers.
My first thought is why can't you use an existing library? I'm sure that most languages already have a library for generating Normal random numbers.
If for some reason you can't use an existing library, then the method outlined by #ellisbben is fairly simple to program. An even simpler (approximate) algorithm is just to sum 12 uniform numbers:
X = -6 ## We set X to be -mean value of 12 uniforms
for i in 1 to 12:
X += U
The value of X is approximately normal. The following figure shows 10^5 draws from this algorithm compared to the Normal distribution.

Generate random numbers according to distributions

I want to generate random numbers according some distributions. How can I do this?
The standard random number generator you've got (rand() in C after a simple transformation, equivalents in many languages) is a fairly good approximation to a uniform distribution over the range [0,1]. If that's what you need, you're done. It's also trivial to convert that to a random number generated over a somewhat larger integer range.
Conversion of a Uniform distribution to a Normal distribution has already been covered on SO, as has going to the Exponential distribution.
[EDIT]: For the triangular distribution, converting a uniform variable is relatively simple (in something C-like):
double triangular(double a,double b,double c) {
double U = rand() / (double) RAND_MAX;
double F = (c - a) / (b - a);
if (U <= F)
return a + sqrt(U * (b - a) * (c - a));
else
return b - sqrt((1 - U) * (b - a) * (b - c));
}
That's just converting the formula given on the Wikipedia page. If you want others, that's the place to start looking; in general, you use the uniform variable to pick a point on the vertical axis of the cumulative density function of the distribution you want (assuming it's continuous), and invert the CDF to get the random value with the desired distribution.
The right way to do this is to decompose the distribution into n-1 binary distributions. That is if you have a distribution like this:
A: 0.05
B: 0.10
C: 0.10
D: 0.20
E: 0.55
You transform it into 4 binary distributions:
1. A/E: 0.20/0.80
2. B/E: 0.40/0.60
3. C/E: 0.40/0.60
4. D/E: 0.80/0.20
Select uniformly from the n-1 distributions, and then select the first or second symbol based on the probability if each in the binary distribution.
Code for this is here
It actually depends on distribution. The most general way is the following. Let P(X) be the probability that random number generated according to your distribution is less than X.
You start with generating uniform random X between zero and one. After that you find Y such that P(Y) = X and output Y. You could find such Y using binary search (since P(X) is an increasing function of X).
This is not very efficient, but works for distributions where P(X) could be efficiently computed.
You can look up inverse transform sampling, rejection sampling as well as the book by Devroye "Nonuniform random variate generation"/Springer Verlag 1986
You can convert from discrete bins to float/double with interpolation. Simple linear works well. If your table memory is constrained other interpolation methods can be used. -jlp
It's a standard textbook matter. See here for some code, or here at Section 3.2 for some reference mathematical background (actually very quick and simple to read).

How to calculate the sum of two normal distributions

I have a value type that represents a gaussian distribution:
struct Gauss {
double mean;
double variance;
}
I would like to perform an integral over a series of these values:
Gauss eulerIntegrate(double dt, Gauss iv, Gauss[] values) {
Gauss r = iv;
foreach (Gauss v in values) {
r += v*dt;
}
return r;
}
My question is how to implement addition for these normal distributions.
The multiplication by a scalar (dt) seemed simple enough. But it wasn't simple! Thanks FOOSHNICK for the help:
public static Gauss operator * (Gauss g, double d) {
return new Gauss(g.mean * d, g.variance * d * d);
}
However, addition eludes me. I assume I can just add the means; it's the variance that's causing me trouble. Either of these definitions seems "logical" to me.
public static Gauss operator + (Gauss a, Gauss b) {
double mean = a.mean + b.mean;
// Is it this? (Yes, it is!)
return new Gauss(mean, a.variance + b.variance);
// Or this? (nope)
//return new Gauss(mean, Math.Max(a.variance, b.variance));
// Or how about this? (nope)
//return new Gauss(mean, (a.variance + b.variance)/2);
}
Can anyone help define a statistically correct - or at least "reasonable" - version of the + operator?
I suppose I could switch the code to use interval arithmetic instead, but I was hoping to stay in the world of prob and stats.
The sum of two normal distributions is itself a normal distribution:
N(mean1, variance1) + N(mean2, variance2) ~ N(mean1 + mean2, variance1 + variance2)
This is all on wikipedia page.
Be careful that these really are variances and not standard deviations.
// X + Y
public static Gauss operator + (Gauss a, Gauss b) {
//NOTE: this is valid if X,Y are independent normal random variables
return new Gauss(a.mean + b.mean, a.variance + b.variance);
}
// X*b
public static Gauss operator * (Gauss a, double b) {
return new Gauss(a.mean*b, a.variance*b*b);
}
To be more precise:
If a random variable Z is defined as the linear combination of two uncorrelated Gaussian random variables X and Y, then Z is itself a Gaussian random variable, e.g.:
if Z = aX + bY,
then mean(Z) = a * mean(X) + b * mean(Y), and variance(Z) = a2 * variance(X) + b2 * variance(Y).
If the random variables are correlated, then you have to account for that. Variance(X) is defined by the expected value E([X-mean(X)]2). Working this through for Z = aX + bY, we get:
variance(Z) = a2 * variance(X) + b2 * variance(Y) + 2ab * covariance(X,Y)
If you are summing two uncorrelated random variables which do not have Gaussian distributions, then the distribution of the sum is the convolution of the two component distributions.
If you are summing two correlated non-Gaussian random variables, you have to work through the appropriate integrals yourself.
Well, your multiplication by scalar is wrong - you should multiply variance by the square of d. If you're adding a constant, then just add it to the mean, the variance stays the same. If you're adding two distributions, then add the means and add the variances.
Can anyone help define a statistically correct - or at least "reasonable" - version of the + operator?
Arguably not, as adding two distributions means different things - having worked in reliability and maintainablity my first reaction from the title would be the distribution of a system's mtbf, if the mtbf of each part is normally distributed and the system had no redundancy. You are talking about the distribution of the sum of two normally distributed independent variates, not the (logical) sum of two normal distributions' effect. Very often, operator overloading has surprising semantics. I'd leave it as a function and call it 'normalSumDistribution' unless your code has a very specific target audience.
Hah, I thought you couldn't add gaussian distributions together, but you can!
http://mathworld.wolfram.com/NormalSumDistribution.html
In fact, the mean is the sum of the individual distributions, and the variance is the sum of the individual distributions.
I'm not sure that I like what you're calling "integration" over a series of values. Do you mean that word in a calculus sense? Are you trying to do numerical integration? There are other, better ways to do that. Yours doesn't look right to me, let alone optimal.
The Gaussian distribution is a nice, smooth function. I think a nice quadrature approach or Runge-Kutta would be a much better idea.
I would have thought it depends on what type of addition you are doing. If you just want to get a normal distribution with properties (mean, standard deviation etc.) equal to the sum of two distributions then the addition of the properties as given in the other answers is fine. This is the assumption used in something like PERT where if a large number of normal probability distributions are added up then the resulting probability distribution is another normal probability distribution.
The problem comes when the two distributions being added are not similar. Take for instance adding a probability distribution with a mean of 2 and standard deviation of 1 and a probability distribution of 10 with a standard deviation of 2. If you add these two distributions up, you get a probability distribution with two peaks, one at 2ish and one at 10ish. The result is therefore not a normal distibution. The assumption about adding distributions is only really valid if the original distributions are either very similar or you have a lot of original distributions so that the peaks and troughs can be evened out.

Resources