Reason to normalize individual's fitness when using roulette-wheel selection - genetic-algorithm

As I can read examples of roulette-wheel selection there is always checked
normalized individual's fitness against a uniform random value.
http://en.wikipedia.org/wiki/Selection_(genetic_algorithm)
http://en.wikipedia.org/wiki/Fitness_proportionate_selection
Is there any pros/cons to don't use the normalized value so the last part of the alghoritm
could look like (pseudo code):
while (candidates.length < target_size) {
var random = random() * fitness_sum; // vs: random()
for (items as item) {
if (item.fitness > random) {//vs: item.fitness/fitness_sum > random
candidates.push(population[i]);
break;
}
}
}

There is no practical reason. However, there is a theoretical reason, that is, if you normalize the fitnesses you get probabilities instead of some arbitrary numbers and you can treat it like probability (it sums up to one, etc.).

Related

Get N samples given iterator

Given are an iterator it over data points, the number of data points we have n, and the maximum number of samples we want to use to do some calculations (maxSamples).
Imagine a function calculateStatistics(Iterator it, int n, int maxSamples). This function should use the iterator to retrieve the data and do some (heavy) calculations on the data element retrieved.
if n <= maxSamples we will of course use each element we get from the iterator
if n > maxSamples we will have to choose which elements to look at and which to skip
I've been spending quite some time on this. The problem is of course how to choose when to skip an element and when to keep it. My approaches so far:
I don't want to take the first maxSamples coming from the iterator, because the values might not be evenly distributed.
Another idea was to use a random number generator and let me create maxSamples (distinct) random numbers between 0 and n and take the elements at these positions. But if e.g. n = 101 and maxSamples = 100 it gets more and more difficult to find a new distinct number not yet in the list, loosing lot of time just in the random number generation
My last idea was to do the contrary: to generate n - maxSamples random numbers and exclude the data elements at these positions elements. But this also doesn't seem to be a very good solution.
Do you have a good idea for this problem? Are there maybe standard known algorithms for this?
To provide some answer, a good way to collect a set of random numbers given collection size > elements needed, is the following. (in C++ ish pseudo code).
EDIT: you may need to iterate over and create the "someElements" vector first. If your elements are large they can be "pointers" to these elements to save space.
vector randomCollectionFromVector(someElements, numElementsToGrab) {
while(numElementsToGrab--) {
randPosition = rand() % someElements.size();
resultVector.push(someElements.get(randPosition))
someElements.remove(randPosition);
}
return resultVector;
}
If you don't care about changing your vector of elements, you could also remove random elements from someElements, as you mentioned. The algorithm would look very similar, and again, this is conceptually the same idea, you just pass someElements by reference, and manipulate it.
Something worth noting, is the quality of psuedo random distributions as far as how random they are, grows as the size of the distribution you used increases. So, you may tend to get better results if you pick which method you use based on which method results in the use of more random numbers. Example: if you have 100 values, and need 99, you should probably pick 99 values, as this will result in you using 99 pseudo random numbers, instead of just 1. Conversely, if you have 1000 values, and need 99, you should probably prefer the version where you remove 901 values, because you use more numbers from the psuedo random distribution. If what you want is a solid random distribution, this is a very simple optimization, that will greatly increase the quality of "fake randomness" that you see. Alternatively, if performance matters more than distribution, you would take the alternative or even just grab the first 99 values approach.
interval = n/(n-maxSamples) //an euclidian division of course
offset = random(0..(n-1)) //a random number between 0 and n-1
totalSkip = 0
indexSample = 0;
FOR it IN samples DO
indexSample++ // goes from 1 to n
IF totalSkip < (n-maxSamples) AND indexSample+offset % interval == 0 THEN
//do nothing with this sample
totalSkip++
ELSE
//work with this sample
ENDIF
ENDFOR
ASSERT(totalSkip == n-maxSamples) //to be sure
interval represents the distance between two samples to skip.
offset is not mandatory but it allows to have a very little diversity.
Based on the discussion, and greater understanding of your problem, I suggest the following. You can take advantage of a property of prime numbers that I think will net you a very good solution, that will appear to grab pseudo random numbers. It is illustrated in the following code.
#include <iostream>
using namespace std;
int main() {
const int SOME_LARGE_PRIME = 577; //This prime should be larger than the size of your data set.
const int NUM_ELEMENTS = 100;
int lastValue = 0;
for(int i = 0; i < NUM_ELEMENTS; i++) {
lastValue += SOME_LARGE_PRIME;
cout << lastValue % NUM_ELEMENTS << endl;
}
}
Using the logic presented here, you can create a table of all values from 1 to "NUM_ELEMENTS". Because of the properties of prime numbers, you will not get any duplicates until you rotate all the way around back to the size of your data set. If you then take the first "NUM_SAMPLES" of these, and sort them, you can iterate through your data structure, and grab a pseudo random distribution of numbers(not very good random, but more random than a pre-determined interval), without extra space and only one pass over your data. Better yet, you can change the layout of the distribution by grabbing a random prime number each time, again must be larger than your data set, or the following example breaks.
PRIME = 3, data set size = 99. Won't work.
Of course, ultimately this is very similar to the pre-determined interval, but it inserts a level of randomness that you do not get by simply grabbing every "size/num_samples"th element.
This is called the Reservoir sampling

How to generate a number in arbitrary range using random()={0..1} preserving uniformness and density?

Generate a random number in range [x..y] where x and y are any arbitrary floating point numbers. Use function random(), which returns a random floating point number in range [0..1] from P uniformly distributed numbers (call it "density"). Uniform distribution must be preserved and P must be scaled as well.
I think, there is no easy solution for such problem. To simplify it a bit, I ask you how to generate a number in interval [-0.5 .. 0.5], then in [0 .. 2], then in [-2 .. 0], preserving uniformness and density? Thus, for [0 .. 2] it must generate a random number from P*2 uniformly distributed numbers.
The obvious simple solution random() * (x - y) + y will generate not all possible numbers because of the lower density for all abs(x-y)>1.0 cases. Many possible values will be missed. Remember, that random() returns only a number from P possible numbers. Then, if you multiply such number by Q, it will give you only one of P possible values, scaled by Q, but you have to scale density P by Q as well.
If I understand you problem well, I will provide you a solution: but I would exclude 1, from the range.
N = numbers_in_your_random // [0, 0.2, 0.4, 0.6, 0.8] will be 5
// This turns your random number generator to return integer values between [0..N[;
function randomInt()
{
return random()*N;
}
// This turns the integer random number generator to return arbitrary
// integer
function getRandomInt(maxValue)
{
if (maxValue < N)
{
return randomInt() % maxValue;
}
else
{
baseValue = randomInt();
bRate = maxValue DIV N;
bMod = maxValue % N;
if (baseValue < bMod)
{
bRate++;
}
return N*getRandomInt(bRate) + baseValue;
}
}
// This will return random number in range [lower, upper[ with the same density as random()
function extendedRandom(lower, upper)
{
diff = upper - lower;
ndiff = diff * N;
baseValue = getRandomInt(ndiff);
baseValue/=N;
return lower + baseValue;
}
If you really want to generate all possible floating point numbers in a given range with uniform numeric density, you need to take into account the floating point format. For each possible value of your binary exponent, you have a different numeric density of codes. A direct generation method will need to deal with this explicitly, and an indirect generation method will still need to take it into account. I will develop a direct method; for the sake of simplicity, the following refers exclusively to IEEE 754 single-precision (32-bit) floating point numbers.
The most difficult case is any interval that includes zero. In that case, to produce an exactly even distribution, you will need to handle every exponent down to the lowest, plus denormalized numbers. As a special case, you will need to split zero into two cases, +0 and -0.
In addition, if you are paying such close attention to the result, you will need to make sure that you are using a good pseudorandom number generator with a large enough state space that you can expect it to hit every value with near-uniform probability. This disqualifies the C/Unix rand() and possibly the*rand48() library functions; you should use something like the Mersenne Twister instead.
The key is to dissect the target interval into subintervals, each of which is covered by different combination of binary exponent and sign: within each subinterval, floating point codes are uniformly distributed.
The first step is to select the appropriate subinterval, with probability proportional to its size. If the interval contains 0, or otherwise covers a large dynamic range, this may potentially require a number of random bits up to the full range of the available exponent.
In particular, for a 32-bit IEEE-754 number, there are 256 possible exponent values. Each exponent governs a range which is half the size of the next greater exponent, except for the denormalized case, which is the same size as the smallest normal exponent region. Zero can be considered the smallest denormalized number; as mentioned above, if the target interval straddles zero, the probability of each of +0 and -0 should perhaps be cut in half, to avoid doubling its weight.
If the subinterval chosen covers the entire region governed by a particular exponent, all that is necessary is to fill the mantissa with random bits (23 bits, for 32-bit IEEE-754 floats). However, if the subinterval does not cover the entire region, you will need to generate a random mantissa that covers only that subinterval.
The simplest way to handle both the initial and secondary random steps may be to round the target interval out to include the entirety of all exponent regions partially covered, then reject and retry numbers that fall outside it. This allows the exponent to be generated with simple power-of-2 probabilities (e.g., by counting the number of leading zeroes in your random bitstream), as well as providing a simple and accurate way of generating a mantissa that covers only part of an exponent interval. (This is also a good way of handling the +/-0 special case.)
As another special case: to avoid inefficient generation for target intervals which are much smaller than the exponent regions they reside in, the "obvious simple" solution will in fact generate fairly uniform numbers for such intervals. If you want exactly uniform distributions, you can generate the sub-interval mantissa by using only enough random bits to cover that sub-interval, while still using the aforementioned rejection method to eliminate values outside the target interval.
well, [0..1] * 2 == [0..2] (still uniform)
[0..1] - 0.5 == [-0.5..0.5] etc.
I wonder where have you experienced such an interview?
Update: well, if we want to start caring about losing precision on multiplication (which is weird, because somehow you did not care about that in the original task, and pretend we care about "number of values", we can start iterating. In order to do that, we need one more function, which would return uniformly distributed random values in [0..1) — which can be done by dropping the 1.0 value would it ever appear. After that, we can slice the whole range in equal parts small enough to not care about losing precision, choose one randomly (we have enough randomness to do that), and choose a number in this bucket using [0..1) function for all parts but the last one.
Or, you can come up with a way to code enough values to care about—and just generate random bits for this code, in which case you don't really care whether it's [0..1] or just {0, 1}.
Let me rephrase your question:
Let random() be a random number generator with a discrete uniform distribution over [0,1). Let D be the number of possible values returned by random(), each of which is precisely 1/D greater than the previous. Create a random number generator rand(L, U) with a discrete uniform distribution over [L, U) such that each possible value is precisely 1/D greater than the previous.
--
A couple quick notes.
The problem in this form, and as you phrased it is unsolvable. That
is, if N = 1 there is nothing we can do.
I don't require that 0.0 be one of the possible values for random(). If it is not, then it is possible that the solution below will fail when U - L < 1 / D. I'm not particularly worried about that case.
I use all half-open ranges because it makes the analysis simpler. Using your closed ranges would be simple, but tedious.
Finally, the good stuff. The key insight here is that the density can be maintained by independently selecting the whole and fractional parts of the result.
First, note that given random() it is trivial to create randomBit(). That is,
randomBit() { return random() >= 0.5; }
Then, if we want to select one of {0, 1, 2, ..., 2^N - 1} uniformly at random, that is simple using randomBit(), just generate each of the bits. Call this random2(N).
Using random2() we can select one of {0, 1, 2, ..., N - 1}:
randomInt(N) { while ((val = random2(ceil(log2(N)))) >= N); return val; }
Now, if D is known, then the problem is trivial as we can reduce it to simply choosing one of floor((U - L) * D) values uniformly at random and we can do that with randomInt().
So, let's assume that D is not known. Now, let's first make a function to generate random values in the range [0, 2^N) with the proper density. This is simple.
rand2D(N) { return random2(N) + random(); }
rand2D() is where we require that the difference between consecutive possible values for random() be precisely 1/D. If not, the possible values here would not have uniform density.
Next, we need a function that selects a value in the range [0, V) with the proper density. This is similar to randomInt() above.
randD(V) { while ((val = rand2D(ceil(log2(V)))) >= V); return val; }
And finally...
rand(L, U) { return L + randD(U - L); }
We now may have offset the discrete positions if L / D is not an integer, but that is unimportant.
--
A last note, you may have noticed that several of these functions may never terminate. That is essentially a requirement. For example, random() may have only a single bit of randomness. If I then ask you to select from one of three values, you cannot do so uniformly at random with a function that is guaranteed to terminate.
Consider this approach:
I'm assuming the base random number generator in the range [0..1]
generates among the numbers
0, 1/(p-1), 2/(p-1), ..., (p-2)/(p-1), (p-1)/(p-1)
If the target interval length is less than or equal to 1,
return random()*(y-x) + x.
Else, map each number r from the base RNG to an interval in the
target range:
[r*(p-1)*(y-x)/p, (r+1/(p-1))*(p-1)*(y-x)/p]
(i.e. for each of the P numbers assign one of P intervals with length (y-x)/p)
Then recursively generate another random number in that interval and
add it to the interval begin.
Pseudocode:
const p;
function rand(x, y)
r = random()
if y-x <= 1
return x + r*(y-x)
else
low = r*(p-1)*(y-x)/p
high = low + (y-x)/p
return x + low + rand(low, high)
In real math: the solution is just the provided:
return random() * (upper - lower) + lower
The problem is that, even when you have floating point numbers, only have a certain resolution. So what you can do is apply above function and add another random() value scaled to the missing part.
If I make a practical example it becomes clear what I mean:
E.g. take random() return value from 0..1 with 2 digits accuracy, ie 0.XY, and lower with 100 and upper with 1100.
So with above algorithm you get as result 0.XY * (1100-100) + 100 = XY0.0 + 100.
You will never see 201 as result, as the final digit has to be 0.
Solution here would be to generate again a random value and add it *10, so you have accuracy of one digit (here you have to take care that you dont exceed your given range, which can happen, in this case you have to discard the result and generate a new number).
Maybe you have to repeat it, how often depends on how many places the random() function delivers and how much you expect in your final result.
In a standard IEEE format has a limited precision (i.e. double 53 bits). So when you generate a number this way, you never need to generate more than one additional number.
But you have to be careful that when you add the new number, you dont exceed your given upper limit. There are multiple solutions to it: First if you exceed your limit, you start from new, generating a new number (dont cut off or similar, as this changes the distribution).
Second possibility is to check the the intervall size of the missing lower bit range, and
find the middle value, and generate an appropiate value, that guarantees that the result will fit.
You have to consider the amount of entropy that comes from each call to your RNG. Here is some C# code I just wrote that demonstrates how you can accumulate entropy from low-entropy source(s) and end up with a high-entropy random value.
using System;
using System.Collections.Generic;
using System.Security.Cryptography;
namespace SO_8019589
{
class LowEntropyRandom
{
public readonly double EffectiveEntropyBits;
public readonly int PossibleOutcomeCount;
private readonly double interval;
private readonly Random random = new Random();
public LowEntropyRandom(int possibleOutcomeCount)
{
PossibleOutcomeCount = possibleOutcomeCount;
EffectiveEntropyBits = Math.Log(PossibleOutcomeCount, 2);
interval = 1.0 / PossibleOutcomeCount;
}
public LowEntropyRandom(int possibleOutcomeCount, int seed)
: this(possibleOutcomeCount)
{
random = new Random(seed);
}
public int Next()
{
return random.Next(PossibleOutcomeCount);
}
public double NextDouble()
{
return interval * Next();
}
}
class EntropyAccumulator
{
private List<byte> currentEntropy = new List<byte>();
public double CurrentEntropyBits { get; private set; }
public void Clear()
{
currentEntropy.Clear();
CurrentEntropyBits = 0;
}
public void Add(byte[] entropy, double effectiveBits)
{
currentEntropy.AddRange(entropy);
CurrentEntropyBits += effectiveBits;
}
public byte[] GetBytes(int count)
{
using (var hasher = new SHA512Managed())
{
count = Math.Min(count, hasher.HashSize / 8);
var bytes = new byte[count];
var hash = hasher.ComputeHash(currentEntropy.ToArray());
Array.Copy(hash, bytes, count);
return bytes;
}
}
public byte[] GetPackagedEntropy()
{
// Returns a compact byte array that represents almost all of the entropy.
return GetBytes((int)(CurrentEntropyBits / 8));
}
public double GetDouble()
{
// returns a uniformly distributed number on [0-1)
return (double)BitConverter.ToUInt64(GetBytes(8), 0) / ((double)UInt64.MaxValue + 1);
}
public double GetInt(int maxValue)
{
// returns a uniformly distributed integer on [0-maxValue)
return (int)(maxValue * GetDouble());
}
}
class Program
{
static void Main(string[] args)
{
var random = new LowEntropyRandom(2); // this only provides 1 bit of entropy per call
var desiredEntropyBits = 64; // enough for a double
while (true)
{
var adder = new EntropyAccumulator();
while (adder.CurrentEntropyBits < desiredEntropyBits)
{
adder.Add(BitConverter.GetBytes(random.Next()), random.EffectiveEntropyBits);
}
Console.WriteLine(adder.GetDouble());
Console.ReadLine();
}
}
}
}
Since I'm using a 512-bit hash function, that is the max amount of entropy that you can get out of the EntropyAccumulator. This could be fixed, if necessarily.
If I understand your problem correctly, it's that rand() generates finely spaced but ultimately discrete random numbers. And if we multiply it by (y-x) which is large, this spreads these finely spaced floating point values out in a way that is missing many of the floating point values in the range [x,y]. Is that all right?
If so, I think we have a solution already given by Dialecticus. Let me explain why he is right.
First, we know how to generate a random float and then add another floating point value to it. This may produce a round off error due to addition, but it will be in the last decimal place only. Use doubles or something with finer numerical resolution if you want better precision. So, with that caveat, the problem is no harder than finding a random float in the range [0,y-x] with uniform density. Let's say y-x = z. Obviously, since z is a floating point it may not be an integer. We handle the problem in two steps: first we generate the random digits to the left of the decimal point and then generate the random digits to the right of it. Doing both uniformly means their sum is uniformly distributed across the range [0,z] too. Let w be the largest integer <= z. To answer our simplified problem, we can first pick a random integer from the range {0,1,...,w}. Then, step #2 is to add a random float from the unit interval to this random number. This isn't multiplied by any possibly large values, so it has as fine a resolution as the numerical type can have. (Assuming you're using an ideal random floating point number generator.)
So what about the corner case where the random integer was the largest one (i.e. w) and the random float we added to it was larger than z - w so that the random number exceeds the allowed maximum? The answer is simple: do all of it again and check the new result. Repeat until you get a digit in the allowed range. It's an easy proof that a uniformly generated random number which is tossed out and generated again if it's outside an allowed range results in a uniformly generated random in the allowed range. Once you make this key observation, you see that Dialecticus met all your criteria.
When you generate a random number with random(), you get a floating point number between 0 and 1 having an unknown precision (or density, you name it).
And when you multiply it with a number (NUM), you lose this precision, by lg(NUM) (10-based logarithm). So if you multiply by 1000 (NUM=1000), you lose the last 3 digits (lg(1000) = 3).
You may correct this by adding a smaller random number to the original, which has this missing 3 digits. But you don't know the precision, so you can't determine where are they exactly.
I can imagine two scenarios:
(X = range start, Y = range end)
1: you define the precision (PREC, eg. 20 digits, so PREC=20), and consider it enough to generate a random number, so the expression will be:
( random() * (Y-X) + X ) + ( random() / 10 ^ (PREC-trunc(lg(Y-X))) )
with numbers: (X = 500, Y = 1500, PREC = 20)
( random() * (1500-500) + 500 ) + ( random() / 10 ^ (20-trunc(lg(1000))) )
( random() * 1000 + 500 ) + ( random() / 10 ^ (17) )
There are some problems with this:
2 phase random generation (how much will it be random?)
the first random returns 1 -> result can be out of range
2: guess the precision by random numbers
you define some tries (eg. 4) to calculate the precision by generating random numbers and count the precision every time:
- 0.4663164 -> PREC=7
- 0.2581916 -> PREC=7
- 0.9147385 -> PREC=7
- 0.129141 -> PREC=6 -> 7, correcting by the average of the other tries
That's my idea.

Linear fitness scaling in Genetic Algorithm produces negative fitness values

I have a GA with a fitness function that can evaluate to negative or positive values. For the sake of this question let's assume the function
u = 5 - (x^2 + y^2)
where
x in [-5.12 .. 5.12]
y in [-5.12 .. 5.12]
Now in the selection phase of GA I am using simple roulette wheel. Since to be able to use simple roulette wheel my fitness function must be positive for concrete cases in a population, I started looking for scaling solutions. The most natural seems to be linear fitness scaling. It should be pretty straightforward, for example look at this implementation. However, I am getting negative values even after linear scaling.
For example for the above mentioned function and these fitness values:
-9.734897 -7.479017 -22.834280 -9.868979 -13.180669 4.898595
after linear scaling I am getting these values
-9.6766040 -11.1755111 -0.9727897 -9.5875139 -7.3870793 -19.3997490
Instead, I would like to scale them to positive values, so I can do roulette wheel selection in the next phase.
I must be doing something fundamentally wrong here. How should I approach this problem?
The main mistake was that the input to linear scaling must already be positive (by definition), whereas I was fetching it also negative values.
The talk about negative values is not about input to the algorithm, but about output (scaled values) from the algorithm. The check is to handle this case and then correct it so as not to produce negative scaled values.
if(p->min > (p->scaleFactor * p->avg - p->max)/
(p->scaleFactor - 1.0)) { /* if nonnegative smin */
d = p->max - p->avg;
p->scaleConstA = (p->scaleFactor - 1.0) * p->avg / d;
p->scaleConstB = p->avg * (p->max - (p->scaleFactor * p->avg))/d;
} else { /* if smin becomes negative on scaling */
d = p->avg - p->min;
p->scaleConstA = p->avg/d;
p->scaleConstB = -p->min * p->avg/d;
}
On the image below, if f'min is negative, go to else clause and handle this case.
Well the solution is then to prescale above mentioned function, so it gives only positive values. As Hyperboreus suggested, this can be done by adding the smallest possible value
u = 5 - (2*5.12^2)
It is best if we separate real fitness values that we are trying to maximize from scaled fitness values that are input to selection phase of GA.
I agree with the previous answer. Linear scaling by itself tries to preserve the average fitness value, so it needs to be offset if the function is negative. For more details, please have a look in Goldberg's Genetic Algorithms book (1989), Chapter 7, pp. 76-79.
Your smallest possible value for u = 5 - (2*5.12^2). Why not just add this to your u?

How to calculate the sum of two normal distributions

I have a value type that represents a gaussian distribution:
struct Gauss {
double mean;
double variance;
}
I would like to perform an integral over a series of these values:
Gauss eulerIntegrate(double dt, Gauss iv, Gauss[] values) {
Gauss r = iv;
foreach (Gauss v in values) {
r += v*dt;
}
return r;
}
My question is how to implement addition for these normal distributions.
The multiplication by a scalar (dt) seemed simple enough. But it wasn't simple! Thanks FOOSHNICK for the help:
public static Gauss operator * (Gauss g, double d) {
return new Gauss(g.mean * d, g.variance * d * d);
}
However, addition eludes me. I assume I can just add the means; it's the variance that's causing me trouble. Either of these definitions seems "logical" to me.
public static Gauss operator + (Gauss a, Gauss b) {
double mean = a.mean + b.mean;
// Is it this? (Yes, it is!)
return new Gauss(mean, a.variance + b.variance);
// Or this? (nope)
//return new Gauss(mean, Math.Max(a.variance, b.variance));
// Or how about this? (nope)
//return new Gauss(mean, (a.variance + b.variance)/2);
}
Can anyone help define a statistically correct - or at least "reasonable" - version of the + operator?
I suppose I could switch the code to use interval arithmetic instead, but I was hoping to stay in the world of prob and stats.
The sum of two normal distributions is itself a normal distribution:
N(mean1, variance1) + N(mean2, variance2) ~ N(mean1 + mean2, variance1 + variance2)
This is all on wikipedia page.
Be careful that these really are variances and not standard deviations.
// X + Y
public static Gauss operator + (Gauss a, Gauss b) {
//NOTE: this is valid if X,Y are independent normal random variables
return new Gauss(a.mean + b.mean, a.variance + b.variance);
}
// X*b
public static Gauss operator * (Gauss a, double b) {
return new Gauss(a.mean*b, a.variance*b*b);
}
To be more precise:
If a random variable Z is defined as the linear combination of two uncorrelated Gaussian random variables X and Y, then Z is itself a Gaussian random variable, e.g.:
if Z = aX + bY,
then mean(Z) = a * mean(X) + b * mean(Y), and variance(Z) = a2 * variance(X) + b2 * variance(Y).
If the random variables are correlated, then you have to account for that. Variance(X) is defined by the expected value E([X-mean(X)]2). Working this through for Z = aX + bY, we get:
variance(Z) = a2 * variance(X) + b2 * variance(Y) + 2ab * covariance(X,Y)
If you are summing two uncorrelated random variables which do not have Gaussian distributions, then the distribution of the sum is the convolution of the two component distributions.
If you are summing two correlated non-Gaussian random variables, you have to work through the appropriate integrals yourself.
Well, your multiplication by scalar is wrong - you should multiply variance by the square of d. If you're adding a constant, then just add it to the mean, the variance stays the same. If you're adding two distributions, then add the means and add the variances.
Can anyone help define a statistically correct - or at least "reasonable" - version of the + operator?
Arguably not, as adding two distributions means different things - having worked in reliability and maintainablity my first reaction from the title would be the distribution of a system's mtbf, if the mtbf of each part is normally distributed and the system had no redundancy. You are talking about the distribution of the sum of two normally distributed independent variates, not the (logical) sum of two normal distributions' effect. Very often, operator overloading has surprising semantics. I'd leave it as a function and call it 'normalSumDistribution' unless your code has a very specific target audience.
Hah, I thought you couldn't add gaussian distributions together, but you can!
http://mathworld.wolfram.com/NormalSumDistribution.html
In fact, the mean is the sum of the individual distributions, and the variance is the sum of the individual distributions.
I'm not sure that I like what you're calling "integration" over a series of values. Do you mean that word in a calculus sense? Are you trying to do numerical integration? There are other, better ways to do that. Yours doesn't look right to me, let alone optimal.
The Gaussian distribution is a nice, smooth function. I think a nice quadrature approach or Runge-Kutta would be a much better idea.
I would have thought it depends on what type of addition you are doing. If you just want to get a normal distribution with properties (mean, standard deviation etc.) equal to the sum of two distributions then the addition of the properties as given in the other answers is fine. This is the assumption used in something like PERT where if a large number of normal probability distributions are added up then the resulting probability distribution is another normal probability distribution.
The problem comes when the two distributions being added are not similar. Take for instance adding a probability distribution with a mean of 2 and standard deviation of 1 and a probability distribution of 10 with a standard deviation of 2. If you add these two distributions up, you get a probability distribution with two peaks, one at 2ish and one at 10ish. The result is therefore not a normal distibution. The assumption about adding distributions is only really valid if the original distributions are either very similar or you have a lot of original distributions so that the peaks and troughs can be evened out.

How best to sum up lots of floating point numbers?

Imagine you have a large array of floating point numbers, of all kinds of sizes. What is the most correct way to calculate the sum, with the least error? For example, when the array looks like this:
[1.0, 1e-10, 1e-10, ... 1e-10.0]
and you add up from left to right with a simple loop, like
sum = 0
numbers.each do |val|
sum += val
end
whenever you add up the smaller numbers might fall below the precision threshold so the error gets bigger and bigger. As far as I know the best way is to sort the array and start adding up numbers from lowest to highest, but I am wondering if there is an even better way (faster, more precise)?
EDIT: Thanks for the answer, I now have a working code that perfectly sums up double values in Java. It is a straight port from the Python post of the winning answer. The solution passes all of my unit tests. (A longer but optimized version of this is available here Summarizer.java)
/**
* Adds up numbers in an array with perfect precision, and in O(n).
*
* #see http://code.activestate.com/recipes/393090/
*/
public class Summarizer {
/**
* Perfectly sums up numbers, without rounding errors (if at all possible).
*
* #param values
* The values to sum up.
* #return The sum.
*/
public static double msum(double... values) {
List<Double> partials = new ArrayList<Double>();
for (double x : values) {
int i = 0;
for (double y : partials) {
if (Math.abs(x) < Math.abs(y)) {
double tmp = x;
x = y;
y = tmp;
}
double hi = x + y;
double lo = y - (hi - x);
if (lo != 0.0) {
partials.set(i, lo);
++i;
}
x = hi;
}
if (i < partials.size()) {
partials.set(i, x);
partials.subList(i + 1, partials.size()).clear();
} else {
partials.add(x);
}
}
return sum(partials);
}
/**
* Sums up the rest of the partial numbers which cannot be summed up without
* loss of precision.
*/
public static double sum(Collection<Double> values) {
double s = 0.0;
for (Double d : values) {
s += d;
}
return s;
}
}
For "more precise": this recipe in the Python Cookbook has summation algorithms which keep the full precision (by keeping track of the subtotals). Code is in Python but even if you don't know Python it's clear enough to adapt to any other language.
All the details are given in this paper.
See also: Kahan summation algorithm It does not require O(n) storage but only O(1).
There are many algorithms, depending on what you want. Usually they require keeping track of the partial sums. If you keep only the the sums x[k+1] - x[k], you get Kahan algorithm. If you keep track of all the partial sums (hence yielding O(n^2) algorithm), you get #dF 's answer.
Note that additionally to your problem, summing numbers of different signs is very problematic.
Now, there are simpler recipes than keeping track of all the partial sums:
Sort the numbers before summing, sum all the negatives and the positives independantly. If you have sorted numbers, fine, otherwise you have O(n log n) algorithm. Sum by increasing magnitude.
Sum by pairs, then pairs of pairs, etc.
Personal experience shows that you usually don't need fancier things than Kahan's method.
Well, if you don't want to sort then you could simply keep the total in a variable with a type of higher precision than the individual values (e.g. use a double to keep the sum of floats, or a "quad" to keep the sum of doubles). This will impose a performance penalty, but it might be less than the cost of sorting.
If your application relies on numeric processing search for an arbitrary precision arithmetic library, however I don't know if there are Python libraries of this kind. Of course, all depends on how many precision digits you want -- you can achieve good results with standard IEEE floating point if you use it with care.

Resources