Efficient random sampling of constrained n-dimensional space - algorithm

I'm about to optimize a problem that is defined by n (n>=1, typically n=4) non-negative variables. This is not a n-dimensional problem since the sum of all the variables needs to be 1.
The most straightforward approach would be for each x_i to scan the entire range 0<=x_i<1, and then normalizing all the values to the sum of all the x's. However, this approach introduces redundancy, which is a problem for many optimization algorithms that rely on stochastic sampling of the solution space (genetic algorithm, taboo search and others). Is there any alternative algorithm that can perform this task?
What do I mean by redundancy?
Take two dimensional case as an example. Without the constrains, this would be a two-dimensional problem which would require optimizing two variables. However, due to the requirement that X1 + X2 == 0, one only needs to optimize one variable, since X2 is determined by X1 and vice versa. Had one decided to scan X1 and X2 independently and normalizing them to the sum of 1, then many solution candidates would have been identical vis-a-vis the problem. For example (X1==0.1, X2==0.1) is identical to (X1==0.5, X2==0.5).

If you are dealing with real valued variables then arriving with 2 samples that become identical is quite unlikely. However you do have the problem that your samples would not be uniform. You are much more likely to choose (0.5, 0.5) than (1.0, 0). Oneway of fixing this is subsampling. Basically what you do is that when you are shrinking space along a certain point, you shrink the probability of choosing it.
So basically what you are doing is mapping all the points that are inside the unit cube that satisfy that are in the same direction, map to a single points. These points in the same direction form a line. The longer the line, the larger the probability that you will choose the projected point. Hence you want to bias the probability of choosing a point by the inverse of the length of that line.
Here is the code that can do it(Assuming you are looking for x_is to sum up to 1):
while(true) {
maximum = 0;
norm = 0;
sum = 0;
for (i = 0; i < N; i++) {
x[i] = random(0,1);
maximum = max(x[i], max);
sum += x[i];
norm += x[i] * x[i];
norm = sqrt(norm);
length_of_line = norm/maximum;
sample_probability = 1/length_of_line;
if (sum == 0 || random(0,1) > sample_probability) {
} else {
for (i = 0; i < N; i++) {
x[i] = x[i] /sum;
return x;

Here is the same function provided earlier by Amit Prakash, translated to python
import numpy as np
def f(N):
count += 1
x = np.random.rand(N)
mxm = np.max(x)
theSum = np.sum(x)
nrm = np.sqrt(np.sum(x * x))
length_of_line = nrm / mxm
sample_probability = 1 / length_of_line
if theSum == 0 or rand() > sample_probability:
x = x / theSum
return x


Evenly space n items over m iterations

For context, this is to control multiple stepper motors simultaneously in a high-accuracy application.
Problem statement
Say I have a loop that will run i iterations. Over the course of those iterations, expression E_x should evaluate to true x times (x <= i is guaranteed).
- E_x must evaluate to true exactly x times
- E_x must evaluate to true at more or less evenly spaced intervals*
* "evenly spaced intervals" means that the maximum interval size is minimized
For: i = 10, x = 7
E_x will be true on iterations marked 1: 1101101101
For: i = 10, x = 3
E_x will be true on iterations marked 1: 0010010010
For: i = 10, x = 2
E_x will be true on iterations marked 1: 0001000100
What is the best (or even "a good") way to have E_x evaluate to true at evenly spaced intervals while guaranteeing that it is true exactly x times?
This question is close to mine, however it assumes that E_x will always evaluate to true in the 1st and last iterations, which does not meet my requirements (see 2nd example above).
I'll use a bit different naming convention: let's there by T intervals [1..T] and N events to be fired. Also let's solve the problem as a cyclic one. To do the let's add one fake step at the end that we are guaranteed to fire event at (and this will be also the event at time 0 i.e. before the cycle). So my T is your i+1 and my N is your x+1.
If you divide T by N with reminder you'll get T = w*N + r. If r=0 the case is trivial. If r != 0 the best you can achieve is r intervals of size w+1 and (N-r) intervals of size w. The fast and simple but good enough solution would be something like this (pseudocode):
events = []
w = T / N
r = T % N
current = 0
for(i = 1; i<=N; i++) {
current += w;
if (i <= r)
current += 1;
events[i] = current;
You can see that the last value in the array will be T as was promised by our re-statement as a cyclic problem. It will be T because over the cycle we'll add w to current N times and add r times 1, so the sum will be w*N+r which is T.
The main drawback of this solution is that all the "long" intervals will be at the start while all the "short" interval will be at the end.
You can spread intervals more evenly if you are a bit smarter. And the resulting logic will be essentially the same as it is behind Bresenham's line algorithm referenced in comments. Imagine you are drawing a line on a plane, where X-axis represents time and Y-axis represents events, from (0,0) (which is the 0-th event, before your timeframe) to (i+1, x+1) (which is the x+1-th event, just after your timeframe). The moment to raise an event is when you switch to the next Y i.e. draw the first pixel at a given Y.
If you want to do x increments over n iterations, you can do it like this:
int incCount = 0;
int iterCount = 0;
boolean step() {
int nextCount = (iterCount*x + n/2) / n; // this is rounding division
if (nextCount > incCount) {
return true;
else {
return false;
That's the easy-to-understand way. If you're on an embedded CPU where division is more expensive, you can accomplish exactly the same thing like this:
int accum = n/2;
boolean step() {
if (accum >= n) {
return true;
else {
return false;
The total amount added to accum here is iterCount*x + n/2 just like the first example, but the division is replaced with an incremental repeated subtraction. This is the way that Bresenham's line drawing algorithm works.

Improving performance of interpolation (Barycentric formula)

I have been given an assignment in which I am supposed to write an algorithm which performs polynomial interpolation by the barycentric formula. The formulas states that:
p(x) = (SIGMA_(j=0 to n) w(j)*f(j)/(x - x(j)))/(SIGMA_(j=0 to n) w(j)/(x - x(j)))
I have written an algorithm which works just fine, and I get the polynomial output I desire. However, this requires the use of some quite long loops, and for a large grid number, lots of nastly loop operations will have to be done. Thus, I would appreciate it greatly if anyone has any hints as to how I may improve this, so that I will avoid all these loops.
In the algorithm, x and f stand for the given points we are supposed to interpolate. w stands for the barycentric weights, which have been calculated before running the algorithm. And grid is the linspace over which the interpolation should take place:
function p = barycentric_formula(x,f,w,grid)
%Assert x-vectors and f-vectors have same length.
if length(x) ~= length(f)
sprintf('Not equal amounts of x- and y-values. Function is terminated.')
n = length(x);
m = length(grid);
p = zeros(1,m);
% Loops for finding polynomial values at grid points. All values are
% calculated by the barycentric formula.
for i = 1:m
var = 0;
sum1 = 0;
sum2 = 0;
for j = 1:n
if grid(i) == x(j)
p(i) = f(j);
var = 1;
sum1 = sum1 + (w(j)*f(j))/(grid(i) - x(j));
sum2 = sum2 + (w(j)/(grid(i) - x(j)));
if var == 0
p(i) = sum1/sum2;
This is a classical case for matlab 'vectorization'. I would say - just remove the loops. It is almost that simple. First, have a look at this code:
function p = bf2(x, f, w, grid)
m = length(grid);
p = zeros(1,m);
for i = 1:m
var = grid(i)==x;
if any(var)
p(i) = f(var);
sum1 = sum((w.*f)./(grid(i) - x));
sum2 = sum(w./(grid(i) - x));
p(i) = sum1/sum2;
I have removed the inner loop over j. All I did here was in fact removing the (j) indexing and changing the arithmetic operators from / to ./ and from * to .* - the same, but with a dot in front to signify that the operation is performed on element by element basis. This is called array operators in contrast to ordinary matrix operators. Also note that treating the special case where the grid points fall onto x is very similar to what you had in the original implementation, only using a vector var such that x(var)==grid(i).
Now, you can also remove the outermost loop. This is a bit more tricky and there are two major approaches how you can do that in MATLAB. I will do it the simpler way, which can be less efficient, but more clear to read - using repmat:
function p = bf3(x, f, w, grid)
% Find grid points that coincide with x.
% The below compares all grid values with all x values
% and returns a matrix of 0/1. 1 is in the (row,col)
% for which grid(row)==x(col)
var = bsxfun(#eq, grid', x);
% find the logical indexes of those x entries
varx = sum(var, 1)~=0;
% and of those grid entries
varp = sum(var, 2)~=0;
% Outer-most loop removal - use repmat to
% replicate the vectors into matrices.
% Thus, instead of having a loop over j
% you have matrices of values that would be
% referenced in the loop
ww = repmat(w, numel(grid), 1);
ff = repmat(f, numel(grid), 1);
xx = repmat(x, numel(grid), 1);
gg = repmat(grid', 1, numel(x));
% perform the calculations element-wise on the matrices
sum1 = sum((ww.*ff)./(gg - xx),2);
sum2 = sum(ww./(gg - xx),2);
p = sum1./sum2;
% fix the case where grid==x and return
p(varp) = f(varx);
The fully vectorized version can be implemented with bsxfun rather than repmat. This can potentially be a bit faster, since the matrices are not explicitly formed. However, the speed difference may not be large for small system sizes.
Also, the first solution with one loop is also not too bad performance-wise. I suggest you test those and see, what is better. Maybe it is not worth it to fully vectorize? The first code looks a bit more readable..

Randomly Generate a set of numbers of n length totaling x

I'm working on a project for fun and I need an algorithm to do as follows:
Generate a list of numbers of Length n which add up to x
I would settle for list of integers, but ideally, I would like to be left with a set of floating point numbers.
I would be very surprised if this problem wasn't heavily studied, but I'm not sure what to look for.
I've tackled similar problems in the past, but this one is decidedly different in nature. Before I've generated different combinations of a list of numbers that will add up to x. I'm sure that I could simply bruteforce this problem but that hardly seems like the ideal solution.
Anyone have any idea what this may be called, or how to approach it? Thanks all!
Edit: To clarify, I mean that the list should be length N while the numbers themselves can be of any size.
edit2: Sorry for my improper use of 'set', I was using it as a catch all term for a list or an array. I understand that it was causing confusion, my apologies.
This is how to do it in Python
import random
def random_values_with_prescribed_sum(n, total):
x = [random.random() for i in range(n)]
k = total / sum(x)
return [v * k for v in x]
Basically you pick n random numbers, compute their sum and compute a scale factor so that the sum will be what you want it to be.
Note that this approach will not produce "uniform" slices, i.e. the distribution you will get will tend to be more "egalitarian" than it should be if it was picked at random among all distribution with the given sum.
To see the reason you can just picture what the algorithm does in the case of two numbers with a prescribed sum (e.g. 1):
The point P is a generic point obtained by picking two random numbers and it will be uniform inside the square [0,1]x[0,1]. The point Q is the point obtained by scaling P so that the sum is required to be 1. As it's clear from the picture the points close to the center of the have an higher probability; for example the exact center of the squares will be found by projecting any point on the diagonal (0,0)-(1,1), while the point (0, 1) will be found projecting only points from (0,0)-(0,1)... the diagonal length is sqrt(2)=1.4142... while the square side is only 1.0.
Actually, you need to generate a partition of x into n parts. This is usually done the in following way: The partition of x into n non-negative parts can be represented in the following way: reserve n + x free places, put n borders to some arbitrary places, and stones to the rest. The stone groups add up to x, thus the number of possible partitions is the binomial coefficient (n + x \atop n).
So your algorithm could be as follows: choose an arbitrary n-subset of (n + x)-set, it determines uniquely a partition of x into n parts.
In Knuth's TAOCP the chapter 3.4.2 discusses random sampling. See Algortihm S there.
Algorithm S: (choose n arbitrary records from total of N)
t = 0, m = 0;
u = random, uniformly distributed on (0, 1)
if (N - t)*u >= n - m, skip t-th record and increase t by 1; otherwise include t-th record in the sample, increase m and t by 1
if M < n, return to 2, otherwise, algorithm finished
The solution for non-integers is algorithmically trivial: you just select arbitrary n numbers that don't sum up to 0, and norm them by their sum.
If you want to sample uniformly in the region of N-1-dimensional space defined by x1 + x2 + ... + xN = x, then you're looking at a special case of sampling from a Dirichlet distribution. The sampling procedure is a little more involved than generating uniform deviates for the xi. Here's one way to do it, in Python:
xs = [random.gammavariate(1,1) for a in range(N)]
xs = [x*v/sum(xs) for v in xs]
If you don't care too much about the sampling properties of your results, you can just generate uniform deviates and correct their sum afterwards.
Here is a version of the above algorithm in Javascript
function getRandomArbitrary(min, max) {
return Math.random() * (max - min) + min;
function getRandomArray(min, max, n) {
var arr = [];
for (var i = 0, l = n; i < l; i++) {
arr.push(getRandomArbitrary(min, max))
return arr;
function randomValuesPrescribedSum(min, max, n, total) {
var arr = getRandomArray(min, max, n);
var sum = arr.reduce(function(pv, cv) { return pv + cv; }, 0);
var k = total/sum;
var delays = arr.map(function(x) { return k*x; })
return delays;
You can call it with
var myarray = randomValuesPrescribedSum(0,1,3,3);
And then check it with
var sum = myarray.reduce(function(pv, cv) { return pv + cv;},0);
This code does a reasonable job. I think it produces a different distribution than 6502's answer, but I am not sure which is better or more natural. Certainly his code is clearer/nicer.
import random
def parts(total_sum, num_parts):
points = [random.random() for i in range(num_parts-1)]
ret = []
for i in range(1, len(points)):
ret.append((points[i] - points[i-1]) * total_sum)
return ret
def test(total_sum, num_parts):
ans = parts(total_sum, num_parts)
assert abs(sum(ans) - total_sum) < 1e-7
print ans
test(5.5, 3)
test(10, 1)
test(10, 5)
In python:
a: create a list of (random #'s 0 to 1) times total; append 0 and total to the list
b: sort the list, measure the distance between each element
c: round the list elements
import random
import time
TOTAL = 15
def random_sum_split(parts, total, places):
a = [0, total] + [random.random()*total for i in range(parts-1)]
b = [(a[i] - a[i-1]) for i in range(1, (parts+1))]
if places == None:
return b
c = [round(x, places) for x in b]
c.append(round(total-sum(c), places))
return c
def tick():
if info.tick == 1:
start = time.time()
alpha = random_sum_split(PARTS, TOTAL, PLACES)
end = time.time()
log('alpha: %s' % alpha)
log('total: %.7f' % sum(alpha))
log('parts: %s' % PARTS)
log('places: %s' % PLACES)
log('elapsed: %.7f' % (end-start))
[2014-06-13 01:00:00] alpha: [0.154, 3.617, 6.075, 5.154]
[2014-06-13 01:00:00] total: 15.0000000
[2014-06-13 01:00:00] parts: 4
[2014-06-13 01:00:00] places: 3
[2014-06-13 01:00:00] elapsed: 0.0005839
to the best of my knowledge this distribution is uniform

Rolling variance algorithm

I'm trying to find an efficient, numerically stable algorithm to calculate a rolling variance (for instance, a variance over a 20-period rolling window). I'm aware of the Welford algorithm that efficiently computes the running variance for a stream of numbers (it requires only one pass), but am not sure if this can be adapted for a rolling window. I would also like the solution to avoid the accuracy problems discussed at the top of this article by John D. Cook. A solution in any language is fine.
I've run across this problem as well. There are some great posts out there in computing the running cumulative variance such as John Cooke's Accurately computing running variance post and the post from Digital explorations, Python code for computing sample and population variances, covariance and correlation coefficient. Just could not find any that were adapted to a rolling window.
The Running Standard Deviations post by Subluminal Messages was critical in getting the rolling window formula to work. Jim takes the power sum of the squared differences of the values versus Welford’s approach of using the sum of the squared differences of the mean. Formula as follows:
PSA today = PSA(yesterday) + (((x today * x today) - x yesterday)) / n
x = value in your time series
n = number of values you've analyzed so far.
But, to convert the Power Sum Average formula to a windowed variety you need tweak the formula to the following:
PSA today = PSA yesterday + (((x today * x today) - (x yesterday * x Yesterday) / n
x = value in your time series
n = number of values you've analyzed so far.
You'll also need the Rolling Simple Moving Average formula:
SMA today = SMA yesterday + ((x today - x today - n) / n
x = value in your time series
n = period used for your rolling window.
From there you can compute the Rolling Population Variance:
Population Var today = (PSA today * n - n * SMA today * SMA today) / n
Or the Rolling Sample Variance:
Sample Var today = (PSA today * n - n * SMA today * SMA today) / (n - 1)
I've covered this topic along with sample Python code in a blog post a few years back, Running Variance.
Hope this helps.
Please note: I provided links to all the blog posts and math formulas
in Latex (images) for this answer. But, due to my low reputation (<
10); I'm limited to only 2 hyperlinks and absolutely no images. Sorry
about this. Hope this doesn't take away from the content.
I have been dealing with the same issue.
Mean is simple to compute iteratively, but you need to keep the complete history of values in a circular buffer.
next_index = (index + 1) % window_size; // oldest x value is at next_index, wrapping if necessary.
new_mean = mean + (x_new - xs[next_index])/window_size;
I have adapted Welford's algorithm and it works for all the values that I have tested with.
varSum = var_sum + (x_new - mean) * (x_new - new_mean) - (xs[next_index] - mean) * (xs[next_index] - new_mean);
xs[next_index] = x_new;
index = next_index;
To get the current variance just divide varSum by the window size: variance = varSum / window_size;
If you prefer code over words (heavily based on DanS' post):
public IEnumerable RollingSampleVariance(IEnumerable data, int sampleSize)
double mean = 0;
double accVar = 0;
int n = 0;
var queue = new Queue(sampleSize);
foreach(var observation in data)
if (n < sampleSize)
// Calculating first variance
double delta = observation - mean;
mean += delta / n;
accVar += delta * (observation - mean);
// Adjusting variance
double then = queue.Dequeue();
double prevMean = mean;
mean += (observation - then) / sampleSize;
accVar += (observation - prevMean) * (observation - mean) - (then - prevMean) * (then - mean);
if (n == sampleSize)
yield return accVar / (sampleSize - 1);
Actually Welfords algorithm can AFAICT easily be adapted to compute weighted Variance.
And by setting weights to -1, you should be able to effectively cancel out elements. I havn't checked the math whether it allows negative weights though, but at a first look it should!
I did perform a small experiment using ELKI:
void testSlidingWindowVariance() {
MeanVariance mv = new MeanVariance(); // ELKI implementation of weighted Welford!
MeanVariance mc = new MeanVariance(); // Control.
Random r = new Random();
double[] data = new double[1000];
for (int i = 0; i < data.length; i++) {
data[i] = r.nextDouble();
// Pre-roll:
for (int i = 0; i < 10; i++) {
// Compare to window approach
for (int i = 10; i < data.length; i++) {
mv.put(data[i-10], -1.); // Remove
mc.reset(); // Reset statistics
for (int j = i - 9; j <= i; j++) {
assertEquals("Variance does not agree.", mv.getSampleVariance(),
mc.getSampleVariance(), 1e-14);
I get around ~14 digits of precision compared to the exact two-pass algorithm; this is about as much as can be expected from doubles. Note that Welford does come at some computational cost because of the extra divisions - it takes about twice as long as the exact two-pass algorithm. If your window size is small, it may be much more sensible to actually recompute the mean and then in a second pass the variance every time.
I have added this experiment as unit test to ELKI, you can see the full source here: http://elki.dbs.ifi.lmu.de/browser/elki/trunk/test/de/lmu/ifi/dbs/elki/math/TestSlidingVariance.java
it also compares to the exact two-pass variance.
However, on skewed data sets, the behaviour might be different. This data set obviously is uniform distributed; but I've also tried a sorted array and it worked.
Update: we published a paper with details on differentweighting schemes for (co-)variance:
Schubert, Erich, and Michael Gertz. "Numerically stable parallel computation of (co-) variance." Proceedings of the 30th International Conference on Scientific and Statistical Database Management. ACM, 2018. (Won the SSDBM best-paper award.)
This also discusses how weighting can be used to parallelize the computation, e.g., with AVX, GPUs, or on clusters.
Here's a divide and conquer approach that has O(log k)-time updates, where k is the number of samples. It should be relatively stable for the same reasons that pairwise summation and FFTs are stable, but it's a bit complicated and the constant isn't great.
Suppose we have a sequence A of length m with mean E(A) and variance V(A), and a sequence B of length n with mean E(B) and variance V(B). Let C be the concatenation of A and B. We have
p = m / (m + n)
q = n / (m + n)
E(C) = p * E(A) + q * E(B)
V(C) = p * (V(A) + (E(A) + E(C)) * (E(A) - E(C))) + q * (V(B) + (E(B) + E(C)) * (E(B) - E(C)))
Now, stuff the elements in a red-black tree, where each node is decorated with mean and variance of the subtree rooted at that node. Insert on the right; delete on the left. (Since we're only accessing the ends, a splay tree might be O(1) amortized, but I'm guessing amortized is a problem for your application.) If k is known at compile-time, you could probably unroll the inner loop FFTW-style.
I know this question is old, but in case someone else is interested here follows the python code. It is inspired by johndcook blog post, #Joachim's, #DanS's code and #Jaime comments. The code below still gives small imprecisions for small data windows sizes. Enjoy.
from __future__ import division
import collections
import math
class RunningStats:
def __init__(self, WIN_SIZE=20):
self.n = 0
self.mean = 0
self.run_var = 0
self.windows = collections.deque(maxlen=WIN_SIZE)
def clear(self):
self.n = 0
def push(self, x):
if self.n <= self.WIN_SIZE:
# Calculating first variance
self.n += 1
delta = x - self.mean
self.mean += delta / self.n
self.run_var += delta * (x - self.mean)
# Adjusting variance
x_removed = self.windows.popleft()
old_m = self.mean
self.mean += (x - x_removed) / self.WIN_SIZE
self.run_var += (x + x_removed - old_m - self.mean) * (x - x_removed)
def get_mean(self):
return self.mean if self.n else 0.0
def get_var(self):
return self.run_var / (self.WIN_SIZE - 1) if self.n > 1 else 0.0
def get_std(self):
return math.sqrt(self.get_var())
def get_all(self):
return list(self.windows)
def __str__(self):
return "Current window values: {}".format(list(self.windows))
I look forward to be proven wrong on this but I don't think this can be done "quickly." That said, a large part of the calculation is keeping track of the EV over the window which can be done easily.
I'll leave with the question: are you sure you need a windowed function? Unless you are working with very large windows it is probably better to just use a well known predefined algorithm.
I guess keeping track of your 20 samples, Sum(X^2 from 1..20), and Sum(X from 1..20) and then successively recomputing the two sums at each iteration isn't efficient enough? It's possible to recompute the new variance without adding up, squaring, etc., all of the samples each time.
As in:
Sum(X^2 from 2..21) = Sum(X^2 from 1..20) - X_1^2 + X_21^2
Sum(X from 2..21) = Sum(X from 1..20) - X_1 + X_21
Here's another O(log k) solution: find squares the original sequence, then sum pairs, then quadruples, etc.. (You'll need a bit of a buffer to be able to find all of these efficiently.) Then add up those values that you need to to get your answer. For example:
||||||||||||||||||||||||| // Squares
| | | | | | | | | | | | | // Sum of squares for pairs
| | | | | | | // Pairs of pairs
| | | | // (etc.)
| |
^------------------^ // Want these 20, which you can get with
| | // one...
| | | | // two, three...
| | // four...
|| // five stored values.
Now you use your standard E(x^2)-E(x)^2 formula and you're done. (Not if you need good stability for small sets of numbers; this was assuming that it was only accumulation of rolling error that was causing issues.)
That said, summing 20 squared numbers is very fast these days on most architectures. If you were doing more--say, a couple hundred--a more efficient method would clearly be better. But I'm not sure that brute force isn't the way to go here.
For only 20 values, it's trivial to adapt the method exposed here (I didn't say fast, though).
You can simply pick up an array of 20 of these RunningStat classes.
The first 20 elements of the stream are somewhat special, however once this is done, it's much more simple:
when a new element arrives, clear the current RunningStat instance, add the element to all 20 instances, and increment the "counter" (modulo 20) which identifies the new "full" RunningStat instance
at any given moment, you can consult the current "full" instance to get your running variant.
You will obviously note that this approach isn't really scalable...
You can also note that there is some redudancy in the numbers we keep (if you go with the RunningStat full class). An obvious improvement would be to keep the 20 lasts Mk and Sk directly.
I cannot think of a better formula using this particular algorithm, I am afraid that its recursive formulation somewhat ties our hands.
This is just a minor addition to the excellent answer provided by DanS. The following equations are for removing the oldest sample from the window and updating the mean and variance. This is useful, for example, if you want to take smaller windows near the right edge of your input data stream (i.e. just remove the oldest window sample without adding a new sample).
window_size -= 1; % decrease window size by 1 sample
new_mean = prev_mean + (prev_mean - x_old) / window_size
varSum = varSum - (prev_mean - x_old) * (new_mean - x_old)
Here, x_old is the oldest sample in the window you wish to remove.
For those coming here now, here's a reference containing the full derivation, with proofs, of DanS's answer and Jaime's related comment.
DanS and Jaime's response in concise C.
typedef struct {
size_t n, i;
float *samples, mean, var;
} rolling_var_t;
void rolling_var_init(rolling_var_t *c, size_t window_size) {
size_t ss;
memset(c, 0, sizeof(*c));
c->n = window_size;
c->samples = (float *) malloc(ss = sizeof(float)*window_size);
memset(c->samples, 0, ss);
void rolling_var_add(rolling_var_t *c, float x) {
float nmean; // new mean
float xold; // oldest x
float dx;
c->i = (c->i + 1) % c->n;
xold = c->samples[c->i];
dx = x - xold;
nmean = c->mean + dx / (float) c->n; // walk mean
//c->var += ((x - c->mean)*(x - nmean) - (xold - c->mean) * (xold - nmean)) / (float) c->n;
c->var += ((x + xold - c->mean - nmean) * dx) / (float) c->n;
c->mean = nmean;
c->samples[c->i] = x;

Algorithm for sampling without replacement?

I am trying to test the likelihood that a particular clustering of data has occurred by chance. A robust way to do this is Monte Carlo simulation, in which the associations between data and groups are randomly reassigned a large number of times (e.g. 10,000), and a metric of clustering is used to compare the actual data with the simulations to determine a p value.
I've got most of this working, with pointers mapping the grouping to the data elements, so I plan to randomly reassign pointers to data. THE QUESTION: what is a fast way to sample without replacement, so that every pointer is randomly reassigned in the replicate data sets?
For example (these data are just a simplified example):
Data (n=12 values) - Group A: 0.1, 0.2, 0.4 / Group B: 0.5, 0.6, 0.8 / Group C: 0.4, 0.5 / Group D: 0.2, 0.2, 0.3, 0.5
For each replicate data set, I would have the same cluster sizes (A=3, B=3, C=2, D=4) and data values, but would reassign the values to the clusters.
To do this, I could generate random numbers in the range 1-12, assign the first element of group A, then generate random numbers in the range 1-11 and assign the second element in group A, and so on. The pointer reassignment is fast, and I will have pre-allocated all data structures, but the sampling without replacement seems like a problem that might have been solved many times before.
Logic or pseudocode preferred.
Here's some code for sampling without replacement based on Algorithm 3.4.2S of Knuth's book Seminumeric Algorithms.
void SampleWithoutReplacement
int populationSize, // size of set sampling from
int sampleSize, // size of each sample
vector<int> & samples // output, zero-offset indicies to selected items
// Use Knuth's variable names
int& n = sampleSize;
int& N = populationSize;
int t = 0; // total input records dealt with
int m = 0; // number of items selected so far
double u;
while (m < n)
u = GetUniform(); // call a uniform(0,1) random number generator
if ( (N - t)*u >= n - m )
samples[m] = t;
t++; m++;
There is a more efficient but more complex method by Jeffrey Scott Vitter in "An Efficient Algorithm for Sequential Random Sampling," ACM Transactions on Mathematical Software, 13(1), March 1987, 58-67.
A C++ working code based on the answer by John D. Cook.
#include <random>
#include <vector>
// John D. Cook, https://stackoverflow.com/a/311716/15485
void SampleWithoutReplacement
int populationSize, // size of set sampling from
int sampleSize, // size of each sample
std::vector<int> & samples // output, zero-offset indicies to selected items
// Use Knuth's variable names
int& n = sampleSize;
int& N = populationSize;
int t = 0; // total input records dealt with
int m = 0; // number of items selected so far
std::default_random_engine re;
std::uniform_real_distribution<double> dist(0,1);
while (m < n)
double u = dist(re); // call a uniform(0,1) random number generator
if ( (N - t)*u >= n - m )
samples[m] = t;
t++; m++;
#include <iostream>
int main(int,char**)
const size_t sz = 10;
std::vector< int > samples(sz);
for (size_t i = 0; i < sz; i++ ) {
std::cout << samples[i] << "\t";
return 0;
See my answer to this question Unique (non-repeating) random numbers in O(1)?. The same logic should accomplish what you are looking to do.
Inspired by #John D. Cook's answer, I wrote an implementation in Nim. At first I had difficulties understanding how it works, so I commented extensively also including an example. Maybe it helps to understand the idea. Also, I have changed the variable names slightly.
iterator uniqueRandomValuesBelow*(N, M: int) =
## Returns a total of M unique random values i with 0 <= i < N
## These indices can be used to construct e.g. a random sample without replacement
assert(M <= N)
var t = 0 # total input records dealt with
var m = 0 # number of items selected so far
while (m < M):
let u = random(1.0) # call a uniform(0,1) random number generator
# meaning of the following terms:
# (N - t) is the total number of remaining draws left (initially just N)
# (M - m) is the number how many of these remaining draw must be positive (initially just M)
# => Probability for next draw = (M-m) / (N-t)
# i.e.: (required positive draws left) / (total draw left)
# This is implemented by the inequality expression below:
# - the larger (M-m), the larger the probability of a positive draw
# - for (N-t) == (M-m), the term on the left is always smaller => we will draw 100%
# - for (N-t) >> (M-m), we must get a very small u
# example: (N-t) = 7, (M-m) = 5
# => we draw the next with prob 5/7
# lets assume the draw fails
# => t += 1 => (N-t) = 6
# => we draw the next with prob 5/6
# lets assume the draw succeeds
# => t += 1, m += 1 => (N-t) = 5, (M-m) = 4
# => we draw the next with prob 4/5
# lets assume the draw fails
# => t += 1 => (N-t) = 4
# => we draw the next with prob 4/4, i.e.,
# we will draw with certainty from now on
# (in the next steps we get prob 3/3, 2/2, ...)
if (N - t)*u >= (M - m).toFloat: # this is essentially a draw with P = (M-m) / (N-t)
# no draw -- happens mainly for (N-t) >> (M-m) and/or high u
t += 1
# draw t -- happens when (M-m) gets large and/or low u
yield t # this is where we output an index, can be used to sample
t += 1
m += 1
# example use
for i in uniqueRandomValuesBelow(100, 5):
echo i
When the population size is much greater than the sample size, the above algorithms become inefficient, since they have complexity O(n), n being the population size.
When I was a student I wrote some algorithms for uniform sampling without replacement, which have average complexity O(s log s), where s is the sample size. Here is the code for the binary tree algorithm, with average complexity O(s log s), in R:
# The Tree growing algorithm for uniform sampling without replacement
# by Pavel Ruzankin
quicksample = function (n,size)
# n - the number of items to choose from
# size - the sample size
if (s>n) {
stop("Sample size is greater than the number of items to choose from")
# upv=integer(s) #level up edge is pointing to
leftv=integer(s) #left edge is poiting to; must be filled with zeros
rightv=integer(s) #right edge is pointig to; must be filled with zeros
samp=integer(s) #the sample
ordn=integer(s) #relative ordinal number
ordn[1L]=1L #initial value for the root vertex
if (s > 1L) for (j in 2L:s) {
curn=sample(n-j+1L,1L) #current number sampled
curordn=0L #currend ordinal number
v=1L #current vertice
from=1L #how have come here: 0 - by left edge, 1 - by right edge
repeat {
if (curn+curordn>samp[v]) { #going down by the right edge
if (from == 0L) {
if (rightv[v]!=0L) {
} else { #creating a new vertex
# upv[j]=v
} else { #going down by the left edge
if (from==1L) {
if (leftv[v]!=0L) {
} else { #creating a new vertex
# upv[j]=v
The complexity of this algorithm is discussed in:
Rouzankin, P. S.; Voytishek, A. V. On the cost of algorithms for random selection. Monte Carlo Methods Appl. 5 (1999), no. 1, 39-54.
If you find the algorithm useful, please make a reference.
See also:
P. Gupta, G. P. Bhattacharjee. (1984) An efficient algorithm for random sampling without replacement. International Journal of Computer Mathematics 16:4, pages 201-209.
DOI: 10.1080/00207168408803438
Teuhola, J. and Nevalainen, O. 1982. Two efficient algorithms for random sampling without replacement. /IJCM/, 11(2): 127–140.
DOI: 10.1080/00207168208803304
In the last paper the authors use hash tables and claim that their algorithms have O(s) complexity. There is one more fast hash table algorithm, which will soon be implemented in pqR (pretty quick R):
I wrote a survey of algorithms for sampling without replacement. I may be biased but I recommend my own algorithm, implemented in C++ below, as providing the best performance for many k, n values and acceptable performance for others. randbelow(i) is assumed to return a fairly chosen random non-negative integer less than i.
void cardchoose(uint32_t n, uint32_t k, uint32_t* result) {
auto t = n - k + 1;
for (uint32_t i = 0; i < k; i++) {
uint32_t r = randbelow(t + i);
if (r < t) {
result[i] = r;
} else {
result[i] = result[r - t];
std::sort(result, result + k);
for (uint32_t i = 0; i < k; i++) {
result[i] += i;
Another algorithm for sampling without replacement is described here.
It is similar to the one described by John D. Cook in his answer and also from Knuth, but it has different hypothesis: The population size is unknown, but the sample can fit in memory. This one is called "Knuth's algorithm S".
Quoting the rosettacode article:
Select the first n items as the sample as they become available;
For the i-th item where i > n, have a random chance of n/i of keeping it. If failing this chance, the sample remains the same. If
not, have it randomly (1/n) replace one of the previously selected n
items of the sample.
Repeat #2 for any subsequent items.
