Given an array of n word-frequency pairs:
[ (w0, f0), (w1, f1), ..., (wn-1, fn-1) ]
where wi is a word, fi is an integer frequencey, and the sum of the frequencies ∑fi = m,
I want to use a pseudo-random number generator (pRNG) to select p words wj0, wj1, ..., wjp-1 such that
the probability of selecting any word is proportional to its frequency:
P(wi = wjk) = P(i = jk) = fi / m
(Note, this is selection with replacement, so the same word could be chosen every time).
I've come up with three algorithms so far:
Create an array of size m, and populate it so the first f0 entries are w0, the next f1 entries are w1, and so on, so the last fp-1 entries are wp-1.[ w0, ..., w0, w1,..., w1, ..., wp-1, ..., wp-1 ]
Then use the pRNG to select p indices in the range 0...m-1, and report the words stored at those indices.
This takes O(n + m + p) work, which isn't great, since m can be much much larger than n.
Step through the input array once, computingmi = ∑h≤ifh = mi-1 + fi
and after computing mi, use the pRNG to generate a number xk in the range 0...mi-1 for each k in 0...p-1
and select wi for wjk (possibly replacing the current value of wjk) if xk < fi.
This requires O(n + np) work.
Compute mi as in algorithm 2, and generate the following array on n word-frequency-partial-sum triples:[ (w0, f0, m0), (w1, f1, m1), ..., (wn-1, fn-1, mn-1) ]
and then, for each k in 0...p-1, use the pRNG to generate a number xk in the range 0...m-1 then do binary search on the array of triples to find the i s.t. mi-fi ≤ xk < mi, and select wi for wjk.
This requires O(n + p log n) work.
My question is: Is there a more efficient algorithm I can use for this, or are these as good as it gets?
This sounds like roulette wheel selection, mainly used for the selection process in genetic/evolutionary algorithms.
Look at Roulette Selection in Genetic Algorithms
You could create the target array, then loop through the words determining the probability that it should be picked, and replace the words in the array according to a random number.
For the first word the probability would be f0/m0 (where mn=f0+..+fn), i.e. 100%, so all positions in the target array would be filled with w0.
For the following words the probability falls, and when you reach the last word the target array is filled with randomly picked words accoding to the frequency.
Example code in C#:
public class WordFrequency {
public string Word { get; private set; }
public int Frequency { get; private set; }
public WordFrequency(string word, int frequency) {
Word = word;
Frequency = frequency;
}
}
WordFrequency[] words = new WordFrequency[] {
new WordFrequency("Hero", 80),
new WordFrequency("Monkey", 4),
new WordFrequency("Shoe", 13),
new WordFrequency("Highway", 3),
};
int p = 7;
string[] result = new string[p];
int sum = 0;
Random rnd = new Random();
foreach (WordFrequency wf in words) {
sum += wf.Frequency;
for (int i = 0; i < p; i++) {
if (rnd.Next(sum) < wf.Frequency) {
result[i] = wf.Word;
}
}
}
Ok, I found another algorithm: the alias method (also mentioned in this answer). Basically it creates a partition of the probability space such that:
There are n partitions, all of the same width r s.t. nr = m.
each partition contains two words in some ratio (which is stored with the partition).
for each word wi, fi = ∑partitions t s.t wi ∈ t r × ratio(t,wi)
Since all the partitions are of the same size, selecting which partition can be done in constant work (pick an index from 0...n-1 at random), and the partition's ratio can then be used to select which word is used in constant work (compare a pRNGed number with the ratio between the two words). So this means the p selections can be done in O(p) work, given such a partition.
The reason that such a partitioning exists is that there exists a word wi s.t. fi < r, if and only if there exists a word wi' s.t. fi' > r, since r is the average of the frequencies.
Given such a pair wi and wi' we can replace them with a pseudo-word w'i of frequency f'i = r (that represents wi with probability fi/r and wi' with probability 1 - fi/r) and a new word w'i' of adjusted frequency f'i' = fi' - (r - fi) respectively. The average frequency of all the words will still be r, and the rule from the prior paragraph still applies. Since the pseudo-word has frequency r and is made of two words with frequency ≠ r, we know that if we iterate this process, we will never make a pseudo-word out of a pseudo-word, and such iteration must end with a sequence of n pseudo-words which are the desired partition.
To construct this partition in O(n) time,
go through the list of the words once, constructing two lists:
one of words with frequency ≤ r
one of words with frequency > r
then pull a word from the first list
if its frequency = r, then make it into a one element partition
otherwise, pull a word from the other list, and use it to fill out a two-word partition. Then put the second word back into either the first or second list according to its adjusted frequency.
This actually still works if the number of partitions q > n (you just have to prove it differently). If you want to make sure that r is integral, and you can't easily find a factor q of m s.t. q > n, you can pad all the frequencies by a factor of n, so f'i = nfi, which updates m' = mn and sets r' = m when q = n.
In any case, this algorithm only takes O(n + p) work, which I have to think is optimal.
In ruby:
def weighted_sample_with_replacement(input, p)
n = input.size
m = input.inject(0) { |sum,(word,freq)| sum + freq }
# find the words with frequency lesser and greater than average
lessers, greaters = input.map do |word,freq|
# pad the frequency so we can keep it integral
# when subdivided
[ word, freq*n ]
end.partition do |word,adj_freq|
adj_freq <= m
end
partitions = Array.new(n) do
word, adj_freq = lessers.shift
other_word = if adj_freq < m
# use part of another word's frequency to pad
# out the partition
other_word, other_adj_freq = greaters.shift
other_adj_freq -= (m - adj_freq)
(other_adj_freq <= m ? lessers : greaters) << [ other_word, other_adj_freq ]
other_word
end
[ word, other_word , adj_freq ]
end
(0...p).map do
# pick a partition at random
word, other_word, adj_freq = partitions[ rand(n) ]
# select the first word in the partition with appropriate
# probability
if rand(m) < adj_freq
word
else
other_word
end
end
end
Related
There is known Random(0,1) function, it is a uniformed random function, which means, it will give 0 or 1, with probability 50%. Implement Random(a, b) that only makes calls to Random(0,1)
What I though so far is, put the range a-b in a 0 based array, then I have index 0, 1, 2...b-a.
then call the RANDOM(0,1) b-a times, sum the results as generated idx. and return the element.
However since there is no answer in the book, I don't know if this way is correct or the best. How to prove that the probability of returning each element is exactly same and is 1/(b-a+1) ?
And what is the right/better way to do this?
If your RANDOM(0, 1) returns either 0 or 1, each with probability 0.5 then you can generate bits until you have enough to represent the number (b-a+1) in binary. This gives you a random number in a slightly too large range: you can test and repeat if it fails. Something like this (in Python).
def rand_pow2(bit_count):
"""Return a random number with the given number of bits."""
result = 0
for i in xrange(bit_count):
result = 2 * result + RANDOM(0, 1)
return result
def random_range(a, b):
"""Return a random integer in the closed interval [a, b]."""
bit_count = math.ceil(math.log2(b - a + 1))
while True:
r = rand_pow2(bit_count)
if a + r <= b:
return a + r
When you sum random numbers, the result is not longer evenly distributed - it looks like a Gaussian function. Look up "law of large numbers" or read any probability book / article. Just like flipping coins 100 times is highly highly unlikely to give 100 heads. It's likely to give close to 50 heads and 50 tails.
Your inclination to put the range from 0 to a-b first is correct. However, you cannot do it as you stated. This question asks exactly how to do that, and the answer utilizes unique factorization. Write m=a-b in base 2, keeping track of the largest needed exponent, say e. Then, find the biggest multiple of m that is smaller than 2^e, call it k. Finally, generate e numbers with RANDOM(0,1), take them as the base 2 expansion of some number x, if x < k*m, return x, otherwise try again. The program looks something like this (simple case when m<2^2):
int RANDOM(0,m) {
// find largest power of n needed to write m in base 2
int e=0;
while (m > 2^e) {
++e;
}
// find largest multiple of m less than 2^e
int k=1;
while (k*m < 2^2) {
++k
}
--k; // we went one too far
while (1) {
// generate a random number in base 2
int x = 0;
for (int i=0; i<e; ++i) {
x = x*2 + RANDOM(0,1);
}
// if x isn't too large, return it x modulo m
if (x < m*k)
return (x % m);
}
}
Now you can simply add a to the result to get uniformly distributed numbers between a and b.
Divide and conquer could help us in generating a random number in range [a,b] using random(0,1). The idea is
if a is equal to b, then random number is a
Find mid of the range [a,b]
Generate random(0,1)
If above is 0, return a random number in range [a,mid] using recursion
else return a random number in range [mid+1, b] using recursion
The working 'C' code is as follows.
int random(int a, int b)
{
if(a == b)
return a;
int c = RANDOM(0,1); // Returns 0 or 1 with probability 0.5
int mid = a + (b-a)/2;
if(c == 0)
return random(a, mid);
else
return random(mid + 1, b);
}
If you have a RNG that returns {0, 1} with equal probability, you can easily create a RNG that returns numbers {0, 2^n} with equal probability.
To do this you just use your original RNG n times and get a binary number like 0010110111. Each of the numbers are (from 0 to 2^n) are equally likely.
Now it is easy to get a RNG from a to b, where b - a = 2^n. You just create a previous RNG and add a to it.
Now the last question is what should you do if b-a is not 2^n?
Good thing that you have to do almost nothing. Relying on rejection sampling technique. It tells you that if you have a big set and have a RNG over that set and need to select an element from a subset of this set, you can just keep selecting an element from a bigger set and discarding them till they exist in your subset.
So all you do, is find b-a and find the first n such that b-a <= 2^n. Then using rejection sampling till you picked an element smaller b-a. Than you just add a.
I got asked this question on a interview for Google a couple of weeks ago, I didn't quite get the answer and I was wondering if anyone here could help me out.
You have an array with n elements. The elements are either 0 or 1.
You want to split the array into k contiguous subarrays. The size of each subarray can vary between ceil(n/2k) and floor(3n/2k). You can assume that k << n.
After you split the array into k subarrays. One element of each subarray will be randomly selected.
Devise an algorithm for maximizing the sum of the randomly selected elements from the k subarrays.
Basically means that we will want to split the array in such way such that the sum of all the expected values for the elements selected from each subarray is maximum.
You can assume that n is a power of 2.
Example:
Array: [0,0,1,1,0,0,1,1,0,1,1,0]
n = 12
k = 3
Size of subarrays can be: 2,3,4,5,6
Possible subarrays [0,0,1] [1,0,0,1] [1,0,1,1,0]
Expected Value of the sum of the elements randomly selected from the subarrays: 1/3 + 2/4 + 3/5 = 43/30 ~ 1.4333333
Optimal split: [0,0,1,1,0,0][1,1][0,1,1,0]
Expected value of optimal split: 1/3 + 1 + 1/2 = 11/6 ~ 1.83333333
I think we can solve this problem using dynamic programming.
Basically, we have:
f(i,j) is defined as the maximum sum of all expected values chosen from an array of size i and split into j subarrays. Therefore the solution should be f(n,k).
The recursive equation is:
f(i,j) = f(i-x,j-1) + sum(i-x+1,i)/x where (n/2k) <= x <= (3n/2k)
I don't know if this is still an open question or not, but it seems like the OP has managed to add enough clarifications that this should be straightforward to solve. At any rate, if I am understanding what you are saying this seems like a fair thing to ask in an interview environment for a software development position.
Here is the basic O(n^2 * k) solution, which should be adequate for small k (as the interviewer specified):
def best_val(arr, K):
n = len(arr)
psum = [ 0.0 ]
for x in arr:
psum.append(psum[-1] + x)
tab = [ -100000 for i in range(n) ]
tab.append(0)
for k in range(K):
for s in range(n - (k+1) * ceil(n/(2*K))):
terms = range(s + ceil(n/(2*K)), min(s + floor((3*n)/(2*K)) + 1, n+1))
tab[s] = max( [ (psum[t] - psum[s]) / (t - s) + tab[t] for t in terms ])
return tab[0]
I used the numpy ceil/floor functions but you basically get the idea. The only `tricks' in this version is that it does windowing to reduce the memory overhead to just O(n) instead of O(n * k), and that it precalculates the partial sums to make computing the expected value for a box a constant time operation (thus saving a factor of O(n) from the inner loop).
I don't know if anyone is still interested to see the solution for this problem. Just stumbled upon this question half an hour ago and thought of posting my solution(Java). The complexity for this is O(n*K^log10). The proof is a little convoluted so I would rather provide runtime numbers:
n k time(ms)
48 4 25
48 8 265
24 4 20
24 8 33
96 4 51
192 4 143
192 8 343919
The solution is the same old recursive one where given an array, choose the first partition of size ceil(n/2k) and find the best solution recursively for the rest with number of partitions = k -1, then take ceil(n/2k) + 1 and so on.
Code:
public class PartitionOptimization {
public static void main(String[] args) {
PartitionOptimization p = new PartitionOptimization();
int[] input = { 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0};
int splitNum = 3;
int lowerLim = (int) Math.ceil(input.length / (2.0 * splitNum));
int upperLim = (int) Math.floor((3.0 * input.length) / (2.0 * splitNum));
System.out.println(input.length + " " + lowerLim + " " + upperLim + " " +
splitNum);
Date currDate = new Date();
System.out.println(currDate);
System.out.println(p.getMaxPartExpt(input, lowerLim, upperLim,
splitNum, 0));
System.out.println(new Date().getTime() - currDate.getTime());
}
public double getMaxPartExpt(int[] input, int lowerLim, int upperLim,
int splitNum, int startIndex) {
if (splitNum <= 1 && startIndex<=(input.length -lowerLim+1)){
double expt = findExpectation(input, startIndex, input.length-1);
return expt;
}
if (!((input.length - startIndex) / lowerLim >= splitNum))
return -1;
double maxExpt = 0;
double curMax = 0;
int bestI=0;
for (int i = startIndex + lowerLim - 1; i < Math.min(startIndex
+ upperLim, input.length); i++) {
double curExpect = findExpectation(input, startIndex, i);
double splitExpect = getMaxPartExpt(input, lowerLim, upperLim,
splitNum - 1, i + 1);
if (splitExpect>=0 && (curExpect + splitExpect > maxExpt)){
bestI = i;
curMax = curExpect;
maxExpt = curExpect + splitExpect;
}
}
return maxExpt;
}
public double findExpectation(int[] input, int startIndex, int endIndex) {
double expectation = 0;
for (int i = startIndex; i <= endIndex; i++) {
expectation = expectation + input[i];
}
expectation = (expectation / (endIndex - startIndex + 1));
return expectation;
}
}
Not sure I understand, the algorithm is to split the array in groups, right? The maximum value the sum can have is the number of ones. So split the array in "n" groups of 1 element each and the addition will be the maximum value possible. But it must be something else and I did not understand the problem, that seems too silly.
I think this can be solved with dynamic programming. At each possible split location, get the maximum sum if you split at that location and if you don't split at that point. A recursive function and a table to store history might be useful.
sum_i = max{ NumOnesNewPart/NumZerosNewPart * sum(NewPart) + sum(A_i+1, A_end),
sum(A_0,A_i+1) + sum(A_i+1, A_end)
}
This might lead to something...
I think its a bad interview question, but it is also an easy problem to solve.
Every integer contributes to the expected value with weight 1/s where s is the size of the set where it has been placed. Therefore, if you guess the sizes of the sets in your partition, you just need to fill the sets with ones starting from the smallest set, and then fill the remaining largest set with zeroes.
You can easily see then that if you have a partition, filled as above, where the sizes of the sets are S_1, ..., S_k and you do a transformation where you remove one item from set S_i and move it to set S_i+1, you have the following cases:
Both S_i and S_i+1 were filled with ones; then the expected value does not change
Both them were filled with zeroes; then the expected value does not change
S_i contained both 1's and 0's and S_i+1 contains only zeroes; moving 0 to S_i+1 increases the expected value because the expected value of S_i increases
S_i contained 1's and S_i+1 contains both 1's and 0's; moving 1 to S_i+1 increases the expected value because the expected value of S_i+1 increases and S_i remains intact
In all these cases, you can shift an element from S_i to S_i+1, maintaining the filling rule of filling smallest sets with 1's, so that the expected value increases. This leads to the simple algorithm:
Create a partitioning where there is a maximal number of maximum-size arrays and maximal number of minimum-size arrays
Fill the arrays starting from smallest one with 1's
Fill the remaining slots with 0's
How about a recursive function:
int BestValue(Array A, int numSplits)
// Returns the best value that would be obtained by splitting
// into numSplits partitions.
This in turn uses a helper:
// The additional argument is an array of the valid split sizes which
// is the same for each call.
int BestValueHelper(Array A, int numSplits, Array splitSizes)
{
int result = 0;
for splitSize in splitSizes
int splitResult = ExpectedValue(A, 0, splitSize) +
BestValueHelper(A+splitSize, numSplits-1, splitSizes);
if splitResult > result
result = splitResult;
}
ExpectedValue(Array A, int l, int m) computes the expected value of a split of A that goes from l to m i.e. (A[l] + A[l+1] + ... A[m]) / (m-l+1).
BestValue calls BestValueHelper after computing the array of valid split sizes between ceil(n/2k) and floor(3n/2k).
I have omitted error handling and some end conditions but those should not be too difficult to add.
Let
a[] = given array of length n
from = inclusive index of array a
k = number of required splits
minSize = minimum size of a split
maxSize = maximum size of a split
d = maxSize - minSize
expectation(a, from, to) = average of all element of array a from "from" to "to"
Optimal(a[], from, k) = MAX[ for(j>=minSize-1 to <=maxSize-1) { expectation(a, from, from+j) + Optimal(a, j+1, k-1)} ]
Runtime (assuming memoization or dp) = O(n*k*d)
I'm working on a project for fun and I need an algorithm to do as follows:
Generate a list of numbers of Length n which add up to x
I would settle for list of integers, but ideally, I would like to be left with a set of floating point numbers.
I would be very surprised if this problem wasn't heavily studied, but I'm not sure what to look for.
I've tackled similar problems in the past, but this one is decidedly different in nature. Before I've generated different combinations of a list of numbers that will add up to x. I'm sure that I could simply bruteforce this problem but that hardly seems like the ideal solution.
Anyone have any idea what this may be called, or how to approach it? Thanks all!
Edit: To clarify, I mean that the list should be length N while the numbers themselves can be of any size.
edit2: Sorry for my improper use of 'set', I was using it as a catch all term for a list or an array. I understand that it was causing confusion, my apologies.
This is how to do it in Python
import random
def random_values_with_prescribed_sum(n, total):
x = [random.random() for i in range(n)]
k = total / sum(x)
return [v * k for v in x]
Basically you pick n random numbers, compute their sum and compute a scale factor so that the sum will be what you want it to be.
Note that this approach will not produce "uniform" slices, i.e. the distribution you will get will tend to be more "egalitarian" than it should be if it was picked at random among all distribution with the given sum.
To see the reason you can just picture what the algorithm does in the case of two numbers with a prescribed sum (e.g. 1):
The point P is a generic point obtained by picking two random numbers and it will be uniform inside the square [0,1]x[0,1]. The point Q is the point obtained by scaling P so that the sum is required to be 1. As it's clear from the picture the points close to the center of the have an higher probability; for example the exact center of the squares will be found by projecting any point on the diagonal (0,0)-(1,1), while the point (0, 1) will be found projecting only points from (0,0)-(0,1)... the diagonal length is sqrt(2)=1.4142... while the square side is only 1.0.
Actually, you need to generate a partition of x into n parts. This is usually done the in following way: The partition of x into n non-negative parts can be represented in the following way: reserve n + x free places, put n borders to some arbitrary places, and stones to the rest. The stone groups add up to x, thus the number of possible partitions is the binomial coefficient (n + x \atop n).
So your algorithm could be as follows: choose an arbitrary n-subset of (n + x)-set, it determines uniquely a partition of x into n parts.
In Knuth's TAOCP the chapter 3.4.2 discusses random sampling. See Algortihm S there.
Algorithm S: (choose n arbitrary records from total of N)
t = 0, m = 0;
u = random, uniformly distributed on (0, 1)
if (N - t)*u >= n - m, skip t-th record and increase t by 1; otherwise include t-th record in the sample, increase m and t by 1
if M < n, return to 2, otherwise, algorithm finished
The solution for non-integers is algorithmically trivial: you just select arbitrary n numbers that don't sum up to 0, and norm them by their sum.
If you want to sample uniformly in the region of N-1-dimensional space defined by x1 + x2 + ... + xN = x, then you're looking at a special case of sampling from a Dirichlet distribution. The sampling procedure is a little more involved than generating uniform deviates for the xi. Here's one way to do it, in Python:
xs = [random.gammavariate(1,1) for a in range(N)]
xs = [x*v/sum(xs) for v in xs]
If you don't care too much about the sampling properties of your results, you can just generate uniform deviates and correct their sum afterwards.
Here is a version of the above algorithm in Javascript
function getRandomArbitrary(min, max) {
return Math.random() * (max - min) + min;
};
function getRandomArray(min, max, n) {
var arr = [];
for (var i = 0, l = n; i < l; i++) {
arr.push(getRandomArbitrary(min, max))
};
return arr;
};
function randomValuesPrescribedSum(min, max, n, total) {
var arr = getRandomArray(min, max, n);
var sum = arr.reduce(function(pv, cv) { return pv + cv; }, 0);
var k = total/sum;
var delays = arr.map(function(x) { return k*x; })
return delays;
};
You can call it with
var myarray = randomValuesPrescribedSum(0,1,3,3);
And then check it with
var sum = myarray.reduce(function(pv, cv) { return pv + cv;},0);
This code does a reasonable job. I think it produces a different distribution than 6502's answer, but I am not sure which is better or more natural. Certainly his code is clearer/nicer.
import random
def parts(total_sum, num_parts):
points = [random.random() for i in range(num_parts-1)]
points.append(0)
points.append(1)
points.sort()
ret = []
for i in range(1, len(points)):
ret.append((points[i] - points[i-1]) * total_sum)
return ret
def test(total_sum, num_parts):
ans = parts(total_sum, num_parts)
assert abs(sum(ans) - total_sum) < 1e-7
print ans
test(5.5, 3)
test(10, 1)
test(10, 5)
In python:
a: create a list of (random #'s 0 to 1) times total; append 0 and total to the list
b: sort the list, measure the distance between each element
c: round the list elements
import random
import time
TOTAL = 15
PARTS = 4
PLACES = 3
def random_sum_split(parts, total, places):
a = [0, total] + [random.random()*total for i in range(parts-1)]
a.sort()
b = [(a[i] - a[i-1]) for i in range(1, (parts+1))]
if places == None:
return b
else:
b.pop()
c = [round(x, places) for x in b]
c.append(round(total-sum(c), places))
return c
def tick():
if info.tick == 1:
start = time.time()
alpha = random_sum_split(PARTS, TOTAL, PLACES)
end = time.time()
log('alpha: %s' % alpha)
log('total: %.7f' % sum(alpha))
log('parts: %s' % PARTS)
log('places: %s' % PLACES)
log('elapsed: %.7f' % (end-start))
yields:
[2014-06-13 01:00:00] alpha: [0.154, 3.617, 6.075, 5.154]
[2014-06-13 01:00:00] total: 15.0000000
[2014-06-13 01:00:00] parts: 4
[2014-06-13 01:00:00] places: 3
[2014-06-13 01:00:00] elapsed: 0.0005839
to the best of my knowledge this distribution is uniform
Selecting without any weights (equal probabilities) is beautifully described here.
I was wondering if there is a way to convert this approach to a weighted one.
I am also interested in other approaches as well.
Update: Sampling without replacement
If the sampling is with replacement, you can use this algorithm (implemented here in Python):
import random
items = [(10, "low"),
(100, "mid"),
(890, "large")]
def weighted_sample(items, n):
total = float(sum(w for w, v in items))
i = 0
w, v = items[0]
while n:
x = total * (1 - random.random() ** (1.0 / n))
total -= x
while x > w:
x -= w
i += 1
w, v = items[i]
w -= x
yield v
n -= 1
This is O(n + m) where m is the number of items.
Why does this work? It is based on the following algorithm:
def n_random_numbers_decreasing(v, n):
"""Like reversed(sorted(v * random() for i in range(n))),
but faster because we avoid sorting."""
while n:
v *= random.random() ** (1.0 / n)
yield v
n -= 1
The function weighted_sample is just this algorithm fused with a walk of the items list to pick out the items selected by those random numbers.
This in turn works because the probability that n random numbers 0..v will all happen to be less than z is P = (z/v)n. Solve for z, and you get z = vP1/n. Substituting a random number for P picks the largest number with the correct distribution; and we can just repeat the process to select all the other numbers.
If the sampling is without replacement, you can put all the items into a binary heap, where each node caches the total of the weights of all items in that subheap. Building the heap is O(m). Selecting a random item from the heap, respecting the weights, is O(log m). Removing that item and updating the cached totals is also O(log m). So you can pick n items in O(m + n log m) time.
(Note: "weight" here means that every time an element is selected, the remaining possibilities are chosen with probability proportional to their weights. It does not mean that elements appear in the output with a likelihood proportional to their weights.)
Here's an implementation of that, plentifully commented:
import random
class Node:
# Each node in the heap has a weight, value, and total weight.
# The total weight, self.tw, is self.w plus the weight of any children.
__slots__ = ['w', 'v', 'tw']
def __init__(self, w, v, tw):
self.w, self.v, self.tw = w, v, tw
def rws_heap(items):
# h is the heap. It's like a binary tree that lives in an array.
# It has a Node for each pair in `items`. h[1] is the root. Each
# other Node h[i] has a parent at h[i>>1]. Each node has up to 2
# children, h[i<<1] and h[(i<<1)+1]. To get this nice simple
# arithmetic, we have to leave h[0] vacant.
h = [None] # leave h[0] vacant
for w, v in items:
h.append(Node(w, v, w))
for i in range(len(h) - 1, 1, -1): # total up the tws
h[i>>1].tw += h[i].tw # add h[i]'s total to its parent
return h
def rws_heap_pop(h):
gas = h[1].tw * random.random() # start with a random amount of gas
i = 1 # start driving at the root
while gas >= h[i].w: # while we have enough gas to get past node i:
gas -= h[i].w # drive past node i
i <<= 1 # move to first child
if gas >= h[i].tw: # if we have enough gas:
gas -= h[i].tw # drive past first child and descendants
i += 1 # move to second child
w = h[i].w # out of gas! h[i] is the selected node.
v = h[i].v
h[i].w = 0 # make sure this node isn't chosen again
while i: # fix up total weights
h[i].tw -= w
i >>= 1
return v
def random_weighted_sample_no_replacement(items, n):
heap = rws_heap(items) # just make a heap...
for i in range(n):
yield rws_heap_pop(heap) # and pop n items off it.
If the sampling is with replacement, use the roulette-wheel selection technique (often used in genetic algorithms):
sort the weights
compute the cumulative weights
pick a random number in [0,1]*totalWeight
find the interval in which this number falls into
select the elements with the corresponding interval
repeat k times
If the sampling is without replacement, you can adapt the above technique by removing the selected element from the list after each iteration, then re-normalizing the weights so that their sum is 1 (valid probability distribution function)
I know this is a very old question, but I think there's a neat trick to do this in O(n) time if you apply a little math!
The exponential distribution has two very useful properties.
Given n samples from different exponential distributions with different rate parameters, the probability that a given sample is the minimum is equal to its rate parameter divided by the sum of all rate parameters.
It is "memoryless". So if you already know the minimum, then the probability that any of the remaining elements is the 2nd-to-min is the same as the probability that if the true min were removed (and never generated), that element would have been the new min. This seems obvious, but I think because of some conditional probability issues, it might not be true of other distributions.
Using fact 1, we know that choosing a single element can be done by generating these exponential distribution samples with rate parameter equal to the weight, and then choosing the one with minimum value.
Using fact 2, we know that we don't have to re-generate the exponential samples. Instead, just generate one for each element, and take the k elements with lowest samples.
Finding the lowest k can be done in O(n). Use the Quickselect algorithm to find the k-th element, then simply take another pass through all elements and output all lower than the k-th.
A useful note: if you don't have immediate access to a library to generate exponential distribution samples, it can be easily done by: -ln(rand())/weight
I've done this in Ruby
https://github.com/fl00r/pickup
require 'pickup'
pond = {
"selmon" => 1,
"carp" => 4,
"crucian" => 3,
"herring" => 6,
"sturgeon" => 8,
"gudgeon" => 10,
"minnow" => 20
}
pickup = Pickup.new(pond, uniq: true)
pickup.pick(3)
#=> [ "gudgeon", "herring", "minnow" ]
pickup.pick
#=> "herring"
pickup.pick
#=> "gudgeon"
pickup.pick
#=> "sturgeon"
If you want to generate large arrays of random integers with replacement, you can use piecewise linear interpolation. For example, using NumPy/SciPy:
import numpy
import scipy.interpolate
def weighted_randint(weights, size=None):
"""Given an n-element vector of weights, randomly sample
integers up to n with probabilities proportional to weights"""
n = weights.size
# normalize so that the weights sum to unity
weights = weights / numpy.linalg.norm(weights, 1)
# cumulative sum of weights
cumulative_weights = weights.cumsum()
# piecewise-linear interpolating function whose domain is
# the unit interval and whose range is the integers up to n
f = scipy.interpolate.interp1d(
numpy.hstack((0.0, weights)),
numpy.arange(n + 1), kind='linear')
return f(numpy.random.random(size=size)).astype(int)
This is not effective if you want to sample without replacement.
Here's a Go implementation from geodns:
package foo
import (
"log"
"math/rand"
)
type server struct {
Weight int
data interface{}
}
func foo(servers []server) {
// servers list is already sorted by the Weight attribute
// number of items to pick
max := 4
result := make([]server, max)
sum := 0
for _, r := range servers {
sum += r.Weight
}
for si := 0; si < max; si++ {
n := rand.Intn(sum + 1)
s := 0
for i := range servers {
s += int(servers[i].Weight)
if s >= n {
log.Println("Picked record", i, servers[i])
sum -= servers[i].Weight
result[si] = servers[i]
// remove the server from the list
servers = append(servers[:i], servers[i+1:]...)
break
}
}
}
return result
}
If you want to pick x elements from a weighted set without replacement such that elements are chosen with a probability proportional to their weights:
import random
def weighted_choose_subset(weighted_set, count):
"""Return a random sample of count elements from a weighted set.
weighted_set should be a sequence of tuples of the form
(item, weight), for example: [('a', 1), ('b', 2), ('c', 3)]
Each element from weighted_set shows up at most once in the
result, and the relative likelihood of two particular elements
showing up is equal to the ratio of their weights.
This works as follows:
1.) Line up the items along the number line from [0, the sum
of all weights) such that each item occupies a segment of
length equal to its weight.
2.) Randomly pick a number "start" in the range [0, total
weight / count).
3.) Find all the points "start + n/count" (for all integers n
such that the point is within our segments) and yield the set
containing the items marked by those points.
Note that this implementation may not return each possible
subset. For example, with the input ([('a': 1), ('b': 1),
('c': 1), ('d': 1)], 2), it may only produce the sets ['a',
'c'] and ['b', 'd'], but it will do so such that the weights
are respected.
This implementation only works for nonnegative integral
weights. The highest weight in the input set must be less
than the total weight divided by the count; otherwise it would
be impossible to respect the weights while never returning
that element more than once per invocation.
"""
if count == 0:
return []
total_weight = 0
max_weight = 0
borders = []
for item, weight in weighted_set:
if weight < 0:
raise RuntimeError("All weights must be positive integers")
# Scale up weights so dividing total_weight / count doesn't truncate:
weight *= count
total_weight += weight
borders.append(total_weight)
max_weight = max(max_weight, weight)
step = int(total_weight / count)
if max_weight > step:
raise RuntimeError(
"Each weight must be less than total weight / count")
next_stop = random.randint(0, step - 1)
results = []
current = 0
for i in range(count):
while borders[current] <= next_stop:
current += 1
results.append(weighted_set[current][0])
next_stop += step
return results
In the question you linked to, Kyle's solution would work with a trivial generalization.
Scan the list and sum the total weights. Then the probability to choose an element should be:
1 - (1 - (#needed/(weight left)))/(weight at n). After visiting a node, subtract it's weight from the total. Also, if you need n and have n left, you have to stop explicitly.
You can check that with everything having weight 1, this simplifies to kyle's solution.
Edited: (had to rethink what twice as likely meant)
This one does exactly that with O(n) and no excess memory usage. I believe this is a clever and efficient solution easy to port to any language. The first two lines are just to populate sample data in Drupal.
function getNrandomGuysWithWeight($numitems){
$q = db_query('SELECT id, weight FROM theTableWithTheData');
$q = $q->fetchAll();
$accum = 0;
foreach($q as $r){
$accum += $r->weight;
$r->weight = $accum;
}
$out = array();
while(count($out) < $numitems && count($q)){
$n = rand(0,$accum);
$lessaccum = NULL;
$prevaccum = 0;
$idxrm = 0;
foreach($q as $i=>$r){
if(($lessaccum == NULL) && ($n <= $r->weight)){
$out[] = $r->id;
$lessaccum = $r->weight- $prevaccum;
$accum -= $lessaccum;
$idxrm = $i;
}else if($lessaccum){
$r->weight -= $lessaccum;
}
$prevaccum = $r->weight;
}
unset($q[$idxrm]);
}
return $out;
}
I putting here a simple solution for picking 1 item, you can easily expand it for k items (Java style):
double random = Math.random();
double sum = 0;
for (int i = 0; i < items.length; i++) {
val = items[i];
sum += val.getValue();
if (sum > random) {
selected = val;
break;
}
}
I have implemented an algorithm similar to Jason Orendorff's idea in Rust here. My version additionally supports bulk operations: insert and remove (when you want to remove a bunch of items given by their ids, not through the weighted selection path) from the data structure in O(m + log n) time where m is the number of items to remove and n the number of items in stored.
Sampling wihout replacement with recursion - elegant and very short solution in c#
//how many ways we can choose 4 out of 60 students, so that every time we choose different 4
class Program
{
static void Main(string[] args)
{
int group = 60;
int studentsToChoose = 4;
Console.WriteLine(FindNumberOfStudents(studentsToChoose, group));
}
private static int FindNumberOfStudents(int studentsToChoose, int group)
{
if (studentsToChoose == group || studentsToChoose == 0)
return 1;
return FindNumberOfStudents(studentsToChoose, group - 1) + FindNumberOfStudents(studentsToChoose - 1, group - 1);
}
}
I just spent a few hours trying to get behind the algorithms underlying sampling without replacement out there and this topic is more complex than I initially thought. That's exciting! For the benefit of a future readers (have a good day!) I document my insights here including a ready to use function which respects the given inclusion probabilities further below. A nice and quick mathematical overview of the various methods can be found here: Tillé: Algorithms of sampling with equal or unequal probabilities. For example Jason's method can be found on page 46. The caveat with his method is that the weights are not proportional to the inclusion probabilities as also noted in the document. Actually, the i-th inclusion probabilities can be recursively computed as follows:
def inclusion_probability(i, weights, k):
"""
Computes the inclusion probability of the i-th element
in a randomly sampled k-tuple using Jason's algorithm
(see https://stackoverflow.com/a/2149533/7729124)
"""
if k <= 0: return 0
cum_p = 0
for j, weight in enumerate(weights):
# compute the probability of j being selected considering the weights
p = weight / sum(weights)
if i == j:
# if this is the target element, we don't have to go deeper,
# since we know that i is included
cum_p += p
else:
# if this is not the target element, than we compute the conditional
# inclusion probability of i under the constraint that j is included
cond_i = i if i < j else i-1
cond_weights = weights[:j] + weights[j+1:]
cond_p = inclusion_probability(cond_i, cond_weights, k-1)
cum_p += p * cond_p
return cum_p
And we can check the validity of the function above by comparing
In : for i in range(3): print(i, inclusion_probability(i, [1,2,3], 2))
0 0.41666666666666663
1 0.7333333333333333
2 0.85
to
In : import collections, itertools
In : sample_tester = lambda f: collections.Counter(itertools.chain(*(f() for _ in range(10000))))
In : sample_tester(lambda: random_weighted_sample_no_replacement([(1,'a'),(2,'b'),(3,'c')],2))
Out: Counter({'a': 4198, 'b': 7268, 'c': 8534})
One way - also suggested in the document above - to specify the inclusion probabilities is to compute the weights from them. The whole complexity of the question at hand stems from the fact that one cannot do that directly since one basically has to invert the recursion formula, symbolically I claim this is impossible. Numerically it can be done using all kind of methods, e.g. Newton's method. However the complexity of inverting the Jacobian using plain Python becomes unbearable quickly, I really recommend looking into numpy.random.choice in this case.
Luckily there is method using plain Python which might or might not be sufficiently performant for your purposes, it works great if there aren't that many different weights. You can find the algorithm on page 75&76. It works by splitting up the sampling process into parts with the same inclusion probabilities, i.e. we can use random.sample again! I am not going to explain the principle here since the basics are nicely presented on page 69. Here is the code with hopefully a sufficient amount of comments:
def sample_no_replacement_exact(items, k, best_effort=False, random_=None, ε=1e-9):
"""
Returns a random sample of k elements from items, where items is a list of
tuples (weight, element). The inclusion probability of an element in the
final sample is given by
k * weight / sum(weights).
Note that the function raises if a inclusion probability cannot be
satisfied, e.g the following call is obviously illegal:
sample_no_replacement_exact([(1,'a'),(2,'b')],2)
Since selecting two elements means selecting both all the time,
'b' cannot be selected twice as often as 'a'. In general it can be hard to
spot if the weights are illegal and the function does *not* always raise
an exception in that case. To remedy the situation you can pass
best_effort=True which redistributes the inclusion probability mass
if necessary. Note that the inclusion probabilities will change
if deemed necessary.
The algorithm is based on the splitting procedure on page 75/76 in:
http://www.eustat.eus/productosServicios/52.1_Unequal_prob_sampling.pdf
Additional information can be found here:
https://stackoverflow.com/questions/2140787/
:param items: list of tuples of type weight,element
:param k: length of resulting sample
:param best_effort: fix inclusion probabilities if necessary,
(optional, defaults to False)
:param random_: random module to use (optional, defaults to the
standard random module)
:param ε: fuzziness parameter when testing for zero in the context
of floating point arithmetic (optional, defaults to 1e-9)
:return: random sample set of size k
:exception: throws ValueError in case of bad parameters,
throws AssertionError in case of algorithmic impossibilities
"""
# random_ defaults to the random submodule
if not random_:
random_ = random
# special case empty return set
if k <= 0:
return set()
if k > len(items):
raise ValueError("resulting tuple length exceeds number of elements (k > n)")
# sort items by weight
items = sorted(items, key=lambda item: item[0])
# extract the weights and elements
weights, elements = list(zip(*items))
# compute the inclusion probabilities (short: π) of the elements
scaling_factor = k / sum(weights)
π = [scaling_factor * weight for weight in weights]
# in case of best_effort: if a inclusion probability exceeds 1,
# try to rebalance the probabilities such that:
# a) no probability exceeds 1,
# b) the probabilities still sum to k, and
# c) the probability masses flow from top to bottom:
# [0.2, 0.3, 1.5] -> [0.2, 0.8, 1]
# (remember that π is sorted)
if best_effort and π[-1] > 1 + ε:
# probability mass we still we have to distribute
debt = 0.
for i in reversed(range(len(π))):
if π[i] > 1.:
# an 'offender', take away excess
debt += π[i] - 1.
π[i] = 1.
else:
# case π[i] < 1, i.e. 'save' element
# maximum we can transfer from debt to π[i] and still not
# exceed 1 is computed by the minimum of:
# a) 1 - π[i], and
# b) debt
max_transfer = min(debt, 1. - π[i])
debt -= max_transfer
π[i] += max_transfer
assert debt < ε, "best effort rebalancing failed (impossible)"
# make sure we are talking about probabilities
if any(not (0 - ε <= π_i <= 1 + ε) for π_i in π):
raise ValueError("inclusion probabilities not satisfiable: {}" \
.format(list(zip(π, elements))))
# special case equal probabilities
# (up to fuzziness parameter, remember that π is sorted)
if π[-1] < π[0] + ε:
return set(random_.sample(elements, k))
# compute the two possible lambda values, see formula 7 on page 75
# (remember that π is sorted)
λ1 = π[0] * len(π) / k
λ2 = (1 - π[-1]) * len(π) / (len(π) - k)
λ = min(λ1, λ2)
# there are two cases now, see also page 69
# CASE 1
# with probability λ we are in the equal probability case
# where all elements have the same inclusion probability
if random_.random() < λ:
return set(random_.sample(elements, k))
# CASE 2:
# with probability 1-λ we are in the case of a new sample without
# replacement problem which is strictly simpler,
# it has the following new probabilities (see page 75, π^{(2)}):
new_π = [
(π_i - λ * k / len(π))
/
(1 - λ)
for π_i in π
]
new_items = list(zip(new_π, elements))
# the first few probabilities might be 0, remove them
# NOTE: we make sure that floating point issues do not arise
# by using the fuzziness parameter
while new_items and new_items[0][0] < ε:
new_items = new_items[1:]
# the last few probabilities might be 1, remove them and mark them as selected
# NOTE: we make sure that floating point issues do not arise
# by using the fuzziness parameter
selected_elements = set()
while new_items and new_items[-1][0] > 1 - ε:
selected_elements.add(new_items[-1][1])
new_items = new_items[:-1]
# the algorithm reduces the length of the sample problem,
# it is guaranteed that:
# if λ = λ1: the first item has probability 0
# if λ = λ2: the last item has probability 1
assert len(new_items) < len(items), "problem was not simplified (impossible)"
# recursive call with the simpler sample problem
# NOTE: we have to make sure that the selected elements are included
return sample_no_replacement_exact(
new_items,
k - len(selected_elements),
best_effort=best_effort,
random_=random_,
ε=ε
) | selected_elements
Example:
In : sample_no_replacement_exact([(1,'a'),(2,'b'),(3,'c')],2)
Out: {'b', 'c'}
In : import collections, itertools
In : sample_tester = lambda f: collections.Counter(itertools.chain(*(f() for _ in range(10000))))
In : sample_tester(lambda: sample_no_replacement_exact([(1,'a'),(2,'b'),(3,'c'),(4,'d')],2))
Out: Counter({'a': 2048, 'b': 4051, 'c': 5979, 'd': 7922})
The weights sum up to 10, hence the inclusion probabilities compute to: a → 20%, b → 40%, c → 60%, d → 80%. (Sum: 200% = k.) It works!
Just one word of caution for the productive use of this function, it can be very hard to spot illegal inputs for the weights. An obvious illegal example is
In: sample_no_replacement_exact([(1,'a'),(2,'b')],2)
ValueError: inclusion probabilities not satisfiable: [(0.6666666666666666, 'a'), (1.3333333333333333, 'b')]
b cannot appear twice as often as a since both have to be always be selected. There are more subtle examples. To avoid an exception in production just use best_effort=True, which rebalances the inclusion probability mass such that we have always a valid distribution. Obviously this might change the inclusion probabilities.
I used a associative map (weight,object). for example:
{
(10,"low"),
(100,"mid"),
(10000,"large")
}
total=10110
peek a random number between 0 and 'total' and iterate over the keys until this number fits in a given range.
I am trying to test the likelihood that a particular clustering of data has occurred by chance. A robust way to do this is Monte Carlo simulation, in which the associations between data and groups are randomly reassigned a large number of times (e.g. 10,000), and a metric of clustering is used to compare the actual data with the simulations to determine a p value.
I've got most of this working, with pointers mapping the grouping to the data elements, so I plan to randomly reassign pointers to data. THE QUESTION: what is a fast way to sample without replacement, so that every pointer is randomly reassigned in the replicate data sets?
For example (these data are just a simplified example):
Data (n=12 values) - Group A: 0.1, 0.2, 0.4 / Group B: 0.5, 0.6, 0.8 / Group C: 0.4, 0.5 / Group D: 0.2, 0.2, 0.3, 0.5
For each replicate data set, I would have the same cluster sizes (A=3, B=3, C=2, D=4) and data values, but would reassign the values to the clusters.
To do this, I could generate random numbers in the range 1-12, assign the first element of group A, then generate random numbers in the range 1-11 and assign the second element in group A, and so on. The pointer reassignment is fast, and I will have pre-allocated all data structures, but the sampling without replacement seems like a problem that might have been solved many times before.
Logic or pseudocode preferred.
Here's some code for sampling without replacement based on Algorithm 3.4.2S of Knuth's book Seminumeric Algorithms.
void SampleWithoutReplacement
(
int populationSize, // size of set sampling from
int sampleSize, // size of each sample
vector<int> & samples // output, zero-offset indicies to selected items
)
{
// Use Knuth's variable names
int& n = sampleSize;
int& N = populationSize;
int t = 0; // total input records dealt with
int m = 0; // number of items selected so far
double u;
while (m < n)
{
u = GetUniform(); // call a uniform(0,1) random number generator
if ( (N - t)*u >= n - m )
{
t++;
}
else
{
samples[m] = t;
t++; m++;
}
}
}
There is a more efficient but more complex method by Jeffrey Scott Vitter in "An Efficient Algorithm for Sequential Random Sampling," ACM Transactions on Mathematical Software, 13(1), March 1987, 58-67.
A C++ working code based on the answer by John D. Cook.
#include <random>
#include <vector>
// John D. Cook, https://stackoverflow.com/a/311716/15485
void SampleWithoutReplacement
(
int populationSize, // size of set sampling from
int sampleSize, // size of each sample
std::vector<int> & samples // output, zero-offset indicies to selected items
)
{
// Use Knuth's variable names
int& n = sampleSize;
int& N = populationSize;
int t = 0; // total input records dealt with
int m = 0; // number of items selected so far
std::default_random_engine re;
std::uniform_real_distribution<double> dist(0,1);
while (m < n)
{
double u = dist(re); // call a uniform(0,1) random number generator
if ( (N - t)*u >= n - m )
{
t++;
}
else
{
samples[m] = t;
t++; m++;
}
}
}
#include <iostream>
int main(int,char**)
{
const size_t sz = 10;
std::vector< int > samples(sz);
SampleWithoutReplacement(10*sz,sz,samples);
for (size_t i = 0; i < sz; i++ ) {
std::cout << samples[i] << "\t";
}
return 0;
}
See my answer to this question Unique (non-repeating) random numbers in O(1)?. The same logic should accomplish what you are looking to do.
Inspired by #John D. Cook's answer, I wrote an implementation in Nim. At first I had difficulties understanding how it works, so I commented extensively also including an example. Maybe it helps to understand the idea. Also, I have changed the variable names slightly.
iterator uniqueRandomValuesBelow*(N, M: int) =
## Returns a total of M unique random values i with 0 <= i < N
## These indices can be used to construct e.g. a random sample without replacement
assert(M <= N)
var t = 0 # total input records dealt with
var m = 0 # number of items selected so far
while (m < M):
let u = random(1.0) # call a uniform(0,1) random number generator
# meaning of the following terms:
# (N - t) is the total number of remaining draws left (initially just N)
# (M - m) is the number how many of these remaining draw must be positive (initially just M)
# => Probability for next draw = (M-m) / (N-t)
# i.e.: (required positive draws left) / (total draw left)
#
# This is implemented by the inequality expression below:
# - the larger (M-m), the larger the probability of a positive draw
# - for (N-t) == (M-m), the term on the left is always smaller => we will draw 100%
# - for (N-t) >> (M-m), we must get a very small u
#
# example: (N-t) = 7, (M-m) = 5
# => we draw the next with prob 5/7
# lets assume the draw fails
# => t += 1 => (N-t) = 6
# => we draw the next with prob 5/6
# lets assume the draw succeeds
# => t += 1, m += 1 => (N-t) = 5, (M-m) = 4
# => we draw the next with prob 4/5
# lets assume the draw fails
# => t += 1 => (N-t) = 4
# => we draw the next with prob 4/4, i.e.,
# we will draw with certainty from now on
# (in the next steps we get prob 3/3, 2/2, ...)
if (N - t)*u >= (M - m).toFloat: # this is essentially a draw with P = (M-m) / (N-t)
# no draw -- happens mainly for (N-t) >> (M-m) and/or high u
t += 1
else:
# draw t -- happens when (M-m) gets large and/or low u
yield t # this is where we output an index, can be used to sample
t += 1
m += 1
# example use
for i in uniqueRandomValuesBelow(100, 5):
echo i
When the population size is much greater than the sample size, the above algorithms become inefficient, since they have complexity O(n), n being the population size.
When I was a student I wrote some algorithms for uniform sampling without replacement, which have average complexity O(s log s), where s is the sample size. Here is the code for the binary tree algorithm, with average complexity O(s log s), in R:
# The Tree growing algorithm for uniform sampling without replacement
# by Pavel Ruzankin
quicksample = function (n,size)
# n - the number of items to choose from
# size - the sample size
{
s=as.integer(size)
if (s>n) {
stop("Sample size is greater than the number of items to choose from")
}
# upv=integer(s) #level up edge is pointing to
leftv=integer(s) #left edge is poiting to; must be filled with zeros
rightv=integer(s) #right edge is pointig to; must be filled with zeros
samp=integer(s) #the sample
ordn=integer(s) #relative ordinal number
ordn[1L]=1L #initial value for the root vertex
samp[1L]=sample(n,1L)
if (s > 1L) for (j in 2L:s) {
curn=sample(n-j+1L,1L) #current number sampled
curordn=0L #currend ordinal number
v=1L #current vertice
from=1L #how have come here: 0 - by left edge, 1 - by right edge
repeat {
curordn=curordn+ordn[v]
if (curn+curordn>samp[v]) { #going down by the right edge
if (from == 0L) {
ordn[v]=ordn[v]-1L
}
if (rightv[v]!=0L) {
v=rightv[v]
from=1L
} else { #creating a new vertex
samp[j]=curn+curordn
ordn[j]=1L
# upv[j]=v
rightv[v]=j
break
}
} else { #going down by the left edge
if (from==1L) {
ordn[v]=ordn[v]+1L
}
if (leftv[v]!=0L) {
v=leftv[v]
from=0L
} else { #creating a new vertex
samp[j]=curn+curordn-1L
ordn[j]=-1L
# upv[j]=v
leftv[v]=j
break
}
}
}
}
return(samp)
}
The complexity of this algorithm is discussed in:
Rouzankin, P. S.; Voytishek, A. V. On the cost of algorithms for random selection. Monte Carlo Methods Appl. 5 (1999), no. 1, 39-54.
http://dx.doi.org/10.1515/mcma.1999.5.1.39
If you find the algorithm useful, please make a reference.
See also:
P. Gupta, G. P. Bhattacharjee. (1984) An efficient algorithm for random sampling without replacement. International Journal of Computer Mathematics 16:4, pages 201-209.
DOI: 10.1080/00207168408803438
Teuhola, J. and Nevalainen, O. 1982. Two efficient algorithms for random sampling without replacement. /IJCM/, 11(2): 127–140.
DOI: 10.1080/00207168208803304
In the last paper the authors use hash tables and claim that their algorithms have O(s) complexity. There is one more fast hash table algorithm, which will soon be implemented in pqR (pretty quick R):
https://stat.ethz.ch/pipermail/r-devel/2017-October/075012.html
I wrote a survey of algorithms for sampling without replacement. I may be biased but I recommend my own algorithm, implemented in C++ below, as providing the best performance for many k, n values and acceptable performance for others. randbelow(i) is assumed to return a fairly chosen random non-negative integer less than i.
void cardchoose(uint32_t n, uint32_t k, uint32_t* result) {
auto t = n - k + 1;
for (uint32_t i = 0; i < k; i++) {
uint32_t r = randbelow(t + i);
if (r < t) {
result[i] = r;
} else {
result[i] = result[r - t];
}
}
std::sort(result, result + k);
for (uint32_t i = 0; i < k; i++) {
result[i] += i;
}
}
Another algorithm for sampling without replacement is described here.
It is similar to the one described by John D. Cook in his answer and also from Knuth, but it has different hypothesis: The population size is unknown, but the sample can fit in memory. This one is called "Knuth's algorithm S".
Quoting the rosettacode article:
Select the first n items as the sample as they become available;
For the i-th item where i > n, have a random chance of n/i of keeping it. If failing this chance, the sample remains the same. If
not, have it randomly (1/n) replace one of the previously selected n
items of the sample.
Repeat #2 for any subsequent items.