Choosing k out of n - algorithm

I want to choose k elements uniformly at random out of a possible n without choosing the same number twice. There are two trivial approaches to this.
Make a list of all n possibilities. Shuffle them (you don't need
to shuffle all n numbers just k of them by performing the first
k steps of Fisher Yates). Choose the first k. This approach
takes O(k) time (assuming allocating an array of size n takes
O(1) time) and O(n) space. This is a problem if k is very
small relative to n.
Store a set of seen elements. Choose a number at random from [0, n-1]. While the element is in the set then choose a new number.
This approach takes O(k) space. The run-time is a little more
complicated to analyze. If k = theta(n) then the run-time is
O(k*lg(k))=O(n*lg(n)) because it is the coupon collector's
problem. If k is small relative to n then it takes slightly
more than O(k) because of the probability (albeit low) of choosing
the same number twice. This is better than the above solution in
terms of space but worse in terms of run-time.
My question:
is there an O(k) time, O(k) space algorithm for all k and n?

With an O(1) hash table, the partial Fisher-Yates method can be made to run in O(k) time and space. The trick is simply to store only the changed elements of the array in the hash table.
Here's a simple example in Java:
public static int[] getRandomSelection (int k, int n, Random rng) {
if (k > n) throw new IllegalArgumentException(
"Cannot choose " + k + " elements out of " + n + "."
);
HashMap<Integer, Integer> hash = new HashMap<Integer, Integer>(2*k);
int[] output = new int[k];
for (int i = 0; i < k; i++) {
int j = i + rng.nextInt(n - i);
output[i] = (hash.containsKey(j) ? hash.remove(j) : j);
if (j > i) hash.put(j, (hash.containsKey(i) ? hash.remove(i) : i));
}
return output;
}
This code allocates a HashMap of 2×k buckets to store the modified elements (which should be enough to ensure that the hash table is never rehashed), and just runs a partial Fisher-Yates shuffle on it.
Here's a quick test on Ideone; it picks two elements out of three 30,000 times, and counts the number of times each pair of elements gets chosen. For an unbiased shuffle, each ordered pair should appear approximately 5,000 (&pm;100 or so) times, except for the impossible cases where both elements would be equal.

Your second approach does not take Theta(k log k) time on average, it takes about n/(n-k+1) + n/(n-k+2) + ... + n/n operations, which is less than k(n/(n-k)) since you have k terms which are each smaller than n/(n-k). For k <= n/2, it takes under 2*k operations on average. For k>n/2, you can choose a random subset of size n-k, and take the complement. So, this is already an O(k) average time and space algorithm.

What you could use is the following algorithm (using javascript instead of pseudocode):
var k = 3;
var n = [1,2,3,4,5,6];
// O(k) iterations
for(var i = 0, tmp; i < k; ++i) {
// Random index O(1)
var index = Math.floor(Math.random() * (n.length - i));
// Output O(1)
console.log(n[index]);
// Swap and lookup O(1)
tmp = n[index];
n[index] = n[n.length - i - 1];
n[n.length - i - 1] = tmp;
}
In short, you swap the selected value with the last item and in the next iteration sample from the reduced subset. This assumes your original set is wholly unique.
The storage is O(n), if you wish to retrieve the numbers as a set, just refer to the last k entries from n.

Related

Q: Count array pairs with bitwise AND > k ~ better than O(N^2) possible?

Given an array nums
Count no. of pairs (two elements) where bitwise AND is greater than K
Brute force
for i in range(0,n):
for j in range(i+1,n):
if a[i]&a[j] > k:
res += 1
Better version:
preprocess to remove all elements ≤k
and then brute force
But i was wondering, what would be the limit in complexity here?
Can we do better with a trie, hashmap approach like two-sum?
( I did not find this problem on Leetcode so I thought of asking here )
Let size_of_input_array = N. Let the input array be of B-bit numbers
Here is an easy to understand and implement solution.
Eliminate all values <= k.
The above image shows 5 10-bit numbers.
Step 1: Adjacency Graph
Store a list of set bits. In our example, 7th bit is set for numbers at index 0,1,2,3 in the input array.
Step 2: The challenge is to avoid counting the same pairs again.
To solve this challenge we take help of union-find data structure as shown in the code below.
//unordered_map<int, vector<int>> adjacency_graph;
//adjacency_graph has been filled up in step 1
vector<int> parent;
for(int i = 0; i < input_array.size(); i++)
parent.push_back(i);
int result = 0;
for(int i = 0; i < adjacency_graph.size(); i++){ // loop 1
auto v = adjacency_graph[i];
if(v.size() > 1){
int different_parents = 1;
for (int j = 1; j < v.size(); j++) { // loop 2
int x = find(parent, v[j]);
int y = find(parent, v[j - 1]);
if (x != y) {
different_parents++;
union(parent, x, y);
}
}
result += (different_parents * (different_parents - 1)) / 2;
}
}
return result;
In the above code, find and union are from union-find data structure.
Time Complexity:
Step 1:
Build Adjacency Graph: O(BN)
Step 2:
Loop 1: O(B)
Loop 2: O(N * Inverse of Ackermann’s function which is an extremely slow-growing function)
Overall Time Complexity
= O(BN)
Space Complexity
Overall space complexity = O(BN)
First, prune everything <= k. Also Sort the value list.
Going from the most significant bit to the least significant we are going to keep track of the set of numbers we are working with (initially all ,s=0, e=n).
Let p be the first position that contains a 1 in the current set at the current position.
If the bit in k is 0, then everything that would yield a 1 world definetly be good and we need to investigate the ones that get a 0. We have (end - p) * (end-p-1) /2 pairs in the current range and (end-p) * <total 1s in this position larger or equal to end> combinations with larger previously good numbers, that we can add to the solution. To continue we update end = p. We want to count 1s in all the numbers above, because we only counted them before in pairs with each other, not with the numbers this low in the set.
If the bit in k is 1, then we can't count any wins yet, but we need to eliminate everything below p, so we update start = p.
You can stop once you went through all the bits or start==end.
Details:
Since at each step we eliminate either everything that has a 0 or everything that has a 1, then everything between start and end will have the same bit-prefix. since the values are sorted we can do a binary search to find p.
For <total 1s in this position larger than p>. We already have the values sorted. So we can compute partial sums and store for every position in the sorted list the number of 1s in every bit position for all numbers above it.
Complexity:
We got bit-by-bit so L (the bit length of the numbers), we do a binary search (logN), and lookup and updates O(1), so this is O(L logN).
We have to sort O(NlogN).
We have to compute partial bit-wise sums O(L*N).
Total O(L logN + NlogN + L*N).
Since N>>L, L logN is subsummed by NlogN. Since L>>logN (probably, as in you have 32 bit numbers but you don't have 4Billion of them), then NlogN is subsummed by L*N. So complexity is O(L * N). Since we also need to keep the partial sums around the memory complexity is also O(L * N).

Finding Θ for an algorithm

I have the below pseudocode that takes a given unsorted array of length size and finds the range by finding the max and min values in the array. I'm just learning about the various time efficiency methods, but I think the below code is Θ(n), as a longer array adds a fixed number of actions (3).
For example, ignoring the actual assignments to max and min (as the unsorted array is arbitrary and these assignments are unknown in advance), an array of length 2 would only require 5 actions total (including the final range calculation). An array of length 4 only uses 9 actions total, again adding the final range calculation. An array of length 12 uses 25 actions.
This all points me to Θ(n), as it is a linear relationship. Is this correct?
Pseudocode:
// Traverse each element of the array, storing the max and min values
// Assuming int size exists that is size of array a[]
// Assuming array is a[]
min = a[0];
max = a[0];
for(i = 0; i < size; i++) {
if(min > a[i]) { // If current min is greater than val,
min = a[i]; // replace min with val
}
if(max < a[i]) { // If current max is smaller than val,
max = a[i]; // replace max with val
}
}
range = max – min; // range is largest value minus smallest
You're right. It's O(n).
An easy way to tell in simple code (like the one above) is to see how many for() loops are nested, if any. For every "normal" loop (from i = 0 -> n), you add a factor of n.
[Edit2]: That is, if you have code like this:
array a[n]; //Array with n elements.
for(int i = 0; i < n; ++i){ //Happens n times.
for(int j = 0; j < n; ++j){ //Happens n*n times.
//something //Happens n*n times.
}
}
//Overall complexity is O(n^2)
Whereas
array a[n]; //Array with n elements.
for(int i = 0; i < n; ++i){ //Happens n times.
//something //Happens n times.
}
for(int j = 0; j < n; ++j){ //Happens n times.
//something //Happens n times.
}
//Overall complexity is O(2n) = O(n)
This is pretty rudimentary, but useful if someone has not taken an Algorithm course.
The procedures within your for() loop are irrelevant in a complexity question.
[Edit]: This assumes that size actually means the size of array a.
Yes, this would be Θ(n). Your reasoning is a little skewed though.
You have to look at every item in your loop so you're bounded above by a linear function. Conversely, you are also bounded below by a linear function (the same one in fact), because you can't avoid looking at every element.
O(n) only requires that you bound above, Omega(n) requires that you bound below.
Θ(n) says you're bounded on both sides.
Let size be n, then it's clear to see that you always have 2n comparisons and of course the single assignment at the end. So you always have 2n + 1 operations in this algorithm.
In the worst case scenario, you have 2n assignments, thus 2n + 1 + 2n = 4n + 1 = O(n).
In the best case scenrio, you have 0 assignments, thus 2n + 1 + 0 = 2n + 1 = Ω(n).
Therefore, we have that both the best and worst case perform in linear time. Hence, Ɵ(n).
Yeah this surely is O(n) algorithm. I don't think you really need to drill down to see number of comparisons to arrive on the conclusion about the complexity of the algorithm. Just try to see how the number of comparisons will change with the increasing size of the input. For O(n) the comparisons should have a linear increase with the increase in input. For O(n^2) it increases by some multiple of n and so on.

Exactly how many comparisons does merge sort make?

I have read that quicksort is much faster than mergesort in practice, and the reason for this is the hidden constant.
Well, the solution for the randomized quick sort complexity is 2nlnn=1.39nlogn which means that the constant in quicksort is 1.39.
But what about mergesort? What is the constant in mergesort?
Let's see if we can work this out!
In merge sort, at each level of the recursion, we do the following:
Split the array in half.
Recursively sort each half.
Use the merge algorithm to combine the two halves together.
So how many comparisons are done at each step? Well, the divide step doesn't make any comparisons; it just splits the array in half. Step 2 doesn't (directly) make any comparisons; all comparisons are done by recursive calls. In step 3, we have two arrays of size n/2 and need to merge them. This requires at most n comparisons, since each step of the merge algorithm does a comparison and then consumes some array element, so we can't do more than n comparisons.
Combining this together, we get the following recurrence:
C(1) = 0
C(n) = 2C(n / 2) + n
(As mentioned in the comments, the linear term is more precisely (n - 1), though this doesn’t change the overall conclusion. We’ll use the above recurrence as an upper bound.)
To simplify this, let's define n = 2k and rewrite this recurrence in terms of k:
C'(0) = 0
C'(k) = 2C'(k - 1) + 2^k
The first few terms here are 0, 2, 8, 24, ... . This looks something like k 2k, and we can prove this by induction. As our base case, when k = 0, the first term is 0, and the value of k 2k is also 0. For the inductive step, assume the claim holds for some k and consider k + 1. Then the value is 2(k 2k) + 2k + 1 = k 2 k + 1 + 2k + 1 = (k + 1)2k + 1, so the claim holds for k + 1, completing the induction. Thus the value of C'(k) is k 2k. Since n = 2 k, this means that, assuming that n is a perfect power of two, we have that the number of comparisons made is
C(n) = n lg n
Impressively, this is better than quicksort! So why on earth is quicksort faster than merge sort? This has to do with other factors that have nothing to do with the number of comparisons made. Primarily, since quicksort works in place while merge sort works out of place, the locality of reference is not nearly as good in merge sort as it is in quicksort. This is such a huge factor that quicksort ends up being much, much better than merge sort in practice, since the cost of a cache miss is pretty huge. Additionally, the time required to sort an array doesn't just take the number of comparisons into account. Other factors like the number of times each array element is moved can also be important. For example, in merge sort we need to allocate space for the buffered elements, move the elements so that they can be merged, then merge back into the array. These moves aren't counted in our analysis, but they definitely add up. Compare this to quicksort's partitioning step, which moves each array element exactly once and stays within the original array. These extra factors, not the number of comparisons made, dominate the algorithm's runtime.
This analysis is a bit less precise than the optimal one, but Wikipedia confirms that the analysis is roughly n lg n and that this is indeed fewer comparisons than quicksort's average case.
Hope this helps!
In the worst case and assuming a straight-forward implementation, the number of comparisons to sort n elements is
n ⌈lg n⌉ − 2⌈lg n⌉ + 1
where lg n indicates the base-2 logarithm of n.
This result can be found in the corresponding Wikipedia article or recent editions of The Art of Computer Programming by Donald Knuth, and I just wrote down a proof for this answer.
Merging two sorted arrays (or lists) of size k resp. m takes k+m-1 comparisons at most, min{k,m} at best. (After each comparison, we can write one value to the target, when one of the two is exhausted, no more comparisons are necessary.)
Let C(n) be the worst case number of comparisons for a mergesort of an array (a list) of n elements.
Then we have C(1) = 0, C(2) = 1, pretty obviously. Further, we have the recurrence
C(n) = C(floor(n/2)) + C(ceiling(n/2)) + (n-1)
An easy induction shows
C(n) <= n*log_2 n
On the other hand, it's easy to see that we can come arbitrarily close to the bound (for every ε > 0, we can construct cases needing more than (1-ε)*n*log_2 n comparisons), so the constant for mergesort is 1.
Merge sort is O(n log n) and at each step, in the "worst" case (for number of comparisons), performs a comparison.
Quicksort, on the other hand, is O(n^2) in the worst case.
C++ program to count the number of comparisons in merge sort.
First the program will sort the given array, then it will show the number of comparisons.
#include<iostream>
using namespace std;
int count=0; /* to count the number of comparisions */
int merge( int arr [ ], int l, int m, int r)
{
int i=l; /* left subarray*/
int j=m+1; /* right subarray*/
int k=l; /* temporary array*/
int temp[r+1];
while( i<=m && j<=r)
{
if ( arr[i]<= arr[j])
{
temp[k]=arr[i];
i++;
}
else
{
temp[k]=arr[j];
j++;
}
k++;
count++;
}
while( i<=m)
{
temp[k]=arr[i];
i++;
k++;
}
while( j<=r)
{
temp[k]=arr[j];
j++;
k++;
}
for( int p=l; p<=r; p++)
{
arr[p]=temp[p];
}
return count;
}
int mergesort( int arr[ ], int l, int r)
{
int comparisons;
if(l<r)
{
int m= ( l+r)/2;
mergesort(arr,l,m);
mergesort(arr,m+1,r);
comparisions = merge(arr,l,m,r);
}
return comparisons;
}
int main ()
{
int size;
cout<<" Enter the size of an array "<< endl;
cin>>size;
int myarr[size];
cout<<" Enter the elements of array "<<endl;
for ( int i=0; i< size; i++)
{
cin>>myarr[i];
}
cout<<" Elements of array before sorting are "<<endl;
for ( int i=0; i< size; i++)
{
cout<<myarr[i]<<" " ;
}
cout<<endl;
int c=mergesort(myarr, 0, size-1);
cout<<" Elements of array after sorting are "<<endl;
for ( int i=0; i< size; i++)
{
cout<<myarr[i]<<" " ;
}
cout<<endl;
cout<<" Number of comaprisions while sorting the given array"<< c <<endl;
return 0;
}
I am assuming reader knows Merge sort. Comparisons happens only when two sorted arrays is getting merged. For simplicity, assume n as power of 2. To merge two (n/2) size arrays in worst case, we need (n - 1) comparisons. -1 appears here, as last element left on merging does not require any comparison. First found number of total comparison assuming it as n for some time, we can correct it by (-1) part. Number of levels for merging is log2(n) (Imagine as tree structure). In each layer there will be n comparison (need to minus some number, due to -1 part),so total comparison is nlog2(n) - (Yet to be found). "Yet to be found" part does not give nlog2(n) constant, it is actually (1 + 2 + 4 + 8 + ... + (n/2) = n - 1).
Number of total comparison in merge sort = n*log2(n) - (n - 1).
So, your constant is 1.

selection algorithm problem

Suppose you have an array A of n items, and you want to find the k items in A closest
to the median of A. For example, if A contains the 9 values {7, 14, 10, 12, 2, 11, 29, 3, 4}
and k = 5, then the answer would be the values {7, 14, 10, 12, 11}, since the median
is 10 and these are the five values in A closest to the value 10. Give an algorithm
to solve this problem in O(n) time.
I know that a selection algorithm (deep selection) is the appropriate algorithm for this problem, but I think that would run in O(n*logn) time instead of O(n). Any help would be greatly appreciated :)
You will first need to find the median, which can be done in O(n) (for example using Hoare's Quickselect algorithm).
Then you will need to implement a sorting algorithm which sorts the elements in the array according to their absolute distance to the median (smallest distances first).
If you were to sort the entire array this way, this would typically take somewhere from O(n * log n) to O(n^2), depending on the algorithm being used. However since you only need the first k values, the complexity can be reduced to O(k * log n) to O(k * n).
Since k is a constant and does not depend on the size of the array, the overall complexity in a worst case scenario will be: O(n) (for finding the median) + O(k * n) (sorting), which is O(n) overall.
I think you can do this using a variant on quicksort.
You start with a set S of n items and are looking for the "middle" k items. You can think of this as partitioning S into three parts of sizes n - k/2 (the "lower" items), k (the "middle" items), and n - k/2 (the "upper" items).
This gives us a strategy: first remove the lower n - k/2 items from S, leaving S'. Then remove the upper n - k/2 items from S', leaving S'', which is the middle k items of S.
You can easily partition a set this way using "half a quicksort": choose a pivot, partition the set into L and U (lower and upper elements w.r.t. the pivot), then you know the items to discard in the partition must be either all of L and some of U or vice versa: recurse accordingly.
[Thinking further, this may not be exactly what you want if you define "closest to the median" in some other way, but it's a start.]
Assumption: we care about the k values in A that are closest to the median. If we had an A={1,2,2,2,2,2,2,2,2,2,2,2,3}, and k=3, the answer is {2,2,2}. Similarly, if we have A={0,1,2,3,3,4,5,6}, and k=3, answers {2,3,3} and {3,3,4} are equally valid. Furthermore, we are not interested in the indices from which these values came, though I imagine some small tweaks to the algorithm would work.
As Grodrigues states, first find the median in O(n) time. While we're at it, keep track of the largest and smallest number
Next, create an array K, k items long. This array will contain the distance an item is from the median. (note that
Copy the first k items from A into K.
For each item A[i], compare the distance of A[i] from the median to each item of K. If A[i] is closer to the median than the farthest item from the median in K, replace that item. As an optimization, we could also track K's closest and farthest items from the median, so we have a faster comparison to K, or we could keep K sorted, but neither optimization is necessary to operate in O(n) time.
Pseudocode, C++ ish:
/* n = length of array
* array = A, given in the problem
* result is a pre-allocated array where the result will be placed
* k is the length of result
*
* returns
* 0 for success
* -1 for invalid input
* 1 for other errors
*
* Implementation note: optimizations are skipped.
*/
#define SUCCESS 0
#define INVALID_INPUT -1
#define ERROR 1
void find_k_closest(int n, int[] array, int k, int[] result)
{
// if we're looking for more results than possible,
// it's impossible to give a valid result.
if( k > n ) return INVALID_INPUT;
// populate result with the first k elements of array.
for( int i=0; i<k; i++ )
{
result[i] = array[i];
}
// if we're looking for n items of an n length array,
// we don't need to do any comparisons
// Up to this point, function is O(k). Worst case, k==n,
// and we're O(n)
if( k==n ) return 0;
// Assume an O(n) median function
// Note that we don't bother finding the median if there's an
// error or if the output is the input.
int median = median(array);
// Convert the result array to be distance, not
// actual numbers
for( int i=0; i<k; i++)
{
result[i] = result[i]-median;
// if array[i]=1, median=3, array[i] will be set to 2.
// 4 3 -1.
}
// Up to this point, function is O(2k+n) = O(n)
// find the closest items.
// Outer loop is O(n * order_inner_loop)
// Inner loop is O(k)
// Thus outer loop is O(2k*n) = O(n)
// Note that we start at k, since the first k elements
// of array are already in result.
OUTER: for(int i=k; i<n; i++)
{
int distance = array[i]-median;
int abs_distance = abs(distance);
// find the result farthest from the median
int idx = 0;
#define FURTHER(a,b) ((abs(a)>abs(b)) ? 1 : 0;
INNER: for( int i=1; i<k; i++ )
{
idx = (FURTHER(result[i],result[i-1])) ? i:i-1;
}
// If array[i] is closer to the median than the farthest element of
// result, replace the farthest element of result with array[i]
if( abs_distance < result[idx] ){ result[idx] = distance; }
}
}
// Up to this point, function is O(2n)
// convert result from distance to values
for( int i=0; i<k; i++)
{
result[i] = median - result[i];
// if array[i]=2 , median=3, array[i] will be set to 1.
// -1 3 4.
}
}

Most efficient way of randomly choosing a set of distinct integers

I'm looking for the most efficient algorithm to randomly choose a set of n distinct integers, where all the integers are in some range [0..maxValue].
Constraints:
maxValue is larger than n, and possibly much larger
I don't care if the output list is sorted or not
all integers must be chosen with equal probability
My initial idea was to construct a list of the integers [0..maxValue] then extract n elements at random without replacement. But that seems quite inefficient, especially if maxValue is large.
Any better solutions?
Here is an optimal algorithm, assuming that we are allowed to use hashmaps. It runs in O(n) time and space (and not O(maxValue) time, which is too expensive).
It is based on Floyd's random sample algorithm. See my blog post about it for details.
The code is in Java:
private static Random rnd = new Random();
public static Set<Integer> randomSample(int max, int n) {
HashSet<Integer> res = new HashSet<Integer>(n);
int count = max + 1;
for (int i = count - n; i < count; i++) {
Integer item = rnd.nextInt(i + 1);
if (res.contains(item))
res.add(i);
else
res.add(item);
}
return res;
}
For small values of maxValue such that it is reasonable to generate an array of all the integers in memory then you can use a variation of the Fisher-Yates shuffle except only performing the first n steps.
If n is much smaller than maxValue and you don't wish to generate the entire array then you can use this algorithm:
Keep a sorted list l of number picked so far, initially empty.
Pick a random number x between 0 and maxValue - (elements in l)
For each number in l if it smaller than or equal to x, add 1 to x
Add the adjusted value of x into the sorted list and repeat.
If n is very close to maxValue then you can randomly pick the elements that aren't in the result and then find the complement of that set.
Here is another algorithm that is simpler but has potentially unbounded execution time:
Keep a set s of element picked so far, initially empty.
Pick a number at random between 0 and maxValue.
If the number is not in s, add it to s.
Go back to step 2 until s has n elements.
In practice if n is small and maxValue is large this will be good enough for most purposes.
One way to do it without generating the full array.
Say I want a randomly selected subset of m items from a set {x1, ..., xn} where m <= n.
Consider element x1. I add x1 to my subset with probability m/n.
If I do add x1 to my subset then I reduce my problem to selecting (m - 1) items from {x2, ..., xn}.
If I don't add x1 to my subset then I reduce my problem to selecting m items from {x2, ..., xn}.
Lather, rinse, and repeat until m = 0.
This algorithm is O(n) where n is the number of items I have to consider.
I rather imagine there is an O(m) algorithm where at each step you consider how many elements to remove from the "front" of the set of possibilities, but I haven't convinced myself of a good solution and I have to do some work now!
If you are selecting M elements out of N, the strategy changes depending on whether M is of the same order as N or much less (i.e. less than about N/log N).
If they are similar in size, then you go through each item from 1 to N. You keep track of how many items you've got so far (let's call that m items picked out of n that you've gone through), and then you take the next number with probability (M-m)/(N-n) and discard it otherwise. You then update m and n appropriately and continue. This is a O(N) algorithm with low constant cost.
If, on the other hand, M is significantly less than N, then a resampling strategy is a good one. Here you will want to sort M so you can find them quickly (and that will cost you O(M log M) time--stick them into a tree, for example). Now you pick numbers uniformly from 1 to N and insert them into your list. If you find a collision, pick again. You will collide about M/N of the time (actually, you're integrating from 1/N to M/N), which will require you to pick again (recursively), so you'll expect to take M/(1-M/N) selections to complete the process. Thus, your cost for this algorithm is approximately O(M*(N/(N-M))*log(M)).
These are both such simple methods that you can just implement both--assuming you have access to a sorted tree--and pick the one that is appropriate given the fraction of numbers that will be picked.
(Note that picking numbers is symmetric with not picking them, so if M is almost equal to N, then you can use the resampling strategy, but pick those numbers to not include; this can be a win, even if you have to push all almost-N numbers around, if your random number generation is expensive.)
My solution is the same as Mark Byers'. It takes O(n^2) time, hence it's useful when n is much smaller than maxValue. Here's the implementation in python:
def pick(n, maxValue):
chosen = []
for i in range(n):
r = random.randint(0, maxValue - i)
for e in chosen:
if e <= r:
r += 1
else:
break;
bisect.insort(chosen, r)
return chosen
The trick is to use a variation of shuffle or in other words a partial shuffle.
function random_pick( a, n )
{
N = len(a);
n = min(n, N);
picked = array_fill(0, n, 0); backup = array_fill(0, n, 0);
// partially shuffle the array, and generate unbiased selection simultaneously
// this is a variation on fisher-yates-knuth shuffle
for (i=0; i<n; i++) // O(n) times
{
selected = rand( 0, --N ); // unbiased sampling N * N-1 * N-2 * .. * N-n+1
value = a[ selected ];
a[ selected ] = a[ N ];
a[ N ] = value;
backup[ i ] = selected;
picked[ i ] = value;
}
// restore partially shuffled input array from backup
// optional step, if needed it can be ignored
for (i=n-1; i>=0; i--) // O(n) times
{
selected = backup[ i ];
value = a[ N ];
a[ N ] = a[ selected ];
a[ selected ] = value;
N++;
}
return picked;
}
NOTE the algorithm is strictly O(n) in both time and space, produces unbiased selections (it is a partial unbiased shuffling) and does not need hasmaps (which may not be available and/or usualy hide a complexity behind their implementation, e.g fetch time is not O(1), it might even be O(n) in worst case)
adapted from here
Linear congruential generator modulo maxValue+1. I'm sure I've written this answer before, but I can't find it...
UPDATE: I am wrong. The output of this is not uniformly distributed. Details on why are here.
I think this algorithm below is optimum. I.e. you cannot get better performance than this.
For choosing n numbers out of m numbers, the best offered algorithm so far is presented below. Its worst run time complexity is O(n), and needs only a single array to store the original numbers. It partially shuffles the first n elements from the original array, and then you pick those first n shuffled numbers as your solution.
This is also a fully working C program. What you find is:
Function getrand: This is just a PRNG that returns a number from 0 up to upto.
Function randselect: This is the function that randmoly chooses n unique numbers out of m many numbers. This is what this question is about.
Function main: This is only to demonstrate a use for other functions, so that you could compile it into a program and have fun.
#include <stdio.h>
#include <stdlib.h>
int getrand(int upto) {
long int r;
do {
r = rand();
} while (r > upto);
return r;
}
void randselect(int *all, int end, int select) {
int upto = RAND_MAX - (RAND_MAX % end);
int binwidth = upto / end;
int c;
for (c = 0; c < select; c++) {
/* randomly choose some bin */
int bin = getrand(upto)/binwidth;
/* swap c with bin */
int tmp = all[c];
all[c] = all[bin];
all[bin] = tmp;
}
}
int main() {
int end = 1000;
int select = 5;
/* initialize all numbers up to end */
int *all = malloc(end * sizeof(int));
int c;
for (c = 0; c < end; c++) {
all[c] = c;
}
/* select select unique numbers randomly */
srand(0);
randselect(all, end, select);
for (c = 0; c < select; c++) printf("%d ", all[c]);
putchar('\n');
return 0;
}
Here is the output of an example code where I randomly output 4 permutations out of a pool of 8 numbers for 100,000,000 many times. Then I use those many permutations to compute the probability of having each unique permutation occur. I then sort them by this probability. You notice that the numbers are fairly close, which I think means that it is uniformly distributed. The theoretical probability should be 1/1680 = 0.000595238095238095. Note how the empirical test is close to the theoretical one.

Resources