A Google search reveals plenty about generating all possible partitions of an integer n into m parts, but I haven't found anything about sampling a uniformly distributed random partition of n into m parts.
The title of this post is a bit misleading. A random integer partition is by default unrestricted, meaning it can have as many parts of any size. The specific question asked is about partitions of n into m parts, which is a type of restricted integer partition.
For generating unrestricted integer partitions, a very fast and simple algorithm is due to Fristedt, in a paper called The Structure of Random Partitions of Large Integer (1993). The algorithm is as follows:
Set x = exp(-pi/sqrt(6n) ).
Generate independent random variables Z(1), Z(2), ..., Z(n), where Z(i) is geometrically distributed with parameter 1-x^i.
IF sum i*Z(i) = n, where the sum is taken over all i=1,2,...,n, then STOP.
ELSE, repeat 2.
Once the algorithm stops, then Z(1) is the number of 1s, Z(2) is the number of 2s, etc., in a partition chosen uniformly at random. The probability of accepting a randomly chosen set of Z's is asymptotically 1/(94n^3)^(1/4), which means one would expect to run this algorithm O(n^(3/4)) times before accepting a single sample.
The reason I took the time to explain this algorithm is because it applies directly to the problem of generating a partition of n into exactly m parts. First, observe that
The number of partitions of n into exactly m parts is equal to the number of partitions of n with largest part equal to m.
Then we may apply Fristedt's algorithm directly, but instead of generating Z(1), Z(2), ..., Z(n), we can generate Z(1), Z(2), ..., Z(m-1), Z(m)+1 (the +1 here ensures that the largest part is exactly m, and 1+Z(m) is equal in distribution to Z(m) conditional on Z(m)>=1) and set all other Z(m+1), Z(m+2), ... equal to 0. Then once we obtain the target sum in step 3 we are also guaranteed to have an unbiased sample. To obtain a partition of n into exactly m parts simply take the conjugate of the partition generated.
The advantage this has over the recursive method of Nijenhuis and Wilf is that there is no memory requirements other than to store the random variables Z(1), Z(2), etc. Also, the value of x can be anything between 0 and 1 and this algorithm is still unbiased! Choosing a good value of x, however, can make the algorithm much faster, though the choice in Step 1 is nearly optimal for unrestricted integer partitions.
If n is really huge and Fristedt's algorithm takes too long (and table methods are out of the question), then there are other options, but they are a little more complicated; see my thesis https://sites.google.com/site/stephendesalvo/home/papers for more info on probabilistic divide-and-conquer and its applications.
Here is some code that does it. This is O(n2) the first time you call it, but it builds a cache so that subsequent calls are O(n).
import random
cache = {}
def count_partitions(n, limit):
if n == 0:
return 1
if (n, limit) in cache:
return cache[n, limit]
x = cache[n, limit] = sum(count_partitions(n-k, k) for k in range(1, min(limit, n) + 1))
return x
def random_partition(n):
a = []
limit = n
total = count_partitions(n, limit)
which = random.randrange(total)
while n:
for k in range(1, min(limit, n) + 1):
count = count_partitions(n-k, k)
if which < count:
break
which -= count
a.append(k)
limit = k
n -= k
return a
How this works: We can calculate how many partitions of an integer n there are in O(n2) time. As a side effect, this produces a table of size O(n2) which we can then use to generate the kth partition of n, for any integer k, in O(n) time.
So let total = the number of partitions. Pick a random number k from 0 to total - 1. Generate the kth partition.
Another algorithm from Combinatorial Algorithms page 52, "Random Generation of n into k parts"
Choose a1, a2, .. , ak-1 a random k-1 subset of {1,2,..,n+k-1} (see below 1., 2.)
Set r1 = a1-1; rj = aj - aj-1-1 (j=2..k-1); rk = n+k-1- ak-1
The rj (j=1..k) constitute the random partition of n into k parts
This algorithm for random compositions is based on the
"balls-in-cells" model.
Briefly we choose the posiitons of the cell
boundaries at random, then by differencing we find out how many balls
are in each cell.
For efficiently generating a random subset of a set, see a 1. related answer here and 2. here
update
Another approach using a single random number in [0,1] to uniformly generate a random partition (also called composition) is given in IVAN STOJMENOVIC, "ON RANDOM AND ADAPTIVE PARALLEL GENERATION OF COMBINATORIAL OBJECTS" (section 5, section 10)
Just one more version in c#.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace ConsoleApplication6
{
class Program
{
static Random random = new Random();
static void Main(string[] args)
{
PrintPartition(GetUniformPartition(24, 5));
PrintPartition(GetUniformPartition(24, 5));
PrintPartition(GetUniformPartition(24, 5));
PrintPartition(GetUniformPartition(24, 5));
PrintPartition(GetUniformPartition(24, 5));
Console.ReadKey();
}
static int[] GetUniformPartition(int input, int parts)
{
if(input<= 0 || parts <= 0)
throw new ArgumentException("invalid input or parts");
if (input < MinUniformPartition(parts))
throw new ArgumentException("input is to small");
int[] partition = new int[parts];
int sum = 0;
for (int i = 0; i < parts-1; i++)
{
int max = input - MinUniformPartition(parts - i - 1) - sum;
partition[i] = random.Next(parts - i, max);
sum += partition[i];
}
partition[parts - 1] = input - sum; // last
return partition;
}
// sum of 1,2,3,4,..,n
static int MinUniformPartition(int n)
{
return n * n - 1;
}
static void PrintPartition(int[] p)
{
for (int i = 0; i < p.Length; i++)
{
Console.Write("{0},", p[i]);
}
Console.WriteLine();
}
}
}
This code will produce next output:
5,8,7,2,2,
6,6,7,2,3,
5,7,6,2,4,
6,4,3,2,9,
7,8,4,4,1,
I have an evenly distributed partition generator.
Where n := the integer to be partitioned, r:= the number of slices:
The algorithm is a patched version of the naive method of simply inserting partings at random. The problem with this method, as it appeared to me when I looked at its output, was that scenarios where partings are placed in the same spot are less likely to occur. There is only one way to get {1,1,1}, while there are 3! ways of getting {2,4,9}, any of {4,2,9},{2,4,9},{9,4,2}... will lead to the same partition placement when sorted. This has been amended by providing additional explicit opportunities for repeats. For each parting insertion, there's a chance that the position of the parting wont be random, but will be selected as a repeat of a formerly selected value. This balances the uneven probability distribution of the naive method right out.
I have proved by exhaustion that each partitioning is perfectly equally likely for r = 3, n = 2. I cbf proving it for higher values but healfhearted ventures to do so found only promising signs. I also tested it on random input, finding that it is at least roughly even for every values I tried[but probably perfectly even].
here it is in C++11: [the output format is different to what you're expecting, it's the positions of the partings rather than the size of the space between them. The conversion is easy, though]
#include <vector>
#include <algorithm>
#include <random>
#include <cassert>
template <typename Parting, typename Seed>
vector<Parting> partitionGen(unsigned nparts, unsigned bandw, Seed seed){//nparts is the number of parts, that is, one greater than the number of dividers listed in the output vector. Bandw is the integer being partitioned.
assert(nparts > 0);
vector<Parting> out(nparts-1);
srand(seed);
unsigned genRange = bandw;
for(auto i=out.begin(); i<out.end(); ++i, ++genRange){
unsigned gen = rand()%genRange;
*i = ((gen<bandw)?
gen:
*(i-(gen-bandw+1)));
}
sort(out.begin(), out.end(), less<Parting>());
return out;
}
I don't like the fact that I have to sort it though. If Vlody's version has an even distribution, it appears that it'd be better.
After some googling I found an algorithm for this in the "Handbook of Applied Algorithms," which Google Books has indexed. The algorithm is given in section 1.12.2, on page 31.
I have implemented the above solution and found that it works very well if one wants to calculate integer partitions for n but not with respect to m. If working with large n, recursion limits and call stacks may need to be increased a lot.
However, you don't need the first function because count_partitions(n, limit) will actually equal the number of partitions of 'n+limit' with 'limit' number of parts. Some mathematical software have very fast functions for finding the number of partition of n into m parts.
I have recently derived a definitely unbiased, very simple, and very fast method (using memoization) to solve your exact question: An algorithm for randomly generating integer partitions of a particular length, in Python?
It's based on knowing something about lexically ordered partitions of n having m parts and uses a similar approach to well-accepted algorithms (e.g. Nijenhuis and Wilf 1978) that find random partitions of n, and is conceptually similar to the above.
In short, if there are x partitions of n with m parts, then we choose a random number between 1 and x. That random number will code for one and only one partition satisfying n and m. I hope this helps.
Related
This question already has answers here:
A efficient binomial random number generator code in Java
(2 answers)
Closed 8 years ago.
I need to generate binomial random numbers:
For example, consider binomial random numbers. A binomial random
number is the number of heads in N tosses of a coin with probability p
of a heads on any single toss. If you generate N uniform random
numbers on the interval (0,1) and count the number less than p, then
the count is a binomial random number with parameters N and p.
In my case, my N could range from 1*10^3 to 1*10^10. My p is around 1*10^(-7).
Often my n*p is around 1*10^(-3).
There is a trivial implementation to generate such binomial random number through loops:
public static int getBinomial(int n, double p) {
int x = 0;
for(int i = 0; i < n; i++) {
if(Math.random() < p)
x++;
}
return x;
}
This native implementation is very slow. I tried the Acceptance Rejection/Inversion method [1] implemented in the Colt (http://acs.lbl.gov/software/colt/) lib. It is very fast, but the distribution of its generated number only agrees with the native implementation when n*p is not very small. In my case when n*p = 1*10^(-3), the native implementation can still generate the number 1 after many runs, but the Acceptance Rejection/Inversion method can never generate the number 1 (always returns 0).
Does anyone know what is the problem here? Or can you suggest a better binomial random number generating algorithm that can solve my case.
[1] V. Kachitvichyanukul, B.W. Schmeiser (1988): Binomial random variate generation, Communications of the ACM 31, 216-222.
If n*p is a fixed small number t, and n is much larger than 1/t, then the binomial distribution is very close to the Poisson distribution, which returns k with probability e^{-t} t^k/k!.
Here is some pseudocode
r=e^t * RandomReal[0,1];
s=1;
k=0;
While[s<r,
(k++; s=s+t^k/k!;)]
Return k;
If t is really small, it will be pretty hard to tell the difference between this and a routine that just returns 0 with probability 1-t and t the rest of the time. For example, if t=0.001 and n is large, then the probabilities of various values of k are
k=0 0.9990005
k=1 0.0009990
k=2 0.0000005
k>2 1.7 * 10^{-10}
Where np is very small, only the very smallest values of n are at all likely. You could work out the probabilities of those and then use http://en.wikipedia.org/wiki/Alias_method. If you are feeling extra scrupulous, you could work out the probability that some value above those you are prepared to deal with occurs instead, and divert to a special case method with this probability, such as generating a second alias table to cope with the most likely N values above those your first alias method coped with.
I want to put n balls into m buckets at random, with constraint that
ballCountMax-ballCountMin <= diff
ballCountMax-ballCountMin as random as possible
and
Input:
ballCount: n
bucketCount: m
allowedDiff: diff
Output:
ballCount distribution for buckets
Is there a good algorithm?
To distribute the balls, simply go down the line, asking for a random number [0, 1) if it's less than 1/(total buckets remaining) place a ball in the bin and move on to the next bin. If at the end of this, you still have balls remaining, evaluate the differences between the bins, if bins are as far apart as allowed ignore bins which are at the maximum for this pass. Do this by finding the minimum and ignoring any balls more than the minimum+difference-1 Repeat this process until you have distributed all your balls.
The complexity of this algorithm is dependent on the number of balls (n) and the number of buckets (m). It has a complexity of O(mn).
We can speed this up significantly by realizing that each bucket must contain a certain minimum number of balls, for example with 5 buckets and 10 balls with a difference of 2 each bucket must have at least 1 ball. Therefore before even executing the main algorithm we can save half the running time by "pre-placing" the balls into each bucket.
To calculate the number of pre-placeable balls we simply must divide number of balls by number of buckets n/m and take the floor and ceiling of this so that a = ceiling(n/m) and b = floor(n/m)
Now b should be the minimum number of balls possible for each bucket iff a-b = diff. There are numerous ways to solve this if the equation isn't initially true, such as
while(a-b<diff){
++a;
--b;
}
Note that in all cases this method will return incorrect results, therefore adding a check that a-b = diff is necessary.
We can therefore pre-place b balls.
The simplest approach would be a generate-and-test loop:
do {
distribute_balls_at_random();
} while (constraint_not_satisfied())
There are probably other approaches that are much more efficient, but this will be the easiest to code.
Following is the O(n) algorithm with the diff <= 1:
Shuffle the balls with the Modern Fisher–Yates shuffle
Now use the simple division hashing h(k) = k mod m to distribute n balls to m buckets
The diff will be either 0, if n mod m == 0, or 1 otherwise.
function do(int n, int m, int diff){
buckets = array of size m with initial values = 0
while(n-- > 0){
int limit = 1000;
while(limit > 0){
int idx = random number from 0 to m
buckets[idx]+=1
int min = min_element(buckets)
int max = max_element(buckets)
if(buckets[max] - buckets[min] <= diff) break
buckets[idx]-=1
limit--
}
if(limit == 0){
int min = min_element(buckets)
buckets[min]++;
int max = max_element(buckets)
if(buckets[max] - buckets[min) > diff)
return false; //there is no valid distribution
}
}
print buckets
return true
}
limit is a parameter you can adjust as you want. Greater values ensure more randomness and less values ensure better performance. You can try many test cases and come out with the best value suitable for you.
Recently I was asked the following interview question:
You have two sets of numbers of the same length N, for example A = [3, 5, 9] and B = [7, 5, 1]. Next, for each position i in range 0..N-1, you can pick either number A[i] or B[i], so at the end you will have another array C of length N which consists in elements from A and B. If sum of all elements in C is less than or equal to K, then such array is good. Please write an algorithm to figure out the total number of good arrays by given arrays A, B and number K.
The only solution I've come up is Dynamic Programming approach, when we have a matrix of size NxK and M[i][j] represents how many combinations could we have for number X[i] if current sum is equal to j. But looks like they expected me to come up with a formula. Could you please help me with that? At least what direction should I look for? Will appreciate any help. Thanks.
After some consideration, I believe this is an NP-complete problem. Consider:
A = [0, 0, 0, ..., 0]
B = [b1, b2, b3, ..., bn]
Note that every construction of the third set C = ( A[i] or B[i] for i = 0..n ) is is just the union of some subset of A and some subset of B. In this case, since every subset of A sums to 0, the sum of C is the same as the sum of some subset of B.
Now your question "How many ways can we construct C with a sum less than K?" can be restated as "How many subsets of B sum to less than K?". Solving this problem for K = 1 and K = 0 yields the solution to the subset sum problem for B (the difference between the two solutions is the number of subsets that sum to 0).
By similar argument, even in the general case where A contains nonzero elements, we can construct an array S = [b1-a1, b2-a2, b3-a3, ..., bn-an], and the question becomes "How many subsets of S sum to less than K - sum(A)?"
Since the subset sum problem is NP-complete, this problem must be also. So with that in mind, I would venture that the dynamic programming solution you proposed is the best you can do, and certainly no magic formula exists.
" Please write an algorithm to figure out the total number of good
arrays by given arrays A, B and number K."
Is it not the goal?
int A[];
int B[];
int N;
int K;
int Solutions = 0;
void FindSolutons(int Depth, int theSumSoFar) {
if (theSumSoFar > K) return;
if (Depth >= N) {
Solutions++;
return;
}
FindSolutions(Depth+1,theSumSoFar+A[Depth]);
FindSolutions(Depth+1,theSumSoFar+B[Depth]);
}
Invoke FindSolutions with both arguments set to zero. On return the Solutions will be equal to the number of good arrays;
this is how i would try to solve the problem
(Sorry if its stupid)
think of arrays
A=[3,5,9,8,2]
B=[7,5,1,8,2]
if
elements
0..N-1
number of choices
2^N
C1=0,C2=0
for all A[i]=B[i]
{
C1++
C2+=A[i]+B[i]
}
then create new two arrays like
A1=[3,5,9]
B1=[7,5,1]
also now C2 is 10
now number of all choices are reduced to 2^(N-C1)
now calculate all good numbers
using 'K' as K=K-C2
unfortunately
no matter what method you use, you have
to calculate sum 2^(N-C1) times
So there's 2^N choices, since at each point you either pick from A or from B. In the specific example you give where N happens to be 3 there are 8. For discussion you can characterise each set of decisions as a bit pattern.
So as a brute-force approach would try every single bit pattern.
But what should be obvious is that if the first few bits produce a number too large then every subsequent possible group of tail bits will also produce a number that is too large. So probably a better way to model it is a tree where you don't bother walking down the limbs that have already grown beyond your limit.
You can also compute the maximum totals that can be reached from each bit to the end of the table. If at any point your running total plus the maximum that you can obtain from here on down is less than K then every subtree from where you are is acceptable without any need for traversal. The case, as discussed in the comments, where every single combination is acceptable is a special case of this observation.
As pointed out by Serge below, a related observation is to us minimums and use the converse logic to cancel whole subtrees without traversal.
A potential further optimisation rests behind the observation that, as long as we shuffle each in the same way, changing the order of A and B has no effect because addition is commutative. You can therefore make an effort to ensure either that the maximums grow as quickly as possible or the minimums grow as slowly as possible, to try to get the earliest possible exit from traversal. In practice you'd probably want to apply a heuristic comparing the absolute maximum and minimum (both of which you've computed anyway) to K.
That being the case, a recursive implementation is easiest, e.g. (in C)
/* assume A, B and N are known globals */
unsigned int numberOfGoodArraysFromBit(
unsigned int bit,
unsigned int runningTotal,
unsigned int limit)
{
// have we ended up in an unacceptable subtree?
if(runningTotal > limit) return 0;
// have we reached the leaf node without at any
// point finding this subtree to be unacceptable?
if(bit >= N) return 1;
// maybe every subtree is acceptable?
if(runningTotal + MAXV[bit] <= limit)
{
return 1 << (N - bit);
}
// maybe no subtrees are acceptable?
if(runningTotal + MINV[bit] > limit)
{
return 0;
}
// if we can't prima facie judge the subtreees,
// we'll need specifically to evaluate them
return
numberOfGoodArraysFromBit(bit+1, runningTotal+A[bit], limit) +
numberOfGoodArraysFromBit(bit+1, runningTotal+B[bit], limit);
}
// work out the minimum and maximum values at each position
for(int i = 0; i < N; i++)
{
MAXV[i] = MAX(A[i], B[i]);
MINV[i] = MIN(A[i], B[i]);
}
// hence work out the cumulative totals from right to left
for(int i = N-2; i >= 0; i--)
{
MAXV[i] += MAXV[i+1];
MINV[i] += MINV[i+1];
}
// to kick it off
printf("Total valid combinations is %u", numberOfGoodArraysFromBit(0, 0, K));
I'm just thinking extemporaneously; it's likely better solutions exist.
I'm looking for the most efficient algorithm to randomly choose a set of n distinct integers, where all the integers are in some range [0..maxValue].
Constraints:
maxValue is larger than n, and possibly much larger
I don't care if the output list is sorted or not
all integers must be chosen with equal probability
My initial idea was to construct a list of the integers [0..maxValue] then extract n elements at random without replacement. But that seems quite inefficient, especially if maxValue is large.
Any better solutions?
Here is an optimal algorithm, assuming that we are allowed to use hashmaps. It runs in O(n) time and space (and not O(maxValue) time, which is too expensive).
It is based on Floyd's random sample algorithm. See my blog post about it for details.
The code is in Java:
private static Random rnd = new Random();
public static Set<Integer> randomSample(int max, int n) {
HashSet<Integer> res = new HashSet<Integer>(n);
int count = max + 1;
for (int i = count - n; i < count; i++) {
Integer item = rnd.nextInt(i + 1);
if (res.contains(item))
res.add(i);
else
res.add(item);
}
return res;
}
For small values of maxValue such that it is reasonable to generate an array of all the integers in memory then you can use a variation of the Fisher-Yates shuffle except only performing the first n steps.
If n is much smaller than maxValue and you don't wish to generate the entire array then you can use this algorithm:
Keep a sorted list l of number picked so far, initially empty.
Pick a random number x between 0 and maxValue - (elements in l)
For each number in l if it smaller than or equal to x, add 1 to x
Add the adjusted value of x into the sorted list and repeat.
If n is very close to maxValue then you can randomly pick the elements that aren't in the result and then find the complement of that set.
Here is another algorithm that is simpler but has potentially unbounded execution time:
Keep a set s of element picked so far, initially empty.
Pick a number at random between 0 and maxValue.
If the number is not in s, add it to s.
Go back to step 2 until s has n elements.
In practice if n is small and maxValue is large this will be good enough for most purposes.
One way to do it without generating the full array.
Say I want a randomly selected subset of m items from a set {x1, ..., xn} where m <= n.
Consider element x1. I add x1 to my subset with probability m/n.
If I do add x1 to my subset then I reduce my problem to selecting (m - 1) items from {x2, ..., xn}.
If I don't add x1 to my subset then I reduce my problem to selecting m items from {x2, ..., xn}.
Lather, rinse, and repeat until m = 0.
This algorithm is O(n) where n is the number of items I have to consider.
I rather imagine there is an O(m) algorithm where at each step you consider how many elements to remove from the "front" of the set of possibilities, but I haven't convinced myself of a good solution and I have to do some work now!
If you are selecting M elements out of N, the strategy changes depending on whether M is of the same order as N or much less (i.e. less than about N/log N).
If they are similar in size, then you go through each item from 1 to N. You keep track of how many items you've got so far (let's call that m items picked out of n that you've gone through), and then you take the next number with probability (M-m)/(N-n) and discard it otherwise. You then update m and n appropriately and continue. This is a O(N) algorithm with low constant cost.
If, on the other hand, M is significantly less than N, then a resampling strategy is a good one. Here you will want to sort M so you can find them quickly (and that will cost you O(M log M) time--stick them into a tree, for example). Now you pick numbers uniformly from 1 to N and insert them into your list. If you find a collision, pick again. You will collide about M/N of the time (actually, you're integrating from 1/N to M/N), which will require you to pick again (recursively), so you'll expect to take M/(1-M/N) selections to complete the process. Thus, your cost for this algorithm is approximately O(M*(N/(N-M))*log(M)).
These are both such simple methods that you can just implement both--assuming you have access to a sorted tree--and pick the one that is appropriate given the fraction of numbers that will be picked.
(Note that picking numbers is symmetric with not picking them, so if M is almost equal to N, then you can use the resampling strategy, but pick those numbers to not include; this can be a win, even if you have to push all almost-N numbers around, if your random number generation is expensive.)
My solution is the same as Mark Byers'. It takes O(n^2) time, hence it's useful when n is much smaller than maxValue. Here's the implementation in python:
def pick(n, maxValue):
chosen = []
for i in range(n):
r = random.randint(0, maxValue - i)
for e in chosen:
if e <= r:
r += 1
else:
break;
bisect.insort(chosen, r)
return chosen
The trick is to use a variation of shuffle or in other words a partial shuffle.
function random_pick( a, n )
{
N = len(a);
n = min(n, N);
picked = array_fill(0, n, 0); backup = array_fill(0, n, 0);
// partially shuffle the array, and generate unbiased selection simultaneously
// this is a variation on fisher-yates-knuth shuffle
for (i=0; i<n; i++) // O(n) times
{
selected = rand( 0, --N ); // unbiased sampling N * N-1 * N-2 * .. * N-n+1
value = a[ selected ];
a[ selected ] = a[ N ];
a[ N ] = value;
backup[ i ] = selected;
picked[ i ] = value;
}
// restore partially shuffled input array from backup
// optional step, if needed it can be ignored
for (i=n-1; i>=0; i--) // O(n) times
{
selected = backup[ i ];
value = a[ N ];
a[ N ] = a[ selected ];
a[ selected ] = value;
N++;
}
return picked;
}
NOTE the algorithm is strictly O(n) in both time and space, produces unbiased selections (it is a partial unbiased shuffling) and does not need hasmaps (which may not be available and/or usualy hide a complexity behind their implementation, e.g fetch time is not O(1), it might even be O(n) in worst case)
adapted from here
Linear congruential generator modulo maxValue+1. I'm sure I've written this answer before, but I can't find it...
UPDATE: I am wrong. The output of this is not uniformly distributed. Details on why are here.
I think this algorithm below is optimum. I.e. you cannot get better performance than this.
For choosing n numbers out of m numbers, the best offered algorithm so far is presented below. Its worst run time complexity is O(n), and needs only a single array to store the original numbers. It partially shuffles the first n elements from the original array, and then you pick those first n shuffled numbers as your solution.
This is also a fully working C program. What you find is:
Function getrand: This is just a PRNG that returns a number from 0 up to upto.
Function randselect: This is the function that randmoly chooses n unique numbers out of m many numbers. This is what this question is about.
Function main: This is only to demonstrate a use for other functions, so that you could compile it into a program and have fun.
#include <stdio.h>
#include <stdlib.h>
int getrand(int upto) {
long int r;
do {
r = rand();
} while (r > upto);
return r;
}
void randselect(int *all, int end, int select) {
int upto = RAND_MAX - (RAND_MAX % end);
int binwidth = upto / end;
int c;
for (c = 0; c < select; c++) {
/* randomly choose some bin */
int bin = getrand(upto)/binwidth;
/* swap c with bin */
int tmp = all[c];
all[c] = all[bin];
all[bin] = tmp;
}
}
int main() {
int end = 1000;
int select = 5;
/* initialize all numbers up to end */
int *all = malloc(end * sizeof(int));
int c;
for (c = 0; c < end; c++) {
all[c] = c;
}
/* select select unique numbers randomly */
srand(0);
randselect(all, end, select);
for (c = 0; c < select; c++) printf("%d ", all[c]);
putchar('\n');
return 0;
}
Here is the output of an example code where I randomly output 4 permutations out of a pool of 8 numbers for 100,000,000 many times. Then I use those many permutations to compute the probability of having each unique permutation occur. I then sort them by this probability. You notice that the numbers are fairly close, which I think means that it is uniformly distributed. The theoretical probability should be 1/1680 = 0.000595238095238095. Note how the empirical test is close to the theoretical one.
I have a problem resembling the one described here:
Algorithm to return all combinations of k elements from n
I am looking for something similar that covers all possible combinations of k from n. However, I need a subset to vary a lot from the one drawn previously. For example, if I were to draw a subset of 3 elements from a set of 8, the following algorithm wouldn't be useful to me since every subset is very similar to the one previously drawn:
11100000,
11010000,
10110000,
01110000,
...
I am looking for an algorithm thats picks the subsets in a more "random" looking fashion, ie. where the majority of elements in one subset is not reused in the next:
11100000,
00010011,
00101100,
...
Does anyone know of such an algorithm?
I hope my question made sence and that someone can help me out =)
Kind regards,
Christian
How about first generating all possible combinations of k from n, and then rearranging them with help of a random function.
If you have the result in a vector, loop through the vector: for each element let it change place with the element at a random position.
This of course becomes slow for large k and n.
This is not really random, but depending on your needs it might suit you.
Calculate the number of possible combinations. Let's name them N.
Calculate a large number which is coprime to N. Let's name it P.
Order the combinations and give them numbers from 1 to N. Let's name them C1 to CN
Iterate for output combinations. The first one will be VP mod N, the second one will be C2*P mod N, the third one C3*P mod N, etc. In essence, Outputi = Ci*P mod N. Mod is meant as the modulus operator.
If P is picked carefully, you will get seemingly random combinations. Values close to 1 or to N will produce values that differ little. Better pick values close to, say N/4 or N/5. You can also randomize the generation of P for every iteration that you need.
As a follow-up to my comment on this answer, here is some code that allows one to determine the composition of a subset from its "index", in colex order.
Shamelessly stolen from my own assignments.
//////////////////////////////////////
// NChooseK
//
// computes n!
// --------
// k!(n-k)!
//
// using Pascal's identity
// i.e. (n,k) = (n-1,k-1) + (n-1,k)
//
// easily optimizable by memoization
long long NChooseK(int n, int k)
{
if(k >= 0 && k <= n && n >= 1)
{
if( k > n / 2)
k = n - k;
if(k == 0 || n == 0)
return 1;
else
return NChooseK(n-1, k-1) + NChooseK(n-1, k);
}
else
return 0;
}
///////////////////////////////////////////////////////////////////////
// SubsetColexUnrank
// The unranking works by finding each element
// in turn, beginning with the biggest, leftmost one.
// We just have to find, for each element, how many subsets there are
// before the one beginning with the elements we have already found.
//
// It stores its results (indices of the elements present in the subset) into T, in ascending order.
void SubsetColexUnrank(long long r, int * T, int subsetSize)
{
assert( subsetSize >= 1 );
// For each element in the k-subset to be found
for(int i = subsetSize; i >= 1; i--)
{
// T[i] cannot be less than i
int x = i;
// Find the smallest element such that, of all the k-subsets that contain it,
// none has a rank that exceeds r.
while( NChooseK(x, i) <= r )
x++;
// update T with the newly found element
T[i] = x;
// if the subset we have to find is not the first one containing this element
if(r > 0)
{
// finding the next element of our k-subset
// is like finding the first one of the same subset
// divided by {T[i]}
r -= NChooseK(x - 1, i);
}
}
}
Random-in, random-out.
The colex order is such that its unranking function does not need the size of the set from which to pick the elements to work; the number of elements is assumed to be NChooseK(size of the set, size of the subset).
How about randomly choosing the k elements. ie choose the pth where p is random between 1 and n, then reorder what's left and choose the qth where q is between 1 and n-1 etc?
or maybe i misunderstood. do you still want all possibilities? in that case you can always generate them first and then choose random entries from your list
By "random looking" I think you mean lexicographically distant.. does this apply to combination i vs. i-1, or i vs. all previous combinations?
If so, here are some suggestions:
since most of the combinators yield ordered output, there are two options:
design or find a generator which somehow yields non-ordered output
enumerate and store enough/all combinations in a tie'd array file/db
if you decide to go with door #2, then you can just access randomly ordered combinations by random integers between 1 and the # of combinations
Just as a final check, compare the current and previous combination using a measure of difference/distance between combinations, e.g. for an unsigned Bit::Vector in Perl:
$vec1->Lexicompare($vec2) >= $MIN_LEX_DIST
You might take another look behind door #1, since even for moderate values of n and k you can get a big array:
EDIT:
Just saw your comment to AnnK... maybe the lexicompare might still help you skip similar combinations?
Depending on what you are trying to do, you could do something like playing cards. Keep two lists: Source is your source (unused) list; and Used the second is the "already-picked" list. As you randomly pick k items from Source, you move them to your Used list.
If there are k items left in Source when you need to pick again, you pick them all and swap the lists. If there are fewer than k items, you pick j items from Used and add them to Source to make k items in Source, then pick them all and swap the lists.
This is kind of like picking k cards from a deck. You discard them to the used pile. Once you reach the end or need more cards, you shuffle the old ones back into play.
This is just to make sure each set is definitely different from the previous subsets.
Also, this will not really guarantee that all possible subsets are picked before old ones start being repeated.
The good is that you don't need to worry about pre-calculating all the subsets, and your memory requirements are linear with your data (2 n-sized lists).