I'm searching for an algorithm (no matter what programming language, maybe Pseudo-code?) where you get a random number with different probability's.
For example:
A random Generator, which simulates a dice where the chance for a '6'
is 50% and for the other 5 numbers it's 10%.
The algorithm should be scalable, because this is my exact problem:
I have a array (or database) of elements, from which i want to
select 1 random element. But each element should have a different
probability to be selected. So my idea is that every element get a
number. And this number divided by the sum of all numbers results the
chance for the number to be randomly selected.
Anybody know a good programming language (or library) for this problem?
The best solution would be a good SQL Query which delivers 1 random entry.
But i would also be happy with every hint or attempt in an other programming language.
A simple algorithm to achieve it is:
Create an auexillary array where sum[i] = p1 + p2 + ... + pi. This is done only once.
When you draw a number, draw a number r with uniform distribution over [0,sum[n]), and binary search for the first number higher than the uniformly distributed random number. It can be done using binary search efficiently.
It is easy to see that indeed the probability for r to lay in a certain range [sum[i-1],sum[i]), is indeed sum[i]-sum[i-1] = pi
(In the above, we regard sum[-1]=0, for completeness)
For your cube example:
You have:
p1=p2=....=p5 = 0.1
p6 = 0.5
First, calculate sum array:
sum[1] = 0.1
sum[2] = 0.2
sum[3] = 0.3
sum[4] = 0.4
sum[5] = 0.5
sum[6] = 1
Then, each time you need to draw a number: Draw a random number r in [0,1), and choose the number closest to it, for example:
r1 = 0.45 -> element = 4
r2 = 0.8 -> element = 6
r3 = 0.1 -> element = 2
r4 = 0.09 -> element = 1
An alternative answer. Your example was in percentages, so set up an array with 100 slots. A 6 is 50%, so put 6 in 50 of the slots. 1 to 5 are at 10% each, so put 1 in 10 slots, 2 in 10 slots etc. until you have filled all 100 slots in the array. Now pick one of the slots at random using a uniform distribution in [0, 99] or [1, 100] depending on the language you are using.
The contents of the selected array slot will give you the distribution you want.
ETA: On second thoughts, you don't actually need the array, just use cumulative probabilities to emulate the array:
r = rand(100) // In range 0 -> 99 inclusive.
if (r < 50) return 6; // Up to 50% returns a 6.
if (r < 60) return 1; // Between 50% and 60% returns a 1.
if (r < 70) return 2; // Between 60% and 70% returns a 2.
etc.
You already know what numbers are in what slots, so just use cumulative probabilities to pick a virtual slot: 50; 50 + 10; 50 + 10 + 10; ...
Be careful of edge cases and whether your RNG is 0 -> 99 or 1 -> 100.
Related
I have two uint16 3D (GPU) arrays A and B in MATLAB, which have the same 2nd and 3rd dimension. For instance, size(A,1) = 300 000, size(B,1) = 2000, size(A,2) = size(B,2) = 20, and size(A,3) = size(B,3) = 100, to give an idea about the orders of magnitude. Actually, size(A,3) = size(B,3) is very big, say ~ 1 000 000, but the arrays are stored externally in small pieces cut along the 3rd dimension. The point is that there is a very long loop along the 3rd dimension (cfg. MWE below), so the code inside of it needs to be optimized further (if possible). Furthermore, the values of A and B can be assumed to be bounded way below 65535, but there are still hundreds of different values.
For each i,j, and d, the rows A(i,:,d) and B(j,:,d) represent multisets of the same size, and I need to find the size of the largest common submultiset (multisubset?) of the two, i.e. the size of their intersection as multisets. Moreover, the rows of B can be assumed sorted.
For example, if [2 3 2 1 4 5 5 5 6 7] and [1 2 2 3 5 5 7 8 9 11] are two such multisets, respectively, then their multiset intersection is [1 2 2 3 5 5 7], which has the size 7 (7 elements as a multiset).
I am currently using the following routine to do this:
s = 300000; % 1st dim. of A
n = 2000; % 1st dim. of B
c = 10; % 2nd dim. of A and B
depth = 10; % 3rd dim. of A and B (corresponds to a batch of size 10 of A and B along the 3rd dim.)
N = 100; % upper bound on the possible values of A and B
A = randi(N,s,c,depth,'uint16','gpuArray');
B = randi(N,n,c,depth,'uint16','gpuArray');
Sizes_of_multiset_intersections = zeros(s,n,depth,'uint8'); % too big to fit in GPU memory together with A and B
for d=1:depth
A_slice = A(:,:,d);
B_slice = B(:,:,d);
unique_B_values = permute(unique(B_slice),[3 2 1]); % B is smaller than A
% compute counts of the unique B-values for each multiset:
A_values_counts = permute(sum(uint8(A_slice==unique_B_values),2,'native'),[1 3 2]);
B_values_counts = permute(sum(uint8(B_slice==unique_B_values),2,'native'),[1 3 2]);
% compute the count of each unique B-value in the intersection:
Sizes_of_multiset_intersections_tmp = gpuArray.zeros(s,n,'uint8');
for i=1:n
Sizes_of_multiset_intersections_tmp(:,i) = sum(min(A_values_counts,B_values_counts(i,:)),2,'native');
end
Sizes_of_multiset_intersections(:,:,d) = gather(Sizes_of_multiset_intersections_tmp);
end
One can also easily adapt above code to compute the result in batches along dimension 3 rather than d=1:depth (=batch of size 1), though at the expense of even bigger unique_B_values vector.
Since the depth dimension is large (even when working in batches along it), I am interested in faster alternatives to the code inside the outer loop. So my question is this: is there a faster (e.g. better vectorized) way to compute sizes of intersections of multisets of equal size?
Disclaimer : This is not a GPU based solution (Don't have a good GPU). I find the results interesting and want to share, but I can delete this answer if you think it should be.
Below is a vectorized version of your code, that makes it possible to get rid of the inner loop, at the cost of having to deal with a bigger array, that might be too big to fit in the memory.
The idea is to have the matrices A_values_counts and B_values_counts be 3D matrices shaped in such a way that calling min(A_values_counts,B_values_counts) will calculate everything in one go due to implicit expansion. In the background it will create a big array of size s x n x length(unique_B_values) (Probably most of the time too big)
In order to go around the constraint on the size, the results are calculated in batches along the n dimension, i.e. the first dimension of B:
tic
nBatches_B = 2000;
sBatches_B = n/nBatches_B;
Sizes_of_multiset_intersections_new = zeros(s,n,depth,'uint8');
for d=1:depth
A_slice = A(:,:,d);
B_slice = B(:,:,d);
% compute counts of the unique B-values for each multiset:
unique_B_values = reshape(unique(B_slice),1,1,[]);
A_values_counts = sum(uint8(A_slice==unique_B_values),2,'native'); % s x 1 x length(uniqueB) array
B_values_counts = reshape(sum(uint8(B_slice==unique_B_values),2,'native'),1,n,[]); % 1 x n x length(uniqueB) array
% Not possible to do it all in one go, must split in batches along B
for ii = 1:nBatches_B
Sizes_of_multiset_intersections_new(:,((ii-1)*sBatches_B+1):ii*sBatches_B,d) = sum(min(A_values_counts,B_values_counts(:,((ii-1)*sBatches_B+1):ii*sBatches_B,:)),3,'native'); % Vectorized
end
end
toc
Here is a little benchmark with different values of the number of batches. You can see that a minimum is found around a number of 400 (batch size 50), with a decrease of around 10% in processing time (each point is an average over 3 runs). (EDIT : x axis is amount of batches, not batches size)
I'd be interested in knowing how it behaves for GPU arrays as well!
Since this is about remapping a uniform distribution to another with a different range, this is not a PHP question specifically although I am using PHP.
I have a cryptographicaly secure random number generator that gives me evenly distributed integers (uniform discrete distribution) between 0 and PHP_INT_MAX.
How do I remap these results to fit into a different range in an efficient manner?
Currently I am using $mappedRandomNumber = $randomNumber % ($range + 1) + $min where $range = $max - $min, but that obvioulsy doesn't work since the first PHP_INT_MAX%$range integers from the range have a higher chance to be picked, breaking the uniformity of the distribution.
Well, having zero knowledge of PHP definitely qualifies me as an expert, so
mentally converting to float U[0,1)
f = r / PHP_MAX_INT
then doing
mapped = min + f*(max - min)
going back to integers
mapped = min + (r * max - r * min)/PHP_MAX_INT
if computation is done via 64bit math, and PHP_MAX_INT being 2^31 it should work
This is what I ended up doing. PRNG 101 (if it does not fit, ignore and generate again). Not very sophisticated, but simple:
public function rand($min = 0, $max = null){
// pow(2,$numBits-1) calculated as (pow(2,$numBits-2)-1) + pow(2,$numBits-2)
// to avoid overflow when $numBits is the number of bits of PHP_INT_MAX
$maxSafe = (int) floor(
((pow(2,8*$this->intByteCount-2)-1) + pow(2,8*$this->intByteCount-2))
/
($max - $min)
) * ($max - $min);
// discards anything above the last interval N * {0 .. max - min -1}
// that fits in {0 .. 2^(intBitCount-1)-1}
do {
$chars = $this->getRandomBytesString($this->intByteCount);
$n = 0;
for ($i=0;$i<$this->intByteCount;$i++) {$n|=(ord($chars[$i])<<(8*($this->intByteCount-$i-1)));}
} while (abs($n)>$maxSafe);
return (abs($n)%($max-$min+1))+$min;
}
Any improvements are welcomed.
(Full code on https://github.com/elcodedocle/cryptosecureprng/blob/master/CryptoSecurePRNG.php)
Here is the sketch how I would do it:
Consider you have uniform random integer distribution in range [A, B) that's what your random number generator provide.
Let L = B - A.
Let P be the highest power of 2 such that P <= L.
Let X be a sample from this range.
First calculate Y = X - A.
If Y >= P, discard it and start with new X until you get an Y that fits.
Now Y contains log2(P) uniformly random bits - zero extend it up to log2(P) bits.
Now we have uniform random bit generator that can be used to provide arbitrary number of random bits as needed.
To generate a number in the target range, let [A_t, B_t) be the target range. Let L_t = B_t - A_t.
Let P_t be the smallest power of 2 such that P_t >= L_t.
Read log2(P_t) random bits and make an integer from it, let's call it X_t.
If X_t >= L_t, discard it and try again until you get a number that fits.
Your random number in the desired range will be L_t + A_t.
Implementation considerations: if your L_t and L are powers of 2, you never have to discard anything. If not, then even in the worst case you should get the right number in less than 2 trials on average.
I'd like to read more about an algorithm that's used in R for unequal probability sampling, but after a few hours of searching I haven't been able to turn anything up on it. I thought it might have been an Art of Computer Programming algorithm, but I haven't been able to substantiate that either. The particular function in R's random.c is called ProbSampleNoReplace().
Given a vector of probabilities prob[] and a desired sample size n with a vector of selected items ans[]
For each element j in prob[] assign an index perm[j]
Sort the list in order of probability value, largest first
totalmass = 1
For (h=0, n1= n-1, h<nans, h++,n1-- )
rt = totalmass * rand(in 0:1)
mass = 0
**sum the probabilities, largest first, until the sum is bigger than rt**
for(j=0;j<n1;j++)
mass += prob[j]
if rt <= mass then break
ans[h] = perm[j]
**reduce size of totalmass to reflect removed item**
totalmass -= prob[j]
**reset the indices to be sequential**
for(k=j, k<n1, k++)
prob[k] = prob[k+1]
perm[k] = perm[k+1]
The sample function supports unequal probability arguments. Your code fragment is not clear as to its intent to those of us who do not read C.
> table( sample(1:4, 100, repl=TRUE, prob=4:1) )
1 2 3 4
46 23 24 7
There is another SO Q&A that may be useful (found by an SO search with arguments):
random.c ProbSampleNoReplace
Faster weighted sampling without replacement
Is there a good algorithm to split a randomly generated number into three buckets, each with constraints as to how much of the total they may contain.
For example, say my randomly generated number is 1,000 and I need to split it into buckets a, b, and c.
These ranges are only an example. See my edit for possible ranges.
Bucket a may only be between 10% - 70% of the number (100 - 700)
Bucket b may only be between 10% - 50% of the number (100 - 500)
Bucket c may only be between 5% - 25% of the number (50 - 250)
a + b + c must equal the randomly generated number
You want the amounts assigned to be completely random so there's just as equal a chance of bucket a hitting its max as bucket c in addition to as equal a chance of all three buckets being around their percentage mean.
EDIT: The following will most likely always be true: low end of a + b + c < 100%, high end of a + b + c > 100%. These percentages are only to indicate acceptable values of a, b, and c. In a case where a is 10% while b and c are their max (50% and 25% respectively) the numbers would have to be reassigned since the total would not equal 100%. This is the exact case I'm trying to avoid by finding a way to assign these numbers in one pass.
I'd like to find a way to pick these number randomly within their range in one pass.
The problem is equivalent to selecting a random point in an N-dimensional object (in your example N=3), the object being defined by the equations (in your example):
0.1 <= x <= 0.7
0.1 <= y <= 0.5
0.05 <= z <= 0.25
x + y + z = 1 (*)
Clearly because of the last equation (*) one of the coordinates is redundant, i.e. picking values for x and y dictates z.
Eliminating (*) and one of the other equations leaves us with an (N-1)-dimensional box, e.g.
0.1 <= x <= 0.7
0.1 <= y <= 0.5
that is cut by the inequality
0.05 <= (1 - x - y) <= 0.25 (**)
that derives from (*) and the equation for z. This is basically a diagonal stripe through the box.
In order for the results to be uniform, I would just repeatedly sample the (N-1)-dimensional box, and accept the first sampled point that fulfills (**). Single-pass solutions might end up having biased distributions.
Update: Yes, you're right, the result is not uniformly distributed.
Let's say your percent values are natural numbers (if this assumption is wrong, you don't have to read further :) In that case I don't have a solution).
Let's define an event e as a tuple of 3 values (percentage of each bucket): e = (pa, pb, pc). Next, create all possible events en. What you have here is a tuple space consisting of a discrete number of events. All of the possible events should have the same possibility to occur.
Let's say we have a function f(n) => en. Then, all we have to do is take a random number n and return en in a single pass.
Now, the problem remains to create such a function f :)
In pseudo code, a very slow method (just for illustration):
function f(n) {
int c = 0
for i in [10..70] {
for j in [10..50] {
for k in [5..25] {
if(i + j + k == 100) {
if(n == c) {
return (i, j, k) // found event!
} else {
c = c + 1
}
}
}
}
}
}
What you have know is a single pass solution, but problem is only moved away. The function f is very slow. But you can do better: I think you can calculate everything a bit faster if you set your ranges correctly and calculate offsets instead of iterating through your ranges.
Is this clear enough?
First of all you probably have to adjust your ranges. 10% in bucket a is not possible, since you can't get condition a+b+c = number to hold.
Concerning your question: (1) Pick a random number for bucket a inside your range, then (2) update the range for bucket b with minimum and maximum percentage (you should only narrow the range). Then (3) pick a random number for bucket b. In the end c should be calculated that your condition holds (4).
Example:
n = 1000
(1) a = 40%
(2) range b [35,50], because 40+35+25 = 100%
(3) b = 45%
(4) c = 100-40-45 = 15%
Or:
n = 1000
(1) a = 70%
(2) range b [10,25], because 70+25+5 = 100%
(3) b = 20%
(4) c = 100-70-20 = 10%
It is to check whether all the events are uniformly distributed. If that should be a problem you might want to randomize the range update in step 2.
I'm trying to make a randomizer that will use the Monte Carlo Hit or Miss Simulation.
I have a Key-Value pair that represents the ID and the probability value:
ID - Value
2 - 0.37
1 - 0.35
4 - 0.14
3 - 0.12
When you add all of those values, you will get a total of 1.0.
You can imagine those values as the total area of a "slice" on the "wheel" (EG: ID 2 occupies 37% of the wheel, while ID 3 only occupies 12% of the wheel). When converted to "range" it will look like this:
ID - Value - Range
2 - 0.37 - 0 to 37
1 - 0.35 - 37 to 72
4 - 0.14 - 72 to 86
3 - 0.12- 86 to 100
Now, I am using Random.NextDouble() to generate a random value that is between 0.0 and 1.0. That random value will be considered as the "spin" on the wheel. Say, the randomizer returns 0.35, then ID 2 will be selected.
What is the best way to implement this given that I have an array of doubles?
The simplest solutions are often the best, if your range is 0 - 100 by design (or another manageebly small number), you can allocate an int[] and use the table of ranges you created to fill in the ID at the corresponding index, your "throw" will then look like:
int randomID = rangesToIDs[random.nextInt(rangesToIDs.length)];
Btw, it is not necessary to sort the ID's on range size, as the randoms are assumed to be distributed uniformly it does not matter where in the lookup table a range is placed. It only matters that the number of entries is proportional to the chance to throw an ID.
Let's assume your initial data is represented as array D[n], where D[i] = (id, p) and sum(D[i].p for i=0..n-1) == 1.
Build a second array P[n] such that P[i] = (q, id): P[i] = (sum(D[j].p for j in 0..i), D[j].id) -- i.e., convert individual probablity of each slice i into cumulative probability of all slices preceding i (inclusive). Note that, by definition, this array P is ordered by field q (i.e. by cumulative probability).
Now you can use binary search to find the slice chosen by the random number r (0 <= r <= 1):
find highest i such that P[i].q <= r; then P[i].id is your slice.
It is possible to speed up the lookup further by hashing the probability range with a fixed grid. I can write more details on this if anybody is interested.
As jk wrote sorted dictionary of should be fine.
let's say you got dictionary like this:
0.37 2
0.72 1
0.86 4
1.00 3
You roll xx = 0.66..
Iterate through dictionary starting from lowest number (that's 0.37)
if xx < dict[i].key
return dict[i].value
Or another solution which comes to my mind is List of custom objects containing lower and upper bound and value. You iterate then through list and check if rolled number is in range of up and low bounds.
a sorted map/dictionary with the 'Value' as the key and the 'ID' as the value would allow you to quickly find the upper bound of the range you are in and then look up the ID for that range
assuming your dictionary allows it, a binary search would be better to find the upper bound than interating throught the entire dictionary
boundaries = [37, 72, 86, 100]
num = 100 * random
for i in boundaries:
if num < i then return i