Issue with program in Scilab for calculating the cumulative distribution function of a geometric discrete random variable - probability

I wrote a program that shows the graph of cumulative distribution function and PMF of a random discrete variable with geometric distribution. But I've encountered a problem: The sum of all the probabilities of PMF is not 1, but something very close to 1.When I saw this, I switched to Matlab, where I used the function: geocdf. Thus, I observed that the first value of CDF, taking p 0.6 and n = 10, is 0.84 and not 0.6 as expected. Can you, please, help me find out what's wrong with my program ? Here's my script written in Scilab:
n = input('n = ');
tab = zeros(2, n);
p = input('p = ');//probability
q = 1 - p;
for k = 1:n
tab(1,k) = k;
tab(2,k) = (q^(k - 1))*p;
end
subplot(1,2,1);
plot(tab(1,:), tab(2,:), "-");
F = cumsum(tab(2, :));//cumulative distribution
subplot(1,2,2);
plot2d2(tab(1,:), F);
disp([tab' F']);
Mean = 1/p;
Variance = q/(p^2);
mprintf('\nMedia = %g, Dispersia = %g', Mean, Variance);

The sum of all the probabilities of PMF is not 1
You could not possibly add all the probabilities, because the geometric distribution assigns nonzero probabilities to all positive integers. If you run the sum up to n=10, the sum of probabilities is appreciably less than 1. If you run it up to 20, the cumulative is getting rounded to 1 in the output:
1. 0.6 0.6
2. 0.24 0.84
3. 0.096 0.936
4. 0.0384 0.9744
5. 0.01536 0.98976
6. 0.006144 0.995904
7. 0.0024576 0.9983616
8. 0.0009830 0.9993446
9. 0.0003932 0.9997379
10. 0.0001573 0.9998951
11. 0.0000629 0.9999581
12. 0.0000252 0.9999832
13. 0.0000101 0.9999933
14. 0.0000040 0.9999973
15. 0.0000016 0.9999989
16. 0.0000006 0.9999996
17. 0.0000003 0.9999998
18. 0.0000001 0.9999999
19. 4.123D-08 1.0000000
20. 1.649D-08 1.0000000
the first value of CDF, taking p 0.6 and n = 10, 0.84 and not 0.6 as expected
There are two versions of geometric distribution, as Wikipedia explains at the very beginning of the article. Your code follows the first convention, with PMF supported on the set 1,2,3,.... Matlab's geocdf uses the second convention, with PMF supported on 0,1,2,3... The output of geocdf(1,0.6) is 0.84, representing the sum of probabilities of 0 and 1. The output of geocdf(0,0.6) is 0.6.

Related

Calculate certainty of Monte Carlo simulation

Let's say that we use the Monte Carlo method to estimate the area of an object, in the exact same way you'd use it to estimate the value of π.
Now, let's say we want to calculate the certainty of our simulation result. We've cast n samples, m of which landed inside the object, so the area of the object is approximately m/n of the total sampled area. We would like to make a statement such as:
"We are 99% certain that the area of the object is between a1 and a2."
How can we calculate a1 and a2 above (given n, m, total area, and the desired certainty)?
Here is a program which attempts to estimate this bound numerically. Here the samples are points in [0,1), and the object is the segment [0.25,0.75). It prints a1 and a2 for 50%, 90%, and 99%, for a range of sample counts:
import std.algorithm;
import std.random;
import std.range;
import std.stdio;
void main()
{
foreach (numSamples; iota(0, 1000+1, 100).filter!(n => n > 0))
{
auto samples = new double[numSamples];
enum objectStart = 0.25;
enum objectEnd = 0.75;
enum numTotalSamples = 10_000_000;
auto numSizes = numTotalSamples / numSamples;
auto sizes = new double[numSizes];
foreach (ref size; sizes)
{
size_t numHits;
foreach (i; 0 .. numSamples)
{
auto sample = uniform01!double;
if (sample >= objectStart && sample < objectEnd)
numHits++;
}
size = 1.0 / numSamples * numHits;
}
sizes.sort;
writef("%d samples:", numSamples);
foreach (certainty; [50, 90, 99])
{
auto centerDist = numSizes * certainty / 100 / 2;
auto startPos = numSizes / 2 - centerDist;
auto endPos = numSizes / 2 + centerDist;
writef("\t%.5f..%.5f", sizes[startPos], sizes[endPos]);
}
writeln;
}
}
(Run it online.) It outputs:
// 50% 90% 99%
100 samples: 0.47000..0.53000 0.42000..0.58000 0.37000..0.63000
200 samples: 0.47500..0.52500 0.44500..0.56000 0.41000..0.59000
300 samples: 0.48000..0.52000 0.45333..0.54667 0.42667..0.57333
400 samples: 0.48250..0.51750 0.46000..0.54250 0.43500..0.56500
500 samples: 0.48600..0.51600 0.46400..0.53800 0.44200..0.55800
600 samples: 0.48667..0.51333 0.46667..0.53333 0.44833..0.55167
700 samples: 0.48714..0.51286 0.46857..0.53143 0.45000..0.54857
800 samples: 0.48750..0.51250 0.47125..0.53000 0.45375..0.54625
900 samples: 0.48889..0.51111 0.47222..0.52667 0.45778..0.54111
1000 samples: 0.48900..0.51000 0.47400..0.52500 0.45800..0.53900
Is it possible to precisely calculate these numbers instead?
(Context: I'd like to add something like "±X.Y GB with 99% certainty" to btdu)
Ok, with question being language agnostic, here is the illustration how to do error estimation with Monte-Carlo.
Suppose, you want to compute integral
I = S01 f(x) dx
where f(x) is simple polynomial function
f(x) = xn
Here is the illustration of the calculations.
For that you have to compute not only mean value, but standard deviation as well.
Then, knowing that Monte Carlo error is going down as inverse square root of number of samples, computing confidence interval is simple
Code, Python 3.7, Windows 10 x64
import numpy as np
rng = np.random.default_rng()
N = 100000
n = 2
def f(x):
return np.power(x, n)
sample = f(rng.random(N)) # N samples of the function
m = np.mean(sample) # mean value of the sample, approaching integral value as N->∞
s = np.std(sample, ddof=1) # standard deviation with Bessel correction
e = s / np.sqrt(N) # Monte Carlo error decreases as inverse square root
t = 2.576 # For 99% confidence interval, we should take 2.58 sigma, per Gaussian distribution
#t = 3.00 # For 99.7% confidence interval, we should take 3 sigma, per Gaussian distribution
print(f'True integral value is {1.0/(1.0+n)}')
print(f'Computed integral value is in the range [{m-t*e}...{m+t*e}] with 99% confidence')
will print something like
True integral value is 0.3333333333333333
Computed integral value is in the range
[0.33141772204489295...0.3362795491124624] with 99% confidence
You could use Z-score table, line this one along the lines, to print table you want. You could vary N to get desired N dependency
zscore = {'50%': 0.674, '80%': 1.282, '90%': 1.645, '95%': 1.960, '98%': 2.326, '99%': 2.576, '99.7%': 3.0}
for c, z in zscore.items():
print(f'Computed integral value is in the range [{m-z*e}...{m+z*e}] with {c} confidence')
Based on Severin's answer, here is the code to calculate the values as stated in the question:
def calculate_error(n, m, z):
p = m / n
std_dev = (p * (1 - p)) ** 0.5 # Standard deviation of Bernoulli variable
error = std_dev / n ** 0.5 # Monte Carlo error decreases as inverse square root
return (mean - z * error, mean + z * error)
n = 1000
z = 2.576 # For 99% confidence interval, we should take 2.58 sigma, per Gaussian distribution
print(calculate_error(n, n * 0.5, z))

Generating a random number with weighted probability - 'Distribution' gem

I would like to create a random number generator, that generates a random decimal number:
Greater than 0.0
Less than 15.0
Where the probability of that number being close to 2.0 is relatively high
The probability of it being near 15.0 or very close to zero is very low
I'm terrifically poor at mathematics but my research seems to tell me I want to pull a random number from a Cumulative Distribution Function resembling a Fisher–Snedecor (F) pattern, a bit like this one:
http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/6303a2314437d8fcf2f72d9a56b1293a/f_distribution_probability.png
I am using a Ruby gem called Distribution (https://github.com/sciruby/distribution) to try and achieve this. It looks like the right tool, but I'm having a terrible time trying to understand how to use it to achieve the desired outcome :( Any help please.
I'll take it back, there is no rng call for F. So, if you want to use Distribution gem, what I would propose is to use Chi2 with 4 degrees of freedom.
Mode for Chi2 with k degress of freedom is equal to k-2, so for 4 d.f. you'll get mode at 2, see here. My Ruby is rusty, bear with me
require 'distribution'
normal = Distribution::Normal.rng(0)
g1 = normal.call
g2 = normal.call
g3 = normal.call
g4 = normal.call
chi2 = g1*g1 + g2*g2 + g3*g3 + g4*g4
UPDATE
You have to truncate it at 15, so if generated chi2 is greater than 15 just reject it and generate another one. Though I would say you won't see a lot of
value above 15, check graphs for PDF/CDF.
UPDATE II
And if you want to get samples from F, make generic Chi2 generator for d degrees of freedom from code above, and just sample ratio of chi2, check here
chi2_d1 = DChi2(d1)
chi2_d2 = DChi2(d2)
f = (chi2_d1.call / d1) / (chi2_d2.call / d2)
UPDATE III
And, frankly, I don't see how you could get F distribution working for you. It is ok at 0, but mode is equal to (d1-2)/d1 * d2/(d2 + 2), and it is hard to see it equal to 2. Graph you provided has mode at about 1/3.
Here's a very crude, unscientific, non-mathy attempt at using the F-distribution with the parameters you gave in the F-function image (3 and 36).
First I calculate what F-value is needed for the CDF to be 0.975 (100% - 2.5% for the upper end of the range for your number 15):
To calculate that we can use the p_value method like so:
> F_15 = Distribution::F.p_value(0.975, 3, 36)
=> 3.5046846420861977
Next we simply use a multiplier so that when we calculate the CDF it will return the value 15 when the F-value is F_15.
> M = 15 / F_15
=> 4.27998565687528
And now we can generate random numbers with rand, which has a range of 0..1 like so:
[M * Distribution::F.p_value(rand, 3, 36), 15].min
The question is will this function be close to the number 2 with a 45% probability? Well..sort of. You need to pick the right parameters for the F-distribution to tweak the curve (or just adjust the multiplier M). But here's a sample with the parameters from your image:
0.step(0.99, 0.02).map { |n|
sprintf("%0.2f", M * Distribution::F.p_value(n, 3, 36))
}
Gives you:
["0.00", "0.26", "0.42", "0.57", "0.70", "0.83", "0.95", "1.07",
"1.20", "1.31", "1.43", "1.55", "1.67", "1.80", "1.92", "2.04",
"2.17", "2.30", "2.43", "2.56", "2.70", "2.84", "2.98", "3.13",
"3.28", "3.44", "3.60", "3.77", "3.95", "4.13", "4.32", "4.52",
"4.73", "4.95", "5.18", "5.43", "5.69", "5.97", "6.28", "6.61",
"6.97", "7.37", "7.81", "8.32", "8.90", "9.60", "10.45", "11.56",
"13.14", "15.90"]
Sometimes you know which distribution applies because of the nature of the data. If, for example, the random variable is the sum of independent, identical Bernouli (two-state) random variables, you know the former has a binomial distribution, which can be approximated by a Normal distribution. When, as here, that does not apply, you can use a continuous distribution, shaped by it's parameters, or simply use a discrete distribution. Others have made suggestions for using various continuous distributions, so I'll pass on some remarks about using a discrete distribution.
Suppose the discrete probability density function were the following:
pdf = [[0.5, 0.03], [1.0, 0.06], [1.5, 0.10], [ 2.0, 0.15], [2.5 , 0.15], [ 3.0, 0.10],
[4.0, 0.11], [6.0, 0.14], [9.0, 0.10], [12.0, 0.03], [14.0, 0.02], [15.0, 0.01]]
pdf.map(&:last).reduce(:+)
#=> 1.0
This could be interpreted as there being a probability of 0.03 that the random variable will be less than 0.5, a 0.06 probability of the random variable being greater than or equal 0.5 and less than 1.0, and so on.
A discrete pdf might be constructed from historical data or by sampling, an advantage it has over using a continuous distribution. It can be made arbitrarily fine by increasing the numbers of intervals.
Next convert the pdf to a cumulative distribution function:
cum = 0.0
cdf = pdf.map { |k,v| [k, cum += v] }
#=> [[0.5, 0.03], [1.0, 0.09], [1.5, 0.19], [2.0, 0.34], [2.5, 0.49], [3.0, 0.59],
# [4.0, 0.7], [6.0, 0.84], [9.0, 0.94], [12.0, 0.97], [14.0, 0.99], [15.0, 1.0]]
Now use Kernel#rand to generate pseudo random variates between 0.0 and 1.0 and use Enumerable#find to associate the random variate with a cdf key:
def rnd(cdf)
r = rand
cdf.find { |k,v| r < v }.first
end
Note that cdf.find { |k,v| rand < v }.first would produce erroneous results, since rand is executed for each key-value pair of cdf.
Let's try it 100,000 times, recording the relative frequencies
n = 100_000
inc = 1.0/n
n.times.with_object(Hash.new(0.0)) { |_, h| h[rnd(cdf)] += inc }.
sort.
map { |k,v| [k, v.round(5)] }.to_h
#=> { 0.5=>0.03053, 1.0=>0.05992, 1.5=>0.10084, 2.0=>0.14959, 2.5=>0.15024,
# 3.0=>0.10085, 4.0=>0.10946, 6.0=>0.13923, 9.0=>0.09919, 12.0=>0.03073,
# 14.0=>0.01931, 15.0=>0.01011}

Number of ways to reach N from 0 using only 2 or 3?

I am solving this problem where we need to reach from X=0 to X=N.We can only take a step of 2 or 3 at a time.
For each step of 2 we have a probability of 0.2 and for each step of 3 we have a probability of 0.8.How can we find the total probability to reach N.
e.g. for reaching 5,
2+3 with probability =0.2 * 0.8=0.16
3+2 with probability =0.8 * 0.2=0.16 total = 0.32.
My initial thoughts:
Number of ways can be found out by simple Fibonacci sequence.
f(n)=f(n-3)+f(n-2);
But how do we remember the numbers so that we can multiply them to find the probability?
This can be solved using Dynamic programming.
Lets call the function F(N) = probability to reach 0 using only 2 and 3 when the starting number is N
F(N) = 0.2*F(N-2) + 0.3*F(N-3)
Base case:
F(0) = 1 and F(k)= 0 where k< 0
So the DP code would be somthing like that:
F[0] = 1;
for(int i = 1;i<=N;i++){
if(i>=3)
F[i] = 0.2*F[i-2] + 0.8*F[i-3];
else if(i>=2)
F[i] = 0.2*F[i-2];
else
F[i] = 0;
}
return F[N];
This algorithm would run in O(N)
Some clarifications about this solution: I assume the only allowed operation for generating the number from 2s and 3s is addition (your definition would allow substraction aswell) and the input-numbers are always valid (2 <= input). Definition: a unique row of numbers means: no other row with the same number of 3s and 2s in another order is in scope.
We can reduce the problem into multiple smaller problems:
Problem A: finding all sequences of numbers that can sum up to the given number. (Unique rows of numbers only)
Start by finding the minimum-number of 3s required to build the given number, which is simply input % 2. The maximum-number of 3s that can be used to build the input can be calculated this way:
int max_3 = (int) (input / 3);
if(input - max_3 == 1)
--max_3;
Now all sequences of numbers that sum up to input must hold between input % 2 and max_3 3s. The 2s can be easily calculated from a given number of 3s.
Problem B: calculating the probability for a given list and it's permutations to be the result
For each unique row of numbers, we can easily derive all permutations. Since these consist of the same number, they have the same likeliness to appear and produce the same sum. The likeliness can be calculated easily from the row: 0.8 ^ number_of_3s * 0.2 ^ number_of_2s. Next step would be to calculate the number of different permuatations. Calculating all distinct sets with a specific number of 2s and 3s can be done this way: Calculate all possible distributions of 2s in the set: (number_of_2s + number_of_3s)! / (number_of_3s! * numer_of_2s!). Basically just the number of possible distinct permutations.
Now from theory to praxis
Since the math is given, the rest is pretty straight forward:
define prob:
input: int num
output: double
double result = 0.0
int min_3s = (num % 2)
int max_3s = (int) (num / 3)
if(num - max_3 == 1)
--max_3
for int c3s in [min_3s , max_3s]
int c2s = (num - (c3s * 3)) / 2
double p = 0.8 ^ c3s * 0.2 * c2s
p *= (c3s + c2s)! / (c3s! * c2s!)
result += p
return result
Instead of jumping into the programming, you can use math.
Let p(n) be the probability that you reach the location that is n steps away.
Base cases:
p(0)=1
p(1)=0
p(2)=0.2
Linear recurrence relation
p(n+3)=0.2 p(n+1) + 0.8 p(n)
You can solve this in closed form by finding the exponential solutions to the linear recurrent relation.
c^3 = 0.2 c + 0.8
c = 1, (-5 +- sqrt(55)i)/10
Although this was cubic, c=1 will always be a solution in this type of problem since there is a constant nonzero solution.
Because the roots are distinct, all solutions are of the form a1(1)^n + a2((-5+sqrt(55)i) / 10)^n + a3((-5-sqrt(55)i)/10)^n. You can solve for a1, a2, and a3 using the initial conditions:
a1=5/14
a2=(99-sqrt(55)i)/308
a3=(99+sqrt(55)i)/308
This gives you a nonrecursive formula for p(n):
p(n)=5/14+(99-sqrt(55)i)/308((-5+sqrt(55)i)/10)^n+(99+sqrt(55)i)/308((-5-sqrt(55)i)/10)^n
One nice property of the non-recursive formula is that you can read off the asymptotic value of 5/14, but that's also clear because the average value of a jump is 2(1/5)+ 3(4/5) = 14/5, and you almost surely hit a set with density 1/(14/5) of the integers. You can use the magnitudes of the other roots, 2/sqrt(5)~0.894, to see how rapidly the probabilities approach the asymptotics.
5/14 - (|a2|+|a3|) 0.894^n < p(n) < 5/14 + (|a2|+|a3|) 0.894^n
|5/14 - p(n)| < (|a2|+|a3|) 0.894^n
f(n, p) = f(n-3, p*.8) + f(n -2, p*.2)
Start p at 1.
If n=0 return p, if n <0 return 0.
Instead of using the (terribly inefficient) recursive algorithm, start from the start and calculate in how many ways you can reach subsequent steps, i.e. using 'dynamic programming'. This way, you can easily calculate the probabilities and also have a complexity of only O(n) to calculate everything up to step n.
For each step, memorize the possible ways of reaching that step, if any (no matter how), and the probability of reaching that step. For the zeroth step (the start) this is (1, 1.0).
steps = [(1, 1.0)]
Now, for each consecutive step n, get the previously computed possible ways poss and probability prob to reach steps n-2 and n-3 (or (0, 0.0) in case of n < 2 or n < 3 respectively), add those to the combined possibilities and probability to reach that new step, and add them to the list.
for n in range(1, 10):
poss2, prob2 = steps[n-2] if n >= 2 else (0, 0.0)
poss3, prob3 = steps[n-3] if n >= 3 else (0, 0.0)
steps.append( (poss2 + poss3, prob2 * 0.2 + prob3 * 0.8) )
Now you can just get the numbers from that list:
>>> for n, (poss, prob) in enumerate(steps):
... print "%s\t%s\t%s" % (n, poss, prob)
0 1 1.0
1 0 0.0
2 1 0.2
3 1 0.8
4 1 0.04
5 2 0.32 <-- 2 ways to get to 5 with combined prob. of 0.32
6 2 0.648
7 3 0.096
8 4 0.3856
9 5 0.5376
(Code is in Python)
Note that this will get you both the number of possible ways of reaching a certain step (e.g. "first 2, then 3" or "first 3, then 2" for 5), and the probability to reach that step in one go. Of course, if you need only the probability, you can just use single numbers instead of tuples.

Psuedo-Random Variable

I have a variable, between 0 and 1, which should dictate the likelyhood that a second variable, a random number between 0 and 1, is greater than 0.5. In other words, if I were to generate the second variable 1000 times, the average should be approximately equal to the first variable's value. How do I make this code?
Oh, and the second variable should always be capable of producing either 0 or 1 in any condition, just more or less likely depending on the value of the first variable. Here is a link to a graph which models approximately how I would like the program to behave. Each equation represents a separate value for the first variable.
You have a variable p and you are looking for a mapping function f(x) that maps random rolls between x in [0, 1] to the same interval [0, 1] such that the expected value, i.e. the average of all rolls, is p.
You have chosen the function prototype
f(x) = pow(x, c)
where c must be chosen appropriately. If x is uniformly distributed in [0, 1], the average value is:
int(f(x) dx, [0, 1]) == p
With the integral:
int(pow(x, c) dx) == pow(x, c + 1) / (c + 1) + K
one gets:
c = 1/p - 1
A different approach is to make p the median value of the distribution, such that half of the rolls fall below p, the other half above p. This yields a different distribution. (I am aware that you didn't ask for that.) Now, we have to satisfy the condition:
f(0.5) == pow(0.5, c) == p
which yields:
c = log(p) / log(0.5)
With the current function prototype, you cannot satisfy both requirements. Your function is also asymmetric (f(x, p) != f(1-x, 1-p)).
Python functions below:
def medianrand(p):
"""Random number between 0 and 1 whose median is p"""
c = math.log(p) / math.log(0.5)
return math.pow(random.random(), c)
def averagerand(p):
"""Random number between 0 and 1 whose expected value is p"""
c = 1/p - 1
return math.pow(random.random(), c)
You can do this by using a dummy. First set the first variable to a value between 0 and 1. Then create a random number in the dummy between 0 and 1. If this dummy is bigger than the first variable, you generate a random number between 0 and 0.5, and otherwise you generate a number between 0.5 and 1.
In pseudocode:
real a = 0.7
real total = 0.0
for i between 0 and 1000 begin
real dummy = rand(0,1)
real b
if dummy > a then
b = rand(0,0.5)
else
b = rand(0.5,1)
end if
total = total + b
end for
real avg = total / 1000
Please note that this algorithm will generate average values between 0.25 and 0.75. For a = 1 it will only generate random values between 0.5 and 1, which should average to 0.75. For a=0 it will generate only random numbers between 0 and 0.5, which should average to 0.25.
I've made a sort of pseudo-solution to this problem, which I think is acceptable.
Here is the algorithm I made;
a = 0.2 # variable one
b = 0 # variable two
b = random.random()
b = b^(1/(2^(4*a-1)))
It doesn't actually produce the average results that I wanted, but it's close enough for my purposes.
Edit: Here's a graph I made that consists of a large amount of datapoints I generated with a python script using this algorithm;
import random
mod = 6
div = 100
for z in xrange(div):
s = 0
for i in xrange (100000):
a = (z+1)/float(div) # variable one
b = random.random() # variable two
c = b**(1/(2**((mod*a*2)-mod)))
s += c
print str((z+1)/float(div)) + "\t" + str(round(s/100000.0, 3))
Each point in the table is the result of 100000 randomly generated points from the algorithm; their x positions being the a value given, and their y positions being their average. Ideally they would fit to a straight line of y = x, but as you can see they fit closer to an arctan equation. I'm trying to mess around with the algorithm so that the averages fit the line, but I haven't had much luck as of yet.

generate random numbers within a range with different probabilities

How can i generate a random number between A = 1 and B = 10 where each number has a different probability?
Example: number / probability
1 - 20%
2 - 20%
3 - 10%
4 - 5%
5 - 5%
...and so on.
I'm aware of some hard-coded workarounds which unfortunately are of no use with larger ranges, for example A = 1000 and B = 100000.
Assume we have a
Rand()
method which returns a random number R, 0 < R < 1, can anyone post a code sample with a proper way of doing this ? prefferable in c# / java / actionscript.
Build an array of 100 integers and populate it with 20 1's, 20 2's, 10 3's, 5 4's, 5 5's, etc. Then just randomly pick an item from the array.
int[] numbers = new int[100];
// populate the first 20 with the value '1'
for (int i = 0; i < 20; ++i)
{
numbers[i] = 1;
}
// populate the rest of the array as desired.
// To get an item:
// Since your Rand() function returns 0 < R < 1
int ix = (int)(Rand() * 100);
int num = numbers[ix];
This works well if the number of items is reasonably small and your precision isn't too strict. That is, if you wanted 4.375% 7's, then you'd need a much larger array.
There is an elegant algorithm attributed by Knuth to A. J. Walker (Electronics Letters 10, 8 (1974), 127-128; ACM Trans. Math Software 3 (1977), 253-256).
The idea is that if you have a total of k * n balls of n different colors, then it is possible to distribute the balls in n containers such that container no. i contains balls of color i and at most one other color. The proof is by induction on n. For the induction step pick the color with the least number of balls.
In your example n = 10. Multiply the probabilities with a suitable m such that they are all integers. So, maybe m = 100 and you have 20 balls of color 0, 20 balls of color 1, 10 balls of color 2, 5 balls of color 3, etc. So, k = 10.
Now generate a table of dimension n with each entry being a probability (the ration of balls of color i vs the other color) and the other color.
To generate a random ball, generate a random floating-point number r in the range [0, n). Let i be the integer part (floor of r) and x the excess (r – i).
if (x < table[i].probability) output i
else output table[i].other
The algorithm has the advantage that for each random ball you only make a single comparison.
Let me work out an example (same as Knuth).
Consider simulating throwing a pair of dice.
So P(2) = 1/36, P(3) = 2/36, P(4) = 3/36, P(5) = 4/36, P(6) = 5/36, P(7) = 6/36, P(8) = 5/36, P(9) = 4/36, P(10) = 3/36, P(11) = 2/36, P(12) = 1/36.
Multiply by 36 * 11 to get 393 balls, 11 of color 2, 22 of color 3, 33 of color 4, …, 11 of color 12.
We have k = 393 / 11 = 36.
Table[2] = (11/36, color 4)
Table[12] = (11/36, color 10)
Table[3] = (22/36, color 5)
Table[11] = (22/36, color 5)
Table[4] = (8/36, color 9)
Table[10] = (8/36, color 6)
Table[5] = (16/36, color 6)
Table[9] = (16/36, color 8)
Table[6] = (7/36, color 8)
Table[8] = (6/36, color 7)
Table[7] = (36/36, color 7)
Assuming that you have a function p(n) that gives you the desired probability for a random number:
r = rand() // a random number between 0 and 1
for i in A to B do
if r < p(i)
return i
r = r - p(i)
done
A faster way is to create an array of (B - A) * 100 elements and populate it with numbers from A to B such that the ratio of the number of each item occurs in the array to the size of the array is its probability. You can then generate a uniform random number to get an index to the array and directly access the array to get your random number.
Map your uniform random results to the required outputs according to the probabilities.
E.g., for your example:
If `0 <= Round() <= 0.2`: result = 1.
If `0.2 < Round() <= 0.4`: result = 2.
If `0.4 < Round() <= 0.5`: result = 3.
If `0.5 < Round() <= 0.55`: result = 4.
If `0.55 < Round() <= 0.65`: result = 5.
...
Here's an implementation of Knuth's Algorithm. As discussed by some of the answers it works by
1) creating a table of summed frequencies
2) generates a random integer
3) rounds it with ceiling function
4) finds the "summed" range within which the random number falls and outputs original array entity based on it
Inverse Transform
In probability speak, a cumulative distribution function F(x) returns the probability that any randomly drawn value, call it X, is <= some given value x. For instance, if I did F(4) in this case, I would get .6. because the running sum of probabilities in your example is {.2, .4, .5, .55, .6, .65, ....}. I.e. the probability of randomly getting a value less than or equal to 4 is .6. However, what I actually want to know is the inverse of the cumulative probability function, call it F_inv. I want to know what is the x value given the cumulative probability. I want to pass in F_inv(.6) and get back 4. That is why this is called the inverse transform method.
So, in the inverse transform method, we are basically trying to find the interval in the cumulative distribution in which a random Uniform (0,1) number falls. This works out to the algorithm that perreal and icepack posted. Here is another way to state it in terms of the cumulative distribution function
Generate a random number U
for x in A .. B
if U <= F(x) then return x
Note that it might be more efficient to have the loop go from B to A and check if U >= F(x) if the smaller probabilities come at the beginning of the distribution

Resources