This is a purely theoretical question.
We all know that most, if not all, random-number generators actually only generate pseudo-random numbers.
Let's say I want a random number from 10 to 20. I can do this as follows (myRandomNumber being an integer-type variable):
myRandomNumber = rand(10, 20);
However, if I execute this statement:
myRandomNumber = rand(5, 10) + rand(5, 10);
Is this method more random?
No.
The randomness is not cumulative. The rand() function uses a uniform distribution between your two defined endpoints.
Adding two uniformly distributions invalidates the uniform distribution. It will make a strange looking pyramid, with the most probability tending toward the center. This is because of accumulation of the probability density function with increasing degrees of freedom.
I urge you to read this:
Uniform Distribution
and this:
Convolution
Pay special attention to what happens with the two uniform distributions on the top right of the screen.
You can prove this to yourself by writing to a file all the sums and then plotting in excel. Make sure you give yourself a large enough sample size. 25000 should be sufficient.
The best way to understand this is by considering the popular fair ground game "Lucky Seven".
If we roll a six sided die, we know that the probability of obtaining any of the six numbers is the same - 1/6.
What if we roll two dice and add the numbers that appear on the two ?
The sum can range from 2 ( both dice show 'one') uptil 12 (both dice show 'six')
The probabilities of obtaining different numbers from 2 to 12 are no longer uniform. The probability of obtaining a 'seven' is the highest. There can be a 1+6, a 6+1, a 2+5, a 5+2, a 3+4 and a 4+3. Six ways of obtaining a 'seven' out of 36 possibilities.
If we plot the distribution we get a pyramid. The probabilities would be 1,2,3,4,5,6,5,4,3,2,1 (of course each of these has to be divided by 36).
The pyramidal figure (and the probability distribution) of the sum can be obtained by 'convolution.
If we know the 'expected value' and standard deviation ('sigma') for the two random numbers, we can perform a quick a ready calculation of the expected value of the sum of the two random numbers.
The expected value is simply the addition of the two individual expected values.
The sigma is obtained by applying the "pythagoras theorem" on the two individual sigmas (square root of the sum of the square of each sigma).
Related
I'm looking for an efficient algorithm to generate or iteratively approximate a solution to the problem described below.
You are given an array of length N and a finite set of numbers Si for each index i of the array. Now, if we are to place a number from Si at each index i to fill the entire array, while ensuring that the number is unique across the entire array; given all the possible arrays, what is the probability ditribution over each number at each index?
Here I give an example:
Assuming we have the following array of length 3 with each column representing Si at the index of the column
4 4 4
2 2
1 1 1
We will have the following possible arrays:
421
412
124
142
And the following probability distribution: (over 1 2 4 at each index respectively)
0.5 0.25 0.25
0.5 0.5
0.5 0.25 0.25
Brute forcing this problem is obviously doable but I have a gut feeling that there must be some more efficient algorithms for this.
The reason why I think so is due to the fact that one can derive the probability distribution from the set of all possibilities but not the other way around, so the distribution itself must contain less information then the set of all possibilities have. Therefore, I believe that we do not need to generate all possibilites just to obtain the probability distribution.
Hence, I am wondering if there is any smart matrix operation we could use for this problem or even fixed-point iteration/density evolution to approximate the end probability distribution? Some other potentially more efficient approaches to this problem are also appreciated.
(p.s. The reason why I am interested in this problem is because I wanted to generate probability distribution over candidate numbers for the empty cells in a sudoku and other sudoku-like games without a unique answers by only applying all the standard rules)
Sudoku is a combinatorial problem. It is easy to show that the probability of any independent cell is uniform (because you can relabel a configuration to put any number at a given position). The joint probabilities are more complicated.
If the game is partially filled you have constraints that will affect this distribution.
You must devise an algorithm to calculate the number of solutions from a given initial configuration. Then you compute the fraction of the total solutions are will have a specific value at the position of interest.
counts = {}
for i in range(1, 10):
board[cell] = i;
counts[i] = countSolutions(board);
prob = {i: counts[i] / sum(counts[i] for i in range(1, 10))}
The same approach works for joint probabilities but in some cases the number of possibilities may be too high.
I am looking for a math equation or algorithm which can generate uniform random numbers in ascending order in the range [0,1] without the help of division operator. i am keen in skipping the division operation because i am implementing it in hardware. Thank you.
Generating the numbers in ascending (or descending) order means generating them sequentially but with the right distribution. That, in turn, means we need to know the distribution of the minimum of a set of size N, and then at each stage we need to use conditioning to determine the next value based on what we've already seen. Mathematically these are both straightforward except for the issue of avoiding division.
You can generate the minimum of N uniform(0,1)'s from a single uniform(0,1) random number U using the algorithm min = 1 - U**(1/N), where ** denotes exponentiation. In other words, the complement of the Nth root of a uniform has the same distribution as the minimum of N uniforms over the range [0,1], which can then be scaled to any other interval length you like.
The conditioning aspect basically says that the k values already generated will have eaten up some portion of the original interval, and that what we now want is the minimum of N-k values, scaled to the remaining range.
Combining the two pieces yields the following logic. Generate the smallest of the N uniforms, scale it by the remaining interval length (1 the first time), and make that result the last value we have generated. Then generate the smallest of N-1 uniforms, scale it by the remaining interval length, and add it to the last one to give you your next value. Lather, rinse, repeat, until you have done them all. The following Ruby implementation gives distributionally correct results, assuming you have read in or specified N prior to this:
last_u = 0.0
N.downto(1) do |i|
p last_u += (1.0 - last_u) * (1.0 - (rand ** (1.0/i)))
end
but we have that pesky ith root which uses division. However, if we know N ahead of time, we can pre-calculate the inverses of the integers from 1 to N offline and table them.
last_u = 0.0
N.downto(1) do |i|
p last_u += (1.0 - last_u) * (1.0 - (rand ** inverse[i]))
end
I don't know of any way get the correct distributional behavior sequentially without using exponentiation. If that's a show-stopper, you're going to have to give up on either the sequential nature of the process or the uniformity requirement.
You can try so-called "stratified sampling", which means you divide the range into bins and then sample randomly from bins. A sample thus generated is more uniform (less clumping) than a sample generated from the entire interval. For this reason, stratified sampling reduces the variance of Monte Carlo estimates (I don't suppose that's important to you, but that's why the method was invented, as a reduction of variance method).
It is an interesting problem to generate numbers in order, but my guess is that to get a uniform distribution over the entire interval, you will have to apply some formulas which require more computation. If you want to minimize computation time, I suspect you cannot do better than generating a sample and then sorting it.
I understand 3D hyperplanes can represent numbers generated by linear congruential generator. But I don't get how it determines the location for each number or point. Especially in a 3D cube? I mean, doesn't a point have to have X, Y, and Z values to be in there?! What if one of the numbers generated is "8"? It's just "8"... how would I know XYZ for that? (I hope you know what I'm talking about... couldn't post an image, sorry :/)
Suppose you generate batches of three pseudo-random numbers in a sequence from your linear congruential generator and use the first number in each batch as the x-dimension, the next as the y-dimension and the last as the z-dimension, you can then plot each batch of three pseudo-random numbers in a x-y-z cube. A similar argument goes for generating batches of n (n > 3) numbers, except you'll plot them in a hypercube.
Assume that you are generating each of those pseudo-random numbers with b bits. There are then 2nb possible numbers that would have to be generated to fill the (hyper)cube (which will be a very large number, for any typical value of b). However, if the generator has a period of less than 2nb (which will almost always be the case for practical purposes), it won't fill all the available spaces in the cube (or hypercube, if n > 3). It will only fill some of the spaces.
What's more, the filled spaces may be located in planes (or hyperplanes, if n > 3) passing through the (hyper)cube, with spaces in-between the (hyper)planes that represent numbers that the generator will never produce because it repeats its cycle without ever producing such a number. This occurs because the pseudo-random numbers are serially correlated. You can see this behaviour at any dimensionality but the number of (hyper)planes on which the pseudo-random numbers are located reduces as the dimensionality n increases, so the behaviour becomes much more obvious as n gets larger.
This can be a particular problem in when using the generated pseudo-random numbers as input to a simulation because the simulation can then produce output that is more an artefact of the imperfections of the pseudo-random numbers than a consequence of the simulated model.
The Wikipedia article on Linear congruential generator is excellent.
(EDITED TO ADD AN EXAMPLE)
Here is a linear congruential generator (with very poor parameters selected deliberately) implemented in Python. Pseudo-random numbers with an even index are assigned to x values and those with odd numbers are assigned to y values.
import matplotlib.pyplot as plt
def lcg (X, a, c, m):
return (a * X + c) % m;
x = []
y = []
X = 0
for i in range(1000):
X = lcg(X,43,5,256)
if i % 2 == 0:
x.append(X)
else:
y.append(X)
plt.scatter(x,y)
plt.show()
This script produces the following output:
You can see that the resulting (x,y) pairs are all found on a small number of straight lines and pairs that appear in-between the lines can never be produced by the generator. The same thing can be done in three or more dimensions to see how generators with better parameters than I've used here still produce outputs that sit on lines, planes or hyperplanes in 2, 3, or n-dimensional space.
How will you test if the random number generator is generating actual random numbers?
My Approach: Firstly build a hash of size M, where M is the prime number. Then take the number
generated by random number generator, and take mod with M.
and see it fills in all the hash or just in some part.
That's my approach. Can we prove it with visualization?
Since I have very less knowledge about testing. Can you suggest me a thorough approach of this question? Thanks in advance
You should be aware that you cannot guarantee the random number generator is working properly. Note that even a perfect uniform distribution in range [1,10] - there is a 10-10 chance of getting 10 times 10 in a random sampling of 10 numbers.
Is it likely? Of course not.
So - what can we do?
We can statistically prove that the combination (10,10,....,10) is unlikely if the random number generator is indeed uniformly distributed. This concept is called Hypothesis testing. With this approach we can say "with certainty level of x% - we can reject the hypothesis that the data is taken from a uniform distribution".
A common way to do it, is using Pearson's Chi-Squared test, The idea is similar to yours - you fill in a table - check what is the observed (generated) number of numbers for each cell, and what is the expected number of numbers for each cell under the null hypothesis (in your case, the expected is k/M - where M is the range's size, and k is the total number of numbers taken).
You then do some manipulation on the data (see the wikipedia article for more info what this manipulation is exactly) - and get a number (the test statistic). You then check if this number is likely to be taken from a Chi-Square Distribution. If it is - you cannot reject the null hypothesis, if it is not - you can be certain with x% certainty that the data is not taken from a uniform random generator.
EDIT: example:
You have a cube, and you want to check if it is "fair" (uniformly distributed in [1,6]). Throw it 200 times (for example) and create the following table:
number: 1 2 3 4 5 6
empirical occurances: 37 41 30 27 32 33
expected occurances: 33.3 33.3 33.3 33.3 33.3 33.3
Now, according to Pearson's test, the statistic is:
X = ((37-33.3)^2)/33.3 + ((41-33.3)^2)/33.3 + ... + ((33-33.3)^2)/33.3
X = (18.49 + 59.29 + 10.89 + 39.69 + 1.69 + 0.09) / 33.3
X = 3.9
For a random C~ChiSquare(5), the probability of being higher then 3.9 is ~0.45 (which is not improbable)1.
So we cannot reject the null hypothesis, and we can conclude that the data is probably uniformly distributed in [1,6]
(1) We usually reject the null hypothesis if the value is smaller then 0.05, but this is very case dependent.
My naive idea:
The generator is following a distribution. (At least it should.) Do a reasonable amount of runs then plot the values on a graph. Fit a regression curve on the points. If it correlates with the shape of the distribution you're good. (This is also possible in 1D with projections and histograms. And fully automatable with the correct tool, e.g. MatLab)
You can also use the diehard tests as it was mentioned before, that is surely better but involves much less intuition, at least on your side.
Let's say you want to generate a uniform distribution on the interval [0, 1].
Then one possible test is
for i from 1 to sample-size
when a < random-being-tested() < b
counter +1
return counter/sample-size
And see if the result is closed to b-a (b minus a).
Of course you should define a function taking a, b between 0 and 1 as inputs, and return the difference between the counter/sample-size and b-a. Loop through possible a, b, say of the multiples of 0.01, a < b. Print out a, b when the difference is larger than a preset epsilon, say 0.001.
Those are the a, b for which there are too many outliers.
If you let sample-size be 5000. Your random-being-tested will be called about 5000 * 5050 times in total, hopefully not too bad.
I had the same problem.
when I finish to write my code (using an external RNG engine)
I looked on the results and found that all of them fail Chi-Square test whenever I have to many results.
my code generated a random number and hold buckets of the amount of each result range.
I don't know why the Chi-square test fail when i have a lot of results.
during my research I saw that the C# Random.next() fail in any range of random and that some of the numbers have better odds than the other, further more i saw that the RNGCryptoServiceProvider random provider is not supporting good on big numbers.
when trying to get numbers in the range of 0-1,000,000,000 the numbers in the lower range 0-300M have better odds to appear...
as a result I'm using the RNGCryptoServiceProvider and if my range is higher than 100M i'm combine the number my self (RandomHigh*100M + RandomLow) and the ranges of both randoms is smaller than 100M so it good.
Good Luck!
This question arose to me while I was playing FIFA.
Assumingly, they programmed a complex function which includes all the factors like shooting skills, distance, shot power etc. to calculate the probability that the shot hits the target. How would they have programmed something that the goal happens according to that probability?
In other words, like a function X() has the probability that it return 1 89% and 0 11%. How would I program it so that it returns 1 (approximately) 89 times in 100 trials?
Generate a uniformly-distributed random number between 0 and 1, and return true if the number is less than the desired probability (0.89).
For example, in IPython:
In [13]: from random import random
In [14]: vals = [random() < 0.89 for i in range(10000)]
In [15]: sum(vals)
Out[15]: 8956
In this realisation, 8956 out of the 10000 boolean outcomes are true. If we repeat the experiment, the number will vary around 8900.
That is not how goals are determined in FIFA or other video games. They don't have a function that says, with some probability, the shot makes it or doesn't.
Rather, they simulate a ball actually being kicked into a goal.
The ball will have some speed (based on the "shot power") and some trajectory angle (based on where the player aimed, and some variability based on the character's "shot skill"). Then they allow physics - and the AI of the goalee, if there is one - to take over, and count it as a point only when the ball physically enters the goal.
There is of course still randomness involved, but there is no single variable that decides whether or not a shot will make it.
I'm not 100% sure but one way i would achieve:
Generate a random number (between 0 and 100). If the number is 89 or greater than return 1, elsewise return 0.
If you have a random number generator, then you would do something like:
bool return_true_89_out_of_100() {
double random_n = rand(); // returns random between 0 and 89
return (random_n < 0.89);
}
You can generate a crudely random number by, for example, sampling lower bits of the CPU clock or some mathematical tricks.
You're tagged language agnostic, but the answer depends on what random number function(s) are available to you. Furthermore the accuracy may depend on how close to being truly random your generator is (generally they're not that close).
As to random number functions, there tend to be two kinds -- those which generate a number between 0 and 1, and those that generate a number between m and n. Each can be used to derive a percentage easily.