Actually, I have several interweaving questions. (If it matters I use C#.)
First. I have a prng that generates random numbers in UInt32 range, from 0 to UInt32.Max inclusive. I want to preserve the uniformity as much as possible. What is the main idea to get [a,b], (a,b) double ranges (such as [0,1], [0,1), (0,1), [-2,4], (-10,10))?
I'm concerned about the following. I have 4 294 967 296 prng outcomes. It is less than numbers in [0,1] double range — 2^53. So I construct 4 294 967 296-ary number from 2 digits, which is random and uniform in [0, 4294967295 * 4294967296 + 4294967295]. This maximum value is larger than 2^53 on 1 so if one get it one throw it away, recalculate, use mod 2^53 and get uniform number in, for example, [0,1]. Here I have to represent the maximum value as double (suppose there is no Int64 type) — are there any drawbacks with it?
Now, if I want get [0,1), I consider that the number of outcomes is (2^53) - 1. Adding to the last result 1/(2^53) will produce random double in (0,1]. To get (0,1) I consider (2^53) - 2 new outcomes and add 1/(2^53) to 0-based result. Is all that correct?
But how to get double ranges that are close or equal to the whole double range? Even If I construct n-ary number like above, it may become larger than Double.Max. May be some bitshifts/bitmasks approach is possible?
Second. Now there is double prng with outcomes in [0,1) is that possible to get [Double.Min, Double.Max] range? How many double numbers at all? If there is full double range prng, what is the best way to get UInt range — map "directly" or scale to [0,1] before?
Third. I found this code (http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/MT2002/CODES/mt19937ar.c):
/* generates a random number on [0,1) with 53-bit resolution*/
double genrand_res53(void)
{
unsigned long a=genrand_int32()>>5, b=genrand_int32()>>6;
return(a*67108864.0+b)*(1.0/9007199254740992.0);
}
Why a and b are shifted to 5 and 6 and why after that a*67108864.0+b is uniform?
Thank you.
Good random number generators produce random bits at all positions. Certain classes of poor ones produce poor randomness in the lower order bits. Thus, if you need 53 bits and generate 64, you want to throw away the 11 lowest order bits--in the case of the example code you posted, 5 from one number and 6 from another. Now you have a 26 bit number and a 27 bit number; 2^26 is 67108864 and 2^53 is 9007199254740992, which should explain why those constants are used to scale those numbers into [0,1). (It's a mixed-base number: 67108864-ary for the first digit, and 134217728-ary for the second.)
(The reason 53 bits are often used is that it makes the numbers symmetric upon subtraction--otherwise, the values between 2^-53 and 2^-64 will disappear when you subtract them from 1.)
Also, you shouldn't resample when you have too many bits--just throw away surplus bits (unless you have less than one).
Anyway, the obvious method gives you [0,1). If you want (0,1] thats 1 - [0,1). If you want (0,1), sample again if you get both a=0 and b=0. If you want [0,1], note that there is a 1 in (2^53+1) chance of getting 1, and otherwise you have [0,1). You could approximate this by getting a random number in [0,1) and checking if it's zero, and picking 1 as the answer if so, or picking again from [0,1) if not. Your random number generator probably doesn't have a long enough period to be more exact than that anyway.
Related
This is something that's been on my mind for years, but I never took the time to ask before.
Many (pseudo) random number generators generate a random number between 0.0 and 1.0. Mathematically there are infinite numbers in this range, but double is a floating point number, and therefore has a finite precision.
So the questions are:
Just how many double numbers are there between 0.0 and 1.0?
Are there just as many numbers between 1 and 2? Between 100 and 101? Between 10^100 and 10^100+1?
Note: if it makes a difference, I'm interested in Java's definition of double in particular.
Java doubles are in IEEE-754 format, therefore they have a 52-bit fraction; between any two adjacent powers of two (inclusive of one and exclusive of the next one), there will therefore be 2 to the 52th power different doubles (i.e., 4503599627370496 of them). For example, that's the number of distinct doubles between 0.5 included and 1.0 excluded, and exactly that many also lie between 1.0 included and 2.0 excluded, and so forth.
Counting the doubles between 0.0 and 1.0 is harder than doing so between powers of two, because there are many powers of two included in that range, and, also, one gets into the thorny issues of denormalized numbers. 10 of the 11 bits of the exponents cover the range in question, so, including denormalized numbers (and I think a few kinds of NaN) you'd have 1024 times the doubles as lay between powers of two -- no more than 2**62 in total anyway. Excluding denormalized &c, I believe the count would be 1023 times 2**52.
For an arbitrary range like "100 to 100.1" it's even harder because the upper bound cannot be exactly represented as a double (not being an exact multiple of any power of two). As a handy approximation, since the progression between powers of two is linear, you could say that said range is 0.1 / 64th of the span between the surrounding powers of two (64 and 128), so you'd expect about
(0.1 / 64) * 2**52
distinct doubles -- which comes to 7036874417766.4004... give or take one or two;-).
Every double value whose representation is between 0x0000000000000000 and 0x3ff0000000000000 lies in the interval [0.0, 1.0]. That's (2^62 - 2^52) distinct values (plus or minus a couple depending on whether you count the endpoints).
The interval [1.0, 2.0] corresponds to representations between 0x3ff0000000000000 and 0x400000000000000; that's 2^52 distinct values.
The interval [100.0, 101.0] corresponds to representations between 0x4059000000000000 and 0x4059400000000000; that's 2^46 distinct values.
There are no doubles between 10^100 and 10^100 + 1. Neither one of those numbers is representable in double precision, and there are no doubles that fall between them. The closest two double precision numbers are:
99999999999999982163600188718701095...
and
10000000000000000159028911097599180...
Others have already explained that there are around 2^62 doubles in the range [0.0, 1.0].
(Not really surprising: there are almost 2^64 distinct finite doubles; of those, half are positive, and roughly half of those are < 1.0.)
But you mention random number generators: note that a random number generator generating numbers between 0.0 and 1.0 cannot in general produce all these numbers; typically it'll only produce numbers of the form n/2^53 with n an integer (see e.g. the Java documentation for nextDouble). So there are usually only around 2^53 (+/-1, depending on which endpoints are included) possible values for the random() output. This means that most doubles in [0.0, 1.0] will never be generated.
The article Java's new math, Part 2: Floating-point numbers from IBM offers the following code snippet to solve this (in floats, but I suspect it works for doubles as well):
public class FloatCounter {
public static void main(String[] args) {
float x = 1.0F;
int numFloats = 0;
while (x <= 2.0) {
numFloats++;
System.out.println(x);
x = Math.nextUp(x);
}
System.out.println(numFloats);
}
}
They have this comment about it:
It turns out there are exactly 8,388,609 floats between 1.0 and 2.0 inclusive; large but hardly the uncountable infinity of real numbers that exist in this range. Successive numbers are about 0.0000001 apart. This distance is called an ULP for unit of least precision or unit in the last place.
2^53 - the size of the significand/mantissa of a 64bit floating point number including the hidden bit.
Roughly yes, as the significand is fixed but the exponent changes.
See the wikipedia article for more information.
The Java double is a IEEE 754 binary64 number.
This means that we need to consider:
Mantissa is 52 bit
Exponent is 11 bit number with 1023 bias (ie with 1023 added to it)
If the exponent is all 0 and the mantissa is non zero then the number is said to be non-normalized
This basically means there is a total of 2^62-2^52+1 of possible double representations that according to the standard are between 0 and 1. Note that 2^52+1 is to the remove the cases of the non-normalized numbers.
Remember that if mantissa is positive but exponent is negative number is positive but less than 1 :-)
For other numbers it is a bit harder because the edge integer numbers may not representable in a precise manner in the IEEE 754 representation, and because there are other bits used in the exponent to be able represent the numbers, so the larger the number the lower the different values.
Assuming I can generate random bytes of data, how can I use that to choose an element out of an array of n elements?
If I have 256 elements I can generate 1 byte of entropy (8 bits), and then use that to pick my element simply be converting it to an integer.
If I have 2 elements I can generate 1 byte, discard 7 bits and use the remaining bit to select my element.
But what if I have 3 elements? 1 bit is too few and 2 is too many. How would I randomly select 1 of the 3 elements with equal probability?
Here is a survey of algorithms to generate uniform random integers from random bits.
J. Lumbroso's Fast Dice Roller in "Optimal Discrete Uniform Generation from Coin Flips, and Applications, 2013. See also the implementation at the end of this answer.
The Math Forum, 2004. See also "Bit Recycling for Scaling Random Number Generators".
D. Lemire, "A Fast Alternative to the Modulo Reduction".
M. O'Neill, "Efficiently Generating a Number in a Range".
Some of these algorithms are "constant-time", others are unbiased, and still others are "optimal" in terms of the number of random bits it uses on average. In the rest of this answer we will assume we have a "true" random generator that can produce unbiased and independent random bits.
For further discussion, see the following answer of mine:
How to generate a random integer in the range [0,n] from a stream of random bits without wasting bits?
You can generate the proper distribution by simply truncating into the necessary range. If you have N elements then simply generate ceiling(log(N))=K random bits. Doing so is inefficient, but still works as long as the K bits are generated randomly.
In your example where you have N=3, you need at least K=2 bits, you have the following outcomes [00, 01, 10, 11] of equal probability. To map this into the proper range, just ignore one of the outcomes, such as the last one. Think of this as creating a new joint probability distribution, p(x_1, x_2), over the two bits where p(x_1=1, x_2=1) = 0, while for each of the others it will be 1/3 due to renormalization (i.e., (1/4)/(3/4) = 1/3 ).
If I have a true random number generator (TRNG) which can give me either a 0 or a 1 each time I call it, then it is trivial to then generate any number in a range with a length equal to a power of 2. For example, if I wanted to generate a random number between 0 and 63, I would simply poll the TRNG 5 times, for a maximum value of 11111 and a minimum value of 00000. The problem is when I want a number in a rangle not equal to 2^n. Say I wanted to simulate the roll of a dice. I would need a range between 1 and 6, with equal weighting. Clearly, I would need three bits to store the result, but polling the TRNG 3 times would introduce two eroneous values. We could simply ignore them, but then that would give one side of the dice a much lower odds of being rolled.
My question of ome most effectively deals with this.
The easiest way to get a perfectly accurate result is by rejection sampling. For example, generate a random value from 1 to 8 (3 bits), rejecting and generating a new value (3 new bits) whenever you get a 7 or 8. Do this in a loop.
You can get arbitrarily close to accurate just by generating a large number of bits, doing the mod 6, and living with the bias. In cases like 32-bit values mod 6, the bias will be so small that it will be almost impossible to detect, even after simulating millions of rolls.
If you want a number in range 0 .. R - 1, pick least n such that R is less or equal to 2n. Then generate a random number r in the range 0 .. 2n-1 using your method. If it is greater or equal to R, discard it and generate again. The probability that your generation fails in this manner is at most 1/2, you will get a number in your desired range with less than two attempts on the average. This method is balanced and does not impair the randomness of the result in any fashion.
As you've observed, you can repeatedly double the range of a possible random values through powers of two by concatenating bits, but if you start with an integer number of bits (like zero) then you cannot obtain any range with prime factors other than two.
There are several ways out; none of which are ideal:
Simply produce the first reachable range which is larger than what you need, and to discard results and start again if the random value falls outside the desired range.
Produce a very large range, and distribute that as evenly as possible amongst your desired outputs, and overlook the small bias that you get.
Produce a very large range, distribute what you can evenly amongst your desired outputs, and if you hit upon one of the [proportionally] few values which fall outside of the set which distributes evenly, then discard the result and start again.
As with 3, but recycle the parts of the value that you did not convert into a result.
The first option isn't always a good idea. Numbers 2 and 3 are pretty common. If your random bits are cheap then 3 is normally the fastest solution with a fairly small chance of repeating often.
For the last one; supposing that you have built a random value r in [0,31], and from that you need to produce a result x [0,5]. Values of r in [0,29] could be mapped to the required output without any bias using mod 6, while values [30,31] would have to be dropped on the floor to avoid bias.
In the former case, you produce a valid result x, but there's some more randomness left over -- the difference between the ranges [0,5], [6,11], etc., (five possible values in this case). You can use this to start building your new r for the next random value you'll need to produce.
In the latter case, you don't get any x and are going to have to try again, but you don't have to throw away all of r. The specific value picked from the illegal range [30,31] is left-over and free to be used as a starting value for your next r (two possible values).
The random range you have from that point on needn't be a power of two. That doesn't mean it'll magically reach the range you need at the time, but it does mean you can minimise what you throw away.
The larger you make r, the more bits you may need to throw away if it overflows, but the smaller the chances of that happening. Adding one bit halves your risk but increases the cost only linearly, so it's best to use the largest r you can handle.
Math/programming question that has arisen while I'm trying to deal with using a set of random data as an entropy source. In a situation where I'm using something like Random.org's pregenerated random files as an entropy source. Raw data like this is random zeroes and ones, and could be bit off as random bytes (0-255) or larger ranges as powers of two. I'm trying to be as efficient as possible in using this random source, since it's finite in length, so I don't want to use a larger set than I need.
Taking random bytes is fair if you want a number from a range evenly divisible by 256 (e.g. 100 to 355, 0 to 15, etc.). However, what if I want a number from 1 to 100? That doesn't fit nicely in 256. I could assign 0-199 to the 1-100 range twice over, leaving 200-255 as extra that would have to be discarded if drawn, or else 55 numbers in the range would be unfairly weighted to come up more often.
Is throwing out the out-of-range numbers the only fair option? Or is there a mathematical way to fairly "blur" those 55 numbers over the 1-100 range?
The only other option I've come up with to know I will be able to use the number and not throw out results is to absorb a larger number of bytes, so that the degree of bias is less (0-255 would have some numbers in 1-100 with two "draws", and some with three; 3:2 odds = 50% more likely. Ten bytes (0-2,550) would have 26:25 odds = 4% more likely. Etc.) That uses up more data, but is more predictable.
Is there a term for what I'm trying to do (can't Google what I can't name)? Is it possible, or do I have to concede that I'll have to throw out data that doesn't fairly match the range I want?
If you use 7 bits per number, you get 0-127. Whenever you get a number greater than 100, you have to discard it. You lose the use of that data point but its still random. You lose 28 of every 128 or about 20% of the random information.
If you use 20 bits at a whack, you get a number between 0 and 1,048,575. This can be broken into 3 random values between 0 and 99 (or 1-100 if you add 1 to it). You have to use integer arithmetic or throw away any fractional part when dividing.
if (number > 1000000) discard it.
a = number % 100;
b = (number / 100) % 100;
c = (number / 10000) % 100;
You only waste 48,575 values out of 1048575 or about 5% of the random information.
You can think of this process this way. Take the number you get by converting 20 bits to an decimal integer. Break out the 10's and 1's digits, the 1000's and 100's digits and the 100,000's and 10,000's digits and use those as three random numbers. They are truly random since those digits could be any value at all in the original number. Further, we discarded any values that bias particular values of the three.
So there's a way to make more efficient use of the random bits. But you have to do some computing.
Note: The next interesting bit combination is 27 bits and that wastes about 25%. 14 bits would waste about 60%.
This is something that's been on my mind for years, but I never took the time to ask before.
Many (pseudo) random number generators generate a random number between 0.0 and 1.0. Mathematically there are infinite numbers in this range, but double is a floating point number, and therefore has a finite precision.
So the questions are:
Just how many double numbers are there between 0.0 and 1.0?
Are there just as many numbers between 1 and 2? Between 100 and 101? Between 10^100 and 10^100+1?
Note: if it makes a difference, I'm interested in Java's definition of double in particular.
Java doubles are in IEEE-754 format, therefore they have a 52-bit fraction; between any two adjacent powers of two (inclusive of one and exclusive of the next one), there will therefore be 2 to the 52th power different doubles (i.e., 4503599627370496 of them). For example, that's the number of distinct doubles between 0.5 included and 1.0 excluded, and exactly that many also lie between 1.0 included and 2.0 excluded, and so forth.
Counting the doubles between 0.0 and 1.0 is harder than doing so between powers of two, because there are many powers of two included in that range, and, also, one gets into the thorny issues of denormalized numbers. 10 of the 11 bits of the exponents cover the range in question, so, including denormalized numbers (and I think a few kinds of NaN) you'd have 1024 times the doubles as lay between powers of two -- no more than 2**62 in total anyway. Excluding denormalized &c, I believe the count would be 1023 times 2**52.
For an arbitrary range like "100 to 100.1" it's even harder because the upper bound cannot be exactly represented as a double (not being an exact multiple of any power of two). As a handy approximation, since the progression between powers of two is linear, you could say that said range is 0.1 / 64th of the span between the surrounding powers of two (64 and 128), so you'd expect about
(0.1 / 64) * 2**52
distinct doubles -- which comes to 7036874417766.4004... give or take one or two;-).
Every double value whose representation is between 0x0000000000000000 and 0x3ff0000000000000 lies in the interval [0.0, 1.0]. That's (2^62 - 2^52) distinct values (plus or minus a couple depending on whether you count the endpoints).
The interval [1.0, 2.0] corresponds to representations between 0x3ff0000000000000 and 0x400000000000000; that's 2^52 distinct values.
The interval [100.0, 101.0] corresponds to representations between 0x4059000000000000 and 0x4059400000000000; that's 2^46 distinct values.
There are no doubles between 10^100 and 10^100 + 1. Neither one of those numbers is representable in double precision, and there are no doubles that fall between them. The closest two double precision numbers are:
99999999999999982163600188718701095...
and
10000000000000000159028911097599180...
Others have already explained that there are around 2^62 doubles in the range [0.0, 1.0].
(Not really surprising: there are almost 2^64 distinct finite doubles; of those, half are positive, and roughly half of those are < 1.0.)
But you mention random number generators: note that a random number generator generating numbers between 0.0 and 1.0 cannot in general produce all these numbers; typically it'll only produce numbers of the form n/2^53 with n an integer (see e.g. the Java documentation for nextDouble). So there are usually only around 2^53 (+/-1, depending on which endpoints are included) possible values for the random() output. This means that most doubles in [0.0, 1.0] will never be generated.
The article Java's new math, Part 2: Floating-point numbers from IBM offers the following code snippet to solve this (in floats, but I suspect it works for doubles as well):
public class FloatCounter {
public static void main(String[] args) {
float x = 1.0F;
int numFloats = 0;
while (x <= 2.0) {
numFloats++;
System.out.println(x);
x = Math.nextUp(x);
}
System.out.println(numFloats);
}
}
They have this comment about it:
It turns out there are exactly 8,388,609 floats between 1.0 and 2.0 inclusive; large but hardly the uncountable infinity of real numbers that exist in this range. Successive numbers are about 0.0000001 apart. This distance is called an ULP for unit of least precision or unit in the last place.
2^53 - the size of the significand/mantissa of a 64bit floating point number including the hidden bit.
Roughly yes, as the significand is fixed but the exponent changes.
See the wikipedia article for more information.
The Java double is a IEEE 754 binary64 number.
This means that we need to consider:
Mantissa is 52 bit
Exponent is 11 bit number with 1023 bias (ie with 1023 added to it)
If the exponent is all 0 and the mantissa is non zero then the number is said to be non-normalized
This basically means there is a total of 2^62-2^52+1 of possible double representations that according to the standard are between 0 and 1. Note that 2^52+1 is to the remove the cases of the non-normalized numbers.
Remember that if mantissa is positive but exponent is negative number is positive but less than 1 :-)
For other numbers it is a bit harder because the edge integer numbers may not representable in a precise manner in the IEEE 754 representation, and because there are other bits used in the exponent to be able represent the numbers, so the larger the number the lower the different values.