I'm trying to generate random numbers from 0 to 1, including the borders 0 and 1 in Racket. Until now I didn't find a solution. Is there a nice way ?
As soegaard mentions in his answer, you should just use (random). It is true that it produces a number in the open range (0, 1), not the closed range [0, 1] that you want. However, I’m going to present an argument backed by some numbers that the difference is irrelevant.
The random procedure produces an IEEE 754 double-precision floating point number. The next-largest number representable by that format after 0 is approximately 4.94 × 10-324. If we assume that (random) does, indeed, produce a number uniformly distributed over the range, then that would imply that the probability of ever actually producing 0.0 is one in 4.94 × 10324! To give you a little bit of context for that number, that means that even if you generated one billion random numbers every second, it would still take on average 6.41 × 10306 years to generate 0.0 even once.
The situation is a little less grim at the other end of the range, since the difference between 1 and the next-smallest number representable by double flonums is significantly larger: approximately 1.11 × 10-16. On this end of the range, if we once again generated one billion random numbers every second, then it would take on average “only” 104 days before generating 1.0 for the first time. However, this would still be completely insignificant compared to the enormous amount of data you had already generated (and indeed, it’s rather unlikely you are going to be generating a billion random numbers per second to begin with, since it takes well over a minute just to generate a billion random numbers on my machine).
Don’t worry about those missing ends of the range, since they really won’t matter. Just call (random) and be done with it.
Use (random) to generate a number between 0.0 and 1.0.
To include 0.0 and 1.0 you can use:
(define (r)
(/ (random 4294967087)
4294967086.0))
Related
I have studied the working of Sieve of Eratosthenes for generating the prime numbers up to a given number using iteration and striking off all the composite numbers. And the algorithm needs just to be iterated up to sqrt(n) where n is the upper bound upto which we need to find all the prime numbers. We know that the number of prime numbers up to n=10^9 is very less as compared to the number of composite numbers. So we use all the space to just tell that these numbers are not prime first by marking them composite.
My question is can we modify the algorithm to just store prime numbers since we deal with a very large range (since number of primes are very less)?
Can we just store straight away the prime numbers?
Changing the structure from that of a set (sieve) - one bit per candidate - to storing primes (e.g. in a list, vector or tree structure) actually increases storage requirements.
Example: there are 203.280.221 primes below 2^32. An array of uint32_t of that size requires about 775 MiB whereas the corresponding bitmap (a.k.a. set representation) occupies only 512 MiB (2^32 bits / 8 bits/byte = 2^29 bytes).
The most compact number-based representation with fixed cell size would be storing the halved distance between consecutive odd primes, since up to about 2^40 the halved distance fits into a byte. At 193 MiB for the primes up to 2^32 this is slightly smaller than an odds-only bitmap but it is only efficient for sequential processing. For sieving it is not suitable because, as Anatolijs has pointed out, algorithms like the Sieve of Eratosthenes effectively require a set representation.
The bitmap can be shrunk drastically by leaving out the multiples of small primes. Most famous is the odds-only representation that leaves out the number 2 and its multiples; this halves the space requirement to 256 MiB at virtually no cost in added code complexity. You just need to remember to pull the number 2 out of thin air when needed, since it isn't represented in the sieve.
Even more space can be saved by leaving out multiples of more small primes; this generalisation of the 'odds-only' trick is usually called 'wheeled storage' (see Wheel Factorization in the Wikipedia). However, the gain from adding more small primes to the wheel gets smaller and smaller whereas the wheel modulus ('circumference') increases explosively. Adding 3 removes 1/3rd of the remaining numbers, adding 5 removes a further 1/5th, adding 7 only gets you a further 1/7th and so on.
Here's an overview of what adding another prime to the wheel can get you. 'ratio' is the size of the wheeled/reduced set relative to the full set that represents every number; 'delta' gives the shrinkage compared to the previous step. 'spokes' refers to the number of prime-bearing spokes which need to be represented/stored; the total number of spokes for a wheel is of course equal to its modulus (circumference).
The mod 30 wheel (about 136 MiB for the primes up to 2^32) offers an excellent cost/benefit ratio because it has eight prime-bearing spokes, which means that there is a one-to-one correspondence between wheels and 8-bit bytes. This enables many efficient implementation tricks. However, its cost in added code complexity is considerable despite this fortuitous circumstance, and for many purposes the odds-only sieve ('mod 2 wheel') gives the most bang for buck by far.
There are two additional considerations worth keeping in mind. The first is that data sizes like these often exceed the capacity of memory caches by a wide margin, so that programs can often spend a lot of time waiting for the memory system to deliver the data. This is compounded by the typical access patterns of sieving - striding over the whole range, again and again and again. Speedups of several orders of magnitude are possible by working the data in small batches that fit into the level-1 data cache of the processor (typically 32 KiB); lesser speedups are still possible by keeping within the capacity of the L2 and L3 caches (a few hundred KiB and a few MiB, respectively). The keyword here is 'segmented sieving'.
The second consideration is that many sieving tasks - like the famous SPOJ PRIME1 and its updated version PRINT (with extended bounds and tightened time limit) - require only the small factor primes up to the square root of the upper limit to be permanently available for direct access. That's a comparatively small number: 3512 when sieving up to 2^31 as in the case of PRINT.
Since these primes have already been sieved there's no need for a set representation anymore, and since they are few there are no problems with storage space. This means they are most profitably kept as actual numbers in a vector or list for easy iteration, perhaps with additional, auxiliary data like current working offset and phase. The actual sieving task is then easily accomplished via a technique called 'windowed sieving'. In the case of PRIME1 and PRINT this can be several orders of magnitude faster than sieving the whole range up to the upper limit, since both tasks only asks for a small number of subranges to be sieved.
You can do that (remove numbers that are detected to be non-prime from your array/linked list), but then time complexity of algorithm will degrade to O(N^2/log(N)) or something like that, instead of original O(N*log(N)). This is because you will not be able to say "the numbers 2X, 3X, 4X, ..." are not prime anymore. You will have to loop through your entire compressed list.
You could erase each composite number from the array/vector once you have shown it to be composite. Or when you fill an array of numbers to put through the sieve, remove all even numbers (other than 2) and all numbers ending in 5.
If you have studied the sieve right, you must know we don't have the primes to begin with. We have an array, sizeof which is equal to the range. Now, if you want the range to be 10e9, you want this to be the size of the array. You have mentioned no language, but for each number, you must need a bit to represent whether it is prime or not.
Even that means you need 10^9 bits = 1.125 * 10^8 bytes which is greater than 100 MB of RAM.
Assuming you have all this, most optimized sieve takes O(n * log(log n)) time, which is, if n = 10e9, on a machine that evaluates 10e8 instructions per second, will still take some minutes.
Now, assuming you have all this with you, still, number of primes till 10e9 is q = 50,847,534, to save these will still take q * 4 bytes, which is still greater than 100MB. (more RAM)
Even if you remove the indexes which are multiples of 2, 3 or 5, this removes 21 numbers in every 30. This is not good enough, because in total, you will still need around 140 MB space. (40MB = a third of 10^9 bits + ~100MB for storing prime numbers).
So, since, for storing the primes, you will, in any case require similar amount of memory (of the same order as calculation), your question, IMO has no solution.
You can halve the size of the sieve by only 'storing' the odd numbers. This requires code to explicitly deal with the case of testing even numbers. For odd numbers, bit b of the sieve represents n = 2b + 3. Hence bit 0 represents 3, bit 1 represents 5 and so on. There is a small overhead in converting between the number n and the bit index b.
Whether this technique is any use to you depends on the memory/speed balance you require.
If I have a true random number generator (TRNG) which can give me either a 0 or a 1 each time I call it, then it is trivial to then generate any number in a range with a length equal to a power of 2. For example, if I wanted to generate a random number between 0 and 63, I would simply poll the TRNG 5 times, for a maximum value of 11111 and a minimum value of 00000. The problem is when I want a number in a rangle not equal to 2^n. Say I wanted to simulate the roll of a dice. I would need a range between 1 and 6, with equal weighting. Clearly, I would need three bits to store the result, but polling the TRNG 3 times would introduce two eroneous values. We could simply ignore them, but then that would give one side of the dice a much lower odds of being rolled.
My question of ome most effectively deals with this.
The easiest way to get a perfectly accurate result is by rejection sampling. For example, generate a random value from 1 to 8 (3 bits), rejecting and generating a new value (3 new bits) whenever you get a 7 or 8. Do this in a loop.
You can get arbitrarily close to accurate just by generating a large number of bits, doing the mod 6, and living with the bias. In cases like 32-bit values mod 6, the bias will be so small that it will be almost impossible to detect, even after simulating millions of rolls.
If you want a number in range 0 .. R - 1, pick least n such that R is less or equal to 2n. Then generate a random number r in the range 0 .. 2n-1 using your method. If it is greater or equal to R, discard it and generate again. The probability that your generation fails in this manner is at most 1/2, you will get a number in your desired range with less than two attempts on the average. This method is balanced and does not impair the randomness of the result in any fashion.
As you've observed, you can repeatedly double the range of a possible random values through powers of two by concatenating bits, but if you start with an integer number of bits (like zero) then you cannot obtain any range with prime factors other than two.
There are several ways out; none of which are ideal:
Simply produce the first reachable range which is larger than what you need, and to discard results and start again if the random value falls outside the desired range.
Produce a very large range, and distribute that as evenly as possible amongst your desired outputs, and overlook the small bias that you get.
Produce a very large range, distribute what you can evenly amongst your desired outputs, and if you hit upon one of the [proportionally] few values which fall outside of the set which distributes evenly, then discard the result and start again.
As with 3, but recycle the parts of the value that you did not convert into a result.
The first option isn't always a good idea. Numbers 2 and 3 are pretty common. If your random bits are cheap then 3 is normally the fastest solution with a fairly small chance of repeating often.
For the last one; supposing that you have built a random value r in [0,31], and from that you need to produce a result x [0,5]. Values of r in [0,29] could be mapped to the required output without any bias using mod 6, while values [30,31] would have to be dropped on the floor to avoid bias.
In the former case, you produce a valid result x, but there's some more randomness left over -- the difference between the ranges [0,5], [6,11], etc., (five possible values in this case). You can use this to start building your new r for the next random value you'll need to produce.
In the latter case, you don't get any x and are going to have to try again, but you don't have to throw away all of r. The specific value picked from the illegal range [30,31] is left-over and free to be used as a starting value for your next r (two possible values).
The random range you have from that point on needn't be a power of two. That doesn't mean it'll magically reach the range you need at the time, but it does mean you can minimise what you throw away.
The larger you make r, the more bits you may need to throw away if it overflows, but the smaller the chances of that happening. Adding one bit halves your risk but increases the cost only linearly, so it's best to use the largest r you can handle.
Math/programming question that has arisen while I'm trying to deal with using a set of random data as an entropy source. In a situation where I'm using something like Random.org's pregenerated random files as an entropy source. Raw data like this is random zeroes and ones, and could be bit off as random bytes (0-255) or larger ranges as powers of two. I'm trying to be as efficient as possible in using this random source, since it's finite in length, so I don't want to use a larger set than I need.
Taking random bytes is fair if you want a number from a range evenly divisible by 256 (e.g. 100 to 355, 0 to 15, etc.). However, what if I want a number from 1 to 100? That doesn't fit nicely in 256. I could assign 0-199 to the 1-100 range twice over, leaving 200-255 as extra that would have to be discarded if drawn, or else 55 numbers in the range would be unfairly weighted to come up more often.
Is throwing out the out-of-range numbers the only fair option? Or is there a mathematical way to fairly "blur" those 55 numbers over the 1-100 range?
The only other option I've come up with to know I will be able to use the number and not throw out results is to absorb a larger number of bytes, so that the degree of bias is less (0-255 would have some numbers in 1-100 with two "draws", and some with three; 3:2 odds = 50% more likely. Ten bytes (0-2,550) would have 26:25 odds = 4% more likely. Etc.) That uses up more data, but is more predictable.
Is there a term for what I'm trying to do (can't Google what I can't name)? Is it possible, or do I have to concede that I'll have to throw out data that doesn't fairly match the range I want?
If you use 7 bits per number, you get 0-127. Whenever you get a number greater than 100, you have to discard it. You lose the use of that data point but its still random. You lose 28 of every 128 or about 20% of the random information.
If you use 20 bits at a whack, you get a number between 0 and 1,048,575. This can be broken into 3 random values between 0 and 99 (or 1-100 if you add 1 to it). You have to use integer arithmetic or throw away any fractional part when dividing.
if (number > 1000000) discard it.
a = number % 100;
b = (number / 100) % 100;
c = (number / 10000) % 100;
You only waste 48,575 values out of 1048575 or about 5% of the random information.
You can think of this process this way. Take the number you get by converting 20 bits to an decimal integer. Break out the 10's and 1's digits, the 1000's and 100's digits and the 100,000's and 10,000's digits and use those as three random numbers. They are truly random since those digits could be any value at all in the original number. Further, we discarded any values that bias particular values of the three.
So there's a way to make more efficient use of the random bits. But you have to do some computing.
Note: The next interesting bit combination is 27 bits and that wastes about 25%. 14 bits would waste about 60%.
How will you test if the random number generator is generating actual random numbers?
My Approach: Firstly build a hash of size M, where M is the prime number. Then take the number
generated by random number generator, and take mod with M.
and see it fills in all the hash or just in some part.
That's my approach. Can we prove it with visualization?
Since I have very less knowledge about testing. Can you suggest me a thorough approach of this question? Thanks in advance
You should be aware that you cannot guarantee the random number generator is working properly. Note that even a perfect uniform distribution in range [1,10] - there is a 10-10 chance of getting 10 times 10 in a random sampling of 10 numbers.
Is it likely? Of course not.
So - what can we do?
We can statistically prove that the combination (10,10,....,10) is unlikely if the random number generator is indeed uniformly distributed. This concept is called Hypothesis testing. With this approach we can say "with certainty level of x% - we can reject the hypothesis that the data is taken from a uniform distribution".
A common way to do it, is using Pearson's Chi-Squared test, The idea is similar to yours - you fill in a table - check what is the observed (generated) number of numbers for each cell, and what is the expected number of numbers for each cell under the null hypothesis (in your case, the expected is k/M - where M is the range's size, and k is the total number of numbers taken).
You then do some manipulation on the data (see the wikipedia article for more info what this manipulation is exactly) - and get a number (the test statistic). You then check if this number is likely to be taken from a Chi-Square Distribution. If it is - you cannot reject the null hypothesis, if it is not - you can be certain with x% certainty that the data is not taken from a uniform random generator.
EDIT: example:
You have a cube, and you want to check if it is "fair" (uniformly distributed in [1,6]). Throw it 200 times (for example) and create the following table:
number: 1 2 3 4 5 6
empirical occurances: 37 41 30 27 32 33
expected occurances: 33.3 33.3 33.3 33.3 33.3 33.3
Now, according to Pearson's test, the statistic is:
X = ((37-33.3)^2)/33.3 + ((41-33.3)^2)/33.3 + ... + ((33-33.3)^2)/33.3
X = (18.49 + 59.29 + 10.89 + 39.69 + 1.69 + 0.09) / 33.3
X = 3.9
For a random C~ChiSquare(5), the probability of being higher then 3.9 is ~0.45 (which is not improbable)1.
So we cannot reject the null hypothesis, and we can conclude that the data is probably uniformly distributed in [1,6]
(1) We usually reject the null hypothesis if the value is smaller then 0.05, but this is very case dependent.
My naive idea:
The generator is following a distribution. (At least it should.) Do a reasonable amount of runs then plot the values on a graph. Fit a regression curve on the points. If it correlates with the shape of the distribution you're good. (This is also possible in 1D with projections and histograms. And fully automatable with the correct tool, e.g. MatLab)
You can also use the diehard tests as it was mentioned before, that is surely better but involves much less intuition, at least on your side.
Let's say you want to generate a uniform distribution on the interval [0, 1].
Then one possible test is
for i from 1 to sample-size
when a < random-being-tested() < b
counter +1
return counter/sample-size
And see if the result is closed to b-a (b minus a).
Of course you should define a function taking a, b between 0 and 1 as inputs, and return the difference between the counter/sample-size and b-a. Loop through possible a, b, say of the multiples of 0.01, a < b. Print out a, b when the difference is larger than a preset epsilon, say 0.001.
Those are the a, b for which there are too many outliers.
If you let sample-size be 5000. Your random-being-tested will be called about 5000 * 5050 times in total, hopefully not too bad.
I had the same problem.
when I finish to write my code (using an external RNG engine)
I looked on the results and found that all of them fail Chi-Square test whenever I have to many results.
my code generated a random number and hold buckets of the amount of each result range.
I don't know why the Chi-square test fail when i have a lot of results.
during my research I saw that the C# Random.next() fail in any range of random and that some of the numbers have better odds than the other, further more i saw that the RNGCryptoServiceProvider random provider is not supporting good on big numbers.
when trying to get numbers in the range of 0-1,000,000,000 the numbers in the lower range 0-300M have better odds to appear...
as a result I'm using the RNGCryptoServiceProvider and if my range is higher than 100M i'm combine the number my self (RandomHigh*100M + RandomLow) and the ranges of both randoms is smaller than 100M so it good.
Good Luck!
A recent homework assignment I have received asks us to take expressions which could create a loss of precision when performed in the computer, and alter them so that this loss is avoided.
Unfortunately, the directions for doing this haven't been made very clear. From watching various examples being performed, I know that there are certain methods of doing this: using Taylor series, using conjugates if square roots are involved, or finding a common denominator when two fractions are being subtracted.
However, I'm having some trouble noticing exactly when loss of precision is going to occur. So far the only thing I know for certain is that when you subtract two numbers that are close to being the same, loss of precision occurs since high order digits are significant, and you lose those from round off.
My question is what are some other common situations I should be looking for, and what are considered 'good' methods of approaching them?
For example, here is one problem:
f(x) = tan(x) − sin(x) when x ~ 0
What is the best and worst algorithm for evaluating this out of these three choices:
(a) (1/ cos(x) − 1) sin(x),
(b) (x^3)/2
(c) tan(x)*(sin(x)^2)/(cos(x) + 1).
I understand that when x is close to zero, tan(x) and sin(x) are nearly the same. I don't understand how or why any of these algorithms are better or worse for solving the problem.
Another rule of thumb usually used is this: When adding a long series of numbers, start adding from numbers closest to zero and end with the biggest numbers.
Explaining why this is good is abit tricky. when you're adding small numbers to a large numbers, there is a chance they will be completely discarded because they are smaller than then lowest digit in the current mantissa of a large number. take for instance this situation:
a = 1,000,000;
do 100,000,000 time:
a += 0.01;
if 0.01 is smaller than the lowest mantissa digit, then the loop does nothing and the end result is a == 1,000,000
but if you do this like this:
a = 0;
do 100,000,000 time:
a += 0.01;
a += 1,000,000;
Than the low number slowly grow and you're more likely to end up with something close to a == 2,000,000 which is the right answer.
This is ofcourse an extreme example but I hope you get the idea.
I had to take a numerics class back when I was an undergrad, and it was thoroughly painful. Anyhow, IEEE 754 is the floating point standard typically implemented by modern CPUs. It's useful to understand the basics of it, as this gives you a lot of intuition about what not to do. The simplified explanation of it is that computers store floating point numbers in something like base-2 scientific notation with a fixed number of digits (bits) for the exponent and for the mantissa. This means that the larger the absolute value of a number, the less precisely it can be represented. For 32-bit floats in IEEE 754, half of the possible bit patterns represent between -1 and 1, even though numbers up to about 10^38 are representable with a 32-bit float. For values larger than 2^24 (approximately 16.7 million) a 32-bit float cannot represent all integers exactly.
What this means for you is that you generally want to avoid the following:
Having intermediate values be large when the final answer is expected to be small.
Adding/subtracting small numbers to/from large numbers. For example, if you wrote something like:
for(float index = 17000000; index < 17000001; index++) {}
This loop would never terminate becuase 17,000,000 + 1 is rounded down to 17,000,000.
If you had something like:
float foo = 10000000 - 10000000.0001
The value for foo would be 0, not -0.0001, due to rounding error.
My question is what are some other
common situations I should be looking
for, and what are considered 'good'
methods of approaching them?
There are several ways you can have severe or even catastrophic loss of precision.
The most important reason is that floating-point numbers have a limited number of digits, e.g..doubles have 53 bits. That means if you have "useless" digits which are not part of the solution but must be stored, you lose precision.
For example (We are using decimal types for demonstration):
2.598765000000000000000000000100 -
2.598765000000000000000000000099
The interesting part is the 100-99 = 1 answer. As 2.598765 is equal in both cases, it
does not change the result, but waste 8 digits. Much worse, because the computer doesn't
know that the digits is useless, it is forced to store it and crams 21 zeroes after it,
wasting at all 29 digits. Unfortunately there is no way to circumvent it for differences,
but there are other cases, e.g. exp(x)-1 which is a function occuring very often in physics.
The exp function near 0 is almost linear, but it enforces a 1 as leading digit. So with 12
significant digits
exp(0.001)-1 = 1.00100050017 - 1 = 1.00050017e-3
If we use instead a function expm1(), use the taylor series:
1 + x +x^2/2 +x^3/6 ... -1 =
x +x^2/2 +x^3/6 =: expm1(x)
expm1(0.001) = 1.00500166667e-3
Much better.
The second problem are functions with a very steep slope like tangent of x near pi/2.
tan(11) has a slope of 50000 which means that any small deviation caused by rounding errors
before will be amplified by the factor 50000 ! Or you have singularities if e.g. the result approaches 0/0, that means it can have any value.
In both cases you create a substitute function, simplying the original function. It is of no use to highlight the different solution approaches because without training you will simply not "see" the problem in the first place.
A very good book to learn and train: Forman S. Acton: Real Computing made real
Another thing to avoid is subtracting numbers that are nearly equal, as this can also lead to increased sensitivity to roundoff error. For values near 0, cos(x) will be close to 1, so 1/cos(x) - 1 is one of those subtractions that you'd like to avoid if possible, so I would say that (a) should be avoided.