Random number generation from 1 to 7 - algorithm

I was going through Google Interview Questions. to implement the random number generation from 1 to 7.
I did write a simple code, I would like to understand if in the interview this question asked to me and if I write the below code is it Acceptable or not?
import time
def generate_rand():
ret = str(time.time()) # time in second like, 12345.1234
ret = int(ret[-1])
if ret == 0 or ret == 1:
return 1
elif ret > 7:
ret = ret - 7
return ret
return ret
while 1:
print(generate_rand())
time.sleep(1) # Just to see the output in the STDOUT

(Since the question seems to ask for analysis of issues in the code and not a solution, I am not providing one. )
The answer is unacceptable because:
You need to wait for a second for each random number. Many applications need a few hundred at a time. (If the sleep is just for convenience, note that even a microsecond granularity will not yield true random numbers as the last microsecond will be monotonically increasing until 10us are reached. You may get more than a few calls done in a span of 10us and there will be a set of monotonically increasing pseudo-random numbers).
Random numbers have uniform distribution. Each element should have the same probability in theory. In this case, you skew 1 more (twice the probability for 0, 1) and 7 more (thrice the probability for 7, 8, 9) compared to the others in the range 2-6.
Typically answers to this sort of a question will try to get a large range of numbers and distribute the ranges evenly from 1-7. For example, the above method would have worked fine if u had wanted randomness from 1-5 as 10 is evenly divisible by 5. Note that this will only solve (2) above.
For (1), there are other sources of randomness, such as /dev/random on a Linux OS.

You haven't really specified the constraints of the problem you're trying to solve, but if it's from a collection of interview questions it seems likely that it might be something like this.
In any case, the answer shown would not be acceptable for the following reasons:
The distribution of the results is not uniform, even if the samples you read from time.time() are uniform.
The results from time.time() will probably not be uniform. The result depends on the time at which you make the call, and if your calls are not uniformly distributed in time then the results will probably not be uniformly distributed either. In the worst case, if you're trying to randomise an array on a very fast processor then you might complete the entire operation before the time changes, so the whole array would be filled with the same value. Or at least large chunks of it would be.
The changes to the random value are highly predictable and can be inferred from the speed at which your program runs. In the very-fast-computer case you'll get a bunch of x followed by a bunch of x+1, but even if the computer is much slower or the clock is more precise, you're likely to get aliasing patterns which behave in a similarly predictable way.
Since you take the time value in decimal, it's likely that the least significant digit doesn't visit all possible values uniformly. It's most likely a conversion from binary to some arbitrary number of decimal digits, and the distribution of the least significant digit can be quite uneven when that happens.
The code should be much simpler. It's a complicated solution with many special cases, which reflects a piecemeal approach to the problem rather than an understanding of the relevant principles. An ideal solution would make the behaviour self-evident without having to consider each case individually.
The last one would probably end the interview, I'm afraid. Perhaps not if you could tell a good story about how you got there.
You need to understand the pigeonhole principle to begin to develop a solution. It looks like you're reducing the time to its least significant decimal digit for possible values 0 to 9. Legal results are 1 to 7. If you have seven pigeonholes and ten pigeons then you can start by putting your first seven pigeons into one hole each, but then you have three pigeons left. There's nowhere that you can put the remaining three pigeons (provided you only use whole pigeons) such that every hole has the same number of pigeons.
The problem is that if you pick a pigeon at random and ask what hole it's in, the answer is more likely to be a hole with two pigeons than a hole with one. This is what's called "non-uniform", and it causes all sorts of problems, depending on what you need your random numbers for.
You would either need to figure out how to ensure that all holes are filled equally, or you would have to come up with an explanation for why it doesn't matter.
Typically the "doesn't matter" answer is that each hole has either a million or a million and one pigeons in it, and for the scale of problem you're working with the bias would be undetectable.

Using the same general architecture you've created, I would do something like this:
import time
def generate_rand():
ret = str(time.time()) # time in second like, 12345.1234
ret = ret % 8 # will return pseudorandom numbers 0-7
if ret == 0:
return 1 # or you could also return the result of another call to generate_rand()
return ret
while 1:
print(generate_rand())
time.sleep(1)

Related

Right way to permute [duplicate]

I've been using Random (java.util.Random) to shuffle a deck of 52 cards. There are 52! (8.0658175e+67) possibilities. Yet, I've found out that the seed for java.util.Random is a long, which is much smaller at 2^64 (1.8446744e+19).
From here, I'm suspicious whether java.util.Random is really that random; is it actually capable of generating all 52! possibilities?
If not, how can I reliably generate a better random sequence that can produce all 52! possibilities?
Selecting a random permutation requires simultaneously more and less randomness than what your question implies. Let me explain.
The bad news: need more randomness.
The fundamental flaw in your approach is that it's trying to choose between ~2226 possibilities using 64 bits of entropy (the random seed). To fairly choose between ~2226 possibilities you're going to have to find a way to generate 226 bits of entropy instead of 64.
There are several ways to generate random bits: dedicated hardware, CPU instructions, OS interfaces, online services. There is already an implicit assumption in your question that you can somehow generate 64 bits, so just do whatever you were going to do, only four times, and donate the excess bits to charity. :)
The good news: need less randomness.
Once you have those 226 random bits, the rest can be done deterministically and so the properties of java.util.Random can be made irrelevant. Here is how.
Let's say we generate all 52! permutations (bear with me) and sort them lexicographically.
To choose one of the permutations all we need is a single random integer between 0 and 52!-1. That integer is our 226 bits of entropy. We'll use it as an index into our sorted list of permutations. If the random index is uniformly distributed, not only are you guaranteed that all permutations can be chosen, they will be chosen equiprobably (which is a stronger guarantee than what the question is asking).
Now, you don't actually need to generate all those permutations. You can produce one directly, given its randomly chosen position in our hypothetical sorted list. This can be done in O(n2) time using the Lehmer[1] code (also see numbering permutations and factoriadic number system). The n here is the size of your deck, i.e. 52.
There is a C implementation in this StackOverflow answer. There are several integer variables there that would overflow for n=52, but luckily in Java you can use java.math.BigInteger. The rest of the computations can be transcribed almost as-is:
public static int[] shuffle(int n, BigInteger random_index) {
int[] perm = new int[n];
BigInteger[] fact = new BigInteger[n];
fact[0] = BigInteger.ONE;
for (int k = 1; k < n; ++k) {
fact[k] = fact[k - 1].multiply(BigInteger.valueOf(k));
}
// compute factorial code
for (int k = 0; k < n; ++k) {
BigInteger[] divmod = random_index.divideAndRemainder(fact[n - 1 - k]);
perm[k] = divmod[0].intValue();
random_index = divmod[1];
}
// readjust values to obtain the permutation
// start from the end and check if preceding values are lower
for (int k = n - 1; k > 0; --k) {
for (int j = k - 1; j >= 0; --j) {
if (perm[j] <= perm[k]) {
perm[k]++;
}
}
}
return perm;
}
public static void main (String[] args) {
System.out.printf("%s\n", Arrays.toString(
shuffle(52, new BigInteger(
"7890123456789012345678901234567890123456789012345678901234567890"))));
}
[1] Not to be confused with Lehrer. :)
Your analysis is correct: seeding a pseudo-random number generator with any specific seed must yield the same sequence after a shuffle, limiting the number of permutations that you could obtain to 264. This assertion is easy to verify experimentally by calling Collection.shuffle twice, passing a Random object initialized with the same seed, and observing that the two random shuffles are identical.
A solution to this, then, is to use a random number generator that allows for a larger seed. Java provides SecureRandom class that could be initialized with byte[] array of virtually unlimited size. You could then pass an instance of SecureRandom to Collections.shuffle to complete the task:
byte seed[] = new byte[...];
Random rnd = new SecureRandom(seed);
Collections.shuffle(deck, rnd);
In general, a pseudorandom number generator (PRNG) can't choose from among all permutations of a 52-item list if its maximum cycle length is less than 226 bits.
java.util.Random implements an algorithm with a modulus of 248 and a maximum cycle length of not more than that, so much less than 2226 (corresponding to the 226 bits I referred to). You will need to use another PRNG with a bigger cycle length, specifically one with a maximum cycle length of 52 factorial or greater.
See also "Shuffling" in my article on random number generators.
This consideration is independent of the nature of the PRNG; it applies equally to cryptographic and noncryptographic PRNGs (of course, noncryptographic PRNGs are inappropriate whenever information security is involved).
Although java.security.SecureRandom allows seeds of unlimited length to be passed in, the SecureRandom implementation could use an underlying PRNG (e.g., "SHA1PRNG" or "DRBG"). And it depends on that PRNG's maximum cycle length whether it's capable of choosing from among 52 factorial permutations.
Let me apologize in advance, because this is a little tough to understand...
First of all, you already know that java.util.Random is not completely random at all. It generates sequences in a perfectly predictable way from the seed. You are completely correct that, since the seed is only 64 bits long, it can only generate 2^64 different sequences. If you were to somehow generate 64 real random bits and use them to select a seed, you could not use that seed to randomly choose between all of the 52! possible sequences with equal probability.
However, this fact is of no consequence as long as you're not actually going to generate more than 2^64 sequences, as long as there is nothing 'special' or 'noticeably special' about the 2^64 sequences that it can generate.
Lets say you had a much better PRNG that used 1000-bit seeds. Imagine you had two ways to initialize it -- one way would initialize it using the whole seed, and one way would hash the seed down to 64 bits before initializing it.
If you didn't know which initializer was which, could you write any kind of test to distinguish them? Unless you were (un)lucky enough to end up initializing the bad one with the same 64 bits twice, then the answer is no. You could not distinguish between the two initializers without some detailed knowledge of some weakness in the specific PRNG implementation.
Alternatively, imagine that the Random class had an array of 2^64 sequences that were selected completely and random at some time in the distant past, and that the seed was just an index into this array.
So the fact that Random uses only 64 bits for its seed is actually not necessarily a problem statistically, as long as there is no significant chance that you will use the same seed twice.
Of course, for cryptographic purposes, a 64 bit seed is just not enough, because getting a system to use the same seed twice is computationally feasible.
EDIT:
I should add that, even though all of the above is correct, that the actual implementation of java.util.Random is not awesome. If you are writing a card game, maybe use the MessageDigest API to generate the SHA-256 hash of "MyGameName"+System.currentTimeMillis(), and use those bits to shuffle the deck. By the above argument, as long as your users are not really gambling, you don't have to worry that currentTimeMillis returns a long. If your users are really gambling, then use SecureRandom with no seed.
I'm going to take a bit of a different tack on this. You're right on your assumptions - your PRNG isn't going to be able to hit all 52! possibilities.
The question is: what's the scale of your card game?
If you're making a simple klondike-style game? Then you definitely don't need all 52! possibilities. Instead, look at it like this: a player will have 18 quintillion distinct games. Even accounting for the 'Birthday Problem', they'd have to play billions of hands before they'd run into the first duplicate game.
If you're making a monte-carlo simulation? Then you're probably okay. You might have to deal with artifacts due to the 'P' in PRNG, but you're probably not going to run into problems simply due to a low seed space (again, you're looking at quintillions of unique possibilities.) On the flip side, if you're working with large iteration count, then, yeah, your low seed space might be a deal-breaker.
If you're making a multiplayer card game, particularly if there's money on the line? Then you're going to need to do some googling on how the online poker sites handled the same problem you're asking about. Because while the low seed space issue isn't noticeable to the average player, it is exploitable if it's worth the time investment. (The poker sites all went through a phase where their PRNGs were 'hacked', letting someone see the hole cards of all the other players, simply by deducing the seed from exposed cards.) If this is the situation you're in, don't simply find a better PRNG - you'll need to treat it as seriously as a Crypto problem.
Short solution which is essentially the same of dasblinkenlight:
// Java 7
SecureRandom random = new SecureRandom();
// Java 8
SecureRandom random = SecureRandom.getInstanceStrong();
Collections.shuffle(deck, random);
You don't need to worry about the internal state. Long explanation why:
When you create a SecureRandom instance this way, it accesses an OS specific
true random number generator. This is either an entropy pool where values are
accessed which contain random bits (e.g. for a nanosecond timer the nanosecond
precision is essentially random) or an internal hardware number generator.
This input (!) which may still contain spurious traces are fed into a
cryptographically strong hash which removes those traces. That is the reason those CSPRNGs are used, not for creating those numbers themselves! The SecureRandom has a counter which traces how many bits were used (getBytes(), getLong() etc.) and refills the SecureRandom with entropy bits when necessary.
In short: Simply forget objections and use SecureRandom as true random number generator.
If you consider the number as just an array of bits (or bytes) then maybe you could use the (Secure)Random.nextBytes solutions suggested in this Stack Overflow question, and then map the array into a new BigInteger(byte[]).
A very simple algorithm is to apply SHA-256 to a sequence of integers incrementing from 0 upwards. (A salt can be appended if desired to "get a different sequence".) If we assume that the output of SHA-256 is "as good as" uniformly distributed integers between 0 and 2256 - 1 then we have enough entropy for the task.
To get a permutation from the output of SHA256 (when expressed as an integer) one simply needs to reduce it modulo 52, 51, 50... as in this pseudocode:
deck = [0..52]
shuffled = []
r = SHA256(i)
while deck.size > 0:
pick = r % deck.size
r = floor(r / deck.size)
shuffled.append(deck[pick])
delete deck[pick]
My Empirical research results are Java.Random is not totally truly random. If you try yourself by using Random class "nextGaussian()"-method and generate enough big sample population for numbers between -1 and 1, the graph is normal distbruted field know as Gaussian Model.
Finnish goverment owned gambling-bookmarker have a once per day whole year around every day drawn lottery-game where winning table shows that the Bookmarker gives winnings in normal distrbuted way. My Java Simulation with 5 million draws shows me that with nextInt() -methdod used number draw, winnings are normally distributed same kind of like the my Bookmarker deals the winnings in each draw.
My best picks are avoiding numbers 3 and 7 in each of ending ones and that's true that they are rarely in winning results. Couple of times won five out of five picks by avoiding 3 and 7 numbers in ones column in Integer between 1-70 (Keno).
Finnish Lottery drawn once per week Saturday evenings If you play System with 12 numbers out of 39, perhaps you get 5 or 6 right picks in your coupon by avoiding 3 and 7 values.
Finnish Lottery have numbers 1-40 to choose and it takes 4 coupon to cover all the nnumbers with 12 number system. The total cost is 240 euros and in long term it's too expensive for the regural gambler to play without going broke. Even if you share coupons to other customers available to buy still you have to be quite a lucky if you want to make profit.

Are pseudo random number generators less likely to repeat?

So they say if you flip a coin 50 times and get heads all 50 times, you're still 50/50 the next flip and 1/4 for the next two. Do you think/know if this same principle applies to computer pseudo-random number generators? I theorize they're less likely to repeat the same number for long stretches.
I ran this a few times and the results are believable, but I'm wondering how many times I'd have to run it to get an anomaly output.
def genString(iterations):
mystring = ''
for _ in range(iterations):
mystring += str(random.randint(0,9))
return mystring
def repeatMax(mystring):
tempchar = ''
max = 0
for char in mystring:
if char == tempchar:
count += 1
if count > max:
max = count
else:
count = 0
tempchar = char
return max
for _ in range(10):
stringer = genString()
print repeatMax(stringer)
I got all 7's and a couple 6's. If I run this 1000 times, will it approximate a normal distribution or should I expect it to stay relatively predictable? I'm trying to understand the predictability of pseudo random number generation.
Failure to produce specific patterns is a typical weakness of PRNGs, but the probability of hitting a substantial run of repeated digits at random is so small it's hard to demonstrate that weakness.
It's perfectly reasonable for a PRNG to use only a 32-bit state, which (traditionally) means producing a sequence of four billion numbers and then repeating from the start again. In that case your sequence of 50 coin-flips coming out the same is probably never going to happen (four billion tries at something that has a one in a quadrillion chance is unlikely to succeed); but if it does, then it's going to appear way too often.
Superficially you're looking for k-dimensional equidistribution as a test for whether or not you can expect to find a prescribed pattern in the output without deeper analysis of the specific generator. If your generator claims at least 50-dimensional equidistribution then you're guaranteed to see the 50-heads state at least once.
However, if your generator emits 32-bit results but you only test whether each result maps to heads or tails, you have some chance at success even if the generator fails the k-dimension test, and that chance depends on the specifics of the generator and the mapping function.
If you adjust the implementation of your generator to return just one bit at a time, then you have an opportunity to try to squeeze 50 heads out of just 50 bits of state (or potentially as few as 18, but that generator would probably be faulty). Provided the generator visits all 2**50 possible states, one of those states will produce 50 heads in a row. You may get a few more heads when adjacent states start or end with more zeroes.

Finding seeds for a 5 byte PRNG

An old idea, but ever since then I couldn't get around finding some reasonably good way to solve the problem it raised. So I "invented" (see below) a very compact, and in my opinion, reasonably well performing PRNG, but I can't get to figure out algorithms to build suitable seed values for it at large bit depths. My current solution is simply brute-forcing, it's running time is O(n^3).
The generator
My idea came from XOR taps (essentially LFSRs) some old 8bit machines used for sound generation. I fiddled with XOR as a base on a C64, tried to put together opcodes, and experienced with the result. The final working solution looked like this:
asl
adc #num1
eor #num2
This is 5 bytes on the 6502. With a well chosen num1 and num2, in the accumulator it iterates over all 256 values in a seemingly random order, that is, it looks reasonably random when used to fill the screen (I wrote a little 256b demo back then on this). There are 40 suitable num1 & num2 pairs for this, all giving decent looking sequences.
The concept can be well generalized, if expressed in pure C, it may look like this (BITS being the bit depth of the sequence):
r = (((r >> (BITS-1)) & 1U) + (r << 1) + num1) ^ num2;
r = r & ((1U<<BITS)-1U);
This C code is longer since it is generalized, and even if one would use the full depth of an unsigned integer, C wouldn't have the necessary carry logic to transfer the high bit of the shift to the add operation.
For some performance analysis and comparisons, see below, after the question(s).
The problem / question(s)
The core problem with the generator is finding suitable num1 and num2 which would make it iterate over the whole possible sequence of a given bit depth. At the end of this section I attach my code which just brute-forces it. It will finish in reasonable time for up to 12 bits, you may wait for all 16 bits (there are 5736 possible pairs for that by the way, acquired with an overnight full search a while ago), and you may get a few 20 bits if you are patient. But O(n^3) is really nasty...
(Who will get to find the first full 32bit sequence?)
Other interesting questions which arise:
For both num1 and num2 only odd values are able to produce full sequences. Why? This may not be hard (simple logic, I guess), but I never reasonably proved it.
There is a mirroring property along num1 (the add value), that is, if 'a' with a given 'b' num2 gives a full sequence, then the 2 complement of 'a' (in the given bit depth) with the same num2 is also a full sequence. I only observed this happening reliably with all the full generations I calculated.
A third interesting property is that for all the num1 & num2 pairs the resulting sequences seem to form proper circles, that is, at least the number zero seems to be always part of a circle. Without this property my brute force search would die in an infinite loop.
Bonus: Was this PRNG already known before? (and I just re-invented it)?
And here is the brute force search's code (C):
#define BITS 16
#include "stdio.h"
#include "stdlib.h"
int main(void)
{
unsigned int r;
unsigned int c;
unsigned int num1;
unsigned int num2;
unsigned int mc=0U;
num1=1U; /* Only odd add values produce useful results */
do{
num2=1U; /* Only odd eor values produce useful results */
do{
r= 0U;
c=~0U;
do{
r=(((r>>(BITS-1)) & 1U)+r+r+num1)^num2;
r&=(1U<<(BITS-1)) | ((1U<<(BITS-1))-1U); /* 32bit safe */
c++;
}while (r);
if (c>=mc){
mc=c;
printf("Count-1: %08X, Num1(adc): %08X, Num2(eor): %08X\n", c, num1, num2);
}
num2+=2U;
num2&=(1U<<(BITS-1)) | ((1U<<(BITS-1))-1U);
}while(num2!=1U);
num1+=2U;
num1&=((1U<<(BITS-1))-1U); /* Do not check complements */
}while(num1!=1U);
return 0;
}
This, to show it is working, after each iteration will output the pair found if it's sequence length is equal or longer than the previous. Modify the BITS constant for sequences of other depths.
Seed hunting
I did some graphing relating to the seeds. Here is a nice image showing all the 9bit sequence lengths:
The white dots are the full length sequences, X axis is for num1 (add), Y axis is for num2 (xor), the brighter the dot, the longer the sequence. Other bit depth look very similar in pattern: they all seem to be broken up to sixteen major tiles with two patterns repeating with mirroring. The similarity of the tiles is not complete, for example above a diagonal from the up-left corner to the bottom-right is clearly visible while it's opposite is absent, but for the full-length sequences this property seems to be reliable.
Relying on this it is possible to reduce the work even more than by the previous assumptions, but that's still O(n^3)...
Performance analysis
As of current the longest sequences possible to be generated are 24bits: on my computer it takes at about 5 hours to brute-force a full 24bit sequence for this. This is still just so-so for real PRNG tests such as Diehard, so as of now I rather gone by an own approach.
First it's important to understand the role of the generator. This by no means would be a very good generator for it's simplicity, it's goal is rather to produce decent numbers blazing fast. On this region not needing multiply / divide operations, a Galois LFSR can produce similar performance. So my generator is of any use if it is capable to outperform this one.
The test I performed were all of 16bit generators. I chose this depth since it gives an useful sequence length while the numbers may still be broken up in two 8bit parts making it possible to present various bit-exact graphs for visual analysis.
The core of the tests were looking for correlations along previous and currently generated numbers. For this I used X:Y plots where the previous generation was the Y, the current the X, both broken up to low / high parts as above mentioned for two graphs. I created a program capable of plotting these stepped in real time so to also make it possible to roughly examine how the numbers follow each other, how the graphs fill up. Here obviously only the end results are shown as the generators ran through their full 2^16 or 2^16-1 (Galois) cycle.
The explanation of the fields:
The images consist 8x2 256x256 graphs making the total image size 2048x512 (check them at original size).
The top left graph just confirms that indeed a full sequence was plotted, it is simply an X = r % 256; Y = r / 256; plot.
The bottom left graph shows every second number only plotted the same way as the top, just confirming that the numbers occur reasonably randomly.
From the second graph the top row are the high byte correlation graphs. The first of them uses the previous generation, the next skips one number (so uses 2nd previous generation), and so on until the 7th previous generation.
From the second the bottom row are the low byte correlation graphs, organized the same way as above.
Galois generator, 0xB400 tap set
This is the generator found in the Wikipedia Galois example. It's performance is not the worst, but it is still definitely not really good.
Galois generator, 0xA55A tap set
One of the decent Galois "seeds" I found. Note that the low part of the 16bit numbers seem to be a lot better than the above, however I couldn't find any Galois "seed" which would fuzz up the high byte.
My generator, 0x7F25 (adc), 0x00DB (eor) seed
This is the best of my generators where the high byte of the EOR value is zero. Limiting the high byte is useful on 8bit machines since then this calculation can be omitted for smaller code and faster execution if the loss of randomness performance is affordable.
My generator, 0x778B (adc), 0x4A8B (eor) seed
This is one of the very good quality seeds by my measurements.
To find seeds with good correlation, I built a small program which would analyse them to some degree, the same way for Galois and mine. The "good quality" examples were pinpointed by that program, and then I tested several of them and selected one from those.
Some conclusions:
The Galois generator seems to be more rigid than mine. On all the correlation graphs definite geometrical patterns are observable (some seeds produce "checkerboard" patterns, not shown here) even if it is not composed of lines. My generator also shows patterns, but with more generations they grow less defined.
A portion of the Galois generator's result which include the bits in the high byte seems to be inherently rigid which property seems to be absent from my generator. This is a weak assumption yet probably needing some more research (to see if this is always so with the Galois generator and not with mine on other bit combinations).
The Galois generator lacks zero (maximal period being 2^16-1).
As of now it is impossible to generate a good set of seeds for my generator above 20 bits.
Later I might get in this subject deeper seeking to test the generator with Diehard, but as of now the lack of the ability of generating large enough seeds for it makes it impossible.
This is some form of a non-linear shift feedback register. I don't know if it has been used as such, but it resembles linear shift feedback registers somewhat. Read this Wikipedia page as an introduction to LSFRs. They are used frequently in pseudo random number generation.
However, your pseudo random number generator is inherently bad in that there is a linear correlation between the highest order bit of a previously generated number and the lowest order bit of a number generated next. You shift the highest bit B out, and then the lowest order bit of the new number will be the XOR or B, the lowest order bit of the additive constant num1 and the lowest order bit of the XORed constant num2, because binary addition is equivalent to exclusive or at the lowest order bit. Most likely your PRNG has other similar deficiencies. Creating good PRNGs is hard.
However, I must admit that the C64 code is pleasingly compact!

Range extremes don't seem to get drawn by random()

For several valid reasons I have to use BSD's random() to generate awfully large amounts of random numbers, and since its cycle is quite short (~2^69, if I'm not mistaken) the quality of such numbers degrades pretty quickly for my use case. I could use the rng board I have access to but it's painfully slow so I thought I could do this trick: take one number from the board, use it to seed random(), use random() to draw numbers and reseed it when the board says a new number is available. The board generates about 100 numbers per second so my guess is that random() hardly gets to cycle over and the generation rate easily keeps up with my requirements of several millions numbers per second.
Anyway, the problem is that random() claims to uniformly draw numbers between 0 and (2^31)-1, but I've been drawing an uncountable amount of numbers and I've never ever seen a 0 nor a (2^31)-1 so far. Maybe some 1 and (2^31)-2, but I've never seen the extremes. Now, I know the problem with random numbers is that you can never be sure (see Dilbert, Debian), but this seem extremely odd nonetheless. Moreover I tried analysing the generated datasets with Octave using the histc() function, and the lowest and the highest bins contain between half and three quarter the amount of numbers of the middle bins (which in turn are uniformly filled, so I guess in some sense the distribution is "uniform").
Can anybody explain this?
EDIT Some code
The board outputs this structure with the three components, and then I do some mumbo-jumbo combining them to produce the seed. I have no specs about this board, it's an ancient piece of hardware thrown together by a previous student some years ago, there's little documentation and this formula I'm using is one of those suggested in the docs. The STEP parameter tells me how may numbers I can draw using one seed so I can optimise performance and throttle down CPU usage at the same time.
float n = fabsf(fmod(sqrt(a.s1*a.s1 + a.s2*a.s2 + a.s3*a.s3), 1.0));
unsigned int seed = n * UINT32_MAX;
srandom(seed);
for(int i = 0; i < STEP; i++) {
long r = random();
n = (float)r / (UINT32_MAX >> 1);
[_numbers addObject:[NSNumber numberWithFloat:n]];
}
Are you certain that
void main() {
while (random() != 0L);
}
hangs indefinitely? On my linux machine (the Gnu C library uses the same linear feedback shift register as BSD, albeit with a different seeding procedure) it doesn't.
According to this reference the algorithm produces 'runs' of consecutive zeroes or ones up to length n-1 where n is the size of the shift register. When this has a size of 31 integers (the default case) we can even be certain that, eventually, random() will return 0 a whopping 30 (but never 31) times in a row! Of course, we may have to wait a few centuries to see it happening...
To extend the cycle length, one method is to run two RNGs, with different periods, and XOR their output. See L'Ecuyer 1988 for some examples.

Algorithm to find a common multiplier to convert decimal numbers to whole numbers

I have an array of numbers that potentially have up to 8 decimal places and I need to find the smallest common number I can multiply them by so that they are all whole numbers. I need this so all the original numbers can all be multiplied out to the same scale and be processed by a sealed system that will only deal with whole numbers, then I can retrieve the results and divide them by the common multiplier to get my relative results.
Currently we do a few checks on the numbers and multiply by 100 or 1,000,000, but the processing done by the *sealed system can get quite expensive when dealing with large numbers so multiplying everything by a million just for the sake of it isn’t really a great option. As an approximation lets say that the sealed algorithm gets 10 times more expensive every time you multiply by a factor of 10.
What is the most efficient algorithm, that will also give the best possible result, to accomplish what I need and is there a mathematical name and/or formula for what I’m need?
*The sealed system isn’t really sealed. I own/maintain the source code for it but its 100,000 odd lines of proprietary magic and it has been thoroughly bug and performance tested, altering it to deal with floats is not an option for many reasons. It is a system that creates a grid of X by Y cells, then rects that are X by Y are dropped into the grid, “proprietary magic” occurs and results are spat out – obviously this is an extremely simplified version of reality, but it’s a good enough approximation.
So far there are quiet a few good answers and I wondered how I should go about choosing the ‘correct’ one. To begin with I figured the only fair way was to create each solution and performance test it, but I later realised that pure speed wasn’t the only relevant factor – an more accurate solution is also very relevant. I wrote the performance tests anyway, but currently the I’m choosing the correct answer based on speed as well accuracy using a ‘gut feel’ formula.
My performance tests process 1000 different sets of 100 randomly generated numbers.
Each algorithm is tested using the same set of random numbers.
Algorithms are written in .Net 3.5 (although thus far would be 2.0 compatible)
I tried pretty hard to make the tests as fair as possible.
Greg – Multiply by large number
and then divide by GCD – 63
milliseconds
Andy – String Parsing
– 199 milliseconds
Eric – Decimal.GetBits – 160 milliseconds
Eric – Binary search – 32
milliseconds
Ima – sorry I couldn’t
figure out a how to implement your
solution easily in .Net (I didn’t
want to spend too long on it)
Bill – I figure your answer was pretty
close to Greg’s so didn’t implement
it. I’m sure it’d be a smidge faster
but potentially less accurate.
So Greg’s Multiply by large number and then divide by GCD” solution was the second fastest algorithm and it gave the most accurate results so for now I’m calling it correct.
I really wanted the Decimal.GetBits solution to be the fastest, but it was very slow, I’m unsure if this is due to the conversion of a Double to a Decimal or the Bit masking and shifting. There should be a
similar usable solution for a straight Double using the BitConverter.GetBytes and some knowledge contained here: http://blogs.msdn.com/bclteam/archive/2007/05/29/bcl-refresher-floating-point-types-the-good-the-bad-and-the-ugly-inbar-gazit-matthew-greig.aspx but my eyes just kept glazing over every time I read that article and I eventually ran out of time to try to implement a solution.
I’m always open to other solutions if anyone can think of something better.
I'd multiply by something sufficiently large (100,000,000 for 8 decimal places), then divide by the GCD of the resulting numbers. You'll end up with a pile of smallest integers that you can feed to the other algorithm. After getting the result, reverse the process to recover your original range.
Multiple all the numbers by 10
until you have integers.
Divide
by 2,3,5,7 while you still have all
integers.
I think that covers all cases.
2.1 * 10/7 -> 3
0.008 * 10^3/2^3 -> 1
That's assuming your multiplier can be a rational fraction.
If you want to find some integer N so that N*x is also an exact integer for a set of floats x in a given set are all integers, then you have a basically unsolvable problem. Suppose x = the smallest positive float your type can represent, say it's 10^-30. If you multiply all your numbers by 10^30, and then try to represent them in binary (otherwise, why are you even trying so hard to make them ints?), then you'll lose basically all the information of the other numbers due to overflow.
So here are two suggestions:
If you have control over all the related code, find another
approach. For example, if you have some function that takes only
int's, but you have floats, and you want to stuff your floats into
the function, just re-write or overload this function to accept
floats as well.
If you don't have control over the part of your system that requires
int's, then choose a precision to which you care about, accept that
you will simply have to lose some information sometimes (but it will
always be "small" in some sense), and then just multiply all your
float's by that constant, and round to the nearest integer.
By the way, if you're dealing with fractions, rather than float's, then it's a different game. If you have a bunch of fractions a/b, c/d, e/f; and you want a least common multiplier N such that N*(each fraction) = an integer, then N = abc / gcd(a,b,c); and gcd(a,b,c) = gcd(a, gcd(b, c)). You can use Euclid's algorithm to find the gcd of any two numbers.
Greg: Nice solution but won't calculating a GCD that's common in an array of 100+ numbers get a bit expensive? And how would you go about that? Its easy to do GCD for two numbers but for 100 it becomes more complex (I think).
Evil Andy: I'm programing in .Net and the solution you pose is pretty much a match for what we do now. I didn't want to include it in my original question cause I was hoping for some outside the box (or my box anyway) thinking and I didn't want to taint peoples answers with a potential solution. While I don't have any solid performance statistics (because I haven't had any other method to compare it against) I know the string parsing would be relatively expensive and I figured a purely mathematical solution could potentially be more efficient.
To be fair the current string parsing solution is in production and there have been no complaints about its performance yet (its even in production in a separate system in a VB6 format and no complaints there either). It's just that it doesn't feel right, I guess it offends my programing sensibilities - but it may well be the best solution.
That said I'm still open to any other solutions, purely mathematical or otherwise.
What language are you programming in? Something like
myNumber.ToString().Substring(myNumber.ToString().IndexOf(".")+1).Length
would give you the number of decimal places for a double in C#. You could run each number through that and find the largest number of decimal places(x), then multiply each number by 10 to the power of x.
Edit: Out of curiosity, what is this sealed system which you can pass only integers to?
In a loop get mantissa and exponent of each number as integers. You can use frexp for exponent, but I think bit mask will be required for mantissa. Find minimal exponent. Find most significant digits in mantissa (loop through bits looking for last "1") - or simply use predefined number of significant digits.
Your multiple is then something like 2^(numberOfDigits-minMantissa). "Something like" because I don't remember biases/offsets/ranges, but I think idea is clear enough.
So basically you want to determine the number of digits after the decimal point for each number.
This would be rather easier if you had the binary representation of the number. Are the numbers being converted from rationals or scientific notation earlier in your program? If so, you could skip the earlier conversion and have a much easier time. Otherwise you might want to pass each number to a function in an external DLL written in C, where you could work with the floating point representation directly. Or you could cast the numbers to decimal and do some work with Decimal.GetBits.
The fastest approach I can think of in-place and following your conditions would be to find the smallest necessary power-of-ten (or 2, or whatever) as suggested before. But instead of doing it in a loop, save some computation by doing binary search on the possible powers. Assuming a maximum of 8, something like:
int NumDecimals( double d )
{
// make d positive for clarity; it won't change the result
if( d<0 ) d=-d;
// now do binary search on the possible numbers of post-decimal digits to
// determine the actual number as quickly as possible:
if( NeedsMore( d, 10e4 ) )
{
// more than 4 decimals
if( NeedsMore( d, 10e6 ) )
{
// > 6 decimal places
if( NeedsMore( d, 10e7 ) ) return 10e8;
return 10e7;
}
else
{
// <= 6 decimal places
if( NeedsMore( d, 10e5 ) ) return 10e6;
return 10e5;
}
}
else
{
// <= 4 decimal places
// etc...
}
}
bool NeedsMore( double d, double e )
{
// check whether the representation of D has more decimal points than the
// power of 10 represented in e.
return (d*e - Math.Floor( d*e )) > 0;
}
PS: you wouldn't be passing security prices to an option pricing engine would you? It has exactly the flavor...

Resources