I have a homework problem on running a simulation where I generate 100 random numbers and perform a calculation on each outcome. The next question asks me to repeat the previous question but with a different stream of pseudorandom numbers. The side note tells me to to perform two computations within one call to the program because changing the seed/state arbitrary can lead to overlapping streams.
Can someone explain to me what this means? Why do I have to do it through 1 loop?
Why can't I just call the same code twice using a different seed each time?
Pseudo-random number generators (PRNGs) work by iterating through a deterministic set of calculations on some internal information known as the generator's state, and then handing you back a value which is based on the state. There's a finite amount of state information that determines what the next state, and thus the next outcome, will be. Since it's finite, eventually the generator will revisit a state that it used before, and from that point forward all values will be exact duplicates of the sequence you've already seen. The PRNG is said to have cycled. "Seeding" a random number generator sets the starting point for the state, so it effectively corresponds to choosing an entry point to the cycle.
If a human intervenes by changing the seed arbitrarily, there's a chance that they will prematurely put the state back to where some portion of the output sequence repeat gets repeated. This is referred to as overlapping streams. The solution is to seed your PRNG once, and then don't mess with it so it can achieve its full cycle.
In your case it means that the values and ordering of your first set of 100 numbers will be distinct from your the values and ordering of your second set of 100.
Related
For a PRNG like the Mersenne Twister which has a period of 2^19937-1, are there precisely that many states for the PRNG? i.e. the states start repeating at that point because there are no more states for the PRNG to be in?
As a follow up, what is the distribution over the next state given you exist in some current state of your PRNG? I am running long simulations and interested in finding out when it is likely that one simulation using a random seed will run into a state that exists in another simulation using a different random seed.
Pseudo-Random Number Generators (PRNGs) work by applying a deterministic function to the current state in order to determine the next state, then projecting that state onto the number of return bits. Asking "what is the distribution over the next state given you exit in some current state" is essentially meaningless. Since the transformation is deterministic, from any given state there is one and only one state that will be the next one. Eventually you will inevitably come back to some previously observed state, and from that point on the states and their corresponding output projections will repeat in an identical order, and we say that your PRNG has cycled. The cycle length is determined by how many unique states are reachable from your starting point (seed state), and is bounded by but not necessarily equal to the size of the state space. For instance, there are functions that will only produce even numbers if seeded by an even number, or odd numbers if seeded by an odd number. Neither of these cases would produce all possible integers before repeating.
Mersenne Twister achieves a cycle length of 219937-1 by having a state space with 19937 bits. (If I recall correctly, the -1 is because a state of all 0's is unreachable from any of the other non-zero states.) As for the likelihood of overlapping states in two runs, forget about it. To give you an idea of how big 219937-1 is, consider the following: Physicists estimate that there are about 1080 = 2266 subatomic particles in the known universe. If you've seeded your runs independently using full state initialization (for instance by reading 2.5KB from /dev/random into a buffer of bytes), even if you're using 1010 random numbers you're asking the equivalent of how likely you are to just happen to pick the same subatomic particle twice out of 219937-300 universes. It just ain't gonna happen.
I'm working on a project with multiple functions. I call a pseudorandom number several times, each in different functions, and then do some math on it. For example:
f(i,j)*random(i,j)
I assume that in the different functions the pseudorandom number isn't equal to the pseudorandom number in another function at a given i and j. Is that a correct assumption? If so, how is it possible to change that?
If it matters, the language I'm using is Xojo, which is similar to VB6.
I'm not really sure what the question is, but hopefully giving some basics of pseudo-random number generators (PRNG) will answer it:
This is more of a language feature, but usually calling the same function (i.e. random) is independent of where you call it from (there may be other determining factors).
random(i,j) may or may not return the same number twice in a row or after some time. It's (pseudo-)random, we just don't know whether it will.
If you want random(i,j) to always return the same value, you can consider writing your own function that maps some value of i and j to another value using some formula, or you can store all previous generated numbers in a map, and simply return this value if it exists.
If you want random(i,j) to never return the same value, consider generating numbers from i to j and shuffling them and simply returning the next value in the list repeatedly.
You can usually set the seed of a PRNG. This will cause that, if you get some sequence after setting the seed to some value, you will get the same sequence if you set the seed to the same value at some other time. This doesn't really serve much of a practical purpose (that I can think of) beyond giving you the capability to reproducible previous results exactly.
If you want random(i,j) to return the same random number each time it's called with the same i,j you could simply save the state.
One approach would be to store the state in an n x n matrix R (where n is the range of i,j). On the first call to random(i,j) set R(i,j) = rand(). On subsequent calls retrieve the existing value.
If the range of i,j is very large and the values sparse use a hash table for R instead of a matrix.
The awk manual says srand "sets the seed (starting point) for rand()". I used srand(5) with the following code:
awk 'BEGIN {srand(5);while(1)print rand()}'> /var/tmp/rnd
It generates numbers like:
0.177399
0.340855
0.0256178
0.838417
0.0195347
0.29598
Can you explain how srand(5) generates the "starting point" with the above output?
The starting point is called the seed. It is given to the first iteration of the rand function. After that rand uses the previous value it got when calculating the old number -- to generate the next number. Using a prime number for the seed is a good idea.
PRNGs (pseudo-random number generators) produce random values by keeping some kind of internal state which can be advanced through a series of values whose repeating period is very large, and whose successive values have very few apparent statistical correlations as long as we use far fewer of them. But nonetheless, its values are a deterministic sequence.
"Seeding" a PRNG is basically selecting what point in the deterministic sequence to start at. The algorithm will take the number passed as the seed and compute (in some algorithm-specific way) where to start in the sequence. The actual value of the seed is irrelevant--the algorithm should not depend on it in any way.
But, although the seed value itself does not directly participate in the PRNG algorithm, it does uniquely identify the starting point in the sequence, so if you give a particular seed and then generate a sequence of values, seeding again with the same value should cause the PRNG to generate the same sequence of values.
I would like to ask is it possible to run the GA with different seed to generate the initial solution and make analysis ?. However, at the beginning of applying GA, you have to produce number of population solutions.
For example, You run the genetic algorithm using seed "12345" to generate initial solution , and then you populate list of random solutions from this initial solution, and continue applying GA steps to solve the problem.
Then you run the genetic with another seed for example "5678" to generate initial solution , and then you populate list of random solutions from this initial solution, and continue applying GA steps to solve the problem..
That means the populated list in first run may contain the initial solution that has been generated in the second run.
My question is , Is there any way I can use GA with different seed to make comparison and analysis ?, If not, how can I compare and make analysis , should I only use different instance file for the problem ?
To make comparisons of stochastic algorithms first you typically run them multiple times with different random seeds. The output you then obtain is a random variable. You can then assess whether one algorithm is better than another by performing a statistical hypothesis test (ANOVA or Kruskal-Wallis for multiple comparisons, t-test or Mann-Whitney U test for pairwise comparisons) on the resulting samples. If the obtained p-value in these tests is below your desired threshold (typically 0.05, but for more rigorous proofs you would set this lower e.g. 0.01) you would reject the H0 hypothesis that these results are equal e.g. with respect to their means. Thus you assume that the results are unequal and further, that the one with the better average performance is the better algorithm to choose (if you're interested in average performance - "better" usually has many dimensions).
One thing made me wonder in your comments:
If I run the GA algorithm multiple times with the same seed for initial solution, the result will be completely different
I think you have made some error in your code. You need to use the same random object throughout any random decision made inside your algorithm in order to obtain exactly the same result. Somewhere in your code you probably use a new Random() instead of the one that you obtained initially with the given seed. Another reason could be that you use parallel processing of parts that draw random numbers. You can never guarantee that your threads are always executed in the same order, so one time thread 1 gets the first random number and thread 2 the second one, another time thread 2 executes first and gets the first number.
I've got a very large range/set of numbers, (1..1236401668096), that I would basically like to 'shuffle', i.e. randomly traverse without revisiting the same number. I will be running a Web service, and each time a request comes in it will increment a counter and pull the next 'shuffled' number from the range. The algorithm will have to accommodate for the server going offline, being able to restart traversal using the persisted value of the counter (something like how you can seed a pseudo-random number generator, and get the same pseudo-random number given the seed and which iteration you are on).
I'm wondering if such an algorithm exists or is feasible. I've seen the Fisher-Yates Shuffle, but the 1st step is to "Write down the numbers from 1 to N", which would take terabytes of storage for my entire range. Generating a pseudo-random number for each request might work for awhile, but as the database/tree gets full, collisions will become more common and could degrade performance (already a 0.08% chance of collision after 1 billion hits according to my calculation). Is there a more ideal solution for my scenario, or is this just a pipe dream?
The reason for the shuffling is that being able to correctly guess the next number in the sequence could lead to a minor DOS vulnerability in my app, but also because the presentation layer will look much nicer with a wider number distribution (I'd rather not go into details about exactly what the app does). At this point I'm considering just using a PRNG and dealing with collisions or shuffling range slices (starting with (1..10000000).to_a.shuffle, then, (10000001, 20000000).to_a.shuffle, etc. as each range's numbers start to run out).
Any mathemagicians out there have any better ideas/suggestions?
Concatenate a PRNG or LFSR sequence with /dev/random bits
There are several algorithms that can generate pseudo-random numbers with arbitrarily large and known periods. The two obvious candidates are the LCPRNG (LCG) and the LFSR, but there are more algorithms such as the Mersenne Twister.
The period of these generators can be easily constructed to fit your requirements and then you simply won't have collisions.
You could deal with the predictable behavior of PRNG's and LFSR's by adding 10, 20, or 30 bits of cryptographically hashed entropy from an interface like /dev/random. Because the deterministic part of your number is known to be unique it makes no difference if you ever repeat the actually random part of it.
Divide and conquer? Break down into manageable chunks and shuffle them. You could divide the number range e.g. by their value modulo n. The list is constructive and quite small depending on n. Once a group is exhausted, you can use the next one.
For example if you choose an n of 1000, you create 1000 different groups. Pick a random number between 1 and 1000 (let's call this x) and shuffle the numbers whose value modulo 1000 equals x. Once you have exhausted that range, you can choose a new random number between 1 and 1000 (without x obviously) to get the next subset to shuffle. It shouldn't exactly be challenging to keep track of which numbers of the 1..1000 range have already been used, so you'd just need a repeatable shuffle algorithm for the numbers in the subset (e.g. Fisher-Yates on their "indices").
I guess the best option is to use a GUID/UUID. They are made for this type of thing, and it shouldn't be hard to find an existing implementation to suit your needs.
While collisions are theoretically possible, they are extremely unlikely. To quote Wikipedia:
The probability of one duplicate would be about 50% if every person on earth owns 600 million UUIDs