1024 bit pseudo random generator in verilog for FPGA - random

I want to generate random vectors of length 1024 in verilog . I have looked at certain implementations like Tausworth generators and Mersenne Twisters.
Most Mersenne twisters have 32 bit/ 64 bit outputs . I want to simulate an error pattern of 1024 bits with some probability p . So , I generate a 32 bit random number (uniformly distributed) using Mersenne Twister. Since I have 32 bit random numbers , this number will be in the range 0 to 2^32-1 . After this I set the number to 1, if the number generated from this 32 bit value is less than p*(2^32-1) .Otherwise the number is mapped to a 0 in my 1023 bit vector . Basically , each 32 bit number is used to generate a bit in the 1023 vector according to the probabilistic distribution .
The above method implies that I need 1024 clock cycles to generate each 1024 bit vector. Is there any other way which allows me to do this quickly ? I understand that I could use several instance of the Mersenne Twister in parallel using different seed values but I was afraid that those numbers will not be truly random and that there will be collisions . Is there something that I am doing wrong or something that I am missing ? I would really appreciate your help

Okay,
So I read a bit about Mersenne Twisters in general from wikipedia. I accept I did't outright get all of it but I got this: Given a seed value (to initialise the array), the module generates 32 bit random numbers.
Now, from your description above, it takes one cycle to compute one random number.
So your problem basically boils to to it's mathematics rather than being about verilog as such.
I would try to explain the math of it as best as I can.
You have a 32 bit uniformly distributed random number. So, the probability of any one bit being high or low is exactly (well, close to, cause psuedo random) 0.5.
Let's forget that this is a pseudo random generator, because that is the best you are going to get(So let's consider this as our ideal).
Even if we generate 5 numbers one after the other, the probability of each one being any particular number is still uniformly distributed. So if we concatenate these five numbers, we will get a 160 bit completely random number.
If it's still not clear, consider this way.
I'm gonna break the problem down. Let's say we have a 4-bit random number generator (RNG), and we require 16 bit random numbers.
Each output of the RNG would be a hex digit with a uniform probability distribution. So the probability of getting some particular digit (say... A) is 1/16. Now I want to make a 4 digit Hex number (say... 0xA019).
Probability of getting A as the Most Significant digit = 1/16
Probability of getting 0 as digit number 2 = 1/16
Probability of getting 1 as digit number 3 = 1/16
Probability of getting 9 as the Least Significant digit = 1/16
So the probability of getting 0xA019 = 1/(2^16). Infact, probability of getting any four digit hex number would be exactly the same. Now, extend the same logic to Base-32 Number systems with 32 digit numbers as the required output and you have your solution.
So, we see, we could do with just 32 repetitions of the Mersenne twister to get the 1024 bit output (that would take 32 cycles, still kinda slow). What you could also do is synthesise 32 twisters in parallel (that would give you the output in one stroke but would be very heavy on the fpga in terms of area, power constraints).
The best way to go about this would be to try for some middle ground (maybe 4 parallel twisters running in 8 cycles). This would really be a question of the end application of the module and the power and timing constraints that you need for that application.
As for giving different seed values, most PRNGs usually have provision for input seeds just to increase randomness, from what I read on Mersenne Twisters, it has the same case.
Hope that answers your question.

Related

Shuffle sequential numbers without a buffer

I am looking for a shuffle algorithm to shuffle a set of sequential numbers without buffering. Another way to state this is that I’m looking for a random sequence of unique numbers that have a given period.
Your typical Fisher–Yates shuffle needs to have each element all of the elements it is going to shuffle, so that isn’t going to work.
A Linear-Feedback Shift Register (LFSR) does what I want, but only works for periods that are powers-of-two less two. Here is an example of using a 4-bit LFSR to shuffle the numbers 1-14:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
8
12
14
7
4
10
5
11
6
3
2
1
9
13
The first two is the input, and the second row the output. What’s nice is that the state is very small—just the current index. You can start of any index and get a difference set of numbers (starting at 1 yields: 8, 12, 14; starting at 9: 6, 3, 2), although the sequence is always the same (5 is always followed by 11). If I want a different sequence, I can pick a different generator polynomial.
The limitations to the LFSR are that the periods are always power-of-two less two (the min and max are always the same, thus unshuffled) and there not enough enough generator polynomials to allow every possible random sequence.
A block cipher algorithm would work. Every key produces a uniquely shuffled set of numbers. However all block ciphers (that I know about) have power-of-two block sizes, and usually a fixed or limited number of block sizes. A block cipher with a arbitrary non-binary block size would be perfect if such a thing exists.
There are a couple of projects I have that could benefit from such an algorithm. One is for small embedded micros that need to produce a shuffled sequence of numbers with a period larger than the memory they have available (think Arduino Uno needing to shuffle 1 to 100,000).
Does such an algorithm exist? If not, what things might I search for to help me develop such an algorithm? Or is this simply not possible?
Edit 2022-01-30
I have received a lot of good feedback and I need to better explain what I am searching for.
In addition to the Arduino example, where memory is an issue, there is also the shuffle of a large number of records (billions to trillions). The desire is to have a shuffle applied to these records without needing a buffer to hold the shuffle order array, or the time needed to build that array.
I do not need an algorithm that could produce every possible permutation, but a large number of permutations. Something like a typical block cipher in counter mode where each key produces a unique sequence of values.
A Linear Congruential Generator using coefficients to produce the desired sequence period will only produce a single sequence. This is the same problem for a Linear Feedback Shift Register.
Format-Preserving Encryption (FPE), such as AES FFX, shows promise and is where I am currently focusing my attention. Additional feedback welcome.
It is certainly not possible to produce an algorithm which could potentially generate every possible sequence of length N with less than N (log2N - 1.45) bits of state, because there are N! possible sequence and each state can generate exactly one sequence. If your hypothetical Arduino application could produce every possible sequence of 100,000 numbers, it would require at least 1,516,705 bits of state, a bit more than 185Kib, which is probably more memory than you want to devote to the problem [Note 1].
That's also a lot more memory than you would need for the shuffle buffer; that's because the PRNG driving the shuffle algorithm also doesn't have enough state to come close to being able to generate every possible sequence. It can't generate more different sequences than the number of different possible states that it has.
So you have to make some compromise :-)
One simple algorithm is to start with some parametrisable generator which can produce non-repeating sequences for a large variety of block sizes. Then you just choose a block size which is as least as large as your target range but not "too much larger"; say, less than twice as large. Then you just select a subrange of the block size and start generating numbers. If the generated number is inside the subrange, you return its offset; if not, you throw it away and generate another number. If the generator's range is less than twice the desired range, then you will throw away less than half of the generated values and producing the next element in the sequence will be amortised O(1). In theory, it might take a long time to generate an individual value, but that's not very likely, and if you use a not-very-good PRNG like a linear congruential generator, you can make it very unlikely indeed by restricting the possible generator parameters.
For LCGs you have a couple of possibilities. You could use a power-of-two modulus, with an odd offset and a multiplier which is 5 mod 8 (and not too far from the square root of the block size), or you could use a prime modulus with almost arbitrary offset and multiplier. Using a prime modulus is computationally more expensive but the deficiencies of LCG are less apparent. Since you don't need to handle arbitrary primes, you can preselect a geometrically-spaced sample and compute the efficient division-by-multiplication algorithm for each one.
Since you're free to use any subrange of the generator's range, you have an additional potential parameter: the offset of the start of the subrange. (Or even offsets, since the subrange doesn't need to be contiguous.) You can also increase the apparent randomness by doing any bijective transformation (XOR/rotates are good, if you're using a power-of-two block size.)
Depending on your application, there are known algorithms to produce block ciphers for subword bit lengths [Note 2], which gives you another possible way to increase randomness and/or add some more bits to the generator state.
Notes
The approximation for the minimum number of states comes directly from Stirling's approximation for N!, but I computed the number of bits by using the commonly available lgamma function.
With about 30 seconds of googling, I found this paper on researchgate.net; I'm far from knowledgable enough in crypto to offer an opinion, but it looks credible; also, there are references to other algorithms in its footnotes.

How to create a uniform distribution over non-power-of-2 elements from n bits?

Assuming I can generate random bytes of data, how can I use that to choose an element out of an array of n elements?
If I have 256 elements I can generate 1 byte of entropy (8 bits), and then use that to pick my element simply be converting it to an integer.
If I have 2 elements I can generate 1 byte, discard 7 bits and use the remaining bit to select my element.
But what if I have 3 elements? 1 bit is too few and 2 is too many. How would I randomly select 1 of the 3 elements with equal probability?
Here is a survey of algorithms to generate uniform random integers from random bits.
J. Lumbroso's Fast Dice Roller in "Optimal Discrete Uniform Generation from Coin Flips, and Applications, 2013. See also the implementation at the end of this answer.
The Math Forum, 2004. See also "Bit Recycling for Scaling Random Number Generators".
D. Lemire, "A Fast Alternative to the Modulo Reduction".
M. O'Neill, "Efficiently Generating a Number in a Range".
Some of these algorithms are "constant-time", others are unbiased, and still others are "optimal" in terms of the number of random bits it uses on average. In the rest of this answer we will assume we have a "true" random generator that can produce unbiased and independent random bits.
For further discussion, see the following answer of mine:
How to generate a random integer in the range [0,n] from a stream of random bits without wasting bits?
You can generate the proper distribution by simply truncating into the necessary range. If you have N elements then simply generate ceiling(log(N))=K random bits. Doing so is inefficient, but still works as long as the K bits are generated randomly.
In your example where you have N=3, you need at least K=2 bits, you have the following outcomes [00, 01, 10, 11] of equal probability. To map this into the proper range, just ignore one of the outcomes, such as the last one. Think of this as creating a new joint probability distribution, p(x_1, x_2), over the two bits where p(x_1=1, x_2=1) = 0, while for each of the others it will be 1/3 due to renormalization (i.e., (1/4)/(3/4) = 1/3 ).

Multiple independent pseudo random number generation in hardware (Verilog or VHDL)

I need pseudo random numbers generated for hardware (either in VHDL or Verilog) that meet the following criteria.
- Each number is 1-bit (doesn't have to be, but that would complicate things more)
- The N pseudo random numbers cannot be correlated with each other.
- The N pseudo random numbers need to be generated at the same time (every clock edge).
I understand that the following will not work :
- Using N different seeds for a given polynomial - they will simply be shifted versions of each other
- Using N different polynomials for a given length LFSR - not practical since N can be as large as 64, and I don't know what length LSFR would give 64 different tap combinations, too huge if possible at all.
If using LFSR, the lengths do not need to be identical. For a small N, say 4, I thought about using 4 different prime number lengths (to minimize repeatability), e.g., 15, 17, 19, 23, but again, for a large N, it gets very messy. Let's say, something on the order of 2^16 gives sufficient length for an LFSR.
Is there an elegant way of handling this problem? By elegant, I mean not having to code N different unique modules (15, 17, 19, 23 above as an example). Using N different instances of Mersenne Twister, with different seeds? I do not have unlimited amount of hardware resources (FF, LUT, BRAM), but for the sake of this discussion it's probably best to ignore resource issues.
Than you in advance.
One option is to use a cryptographic hash, these are typically wide (64-256 bits), and good hashes have the property that a single bit input change will propagate to all output bits in unpredictable fashion. Run an incrementing counter into the hash and start the counter at a random value.
The GHASH used in AES-GCM is hardware-friendly and can generate new output values every clock.

About Mersenne Twister generator's period

I have read that Mersenne Twister generator has a period of 2¹⁹⁹³⁷ - 1, but I'm confused about why can that be possible. I see this implementation of the Mersenne Twister algorithm and in the first comment it clearly says that it produces values in the range 0 to 2³² - 1. Therefore, after it has produced 2³² - 1 different random numbers, it will necessarily come back to the starting point (the seed), so the period can be at maximum 2³² - 1.
Also (and tell me if I'm wrong, please), a computer can't hold the number (2¹⁹⁹³⁷ - 1) ~ 4.3×10⁶⁰⁰¹, at least in a single block of memory. What am I missing here?
Your confusion stems from thinking that the output number and the internal state of a PRNG have to be the same thing.
Some very old PRNGs used to do this, such as Linear Congruental Generators. In those generators, the current output was fed back into the generator for the next step.
However, most PRNGS, including the Mersenne Twister, work from a much larger state, which it updates and uses to generate a 32-bit number (it doesn't really matter which order this is done in for the purposes of this answer).
In fact, the Mersenne Twister does indeed store 624 times 32-bit values, and that is 19968 bits, enough to contain the very long period that you are wondering about. The values are handled separately (as unsigned 32-bit integers), not treated as one giant number in a single-step calculation. The 32-bit random number you get from the output is related to this state, but does not determine the next number by itself.
You are wrong at
Therefore, after it has produced 2³² - 1 different random numbers, it
will necessarily come back to the starting point (the seed)...
That's right that the next number can be the same with one of the number already generated, but the internal state of the random number generator will not be the same. (Noone told you that every number in the range 2³² - 1 will be generated at the 2³² - 1th step.) So there's no bijection between the random number generated and the internal state of the generator. The random number generated can be calculated from the state but you don't even have to do it. You can step the internal state also without creating the random number.
And of course, the computer doesn't store the whole number sequence. It calculates the random number from the internal state. Consider a number sequence like 1, -1, 1, -1 ... you can generate the Nth number without storing number of N elements.

What division algorithm should be used for dividing small integers in hardware?

I need to multiply an integer ranging from 0-1023 by 1023 and divide the result by a number ranging from 1-1023 in hardware (verilog/fpga implementation). The multiplication is straight forward since I can probably get away with just shifting 10 bits (and if needed I'll subtract an extra 1023). The division is a little interesting though. Area/power arent't really critical to me (I'm in an FPGA so the resources are already there). Latency (within reason) isn't a big deal so long as I can pipeline the deisgn. There are obviously several choices with different trade offs, but I'm wondering if there's an "obvious" or "no brainer" algorithm for a situation like this. Given the limited range of operands and the abundance of resources that I have (bram etc) I'm wondering if there isn't something obvious to do.
If you can work with fixed point precision rather than integers it may be possible to change :
divide the result by a number ranging from 1-1023
to multiplication by a number ranging from 1 - 1/1023, ie pre-compute the divide and store that as the coefficient for the multiply.
If you can pre-compute everything, and you've got a spare 20x20 multiplier, and some way to store your pre-computed number, then go for Morgan's suggestion. You need to precompute a 20-bit multiplicand (10b quotient, 10b remainder), and multiply by your first 10b number, and take the bottom 30b of the 40b result.
Otherwise, the no-brainer is non-restoring division, since you say that latency isn't important (lots of stuff on the web, most of it incomprehensible). you have a 20-bit numerator (the result of your (1023 x) multiplication), and a 10-bit denominator. This gives a 20b quotient, and a 10b remainder (ie. 20 bits for the integer part of the answer, and 10 bits for the fractional part, giving a 30b answer).
The actual hardware is pretty trivial: an 11b adder/subtractor, a 31b shift register, and a 10b or 11b register to store the divisor. You also need a small FSM to control it (2b). You have to do a compare, add or subtract, and shift in every clock cycle, and you get the answer out in 21 cycles. I think. :)

Resources