Simple truly random number generator - random

There is a lot of research going on related to generating "truly" random numbers.
There is a very simple method, invented long time ago.
The method is attributed to von Neumann [1].
In the most simple form it can be thought of as generating random bits out of a biased source of 0s or 1s. Given that probability of a sequence 01 is the same as 10, one can use 01 - to represent the truly random "0" and 10 as a truly random "1" bit (the 00 and 11 combinations are simply discarded).
Pretty straightforward. Can anyone point why such a method doesn't generate a random sequence (and thus solve the problem of generating random numbers on computers)?

Let me explain what "random" and "truly random" mean. "Random" simply means the numbers are identically distributed and chosen independently of everything else (that is, the numbers are i.i.d.). And "truly random" simply means i.i.d. and uniform (see also Frauchiger et al. 2013.)
If the source of input bits are i.i.d. but have a bias, then the von Neumann method will remove this bias — the numbers remain i.i.d. but are now unbiased, namely, each number will be 0 or 1 with equal probability. In general, if the source numbers weren't i.i.d. (more specifically, exchangeable [Peres 1992]) to begin with, the von Neumann method won't make those numbers "truly random"; even von Neumann (1951) assumes "independence of successive tosses [of the coin]". The von Neumann method is one of a host of randomness extractors available (and I discuss some of them), and this discussion applies to these other extractors just as well as the von Neumann method.
In any case, the distinction between "pseudorandom" and "truly random" numbers is not what applications care about (and you didn't really specify what kind of application you have in mind). Instead, in general:
Security applications care whether the numbers are hard to guess; in this case, only a cryptographic RNG can achieve this requirement (even one that relies on a pseudorandom number generator). A Python example is the secrets module or random.SystemRandom.
Scientific simulations care whether the numbers behave like independent uniform random numbers, and often care whether the numbers are reproducible at a later time. A Python example is numpy.random.Generator.
REFERENCES:
Frauchiger, D., Renner, R., Troyer, M., "True randomness from realistic quantum devices", 2013.
von Neumann, J., "Various techniques used in connection with random digits", 1951.
Peres, Y., "Iterating von Neumann's procedure for extracting random bits", Annals of Statistics 1992,20,1, p. 590-597.

Related

Is it acceptable to use each byte of a PRNG-generated number separately?

Say you have a non-cryptographically secure PRNG that generates 64-bit output.
Assuming that bytes are 8 bits, is it acceptable to use each byte of the 64-bit output as separate 8-bit random numbers or would that possibly break the randomness guarantees of a good PRNG? Or does it depend on the PRNG?
Because the PRNG is not cryptographically secure, the "randomness guarantee" I am worried about is not security, but whether the byte stream has the same guarantee of randomness, using the same definition of "randomness" that PRNG authors use, that the PRNG has with respect to its 64-bit output.
This should be quite safe with a CSPRNG. For comparison it's like reading /dev/random byte by byte. With a good CSPRNG it is also perfectly acceptable to simply generate a 64bit sample 8 times and pick 8 bits per sample as well (throwing away the 56 other bits).
With PRNGs that are not CSPRNG you will have 'security' concerns in terms of the raw output of the PRNG that outweigh whether or not you chop up output into byte sized chunks.
In all cases it is vital to make sure the PRNG is seeded and periodically re-seeded correctly (so as to flush any possibly compromised internal state regularly). Security depends on the unpredictability of your internal state, which is ultimately driven by the quality of your seed input. One thing good CSPRNG implementations will do for you is to pessimistically estimate the amount of captured 'entropy' to safeguard the output from predictable internal state.
Note however that with 8 bits you only have 256 possible outputs in any case, so it becomes more of a question of how you use this. For instance, if you do something like XOR based encryption against the output of a PRNG (i.e. treating it as a one time pad based on some pre shared secret seed), then using a known plain text attack may relatively easily reveal the contents of the internal state of the PRNG. That is another type of attack which good CSPRNG implementations are supposed to guard against by their design (using e.g. a computationally secure hash function).
EDIT to add: if you don't care about 'security' but only need the output to look random, then this should be quite safe -- in theory a good PRNG is just as likely to yield a 0 as 1, and that should not vary between any octet. So you expect a linear distribution of possible output values. One thing you can do to verify whether this skews the distribution is to run a Monte Carlo simulation of some reasonably large size (e.g. 1M) and compare the histograms with 256 bins for both the raw 64 bit and the 8 * 8 bit output. You expect a roughly flat diagram for both cases if the linear distribution is preserved intact.
It depends on the generator and its parameterization. Quoting from the Wikipedia page for Linear Congruential Generators: "The low-order bits of LCGs when m is a power of 2 should never be relied on for any degree of randomness whatsoever. [...]any full-cycle LCG when m is a power of 2 will produce alternately odd and even results."

understanding seed of a ByteTensor in PyTorch

I understand that a seed is a number used to initialize pseudo-random number generator. in pytorch, torch.get_rng_state documentation states as follows "Returns the random number generator state as a torch.ByteTensor.". and when i print it i get a 1-d tensor of size 5048 whose values are as shown below
tensor([ 80, 78, 248, ..., 0, 0, 0], dtype=torch.uint8)
why does a seed have 5048 values and how is this different from usual seed which we can get using torch.initial_seed
It sounds like you're thinking of the seed and the state as equivalent. For older pseudo-random number generators (PRNGs) that was true, but with more modern PRNGs tend to work as described here. (The answer in the link was written with respect to Mersenne Twister, but the concepts apply equally to other generators.)
Why is it a good idea to not have a 32- or 64-bit state space and report the state as the generator's output? Because if you do that, as soon as you see any value repeat the entire sequence will repeat. PRNGs were designed to be "full cycle," i.e., to iterate through the maximum number of values possible before repeating. This paper showed that the birthday problem could quickly (O(sqrt(cycle-length)) identify such PRNGs as non-random. This meant, for instance, that with 32-bit integers you shouldn't use more than ~50000 values before a statistician could call you out with a better than 99% level of confidence. The solution, used by many modern PRNGs, is to have a larger state space and collapse it down to output a 32- or 64-bit result. Since multiple states can produce the same output, duplicates will occur in the output stream without the entire stream being replicated. It looks like that's what PyTorch is doing.
Given the larger state space, why allow seeding with a single integer? Convenience. For instance, Mersenne Twister has a 19,937 bit state space, but most people don't want to enter that much info to kick-start it. You can if you want to, but most people use the front-end which populates the full state space from a single integer input.

Compress Random 32-bit Integers: How close can we get to Shannon Entropy?

I've developed a lossless compression algorithm that compresses 32-bit integers (of unknown frequency/probability) to 31.95824 bits per integer (it works a lot better for smaller values, just as most compression algorithms do). Obviously it isn't possible to compress uniformly-distributed random data to become smaller than its uncompressed size.
Therefore my question is, which lossless compression algorithms get closest to the Shannon Entropy of 32 bits per integer for pseudorandom data, assuming 32-bit integers?
Essentially, I'm looking for a table which includes compression algorithms and their respective bits-per-integer value for positive, compressed, 32-bit integers.
When you say "it works a lot better for smaller values", I presume that you have a transformation from the 32-bit integer to a variable-bit-length representation that is optimized for some non-uniform expected distribution of values. Then that same transformation applied to a uniform distribution of 32-bit values will necessarily take more than 32 bits on average. How much more depends on how non-uniform a distribution you started with.
So the answer is, of course you can get to 32 bits exactly by doing nothing at all to the number. But then you are not optimized for the application implied by the non-uniform distribution you designed to.
The identity function requires precisely 32 bits per 32 bit integer, which is pretty hard to beat. (There are many other length-preserving bijections, if you insist on changing the data stream.)
It's not obvious to me what other criteria you might be employing to recommend an algorithm which does worse than that. Perhaps you believe that the input stream is not truly a uniform sample; rather, it is a restricted to (or significantly biased towards) a subset of the universe, but you do not a priori know what the subset is. In that case, the entropy of the stream is less than one (if there is an upper bound on the size of the subset which is reasonably less than the size of the universe) and you might be able to actually compress the input stream.
It's worth noting that unless messages are fixed-length, the length of the message needs to be taken into account in the computation of entropy, both in the numerator and the denominator. For very long messages, that can mostly be ignored but if messages are short, the cost of message delimiters (or explicit length indicators) can be significant. (Otherwise, "compressing" to 103% of original size is a somewhat humptydumptyesque definition of "to compress".)
This is exactly what Quantile Compression (https://github.com/mwlon/quantile-compression/) was built to do: lossless compression of numbers drawn from a numerical distributuon. I'm not aware of any other algorithms that do this. You can see its results vs theoretical optimum in the readme. It also works on floats and timestamps! I'm not sure what your distribution is, but real-world distributions often only take a few bits per number with
It works by encoding each number in the sequence as a Huffman code for a coarse numeric range and then an offset for the exact position within that range.

A PRNG that doesn't have a period of maximum length

Are there any 32-bit pseudorandom number generators (PRNG) that either have a greater or smaller period than (2^32 - 1)?
I would also appreciate a reliable source (book, research paper et.c) that uses it, thanks!
There are many PRNGs with different periods. Period is bounded by the size of the state space the PRNG maintains, not by the number of bits per integer. Java, for instance, uses a linear congruential generator with 48 bits of state. The late great George Marsaglia created "The mother of all random number generators", which pooled the results of smaller generators with relatively prime cycle lengths to get a cycle length of about 2250. As #Michael pointed out in a comment, Mersenne Twister has a whopping big period of 219937-1. Pierre L'ecuyer is considered to be a world leader on this topic, and you can read what many consider to be a pretty definitive chapter on the topic of PRNGs from the Elsevier publication "Handbooks in Operations Research and Management Science: Simulation". (Many of his other publications may also be of interest to you, both directly and as evidence of his expertise in this area.)
Linear congruential generators typically have a period of (2^n) exactly. This is, obviously, greater than (2^n-1), and uses only n bits of state (n=32 in your case) -- but it's the absolute upper limit on a PRNG with a 32-bit state.
Multiply-with-carry generators can be arranged to have a period of any safe prime, and configurations can be chosen either to use 32-bit arithmetic or to use a 32-bit state buffer, whichever you require. For a 32-bit state configuration all possible periods will be less than (2^32-1).

How exactly does PC/Mac generates random numbers for either 0 or 1?

This question is NOT about how to use any language to generate a random number between any interval. It is about generating either 0 or 1.
I understand that many random generator algorithm manipulate the very basic random(0 or 1) function and take seed from users and use an algorithm to generate various random numbers as needed.
The question is that how the CPU generate either 0 or 1? If I throw a coin, I can generate head or tailer. That's because I physically throw a coin and let the nature decide. But how does CPU do it? There must be an action that the CPU does (like throwing a coin) to get either 0 or 1 randomly, right?
Could anyone tell me about it?
Thanks
(This has several facets and thus several algorithms. Keep in mind that there are many different forms of randomness used for different purposes, but I understand your question in the way that you are interested in actual randomness used for cryptography.)
The fundamental problem here is that computers are (mostly) deterministic machines. Given the same input in the same state they always yield the same result. However, there are a few ways of actually gathering entropy:
User input. Since users bring outside input into the system you can take that to derive some bits from that. Similar to how you could use radioactive decay or line noise.
Network activity. Again, an outside source of stuff.
Generally interrupts (which kinda include the first two).
As alluded to in the first item, noise from peripherals, such as audio input or a webcam can be used.
There is dedicated hardware that can generate a few hundred MiB of randomness per second. Usually they give you random numbers directly instead of their internal entropy, though.
How exactly you derive bits from that is up to you but you could use time between events, or actual content from the events, etc. – generally eliminating bias from entropy sources isn't easy or trivial and a lot of thought and algorithmic work goes into that (in the case of the aforementioned special hardware this is all done in hardware and the code using it doesn't need to care about it).
Once you have a pool of actually random bits you can just use them as random numbers (/dev/random on Linux does that). But this has downsides, since there is usually little actual entropy and possibly a higher demand for random numbers. So you can invent algorithms to “stretch” that initial randomness in a manner that makes it still impossible or at least very difficult to predict anything about following numbers (/dev/urandom on Linux or both /dev/random and /dev/urandom on FreeBSD do that). Fortuna and Yarrow are so-called cryptographically secure pseudo-random number generators and designed with that in mind. You still have a very good guarantee about the quality of random numbers you generate, but have many more before your entropy pool runs out.
In any case, the CPU itself cannot give you a random 0 or 1. There's a lot more involved and this usually includes the complete computer system or special hardware built for that purpose.
There is also a second class of computational randomness: Plain vanilla pseudo-random number generators (PRNGs). What I said earlier about determinism – this is the embodiment of it. Given the same so-called seed a PRNG will yield the exact same sequence of numbers every time¹. While this sounds idiotic it has practical benefits.
Suppose you run a simulation involving lots of random numbers, maybe to simulate interaction between molecules or atoms that involve certain probabilities and unpredictable behaviour. In science you want results anyone can independently verify, given the same setup and procedure (or, with computing, the same algorithms). If you used actual randomness the only option you have would be to save every single random number used to make sure others can replicate the results independently.
But with a PRNG all you need to save is the seed and remember what algorithm you used. Others can then get the exact same sequence of pseudo-random numbers independently. Very nice property to have :-)
Footnotes
¹ This even includes the CSPRNGs mentioned above, but they are designed to be used in a special way that includes regular re-seeding with entropy to overcome that problem.
A CPU can only generate a uniform random number, U(0,1), which happens to range from 0 to 1. So mathematically, it would be defined as a random variable U in the range [0,1]. Examples of random draws of a U(0,1) random number in the range 0 to 1 would be 0.28100002, 0.34522, 0.7921, etc. The probability of any value between 0 and 1 is equal, i.e., they are equiprobable.
You can generate binary random variates that are either 0 or 1 by setting a random draw of U(0,1) to a 0 if U(0,1)<=0.5 and 1 if U(0,1)>0.5, since in theory there will be an equal number of random draws of U(0,1) below 0.5 and above 0.5.

Resources