avoiding modulo bias with only single random value - random

I want to create a random value between 0 and n. As source I receive a random 256-bit value from an oracle.
Just doing
source % (n + 1)
introduces modulo bias. Solutions to modulo bias I've seen are doing a while loop until the value lies within the largest multiple of n that fits the source range - but in my case I can draw a random source value only once.
my n is much smaller though, let's say smaller than 2^32. Is there any way I can generate a random value that has no modulo bias in this scenario?

Under the assumption that your random number generator is uniform, there is no way to map it to an unbiased generator on 0 to n inclusive unless n+1 is a divisor of 2^256. This is because it is impossible to partition a set of size 2^256 into n+1 equally probable subsets unless n+1 divides 2^256. Having said that, if n is much smaller than 2^256 the departure from uniformity in doing something like int(source * (n+1)/(2^256 - 1) will be negligible.

Related

Hashing with the Division Method - Choosing number of slots?

So, in CLRS, there's this quote
A prime not too close to an exact power of 2 is often a good choice for m.
Several Questions...
I understand how a power of 2 will just be the lower order bits of your key...however, say you have keys from a universe of 1 to 1 million, with each key having an equal probability of being any number from universe (which I'm guessing is a common assumption about your universe if given no other data?) then wouldn't taking say the 4 lower order bits result in (2^4) lower order bit patterns that were pretty much equally likely for the keys from 1 to 1 million? How am I thinking about this incorrectly?
Why a prime number? So, if power of 2's aren't a good idea, why is a prime number a better choice as opposed to a composite number close to a power of 2 (Also why should it be close to a power of 2...lol)?
You are trying to find a hash table that works well for typical input data, and typical input data does things that you wouldn't expect from good random number generators. Very often you get formatted or semi-formatted strings which, when converted to numbers, end up as K, K+A, K+2A, K+3A,.... for some integers K and A. If K+xA and K+yA hash to the same number mod m, then (x-y)A must be 0 mod m. If m is prime, this can only happen if A = 0 mod m or if x = y mod m, so one time in m. But if m=pq and A happens to be divisible by p, then you get a collision every time x-y is divisible by q, which is more often since q < m.
I guess close to a power of 2 because it might be convenient for the memory management system to have blocks of memory of the resulting size - I really don't know. If you really care, and if you have the time, you could try different primes with some representative data and see which of them are best in practice.

Most efficient way of storing exact set membership?

I have N slots. There are M slots occupied.
I want to be able to tell exactly whether each slot is occupied. (No Bloom filter answers, please.)
What is the absolute most storage-space efficient way of storing this information for M << N?
Guess 0: A bitmap of N bits.
Guess 1: An array of the positions of the occupied slots. Reasonably good for small M.
Guess 2: p0 + (M-1)p1 + (M-1)(M-2)p2 + ... where pX is the position of an occupied slot, among the remaining unoccupied slots. This is slightly more efficient than guess 1, as the choice of unoccupied slot narrows as slots are filled.
Guess 2 still has a lot of waste; it includes the order in which the slots were filled, which is information that is not required.
What method is more efficient than Guess 2?
If M and N are known, then one way of achieving the best compression is to store the index of the combination.
There are t= N!/((N-M)!*M!) ways of choosing the M slots to be filled, so you will always need at least log2(t) bits to represent this information.
Storing the index of the combination allows you to use exactly ceil(log2(t))bits.
Assuming that there is no other information about the distribution of the values (i.e., every possible sample of M values is equally probable) then the optimal compression technique is that given by #PeterdeRivaz, which is to simply using the ordinal of the enumeration of the sample out of the set of possible samples.
However, that is not trivial to compute, since the enumeration requires arithmetic on very large numbers.
Instead, it is possible to use a variant on Golomb compression, with only a small impact on the compression ratio.
Assume the numbers in the sample are in order. We start by computing the successive differences. Because we will never have two equal numbers, the sequence of differences will never include a 0; to gain a tiny additional compression, we start with one more than the first element of the sample, rather than the first element itself -- which means that the sequence never contains a 0 -- and then subtract one from each value. We now select some convenient number of bits k, and encode each value δ in the sequence as follows:
While δ > 2k, send a 1 bit and subtract 2k from δ
Now δ can be written in k bits. Send a 0 followed by the k-bit value of δ
We can choose k as ⌊log2(N/M)⌋, which means that N < 2k+1M. (Taking the ceiling would be another possibility.) Consequently, the number of 1 bits sent in all of the iterations of step 1 in the above algorithm is less than 2M (because each 1 accounts for 2k of the cumulative sum of the series, which is less than N). Each step 2 sends exactly k + 1 bits, and there are exactly M executions of step 2, one for each value in the sample. Thus, the total number of bits sent is somewhere between M × (k + 1) and M × (k + 2). But since we know k < log2(N/M) + 1, the total size of the transmission is certainly less than M log2 N - M log2 M + 3M). (We also have to send the parameters k and M, so there is a bit of overhead.)
Now, let's consider the optimal transmission size. There are N choose M possible samples, so the size of the enumeration index in bits will be log2(N choose M). If N &gg; M, we can approximate N choose M as NM/M!, and then using Stirling's approximation we get:
log(N choose M) ≈ M log N − M log M + M
(That's actually a slight overestimate, but it is asymptotically correct.)
Thus, the difference between the compressed sequence and the information-theoretic limit is less than 2 bits per value. (In practice, it is generally around one bit per value, because step 1 executes far less than the maximum number of times.)

generating sorted random numbers without exponentiation involved?

I am looking for a math equation or algorithm which can generate uniform random numbers in ascending order in the range [0,1] without the help of division operator. i am keen in skipping the division operation because i am implementing it in hardware. Thank you.
Generating the numbers in ascending (or descending) order means generating them sequentially but with the right distribution. That, in turn, means we need to know the distribution of the minimum of a set of size N, and then at each stage we need to use conditioning to determine the next value based on what we've already seen. Mathematically these are both straightforward except for the issue of avoiding division.
You can generate the minimum of N uniform(0,1)'s from a single uniform(0,1) random number U using the algorithm min = 1 - U**(1/N), where ** denotes exponentiation. In other words, the complement of the Nth root of a uniform has the same distribution as the minimum of N uniforms over the range [0,1], which can then be scaled to any other interval length you like.
The conditioning aspect basically says that the k values already generated will have eaten up some portion of the original interval, and that what we now want is the minimum of N-k values, scaled to the remaining range.
Combining the two pieces yields the following logic. Generate the smallest of the N uniforms, scale it by the remaining interval length (1 the first time), and make that result the last value we have generated. Then generate the smallest of N-1 uniforms, scale it by the remaining interval length, and add it to the last one to give you your next value. Lather, rinse, repeat, until you have done them all. The following Ruby implementation gives distributionally correct results, assuming you have read in or specified N prior to this:
last_u = 0.0
N.downto(1) do |i|
p last_u += (1.0 - last_u) * (1.0 - (rand ** (1.0/i)))
end
but we have that pesky ith root which uses division. However, if we know N ahead of time, we can pre-calculate the inverses of the integers from 1 to N offline and table them.
last_u = 0.0
N.downto(1) do |i|
p last_u += (1.0 - last_u) * (1.0 - (rand ** inverse[i]))
end
I don't know of any way get the correct distributional behavior sequentially without using exponentiation. If that's a show-stopper, you're going to have to give up on either the sequential nature of the process or the uniformity requirement.
You can try so-called "stratified sampling", which means you divide the range into bins and then sample randomly from bins. A sample thus generated is more uniform (less clumping) than a sample generated from the entire interval. For this reason, stratified sampling reduces the variance of Monte Carlo estimates (I don't suppose that's important to you, but that's why the method was invented, as a reduction of variance method).
It is an interesting problem to generate numbers in order, but my guess is that to get a uniform distribution over the entire interval, you will have to apply some formulas which require more computation. If you want to minimize computation time, I suspect you cannot do better than generating a sample and then sorting it.

Create a random permutation of 1..N in constant space

I am looking to enumerate a random permutation of the numbers 1..N in fixed space. This means that I cannot store all numbers in a list. The reason for that is that N can be very large, more than available memory. I still want to be able to walk through such a permutation of numbers one at a time, visiting each number exactly once.
I know this can be done for certain N: Many random number generators cycle through their whole state space randomly, but entirely. A good random number generator with state size of 32 bit will emit a permutation of the numbers 0..(2^32)-1. Every number exactly once.
I want to get to pick N to be any number at all and not be constrained to powers of 2 for example. Is there an algorithm for this?
The easiest way is probably to just create a full-range PRNG for a larger range than you care about, and when it generates a number larger than you want, just throw it away and get the next one.
Another possibility that's pretty much a variation of the same would be to use a linear feedback shift register (LFSR) to generate the numbers in the first place. This has a couple of advantages: first of all, an LFSR is probably a bit faster than most PRNGs. Second, it is (I believe) a bit easier to engineer an LFSR that produces numbers close to the range you want, and still be sure it cycles through the numbers in its range in (pseudo)random order, without any repetitions.
Without spending a lot of time on the details, the math behind LFSRs has been studied quite thoroughly. Producing one that runs through all the numbers in its range without repetition simply requires choosing a set of "taps" that correspond to an irreducible polynomial. If you don't want to search for that yourself, it's pretty easy to find tables of known ones for almost any reasonable size (e.g., doing a quick look, the wikipedia article lists them for size up to 19 bits).
If memory serves, there's at least one irreducible polynomial of ever possible bit size. That translates to the fact that in the worst case you can create a generator that has roughly twice the range you need, so on average you're throwing away (roughly) every other number you generate. Given the speed an LFSR, I'd guess you can do that and still maintain quite acceptable speed.
One way to do it would be
Find a prime p larger than N, preferably not much larger.
Find a primitive root of unity g modulo p, that is, a number 1 < g < p such that g^k ≡ 1 (mod p) if and only if k is a multiple of p-1.
Go through g^k (mod p) for k = 1, 2, ..., ignoring the values that are larger than N.
For every prime p, there are φ(p-1) primitive roots of unity, so it works. However, it may take a while to find one. Finding a suitable prime is much easier in general.
For finding a primitive root, I know nothing substantially better than trial and error, but one can increase the probability of a fast find by choosing the prime p appropriately.
Since the number of primitive roots is φ(p-1), if one randomly chooses r in the range from 1 to p-1, the expected number of tries until one finds a primitive root is (p-1)/φ(p-1), hence one should choose p so that φ(p-1) is relatively large, that means that p-1 must have few distinct prime divisors (and preferably only large ones, except for the factor 2).
Instead of randomly choosing, one can also try in sequence whether 2, 3, 5, 6, 7, 10, ... is a primitive root, of course skipping perfect powers (or not, they are in general quickly eliminated), that should not affect the number of tries needed greatly.
So it boils down to checking whether a number x is a primitive root modulo p. If p-1 = q^a * r^b * s^c * ... with distinct primes q, r, s, ..., x is a primitive root if and only if
x^((p-1)/q) % p != 1
x^((p-1)/r) % p != 1
x^((p-1)/s) % p != 1
...
thus one needs a decent modular exponentiation (exponentiation by repeated squaring lends itself well for that, reducing by the modulus on each step). And a good method to find the prime factor decomposition of p-1. Note, however, that even naive trial division would be only O(√p), while the generation of the permutation is Θ(p), so it's not paramount that the factorisation is optimal.
Another way to do this is with a block cipher; see this blog post for details.
The blog posts links to the paper Ciphers with Arbitrary Finite Domains which contains a bunch of solutions.
Consider the prime 3. To fully express all possible outputs, think of it this way...
bias + step mod prime
The bias is just an offset bias. step is an accumulator (if it's 1 for example, it would just be 0, 1, 2 in sequence, while 2 would result in 0, 2, 4) and prime is the prime number we want to generate the permutations against.
For example. A simple sequence of 0, 1, 2 would be...
0 + 0 mod 3 = 0
0 + 1 mod 3 = 1
0 + 2 mod 3 = 2
Modifying a couple of those variables for a second, we'll take bias of 1 and step of 2 (just for illustration)...
1 + 2 mod 3 = 0
1 + 4 mod 3 = 2
1 + 6 mod 3 = 1
You'll note that we produced an entirely different sequence. No number within the set repeats itself and all numbers are represented (it's bijective). Each unique combination of offset and bias will result in one of prime! possible permutations of the set. In the case of a prime of 3 you'll see that there are 6 different possible permuations:
0,1,2
0,2,1
1,0,2
1,2,0
2,0,1
2,1,0
If you do the math on the variables above you'll not that it results in the same information requirements...
1/3! = 1/6 = 1.66..
... vs...
1/3 (bias) * 1/2 (step) => 1/6 = 1.66..
Restrictions are simple, bias must be within 0..P-1 and step must be within 1..P-1 (I have been functionally just been using 0..P-2 and adding 1 on arithmetic in my own work). Other than that, it works with all prime numbers no matter how large and will permutate all possible unique sets of them without the need for memory beyond a couple of integers (each technically requiring slightly less bits than the prime itself).
Note carefully that this generator is not meant to be used to generate sets that are not prime in number. It's entirely possible to do so, but not recommended for security sensitive purposes as it would introduce a timing attack.
That said, if you would like to use this method to generate a set sequence that is not a prime, you have two choices.
First (and the simplest/cheapest), pick the prime number just larger than the set size you're looking for and have your generator simply discard anything that doesn't belong. Once more, danger, this is a very bad idea if this is a security sensitive application.
Second (by far the most complicated and costly), you can recognize that all numbers are composed of prime numbers and create multiple generators that then produce a product for each element in the set. In other words, an n of 6 would involve all possible prime generators that could match 6 (in this case, 2 and 3), multiplied in sequence. This is both expensive (although mathematically more elegant) as well as also introducing a timing attack so it's even less recommended.
Lastly, if you need a generator for bias and or step... why don't you use another of the same family :). Suddenly you're extremely close to creating true simple-random-samples (which is not easy usually).
The fundamental weakness of LCGs (x=(x*m+c)%b style generators) is useful here.
If the generator is properly formed then x%f is also a repeating sequence of all values lower than f (provided f if a factor of b).
Since bis usually a power of 2 this means that you can take a 32-bit generator and reduce it to an n-bit generator by masking off the top bits and it will have the same full-range property.
This means that you can reduce the number of discard values to be fewer than N by choosing an appropriate mask.
Unfortunately LCG Is a poor generator for exactly the same reason as given above.
Also, this has exactly the same weakness as I noted in a comment on #JerryCoffin's answer. It will always produce the same sequence and the only thing the seed controls is where to start in that sequence.
Here's some SageMath code that should generate a random permutation the way Daniel Fischer suggested:
def random_safe_prime(lbound):
while True:
q = random_prime(lbound, lbound=lbound // 2)
p = 2 * q + 1
if is_prime(p):
return p, q
def random_permutation(n):
p, q = random_safe_prime(n + 2)
while True:
r = randint(2, p - 1)
if pow(r, 2, p) != 1 and pow(r, q, p) != 1:
i = 1
while True:
x = pow(r, i, p)
if x == 1:
return
if 0 <= x - 2 < n:
yield x - 2
i += 1

Optimal algorithm for generating a random number R not in a set of numbers N

I am curious to know what the best way to generate a random integer R that is not in a provided set of integers (R∉N). I can think of several ways of doing this but I'm wondering what you all think.
Let N be the size of the overall set, and let K be the size of the excluded set.
I depends on the size of the set you are sampling from. If the excluded set is much smaller than the overall range, just choose a random number, and if it is in the excluded set, choose again. If we keep the excluded set in a hash table each try can be done in O(1) time.
If the excluded set is large, choose a random number R in a set of size (N - K) and output the choice as the member of the non excluded elements. If we store just the holes in a hash table keyed with the value of the random number we can generate this in one sample in time O(1).
The cutoff point will depend on the size of (N - K)/N, but I suspect that unless this is greater than .5 or so, or you sets are very small, just sampling until you get a hit will be faster in practice.
Given your limited description? Find the maximum value of the elements in N. Generate only random numbers greater than that.
Generate a random number R in the entire domain (subtract the size of N from the max value) that you want to use. Then loop through all N less than R and for each add 1 to R. This will give a random number in the domain that is not in N.

Resources