I want a simple (non-cryptographic) random number generation algorithm where I can freely choose the period.
One candidate would be a special instance of LCG:
X(n+1) = (aX(n)+c) mod m (m,c relatively prime; (a-1) divisible by all prime factors of m and also divisible by 4 if m is).
This has period m and does not restrict possible values of m.
I intend to use this RNG to create a permutation of an array by generating indices into it. I tried the LCG and it might be OK. However, it may not be "random enough" in that distances between adjacent outputs have very few possible values (i.e, plotting x(n) vs n gives a wrapped line). The arrays I want to index into have some structure that has to do with this distance and I want to avoid potential issues with this.
Of course, I could use any good PRNG to shuffle (using e.g. Fisher–Yates) an array [1,..., m]. But I don't want to have to store this array of indices. Is there some way to capture the permuted indices directly in an algorithm?
I don't really mind the method ending up biased w.r.t choice of RNG seed. Only the period matters and the permuted sequence (for a given seed) being reasonably random.
Encryption is a one-to-one operation. If you encrypt a range of numbers, you will get the same count of apparently random numbers back. In this case the period will be the size of the chosen range. So for a period of 20, encrypt the numbers 0..19.
If you want the output numbers to be in a specific range, then pick a block cipher with an appropriately sized block and use Format Preserving Encryption if needed, as #David Eisenstat suggests.
It is not difficult to set up a cipher with almost any reasonable block size, so long as it is an even number of bits, using the Feistel structure. If you don't require cryptographic security then four or six Feistel rounds should give you enough randomness.
Changing the encryption key will give you a different ordering of the numbers.
Related
The problem
We have a set of symbol sequences, which should be mapped to a pre-defined number of bucket-indexes.
Prerequisites
The symbol sequences are restricted in length (64 characters/bytes), and the hash algorithm used is the Delphi implementation of the Bob Jenkins hash for a 32bit hashvalue.
To further distribute the these hashvalues over a certain number of buckets we use the formula:
bucket_number := (hashvalue mod (num_buckets - 2)) + 2);
(We don't want {0,1} to be in the result set)
The question
A colleague had some doubts, that we need to choose a prime number for num_buckets to achieve an optimal1 distribution in mapping the symbol sequences to the bucket_numbers.
The majority of the team believe that's more an unproven assumption, though our team mate just claimed that's mathematically intrinsic (without more in depth explanation).
I can imagine, that certain symbol sequence patterns we use (that's just a very limited subset of what's actually allowed) may prefer certain hashvalues, but generally I don't believe that's really significant for a large number of symbol sequences.
The hash algo should already distribute the hashvalues optimally, and I doubt that a prime number mod divisor would really make a significant difference (couldn't measure that empirically either), especially since Bob Jenkins hash calculus doesn't involve any prime numbers as well, as far I can see.
[TL;DR]
Does a prime number mod divisor matter for this case, or not?
1)
optimal simply means a stable average value of number-of-sequences per bucket, which doesn't change (much) with the total number of sequences
Your colleague is simply wrong.
If a hash works well, all hash values should be equally likely, with a relationship that is not obvious from the input data.
When you take the hash mod some value, you are then mapping equally likely hash inputs to a reduced number of output buckets. The result is now not evenly distributed to the extent that outputs can be produced by different numbers of inputs. As long as the number of buckets is small relative to the range of hash values, this discrepancy is small. It is on the order of # of buckets / # of hash values. Since the number of buckets is typically under 10^6 and the number of hash values is more than 10^19, this is very small indeed. But if the number of buckets divides the range of hash values, there is no discrepancy.
Primality doesn't enter into it except from the point that you get the best distribution when the number of buckets divides the range of the hash function. Since the range of the hash function is usually a power of 2, a prime number of buckets is unlikely to do anything for you.
I am looking for a shuffle algorithm to shuffle a set of sequential numbers without buffering. Another way to state this is that I’m looking for a random sequence of unique numbers that have a given period.
Your typical Fisher–Yates shuffle needs to have each element all of the elements it is going to shuffle, so that isn’t going to work.
A Linear-Feedback Shift Register (LFSR) does what I want, but only works for periods that are powers-of-two less two. Here is an example of using a 4-bit LFSR to shuffle the numbers 1-14:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
8
12
14
7
4
10
5
11
6
3
2
1
9
13
The first two is the input, and the second row the output. What’s nice is that the state is very small—just the current index. You can start of any index and get a difference set of numbers (starting at 1 yields: 8, 12, 14; starting at 9: 6, 3, 2), although the sequence is always the same (5 is always followed by 11). If I want a different sequence, I can pick a different generator polynomial.
The limitations to the LFSR are that the periods are always power-of-two less two (the min and max are always the same, thus unshuffled) and there not enough enough generator polynomials to allow every possible random sequence.
A block cipher algorithm would work. Every key produces a uniquely shuffled set of numbers. However all block ciphers (that I know about) have power-of-two block sizes, and usually a fixed or limited number of block sizes. A block cipher with a arbitrary non-binary block size would be perfect if such a thing exists.
There are a couple of projects I have that could benefit from such an algorithm. One is for small embedded micros that need to produce a shuffled sequence of numbers with a period larger than the memory they have available (think Arduino Uno needing to shuffle 1 to 100,000).
Does such an algorithm exist? If not, what things might I search for to help me develop such an algorithm? Or is this simply not possible?
Edit 2022-01-30
I have received a lot of good feedback and I need to better explain what I am searching for.
In addition to the Arduino example, where memory is an issue, there is also the shuffle of a large number of records (billions to trillions). The desire is to have a shuffle applied to these records without needing a buffer to hold the shuffle order array, or the time needed to build that array.
I do not need an algorithm that could produce every possible permutation, but a large number of permutations. Something like a typical block cipher in counter mode where each key produces a unique sequence of values.
A Linear Congruential Generator using coefficients to produce the desired sequence period will only produce a single sequence. This is the same problem for a Linear Feedback Shift Register.
Format-Preserving Encryption (FPE), such as AES FFX, shows promise and is where I am currently focusing my attention. Additional feedback welcome.
It is certainly not possible to produce an algorithm which could potentially generate every possible sequence of length N with less than N (log2N - 1.45) bits of state, because there are N! possible sequence and each state can generate exactly one sequence. If your hypothetical Arduino application could produce every possible sequence of 100,000 numbers, it would require at least 1,516,705 bits of state, a bit more than 185Kib, which is probably more memory than you want to devote to the problem [Note 1].
That's also a lot more memory than you would need for the shuffle buffer; that's because the PRNG driving the shuffle algorithm also doesn't have enough state to come close to being able to generate every possible sequence. It can't generate more different sequences than the number of different possible states that it has.
So you have to make some compromise :-)
One simple algorithm is to start with some parametrisable generator which can produce non-repeating sequences for a large variety of block sizes. Then you just choose a block size which is as least as large as your target range but not "too much larger"; say, less than twice as large. Then you just select a subrange of the block size and start generating numbers. If the generated number is inside the subrange, you return its offset; if not, you throw it away and generate another number. If the generator's range is less than twice the desired range, then you will throw away less than half of the generated values and producing the next element in the sequence will be amortised O(1). In theory, it might take a long time to generate an individual value, but that's not very likely, and if you use a not-very-good PRNG like a linear congruential generator, you can make it very unlikely indeed by restricting the possible generator parameters.
For LCGs you have a couple of possibilities. You could use a power-of-two modulus, with an odd offset and a multiplier which is 5 mod 8 (and not too far from the square root of the block size), or you could use a prime modulus with almost arbitrary offset and multiplier. Using a prime modulus is computationally more expensive but the deficiencies of LCG are less apparent. Since you don't need to handle arbitrary primes, you can preselect a geometrically-spaced sample and compute the efficient division-by-multiplication algorithm for each one.
Since you're free to use any subrange of the generator's range, you have an additional potential parameter: the offset of the start of the subrange. (Or even offsets, since the subrange doesn't need to be contiguous.) You can also increase the apparent randomness by doing any bijective transformation (XOR/rotates are good, if you're using a power-of-two block size.)
Depending on your application, there are known algorithms to produce block ciphers for subword bit lengths [Note 2], which gives you another possible way to increase randomness and/or add some more bits to the generator state.
Notes
The approximation for the minimum number of states comes directly from Stirling's approximation for N!, but I computed the number of bits by using the commonly available lgamma function.
With about 30 seconds of googling, I found this paper on researchgate.net; I'm far from knowledgable enough in crypto to offer an opinion, but it looks credible; also, there are references to other algorithms in its footnotes.
I have a set of 64-bit unsigned integers with length >= 2. I pick 2 random integers, a, b from that set. I apply a deterministic operation to combine a and b into different 64-bit unsigned integers, c_1, c_2, c_3, etc. I add those c_ns to the set. I repeat that process.
What procedure can I use to guarantee that c will practically never collide with an existing bitstring on the set, even after millions of steps?
Since you're generating multiple 64-bit values from a pair of 64-bit numbers, I would suggest that you select two numbers at random, and use them to initialize a 64 bit xorshift random number generator with 128 bits of state. See https://en.wikipedia.org/wiki/Xorshift#xorshift.2B for an example.
However, it's rather difficult to predict the collision probability when you're using multiple random number generators. With a single PRNG, the rule of thumb is that you'll have a 50% chance of a collision after generating the square root of the range. For example, if you were generating 32-bit random numbers, your collision probability reaches 50% after about 70,000 numbers generated. Square root of 2^32 is 65,536.
With a single 64-bit PRNG, you could generate more than a billion random numbers without too much worry about collisions. In your case, you're picking two numbers from a potentially small pool, then initializing a PRNG and generating a relatively small number of values that you add back to the pool. I don't know how to calculate the collision probability in that case.
Note, however, that whatever the probability of collision, the possibility of collision always exists. That "one in a billion" chance does in fact occur: on average once every billion times you run the program. You're much better off saving your output numbers in a hash set or other data structure that won't allow you to store duplicates.
I think the best you can do without any other given constraints is to use a pseudo-random function that maps two 64-bit integers to a 64-bit integer. Depending on whether the order of a and b matter for your problem or not (i.e. (3, 5) should map to something else than (5, 3)) you shouldn't or should sort them before.
The natural choice for a pseudo-random function that maps a larger input to a smaller input is a hash function. You can select any hash function that produces an output of at least 64-bit and truncate it. (My favorite in this case would be SipHash with an arbitrary fixed key, it is fast and has public domain implementations in many languages, but you might just use whatever is available.)
The expected amount of numbers you can generate before you get a collision is determined by the birthday bound, as you are essentially selecting values at random. The linked article contains a table for the probabilities for 64-bit values. As an example, if you generate about 6 million entries, you have a collision probability of one in a million.
I don't think it is possible to beat this approach in the general case, as you could encode an arbitrary amount of information in the sequence of elements you combine while the amount of information in the output value is fixed to 64-bit. Thus you have to consider collisions, and a random function spreads out the probability evenly among all possible sequences.
I need to issue a series {1, 2, 3, 4 …} of tickets that are (at least seemingly) random numbers {10,934, 3,453,867, 122, 4,386,564 …}. When presented back, I must be able to compute their original index (e.g. 122 → 3.)
In other words, I need a seemingly random permutation p on the interval [1 … N] that has an inverse permutation p-1. N is about 107.
The reasons for that are:
It is a cipher: When receiving a ticket, it should not be easy to
guess the tickets that where issued before.
The tickets should be short alphanumeric strings that can be noted down.
I want to avoid recording every ticket issued.
I would use some well-known cipher (e.g., DES) in counter mode.
DES is generally considered fairly broken for normal purposes, but it seems to fit your needs reasonably well, and has a smaller block size than most newer algorithms. For you, that means it produces a smaller result (64 bits, if memory serves). Once you've converted that to readable characters (e.g,. base 64) you end up with something like 10 characters or so.
To retrieve the original number, you simply decrypt with your secret key.
Results look quite random--essentially the only known way to sort them back into order would be to break DES, which can be done (has been done) but the resources to do so are quite non-trivial.
If you really do need a lot better security than that, you can use something like AES instead of DES (at the expense of producing a longer "key" value).
1 to generate a pseudo random shuffle, you could use Fisher-Yates algo:
https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle
What distribution do you get from this broken random shuffle?
for (int i = tickets.Length - 1; i > 0; i--)
{
int n = random(i + 1);
Swap(tickets[i], tickets[n]);
}
beware of not using the "wrong" algorithm (he has bias).
You will get the permutation, then the inverse permutation.
2 Problem comes with the randomness of the shuffle.
As there is 10000000 ! permutations, you should have a very big size of seed
Then problem is in the random generator. standard ones are about 32 bits, perhaps a little more, but far from 10000000!
you should see at something like fortuna :
https://en.wikipedia.org/wiki/Fortuna_%28PRNG%29
You can generate such sequence using a Linear congruential generator.
X0 is the seed (or the index of the permutation if you wish). m should be equal to N+1. Select c and a to assure full period length (as described in the section 'period length' in the link above). This will give you a one-to-one mapping with size N.
To restore the index, you can crack the LCG using a small number of consecutive pseudo-random numbers from the series, which is not too hard. Of course you can keep m, a and c and save the trouble.
For more secure methods look at David Eisenstat's comment. You'll need only the secret key to restore the index. On the downside, if you'll use a standard FPE, N would have to be 2^x-1 (e.g. 2^128-1).
I need to generate around 9-100 million non-repeating random numbers, ranging from zero to the amount of numbers generated, and I need them to be generated very quickly. Several answers to similar questions proposed simply shuffling an array in order to get the random numbers, and others proposed using a bloom filter. The question is, which one is more efficient, and in case of it being the bloom filter, how do I use it?
You don't want random numbers at all. You want exactly the numbers 0 to N-1, in random order.
Simply filling the array and shuffling should be very quick. A proper Fisher-Yates shuffle is O(n), so an array of 100 million should take well under a second in C or even Java, slightly slower in a higher-level language like Python.
You only have to generate N-1 random numbers to do the shuffle (maybe up to 1.3N if you use rejection sampling to get perfect uniformity), so the speed will depend largely on how fast your RNG is.
You'll never need to look up whether a number has already be generated; that will deadly be slow no matter which algorithm you use, especially toward the end of the run.
If you need slightly fewer than N total numbers, fill the array from 0 to N-1, then just abort the shuffle early and take the partial result. Only if the amount of numbers you need is very small compared to their range should you consider the generate-and-check-for-dups approach. In that case Bob Floyd's algorithm might be good.
As an alternative you could use an appropriately sized block cypher. Use the block cypher to encrypt the numbers 0, 1, 2, ... and you will get a series of non-repeating random numbers out. Exactly what series will depend on the key you use. They are guaranteed not to repeat, because a block cypher is a reversible permutation.
For 64 bit numbers use DES, for 32 bit use Hasty Pudding (which allows a large range of block sizes) or write your own simple Feistel cypher. Assuming that security is not a big issue for this, then writing your own is possible.
For sure its better create an algorithm to shuffle the numbers, if you use a seed, as for example, the server microtime or timestamp, you can have one different random string for each milisecond .
Start creating an array using range function, set number of numbers as you like .
Than, you need to use a seed to make the pseudo-randomness better .
So, instead of rand, you gotta use SHUFFLE,
so you set array on range as 1 to 90, set the seed, than use shuffle to shuffle the array.. than you got all numbers in a random order (corresponding to the seed) .
You gotta change the seed to have another result .
The order of the numbers is the result .
as .. ball 1 : 42 ... ball 2: 10.... ball 3: 50.... ball 1 is 0 in the array. ;)
You can also use slice function and create a for / each loop, incrementing the slice factor, so you loop
slice array 0,1 the the result .. ball 1...
slice array 0.2 ball 2...
slice array 0.3
Thats the logic, i hope you understand, if so .. it ill help you a lot .