Pseudorandom hash of two integers - random

I need a NxN matrix with 16bit or 32bit pseudorandom uniformaly distributed numbers over the whole range of values. N is unfortunately very large (at least 1e6), so it can not be pregenerated (That would take about a TB of memory). The only viable option I can think of is using a hash of my indices i and j as matrix elements.
There are plenty of integer hash functions available, but I am not sure which ones are suitable for two reasons.
-Only 32bit unsigned integer operations available. Since N is at least 2^20 I can not naively concatenate the two indices into one 32bit key without creating unnecessary collisions.
-Pseudorandomness is important here, I am not building a hashtable. Most integer hashes I found are designed for hashtables and don't have very strong requirements.
A possible solution would be taking a cryptographic hash like SHA-2, but performance is important and that is just too expensive.
A suggestions on how to combine two 32bit uints into one wile avoiding collisions patterns would already help a great deal, since I could then pick from the whole range of 32bit to 32bit hashes.
Some insight on which 32bit to 32bit hashes have good randomness would also be much appreciated.
Pregenerating 1 or 2 Arrays of N random numbers is no problem if it helps.
In case it matters, the target are GPUs I am writing in recent versions of GLSL.

What about using LCG? It is well-known fact that in the form of
xn = (a*x+c) mod 232 where a mod 8 is 3 or 5 and c is odd, the resulting congruential sequence will have period 232.
Numerical recipes: a=1664525, c=1013904223, but there are tons of them
Form unique x from i, j, and compute xn.

I found a suitable algorithm. Block ciphers in counter mode are obviously suitable. I initially rejected the idea because of the performance implications of most block ciphers. However, I found a paper that introduces a related algorithm (basically a block cipher with less rounds) called Philox (Parallel Random Numbers: As Easy as 1, 2, 3 by Salmon et al.).
Link: http://www.thesalmons.org/john/random123/papers/random123sc11.pdf
The only problem left is how to combine the two indices into one 32bit number. But I guess XOR should be good enough if combined with a rotation to avoid commutativity.

Related

Random number from many other random numbers, is it more random?

We want to generate a uniform random number from the interval [0, 1].
Let's first generate k random booleans (for example by rand()<0.5) and decide according to these on what subinterval [m*2^{-k}, (m+1)*2^{-k}] the number will fall. Then we use one rand() to get the final output as m*2^{-k} + rand()*2^{-k}.
Let's assume we have arbitrary precision.
Will a random number generated this way be 'more random' than the usual rand()?
PS. I guess the subinterval picking amounts to just choosing the binary representation of the output 0. b_1 b_2 b_3... one digit b_i at a time and the final step is adding the representation of rand() to the end of the output.
It depends on the definition of "more random". If you use more random generators, it means more random state, and it means that cycle length will be greater. But cycle length is just one property of random generators. Cycle length of 2^64 usually OK for almost any purpose (the only exception I know is that if you need a lot of different, long sequences, like for some kind of simulation).
However, if you combine two bad random generators, they don't necessarily become better, you have to analyze it. But there are generators, which do work this way. For example, KISS is an example for this: it combines 3, not-too-good generators, and the result is a good generator.
For card shuffling, you'll need a cryptographic RNG. Even a very good, but not cryptographic RNG is inadequate for this purpose. For example, Mersenne Twister, which is a good RNG, is not suitable for secure card shuffling! It is because observing output numbers, it is possible to figure out its internal state, so shuffle result can be predicted.
This can help, but only if you use a different pseudorandom generator for the first and last bits. (It doesn't have to be a different pseudorandom algorithm, just a different seed.)
If you use the same generator, then you will still only be able to construct 2^n different shuffles, where n is the number of bits in the random generator's state.
If you have two generators, each with n bits of state, then you can produce up to a total of 2^(2n) different shuffles.
Tinkering with a random number generator, as you are doing by using only one bit of random space and then calling iteratively, usually weakens its random properties. All RNGs fail some statistical tests for randomness, but you are more likely to get find that a noticeable cycle crops up if you start making many calls and combining them.

Obtaining a k-wise independent hash function

I need to use a hash function which belongs to a family of k-wise independent hash functions. Any pointers on any library or toolkit in C, C++ or python which can generate a set of k-wise independent hash functions from which I can pick a function.
Background: I am trying to implement this algorithm here: http://researcher.watson.ibm.com/researcher/files/us-dpwoodru/knw10b.pdf for the Distinct Elements problem.
I have looked at this thread: Generating k pairwise independent hash functions which mentions using Murmur hash to generate a pairwise independent hash function. I was wondering if there is anything similar for k-wise independent hash functions. If there is none available, would it be possible for me to construct such a set of k-wise independent hash functions.
Thanks in advance.
The simplest k-wise independent hash function (mapping positive integer x < p to one of m buckets) is just
where p is some big random prime (261-1 will work)
and ai are some random positive integers less than p, a0 > 0.
2-wise independent hash:
h(x) = (ax + b) % p % m
again, p is prime, a > 0, a,b < p (i.e. a can't be zero but b can when that is a random choice)
These formulas define families of hash functions. They work (in theory) if you select a hash function randomly from corresponding family (i.e. if you generate random a's and b) each time you run your algorithm.
There is no such thing as "a k-wise independent hash function". However, there are k-wise independent families of functions.
As a reminder, a family of functions is k-wise independent when if h is picked randomly from the family and x_1 .. x_k and y_1 .. y_k are picked arbitrarily, the probability that "for all i, h(x_i) = y_i" is Y^-k, where Y is the size of the co-domain from which the y_i were selected.
There are a few families of functions that are known to be k-wise independent for small k like 2, 3, 4, and 5. For arbitrary k, you will likely need to use polynomial hashing. Note that there are two variants of this, one of which is not even 2-independent, so be careful when implementing it.
The polynomial hash family can hash from a field F to itself using k constants a_0 through a_{k-1} and is defined by the sum of a_i x^i, where x is the key you are hashing. Field arithmetic can be implemented on your computer by taking letting F be the integers modulo a prime p. That's probably not convenient, as it is often better to have the domain and range be uint32_t or the like. In that case you can use the field F_{2^32}, and you can use polynomial multiplication over Z_2 and then division by an irreducible polynomial in that field. Otherwise, you can operate in Z_p where p is larger than 2^32 (or 64) and take the result of the polynomial mod 2^32, I think. That will only be almost k-wise independent, but sometimes that's sufficient for the analysis to go through. It will not be easy to re-analyze the KNW algorithm to change its hash families.
To generate a member of a k-wise independent family, use your favorite random number generator to pick the function randomly. In the case of polynomila hashing, that means picking the as referenced above. /dev/random should suffice.
The paper you point to, "An Optimal Algorithm for the Distinct Elements Problem", is a nice one and has been cited many times. However, it is not easy to implement, and it may be slower or even take more space than HyperLogLog, due to hidden constants in the big-O notations. A number of papers have noted the complexity of this algorithm and even called it infeasible compared to HyperLogLog. If you want to implement an estimator for the number of distinct elements, you might start with an earlier algorithm. There is plenty of complexity there if your goal is education. If your goal is practicality, you also want to stay away from KNW, because it could be a lot of work just to make something less practical that HyperLogLog.
As another piece of advice, you should probably ignore the suggestions to "just use Murmur hash" or "pick k values from xxhash" if you want to learn about and understand this algorithm or other random algorithms that use hashing. Murmur/xx might be fine in practice, but they are not k-wise independent families, and some of that advice on this page is not even semantically well-formed. For instance, "if you need k different hash, just re-use the same algorithm k times, with k different seeds" isn't relevant to k-wise independent families. For this algorithm you want to implement, you'll end up apply the hash functions an arbitrary number of times. You don't "need k different hash", you need n different hash values generated by first picking randomly from a k-independent hash family and second applying the chosen function to the streaming keys that are the input to algorithms like this.
This is one of many solutions, but you could use for example the following open-source hash algorithm:
https://github.com/Cyan4973/xxHash
Then, to generate different hashes, you just have to provide different seeds.
Considering the main function declaration :
unsigned int XXH32 (const void* input, int len, unsigned int seed);
So if you need k different hash values, just re-use the same algorithm k times, with k different seeds.
Just use a good non-cryptographic hash function. This advice perhaps will make me unpopular with my colleagues in theoretical computer science, but consider your adversary.
Nature. Yeah, maybe it'll hit the minuscule fraction inputs that cause your hash function to behave badly, but there are plenty of other ways for things to go wrong that a k-wise independent hash family won't fix (e.g., the random number generator that chose the hash function didn't do a good job, bugs, etc.), so you need to test end-to-end anyway.
Oblivious adversary. This is what the theory assumes. Oblivious adversaries cannot look at your random bits. If only they were so nice in real life!
Non-oblivious adversary. Randomness is pointless. Use a binary tree.
I'm not 100% sure what you mean by "k-wise independent hash functions", but you can get k distinct hash functions by coming up with two hash functions, and then using linear combinations of them.
I have an example in my bloom filter module: http://stromberg.dnsalias.org/svn/bloom-filter/trunk/bloom_filter_mod.py Ignore the get_bitno_seed_rnd function, look at hash1, hash2 and get_bitno_lin_comb

Generate code from id using a 1-1 function

Is there any good invertible 1-1 function that maps an integer to another integer?
for eg, given the range 0-5, I want to find one that maps:
0->3
1->2
2->4
3->5
4->1
5->0
Also, the mapping should look random.
You can fill an array in ascending order and shuffle it. This will usually perform reasonably well, if not being the most efficient memorywise.
You can also rely on a closed discrete transformation, such as multiplication. If you have 2 numbers, P and K, then (I think) as long as P and K are coprime, P^n mod K will produce a nonrepeating, pseudorandom sequence of values of length (K - 1), ranging from 1 to K. This particular manifestation of discrete math is one of the premises of cryptography. Going backwards from sequence to exponent is known as the discrete logarithm problem and is the reason traditional RSA is secure.
You asked for a reversible algorithm. If you keep track of the exponent, you can go from P^n mod K to P^(n-1) mod K without much difficulty. You can take a few shortcuts to go backwards from power to exponent that don't work in cryptography because certain parameters of the algorithm are intentionally discarded to make it harder.
That said, if you happen to break RSA by solving the discrete log problem while you're working on this, be sure to let me know.
How about permutation polynomials? See section 3 in this article: http://webstaff.itn.liu.se/~stegu/jgt2012/article.pdf It is used for noise there, but it looks exactly like what you want.
It suggest to construct functions of the form (Ax^2 + Bx) mod M. Only a small subset of those functions are invertible/produce permutations, but it shouldn't be hard to find the actual inverse if it exists.
Something similar to this was discussed in Non-repetitive random seek in a range Algorithm. I was intrigued enough to put some ideas down at http://www.mcdowella.demon.co.uk/PermutationFromHash.html
You can generate such a permutation using a block cipher, without having to hold the entire thing in memory (as you would if you were to shuffle the list). I wrote a blog post about it some time ago, which you can find here.

Computing entropy/disorder

Given an ordered sequence of around a few thousand 32 bit integers, I would like to know how measures of their disorder or entropy are calculated.
What I would like is to be able to calculate a single value of the entropy for each of two such sequences and be able to compare their entropy values to determine which is more (dis)ordered.
I am asking here, as I think I may not be the first with this problem and would like to know of prior work.
Thanks in advance.
UPDATE #1
I have just found this answer that looks great, but would give the same entropy if the integers were sorted. It only gives a measure of the entropy of the individual ints in the list and disregards their (dis)order.
Entropy calculation generally:
http://en.wikipedia.org/wiki/Entropy_%28information_theory%29
Furthermore, you have to sort your integers, then iterate over the sorted integer list to find out the frequency of your integers. Afterwards, you can use the formula.
I think I'll have to code a shannon entropy in 2D. Arrange the list of 32 bit ints as a series of 8 bit bytes and do a Shannons on that, then to cover how ordered they may be, take the bytes eight at a time and form a new list of bytes composed of bits 0 of the eight followed by bits 1 of the eight ... bits 7 of the 8; then the next 8 original bytes ..., ...
I'll see how it goes/codes...
Entropy is a function on probabilities, not data (arrays of ints, or files). Entropy is a measure of disorder, but when the function is modified to take data as input it loses this meaning.
The only true way one can generate a measure of disorder for data is to use Kolmogorov Complexity. Though this has problems too, in particular it's uncomputable and is not yet strictly well defined as one must arbitrarily pick a base language. This well-definedness can be solved if the disorder one is measuring is relative to something that is going to process the data. So when considering compression on a particular computer, the base language would be Assembly for that computer.
So you could define the disorder of an array of integers as follows:
The length of the shortest program written in Assembly that outputs the array.

Shuffling a huge range of numbers using minimal storage

I've got a very large range/set of numbers, (1..1236401668096), that I would basically like to 'shuffle', i.e. randomly traverse without revisiting the same number. I will be running a Web service, and each time a request comes in it will increment a counter and pull the next 'shuffled' number from the range. The algorithm will have to accommodate for the server going offline, being able to restart traversal using the persisted value of the counter (something like how you can seed a pseudo-random number generator, and get the same pseudo-random number given the seed and which iteration you are on).
I'm wondering if such an algorithm exists or is feasible. I've seen the Fisher-Yates Shuffle, but the 1st step is to "Write down the numbers from 1 to N", which would take terabytes of storage for my entire range. Generating a pseudo-random number for each request might work for awhile, but as the database/tree gets full, collisions will become more common and could degrade performance (already a 0.08% chance of collision after 1 billion hits according to my calculation). Is there a more ideal solution for my scenario, or is this just a pipe dream?
The reason for the shuffling is that being able to correctly guess the next number in the sequence could lead to a minor DOS vulnerability in my app, but also because the presentation layer will look much nicer with a wider number distribution (I'd rather not go into details about exactly what the app does). At this point I'm considering just using a PRNG and dealing with collisions or shuffling range slices (starting with (1..10000000).to_a.shuffle, then, (10000001, 20000000).to_a.shuffle, etc. as each range's numbers start to run out).
Any mathemagicians out there have any better ideas/suggestions?
Concatenate a PRNG or LFSR sequence with /dev/random bits
There are several algorithms that can generate pseudo-random numbers with arbitrarily large and known periods. The two obvious candidates are the LCPRNG (LCG) and the LFSR, but there are more algorithms such as the Mersenne Twister.
The period of these generators can be easily constructed to fit your requirements and then you simply won't have collisions.
You could deal with the predictable behavior of PRNG's and LFSR's by adding 10, 20, or 30 bits of cryptographically hashed entropy from an interface like /dev/random. Because the deterministic part of your number is known to be unique it makes no difference if you ever repeat the actually random part of it.
Divide and conquer? Break down into manageable chunks and shuffle them. You could divide the number range e.g. by their value modulo n. The list is constructive and quite small depending on n. Once a group is exhausted, you can use the next one.
For example if you choose an n of 1000, you create 1000 different groups. Pick a random number between 1 and 1000 (let's call this x) and shuffle the numbers whose value modulo 1000 equals x. Once you have exhausted that range, you can choose a new random number between 1 and 1000 (without x obviously) to get the next subset to shuffle. It shouldn't exactly be challenging to keep track of which numbers of the 1..1000 range have already been used, so you'd just need a repeatable shuffle algorithm for the numbers in the subset (e.g. Fisher-Yates on their "indices").
I guess the best option is to use a GUID/UUID. They are made for this type of thing, and it shouldn't be hard to find an existing implementation to suit your needs.
While collisions are theoretically possible, they are extremely unlikely. To quote Wikipedia:
The probability of one duplicate would be about 50% if every person on earth owns 600 million UUIDs

Resources