One of the ways of testing strength of PRNG functions is to design tests that distinguish PRNG
outputs from random strings: we are given a box which outputs PRNG(u) for some u or a random string.
We have to determine if the output comes from PRNG.
Assume that a hash function H fails this test.
Does it mean that it is not second preimage resistant?
Assume a 256-bit cryptographic hash function h with all the properties that are expected from such.
Construct the function which, for any input string s, returns the first 255 bits of h(s) followed by the bit 0.
This function is easy to distinguish from random with good probability given enough inputs: it returns only even numbers. But it is still tricky to compute any sort of collision or preimage for it: it is 255-bit strong.
Resistance to collision or preimage is not an all-or-nothing question. There are gradations.
Related
We want to generate a uniform random number from the interval [0, 1].
Let's first generate k random booleans (for example by rand()<0.5) and decide according to these on what subinterval [m*2^{-k}, (m+1)*2^{-k}] the number will fall. Then we use one rand() to get the final output as m*2^{-k} + rand()*2^{-k}.
Let's assume we have arbitrary precision.
Will a random number generated this way be 'more random' than the usual rand()?
PS. I guess the subinterval picking amounts to just choosing the binary representation of the output 0. b_1 b_2 b_3... one digit b_i at a time and the final step is adding the representation of rand() to the end of the output.
It depends on the definition of "more random". If you use more random generators, it means more random state, and it means that cycle length will be greater. But cycle length is just one property of random generators. Cycle length of 2^64 usually OK for almost any purpose (the only exception I know is that if you need a lot of different, long sequences, like for some kind of simulation).
However, if you combine two bad random generators, they don't necessarily become better, you have to analyze it. But there are generators, which do work this way. For example, KISS is an example for this: it combines 3, not-too-good generators, and the result is a good generator.
For card shuffling, you'll need a cryptographic RNG. Even a very good, but not cryptographic RNG is inadequate for this purpose. For example, Mersenne Twister, which is a good RNG, is not suitable for secure card shuffling! It is because observing output numbers, it is possible to figure out its internal state, so shuffle result can be predicted.
This can help, but only if you use a different pseudorandom generator for the first and last bits. (It doesn't have to be a different pseudorandom algorithm, just a different seed.)
If you use the same generator, then you will still only be able to construct 2^n different shuffles, where n is the number of bits in the random generator's state.
If you have two generators, each with n bits of state, then you can produce up to a total of 2^(2n) different shuffles.
Tinkering with a random number generator, as you are doing by using only one bit of random space and then calling iteratively, usually weakens its random properties. All RNGs fail some statistical tests for randomness, but you are more likely to get find that a noticeable cycle crops up if you start making many calls and combining them.
I don't mean a function that generates random numbers, but an algorithm to generate a random function
"High dimension" means the function is multi-variable, e.g. a 100-dim function has 100 different variables.
Let's say the domain is [0,1], we need to generate a function f:[0,1]^n->[0,1]. This function is chosen from a certain class of functions, so that the probability of choosing any of these functions is the same.
(This class of functions can be either all continuous, or K-order derivative, whichever is convenient for the algorithm.)
Since the functions on a closed interval domain are uncountable infinite, we only require the algorithm to be pseudo-random.
Is there a polynomial time algorithm to solve this problem?
I just want to add a possible algorithm to the question(but not feasible due to its exponential time complexity). The algorithm was proposed by the friend who actually brought up this question in the first place:
The algorithm can be simply described as following. First, we assume the dimension d = 1 for example. Consider smooth functions on the interval I = [a; b]. First, we split the domain [a; b] into N small intervals. For each interval Ii, we generate a random number fi living in some specific distributions (Gaussian or uniform distribution). Finally, we do the interpolation of
series (ai; fi), where ai is a characteristic point of Ii (eg, we can choose ai as the middle point of Ii). After interpolation, we gain a smooth curve, which can be regarded as a one dimensional random function construction living in the function space Cm[a; b] (where m depends on the interpolation algorithm we choose).
This is just to say that the algorithm does not need to be that formal and rigorous, but simply to provide something that works.
So if i get it right you need function returning scalar from vector;
The easiest way I see is the use of dot product
for example let n be the dimensionality you need
so create random vector a[n] containing random coefficients in range <0,1>
and the sum of all coefficients is 1
create float a[n]
feed it with positive random numbers (no zeros)
compute the sum of a[i]
divide a[n] by this sum
now the function y=f(x[n]) is simply
y=dot(a[n],x[n])=a[0]*x[0]+a[1]*x[1]+...+a[n-1]*x[n-1]
if I didn't miss something the target range should be <0,1>
if x==(0,0,0,..0) then y=0;
if x==(1,1,1,..1) then y=1;
If you need something more complex use higher order of polynomial
something like y=dot(a0[n],x[n])*dot(a1[n],x[n]^2)*dot(a2[n],x[n]^3)...
where x[n]^2 means (x[0]*x[0],x[1]*x[1],...)
Booth approaches results in function with the same "direction"
if any x[i] rises then y rises too
if you want to change that then you have to allow also negative values for a[]
but to make that work you need to add some offset to y shifting from negative values ...
and the a[] normalization process will be a bit more complex
because you need to seek the min,max values ...
easier option is to add random flag vector m[n] to process
m[i] will flag if 1-x[i] should be used instead of x[i]
this way all above stays as is ...
you can create more types of mapping to make it even more vaiable
This might not only be hard, but impossible if you actually want to be able to generate every continuous function.
For the one-dimensional case you might be able to create a useful approximation by looking into the Faber-Schauder-System (also see wiki). This gives you a Schauder-basis for continuous functions on an interval. This kind of basis only covers the whole vectorspace if you include infinite linear combinations of basisvectors. Thus you can create some random functions by building random linear combinations from this basis, but in general you won't be able to create functions that are actually represented by an infinite amount of basisvectors this way.
Edit in response to your update:
It seems like choosing a random polynomial function of order K (for the class of K-times differentiable functions) might be sufficient for you since any of these functions can be approximated (around a given point) by one of those (see taylor's theorem). Choosing a random polynomial function is easy, since you can just pick K random real numbers as coefficients for your polynom. (Note that this will for example not return functions similar to abs(x))
The awk manual says srand "sets the seed (starting point) for rand()". I used srand(5) with the following code:
awk 'BEGIN {srand(5);while(1)print rand()}'> /var/tmp/rnd
It generates numbers like:
0.177399
0.340855
0.0256178
0.838417
0.0195347
0.29598
Can you explain how srand(5) generates the "starting point" with the above output?
The starting point is called the seed. It is given to the first iteration of the rand function. After that rand uses the previous value it got when calculating the old number -- to generate the next number. Using a prime number for the seed is a good idea.
PRNGs (pseudo-random number generators) produce random values by keeping some kind of internal state which can be advanced through a series of values whose repeating period is very large, and whose successive values have very few apparent statistical correlations as long as we use far fewer of them. But nonetheless, its values are a deterministic sequence.
"Seeding" a PRNG is basically selecting what point in the deterministic sequence to start at. The algorithm will take the number passed as the seed and compute (in some algorithm-specific way) where to start in the sequence. The actual value of the seed is irrelevant--the algorithm should not depend on it in any way.
But, although the seed value itself does not directly participate in the PRNG algorithm, it does uniquely identify the starting point in the sequence, so if you give a particular seed and then generate a sequence of values, seeding again with the same value should cause the PRNG to generate the same sequence of values.
I need to use a hash function which belongs to a family of k-wise independent hash functions. Any pointers on any library or toolkit in C, C++ or python which can generate a set of k-wise independent hash functions from which I can pick a function.
Background: I am trying to implement this algorithm here: http://researcher.watson.ibm.com/researcher/files/us-dpwoodru/knw10b.pdf for the Distinct Elements problem.
I have looked at this thread: Generating k pairwise independent hash functions which mentions using Murmur hash to generate a pairwise independent hash function. I was wondering if there is anything similar for k-wise independent hash functions. If there is none available, would it be possible for me to construct such a set of k-wise independent hash functions.
Thanks in advance.
The simplest k-wise independent hash function (mapping positive integer x < p to one of m buckets) is just
where p is some big random prime (261-1 will work)
and ai are some random positive integers less than p, a0 > 0.
2-wise independent hash:
h(x) = (ax + b) % p % m
again, p is prime, a > 0, a,b < p (i.e. a can't be zero but b can when that is a random choice)
These formulas define families of hash functions. They work (in theory) if you select a hash function randomly from corresponding family (i.e. if you generate random a's and b) each time you run your algorithm.
There is no such thing as "a k-wise independent hash function". However, there are k-wise independent families of functions.
As a reminder, a family of functions is k-wise independent when if h is picked randomly from the family and x_1 .. x_k and y_1 .. y_k are picked arbitrarily, the probability that "for all i, h(x_i) = y_i" is Y^-k, where Y is the size of the co-domain from which the y_i were selected.
There are a few families of functions that are known to be k-wise independent for small k like 2, 3, 4, and 5. For arbitrary k, you will likely need to use polynomial hashing. Note that there are two variants of this, one of which is not even 2-independent, so be careful when implementing it.
The polynomial hash family can hash from a field F to itself using k constants a_0 through a_{k-1} and is defined by the sum of a_i x^i, where x is the key you are hashing. Field arithmetic can be implemented on your computer by taking letting F be the integers modulo a prime p. That's probably not convenient, as it is often better to have the domain and range be uint32_t or the like. In that case you can use the field F_{2^32}, and you can use polynomial multiplication over Z_2 and then division by an irreducible polynomial in that field. Otherwise, you can operate in Z_p where p is larger than 2^32 (or 64) and take the result of the polynomial mod 2^32, I think. That will only be almost k-wise independent, but sometimes that's sufficient for the analysis to go through. It will not be easy to re-analyze the KNW algorithm to change its hash families.
To generate a member of a k-wise independent family, use your favorite random number generator to pick the function randomly. In the case of polynomila hashing, that means picking the as referenced above. /dev/random should suffice.
The paper you point to, "An Optimal Algorithm for the Distinct Elements Problem", is a nice one and has been cited many times. However, it is not easy to implement, and it may be slower or even take more space than HyperLogLog, due to hidden constants in the big-O notations. A number of papers have noted the complexity of this algorithm and even called it infeasible compared to HyperLogLog. If you want to implement an estimator for the number of distinct elements, you might start with an earlier algorithm. There is plenty of complexity there if your goal is education. If your goal is practicality, you also want to stay away from KNW, because it could be a lot of work just to make something less practical that HyperLogLog.
As another piece of advice, you should probably ignore the suggestions to "just use Murmur hash" or "pick k values from xxhash" if you want to learn about and understand this algorithm or other random algorithms that use hashing. Murmur/xx might be fine in practice, but they are not k-wise independent families, and some of that advice on this page is not even semantically well-formed. For instance, "if you need k different hash, just re-use the same algorithm k times, with k different seeds" isn't relevant to k-wise independent families. For this algorithm you want to implement, you'll end up apply the hash functions an arbitrary number of times. You don't "need k different hash", you need n different hash values generated by first picking randomly from a k-independent hash family and second applying the chosen function to the streaming keys that are the input to algorithms like this.
This is one of many solutions, but you could use for example the following open-source hash algorithm:
https://github.com/Cyan4973/xxHash
Then, to generate different hashes, you just have to provide different seeds.
Considering the main function declaration :
unsigned int XXH32 (const void* input, int len, unsigned int seed);
So if you need k different hash values, just re-use the same algorithm k times, with k different seeds.
Just use a good non-cryptographic hash function. This advice perhaps will make me unpopular with my colleagues in theoretical computer science, but consider your adversary.
Nature. Yeah, maybe it'll hit the minuscule fraction inputs that cause your hash function to behave badly, but there are plenty of other ways for things to go wrong that a k-wise independent hash family won't fix (e.g., the random number generator that chose the hash function didn't do a good job, bugs, etc.), so you need to test end-to-end anyway.
Oblivious adversary. This is what the theory assumes. Oblivious adversaries cannot look at your random bits. If only they were so nice in real life!
Non-oblivious adversary. Randomness is pointless. Use a binary tree.
I'm not 100% sure what you mean by "k-wise independent hash functions", but you can get k distinct hash functions by coming up with two hash functions, and then using linear combinations of them.
I have an example in my bloom filter module: http://stromberg.dnsalias.org/svn/bloom-filter/trunk/bloom_filter_mod.py Ignore the get_bitno_seed_rnd function, look at hash1, hash2 and get_bitno_lin_comb
Many randomized algorithms and data structures (such as the Count-Min Sketch) require hash functions with the pairwise independence property. Intuitively, this means that the probability of a hash collision with a specific element is small, even if the output of the hash function for that element is known.
I have found many descriptions of pairwise independent hash functions for fixed-length bitvectors based on random linear functions. However, I have not yet seen any examples of pairwise independent hash functions for strings.
Are there any families of pairwise independent hash functions for strings?
I'm pretty sure they exist, but there's a bit of measure-theoretic subtlety to your question. You might be better off asking on mathoverflow. I'm very rusty with this stuff, but I think I can show that, even if they do exist, you don't actually want one.
To begin with, you need a probability measure on the strings, and any such measure will necessarily look very different from any notion of "uniform." (It's a countable set and all the sigma-algebras over countable sets just clump together sets of elements and assign a probability to each of those sets. You'll want all of the clumps to be singletons.)
Now, if you only give finitely many strings positive probability, you're back in the finite case. So let's ignore that for now and assume that, for any epsilon > 0, you can find a string whose probability is strictly between 0 and epsilon.
Suppose we restrict to the case where the hash functions map strings to {0,1}.
Your family of hash functions will need to be infinite as well and you'll want to talk about it as a probability space of hash functions. If you have a set H of hash functions that has positive probability, then every string is mapped to both 0 and 1 by (different) elements of H. In particular, no single element of H has positive probability. So H has to be uncountable and you've suddenly run into difficult representability issues.
I'd be very happy if someone who hasn't forgotten measure theory would chime in here.
Not with a seed of bounded length and an output of nonzero bounded length.
A fairly crude argument to this effect is, for a finite family of hash functions H, consider a map f from an element x to a tuple giving h(x) for every h in H. Since the codomains of each h and thus f are finite, there exist two strings mapped the same way by all h in H, which, given that there are at least two possible hash values, contradicts pairwise independence.