Related
I am keen to try and implement minhashing to find near duplicate content. http://blog.cluster-text.com/tag/minhash/ has a nice write up, but there the question of just how many hashing algorithms you need to run across the shingles in a document to get reasonable results.
The blog post above mentioned something like 200 hashing algorithms. http://blogs.msdn.com/b/spt/archive/2008/06/10/set-similarity-and-min-hash.aspx lists 100 as a default.
Obviously there is an increase in the accuracy as the number of hashes increases, but how many hash functions is reasonable?
To quote from the blog
It is tough to get the error bar on our similarity estimate much
smaller than [7%] because of the way error bars on statistically
sampled values scale — to cut the error bar in half we would need four
times as many samples.
Does this mean that mean that decreasing the number of hashes to something like 12 (200 / 4 / 4) would result in an error rate of 28% (7 * 2 * 2)?
One way to generate 200 hash values is to generate one hash value using a good hash algorithm and generate 199 values cheaply by XORing the good hash value with 199 sets of random-looking bits having the same length as the good hash value (i.e. if your good hash is 32 bits, build a list of 199 32-bit pseudo random integers and XOR each good hash with each of the 199 random integers).
Do not simply rotate bits to generate hash values cheaply if you are using unsigned integers (signed integers are fine) -- that will often pick the same shingle over and over. Rotating the bits down by one is the same as dividing by 2 and copying the old low bit into the new high bit location. Roughly 50% of the good hash values will have a 1 in the low bit, so they will have huge hash values with no prayer of being the minimum hash when that low bit rotates into the high bit location. The other 50% of the good hash values will simply equal their original values divided by 2 when you shift by one bit. Dividing by 2 does not change which value is smallest. So, if the shingle that gave the minimum hash with the good hash function happens to have a 0 in the low bit (50% chance of that) it will again give the minimum hash value when you shift by one bit. As an extreme example, if the shingle with the smallest hash value from the good hash function happens to have a hash value of 0, it will always have the minimum hash value no matter how much you rotate the bits. This problem does not occur with signed integers because minimum hash values have extreme negative values, so they tend to have a 1 at the highest bit followed by zeros (100...). So, only hash values with a 1 in the lowest bit will have a chance at being the new lowest hash value after rotating down by one bit. If the shingle with minimum hash value has a 1 in the lowest bit, after rotating down one bit it will look like 1100..., so it will almost certainly be beat out by a different shingle that has a value like 10... after the rotation, and the problem of the same shingle being picked twice in a row with 50% probability is avoided.
Pretty much.. but 28% would be the "error estimate", meaning reported measurements would frequently be inaccurate by +/- 28%.
That means that a reported measurement of 78% could easily come from only 50% similarity..
Or that 50% similarity could easily be reported as 22%. Doesn't sound accurate enough for business expectations, to me.
Mathematically, if you're reporting two digits the second should be meaningful.
Why do you want to reduce the number of hash functions to 12? What "200 hash functions" really means is, calculate a decent-quality hashcode for each shingle/string once -- then apply 200 cheap & fast transformations, to emphasise certain factors/ bring certain bits to the front.
I recommend combining bitwise rotations (or shuffling) and an XOR operation. Each hash function can combined rotation by some number of bits, then XORing by a randomly generated integer.
This both "spreads" the selectivity of the min() function around the bits, and as to what value min() ends up selecting for.
The rationale for rotation, is that "min(Int)" will, 255 times out of 256, select only within the 8 most-significant bits. Only if all top bits are the same, do lower bits have any effect in the comparison.. so spreading can be useful to avoid undue emphasis on just one or two characters in the shingle.
The rationale for XOR is that, on it's own, bitwise rotation (ROTR) can 50% of the time (when 0 bits are shifted in from the left) converge towards zero, and that would cause "separate" hash functions to display an undesirable tendency to coincide towards zero together -- thus an excessive tendency for them to end up selecting the same shingle, not independent shingles.
There's a very interesting "bitwise" quirk of signed integers, where the MSB is negative but all following bits are positive, that renders the tendency of rotations to converge much less visible for signed integers -- where it would be obvious for unsigned. XOR must still be used in these circumstances, anyway.
Java has 32-bit hashcodes builtin. And if you use Google Guava libraries, there are 64-bit hashcodes available.
Thanks to #BillDimm for his input & persistence in pointing out that XOR was necessary.
What you want can be be easily obtained from universal hashing. Popular textbooks like Corman et al as very readable information in section 11.3.3 pp 265-268. In short, you can generate family of hash functions using following simple equation:
h(x,a,b) = ((ax+b) mod p) mod m
x is key you want to hash
a is any odd number you can choose between 1 to p-1 inclusive.
b is any number you can choose between 0 to p-1 inclusive.
p is a prime number that is greater than max possible value of x
m is a max possible value you want for hash code + 1
By selecting different values of a and b you can generate many hash codes that are independent of each other.
An optimized version of this formula can be implemented as follows in C/C++/C#/Java:
(unsigned) (a*x+b) >> (w-M)
Here,
- w is size of machine word (typically 32)
- M is size of hash code you want in bits
- a is any odd integer that fits in to machine word
- b is any integer less than 2^(w-M)
Above works for hashing a number. To hash a string, get the hash code that you can get using built-in functions like GetHashCode and then use that value in above formula.
For example, let's say you need 200 16-bit hash code for string s, then following code can be written as implementation:
public int[] GetHashCodes(string s, int count, int seed = 0)
{
var hashCodes = new int[count];
var machineWordSize = sizeof(int);
var hashCodeSize = machineWordSize / 2;
var hashCodeSizeDiff = machineWordSize - hashCodeSize;
var hstart = s.GetHashCode();
var bmax = 1 << hashCodeSizeDiff;
var rnd = new Random(seed);
for(var i=0; i < count; i++)
{
hashCodes[i] = ((hstart * (i*2 + 1)) + rnd.Next(0, bmax)) >> hashCodeSizeDiff;
}
}
Notes:
I'm using hash code word size as half of machine word size which in most cases would be 16-bit. This is not ideal and has far more chance of collision. This can be used by upgrading all arithmetic to 64-bit.
Normally you want to select a and b both randomly within above said ranges.
Just use 1 hash function! (and save the 1/(f ε^2) smallest values.)
Check out this article for the state of the art practical and theoretical bounds. It has this nice graph (below), explaining why you probably want to use just one 2-independent hash function and save the k smallest values.
When estimating set sizes the paper shows that you can get a relative error of approximately ε = 1/sqrt(f k) where f is the jaccard similarity and k is the number of values kept. So if you want error ε, you need k=1/(fε^2) or if your sets have similarity around 1/3 and you want a 10% relative error, you should keep the 300 smallest values.
It seems like another way to get N number of good hashed values would be to salt the same hash with N different salt values.
In practice, if applying the salt second, it seems you could hash the data, then "clone" the internal state of your hasher, add the first salt and get your first value. You'd reset this clone to the clean cloned state, add the second salt, and get your second value. Rinse and repeat for all N items.
Likely not as cheap as XOR against N values, but seems like there's possibility for better quality results, at a minimal extra cost, especially if the data being hashed is much larger than the salt value.
Suppose you are given a range and a few numbers in the range (exceptions). Now you need to generate a random number in the range except the given exceptions.
For example, if range = [1..5] and exceptions = {1, 3, 5} you should generate either 2 or 4 with equal probability.
What logic should I use to solve this problem?
If you have no constraints at all, i guess this is the easiest way: create an array containing the valid values, a[0]...a[m] . Return a[rand(0,...,m)].
If you don't want to create an auxiliary array, but you can count the number of exceptions e and of elements n in the original range, you can simply generate a random number r=rand(0 ... n-e), and then find the valid element with a counter that doesn't tick on exceptions, and stops when it's equal to r.
Depends on the specifics of the case. For your specific example, I'd return a 2 if a Uniform(0,1) was below 1/2, 4 otherwise. Similarly, if I saw a pattern such as "the exceptions are odd numbers", I'd generate values for half the range and double. In general, though, I'd generate numbers in the range, check if they're in the exception set, and reject and re-try if they were - a technique known as acceptance/rejection for obvious reasons. There are a variety of techniques to make the exception-list check efficient, depending on how big it is and what patterns it may have.
Let's assume, to keep things simple, that arrays are indexed starting at 1, and your range runs from 1 to k. Of course, you can always shift the result by a constant if this is not the case. We'll call the array of exceptions ex_array, and let's say we have c exceptions. These need to be sorted, which shall turn out to be pretty important in a while.
Now, you only have k-e useful numbers to work with, so it'll be meaningful to find a random number in the range 1 to k-e. Say we end up with the number r. Now, we just need to find the r-th valid number in your array. Simple? Not so much. Remember, you can never simply walk over any of your arrays in a linear fashion, because that can really slow down your implementation when you have a lot of numbers. You have do some sort of binary search, say, to come up with a fast enough algorithm.
So let's try something better. The r-th number would nominally have lied at index r in your original array had you had no exceptions. The number at index r is r, of course, since your range and your array indices start from 1. But, you have a bunch of invalid numbers between 1 and r, and you want to somehow get to the r-th valid number. So, lets do a binary search on the array of exceptions, ex_array, to find how many invalid numbers are equal to or less than r, because we have these many invalid numbers lying between 1 and r. If this number is 0, we're all done, but if it isn't, we have a bit more work to do.
Assume you found there were n invalid numbers between 1 and r after the binary search. Let's advance n indices in your array to the index r+n, and find the number of invalid numbers lying between 1 and r+n, using a binary search to find how many elements in ex_array are less than or equal to r+n. If this number is exactly n, no more invalid numbers were encountered, and you've hit upon your r-th valid number. Otherwise, repeat again, this time for the index r+n', where n' is the number of random numbers that lay between 1 and r+n.
Repeat till you get to a stage where no excess exceptions are found. The important thing here is that you never once have to walk over any of the arrays in a linear fashion. You should optimize the binary searches so they don't always start at index 0. Say if you know there are n random numbers between 1 and r. Instead of starting your next binary search from 1, you could start it from one index after the index corresponding to n in ex_array.
In the worst case, you'll be doing binary searches for each element in ex_array, which means you'll do c binary searches, the first starting from index 1, the next from index 2, and so on, which gives you a time complexity of O(log(n!)). Now, Stirling's approximation tells us that O(ln(x!)) = O(xln(x)), so using the algorithm above only makes sense if c is small enough that O(cln(c)) < O(k), since you can achieve O(k) complexity using the trivial method of extracting valid elements from your array first.
In Python the solution is very simple (given your example):
import random
rng = set(range(1, 6))
ex = {1, 3, 5}
random.choice(list(rng-ex))
To optimize the solution, one needs to know how long is the range and how many exceptions there are. If the number of exceptions is very low, it's possible to generate a number from the range and just check if it's not an exception. If the number of exceptions is dominant, it probably makes sense to gather the remaining numbers into an array and generate random index for fetching non-exception.
In this answer I assume that it is known how to get an integer random number from a range.
Here's another approach...just keep on generating random numbers until you get one that isn't excluded.
Suppose your desired range was [0,100) excluding 25,50, and 75.
Put the excluded values in a hashtable or bitarray for fast lookup.
int randNum = rand(0,100);
while( excludedValues.contains(randNum) )
{
randNum = rand(0,100);
}
The complexity analysis is more difficult, since potentially rand(0,100) could return 25, 50, or 75 every time. However that is quite unlikely (assuming a random number generator), even if half of the range is excluded.
In the above case, we re-generate a random value for only 3/100 of the original values.
So 3% of the time you regenerate once. Of those 3%, only 3% will need to be regenerated, etc.
Suppose the initial range is [1,n] and and exclusion set's size is x. First generate a map from [1, n-x] to the numbers [1,n] excluding the numbers in the exclusion set. This mapping with 1-1 since there are equal numbers on both sides. In the example given in the question the mapping with be as follows - {1->2,2->4}.
Another example suppose the list is [1,10] and the exclusion list is [2,5,8,9] then the mapping is {1->1, 2->3, 3->4, 4->6, 5->7, 6->10}. This map can be created in a worst case time complexity of O(nlogn).
Now generate a random number between [1, n-x] and map it to the corresponding number using the mapping. Map looks can be done in O(logn).
You can do it in a versatile way if you have enumerators or set operations. For example using Linq:
void Main()
{
var exceptions = new[] { 1,3,5 };
RandomSequence(1,5).Where(n=>!exceptions.Contains(n))
.Take(10)
.Select(Console.WriteLine);
}
static Random r = new Random();
IEnumerable<int> RandomSequence(int min, int max)
{
yield return r.Next(min, max+1);
}
I would like to acknowledge some comments that are now deleted:
It's possible that this program never ends (only theoretically) because there could be a sequence that never contains valid values. Fair point. I think this is something that could be explained to the interviewer, however I believe my example is good enough for the context.
The distribution is fair because each of the elements has the same chance of coming up.
The advantage of answering this way is that you show understanding of modern "functional-style" programming, which may be interesting to the interviewer.
The other answers are also correct. This is a different take on the problem.
Given a bit array of fixed length and the number of 0s and 1s it contains, how can I arrange all possible combinations such that returning the i-th combinations takes the least possible time?
It is not important the order in which they are returned.
Here is an example:
array length = 6
number of 0s = 4
number of 1s = 2
possible combinations (6! / 4! / 2!)
000011 000101 000110 001001 001010
001100 010001 010010 010100 011000
100001 100010 100100 101000 110000
problem
1st combination = 000011
5th combination = 001010
9th combination = 010100
With a different arrangement such as
100001 100010 100100 101000 110000
001100 010001 010010 010100 011000
000011 000101 000110 001001 001010
it shall return
1st combination = 100001
5th combination = 110000
9th combination = 010100
Currently I am using a O(n) algorithm which tests for each bit whether it is a 1 or 0. The problem is I need to handle lots of very long arrays (in the order of 10000 bits), and so it is still very slow (and caching is out of the question). I would like to know if you think a faster algorithm may exist.
Thank you
I'm not sure I understand the problem, but if you only want the i-th combination without generating the others, here is a possible algorithm:
There are C(M,N)=M!/(N!(M-N)!) combinations of N bits set to 1 having at most highest bit at position M.
You want the i-th: you iteratively increment M until C(M,N)>=i
while( C(M,N) < i ) M = M + 1
That will tell you the highest bit that is set.
Of course, you compute the combination iteratively with
C(M+1,N) = C(M,N)*(M+1)/(M+1-N)
Once found, you have a problem of finding (i-C(M-1,N))th combination of N-1 bits, so you can apply a recursion in N...
Here is a possible variant with D=C(M+1,N)-C(M,N), and I=I-1 to make it start at zero
SOL=0
I=I-1
while(N>0)
M=N
C=1
D=1
while(i>=D)
i=i-D
M=M+1
D=N*C/(M-N)
C=C+D
SOL=SOL+(1<<(M-1))
N=N-1
RETURN SOL
This will require large integer arithmetic if you have that many bits...
If the ordering doesn't matter (it just needs to remain consistent), I think the fastest thing to do would be to have combination(i) return anything you want that has the desired density the first time combination() is called with argument i. Then store that value in a member variable (say, a hashmap that has the value i as key and the combination you returned as its value). The second time combination(i) is called, you just look up i in the hashmap, figure out what you returned before and return it again.
Of course, when you're returning the combination for argument(i), you'll need to make sure it's not something you have returned before for some other argument.
If the number you will ever be asked to return is significantly smaller than the total number of combinations, an easy implementation for the first call to combination(i) would be to make a value of the right length with all 0s, randomly set num_ones of the bits to 1, and then make sure it's not one you've already returned for a different value of i.
Your problem appears to be constrained by the binomial coefficient. In the example you give, the problem can be translated as follows:
there are 6 items that can be chosen 2 at a time. By using the binomial coefficient, the total number of unique combinations can be calculated as N! / (K! (N - K)!, which for the case of K = 2 simplifies to N(N-1)/2. Plugging 6 in for N, we get 15, which is the same number of combinations that you calculated with 6! / 4! / 2! - which appears to be another way to calculate the binomial coefficient that I have never seen before. I have tried other combinations as well and both formulas generate the same number of combinations. So, it looks like your problem can be translated to a binomial coefficient problem.
Given this, it looks like you might be able to take advantage of a class that I wrote to handle common functions for working with the binomial coefficient:
Outputs all the K-indexes in a nice format for any N choose K to a file. The K-indexes can be substituted with more descriptive strings or letters. This method makes solving this type of problem quite trivial.
Converts the K-indexes to the proper index of an entry in the sorted binomial coefficient table. This technique is much faster than older published techniques that rely on iteration. It does this by using a mathematical property inherent in Pascal's Triangle. My paper talks about this. I believe I am the first to discover and publish this technique, but I could be wrong.
Converts the index in a sorted binomial coefficient table to the corresponding K-indexes.
Uses Mark Dominus method to calculate the binomial coefficient, which is much less likely to overflow and works with larger numbers.
The class is written in .NET C# and provides a way to manage the objects related to the problem (if any) by using a generic list. The constructor of this class takes a bool value called InitTable that when true will create a generic list to hold the objects to be managed. If this value is false, then it will not create the table. The table does not need to be created in order to perform the 4 above methods. Accessor methods are provided to access the table.
There is an associated test class which shows how to use the class and its methods. It has been extensively tested with 2 cases and there are no known bugs.
To read about this class and download the code, see Tablizing The Binomial Coeffieicent.
It should not be hard to convert this class to the language of your choice.
There may be some limitations since you are using a very large N that could end up creating larger numbers than the program can handle. This is especially true if K can be large as well. Right now, the class is limited to the size of an int. But, it should not be hard to update it to use longs.
I am looking to enumerate a random permutation of the numbers 1..N in fixed space. This means that I cannot store all numbers in a list. The reason for that is that N can be very large, more than available memory. I still want to be able to walk through such a permutation of numbers one at a time, visiting each number exactly once.
I know this can be done for certain N: Many random number generators cycle through their whole state space randomly, but entirely. A good random number generator with state size of 32 bit will emit a permutation of the numbers 0..(2^32)-1. Every number exactly once.
I want to get to pick N to be any number at all and not be constrained to powers of 2 for example. Is there an algorithm for this?
The easiest way is probably to just create a full-range PRNG for a larger range than you care about, and when it generates a number larger than you want, just throw it away and get the next one.
Another possibility that's pretty much a variation of the same would be to use a linear feedback shift register (LFSR) to generate the numbers in the first place. This has a couple of advantages: first of all, an LFSR is probably a bit faster than most PRNGs. Second, it is (I believe) a bit easier to engineer an LFSR that produces numbers close to the range you want, and still be sure it cycles through the numbers in its range in (pseudo)random order, without any repetitions.
Without spending a lot of time on the details, the math behind LFSRs has been studied quite thoroughly. Producing one that runs through all the numbers in its range without repetition simply requires choosing a set of "taps" that correspond to an irreducible polynomial. If you don't want to search for that yourself, it's pretty easy to find tables of known ones for almost any reasonable size (e.g., doing a quick look, the wikipedia article lists them for size up to 19 bits).
If memory serves, there's at least one irreducible polynomial of ever possible bit size. That translates to the fact that in the worst case you can create a generator that has roughly twice the range you need, so on average you're throwing away (roughly) every other number you generate. Given the speed an LFSR, I'd guess you can do that and still maintain quite acceptable speed.
One way to do it would be
Find a prime p larger than N, preferably not much larger.
Find a primitive root of unity g modulo p, that is, a number 1 < g < p such that g^k ≡ 1 (mod p) if and only if k is a multiple of p-1.
Go through g^k (mod p) for k = 1, 2, ..., ignoring the values that are larger than N.
For every prime p, there are φ(p-1) primitive roots of unity, so it works. However, it may take a while to find one. Finding a suitable prime is much easier in general.
For finding a primitive root, I know nothing substantially better than trial and error, but one can increase the probability of a fast find by choosing the prime p appropriately.
Since the number of primitive roots is φ(p-1), if one randomly chooses r in the range from 1 to p-1, the expected number of tries until one finds a primitive root is (p-1)/φ(p-1), hence one should choose p so that φ(p-1) is relatively large, that means that p-1 must have few distinct prime divisors (and preferably only large ones, except for the factor 2).
Instead of randomly choosing, one can also try in sequence whether 2, 3, 5, 6, 7, 10, ... is a primitive root, of course skipping perfect powers (or not, they are in general quickly eliminated), that should not affect the number of tries needed greatly.
So it boils down to checking whether a number x is a primitive root modulo p. If p-1 = q^a * r^b * s^c * ... with distinct primes q, r, s, ..., x is a primitive root if and only if
x^((p-1)/q) % p != 1
x^((p-1)/r) % p != 1
x^((p-1)/s) % p != 1
...
thus one needs a decent modular exponentiation (exponentiation by repeated squaring lends itself well for that, reducing by the modulus on each step). And a good method to find the prime factor decomposition of p-1. Note, however, that even naive trial division would be only O(√p), while the generation of the permutation is Θ(p), so it's not paramount that the factorisation is optimal.
Another way to do this is with a block cipher; see this blog post for details.
The blog posts links to the paper Ciphers with Arbitrary Finite Domains which contains a bunch of solutions.
Consider the prime 3. To fully express all possible outputs, think of it this way...
bias + step mod prime
The bias is just an offset bias. step is an accumulator (if it's 1 for example, it would just be 0, 1, 2 in sequence, while 2 would result in 0, 2, 4) and prime is the prime number we want to generate the permutations against.
For example. A simple sequence of 0, 1, 2 would be...
0 + 0 mod 3 = 0
0 + 1 mod 3 = 1
0 + 2 mod 3 = 2
Modifying a couple of those variables for a second, we'll take bias of 1 and step of 2 (just for illustration)...
1 + 2 mod 3 = 0
1 + 4 mod 3 = 2
1 + 6 mod 3 = 1
You'll note that we produced an entirely different sequence. No number within the set repeats itself and all numbers are represented (it's bijective). Each unique combination of offset and bias will result in one of prime! possible permutations of the set. In the case of a prime of 3 you'll see that there are 6 different possible permuations:
0,1,2
0,2,1
1,0,2
1,2,0
2,0,1
2,1,0
If you do the math on the variables above you'll not that it results in the same information requirements...
1/3! = 1/6 = 1.66..
... vs...
1/3 (bias) * 1/2 (step) => 1/6 = 1.66..
Restrictions are simple, bias must be within 0..P-1 and step must be within 1..P-1 (I have been functionally just been using 0..P-2 and adding 1 on arithmetic in my own work). Other than that, it works with all prime numbers no matter how large and will permutate all possible unique sets of them without the need for memory beyond a couple of integers (each technically requiring slightly less bits than the prime itself).
Note carefully that this generator is not meant to be used to generate sets that are not prime in number. It's entirely possible to do so, but not recommended for security sensitive purposes as it would introduce a timing attack.
That said, if you would like to use this method to generate a set sequence that is not a prime, you have two choices.
First (and the simplest/cheapest), pick the prime number just larger than the set size you're looking for and have your generator simply discard anything that doesn't belong. Once more, danger, this is a very bad idea if this is a security sensitive application.
Second (by far the most complicated and costly), you can recognize that all numbers are composed of prime numbers and create multiple generators that then produce a product for each element in the set. In other words, an n of 6 would involve all possible prime generators that could match 6 (in this case, 2 and 3), multiplied in sequence. This is both expensive (although mathematically more elegant) as well as also introducing a timing attack so it's even less recommended.
Lastly, if you need a generator for bias and or step... why don't you use another of the same family :). Suddenly you're extremely close to creating true simple-random-samples (which is not easy usually).
The fundamental weakness of LCGs (x=(x*m+c)%b style generators) is useful here.
If the generator is properly formed then x%f is also a repeating sequence of all values lower than f (provided f if a factor of b).
Since bis usually a power of 2 this means that you can take a 32-bit generator and reduce it to an n-bit generator by masking off the top bits and it will have the same full-range property.
This means that you can reduce the number of discard values to be fewer than N by choosing an appropriate mask.
Unfortunately LCG Is a poor generator for exactly the same reason as given above.
Also, this has exactly the same weakness as I noted in a comment on #JerryCoffin's answer. It will always produce the same sequence and the only thing the seed controls is where to start in that sequence.
Here's some SageMath code that should generate a random permutation the way Daniel Fischer suggested:
def random_safe_prime(lbound):
while True:
q = random_prime(lbound, lbound=lbound // 2)
p = 2 * q + 1
if is_prime(p):
return p, q
def random_permutation(n):
p, q = random_safe_prime(n + 2)
while True:
r = randint(2, p - 1)
if pow(r, 2, p) != 1 and pow(r, q, p) != 1:
i = 1
while True:
x = pow(r, i, p)
if x == 1:
return
if 0 <= x - 2 < n:
yield x - 2
i += 1
Given a pseudorandom number generator int64 rand64(), I would like to build a set of pseudo random numbers. This set should have the property that the XOR combinations of each subset should not result in the value 0.
I'm thinking of following algorithm:
count = 0
set = {}
while (count < desiredSetSize)
set[count] = rand64()
if propertyIsNotFullfilled(set[0] to set[count])
continue
count = count + 1
The question is: How can propertyIsNotFullfilled be implemented?
Notes: The reason why I like to generate such a set is following: I have a hash table where the hash values are generated via Zobrist hashing. Instead of keeping a boolean value to each hash table entry indicating if the entry is filled, I thought the hash value – which is stored with each entry – is sufficient for this information (0 ... empty, != 0 ... set). There is another reason to carry this information as sentinel value inside the hash-key-table. I'm trying to switch from a AoS (Array of Structure) to a SoA (Structure of Array) memory layout. I'm trying this to avoid padding and to test if there are lesser cache misses. I hope in most cases the access to the hash-key-table is enough (implied that the hash value provides the information if the entry is empty or not).
I also thought about reserving the most significant bit of the hash values for this information but this would reduce the area of possible hash values more than it is necessary. Theoretically the area would be reduced from 264 (minus the seninal 0-value) to 263.
One can read the question in the other way: Given a set of 84 pseudorandom numbers, is there any number which can't be generated by XORing any subset of this set, and how to get it? This number can be used as sentinel value.
Now, for what I need it: I have developed a connect four game engine. There are 6 x 7 moves possible for player A and also for player B. Thus there are 84 possible moves (therefore 84 random values needed). The hash value of a board-state is generated by the precalculated random values in the following manner: hash(board) = randomset[move1] XOR randomset[move2] XOR randomset[move3] ...
This set should have the property that the XOR combinations of each subset should not result in the value 0.
IMHO this would restrict the maxinum number of subsets to 64 (Pigeonhole principle); for >64 subsets, there will always be a (non empty) subset that XORs to zero. For smaller subsets, the property can be fulfilled.
To further illustrate my point: consider a system of 64 equations over 64 unknown variables. Then, add one extra equation. The fact that the equations and variables are booleans does not make the problem different.
--EDIT/UPDATE--: Since the application appears to be the game "connect-four", you could instead enumerate all possible configurations. Not being able to code the impossible board configurations will save enough coding space to fit any valid board position in 64 bits:
Encoding the colored stones as {A,B}, and irrelevant as {X} the configuration of a (hight=6) column can be one of:
X
X X
X X X
X X X X
X X X X X
_ A A A A A A <<-- possible configurations for one pile
--+--+--+--+--+--+--+
1 1 2 4 8 16 32 <<-- number of combinations of the Xs
-2 -5 <<-- number of impossible Xs
(and similar for B instead of A). The numbers below the piles are the number of posssibilities for the Xs on top, the negative numbers the number of forbidden/impossible configurations. For the column with one A and 4 Xs, every value for the Xs is valid, *except 3*A (the game would already have ended). The same for the rightmost pile: the bottom 3Xs cannot be all A, and X cannot be B for all the Xs.
This leads to a total of 1 + 2 * (63-7) := 113.
(1 is for the empty board, 2 is the number of colors). So: 113 is the number of configurations for one column, fitting well within 7 bit. For 7 columns we'll need 7*7:=49 bits. (we might save one bit for the L/R mirror symmetry, maybe even one for the color symmetry, but that would only complicate things, IMHO).
There still be a lot of coding space wasted (the columns are not independent, the number of As on the board is equal to the number of Bs, or one more, etc), but I don't think it would be easy to avoid them. Fortunately, it will not be necessary.
To amplify wildplasser: every hash function that be used to distinguish every n-bit string from every other n-bit string cannot have output shorter than n bits. Shorter hash functions are usable because we only have to avoid collisions in the strings that actually arrive, but we cannot hope to make an intelligent choice offline. Just use a cryptographically-secure RNG and one of two things will happen: (i) your code will work as though the RNG were truly random or (ii, unlikely) your code will break and (if it's not bugged) it will act as a distinguisher between the crypto RNG and true randomness, bringing you fame and notoriety.
Amplifying the answer by wildplasser a little bit more, here is an idea how to implement propertyIsNotFullfilled.
Represent the set of pseudo-random numbers as a {0,1}-matrix. Perform Gaussian elimination (use XOR instead of usual multiply/subtract operations). If you get matrix where the last row is zero, return true, otherwise false.
Definitely, this function will return true very frequently when size of the set is close to 64. So algorithm in OP is efficient only for relatively small sizes.
To optimize this algorithm, you can keep the result of last Gaussian elimination.