Given a pseudorandom number generator int64 rand64(), I would like to build a set of pseudo random numbers. This set should have the property that the XOR combinations of each subset should not result in the value 0.
I'm thinking of following algorithm:
count = 0
set = {}
while (count < desiredSetSize)
set[count] = rand64()
if propertyIsNotFullfilled(set[0] to set[count])
continue
count = count + 1
The question is: How can propertyIsNotFullfilled be implemented?
Notes: The reason why I like to generate such a set is following: I have a hash table where the hash values are generated via Zobrist hashing. Instead of keeping a boolean value to each hash table entry indicating if the entry is filled, I thought the hash value – which is stored with each entry – is sufficient for this information (0 ... empty, != 0 ... set). There is another reason to carry this information as sentinel value inside the hash-key-table. I'm trying to switch from a AoS (Array of Structure) to a SoA (Structure of Array) memory layout. I'm trying this to avoid padding and to test if there are lesser cache misses. I hope in most cases the access to the hash-key-table is enough (implied that the hash value provides the information if the entry is empty or not).
I also thought about reserving the most significant bit of the hash values for this information but this would reduce the area of possible hash values more than it is necessary. Theoretically the area would be reduced from 264 (minus the seninal 0-value) to 263.
One can read the question in the other way: Given a set of 84 pseudorandom numbers, is there any number which can't be generated by XORing any subset of this set, and how to get it? This number can be used as sentinel value.
Now, for what I need it: I have developed a connect four game engine. There are 6 x 7 moves possible for player A and also for player B. Thus there are 84 possible moves (therefore 84 random values needed). The hash value of a board-state is generated by the precalculated random values in the following manner: hash(board) = randomset[move1] XOR randomset[move2] XOR randomset[move3] ...
This set should have the property that the XOR combinations of each subset should not result in the value 0.
IMHO this would restrict the maxinum number of subsets to 64 (Pigeonhole principle); for >64 subsets, there will always be a (non empty) subset that XORs to zero. For smaller subsets, the property can be fulfilled.
To further illustrate my point: consider a system of 64 equations over 64 unknown variables. Then, add one extra equation. The fact that the equations and variables are booleans does not make the problem different.
--EDIT/UPDATE--: Since the application appears to be the game "connect-four", you could instead enumerate all possible configurations. Not being able to code the impossible board configurations will save enough coding space to fit any valid board position in 64 bits:
Encoding the colored stones as {A,B}, and irrelevant as {X} the configuration of a (hight=6) column can be one of:
X
X X
X X X
X X X X
X X X X X
_ A A A A A A <<-- possible configurations for one pile
--+--+--+--+--+--+--+
1 1 2 4 8 16 32 <<-- number of combinations of the Xs
-2 -5 <<-- number of impossible Xs
(and similar for B instead of A). The numbers below the piles are the number of posssibilities for the Xs on top, the negative numbers the number of forbidden/impossible configurations. For the column with one A and 4 Xs, every value for the Xs is valid, *except 3*A (the game would already have ended). The same for the rightmost pile: the bottom 3Xs cannot be all A, and X cannot be B for all the Xs.
This leads to a total of 1 + 2 * (63-7) := 113.
(1 is for the empty board, 2 is the number of colors). So: 113 is the number of configurations for one column, fitting well within 7 bit. For 7 columns we'll need 7*7:=49 bits. (we might save one bit for the L/R mirror symmetry, maybe even one for the color symmetry, but that would only complicate things, IMHO).
There still be a lot of coding space wasted (the columns are not independent, the number of As on the board is equal to the number of Bs, or one more, etc), but I don't think it would be easy to avoid them. Fortunately, it will not be necessary.
To amplify wildplasser: every hash function that be used to distinguish every n-bit string from every other n-bit string cannot have output shorter than n bits. Shorter hash functions are usable because we only have to avoid collisions in the strings that actually arrive, but we cannot hope to make an intelligent choice offline. Just use a cryptographically-secure RNG and one of two things will happen: (i) your code will work as though the RNG were truly random or (ii, unlikely) your code will break and (if it's not bugged) it will act as a distinguisher between the crypto RNG and true randomness, bringing you fame and notoriety.
Amplifying the answer by wildplasser a little bit more, here is an idea how to implement propertyIsNotFullfilled.
Represent the set of pseudo-random numbers as a {0,1}-matrix. Perform Gaussian elimination (use XOR instead of usual multiply/subtract operations). If you get matrix where the last row is zero, return true, otherwise false.
Definitely, this function will return true very frequently when size of the set is close to 64. So algorithm in OP is efficient only for relatively small sizes.
To optimize this algorithm, you can keep the result of last Gaussian elimination.
Related
Given a list of input binary numbers in A, and a list of output binary numbers in B, seek one value for all X that satisfies the bitwise AND. i.e. A and X = B where A has 6 set bits, B has zero to 6 set bits, and X has 12 set bits. All numbers in A, B and X are 128 bits long.
Similar to this question: Most efficient method of generating a random number with a fixed number of bits set but I also need that random number to produce known binary results when bitwise-and-ed with a known set of binary numbers.
A naive idea: Generate a random 128 bit number X with 12 bits set, then test it against all the numbers in A to see if it generates B. If it doesn't, shuffle the bits in X and try again.
I know there must be a better algorithm.
Clarification: B is the same as A, except that some or all of the bits set (1s) in A have now been randomly set to 0 in B.
Update on my naive idea:
I just figured out that whenever a bit in A is 1, corresponding bit in
X will be equal to corresponding bit in B since 1 AND xBit = bBit.
When the bit in A is 0, then the xBit is unknown (0 or 1). So for each
number in A, I can obtain string X made up of 1s, zeros and unknown y.
e.g. yy00011y10010y10... then I could compare all these strings X to
one another and find a "fit" for a unique X by replacing the y with 1s
and 0s. I'm not sure if this is the best way, but it is probably
better than guessing and checking. Any thoughts on this, or how to find such a fit? Thanks
You can encode the constraint in a Binary Decision Diagram, apply Algorithm C from The Art of Computer Programming volume 4A (which counts the number of solutions for each BDD node), and then generate random solutions from that.
For each node, you can go either "high" or "low" depending on whether you choose that variable to be 1 or 0. Both ways have some number of solutions, let's call them #H and #L. In order to randomly choose such that each solution is equally likely, choose high (ie set this variable to 1) with probability #H / (#H + #L) and low with 1 minus that probability. Keep doing this until you're in a sink node. This will work in general, not just this problem.
The BDD can be constructed by intersecting (algorithm for this is given in TAOCP but of course exists in every BDD library) a BDD that encodes the A & X = B constraint (trivial - just a linear BDD that forces some bits constant and omits all unconstrained bits) with a BDD that forces the popcnt of X to be 12 (this looks like a 12 x 128 "grid" of nodes with some extra chains attached where every additional set bit goes to the Bottom sink). I suppose the total BDD could be constructed immediately too, instead of in two steps, but that sounds a bit annoying to work out.
A non-random solution can also be generated more directly, first put in the up to 6 mandatory bits:
X = B // assuming A & B = B
Then put the remaining 1's anywhere "not harmful",
R = (1 << (12 - popcnt(X))) - 1 // right number of 1's, wrong place
X |= _pdep_u64(R, ~A);
Anywhere that A does not constrain is fine, they fill up bottom to top but if we don't care about randomness then that's fine. Also note that I used _pdep_u64, not something like _pdep_u128, but that's OK because the low 64 bits always have room for the up to 12 bits, A has only 6 bits set there are at least 58 places for leftover bits to go.
I just found a solution, further from my update to the naive idea above.
Set all unknown y values in the X strings to 0, then bitwise-OR all
the X results to get a final value of X that satisfies all.
Since I am checking only 6 bits in A, it works, but this probably wouldn't scale well in another scenario.
It seems simplistic, and I have tested it on a number of sets. It works well, but I am not sure if it is the best way.
I am keen to try and implement minhashing to find near duplicate content. http://blog.cluster-text.com/tag/minhash/ has a nice write up, but there the question of just how many hashing algorithms you need to run across the shingles in a document to get reasonable results.
The blog post above mentioned something like 200 hashing algorithms. http://blogs.msdn.com/b/spt/archive/2008/06/10/set-similarity-and-min-hash.aspx lists 100 as a default.
Obviously there is an increase in the accuracy as the number of hashes increases, but how many hash functions is reasonable?
To quote from the blog
It is tough to get the error bar on our similarity estimate much
smaller than [7%] because of the way error bars on statistically
sampled values scale — to cut the error bar in half we would need four
times as many samples.
Does this mean that mean that decreasing the number of hashes to something like 12 (200 / 4 / 4) would result in an error rate of 28% (7 * 2 * 2)?
One way to generate 200 hash values is to generate one hash value using a good hash algorithm and generate 199 values cheaply by XORing the good hash value with 199 sets of random-looking bits having the same length as the good hash value (i.e. if your good hash is 32 bits, build a list of 199 32-bit pseudo random integers and XOR each good hash with each of the 199 random integers).
Do not simply rotate bits to generate hash values cheaply if you are using unsigned integers (signed integers are fine) -- that will often pick the same shingle over and over. Rotating the bits down by one is the same as dividing by 2 and copying the old low bit into the new high bit location. Roughly 50% of the good hash values will have a 1 in the low bit, so they will have huge hash values with no prayer of being the minimum hash when that low bit rotates into the high bit location. The other 50% of the good hash values will simply equal their original values divided by 2 when you shift by one bit. Dividing by 2 does not change which value is smallest. So, if the shingle that gave the minimum hash with the good hash function happens to have a 0 in the low bit (50% chance of that) it will again give the minimum hash value when you shift by one bit. As an extreme example, if the shingle with the smallest hash value from the good hash function happens to have a hash value of 0, it will always have the minimum hash value no matter how much you rotate the bits. This problem does not occur with signed integers because minimum hash values have extreme negative values, so they tend to have a 1 at the highest bit followed by zeros (100...). So, only hash values with a 1 in the lowest bit will have a chance at being the new lowest hash value after rotating down by one bit. If the shingle with minimum hash value has a 1 in the lowest bit, after rotating down one bit it will look like 1100..., so it will almost certainly be beat out by a different shingle that has a value like 10... after the rotation, and the problem of the same shingle being picked twice in a row with 50% probability is avoided.
Pretty much.. but 28% would be the "error estimate", meaning reported measurements would frequently be inaccurate by +/- 28%.
That means that a reported measurement of 78% could easily come from only 50% similarity..
Or that 50% similarity could easily be reported as 22%. Doesn't sound accurate enough for business expectations, to me.
Mathematically, if you're reporting two digits the second should be meaningful.
Why do you want to reduce the number of hash functions to 12? What "200 hash functions" really means is, calculate a decent-quality hashcode for each shingle/string once -- then apply 200 cheap & fast transformations, to emphasise certain factors/ bring certain bits to the front.
I recommend combining bitwise rotations (or shuffling) and an XOR operation. Each hash function can combined rotation by some number of bits, then XORing by a randomly generated integer.
This both "spreads" the selectivity of the min() function around the bits, and as to what value min() ends up selecting for.
The rationale for rotation, is that "min(Int)" will, 255 times out of 256, select only within the 8 most-significant bits. Only if all top bits are the same, do lower bits have any effect in the comparison.. so spreading can be useful to avoid undue emphasis on just one or two characters in the shingle.
The rationale for XOR is that, on it's own, bitwise rotation (ROTR) can 50% of the time (when 0 bits are shifted in from the left) converge towards zero, and that would cause "separate" hash functions to display an undesirable tendency to coincide towards zero together -- thus an excessive tendency for them to end up selecting the same shingle, not independent shingles.
There's a very interesting "bitwise" quirk of signed integers, where the MSB is negative but all following bits are positive, that renders the tendency of rotations to converge much less visible for signed integers -- where it would be obvious for unsigned. XOR must still be used in these circumstances, anyway.
Java has 32-bit hashcodes builtin. And if you use Google Guava libraries, there are 64-bit hashcodes available.
Thanks to #BillDimm for his input & persistence in pointing out that XOR was necessary.
What you want can be be easily obtained from universal hashing. Popular textbooks like Corman et al as very readable information in section 11.3.3 pp 265-268. In short, you can generate family of hash functions using following simple equation:
h(x,a,b) = ((ax+b) mod p) mod m
x is key you want to hash
a is any odd number you can choose between 1 to p-1 inclusive.
b is any number you can choose between 0 to p-1 inclusive.
p is a prime number that is greater than max possible value of x
m is a max possible value you want for hash code + 1
By selecting different values of a and b you can generate many hash codes that are independent of each other.
An optimized version of this formula can be implemented as follows in C/C++/C#/Java:
(unsigned) (a*x+b) >> (w-M)
Here,
- w is size of machine word (typically 32)
- M is size of hash code you want in bits
- a is any odd integer that fits in to machine word
- b is any integer less than 2^(w-M)
Above works for hashing a number. To hash a string, get the hash code that you can get using built-in functions like GetHashCode and then use that value in above formula.
For example, let's say you need 200 16-bit hash code for string s, then following code can be written as implementation:
public int[] GetHashCodes(string s, int count, int seed = 0)
{
var hashCodes = new int[count];
var machineWordSize = sizeof(int);
var hashCodeSize = machineWordSize / 2;
var hashCodeSizeDiff = machineWordSize - hashCodeSize;
var hstart = s.GetHashCode();
var bmax = 1 << hashCodeSizeDiff;
var rnd = new Random(seed);
for(var i=0; i < count; i++)
{
hashCodes[i] = ((hstart * (i*2 + 1)) + rnd.Next(0, bmax)) >> hashCodeSizeDiff;
}
}
Notes:
I'm using hash code word size as half of machine word size which in most cases would be 16-bit. This is not ideal and has far more chance of collision. This can be used by upgrading all arithmetic to 64-bit.
Normally you want to select a and b both randomly within above said ranges.
Just use 1 hash function! (and save the 1/(f ε^2) smallest values.)
Check out this article for the state of the art practical and theoretical bounds. It has this nice graph (below), explaining why you probably want to use just one 2-independent hash function and save the k smallest values.
When estimating set sizes the paper shows that you can get a relative error of approximately ε = 1/sqrt(f k) where f is the jaccard similarity and k is the number of values kept. So if you want error ε, you need k=1/(fε^2) or if your sets have similarity around 1/3 and you want a 10% relative error, you should keep the 300 smallest values.
It seems like another way to get N number of good hashed values would be to salt the same hash with N different salt values.
In practice, if applying the salt second, it seems you could hash the data, then "clone" the internal state of your hasher, add the first salt and get your first value. You'd reset this clone to the clean cloned state, add the second salt, and get your second value. Rinse and repeat for all N items.
Likely not as cheap as XOR against N values, but seems like there's possibility for better quality results, at a minimal extra cost, especially if the data being hashed is much larger than the salt value.
If I have a true random number generator (TRNG) which can give me either a 0 or a 1 each time I call it, then it is trivial to then generate any number in a range with a length equal to a power of 2. For example, if I wanted to generate a random number between 0 and 63, I would simply poll the TRNG 5 times, for a maximum value of 11111 and a minimum value of 00000. The problem is when I want a number in a rangle not equal to 2^n. Say I wanted to simulate the roll of a dice. I would need a range between 1 and 6, with equal weighting. Clearly, I would need three bits to store the result, but polling the TRNG 3 times would introduce two eroneous values. We could simply ignore them, but then that would give one side of the dice a much lower odds of being rolled.
My question of ome most effectively deals with this.
The easiest way to get a perfectly accurate result is by rejection sampling. For example, generate a random value from 1 to 8 (3 bits), rejecting and generating a new value (3 new bits) whenever you get a 7 or 8. Do this in a loop.
You can get arbitrarily close to accurate just by generating a large number of bits, doing the mod 6, and living with the bias. In cases like 32-bit values mod 6, the bias will be so small that it will be almost impossible to detect, even after simulating millions of rolls.
If you want a number in range 0 .. R - 1, pick least n such that R is less or equal to 2n. Then generate a random number r in the range 0 .. 2n-1 using your method. If it is greater or equal to R, discard it and generate again. The probability that your generation fails in this manner is at most 1/2, you will get a number in your desired range with less than two attempts on the average. This method is balanced and does not impair the randomness of the result in any fashion.
As you've observed, you can repeatedly double the range of a possible random values through powers of two by concatenating bits, but if you start with an integer number of bits (like zero) then you cannot obtain any range with prime factors other than two.
There are several ways out; none of which are ideal:
Simply produce the first reachable range which is larger than what you need, and to discard results and start again if the random value falls outside the desired range.
Produce a very large range, and distribute that as evenly as possible amongst your desired outputs, and overlook the small bias that you get.
Produce a very large range, distribute what you can evenly amongst your desired outputs, and if you hit upon one of the [proportionally] few values which fall outside of the set which distributes evenly, then discard the result and start again.
As with 3, but recycle the parts of the value that you did not convert into a result.
The first option isn't always a good idea. Numbers 2 and 3 are pretty common. If your random bits are cheap then 3 is normally the fastest solution with a fairly small chance of repeating often.
For the last one; supposing that you have built a random value r in [0,31], and from that you need to produce a result x [0,5]. Values of r in [0,29] could be mapped to the required output without any bias using mod 6, while values [30,31] would have to be dropped on the floor to avoid bias.
In the former case, you produce a valid result x, but there's some more randomness left over -- the difference between the ranges [0,5], [6,11], etc., (five possible values in this case). You can use this to start building your new r for the next random value you'll need to produce.
In the latter case, you don't get any x and are going to have to try again, but you don't have to throw away all of r. The specific value picked from the illegal range [30,31] is left-over and free to be used as a starting value for your next r (two possible values).
The random range you have from that point on needn't be a power of two. That doesn't mean it'll magically reach the range you need at the time, but it does mean you can minimise what you throw away.
The larger you make r, the more bits you may need to throw away if it overflows, but the smaller the chances of that happening. Adding one bit halves your risk but increases the cost only linearly, so it's best to use the largest r you can handle.
Given a bit array of fixed length and the number of 0s and 1s it contains, how can I arrange all possible combinations such that returning the i-th combinations takes the least possible time?
It is not important the order in which they are returned.
Here is an example:
array length = 6
number of 0s = 4
number of 1s = 2
possible combinations (6! / 4! / 2!)
000011 000101 000110 001001 001010
001100 010001 010010 010100 011000
100001 100010 100100 101000 110000
problem
1st combination = 000011
5th combination = 001010
9th combination = 010100
With a different arrangement such as
100001 100010 100100 101000 110000
001100 010001 010010 010100 011000
000011 000101 000110 001001 001010
it shall return
1st combination = 100001
5th combination = 110000
9th combination = 010100
Currently I am using a O(n) algorithm which tests for each bit whether it is a 1 or 0. The problem is I need to handle lots of very long arrays (in the order of 10000 bits), and so it is still very slow (and caching is out of the question). I would like to know if you think a faster algorithm may exist.
Thank you
I'm not sure I understand the problem, but if you only want the i-th combination without generating the others, here is a possible algorithm:
There are C(M,N)=M!/(N!(M-N)!) combinations of N bits set to 1 having at most highest bit at position M.
You want the i-th: you iteratively increment M until C(M,N)>=i
while( C(M,N) < i ) M = M + 1
That will tell you the highest bit that is set.
Of course, you compute the combination iteratively with
C(M+1,N) = C(M,N)*(M+1)/(M+1-N)
Once found, you have a problem of finding (i-C(M-1,N))th combination of N-1 bits, so you can apply a recursion in N...
Here is a possible variant with D=C(M+1,N)-C(M,N), and I=I-1 to make it start at zero
SOL=0
I=I-1
while(N>0)
M=N
C=1
D=1
while(i>=D)
i=i-D
M=M+1
D=N*C/(M-N)
C=C+D
SOL=SOL+(1<<(M-1))
N=N-1
RETURN SOL
This will require large integer arithmetic if you have that many bits...
If the ordering doesn't matter (it just needs to remain consistent), I think the fastest thing to do would be to have combination(i) return anything you want that has the desired density the first time combination() is called with argument i. Then store that value in a member variable (say, a hashmap that has the value i as key and the combination you returned as its value). The second time combination(i) is called, you just look up i in the hashmap, figure out what you returned before and return it again.
Of course, when you're returning the combination for argument(i), you'll need to make sure it's not something you have returned before for some other argument.
If the number you will ever be asked to return is significantly smaller than the total number of combinations, an easy implementation for the first call to combination(i) would be to make a value of the right length with all 0s, randomly set num_ones of the bits to 1, and then make sure it's not one you've already returned for a different value of i.
Your problem appears to be constrained by the binomial coefficient. In the example you give, the problem can be translated as follows:
there are 6 items that can be chosen 2 at a time. By using the binomial coefficient, the total number of unique combinations can be calculated as N! / (K! (N - K)!, which for the case of K = 2 simplifies to N(N-1)/2. Plugging 6 in for N, we get 15, which is the same number of combinations that you calculated with 6! / 4! / 2! - which appears to be another way to calculate the binomial coefficient that I have never seen before. I have tried other combinations as well and both formulas generate the same number of combinations. So, it looks like your problem can be translated to a binomial coefficient problem.
Given this, it looks like you might be able to take advantage of a class that I wrote to handle common functions for working with the binomial coefficient:
Outputs all the K-indexes in a nice format for any N choose K to a file. The K-indexes can be substituted with more descriptive strings or letters. This method makes solving this type of problem quite trivial.
Converts the K-indexes to the proper index of an entry in the sorted binomial coefficient table. This technique is much faster than older published techniques that rely on iteration. It does this by using a mathematical property inherent in Pascal's Triangle. My paper talks about this. I believe I am the first to discover and publish this technique, but I could be wrong.
Converts the index in a sorted binomial coefficient table to the corresponding K-indexes.
Uses Mark Dominus method to calculate the binomial coefficient, which is much less likely to overflow and works with larger numbers.
The class is written in .NET C# and provides a way to manage the objects related to the problem (if any) by using a generic list. The constructor of this class takes a bool value called InitTable that when true will create a generic list to hold the objects to be managed. If this value is false, then it will not create the table. The table does not need to be created in order to perform the 4 above methods. Accessor methods are provided to access the table.
There is an associated test class which shows how to use the class and its methods. It has been extensively tested with 2 cases and there are no known bugs.
To read about this class and download the code, see Tablizing The Binomial Coeffieicent.
It should not be hard to convert this class to the language of your choice.
There may be some limitations since you are using a very large N that could end up creating larger numbers than the program can handle. This is especially true if K can be large as well. Right now, the class is limited to the size of an int. But, it should not be hard to update it to use longs.
I need some good pseudo random number generator that can be computed like a pure function from its previous output without any state hiding. Under "good" I mean:
I must be able to parametrize generator in such way that running it for 2^n iterations with any parameters (or with some large subset of them) should cover all or almost all values between 0 and 2^n - 1, where n is the number of bits in output value.
Combined generator output of n + p bits must cover all or almost all values between 0 and 2^(n + p) - 1 if I run it for 2^n iterations for every possible combination of its parameters, where p is the number of bits in parameters.
For example, LCG can be computed like a pure function and it can meet first condition, but it can not meet second one. Say, we have 32-bit LCG, m = 2^32 and it is constant, our p = 64 (two 32-bit parameters a and c), n + p = 96, so we must peek data by three ints from output to meet second condition. Unfortunately, condition can not be meet because of strictly alternating sequence of odd and even ints in output. To overcome this, hidden state must be introduced, but that makes function not pure and breaks first condition (long hidden period).
EDIT: Strictly speaking, I want family of functions parametrized by p bits and with full state of n bits, each generating all possible binary strings of p + n bits in unique "randomish" way, not just continuously incrementing (p + n)-bit int. Parametrization required to select that unique way.
Am I wanting too much?
You can use any block cipher, with a fixed key. To generate the next number, decrypt the current one, increment it, and re-encrypt it. Because block ciphers are 1:1, they'll necessarily iterate through every number in the output domain before repeating.
Try LFSR
All you need is list of primitive polynomials.
Period of generating finite field this way, generates field of size 2^n-1. But you can generalise this procedure to generate anything whit period of k^n-1.
I have not seen this implemented, but all you have to implement is shifting numbers by small number s>n where gcd(s,2^n-1) == 1. gcd stands for greatest common divisor