How many hash functions are required in a minhash algorithm - algorithm

I am keen to try and implement minhashing to find near duplicate content. http://blog.cluster-text.com/tag/minhash/ has a nice write up, but there the question of just how many hashing algorithms you need to run across the shingles in a document to get reasonable results.
The blog post above mentioned something like 200 hashing algorithms. http://blogs.msdn.com/b/spt/archive/2008/06/10/set-similarity-and-min-hash.aspx lists 100 as a default.
Obviously there is an increase in the accuracy as the number of hashes increases, but how many hash functions is reasonable?
To quote from the blog
It is tough to get the error bar on our similarity estimate much
smaller than [7%] because of the way error bars on statistically
sampled values scale — to cut the error bar in half we would need four
times as many samples.
Does this mean that mean that decreasing the number of hashes to something like 12 (200 / 4 / 4) would result in an error rate of 28% (7 * 2 * 2)?

One way to generate 200 hash values is to generate one hash value using a good hash algorithm and generate 199 values cheaply by XORing the good hash value with 199 sets of random-looking bits having the same length as the good hash value (i.e. if your good hash is 32 bits, build a list of 199 32-bit pseudo random integers and XOR each good hash with each of the 199 random integers).
Do not simply rotate bits to generate hash values cheaply if you are using unsigned integers (signed integers are fine) -- that will often pick the same shingle over and over. Rotating the bits down by one is the same as dividing by 2 and copying the old low bit into the new high bit location. Roughly 50% of the good hash values will have a 1 in the low bit, so they will have huge hash values with no prayer of being the minimum hash when that low bit rotates into the high bit location. The other 50% of the good hash values will simply equal their original values divided by 2 when you shift by one bit. Dividing by 2 does not change which value is smallest. So, if the shingle that gave the minimum hash with the good hash function happens to have a 0 in the low bit (50% chance of that) it will again give the minimum hash value when you shift by one bit. As an extreme example, if the shingle with the smallest hash value from the good hash function happens to have a hash value of 0, it will always have the minimum hash value no matter how much you rotate the bits. This problem does not occur with signed integers because minimum hash values have extreme negative values, so they tend to have a 1 at the highest bit followed by zeros (100...). So, only hash values with a 1 in the lowest bit will have a chance at being the new lowest hash value after rotating down by one bit. If the shingle with minimum hash value has a 1 in the lowest bit, after rotating down one bit it will look like 1100..., so it will almost certainly be beat out by a different shingle that has a value like 10... after the rotation, and the problem of the same shingle being picked twice in a row with 50% probability is avoided.

Pretty much.. but 28% would be the "error estimate", meaning reported measurements would frequently be inaccurate by +/- 28%.
That means that a reported measurement of 78% could easily come from only 50% similarity..
Or that 50% similarity could easily be reported as 22%. Doesn't sound accurate enough for business expectations, to me.
Mathematically, if you're reporting two digits the second should be meaningful.
Why do you want to reduce the number of hash functions to 12? What "200 hash functions" really means is, calculate a decent-quality hashcode for each shingle/string once -- then apply 200 cheap & fast transformations, to emphasise certain factors/ bring certain bits to the front.
I recommend combining bitwise rotations (or shuffling) and an XOR operation. Each hash function can combined rotation by some number of bits, then XORing by a randomly generated integer.
This both "spreads" the selectivity of the min() function around the bits, and as to what value min() ends up selecting for.
The rationale for rotation, is that "min(Int)" will, 255 times out of 256, select only within the 8 most-significant bits. Only if all top bits are the same, do lower bits have any effect in the comparison.. so spreading can be useful to avoid undue emphasis on just one or two characters in the shingle.
The rationale for XOR is that, on it's own, bitwise rotation (ROTR) can 50% of the time (when 0 bits are shifted in from the left) converge towards zero, and that would cause "separate" hash functions to display an undesirable tendency to coincide towards zero together -- thus an excessive tendency for them to end up selecting the same shingle, not independent shingles.
There's a very interesting "bitwise" quirk of signed integers, where the MSB is negative but all following bits are positive, that renders the tendency of rotations to converge much less visible for signed integers -- where it would be obvious for unsigned. XOR must still be used in these circumstances, anyway.
Java has 32-bit hashcodes builtin. And if you use Google Guava libraries, there are 64-bit hashcodes available.
Thanks to #BillDimm for his input & persistence in pointing out that XOR was necessary.

What you want can be be easily obtained from universal hashing. Popular textbooks like Corman et al as very readable information in section 11.3.3 pp 265-268. In short, you can generate family of hash functions using following simple equation:
h(x,a,b) = ((ax+b) mod p) mod m
x is key you want to hash
a is any odd number you can choose between 1 to p-1 inclusive.
b is any number you can choose between 0 to p-1 inclusive.
p is a prime number that is greater than max possible value of x
m is a max possible value you want for hash code + 1
By selecting different values of a and b you can generate many hash codes that are independent of each other.
An optimized version of this formula can be implemented as follows in C/C++/C#/Java:
(unsigned) (a*x+b) >> (w-M)
Here,
- w is size of machine word (typically 32)
- M is size of hash code you want in bits
- a is any odd integer that fits in to machine word
- b is any integer less than 2^(w-M)
Above works for hashing a number. To hash a string, get the hash code that you can get using built-in functions like GetHashCode and then use that value in above formula.
For example, let's say you need 200 16-bit hash code for string s, then following code can be written as implementation:
public int[] GetHashCodes(string s, int count, int seed = 0)
{
var hashCodes = new int[count];
var machineWordSize = sizeof(int);
var hashCodeSize = machineWordSize / 2;
var hashCodeSizeDiff = machineWordSize - hashCodeSize;
var hstart = s.GetHashCode();
var bmax = 1 << hashCodeSizeDiff;
var rnd = new Random(seed);
for(var i=0; i < count; i++)
{
hashCodes[i] = ((hstart * (i*2 + 1)) + rnd.Next(0, bmax)) >> hashCodeSizeDiff;
}
}
Notes:
I'm using hash code word size as half of machine word size which in most cases would be 16-bit. This is not ideal and has far more chance of collision. This can be used by upgrading all arithmetic to 64-bit.
Normally you want to select a and b both randomly within above said ranges.

Just use 1 hash function! (and save the 1/(f ε^2) smallest values.)
Check out this article for the state of the art practical and theoretical bounds. It has this nice graph (below), explaining why you probably want to use just one 2-independent hash function and save the k smallest values.
When estimating set sizes the paper shows that you can get a relative error of approximately ε = 1/sqrt(f k) where f is the jaccard similarity and k is the number of values kept. So if you want error ε, you need k=1/(fε^2) or if your sets have similarity around 1/3 and you want a 10% relative error, you should keep the 300 smallest values.

It seems like another way to get N number of good hashed values would be to salt the same hash with N different salt values.
In practice, if applying the salt second, it seems you could hash the data, then "clone" the internal state of your hasher, add the first salt and get your first value. You'd reset this clone to the clean cloned state, add the second salt, and get your second value. Rinse and repeat for all N items.
Likely not as cheap as XOR against N values, but seems like there's possibility for better quality results, at a minimal extra cost, especially if the data being hashed is much larger than the salt value.

Related

Reversibly shuffle a set of a million numbers

I need to issue a series {1, 2, 3, 4 …} of tickets that are (at least seemingly) random numbers {10,934, 3,453,867, 122, 4,386,564 …}. When presented back, I must be able to compute their original index (e.g. 122 → 3.)
In other words, I need a seemingly random permutation p on the interval [1 … N] that has an inverse permutation p-1. N is about 107.
The reasons for that are:
It is a cipher: When receiving a ticket, it should not be easy to
guess the tickets that where issued before.
The tickets should be short alphanumeric strings that can be noted down.
I want to avoid recording every ticket issued.
I would use some well-known cipher (e.g., DES) in counter mode.
DES is generally considered fairly broken for normal purposes, but it seems to fit your needs reasonably well, and has a smaller block size than most newer algorithms. For you, that means it produces a smaller result (64 bits, if memory serves). Once you've converted that to readable characters (e.g,. base 64) you end up with something like 10 characters or so.
To retrieve the original number, you simply decrypt with your secret key.
Results look quite random--essentially the only known way to sort them back into order would be to break DES, which can be done (has been done) but the resources to do so are quite non-trivial.
If you really do need a lot better security than that, you can use something like AES instead of DES (at the expense of producing a longer "key" value).
1 to generate a pseudo random shuffle, you could use Fisher-Yates algo:
https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle
What distribution do you get from this broken random shuffle?
for (int i = tickets.Length - 1; i > 0; i--)
{
int n = random(i + 1);
Swap(tickets[i], tickets[n]);
}
beware of not using the "wrong" algorithm (he has bias).
You will get the permutation, then the inverse permutation.
2 Problem comes with the randomness of the shuffle.
As there is 10000000 ! permutations, you should have a very big size of seed
Then problem is in the random generator. standard ones are about 32 bits, perhaps a little more, but far from 10000000!
you should see at something like fortuna :
https://en.wikipedia.org/wiki/Fortuna_%28PRNG%29
You can generate such sequence using a Linear congruential generator.
X0 is the seed (or the index of the permutation if you wish). m should be equal to N+1. Select c and a to assure full period length (as described in the section 'period length' in the link above). This will give you a one-to-one mapping with size N.
To restore the index, you can crack the LCG using a small number of consecutive pseudo-random numbers from the series, which is not too hard. Of course you can keep m, a and c and save the trouble.
For more secure methods look at David Eisenstat's comment. You'll need only the secret key to restore the index. On the downside, if you'll use a standard FPE, N would have to be 2^x-1 (e.g. 2^128-1).

Hashing google interview

why can't powers of 2 or power's of 10 or prime numbers be good hashing functions? If we want to store overflow records in a hash function, why aren't those good for selection of hashing functions?
Suppose your hash function returns a 32-bit unsigned result. Suppose you choose a modulus of 4096. What you do is, effectively: index = hash & 0xFFF -- so, you throw away the top 20 bits of your hash value. Now, if your hash is really good, and the bottom 12 bits are just as good as the rest, then that's not a problem. However, if your hash is pretty good over all 32 bits, but the bottom 12 bits are suspect (they might, for example, be more strongly influenced by the last characters of a string)... then you may regret discarding the top 20. In this case, if you choose any odd modulus, then index = hash % modulus the result depends on all 32 bits of the hash.
So, more generally, if your hash is calculated modulo M, and your index is taken as hash % N, then what you want is for your M and N to be co-prime.
If M is 2^m (as it usually is), then N=10^n is a poor choice, because the bottom n bits of the resulting index are a straight copy of the bottom n bits of the hash.

How to quickly determine if two sets of checksums are equal, with the same "strength" as the individual checksums

Say you have two unordered sets of checksums, one of size N and one of size M. Depending on the algorithm to compare them, you may not even know the sizes but can compare N != M for a quick abort if you do.
The hashing function used for a checksum has some chance of collision, which as a layman I'm foolishly referring to as "strength". Is there a way to take two sets of checksums, all made from the same hashing function, and quickly compare them (so comparing element to element is right out) with the same basic chance of collision between two sets as there is between two individual checksums?
For instance, one method would be to compute a "set checksum" by XORing all of the checksums in the set. This new single hash is used for comparing with other sets' hashes, meaning storage of size is no longer necessary. Especially since it can be modified for the addition/removal of an element checksum by XORing with the set's checksum without having to recompute the whole thing. But does that reduce the "strength" of the set's checksum compared to a brute force comparison of all the original ones? Is there a way to conglomerate the checksums of a set that doesn't reduce the "strength" (as much?) but still is less complex than a straight comparison of the set elements' checksums?
After my initial comment, I got to thinking about the math behind it. Here's what I came up with. I'm no expert so feel free to jump in with corrections. Note: This all assumes your hash function is uniformly distributed, as it should be.
Basically, the more bits in your checksum, the lower the chance of collision. The more files, the higher.
First, let's find the odds of a collision with a single pair of files XOR'd together. We'll work with small numbers at first, so let's assume our checksum is 4 bits(0-15), and we'll call it n.
With two sums, the total number of bits 2n(8), so there are 2^(2n)(256) possibilities total. However, we're only interested in the collisions. To collide an XOR, you need to flip the same bits in both sums. There are only 2^n(16) ways to do that, since we're using n bits.
So, the overall probability of a collision is 16/256, which is (2^n) / (2^(2n)), or simply 1/(n^2). That means the probability of a non-collision is 1 - (1/(n^2)). So, for our sample n, that means that it's only 15/16 secure, or 93.75%. Of course, for bigger checksums, it's better. Even for a puny n=16, you get 99.998%
That's for a single comparison, of course. Since you're rolling them all together, you're doing f-1 comparisons, where f is the number of files. To get the total odds of a collision that way, you take the f-1 power of the odds we got in the first step.
So, for ten files with a 4-bit checksum, we get pretty terrible results:
(15/16) ^ 9 = 55.92% chance of non-collision
This rapidly gets better as we add bits, even when we increase the number of files.
For 10 files with a 8-bit checksum:
(255/256) ^ 9 = 96.54%
For 100/1000 files with 16 bits:
(65536/65536) ^ 99 = 99.85%
(65536/65536) ^ 999 = 98.49%
As you can see, we're still working with small checksums. If you're using anything >= 32 bits, my calculator gets off into floating-point rounding errors when I try to do the math on it.
TL,DR:
Where n is the number of checksum bits and f is the number of files in each set:
nonCollisionChance = ( ((2^n)-1) / (2^n) ) ^ (f-1)
collisionChance = 1 - ( ((2^n)-1) / (2^n) ) ^ (f-1)
Your method of XOR'ing a bunch of checksums together is probably just fine.

Fast generation of random numbers that appear random

I am looking for an efficient way to generate numbers that a human would perceive as being random. Basically, I think of this as avoiding long sequences of 0 or 1 bits. I expect humans to be viewing the bit pattern, and a very low powered cpu should be able to calculate near a thousand of these per second.
There are two different concepts that I can think of to do this, but I am lost finding a efficient way of accomplishing them.
Generate a random number with a fixed number of one bits. For a 32-bit random number, this requires up to 31 random numbers, using the Knuth selection algorithm. is there a more efficient way to generate a random number with some number of bits set? Unfortunately, 0000FFFF doesn't look very random.
Some form of "part-wise' density seems like it'd look better - but I can't come up with a clear way of doing so - I'd imagine going through each chunk, and calculate how far it is from the ideal density, and try to increase the bit density of the next chunk. This sounds complex.
Hopefully there's another algorithm that I haven't thought about for this. Thanks in advance for your help.
[EDIT]
I should be clearer with what I ask -
(a) Is there an efficient way to generate random numbers without "long" runs of a single bit, where "long" is a tunable parameter?
(b) Other suggestions on what would make a number appear to be less-random?
A linear feedback shift register probably does what you want.
Edit in light of an updated question: You should look at a shuffle bag, although I'm not sure how fast this could run. See also this question.
I don't really know what you mean by bit patterns that "look" random. Is there some algorithm for defining what that is? One way might be to formulate an array consisting of only those numbers which are random enough for your purpose, then, randomly select elements from that array and push them onto the stream. The thing you seem to be trying to do seems bizarre to me and may be doomed to failure though. What happens if you have two 32 bit numbers which taken individually would meet your criteria for apparent randomicity, but when placed side by side make a sufficiently long stream of 0's or 1's to look made up?
Finally, I couldn't resist this.
You need to decide by exactly what rules you decide if something "looks random". Then you take a random number generator that produces enough "real randomness" for your purpose, and every time it generates a number that doesn't look random enough, you throw that number away and generate a new one.
Or you directly produce a sequence of "random" bits and every time the random generator outputs the "wrong" next bit (that would make it look not-random), you just flip that bit.
Here's what I'd do. I'd use a number like 00101011100101100110100101100101 and rotate it by some random amount each time.
But are you sure that a typical pseudo random generator wouldn't do? Have you tried it? You con't very many long strings of 0s and 1s anyhow.
If you're going to use a library random number and you're worried about too many or too few bits being set, there are cheap ways of counting bits.
Random numbers often have long sequences of 1s and 0s, so I'm not sure I fully understand why you can't use a simple linear congruential generator and shift in or out how ever many bits you need. They're blazing fast, look extremely random to the naked eye, and you can choose coefficients that will yield random integers in whatever positive range you need. If you need 32 "random looking" bits, just generate four random numbers and take the low 8 bits from each.
You don't really need to implement your own at all though, since in most languages the random library already implements one.
If you're determined that you want a particular density of 1s, though, you could always start with a number that has the required number of 1s set
int a = 0x00FF;
then use a bit twiddling hack to implement a bit-level shuffle of the bits in that number.
If you are looking to avoid long runs, how about something simple like:
#include <cstdlib>
class generator {
public:
generator() : last_num(0), run_count(1) { }
bool next_bit() {
const bool flip = rand() > RAND_MAX / pow( 2, run_count);
// RAND_MAX >> run_count ?
if(flip) {
run_count = 1;
last_num = !last_num;
} else
++run_count;
return last_num;
}
private:
bool last_num;
int run_count;
};
Runs become less likely the longer they go on. You could also do RAND_MAX / 1+run_count if you wanted longer runs
Since you care most about run length, you could generate random run lengths instead of random bits, so as to give them the exact distribution you want.
The mean run length in random binary data is of course 4 (sum of n/(2^(n-1))), and the mode average 1. Here are some random bits (I swear this is a single run, I didn't pick a value to make my point):
0111111011111110110001000101111001100000000111001010101101001000
See there's a run length of 8 in there. This is not especially surprising, since run length 8 should occur roughly every 256 bits and I've generated 64 bits.
If this doesn't "look random" to you because of excessive run lengths, then generate run lengths with whatever distribution you want. In pseudocode:
loop
get a random number
output that many 1 bits
get a random number
output that many 0 bits
endloop
You'd probably want to discard some initial data from the stream, or randomise the first bit, to avoid the problem that as it stands, the first bit is always 1. The probability of the Nth bit being 1 depends on how you "get a random number", but for anything that achieves "shortish but not too short" run lengths it will soon be as close to 50% as makes no difference.
For instance "get a random number" might do this:
get a uniformly-distributed random number n from 1 to 81
if n is between 1 and 54, return 1
if n is between 55 and 72, return 2
if n is between 72 and 78, return 3
if n is between 79 and 80, return 4
return 5
The idea is that the probability of a run of length N is one third the probability of a run of length N-1, instead of one half. This will give much shorter average run lengths, and a longest run of 5, and would therefore "look more random" to you. Of course it would not "look random" to anyone used to dealing with sequences of coin tosses, because they'd think the runs were too short. You'd also be able to tell very easily with statistical tests that the value of digit N is correlated with the value of digit N-1.
This code uses at least log(81) = 6.34 "random bits" to generate on average 1.44 bits of output, so is slower than just generating uniformly-distributed bits. But it shouldn't be much more than about 7/1.44 = 5 times slower, and a LFSR is pretty fast to start with.
This is how I would examine the number:
const int max_repeated_bits = 4; /* or any other number that you prefer */
int examine_1(unsigned int x) {
for (int i=0; i<max_repeated_bits; ++i) x &= (x << 1);
return x == 0;
}
int examine(unsigned int x) {
return examine_1(x) && examine_1(~x);
}
Then, just generate a number x, if examine(x) return 0, reject it and try again. The probability to get a 32-bit number with more than 4 bits in a row is about 2/3, so you would need about 3 random generator callse per number. However, If you allow more than 4 bits, it gets better. Say, the probability to get more than 6 bits in a row only about 20%, so you would need only 1.25 calls per number.
There are various variants of linear feedback shift registers, such as shrinking and self-shrinking which modify the output of one LFSR based on the output of another.
The design of these attempts to create random numbers, where the probability of getting two bits the same in a row is 0.5, of getting three in a row is 0.25 as so on.
It should be possible to chain two LFSRs to inhibit or invert the output when a sequence of similar bits occurs - the first LFSR uses a conventional primitive polynomial, and the feed the output of the first into the second. The second shift register is shorter, doesn't have a primitive polynomial. Instead it is tapped to invert the output if all its bits are the same, so no run can exceed the size of the second shift register.
Obviously this destroys the randomness of the output - if you have N bits in a row, the next bit is completely predictable. Messing around with using the output of another random source to determine whether or not to invert the output would defeat the second shift register - you wouldn't be able to detect the difference between that and just one random source.
Check out the GSL. I believe it has some functions that do just what you want. They at least are guaranteed to be random bit strings. I'm not sure if they would LOOK random, since thats more of a psychological question.
Can't believe nobody mentioned this:
If you want a longest run (period) of 2N repeats:
PeopleRandom()
{
while(1)
{
Number = randomN_bitNumber();
if(Number && Number != MaxN_BitNumber)
return Number;
}
}
this gives much better results in terms of amount of tosses than using a 32-bit, etc rand
pros:
you only toss values 2/2^N of the time.
larger N give better results.
Since the number of values that do not split the value with a 1 in the middle bit is exactly half, you can go with a larger N than you otherwise would have if you can tolerate a larger largest run less than half the time.
One simple approach would be to generate one bit at a time, with a tuning parameter to control the probability that each new bit matches the previous one. By setting the probability below 0.5, you can generate sequences that are less likely to contain long runs of repeating bits (and you can tune that likelihood). Setting p = 0 gives a repeating 1010101010101010 sequence; setting p = 1 gives a sequence of all 0s or all 1s.
Here is some C# to demonstrate:
double p = 0.3; // 0 <= p <= 1, probability of duplicating a bit
var r = new Random();
int bit = r.Next(2);
for (int i = 0; i < 100; i++)
{
if (r.NextDouble() > p)
{
bit = (bit + 1) % 2;
}
Console.Write(bit);
}
This might well be too slow for your needs, since you need to generate a random double in order to obtain each new random bit. You could, instead, generate a random byte and use each pair of bits to generate the new bit (i.e. if both are zero then keep the same bit, otherwise flip it, if you're happy with the equivalent of a fixed p = 0.25).
Furthermore, it's still possible to get long sequences of repeated bits, you've just lowered the probability of doing so.

Hash Functions and Tables of size of the form 2^p

While calculating the hash table bucket index from the hash code of a key, why do we avoid use of remainder after division (modulo) when the size of the array of buckets is a power of 2?
When calculating the hash, you want as much information as you can cheaply munge things into with good distribution across the entire range of bits: e.g. 32-bit unsigned integers are usually good, unless you have a lot (>3 billion) of items to store in the hash table.
It's converting the hash code into a bucket index that you're really interested in. When the number of buckets n is a power of two, all you need to do is do an AND operation between hash code h and (n-1), and the result is equal to h mod n.
A reason this may be bad is that the AND operation is simply discarding bits - the high-level bits - from the hash code. This may be good or bad, depending on other things. On one hand, it will be very fast, since AND is a lot faster than division (and is the usual reason why you would choose to use a power of 2 number of buckets), but on the other hand, poor hash functions may have poor entropy in the lower bits: that is, the lower bits don't change much when the data being hashed changes.
Let us say that the table size is m = 2^p.
Let k be a key.
Then, whenever we do k mod m, we will only get the last p bits of the binary representation of k. Thus, if I put in several keys that have the same last p bits, the hash function will perform VERY VERY badly as all keys will be hashed to the same slot in the table. Thus, avoid powers of 2

Resources