Hashing google interview - memory-management

why can't powers of 2 or power's of 10 or prime numbers be good hashing functions? If we want to store overflow records in a hash function, why aren't those good for selection of hashing functions?

Suppose your hash function returns a 32-bit unsigned result. Suppose you choose a modulus of 4096. What you do is, effectively: index = hash & 0xFFF -- so, you throw away the top 20 bits of your hash value. Now, if your hash is really good, and the bottom 12 bits are just as good as the rest, then that's not a problem. However, if your hash is pretty good over all 32 bits, but the bottom 12 bits are suspect (they might, for example, be more strongly influenced by the last characters of a string)... then you may regret discarding the top 20. In this case, if you choose any odd modulus, then index = hash % modulus the result depends on all 32 bits of the hash.
So, more generally, if your hash is calculated modulo M, and your index is taken as hash % N, then what you want is for your M and N to be co-prime.
If M is 2^m (as it usually is), then N=10^n is a poor choice, because the bottom n bits of the resulting index are a straight copy of the bottom n bits of the hash.

Related

Hashing with the Division Method - Choosing number of slots?

So, in CLRS, there's this quote
A prime not too close to an exact power of 2 is often a good choice for m.
Several Questions...
I understand how a power of 2 will just be the lower order bits of your key...however, say you have keys from a universe of 1 to 1 million, with each key having an equal probability of being any number from universe (which I'm guessing is a common assumption about your universe if given no other data?) then wouldn't taking say the 4 lower order bits result in (2^4) lower order bit patterns that were pretty much equally likely for the keys from 1 to 1 million? How am I thinking about this incorrectly?
Why a prime number? So, if power of 2's aren't a good idea, why is a prime number a better choice as opposed to a composite number close to a power of 2 (Also why should it be close to a power of 2...lol)?
You are trying to find a hash table that works well for typical input data, and typical input data does things that you wouldn't expect from good random number generators. Very often you get formatted or semi-formatted strings which, when converted to numbers, end up as K, K+A, K+2A, K+3A,.... for some integers K and A. If K+xA and K+yA hash to the same number mod m, then (x-y)A must be 0 mod m. If m is prime, this can only happen if A = 0 mod m or if x = y mod m, so one time in m. But if m=pq and A happens to be divisible by p, then you get a collision every time x-y is divisible by q, which is more often since q < m.
I guess close to a power of 2 because it might be convenient for the memory management system to have blocks of memory of the resulting size - I really don't know. If you really care, and if you have the time, you could try different primes with some representative data and see which of them are best in practice.

Radix sort in linear time vs. converting input to proper base

I was thinking about the linear time sorting problem which appears in quite a few sources which prompts you to sort an array of numbers in the range from 0 to n^3-1 in linear time.
So one way to do this is to use radix sort which normally runs in O(wn) where w is the max word size by observing that we can obtain word size 3 for any number in that range by using base n.
And herein lies my question - while it looks ok on paper, in practice converting all the numbers to base n the naive way is going to take quite a lot of time, quite possibly even more than the later sorting itself. Is there any way to convert to base n faster than naively or to somehow trick one's way out of this limitation or do you just have to live with it?
One useful observation is that the runtime of this algorithm is the same if you choose as your base not the number n, but the smallest power of two greater than or equal to n. Let's imagine that that number is 2k. Now, to read off the base-2k digits of a number, you can just inspect blocks of bits of size k in the number, which is doable quite quickly using some bit shifts and logical ANDs. This will likely be fast even if your numbers are stored as variable-length integers, assuming that the variable-length integer uses some nice sort of binary encoding.
There's an ideal size for choosing k bit fields to use 2k as the base for radix sort depending on the size of the array, but it doesn't make a lot of difference, less than 10% for choosing r = 8 (1 read pass + 4 radix sort passes), versus r = 16 (1 read pass + 2 radix sort passes), because r = 8 is more cache friendly. On my system, (Intel 2600K 3.4 ghz), for array size = 2^20, r = 8 is slightly fastest. For array size = 2^24, r = 10.67 (using 10,11,11 bit field) is slightly fastest. For array size = 2^26, r = 16 is slightly fastest.
For signed integers, the sign bit can be toggled during the radix sort.
In your case, the max value of an integer is given, so this would help in choosing the bit field sizes.

How does flajolet martin sketch works?

I am trying to understand this sketch but am not able to understand.
Correct me if I am wrong but basically, lets say I have a text data.. words.. I have a hash function.. which takes a word and create an integer hash and then I convert that hash to binary bit vector?? Right..
Then I keep a track of the first 1 I see from left.. And the position where that 1 is (say , k)... the cardinality of this set is 2^k?
http://ravi-bhide.blogspot.com/2011/04/flajolet-martin-algorithm.html
But ... say I have just one word. and the hash function of it is such that hash it generates is 2^5, then I am guessing there are 5 (??) trailing 0's?? so it will predict 2^5 (??) cardinality?
That doesnt sounds right? What am I missing
For a single word the distribution of R is a geometric distribution with p = 1/2, and its standard deviation is sqrt(2) ≈ 1.41.
So for a word with hash ending in 100000b the algorithm will, indeed, yield 25/0.77351 = 41.37. But the probability of that is only 1/64, which is consistent with the statement that the standard deviation of R is close to 1.
http://ravi-bhide.blogspot.com/2011/04/flajolet-martin-algorithm.html
We had a good, random hash function that acted on strings and generated integers, what can we say about the generated integers? Since they are random themselves, we would expect:
1/2 of them to have their binary representation end in 0(i.e. divisible by 2),
1/4 of them to have their binary representation end in 00 (i.e. divisible by 4)
1/8 of them to have their binary representation end in 000 (i.e. divisible by 8)
Turning the problem around, if the hash function generated an integer ending in 0^m bits ..intuitively, the number of unique strings is around 2^m.
What is really important to remember is that the Flajolet Martin Algorithm is meant to count distinct elements (lets say M distinct elements) from a set of N elements, when M is expected to be very very large.
There is no point of using the algorithm if N or M are small enough for us to store all distinct elements in memory.
In the case where N and M are really large, the probability of the estimate being close to 2^k is actually very reasonable.
There is an explanation of this at : http://infolab.stanford.edu/~ullman/mmds/ch4.pdf (page 143)

How many hash functions are required in a minhash algorithm

I am keen to try and implement minhashing to find near duplicate content. http://blog.cluster-text.com/tag/minhash/ has a nice write up, but there the question of just how many hashing algorithms you need to run across the shingles in a document to get reasonable results.
The blog post above mentioned something like 200 hashing algorithms. http://blogs.msdn.com/b/spt/archive/2008/06/10/set-similarity-and-min-hash.aspx lists 100 as a default.
Obviously there is an increase in the accuracy as the number of hashes increases, but how many hash functions is reasonable?
To quote from the blog
It is tough to get the error bar on our similarity estimate much
smaller than [7%] because of the way error bars on statistically
sampled values scale — to cut the error bar in half we would need four
times as many samples.
Does this mean that mean that decreasing the number of hashes to something like 12 (200 / 4 / 4) would result in an error rate of 28% (7 * 2 * 2)?
One way to generate 200 hash values is to generate one hash value using a good hash algorithm and generate 199 values cheaply by XORing the good hash value with 199 sets of random-looking bits having the same length as the good hash value (i.e. if your good hash is 32 bits, build a list of 199 32-bit pseudo random integers and XOR each good hash with each of the 199 random integers).
Do not simply rotate bits to generate hash values cheaply if you are using unsigned integers (signed integers are fine) -- that will often pick the same shingle over and over. Rotating the bits down by one is the same as dividing by 2 and copying the old low bit into the new high bit location. Roughly 50% of the good hash values will have a 1 in the low bit, so they will have huge hash values with no prayer of being the minimum hash when that low bit rotates into the high bit location. The other 50% of the good hash values will simply equal their original values divided by 2 when you shift by one bit. Dividing by 2 does not change which value is smallest. So, if the shingle that gave the minimum hash with the good hash function happens to have a 0 in the low bit (50% chance of that) it will again give the minimum hash value when you shift by one bit. As an extreme example, if the shingle with the smallest hash value from the good hash function happens to have a hash value of 0, it will always have the minimum hash value no matter how much you rotate the bits. This problem does not occur with signed integers because minimum hash values have extreme negative values, so they tend to have a 1 at the highest bit followed by zeros (100...). So, only hash values with a 1 in the lowest bit will have a chance at being the new lowest hash value after rotating down by one bit. If the shingle with minimum hash value has a 1 in the lowest bit, after rotating down one bit it will look like 1100..., so it will almost certainly be beat out by a different shingle that has a value like 10... after the rotation, and the problem of the same shingle being picked twice in a row with 50% probability is avoided.
Pretty much.. but 28% would be the "error estimate", meaning reported measurements would frequently be inaccurate by +/- 28%.
That means that a reported measurement of 78% could easily come from only 50% similarity..
Or that 50% similarity could easily be reported as 22%. Doesn't sound accurate enough for business expectations, to me.
Mathematically, if you're reporting two digits the second should be meaningful.
Why do you want to reduce the number of hash functions to 12? What "200 hash functions" really means is, calculate a decent-quality hashcode for each shingle/string once -- then apply 200 cheap & fast transformations, to emphasise certain factors/ bring certain bits to the front.
I recommend combining bitwise rotations (or shuffling) and an XOR operation. Each hash function can combined rotation by some number of bits, then XORing by a randomly generated integer.
This both "spreads" the selectivity of the min() function around the bits, and as to what value min() ends up selecting for.
The rationale for rotation, is that "min(Int)" will, 255 times out of 256, select only within the 8 most-significant bits. Only if all top bits are the same, do lower bits have any effect in the comparison.. so spreading can be useful to avoid undue emphasis on just one or two characters in the shingle.
The rationale for XOR is that, on it's own, bitwise rotation (ROTR) can 50% of the time (when 0 bits are shifted in from the left) converge towards zero, and that would cause "separate" hash functions to display an undesirable tendency to coincide towards zero together -- thus an excessive tendency for them to end up selecting the same shingle, not independent shingles.
There's a very interesting "bitwise" quirk of signed integers, where the MSB is negative but all following bits are positive, that renders the tendency of rotations to converge much less visible for signed integers -- where it would be obvious for unsigned. XOR must still be used in these circumstances, anyway.
Java has 32-bit hashcodes builtin. And if you use Google Guava libraries, there are 64-bit hashcodes available.
Thanks to #BillDimm for his input & persistence in pointing out that XOR was necessary.
What you want can be be easily obtained from universal hashing. Popular textbooks like Corman et al as very readable information in section 11.3.3 pp 265-268. In short, you can generate family of hash functions using following simple equation:
h(x,a,b) = ((ax+b) mod p) mod m
x is key you want to hash
a is any odd number you can choose between 1 to p-1 inclusive.
b is any number you can choose between 0 to p-1 inclusive.
p is a prime number that is greater than max possible value of x
m is a max possible value you want for hash code + 1
By selecting different values of a and b you can generate many hash codes that are independent of each other.
An optimized version of this formula can be implemented as follows in C/C++/C#/Java:
(unsigned) (a*x+b) >> (w-M)
Here,
- w is size of machine word (typically 32)
- M is size of hash code you want in bits
- a is any odd integer that fits in to machine word
- b is any integer less than 2^(w-M)
Above works for hashing a number. To hash a string, get the hash code that you can get using built-in functions like GetHashCode and then use that value in above formula.
For example, let's say you need 200 16-bit hash code for string s, then following code can be written as implementation:
public int[] GetHashCodes(string s, int count, int seed = 0)
{
var hashCodes = new int[count];
var machineWordSize = sizeof(int);
var hashCodeSize = machineWordSize / 2;
var hashCodeSizeDiff = machineWordSize - hashCodeSize;
var hstart = s.GetHashCode();
var bmax = 1 << hashCodeSizeDiff;
var rnd = new Random(seed);
for(var i=0; i < count; i++)
{
hashCodes[i] = ((hstart * (i*2 + 1)) + rnd.Next(0, bmax)) >> hashCodeSizeDiff;
}
}
Notes:
I'm using hash code word size as half of machine word size which in most cases would be 16-bit. This is not ideal and has far more chance of collision. This can be used by upgrading all arithmetic to 64-bit.
Normally you want to select a and b both randomly within above said ranges.
Just use 1 hash function! (and save the 1/(f ε^2) smallest values.)
Check out this article for the state of the art practical and theoretical bounds. It has this nice graph (below), explaining why you probably want to use just one 2-independent hash function and save the k smallest values.
When estimating set sizes the paper shows that you can get a relative error of approximately ε = 1/sqrt(f k) where f is the jaccard similarity and k is the number of values kept. So if you want error ε, you need k=1/(fε^2) or if your sets have similarity around 1/3 and you want a 10% relative error, you should keep the 300 smallest values.
It seems like another way to get N number of good hashed values would be to salt the same hash with N different salt values.
In practice, if applying the salt second, it seems you could hash the data, then "clone" the internal state of your hasher, add the first salt and get your first value. You'd reset this clone to the clean cloned state, add the second salt, and get your second value. Rinse and repeat for all N items.
Likely not as cheap as XOR against N values, but seems like there's possibility for better quality results, at a minimal extra cost, especially if the data being hashed is much larger than the salt value.

Hash Functions and Tables of size of the form 2^p

While calculating the hash table bucket index from the hash code of a key, why do we avoid use of remainder after division (modulo) when the size of the array of buckets is a power of 2?
When calculating the hash, you want as much information as you can cheaply munge things into with good distribution across the entire range of bits: e.g. 32-bit unsigned integers are usually good, unless you have a lot (>3 billion) of items to store in the hash table.
It's converting the hash code into a bucket index that you're really interested in. When the number of buckets n is a power of two, all you need to do is do an AND operation between hash code h and (n-1), and the result is equal to h mod n.
A reason this may be bad is that the AND operation is simply discarding bits - the high-level bits - from the hash code. This may be good or bad, depending on other things. On one hand, it will be very fast, since AND is a lot faster than division (and is the usual reason why you would choose to use a power of 2 number of buckets), but on the other hand, poor hash functions may have poor entropy in the lower bits: that is, the lower bits don't change much when the data being hashed changes.
Let us say that the table size is m = 2^p.
Let k be a key.
Then, whenever we do k mod m, we will only get the last p bits of the binary representation of k. Thus, if I put in several keys that have the same last p bits, the hash function will perform VERY VERY badly as all keys will be hashed to the same slot in the table. Thus, avoid powers of 2

Resources