Hash Table Double Hashing - algorithm

I am given two hash functions that I should use for insertion and deletion into the table;
int hash1(int key)
{
return (key % TABLESIZE);
}
int hash2(int key)
{
return (key % PRIME) + 1;
}
I'm confused on how I utilize them.
Do I:
Use hash1 first, and if the slot is taken at the return value, use hash2?
Do I use hash1 then add the result to hash2's output?
Do I use hash's output as hash2's input?

TLDR: bucket_selected = hash1(hash2(key))
hash1() is an identity hash from key to bucket (it folds the keys into the table size). If the table size happens to be a power of two, it effectively masks out some number of low bits, discarding the high bits; for example, with table size 256, it's effective returning key & 255: the least significant 8 bits. This will be somewhat collision prone if the keys aren't either:
mostly contiguous numbers - possibly with a few small gaps, such that they cleanly map onto successive buckets most of the time, or
pretty random in the low bits used, so they scatter across the buckets
If table size is not a power of two, and ideally is a prime, the high order bits help spread the keys around the buckets. Collisions are just as likely for e.g. random numbers, but on computer systems sometimes different bits in a key are more or less likely to vary (for example, doubles in memory consist of a sign, many mantissa, and many exponent bits: if your numbers are of similar magnitude, the exponents won't vary much between them), or there are patterns based on power-of-two boundaries. For example, if you have 4 ASCII characters packed into a uint32_t key, then a % table_size hash with table_size 256 extracts just one of the characters as the hash value. If table size was instead 257, then varying any of the characters would change the bucket selected.
(key % PRIME) + 1 comes close to doing what hash1() would do if the table size was prime, but why add 1? Some languages do index their arrays from 1, which is the only good reason I can think of, but if dealing with such a language, you'd probably want hash1() to add 1 too. To explain the potential use for hash2(), let's take a step back first...
Real general-purpose hash table implementations need to be able to create tables of different sizes - whatever suits the program - and indeed, often applications want the table to "resize" or grow dynamically if more elements are inserted than it can handle well. Because of that, a hash function such as hash1 would be dependent on the hash table implementation or calling code to tell it the current table size. It's normally more convenient if the hash functions can be written independently of any given hash table implementation, only needing the key as input. What many hash functions do is hash the key to a number of a certain size, e.g. a uint32_t or uint64_t. Clearly that means there may be more hash values than there are buckets in the hash table, so a % operation (or faster bitwise-& operation if the # buckets is a power of two) is then used to "fold" the hash value back onto the buckets. So, a good hash table implementation usually accepts a hash function generating e.g. uint32_t or uint64_t output and internally does the % or &.
In the case of hash1 - it can be used:
as an identity hash folding the key to a bucket, or
to fold a hash value from another hash function to a bucket.
In the second usage, that second hash function could be hash2. This might make sense if the keys given to hash2 were typically much larger than the PRIME used, and yet the PRIME was in turn much larger than the number of buckets. To explain why this is desirable, let's take another step back...
Say you have 8 buckets and a hash function that produces a number in the range [0..10] with uniform probability: if you % the hash values into the table size, hash values 0..7 will map to buckets 0..7, and hash values 8..10 will map to buckets 0..2: buckets 0..2 can be expected to have about twice as many keys collide there as the other buckets. When the range of hash values is vastly larger than the number of buckets, the significance of having some buckets %-ed to once more than other buckets is correspondingly tiny. Alternatively, if you have say a hash function outputting 32-bit numbers (so the number of distinct hash values is a power of two), then % by a smaller power-of-two will map exactly the same number of hash values to each bucket.
So, let's return to my earlier assertion: hash2()'s potential utility is actually to use it like this:
bucket_selected = hash1(hash2(key))
In the above formula - hash1 distributes across the buckets but prevents out-of-bounds bucket access; to work reasonable hash2 should output a range of numbers much larger than the number of buckets, but it won't do anything at all unless the keys span a range larger than PRIME, and ideally they'd span a range vastly larger than PRIME, increasing the odds of hash values from hash2(key) forming a near-uniform distribution between 1 and PRIME.

Related

What is connection between collision and complexity of CRUD operations in Hash Table?

In book of Aditya Bhargava "Grokking Algorithms: An illustrated guide for programmers and other curious people" i read than worst case complexity can be avoided, if we avoid collision.
As i understand, collision - is when hash function returns same value in case of different keys.
How it is affects Hash Table complexity in CRUD operations?
Thanks
i read than worst case complexity can be avoided, if we avoid collision.
That's correct - worst case complexity happens when all the hash values for elements stored in a hash table map on to and collided at the same bucket.
As i understand, collision - is when hash function returns same value in case of different keys.
Ultimately a value is mapped using a hash function to a bucket in the hash table. That said, it's common for that overall conceptual hash function to be implemented as a hash function producing a value in a huge numerical range (e.g. a 32-bit hash between 0 and 2^32-1, or a 64-bit hash between 0 and 2^64-1), then have that value mapped on to a specific bucket based on the current hash table bucket count using the % operator. So, say your hash table has 137 buckets, you might generate a hash value of 139, then say 139 % 137 == 2 and use the third ([2] in an array of buckets). This two step approach makes it easy to use the same hash function (producing 32-bit or 64-bit hashes) regardless of the size of table. If you instead created a hash function that produced numbers between 0 and 136 directly, it wouldn't work at all well for slightly smaller or larger bucket counts.
Returning to your question...
As i understand, collision - is when hash function returns same value in case of different keys.
...for the "32- or 64-bit hash function followed by %" approach I've described above, there are two distinct types of collisions: the 32- or 64-bit hash function itself may produce exactly the same 32- or 64-bit value for distinct values being hashed, or they might produce different values that - after the % operation - never-the-less map to the same bucket in the hash table.
How it is affects Hash Table complexity in CRUD operations?
Hash tables work by probabilistically spreading the values across the buckets. When many values collide at the same bucket, a secondary search mechanism has to be employed to process all the colliding values (and possibly other intermingled values, if you're using Open Addressing to try a sequence of buckets in the hash table, rather than hanging a linked list or binary tree of colliding elements off every bucket). So basically, the worse the collision rate, the further from idealised O(1) complexity you get, though you really only start to affect big-O complexity significantly if you have a particularly bad hash function, in light of the set of values being stored.
In a hash table implementation that has a good hashing function, and the load factor (number of entries divided by total capacity) is 70% or less, the number of collisions is fairly low and hash lookup is O(1).
If you have a poor hashing function or your load factor starts to increase, then the number of collisions increases. If you have a poor hashing function, then some hash codes will have many collisions and others will have very few. Your average lookup rate might still be close to O(1), but some lookups will take much longer because collision resolution takes a long time. For example, if hash code value 11792 has 10 keys mapped to it, then you potentially have to check 10 different keys before you can return the matching key.
If the hash table is overloaded, with each hash code having approximately the same number of keys mapped to it, then your average lookup rate will be O(k), where k is the average number of collisions per hash code.

Does hashtable size depends upon length of key?

What i know,
Hashtable size depends on load factor.
It must be largest prime number, and use that prime number as the
modulo value in hash function.
Prime number must not be too close to power of 2 and power of 10.
Doubt I am having,
Does size of hashtable depends on length of key?
Following paragraph from the book Introduction to Algorithms by Cormen.
Does n=2000 mean length of string or number of element which will be store in hash table?
Good values for m are primes not too close to exact powers of 2. For
example, suppose we wish to allocate a hash table, with collisions
resolved by chaining, to hold roughly n = 2000 character strings,
where a character has 8 bits. We don't mind examining an average of 3
elements in an unsuccessful search, so we allocate a hash table of
size m = 701. The number 701 is chosen because it is a prime near =
2000/3 but not near any power of 2. Treating each key k as an integer,
our hash function would be
h(k) = k mod 701 .
Can somebody explain it>
Here's a general overview of the tradeoff with hash tables.
Suppose you have a hash table with m buckets with chains storing a total of n objects.
If you store only references to objects, the total memory consumed is O (m + n).
Now, suppose that, for an average object, its size is s, it takes O (s) time to compute its hash once, and O (s) to compare two such objects.
Consider an operation checking whether an object is present in the hash table.
The bucket will have n / m elements on average, so the operation will take O (s n / m) time.
So, the tradeoff is this: when you increase the number of buckets m, you increase memory consumption but decrease average time for a single operation.
For the original question - Does size of hashtable depends on length of key? - No, it should not, at least not directly.
The paragraph you cite only mentions the strings as an example of an object to store in a hash table.
One mentioned property is that they are 8-bit character strings.
The other is that "We don't mind examining an average of 3 elements in an unsuccessful search".
And that wraps the properties of the stored object into the form: how many elements on average do we want to place in a single bucket?
The length of strings themselves is not mentioned anywhere.
(2) and (3) are false. It is common for a hash table with 2^n buckets (ref) as long as you use the right hash function. On (1), the memory a hash table takes equals the number of buckets times the length of key. Note that for string keys, we usually keep pointers to strings, not the strings themselves, so the length of key is the length of a pointer, which is 8 bytes on 64-bit machines.
Algorithmic-wise, No!
The length of the key is irrelevant here.
Moreover, the key itself is not important, what's important is the number of different keys you predict you'll have.
Implementation-wise, Yes! Since you must save the key itself in your hashtable, it reflects on its size.
For your second question, 'n' means the number of different keys to hold.

How many hash functions are required in a minhash algorithm

I am keen to try and implement minhashing to find near duplicate content. http://blog.cluster-text.com/tag/minhash/ has a nice write up, but there the question of just how many hashing algorithms you need to run across the shingles in a document to get reasonable results.
The blog post above mentioned something like 200 hashing algorithms. http://blogs.msdn.com/b/spt/archive/2008/06/10/set-similarity-and-min-hash.aspx lists 100 as a default.
Obviously there is an increase in the accuracy as the number of hashes increases, but how many hash functions is reasonable?
To quote from the blog
It is tough to get the error bar on our similarity estimate much
smaller than [7%] because of the way error bars on statistically
sampled values scale — to cut the error bar in half we would need four
times as many samples.
Does this mean that mean that decreasing the number of hashes to something like 12 (200 / 4 / 4) would result in an error rate of 28% (7 * 2 * 2)?
One way to generate 200 hash values is to generate one hash value using a good hash algorithm and generate 199 values cheaply by XORing the good hash value with 199 sets of random-looking bits having the same length as the good hash value (i.e. if your good hash is 32 bits, build a list of 199 32-bit pseudo random integers and XOR each good hash with each of the 199 random integers).
Do not simply rotate bits to generate hash values cheaply if you are using unsigned integers (signed integers are fine) -- that will often pick the same shingle over and over. Rotating the bits down by one is the same as dividing by 2 and copying the old low bit into the new high bit location. Roughly 50% of the good hash values will have a 1 in the low bit, so they will have huge hash values with no prayer of being the minimum hash when that low bit rotates into the high bit location. The other 50% of the good hash values will simply equal their original values divided by 2 when you shift by one bit. Dividing by 2 does not change which value is smallest. So, if the shingle that gave the minimum hash with the good hash function happens to have a 0 in the low bit (50% chance of that) it will again give the minimum hash value when you shift by one bit. As an extreme example, if the shingle with the smallest hash value from the good hash function happens to have a hash value of 0, it will always have the minimum hash value no matter how much you rotate the bits. This problem does not occur with signed integers because minimum hash values have extreme negative values, so they tend to have a 1 at the highest bit followed by zeros (100...). So, only hash values with a 1 in the lowest bit will have a chance at being the new lowest hash value after rotating down by one bit. If the shingle with minimum hash value has a 1 in the lowest bit, after rotating down one bit it will look like 1100..., so it will almost certainly be beat out by a different shingle that has a value like 10... after the rotation, and the problem of the same shingle being picked twice in a row with 50% probability is avoided.
Pretty much.. but 28% would be the "error estimate", meaning reported measurements would frequently be inaccurate by +/- 28%.
That means that a reported measurement of 78% could easily come from only 50% similarity..
Or that 50% similarity could easily be reported as 22%. Doesn't sound accurate enough for business expectations, to me.
Mathematically, if you're reporting two digits the second should be meaningful.
Why do you want to reduce the number of hash functions to 12? What "200 hash functions" really means is, calculate a decent-quality hashcode for each shingle/string once -- then apply 200 cheap & fast transformations, to emphasise certain factors/ bring certain bits to the front.
I recommend combining bitwise rotations (or shuffling) and an XOR operation. Each hash function can combined rotation by some number of bits, then XORing by a randomly generated integer.
This both "spreads" the selectivity of the min() function around the bits, and as to what value min() ends up selecting for.
The rationale for rotation, is that "min(Int)" will, 255 times out of 256, select only within the 8 most-significant bits. Only if all top bits are the same, do lower bits have any effect in the comparison.. so spreading can be useful to avoid undue emphasis on just one or two characters in the shingle.
The rationale for XOR is that, on it's own, bitwise rotation (ROTR) can 50% of the time (when 0 bits are shifted in from the left) converge towards zero, and that would cause "separate" hash functions to display an undesirable tendency to coincide towards zero together -- thus an excessive tendency for them to end up selecting the same shingle, not independent shingles.
There's a very interesting "bitwise" quirk of signed integers, where the MSB is negative but all following bits are positive, that renders the tendency of rotations to converge much less visible for signed integers -- where it would be obvious for unsigned. XOR must still be used in these circumstances, anyway.
Java has 32-bit hashcodes builtin. And if you use Google Guava libraries, there are 64-bit hashcodes available.
Thanks to #BillDimm for his input & persistence in pointing out that XOR was necessary.
What you want can be be easily obtained from universal hashing. Popular textbooks like Corman et al as very readable information in section 11.3.3 pp 265-268. In short, you can generate family of hash functions using following simple equation:
h(x,a,b) = ((ax+b) mod p) mod m
x is key you want to hash
a is any odd number you can choose between 1 to p-1 inclusive.
b is any number you can choose between 0 to p-1 inclusive.
p is a prime number that is greater than max possible value of x
m is a max possible value you want for hash code + 1
By selecting different values of a and b you can generate many hash codes that are independent of each other.
An optimized version of this formula can be implemented as follows in C/C++/C#/Java:
(unsigned) (a*x+b) >> (w-M)
Here,
- w is size of machine word (typically 32)
- M is size of hash code you want in bits
- a is any odd integer that fits in to machine word
- b is any integer less than 2^(w-M)
Above works for hashing a number. To hash a string, get the hash code that you can get using built-in functions like GetHashCode and then use that value in above formula.
For example, let's say you need 200 16-bit hash code for string s, then following code can be written as implementation:
public int[] GetHashCodes(string s, int count, int seed = 0)
{
var hashCodes = new int[count];
var machineWordSize = sizeof(int);
var hashCodeSize = machineWordSize / 2;
var hashCodeSizeDiff = machineWordSize - hashCodeSize;
var hstart = s.GetHashCode();
var bmax = 1 << hashCodeSizeDiff;
var rnd = new Random(seed);
for(var i=0; i < count; i++)
{
hashCodes[i] = ((hstart * (i*2 + 1)) + rnd.Next(0, bmax)) >> hashCodeSizeDiff;
}
}
Notes:
I'm using hash code word size as half of machine word size which in most cases would be 16-bit. This is not ideal and has far more chance of collision. This can be used by upgrading all arithmetic to 64-bit.
Normally you want to select a and b both randomly within above said ranges.
Just use 1 hash function! (and save the 1/(f ε^2) smallest values.)
Check out this article for the state of the art practical and theoretical bounds. It has this nice graph (below), explaining why you probably want to use just one 2-independent hash function and save the k smallest values.
When estimating set sizes the paper shows that you can get a relative error of approximately ε = 1/sqrt(f k) where f is the jaccard similarity and k is the number of values kept. So if you want error ε, you need k=1/(fε^2) or if your sets have similarity around 1/3 and you want a 10% relative error, you should keep the 300 smallest values.
It seems like another way to get N number of good hashed values would be to salt the same hash with N different salt values.
In practice, if applying the salt second, it seems you could hash the data, then "clone" the internal state of your hasher, add the first salt and get your first value. You'd reset this clone to the clean cloned state, add the second salt, and get your second value. Rinse and repeat for all N items.
Likely not as cheap as XOR against N values, but seems like there's possibility for better quality results, at a minimal extra cost, especially if the data being hashed is much larger than the salt value.

Best data structure to store lots one bit data

I want to store lots of data so that
they can be accessed by an index,
each data is just yes and no (so probably one bit is enough for each)
I am looking for the data structure which has the highest performance and occupy least space.
probably storing data in a flat memory, one bit per data is not a good choice on the other hand using different type of tree structures still use lots of memory (e.g. pointers in each node are required to make these tree even though each node has just one bit of data).
Does anyone have any Idea?
What's wrong with using a single block of memory and either storing 1 bit per byte (easy indexing, but wastes 7 bits per byte) or packing the data (slightly trickier indexing, but more memory efficient) ?
Well in Java the BitSet might be a good choice http://download.oracle.com/javase/6/docs/api/java/util/BitSet.html
If I understand your question correctly you should store them in an unsigned integer where you assign each value to a bit of the integer (flag).
Say you represent 3 values and they can be on or off. Then you assign the first to 1, the second to 2 and the third to 4. Your unsigned int can then be 0,1,2,3,4,5,6 or 7 depending on which values are on or off and you check the values using bitwise comparison.
Depends on the language and how you define 'index'. If you mean that the index operator must work, then your language will need to be able to overload the index operator. If you don't mind using an index macro or function, you can access the nth element by dividing the given index by the number of bits in your type (say 8 for char, 32 for uint32_t and variants), then return the result of arr[n / n_bits] & (1 << (n % n_bits))
Have a look at a Bloom Filter: http://en.wikipedia.org/wiki/Bloom_filter
It performs very well and is space-efficient. But make sure you read the fine print below ;-): Quote from the above wiki page.
An empty Bloom filter is a bit array
of m bits, all set to 0. There must
also be k different hash functions
defined, each of which maps or hashes
some set element to one of the m array
positions with a uniform random
distribution. To add an element, feed
it to each of the k hash functions to
get k array positions. Set the bits at
all these positions to 1. To query for
an element (test whether it is in the
set), feed it to each of the k hash
functions to get k array positions. If
any of the bits at these positions are
0, the element is not in the set – if
it were, then all the bits would have
been set to 1 when it was inserted. If
all are 1, then either the element is
in the set, or the bits have been set
to 1 during the insertion of other
elements. The requirement of designing
k different independent hash functions
can be prohibitive for large k. For a
good hash function with a wide output,
there should be little if any
correlation between different
bit-fields of such a hash, so this
type of hash can be used to generate
multiple "different" hash functions by
slicing its output into multiple bit
fields. Alternatively, one can pass k
different initial values (such as 0,
1, ..., k − 1) to a hash function that
takes an initial value; or add (or
append) these values to the key. For
larger m and/or k, independence among
the hash functions can be relaxed with
negligible increase in false positive
rate (Dillinger & Manolios (2004a),
Kirsch & Mitzenmacher (2006)).
Specifically, Dillinger & Manolios
(2004b) show the effectiveness of
using enhanced double hashing or
triple hashing, variants of double
hashing, to derive the k indices using
simple arithmetic on two or three
indices computed with independent hash
functions. Removing an element from
this simple Bloom filter is
impossible. The element maps to k
bits, and although setting any one of
these k bits to zero suffices to
remove it, this has the side effect of
removing any other elements that map
onto that bit, and we have no way of
determining whether any such elements
have been added. Such removal would
introduce a possibility for false
negatives, which are not allowed.
One-time removal of an element from a
Bloom filter can be simulated by
having a second Bloom filter that
contains items that have been removed.
However, false positives in the second
filter become false negatives in the
composite filter, which are not
permitted. In this approach re-adding
a previously removed item is not
possible, as one would have to remove
it from the "removed" filter. However,
it is often the case that all the keys
are available but are expensive to
enumerate (for example, requiring many
disk reads). When the false positive
rate gets too high, the filter can be
regenerated; this should be a
relatively rare event.

Hash Functions and Tables of size of the form 2^p

While calculating the hash table bucket index from the hash code of a key, why do we avoid use of remainder after division (modulo) when the size of the array of buckets is a power of 2?
When calculating the hash, you want as much information as you can cheaply munge things into with good distribution across the entire range of bits: e.g. 32-bit unsigned integers are usually good, unless you have a lot (>3 billion) of items to store in the hash table.
It's converting the hash code into a bucket index that you're really interested in. When the number of buckets n is a power of two, all you need to do is do an AND operation between hash code h and (n-1), and the result is equal to h mod n.
A reason this may be bad is that the AND operation is simply discarding bits - the high-level bits - from the hash code. This may be good or bad, depending on other things. On one hand, it will be very fast, since AND is a lot faster than division (and is the usual reason why you would choose to use a power of 2 number of buckets), but on the other hand, poor hash functions may have poor entropy in the lower bits: that is, the lower bits don't change much when the data being hashed changes.
Let us say that the table size is m = 2^p.
Let k be a key.
Then, whenever we do k mod m, we will only get the last p bits of the binary representation of k. Thus, if I put in several keys that have the same last p bits, the hash function will perform VERY VERY badly as all keys will be hashed to the same slot in the table. Thus, avoid powers of 2

Resources