What is the main implementation idea behind sparse hash table? - data-structures

Why does Google sparsehash open-source library has two implementations: a dense hashtable and a sparse one?

The dense hashtable is your ordinary textbook hashtable implementation.
The sparse hashtable stores only the elements that have actually been set, divided over a number of arrays. To quote from the comments in the implementation of sparse tables:
// The idea is that a table with (logically) t buckets is divided
// into t/M *groups* of M buckets each. (M is a constant set in
// GROUP_SIZE for efficiency.) Each group is stored sparsely.
// Thus, inserting into the table causes some array to grow, which is
// slow but still constant time. Lookup involves doing a
// logical-position-to-sparse-position lookup, which is also slow but
// constant time. The larger M is, the slower these operations are
// but the less overhead (slightly).
To know which elements of the arrays are set, a sparse table includes a bitmap:
// To store the sparse array, we store a bitmap B, where B[i] = 1 iff
// bucket i is non-empty. Then to look up bucket i we really look up
// array[# of 1s before i in B]. This is constant time for fixed M.
so that each element incurs an overhead of only 1 bit (in the limit).

sparsehash are a memory-efficient way of mapping keys to values (1-2 bits per key). Bloom filters can give you even fewer bits per key, but they don't attach values to keys other than outside/probably-inside, which is slightly less than a bit of information.

Related

What is connection between collision and complexity of CRUD operations in Hash Table?

In book of Aditya Bhargava "Grokking Algorithms: An illustrated guide for programmers and other curious people" i read than worst case complexity can be avoided, if we avoid collision.
As i understand, collision - is when hash function returns same value in case of different keys.
How it is affects Hash Table complexity in CRUD operations?
Thanks
i read than worst case complexity can be avoided, if we avoid collision.
That's correct - worst case complexity happens when all the hash values for elements stored in a hash table map on to and collided at the same bucket.
As i understand, collision - is when hash function returns same value in case of different keys.
Ultimately a value is mapped using a hash function to a bucket in the hash table. That said, it's common for that overall conceptual hash function to be implemented as a hash function producing a value in a huge numerical range (e.g. a 32-bit hash between 0 and 2^32-1, or a 64-bit hash between 0 and 2^64-1), then have that value mapped on to a specific bucket based on the current hash table bucket count using the % operator. So, say your hash table has 137 buckets, you might generate a hash value of 139, then say 139 % 137 == 2 and use the third ([2] in an array of buckets). This two step approach makes it easy to use the same hash function (producing 32-bit or 64-bit hashes) regardless of the size of table. If you instead created a hash function that produced numbers between 0 and 136 directly, it wouldn't work at all well for slightly smaller or larger bucket counts.
Returning to your question...
As i understand, collision - is when hash function returns same value in case of different keys.
...for the "32- or 64-bit hash function followed by %" approach I've described above, there are two distinct types of collisions: the 32- or 64-bit hash function itself may produce exactly the same 32- or 64-bit value for distinct values being hashed, or they might produce different values that - after the % operation - never-the-less map to the same bucket in the hash table.
How it is affects Hash Table complexity in CRUD operations?
Hash tables work by probabilistically spreading the values across the buckets. When many values collide at the same bucket, a secondary search mechanism has to be employed to process all the colliding values (and possibly other intermingled values, if you're using Open Addressing to try a sequence of buckets in the hash table, rather than hanging a linked list or binary tree of colliding elements off every bucket). So basically, the worse the collision rate, the further from idealised O(1) complexity you get, though you really only start to affect big-O complexity significantly if you have a particularly bad hash function, in light of the set of values being stored.
In a hash table implementation that has a good hashing function, and the load factor (number of entries divided by total capacity) is 70% or less, the number of collisions is fairly low and hash lookup is O(1).
If you have a poor hashing function or your load factor starts to increase, then the number of collisions increases. If you have a poor hashing function, then some hash codes will have many collisions and others will have very few. Your average lookup rate might still be close to O(1), but some lookups will take much longer because collision resolution takes a long time. For example, if hash code value 11792 has 10 keys mapped to it, then you potentially have to check 10 different keys before you can return the matching key.
If the hash table is overloaded, with each hash code having approximately the same number of keys mapped to it, then your average lookup rate will be O(k), where k is the average number of collisions per hash code.

Does hashtable size depends upon length of key?

What i know,
Hashtable size depends on load factor.
It must be largest prime number, and use that prime number as the
modulo value in hash function.
Prime number must not be too close to power of 2 and power of 10.
Doubt I am having,
Does size of hashtable depends on length of key?
Following paragraph from the book Introduction to Algorithms by Cormen.
Does n=2000 mean length of string or number of element which will be store in hash table?
Good values for m are primes not too close to exact powers of 2. For
example, suppose we wish to allocate a hash table, with collisions
resolved by chaining, to hold roughly n = 2000 character strings,
where a character has 8 bits. We don't mind examining an average of 3
elements in an unsuccessful search, so we allocate a hash table of
size m = 701. The number 701 is chosen because it is a prime near =
2000/3 but not near any power of 2. Treating each key k as an integer,
our hash function would be
h(k) = k mod 701 .
Can somebody explain it>
Here's a general overview of the tradeoff with hash tables.
Suppose you have a hash table with m buckets with chains storing a total of n objects.
If you store only references to objects, the total memory consumed is O (m + n).
Now, suppose that, for an average object, its size is s, it takes O (s) time to compute its hash once, and O (s) to compare two such objects.
Consider an operation checking whether an object is present in the hash table.
The bucket will have n / m elements on average, so the operation will take O (s n / m) time.
So, the tradeoff is this: when you increase the number of buckets m, you increase memory consumption but decrease average time for a single operation.
For the original question - Does size of hashtable depends on length of key? - No, it should not, at least not directly.
The paragraph you cite only mentions the strings as an example of an object to store in a hash table.
One mentioned property is that they are 8-bit character strings.
The other is that "We don't mind examining an average of 3 elements in an unsuccessful search".
And that wraps the properties of the stored object into the form: how many elements on average do we want to place in a single bucket?
The length of strings themselves is not mentioned anywhere.
(2) and (3) are false. It is common for a hash table with 2^n buckets (ref) as long as you use the right hash function. On (1), the memory a hash table takes equals the number of buckets times the length of key. Note that for string keys, we usually keep pointers to strings, not the strings themselves, so the length of key is the length of a pointer, which is 8 bytes on 64-bit machines.
Algorithmic-wise, No!
The length of the key is irrelevant here.
Moreover, the key itself is not important, what's important is the number of different keys you predict you'll have.
Implementation-wise, Yes! Since you must save the key itself in your hashtable, it reflects on its size.
For your second question, 'n' means the number of different keys to hold.

How do hashtable indexes work?

I know about creating hashcodes, collisions, the relationship between .GetHashCode and .Equals, etc.
What I don't quite understand is how a 32 bit hash number is used to get the ~O(1) lookup. If you have an array big enough to allocate all the possibilities in a 32bit number then you do get the ~O(1) but that would be waste of memory.
My guess is that internally the Hashtable class creates a small array (e.g. 1K items) and then rehash the 32bit number to a 3 digit number and use that as lookup. When the number of elements reaches a certain threshold (say 75%) it would expand the array to something like 10K items and recompute the internal hash numbers to 4 digit numbers, based on the 32bit hash of course.
btw, here I'm using ~O(1) to account for possible collisions and their resolutions.
Do I have the gist of it correct or am I completely off the mark?
My guess is that internally the Hashtable class creates a small array (e.g. 1K items) and then rehash the 32bit number to a 3 digit number and use that as lookup.
That's exactly what happens, except that the capacity (number of bins) of the table is more commonly set to a power of two or a prime number. The hash code is then taken modulo this number to find the bin into which to insert an item. When the capacity is a power of two, the modulus operation becomes a simple bitmasking op.
When the number of elements reaches a certain threshold (say 75%)
If you're referring to the Java Hashtable implementation, then yes. This is called the load factor. Other implementations may use 2/3 instead of 3/4.
it would expand the array to something like 10K items
In most implementations, the capacity will not be increased ten-fold but rather doubled (for power-of-two-sized hash tables) or multiplied by roughly 1.5 + the distance to the next prime number.
The hashtable has a number of bins that contain items. The number of bins are quite small to start with. Given a hashcode, it simply uses hashcode modulo bincount to find the bin in which the item should reside. That gives the fast lookup (Find the bin for an item: Take modulo of the hashcode, done).
Or in (pseudo) code:
int hash = obj.GetHashCode();
int binIndex = hash % binCount;
// The item is in bin #binIndex. Go get the items there and find the one that matches.
Obviously, as you figured out yourself, at some point the table will need to grow. When it does this, a new array of bins are created, and the items in the table are redistributed to the new bins. This is also means that growing a hashtable can be slow. (So, approx. O(1) in most cases, unless the insert triggers an internal resize. Lookups should always be ~O(1)).
In general, there are a number of variations in how hash tables handle overflow.
Many (including Java's, if memory serves) resize when the load factor (percentage of bins in use) exceeds some particular percentage. The downside of this is that the speed is undependable -- most insertions will be O(1), but a few will be O(N).
To ameliorate that problem, some resize gradually instead: when the load factor exceeds the magic number, they:
Create a second (larger) hash table.
Insert the new item into the new hash table.
Move some items from the existing hash table to the new one.
Then, each subsequent insertion moves another chunk from the old hash table to the new one. This retains the O(1) average complexity, and can be written so the complexity for every insertion is essentially constant: when the hash table gets "full" (i.e., load factor exceeds your trigger point) you double the size of the table. Then, each insertion you insert the new item and move one item from the old table to the new one. The old table will empty exactly as the new one fills up, so every insertion will involve exactly two operations: inserting one new item and moving one old one, so insertion speed remains essentially constant.
There are also other strategies. One I particularly like is to make the hash table a table of balanced trees. With this, you usually ignore overflow entirely. As the hash table fills up, you just end up with more items in each tree. In theory, this means the complexity is O(log N), but for any practical size it's proportional to log N/M, where M=number of buckets. For practical size ranges (e.g., up to several billion items) that's essentially constant (log N grows very slowly) and and it's often a little faster for the largest table you can fit in memory, and a lost faster for smaller sizes. The shortcoming is that it's only really practical when the objects you're storing are fairly large -- if you stored (for example) one character per node, the overhead from two pointers (plus, usually, balance information) per node would be extremely high.

Best data structure to store lots one bit data

I want to store lots of data so that
they can be accessed by an index,
each data is just yes and no (so probably one bit is enough for each)
I am looking for the data structure which has the highest performance and occupy least space.
probably storing data in a flat memory, one bit per data is not a good choice on the other hand using different type of tree structures still use lots of memory (e.g. pointers in each node are required to make these tree even though each node has just one bit of data).
Does anyone have any Idea?
What's wrong with using a single block of memory and either storing 1 bit per byte (easy indexing, but wastes 7 bits per byte) or packing the data (slightly trickier indexing, but more memory efficient) ?
Well in Java the BitSet might be a good choice http://download.oracle.com/javase/6/docs/api/java/util/BitSet.html
If I understand your question correctly you should store them in an unsigned integer where you assign each value to a bit of the integer (flag).
Say you represent 3 values and they can be on or off. Then you assign the first to 1, the second to 2 and the third to 4. Your unsigned int can then be 0,1,2,3,4,5,6 or 7 depending on which values are on or off and you check the values using bitwise comparison.
Depends on the language and how you define 'index'. If you mean that the index operator must work, then your language will need to be able to overload the index operator. If you don't mind using an index macro or function, you can access the nth element by dividing the given index by the number of bits in your type (say 8 for char, 32 for uint32_t and variants), then return the result of arr[n / n_bits] & (1 << (n % n_bits))
Have a look at a Bloom Filter: http://en.wikipedia.org/wiki/Bloom_filter
It performs very well and is space-efficient. But make sure you read the fine print below ;-): Quote from the above wiki page.
An empty Bloom filter is a bit array
of m bits, all set to 0. There must
also be k different hash functions
defined, each of which maps or hashes
some set element to one of the m array
positions with a uniform random
distribution. To add an element, feed
it to each of the k hash functions to
get k array positions. Set the bits at
all these positions to 1. To query for
an element (test whether it is in the
set), feed it to each of the k hash
functions to get k array positions. If
any of the bits at these positions are
0, the element is not in the set – if
it were, then all the bits would have
been set to 1 when it was inserted. If
all are 1, then either the element is
in the set, or the bits have been set
to 1 during the insertion of other
elements. The requirement of designing
k different independent hash functions
can be prohibitive for large k. For a
good hash function with a wide output,
there should be little if any
correlation between different
bit-fields of such a hash, so this
type of hash can be used to generate
multiple "different" hash functions by
slicing its output into multiple bit
fields. Alternatively, one can pass k
different initial values (such as 0,
1, ..., k − 1) to a hash function that
takes an initial value; or add (or
append) these values to the key. For
larger m and/or k, independence among
the hash functions can be relaxed with
negligible increase in false positive
rate (Dillinger & Manolios (2004a),
Kirsch & Mitzenmacher (2006)).
Specifically, Dillinger & Manolios
(2004b) show the effectiveness of
using enhanced double hashing or
triple hashing, variants of double
hashing, to derive the k indices using
simple arithmetic on two or three
indices computed with independent hash
functions. Removing an element from
this simple Bloom filter is
impossible. The element maps to k
bits, and although setting any one of
these k bits to zero suffices to
remove it, this has the side effect of
removing any other elements that map
onto that bit, and we have no way of
determining whether any such elements
have been added. Such removal would
introduce a possibility for false
negatives, which are not allowed.
One-time removal of an element from a
Bloom filter can be simulated by
having a second Bloom filter that
contains items that have been removed.
However, false positives in the second
filter become false negatives in the
composite filter, which are not
permitted. In this approach re-adding
a previously removed item is not
possible, as one would have to remove
it from the "removed" filter. However,
it is often the case that all the keys
are available but are expensive to
enumerate (for example, requiring many
disk reads). When the false positive
rate gets too high, the filter can be
regenerated; this should be a
relatively rare event.

Hash Functions and Tables of size of the form 2^p

While calculating the hash table bucket index from the hash code of a key, why do we avoid use of remainder after division (modulo) when the size of the array of buckets is a power of 2?
When calculating the hash, you want as much information as you can cheaply munge things into with good distribution across the entire range of bits: e.g. 32-bit unsigned integers are usually good, unless you have a lot (>3 billion) of items to store in the hash table.
It's converting the hash code into a bucket index that you're really interested in. When the number of buckets n is a power of two, all you need to do is do an AND operation between hash code h and (n-1), and the result is equal to h mod n.
A reason this may be bad is that the AND operation is simply discarding bits - the high-level bits - from the hash code. This may be good or bad, depending on other things. On one hand, it will be very fast, since AND is a lot faster than division (and is the usual reason why you would choose to use a power of 2 number of buckets), but on the other hand, poor hash functions may have poor entropy in the lower bits: that is, the lower bits don't change much when the data being hashed changes.
Let us say that the table size is m = 2^p.
Let k be a key.
Then, whenever we do k mod m, we will only get the last p bits of the binary representation of k. Thus, if I put in several keys that have the same last p bits, the hash function will perform VERY VERY badly as all keys will be hashed to the same slot in the table. Thus, avoid powers of 2

Resources