Suppose we have a hash table with 2^16 keys and values. Each key can be represented as a bit string (e.g., 0000, 0000, 0000, 0000). Now we want to construct a new hash table. The key of new hash table is still a bit string (e.g., 0000, ****, ****, ****). The corresponding value would be the average of all values in the old hash table when * takes 0 or 1. For instance, the value of 0000, ****, ****, **** will be the average of 2^12 values in the old hash table from 0000, 0000, 0000, 0000 to 0000, 1111, 1111, 1111. Intuitively, we need to do C(16, 4) * 2^16 times to construct the new hash table. What's the most efficient way to construct the new hash table?
The hash table here is not helping you at all, although it isn't much of a hindrance either.
Hash tables cannot, by their nature, cluster keys by the key prefix. In order to provide good hash distribution, keys need to be distributed as close to uniformly as possible between hash values.
If you will need later to process keys in some specific ordering, you might consider an ordered associative mapping, such as a balanced binary tree or some variant of a trie. On the other hand, the advantage of processing keys in order needs to be demonstrated in order to justify the additional overhead of ordered mapping.
In this case, every key needs to be visited, which means the ordered mapping and the hash mapping will both be O(n), assuming linear time traverse and constant time processing, both reasonable assumptions. However, during the processing each result value needs two accumulated intermediaries, basically a running total and a count. (There is an algorithm for "on-line" computation of the mean of a series, but it also requires two intermediate values, running mean and count. So although it has advantages, reducing storage requirements isn't one of them.)
You can use the output hash table to store one of the intermediate values for each output value, but you need somewhere to put the other one. That might be another hash table of the same size, or something similar; in any case, there is an additional storage cost
If you could traverse the original hash table in prefix order, you could reduce this storage cost to a constant, since the two temporary values can be recycled every time you reach a new prefix. So that's a savings, but I doubt whether it's sufficient to justify the overhead of an ordered associative mapping, which also includes increased storage requirements.
Related
I am given two hash functions that I should use for insertion and deletion into the table;
int hash1(int key)
{
return (key % TABLESIZE);
}
int hash2(int key)
{
return (key % PRIME) + 1;
}
I'm confused on how I utilize them.
Do I:
Use hash1 first, and if the slot is taken at the return value, use hash2?
Do I use hash1 then add the result to hash2's output?
Do I use hash's output as hash2's input?
TLDR: bucket_selected = hash1(hash2(key))
hash1() is an identity hash from key to bucket (it folds the keys into the table size). If the table size happens to be a power of two, it effectively masks out some number of low bits, discarding the high bits; for example, with table size 256, it's effective returning key & 255: the least significant 8 bits. This will be somewhat collision prone if the keys aren't either:
mostly contiguous numbers - possibly with a few small gaps, such that they cleanly map onto successive buckets most of the time, or
pretty random in the low bits used, so they scatter across the buckets
If table size is not a power of two, and ideally is a prime, the high order bits help spread the keys around the buckets. Collisions are just as likely for e.g. random numbers, but on computer systems sometimes different bits in a key are more or less likely to vary (for example, doubles in memory consist of a sign, many mantissa, and many exponent bits: if your numbers are of similar magnitude, the exponents won't vary much between them), or there are patterns based on power-of-two boundaries. For example, if you have 4 ASCII characters packed into a uint32_t key, then a % table_size hash with table_size 256 extracts just one of the characters as the hash value. If table size was instead 257, then varying any of the characters would change the bucket selected.
(key % PRIME) + 1 comes close to doing what hash1() would do if the table size was prime, but why add 1? Some languages do index their arrays from 1, which is the only good reason I can think of, but if dealing with such a language, you'd probably want hash1() to add 1 too. To explain the potential use for hash2(), let's take a step back first...
Real general-purpose hash table implementations need to be able to create tables of different sizes - whatever suits the program - and indeed, often applications want the table to "resize" or grow dynamically if more elements are inserted than it can handle well. Because of that, a hash function such as hash1 would be dependent on the hash table implementation or calling code to tell it the current table size. It's normally more convenient if the hash functions can be written independently of any given hash table implementation, only needing the key as input. What many hash functions do is hash the key to a number of a certain size, e.g. a uint32_t or uint64_t. Clearly that means there may be more hash values than there are buckets in the hash table, so a % operation (or faster bitwise-& operation if the # buckets is a power of two) is then used to "fold" the hash value back onto the buckets. So, a good hash table implementation usually accepts a hash function generating e.g. uint32_t or uint64_t output and internally does the % or &.
In the case of hash1 - it can be used:
as an identity hash folding the key to a bucket, or
to fold a hash value from another hash function to a bucket.
In the second usage, that second hash function could be hash2. This might make sense if the keys given to hash2 were typically much larger than the PRIME used, and yet the PRIME was in turn much larger than the number of buckets. To explain why this is desirable, let's take another step back...
Say you have 8 buckets and a hash function that produces a number in the range [0..10] with uniform probability: if you % the hash values into the table size, hash values 0..7 will map to buckets 0..7, and hash values 8..10 will map to buckets 0..2: buckets 0..2 can be expected to have about twice as many keys collide there as the other buckets. When the range of hash values is vastly larger than the number of buckets, the significance of having some buckets %-ed to once more than other buckets is correspondingly tiny. Alternatively, if you have say a hash function outputting 32-bit numbers (so the number of distinct hash values is a power of two), then % by a smaller power-of-two will map exactly the same number of hash values to each bucket.
So, let's return to my earlier assertion: hash2()'s potential utility is actually to use it like this:
bucket_selected = hash1(hash2(key))
In the above formula - hash1 distributes across the buckets but prevents out-of-bounds bucket access; to work reasonable hash2 should output a range of numbers much larger than the number of buckets, but it won't do anything at all unless the keys span a range larger than PRIME, and ideally they'd span a range vastly larger than PRIME, increasing the odds of hash values from hash2(key) forming a near-uniform distribution between 1 and PRIME.
In book of Aditya Bhargava "Grokking Algorithms: An illustrated guide for programmers and other curious people" i read than worst case complexity can be avoided, if we avoid collision.
As i understand, collision - is when hash function returns same value in case of different keys.
How it is affects Hash Table complexity in CRUD operations?
Thanks
i read than worst case complexity can be avoided, if we avoid collision.
That's correct - worst case complexity happens when all the hash values for elements stored in a hash table map on to and collided at the same bucket.
As i understand, collision - is when hash function returns same value in case of different keys.
Ultimately a value is mapped using a hash function to a bucket in the hash table. That said, it's common for that overall conceptual hash function to be implemented as a hash function producing a value in a huge numerical range (e.g. a 32-bit hash between 0 and 2^32-1, or a 64-bit hash between 0 and 2^64-1), then have that value mapped on to a specific bucket based on the current hash table bucket count using the % operator. So, say your hash table has 137 buckets, you might generate a hash value of 139, then say 139 % 137 == 2 and use the third ([2] in an array of buckets). This two step approach makes it easy to use the same hash function (producing 32-bit or 64-bit hashes) regardless of the size of table. If you instead created a hash function that produced numbers between 0 and 136 directly, it wouldn't work at all well for slightly smaller or larger bucket counts.
Returning to your question...
As i understand, collision - is when hash function returns same value in case of different keys.
...for the "32- or 64-bit hash function followed by %" approach I've described above, there are two distinct types of collisions: the 32- or 64-bit hash function itself may produce exactly the same 32- or 64-bit value for distinct values being hashed, or they might produce different values that - after the % operation - never-the-less map to the same bucket in the hash table.
How it is affects Hash Table complexity in CRUD operations?
Hash tables work by probabilistically spreading the values across the buckets. When many values collide at the same bucket, a secondary search mechanism has to be employed to process all the colliding values (and possibly other intermingled values, if you're using Open Addressing to try a sequence of buckets in the hash table, rather than hanging a linked list or binary tree of colliding elements off every bucket). So basically, the worse the collision rate, the further from idealised O(1) complexity you get, though you really only start to affect big-O complexity significantly if you have a particularly bad hash function, in light of the set of values being stored.
In a hash table implementation that has a good hashing function, and the load factor (number of entries divided by total capacity) is 70% or less, the number of collisions is fairly low and hash lookup is O(1).
If you have a poor hashing function or your load factor starts to increase, then the number of collisions increases. If you have a poor hashing function, then some hash codes will have many collisions and others will have very few. Your average lookup rate might still be close to O(1), but some lookups will take much longer because collision resolution takes a long time. For example, if hash code value 11792 has 10 keys mapped to it, then you potentially have to check 10 different keys before you can return the matching key.
If the hash table is overloaded, with each hash code having approximately the same number of keys mapped to it, then your average lookup rate will be O(k), where k is the average number of collisions per hash code.
I have an array with, for example, 1000000000000 of elements (integers). What is the best approach to pick, for example, only 3 random and unique elements from this array? Elements must be unique in whole array, not in list of N (3 in my example) elements.
I read about Reservoir sampling, but it provides only method to pick random numbers, which can be non-unique.
If the odds of hitting a non-unique value are low, your best bet will be to select 3 random numbers from the array, then check each against the entire array to ensure it is unique - if not, choose another random sample to replace it and repeat the test.
If the odds of hitting a non-unique value are high, this increases the number of times you'll need to scan the array looking for uniqueness and makes the simple solution non-optimal. In that case you'll want to split the task of ensuring unique numbers from the task of making a random selection.
Sorting the array is the easiest way to find duplicates. Most sorting algorithms are O(n log n), but since your keys are integers Radix sort can potentially be faster.
Another possibility is to use a hash table to find duplicates, but that will require significant space. You can use a smaller hash table or Bloom filter to identify potential duplicates, then use another method to go through that smaller list.
counts = [0] * (MAXINT-MININT+1)
for value in Elements:
counts[value] += 1
uniques = [c for c in counts where c==1]
result = random.pick_3_from(uniques)
I assume that you have a reasonable idea what fraction of the array values are likely to be unique. So you would know, for instance, that if you picked 1000 random array values, the odds are good that one is unique.
Step 1. Pick 3 random hash algorithms. They can all be the same algorithm, except that you add different integers to each as a first step.
Step 2. Scan the array. Hash each integer all three ways, and for each hash algorithm, keep track of the X lowest hash codes you get (you can use a priority queue for this), and keep a hash table of how many times each of those integers occurs.
Step 3. For each hash algorithm, look for a unique element in that bucket. If it is already picked in another bucket, find another. (Should be a rare boundary case.)
That is your set of three random unique elements. Every unique triple should have even odds of being picked.
(Note: For many purposes it would be fine to just use one hash algorithm and find 3 things from its list...)
This algorithm will succeed with high likelihood in one pass through the array. What is better yet is that the intermediate data structure that it uses is fairly small and is amenable to merging. Therefore this can be parallelized across machines for a very large data set.
I read that into a hash table we have a bucket array but I don't understand what that bucket array contains.
Does it contain the hashing index? the entry (key/value pair)? both?
This image, for me, is not very clear:
(reference)
So, which is a bucket array?
The array index is mostly equivalent to the hash value (well, the hash value mod the size of the array), so there's no need to store that in the array at all.
As to what the actual array contains, there are a few options:
If we use separate chaining:
A reference to a linked-list of all the elements that have that hash value. So:
LinkedList<E>[]
A linked-list node (i.e. the head of the linked-list) - similar to the first option, but we instead just start off with the linked-list straight away without wasting space by having a separate reference to it. So:
LinkedListNode<E>[]
If we use open addressing, we're simply storing the actual element. If there's another element with the same hash value, we use some reproducible technique to find a place for it (e.g. we just try the next position). So:
E[]
There may be a few other options, but the above are the best-known, with separate-chaining being the most popular (to my knowledge)
* I'm assuming some familiarity with generics and Java/C#/C++ syntax - E here is simply the type of the element we're storing, LinkedList<E> means a LinkedList storing elements of type E. X[] is an array containing elements of type X.
What goes into the bucket array depends a lot on what is stored in the hash table, and also on the collision resolution strategy.
When you use linear probing or another open addressing technique, your bucket table stores keys or key-value pairs, depending on the use of your hash table *.
When you use a separate chaining technique, then your bucket array stores pairs of keys and the headers of your chaining structure (e.g. linked lists).
The important thing to remember about the bucket array is that it establishes a mapping between a hash code and a group of zero or more keys. In other words, given a hash code and a bucket array, you can find out, in constant time, what are the possible keys associated with this hash code (enumerating the candidate keys may be linear, but finding the first one needs to be constant time in order to meet hash tables' performance guarantee of amortized constant time insertions and constant-time searches on average).
* If your hash table us used for checking membership (i.e. it represents a set of keys) then the bucket array stores keys; otherwise, it stores key-value pairs.
In practice a linked list of the entries that have been computed (by hashing the key) to go into that bucket.
In a HashTable there are most of the times collisions. That is when different elements have the same hash value. Elements with the same Hash value are stored in one bucket. So for each hash value you have a bucket containing all elements that have this hash-value.
A bucket is a linked list of key-value pairs. hash index is the one
to tell "which bucket", and the "key" in the key-value pair is the one to tell "which entry in that bucket".
also check out
hashing in Java -- structure & access time, i've bee telling more details there.
I know about creating hashcodes, collisions, the relationship between .GetHashCode and .Equals, etc.
What I don't quite understand is how a 32 bit hash number is used to get the ~O(1) lookup. If you have an array big enough to allocate all the possibilities in a 32bit number then you do get the ~O(1) but that would be waste of memory.
My guess is that internally the Hashtable class creates a small array (e.g. 1K items) and then rehash the 32bit number to a 3 digit number and use that as lookup. When the number of elements reaches a certain threshold (say 75%) it would expand the array to something like 10K items and recompute the internal hash numbers to 4 digit numbers, based on the 32bit hash of course.
btw, here I'm using ~O(1) to account for possible collisions and their resolutions.
Do I have the gist of it correct or am I completely off the mark?
My guess is that internally the Hashtable class creates a small array (e.g. 1K items) and then rehash the 32bit number to a 3 digit number and use that as lookup.
That's exactly what happens, except that the capacity (number of bins) of the table is more commonly set to a power of two or a prime number. The hash code is then taken modulo this number to find the bin into which to insert an item. When the capacity is a power of two, the modulus operation becomes a simple bitmasking op.
When the number of elements reaches a certain threshold (say 75%)
If you're referring to the Java Hashtable implementation, then yes. This is called the load factor. Other implementations may use 2/3 instead of 3/4.
it would expand the array to something like 10K items
In most implementations, the capacity will not be increased ten-fold but rather doubled (for power-of-two-sized hash tables) or multiplied by roughly 1.5 + the distance to the next prime number.
The hashtable has a number of bins that contain items. The number of bins are quite small to start with. Given a hashcode, it simply uses hashcode modulo bincount to find the bin in which the item should reside. That gives the fast lookup (Find the bin for an item: Take modulo of the hashcode, done).
Or in (pseudo) code:
int hash = obj.GetHashCode();
int binIndex = hash % binCount;
// The item is in bin #binIndex. Go get the items there and find the one that matches.
Obviously, as you figured out yourself, at some point the table will need to grow. When it does this, a new array of bins are created, and the items in the table are redistributed to the new bins. This is also means that growing a hashtable can be slow. (So, approx. O(1) in most cases, unless the insert triggers an internal resize. Lookups should always be ~O(1)).
In general, there are a number of variations in how hash tables handle overflow.
Many (including Java's, if memory serves) resize when the load factor (percentage of bins in use) exceeds some particular percentage. The downside of this is that the speed is undependable -- most insertions will be O(1), but a few will be O(N).
To ameliorate that problem, some resize gradually instead: when the load factor exceeds the magic number, they:
Create a second (larger) hash table.
Insert the new item into the new hash table.
Move some items from the existing hash table to the new one.
Then, each subsequent insertion moves another chunk from the old hash table to the new one. This retains the O(1) average complexity, and can be written so the complexity for every insertion is essentially constant: when the hash table gets "full" (i.e., load factor exceeds your trigger point) you double the size of the table. Then, each insertion you insert the new item and move one item from the old table to the new one. The old table will empty exactly as the new one fills up, so every insertion will involve exactly two operations: inserting one new item and moving one old one, so insertion speed remains essentially constant.
There are also other strategies. One I particularly like is to make the hash table a table of balanced trees. With this, you usually ignore overflow entirely. As the hash table fills up, you just end up with more items in each tree. In theory, this means the complexity is O(log N), but for any practical size it's proportional to log N/M, where M=number of buckets. For practical size ranges (e.g., up to several billion items) that's essentially constant (log N grows very slowly) and and it's often a little faster for the largest table you can fit in memory, and a lost faster for smaller sizes. The shortcoming is that it's only really practical when the objects you're storing are fairly large -- if you stored (for example) one character per node, the overhead from two pointers (plus, usually, balance information) per node would be extremely high.