the time performance of inserting into a hash table using external chaining - performance

Suppose I am going to inset a new element into a hash table using External Chaining. If the table is with resizing, I know the time of the insert operation is big theta 1.
However, I don't understand why the performance is different if the bucket is of fixed size. Shouldn't it be inserting into a linked list, which is also big theta 1?
This is from the slide of CS61B #UCB.

The "fixed size" vs "resizing" refers to the number of buckets, rather than the size of each individual bucket.
The idea is that if we have a fixed number of buckets, let's say k buckets, and we insert n elements into the hash table, then with a hash function with perfect spread, each bucket will hold k/n elements in it.
Since it would take us O(k/n) to look through all of the items in the bucket, and k is just a constant because it is fixed, our lookup time is O(n).

Related

What is connection between collision and complexity of CRUD operations in Hash Table?

In book of Aditya Bhargava "Grokking Algorithms: An illustrated guide for programmers and other curious people" i read than worst case complexity can be avoided, if we avoid collision.
As i understand, collision - is when hash function returns same value in case of different keys.
How it is affects Hash Table complexity in CRUD operations?
Thanks
i read than worst case complexity can be avoided, if we avoid collision.
That's correct - worst case complexity happens when all the hash values for elements stored in a hash table map on to and collided at the same bucket.
As i understand, collision - is when hash function returns same value in case of different keys.
Ultimately a value is mapped using a hash function to a bucket in the hash table. That said, it's common for that overall conceptual hash function to be implemented as a hash function producing a value in a huge numerical range (e.g. a 32-bit hash between 0 and 2^32-1, or a 64-bit hash between 0 and 2^64-1), then have that value mapped on to a specific bucket based on the current hash table bucket count using the % operator. So, say your hash table has 137 buckets, you might generate a hash value of 139, then say 139 % 137 == 2 and use the third ([2] in an array of buckets). This two step approach makes it easy to use the same hash function (producing 32-bit or 64-bit hashes) regardless of the size of table. If you instead created a hash function that produced numbers between 0 and 136 directly, it wouldn't work at all well for slightly smaller or larger bucket counts.
Returning to your question...
As i understand, collision - is when hash function returns same value in case of different keys.
...for the "32- or 64-bit hash function followed by %" approach I've described above, there are two distinct types of collisions: the 32- or 64-bit hash function itself may produce exactly the same 32- or 64-bit value for distinct values being hashed, or they might produce different values that - after the % operation - never-the-less map to the same bucket in the hash table.
How it is affects Hash Table complexity in CRUD operations?
Hash tables work by probabilistically spreading the values across the buckets. When many values collide at the same bucket, a secondary search mechanism has to be employed to process all the colliding values (and possibly other intermingled values, if you're using Open Addressing to try a sequence of buckets in the hash table, rather than hanging a linked list or binary tree of colliding elements off every bucket). So basically, the worse the collision rate, the further from idealised O(1) complexity you get, though you really only start to affect big-O complexity significantly if you have a particularly bad hash function, in light of the set of values being stored.
In a hash table implementation that has a good hashing function, and the load factor (number of entries divided by total capacity) is 70% or less, the number of collisions is fairly low and hash lookup is O(1).
If you have a poor hashing function or your load factor starts to increase, then the number of collisions increases. If you have a poor hashing function, then some hash codes will have many collisions and others will have very few. Your average lookup rate might still be close to O(1), but some lookups will take much longer because collision resolution takes a long time. For example, if hash code value 11792 has 10 keys mapped to it, then you potentially have to check 10 different keys before you can return the matching key.
If the hash table is overloaded, with each hash code having approximately the same number of keys mapped to it, then your average lookup rate will be O(k), where k is the average number of collisions per hash code.

Hash Table sequence always get inserted

I have a problem related to the hash tables.
Let's consider an hash table of dimension 2^n in a open linear schema.
h(k,i) = (k^n + 2*i)mod(2^n). Show that the sequence
{1,2,...2^n} always can be inserted into the hash table.
I tried to identify a pattern in the way the numbers get inserted into the table and then apply an induction to see if I can prove the question.Any problem which our teacher gave us seems to be like this one, and I can't figure out a way of doing these kind of problems.
h(k,i) = (k^n + 2*i)mod(2^n). Show that the sequence {1,2,...2^n} always can be inserted into the hash table.
Two observations about the hash function:
k^n, for n >= 1, will be odd when k is odd, and even when k is even
2*i will probe every second bucket (wrapping around from last to first)
So, as you hash {1,2,...2^n} we know you'll alternate between finding an unused odd-indexed bucket, and an even-indexed bucket.
Just to emphasise the point, the k^n bit restricts the odd keys to odd-indexed buckets and the even keys to even-indexed buckets, while 2*i ensures all such buckets are considered until a free one's found. It's necessary that exactly half the keys will be odd and half even for the table to become full without h(k,i) failing to find an unused bucket as i is incremented.
You have a lot of terminology problems here.
You hash table does not have dimensions (actually it has, but it is one dimension, and not 2^n), but it has number of slots/buckets.
Most probably the question you asked is not the question your book/teacher wants you to solve. You tell:
Show that the sequence {1,2,...2^n} always can be inserted into the
hash table
and the problem is that in your case any natural number can be inserted in your hash table. This is obvious, because your hash function maps any number to a natural number in a region from [0 to 2^n) and because your hash function has 2^n slots, any number will fit in your hash.
So clarify what your teacher wants, explain find out what k and i is in your hash function and ask another, better prepared question.

hash table about the load factor

I'm studying about hash table for algorithm class and I became confused with the load factor.
Why is the load factor, n/m, significant with 'n' being the number of elements and 'm' being the number of table slots?
Also, why does this load factor equal the expected length of n(j), the linked list at slot j in the hash table when all of the elements are stored in a single slot?
The crucial property of a hash table is the expected constant time it takes to look up an element.*
In order to achieve this, the implementer of the hash table has to make sure that every query to the hash table returns below some fixed amount of steps.
If you have a hash table with m buckets and you add elements indefinitely (i.e. n>>m), then also the size of the lists will grow and you can't guarantee that expected constant time for look ups, but you will rather get linear time (since the running time you need to traverse the ever increasing linked lists will outweigh the lookup for the bucket).
So, how can we achieve that the lists don't grow? Well, you have to make sure that the length of the list is bounded by some fixed constant - how we do that? Well, we have to add additional buckets.
If the hash table is well implemented, then the hash function being used to map the elements to buckets, should distribute the elements evenly across the buckets. If the hash function does this, then the length of the lists will be roughly the same.
How long is one of the lists if the elements are distributed evenly? Clearly we'll have total number of elements divided by the number of buckets, i.e. the load factor n/m (number of elements per bucket = expected/average length of each list).
Hence, to ensure constant time look up, what we have to do is keep track of the load factor (again: expected length of the lists) such that, when it goes above the fixed constant we can add additional buckets.
Of course, there are more problems which come in, such as how to redistribute the elements you already stored or how many buckets should you add.
The important message to take away, is that the load factor is needed to decide when to add additional buckets to the hash table - that's why it is not only 'important' but crucial.
Of course, if you map all the elements to the same bucket, then the average length of each list won't be worth much. All this stuff only makes sense, if you distribute evenly across the buckets.
*Note the expected - I can't emphasize this enough. Its typical to hear "hash table have constant look up time". They do not! Worst case is always O(n) and you can't make that go away.
Adding to the existing answers, let me just put in a quick derivation.
Consider a arbitrarily chosen bucket in the table. Let X_i be the indicator random variable that equals 1 if the ith element is inserted into this element and 0 otherwise.
We want to find E[X_1 + X_2 + ... + X_n].
By linearity of expectation, this equals E[X_1] + E[X_2] + ... E[X_n]
Now we need to find the value of E[X_i]. This is simply (1/m) 1 + (1 - (1/m) 0) = 1/m by the definition of expected values. So summing up the values for all i's, we get 1/m + 1/m + 1/m n times. This equals n/m. We have just found out the expected number of elements inserted into a random bucket and this is the load factor.

Hashing analysis in hashtable

The search time for a hash value is O(1+alpha) , where
alpha = number of elements/size of table
I don't understand why the 1 is added?
The expected number elements examined is
(1/n summation of i=1 to n (1+(i-1/m)))
I don't understand this too.How it is derived?
(I know how to solve the above expression , but I want to understand how it has been lead to this expression..)
EDIT : n is number of elements present and m is the number of slots or the size of the table
I don't understand why the 1 is added?
The O(1) is there to tell that even if there is no element in a bucket or the hash table at all, you'll have to compute the key hash value and thus it won't be instantaneous.
Your second part needs precisions. See my comments.
EDIT:
Your second portion is there for "amortized analysis", the idea is to consider each insertion in fact in a set of n insertions in an initially empty hash table, each lookup would take O(1) hashing plus O(i-1/m) searching the bucket content considering each bucket is evenly filled with respect to previous elements. The resolution of the sum actually gives the O(1+alpha) amortized time.

How do hashtable indexes work?

I know about creating hashcodes, collisions, the relationship between .GetHashCode and .Equals, etc.
What I don't quite understand is how a 32 bit hash number is used to get the ~O(1) lookup. If you have an array big enough to allocate all the possibilities in a 32bit number then you do get the ~O(1) but that would be waste of memory.
My guess is that internally the Hashtable class creates a small array (e.g. 1K items) and then rehash the 32bit number to a 3 digit number and use that as lookup. When the number of elements reaches a certain threshold (say 75%) it would expand the array to something like 10K items and recompute the internal hash numbers to 4 digit numbers, based on the 32bit hash of course.
btw, here I'm using ~O(1) to account for possible collisions and their resolutions.
Do I have the gist of it correct or am I completely off the mark?
My guess is that internally the Hashtable class creates a small array (e.g. 1K items) and then rehash the 32bit number to a 3 digit number and use that as lookup.
That's exactly what happens, except that the capacity (number of bins) of the table is more commonly set to a power of two or a prime number. The hash code is then taken modulo this number to find the bin into which to insert an item. When the capacity is a power of two, the modulus operation becomes a simple bitmasking op.
When the number of elements reaches a certain threshold (say 75%)
If you're referring to the Java Hashtable implementation, then yes. This is called the load factor. Other implementations may use 2/3 instead of 3/4.
it would expand the array to something like 10K items
In most implementations, the capacity will not be increased ten-fold but rather doubled (for power-of-two-sized hash tables) or multiplied by roughly 1.5 + the distance to the next prime number.
The hashtable has a number of bins that contain items. The number of bins are quite small to start with. Given a hashcode, it simply uses hashcode modulo bincount to find the bin in which the item should reside. That gives the fast lookup (Find the bin for an item: Take modulo of the hashcode, done).
Or in (pseudo) code:
int hash = obj.GetHashCode();
int binIndex = hash % binCount;
// The item is in bin #binIndex. Go get the items there and find the one that matches.
Obviously, as you figured out yourself, at some point the table will need to grow. When it does this, a new array of bins are created, and the items in the table are redistributed to the new bins. This is also means that growing a hashtable can be slow. (So, approx. O(1) in most cases, unless the insert triggers an internal resize. Lookups should always be ~O(1)).
In general, there are a number of variations in how hash tables handle overflow.
Many (including Java's, if memory serves) resize when the load factor (percentage of bins in use) exceeds some particular percentage. The downside of this is that the speed is undependable -- most insertions will be O(1), but a few will be O(N).
To ameliorate that problem, some resize gradually instead: when the load factor exceeds the magic number, they:
Create a second (larger) hash table.
Insert the new item into the new hash table.
Move some items from the existing hash table to the new one.
Then, each subsequent insertion moves another chunk from the old hash table to the new one. This retains the O(1) average complexity, and can be written so the complexity for every insertion is essentially constant: when the hash table gets "full" (i.e., load factor exceeds your trigger point) you double the size of the table. Then, each insertion you insert the new item and move one item from the old table to the new one. The old table will empty exactly as the new one fills up, so every insertion will involve exactly two operations: inserting one new item and moving one old one, so insertion speed remains essentially constant.
There are also other strategies. One I particularly like is to make the hash table a table of balanced trees. With this, you usually ignore overflow entirely. As the hash table fills up, you just end up with more items in each tree. In theory, this means the complexity is O(log N), but for any practical size it's proportional to log N/M, where M=number of buckets. For practical size ranges (e.g., up to several billion items) that's essentially constant (log N grows very slowly) and and it's often a little faster for the largest table you can fit in memory, and a lost faster for smaller sizes. The shortcoming is that it's only really practical when the objects you're storing are fairly large -- if you stored (for example) one character per node, the overhead from two pointers (plus, usually, balance information) per node would be extremely high.

Resources