What is the difference between Load factor and Space utilization in a Hashtable? Please, someone explain!
Load factor
Definition:
The load factor of a Hashtable is the ratio of elements to buckets. Smaller load factors cause faster average lookup times at the cost of
increased memory consumption. The default load factor of 1.0 generally
provides the best balance between speed and size.
In other words, too small load factor will lead to faster access to the elements (while finding a given element, or iterating, ...) of the HashTable but also requires more memory usage.
In the contrary, high load factor will be slower (in average), with less memory usage.
A bucket holds a certain number of items.
Sometimes each location in the table is a bucket which will hold a fixed number of items, all of which hashed to this same location. This speeds up lookups because there is probably no need to go look at another location.
Linear probing as well as double hashing : The load factor is defined as n/prime, where n is the number of items in the table and prime is the size of the table. Thus a load factor of 1 means that the table is full.
Here is an example of benchmark (here realized in the conditions of a large prime number):
load --- successful lookup --- --- unsuccessful lookup ---
factor linear double linear double
------------------------------------------------------------------------
0.50 1.50 1.39 2.50 2.00
0.75 2.50 1.85 8.50 4.00
0.90 5.50 2.56 50.50 10.00
0.95 10.50 3.15 200.50 20.00
Table source.
Some hash tables use other collision-resolution schemes : For example, in separate chaining, where items that hash to the same location are stored in a linked list, lookup time is measured by the number of list nodes that have to be examined. For a successful search, this number is 1+lf/2, where lf is the load factor. Because each table location holds a linked list, which can contain a number of items, the load factor can be greater than 1, whereas 1 is the maximum possible in an ordinary hash table.
Space utilization
The idea is that we store records of data in the hash table. Each record has a key field and an associated data field. The record is stored in a location that is based on its key. The function that produces this location for each given key is called a hash function.
Let's suppose that each key field contains an integer and the data field a string (array of characters type of string). One possible hash function is hash(key) = key % prime.
Definition:
The space utilization would be the ratio of the number of full used buckets (relatively to the load factor) to the total number of buckets reserved in the hash table.
For technical reasons, a prime number of buckets works better, which (modulus the number of filly used buckets) can consists in a waste of memory.
Conclusion : Rather than having to proceed through a linear search, or a binary search, a hash table will usually complete a lookup after just one comparison! Sometimes, however, two comparisons (or even more) are needed. A hash table thus delivers (almost) the ideal lookup time. The trade-off is that to get this great lookup time memory space is wasted.
As you can see, I am no expert, and I'm getting information while writing this, so any comment is welcome to make this more accurate or less... well... wrong...
I switched it in Community Wiki mode (Feel free to improve)
Load factor is a measure of how full the hash table is filled with respect to its total number of buckets. Lets say, you have 1000 buckets and you want to only store a maximum 70% of this number. If load factor ratio exceeds (more than 700 elements are stored) this maximum ratio, hash table size can be increased to effectively hold more elements .
Space utilization is the ratio of the number of filled buckets to the total number of the buckets in a hash table.
Usually, when load factor increases, space utilization increases and in an ideal hash table, load factor and space utilization should be linearly related to each other. However, in most cases, space utilization is a sublinear function of load factor because some buckets are assigned to hold more than 1 elements in case of high load factor ratios.
In order to obtain a hashing performance close to the ideal case you may need a perfect hashing function.
A perfect hashing function maps a key into a unique address. If the
range of potential addresses is the same as the number of keys, the
function is a minimal (in space) perfect hashing function
Related
In book of Aditya Bhargava "Grokking Algorithms: An illustrated guide for programmers and other curious people" i read than worst case complexity can be avoided, if we avoid collision.
As i understand, collision - is when hash function returns same value in case of different keys.
How it is affects Hash Table complexity in CRUD operations?
Thanks
i read than worst case complexity can be avoided, if we avoid collision.
That's correct - worst case complexity happens when all the hash values for elements stored in a hash table map on to and collided at the same bucket.
As i understand, collision - is when hash function returns same value in case of different keys.
Ultimately a value is mapped using a hash function to a bucket in the hash table. That said, it's common for that overall conceptual hash function to be implemented as a hash function producing a value in a huge numerical range (e.g. a 32-bit hash between 0 and 2^32-1, or a 64-bit hash between 0 and 2^64-1), then have that value mapped on to a specific bucket based on the current hash table bucket count using the % operator. So, say your hash table has 137 buckets, you might generate a hash value of 139, then say 139 % 137 == 2 and use the third ([2] in an array of buckets). This two step approach makes it easy to use the same hash function (producing 32-bit or 64-bit hashes) regardless of the size of table. If you instead created a hash function that produced numbers between 0 and 136 directly, it wouldn't work at all well for slightly smaller or larger bucket counts.
Returning to your question...
As i understand, collision - is when hash function returns same value in case of different keys.
...for the "32- or 64-bit hash function followed by %" approach I've described above, there are two distinct types of collisions: the 32- or 64-bit hash function itself may produce exactly the same 32- or 64-bit value for distinct values being hashed, or they might produce different values that - after the % operation - never-the-less map to the same bucket in the hash table.
How it is affects Hash Table complexity in CRUD operations?
Hash tables work by probabilistically spreading the values across the buckets. When many values collide at the same bucket, a secondary search mechanism has to be employed to process all the colliding values (and possibly other intermingled values, if you're using Open Addressing to try a sequence of buckets in the hash table, rather than hanging a linked list or binary tree of colliding elements off every bucket). So basically, the worse the collision rate, the further from idealised O(1) complexity you get, though you really only start to affect big-O complexity significantly if you have a particularly bad hash function, in light of the set of values being stored.
In a hash table implementation that has a good hashing function, and the load factor (number of entries divided by total capacity) is 70% or less, the number of collisions is fairly low and hash lookup is O(1).
If you have a poor hashing function or your load factor starts to increase, then the number of collisions increases. If you have a poor hashing function, then some hash codes will have many collisions and others will have very few. Your average lookup rate might still be close to O(1), but some lookups will take much longer because collision resolution takes a long time. For example, if hash code value 11792 has 10 keys mapped to it, then you potentially have to check 10 different keys before you can return the matching key.
If the hash table is overloaded, with each hash code having approximately the same number of keys mapped to it, then your average lookup rate will be O(k), where k is the average number of collisions per hash code.
I'm reading the article "Rationale for Adding Hash Tables
to the C++ Standard Template Library", and I don't understand this seemingly simple statement:
With hash tables, the amount of extra memory required depends on the
organization of the table and on the load factor (whose denition also
depends on the organization). The simplest case is the organization
called open addressing, in which all entries are stored in a single
random-access table. [...] In this case the amount of memory used per entry is M/α.
*M is the number of bytes required for the key and associated value, α is the load factor.
Why is it M/α? Why isn't it simply M+(amount of memory for each bucket * total buckets)?
In open addressing, you have a fixed-sized array of slots into which the elements are distributed. This is just a plain array with space for elements and (optionally) some control bits thrown in to mark which slots are full and which are empty.
Let's say that we have a table with s slots and that we want to distribute n elements into the table. This means that α = n / s, the number of elements divided by the number of slots. The space usage of the entire table is then sM, because there are s slots and each slot uses M bytes. Therefore, if we want to compute the memory used per element, we want to compute sM / n = M / (n / s) = M / α, which is where the formula comes from. Intuitively, this makes sense. If you have a single element in the table, the load factor is 1 / s and the total memory (Ms) divided by the number of elements (1) is therefore Ms. On the other hand, if the table is fully-loaded (n = s), then α = 1 and the total memory (Ms) divided by the number of elements (s) is equal to M.
You're on the right track in your calculation by looking at the amount of memory per bucket and multiplying that by the number of buckets. If you treat M as the size per element and s as the number of slots, you end up with a total space usage of Ms. (There's no need to add the M term in, and doing so actually gives you the wrong units: M has units "bytes per element" and Ms has units "bytes," so they shouldn't be added together).
I'm studying about hash table for algorithm class and I became confused with the load factor.
Why is the load factor, n/m, significant with 'n' being the number of elements and 'm' being the number of table slots?
Also, why does this load factor equal the expected length of n(j), the linked list at slot j in the hash table when all of the elements are stored in a single slot?
The crucial property of a hash table is the expected constant time it takes to look up an element.*
In order to achieve this, the implementer of the hash table has to make sure that every query to the hash table returns below some fixed amount of steps.
If you have a hash table with m buckets and you add elements indefinitely (i.e. n>>m), then also the size of the lists will grow and you can't guarantee that expected constant time for look ups, but you will rather get linear time (since the running time you need to traverse the ever increasing linked lists will outweigh the lookup for the bucket).
So, how can we achieve that the lists don't grow? Well, you have to make sure that the length of the list is bounded by some fixed constant - how we do that? Well, we have to add additional buckets.
If the hash table is well implemented, then the hash function being used to map the elements to buckets, should distribute the elements evenly across the buckets. If the hash function does this, then the length of the lists will be roughly the same.
How long is one of the lists if the elements are distributed evenly? Clearly we'll have total number of elements divided by the number of buckets, i.e. the load factor n/m (number of elements per bucket = expected/average length of each list).
Hence, to ensure constant time look up, what we have to do is keep track of the load factor (again: expected length of the lists) such that, when it goes above the fixed constant we can add additional buckets.
Of course, there are more problems which come in, such as how to redistribute the elements you already stored or how many buckets should you add.
The important message to take away, is that the load factor is needed to decide when to add additional buckets to the hash table - that's why it is not only 'important' but crucial.
Of course, if you map all the elements to the same bucket, then the average length of each list won't be worth much. All this stuff only makes sense, if you distribute evenly across the buckets.
*Note the expected - I can't emphasize this enough. Its typical to hear "hash table have constant look up time". They do not! Worst case is always O(n) and you can't make that go away.
Adding to the existing answers, let me just put in a quick derivation.
Consider a arbitrarily chosen bucket in the table. Let X_i be the indicator random variable that equals 1 if the ith element is inserted into this element and 0 otherwise.
We want to find E[X_1 + X_2 + ... + X_n].
By linearity of expectation, this equals E[X_1] + E[X_2] + ... E[X_n]
Now we need to find the value of E[X_i]. This is simply (1/m) 1 + (1 - (1/m) 0) = 1/m by the definition of expected values. So summing up the values for all i's, we get 1/m + 1/m + 1/m n times. This equals n/m. We have just found out the expected number of elements inserted into a random bucket and this is the load factor.
I'm learning about hash tables and quadratic probing in particular. I've read that if the load factor is <= 0.5 and the table's size is prime, quadratic probing will always find an empty slot and no key will be accessed multiple times. It then goes on to say that, in order to ensure efficient insertions, I should always maintain a load factor <= 0.5. What does this mean? Surely if we keep adding items, the load factor will increase until it equals 1 whether we want it to or not. So what is implied when my textbook says I should maintain a small load factor?
The implication is that at some point (when you would exceed a load factor of 0.5 in this case), you'll have to allocate a new table (which is bigger by some factor, maybe 1.5 or 2, and then rounded up to the nearest prime number) and copy all the elements from the old table into it (that's not a straight copy, the new position of an item will usually be different than the old position).
I know about creating hashcodes, collisions, the relationship between .GetHashCode and .Equals, etc.
What I don't quite understand is how a 32 bit hash number is used to get the ~O(1) lookup. If you have an array big enough to allocate all the possibilities in a 32bit number then you do get the ~O(1) but that would be waste of memory.
My guess is that internally the Hashtable class creates a small array (e.g. 1K items) and then rehash the 32bit number to a 3 digit number and use that as lookup. When the number of elements reaches a certain threshold (say 75%) it would expand the array to something like 10K items and recompute the internal hash numbers to 4 digit numbers, based on the 32bit hash of course.
btw, here I'm using ~O(1) to account for possible collisions and their resolutions.
Do I have the gist of it correct or am I completely off the mark?
My guess is that internally the Hashtable class creates a small array (e.g. 1K items) and then rehash the 32bit number to a 3 digit number and use that as lookup.
That's exactly what happens, except that the capacity (number of bins) of the table is more commonly set to a power of two or a prime number. The hash code is then taken modulo this number to find the bin into which to insert an item. When the capacity is a power of two, the modulus operation becomes a simple bitmasking op.
When the number of elements reaches a certain threshold (say 75%)
If you're referring to the Java Hashtable implementation, then yes. This is called the load factor. Other implementations may use 2/3 instead of 3/4.
it would expand the array to something like 10K items
In most implementations, the capacity will not be increased ten-fold but rather doubled (for power-of-two-sized hash tables) or multiplied by roughly 1.5 + the distance to the next prime number.
The hashtable has a number of bins that contain items. The number of bins are quite small to start with. Given a hashcode, it simply uses hashcode modulo bincount to find the bin in which the item should reside. That gives the fast lookup (Find the bin for an item: Take modulo of the hashcode, done).
Or in (pseudo) code:
int hash = obj.GetHashCode();
int binIndex = hash % binCount;
// The item is in bin #binIndex. Go get the items there and find the one that matches.
Obviously, as you figured out yourself, at some point the table will need to grow. When it does this, a new array of bins are created, and the items in the table are redistributed to the new bins. This is also means that growing a hashtable can be slow. (So, approx. O(1) in most cases, unless the insert triggers an internal resize. Lookups should always be ~O(1)).
In general, there are a number of variations in how hash tables handle overflow.
Many (including Java's, if memory serves) resize when the load factor (percentage of bins in use) exceeds some particular percentage. The downside of this is that the speed is undependable -- most insertions will be O(1), but a few will be O(N).
To ameliorate that problem, some resize gradually instead: when the load factor exceeds the magic number, they:
Create a second (larger) hash table.
Insert the new item into the new hash table.
Move some items from the existing hash table to the new one.
Then, each subsequent insertion moves another chunk from the old hash table to the new one. This retains the O(1) average complexity, and can be written so the complexity for every insertion is essentially constant: when the hash table gets "full" (i.e., load factor exceeds your trigger point) you double the size of the table. Then, each insertion you insert the new item and move one item from the old table to the new one. The old table will empty exactly as the new one fills up, so every insertion will involve exactly two operations: inserting one new item and moving one old one, so insertion speed remains essentially constant.
There are also other strategies. One I particularly like is to make the hash table a table of balanced trees. With this, you usually ignore overflow entirely. As the hash table fills up, you just end up with more items in each tree. In theory, this means the complexity is O(log N), but for any practical size it's proportional to log N/M, where M=number of buckets. For practical size ranges (e.g., up to several billion items) that's essentially constant (log N grows very slowly) and and it's often a little faster for the largest table you can fit in memory, and a lost faster for smaller sizes. The shortcoming is that it's only really practical when the objects you're storing are fairly large -- if you stored (for example) one character per node, the overhead from two pointers (plus, usually, balance information) per node would be extremely high.