How do I keep load factor small in my hash table? - data-structures

I'm learning about hash tables and quadratic probing in particular. I've read that if the load factor is <= 0.5 and the table's size is prime, quadratic probing will always find an empty slot and no key will be accessed multiple times. It then goes on to say that, in order to ensure efficient insertions, I should always maintain a load factor <= 0.5. What does this mean? Surely if we keep adding items, the load factor will increase until it equals 1 whether we want it to or not. So what is implied when my textbook says I should maintain a small load factor?

The implication is that at some point (when you would exceed a load factor of 0.5 in this case), you'll have to allocate a new table (which is bigger by some factor, maybe 1.5 or 2, and then rounded up to the nearest prime number) and copy all the elements from the old table into it (that's not a straight copy, the new position of an item will usually be different than the old position).

Related

hash table about the load factor

I'm studying about hash table for algorithm class and I became confused with the load factor.
Why is the load factor, n/m, significant with 'n' being the number of elements and 'm' being the number of table slots?
Also, why does this load factor equal the expected length of n(j), the linked list at slot j in the hash table when all of the elements are stored in a single slot?
The crucial property of a hash table is the expected constant time it takes to look up an element.*
In order to achieve this, the implementer of the hash table has to make sure that every query to the hash table returns below some fixed amount of steps.
If you have a hash table with m buckets and you add elements indefinitely (i.e. n>>m), then also the size of the lists will grow and you can't guarantee that expected constant time for look ups, but you will rather get linear time (since the running time you need to traverse the ever increasing linked lists will outweigh the lookup for the bucket).
So, how can we achieve that the lists don't grow? Well, you have to make sure that the length of the list is bounded by some fixed constant - how we do that? Well, we have to add additional buckets.
If the hash table is well implemented, then the hash function being used to map the elements to buckets, should distribute the elements evenly across the buckets. If the hash function does this, then the length of the lists will be roughly the same.
How long is one of the lists if the elements are distributed evenly? Clearly we'll have total number of elements divided by the number of buckets, i.e. the load factor n/m (number of elements per bucket = expected/average length of each list).
Hence, to ensure constant time look up, what we have to do is keep track of the load factor (again: expected length of the lists) such that, when it goes above the fixed constant we can add additional buckets.
Of course, there are more problems which come in, such as how to redistribute the elements you already stored or how many buckets should you add.
The important message to take away, is that the load factor is needed to decide when to add additional buckets to the hash table - that's why it is not only 'important' but crucial.
Of course, if you map all the elements to the same bucket, then the average length of each list won't be worth much. All this stuff only makes sense, if you distribute evenly across the buckets.
*Note the expected - I can't emphasize this enough. Its typical to hear "hash table have constant look up time". They do not! Worst case is always O(n) and you can't make that go away.
Adding to the existing answers, let me just put in a quick derivation.
Consider a arbitrarily chosen bucket in the table. Let X_i be the indicator random variable that equals 1 if the ith element is inserted into this element and 0 otherwise.
We want to find E[X_1 + X_2 + ... + X_n].
By linearity of expectation, this equals E[X_1] + E[X_2] + ... E[X_n]
Now we need to find the value of E[X_i]. This is simply (1/m) 1 + (1 - (1/m) 0) = 1/m by the definition of expected values. So summing up the values for all i's, we get 1/m + 1/m + 1/m n times. This equals n/m. We have just found out the expected number of elements inserted into a random bucket and this is the load factor.

Difference between Space utilization and Load factor in hashtable

What is the difference between Load factor and Space utilization in a Hashtable? Please, someone explain!
Load factor
Definition:
The load factor of a Hashtable is the ratio of elements to buckets. Smaller load factors cause faster average lookup times at the cost of
increased memory consumption. The default load factor of 1.0 generally
provides the best balance between speed and size.
In other words, too small load factor will lead to faster access to the elements (while finding a given element, or iterating, ...) of the HashTable but also requires more memory usage.
In the contrary, high load factor will be slower (in average), with less memory usage.
A bucket holds a certain number of items.
Sometimes each location in the table is a bucket which will hold a fixed number of items, all of which hashed to this same location. This speeds up lookups because there is probably no need to go look at another location.
Linear probing as well as double hashing : The load factor is defined as n/prime, where n is the number of items in the table and prime is the size of the table. Thus a load factor of 1 means that the table is full.
Here is an example of benchmark (here realized in the conditions of a large prime number):
load --- successful lookup --- --- unsuccessful lookup ---
factor linear double linear double
------------------------------------------------------------------------
0.50 1.50 1.39 2.50 2.00
0.75 2.50 1.85 8.50 4.00
0.90 5.50 2.56 50.50 10.00
0.95 10.50 3.15 200.50 20.00
Table source.
Some hash tables use other collision-resolution schemes : For example, in separate chaining, where items that hash to the same location are stored in a linked list, lookup time is measured by the number of list nodes that have to be examined. For a successful search, this number is 1+lf/2, where lf is the load factor. Because each table location holds a linked list, which can contain a number of items, the load factor can be greater than 1, whereas 1 is the maximum possible in an ordinary hash table.
Space utilization
The idea is that we store records of data in the hash table. Each record has a key field and an associated data field. The record is stored in a location that is based on its key. The function that produces this location for each given key is called a hash function.
Let's suppose that each key field contains an integer and the data field a string (array of characters type of string). One possible hash function is hash(key) = key % prime.
Definition:
The space utilization would be the ratio of the number of full used buckets (relatively to the load factor) to the total number of buckets reserved in the hash table.
For technical reasons, a prime number of buckets works better, which (modulus the number of filly used buckets) can consists in a waste of memory.
Conclusion : Rather than having to proceed through a linear search, or a binary search, a hash table will usually complete a lookup after just one comparison! Sometimes, however, two comparisons (or even more) are needed. A hash table thus delivers (almost) the ideal lookup time. The trade-off is that to get this great lookup time memory space is wasted.
As you can see, I am no expert, and I'm getting information while writing this, so any comment is welcome to make this more accurate or less... well... wrong...
I switched it in Community Wiki mode (Feel free to improve)
Load factor is a measure of how full the hash table is filled with respect to its total number of buckets. Lets say, you have 1000 buckets and you want to only store a maximum 70% of this number. If load factor ratio exceeds (more than 700 elements are stored) this maximum ratio, hash table size can be increased to effectively hold more elements .
Space utilization is the ratio of the number of filled buckets to the total number of the buckets in a hash table.
Usually, when load factor increases, space utilization increases and in an ideal hash table, load factor and space utilization should be linearly related to each other. However, in most cases, space utilization is a sublinear function of load factor because some buckets are assigned to hold more than 1 elements in case of high load factor ratios.
In order to obtain a hashing performance close to the ideal case you may need a perfect hashing function.
A perfect hashing function maps a key into a unique address. If the
range of potential addresses is the same as the number of keys, the
function is a minimal (in space) perfect hashing function

How do hashtable indexes work?

I know about creating hashcodes, collisions, the relationship between .GetHashCode and .Equals, etc.
What I don't quite understand is how a 32 bit hash number is used to get the ~O(1) lookup. If you have an array big enough to allocate all the possibilities in a 32bit number then you do get the ~O(1) but that would be waste of memory.
My guess is that internally the Hashtable class creates a small array (e.g. 1K items) and then rehash the 32bit number to a 3 digit number and use that as lookup. When the number of elements reaches a certain threshold (say 75%) it would expand the array to something like 10K items and recompute the internal hash numbers to 4 digit numbers, based on the 32bit hash of course.
btw, here I'm using ~O(1) to account for possible collisions and their resolutions.
Do I have the gist of it correct or am I completely off the mark?
My guess is that internally the Hashtable class creates a small array (e.g. 1K items) and then rehash the 32bit number to a 3 digit number and use that as lookup.
That's exactly what happens, except that the capacity (number of bins) of the table is more commonly set to a power of two or a prime number. The hash code is then taken modulo this number to find the bin into which to insert an item. When the capacity is a power of two, the modulus operation becomes a simple bitmasking op.
When the number of elements reaches a certain threshold (say 75%)
If you're referring to the Java Hashtable implementation, then yes. This is called the load factor. Other implementations may use 2/3 instead of 3/4.
it would expand the array to something like 10K items
In most implementations, the capacity will not be increased ten-fold but rather doubled (for power-of-two-sized hash tables) or multiplied by roughly 1.5 + the distance to the next prime number.
The hashtable has a number of bins that contain items. The number of bins are quite small to start with. Given a hashcode, it simply uses hashcode modulo bincount to find the bin in which the item should reside. That gives the fast lookup (Find the bin for an item: Take modulo of the hashcode, done).
Or in (pseudo) code:
int hash = obj.GetHashCode();
int binIndex = hash % binCount;
// The item is in bin #binIndex. Go get the items there and find the one that matches.
Obviously, as you figured out yourself, at some point the table will need to grow. When it does this, a new array of bins are created, and the items in the table are redistributed to the new bins. This is also means that growing a hashtable can be slow. (So, approx. O(1) in most cases, unless the insert triggers an internal resize. Lookups should always be ~O(1)).
In general, there are a number of variations in how hash tables handle overflow.
Many (including Java's, if memory serves) resize when the load factor (percentage of bins in use) exceeds some particular percentage. The downside of this is that the speed is undependable -- most insertions will be O(1), but a few will be O(N).
To ameliorate that problem, some resize gradually instead: when the load factor exceeds the magic number, they:
Create a second (larger) hash table.
Insert the new item into the new hash table.
Move some items from the existing hash table to the new one.
Then, each subsequent insertion moves another chunk from the old hash table to the new one. This retains the O(1) average complexity, and can be written so the complexity for every insertion is essentially constant: when the hash table gets "full" (i.e., load factor exceeds your trigger point) you double the size of the table. Then, each insertion you insert the new item and move one item from the old table to the new one. The old table will empty exactly as the new one fills up, so every insertion will involve exactly two operations: inserting one new item and moving one old one, so insertion speed remains essentially constant.
There are also other strategies. One I particularly like is to make the hash table a table of balanced trees. With this, you usually ignore overflow entirely. As the hash table fills up, you just end up with more items in each tree. In theory, this means the complexity is O(log N), but for any practical size it's proportional to log N/M, where M=number of buckets. For practical size ranges (e.g., up to several billion items) that's essentially constant (log N grows very slowly) and and it's often a little faster for the largest table you can fit in memory, and a lost faster for smaller sizes. The shortcoming is that it's only really practical when the objects you're storing are fairly large -- if you stored (for example) one character per node, the overhead from two pointers (plus, usually, balance information) per node would be extremely high.

Incremental median computation with max memory efficiency

I have a process that generates values and that I observe. When the process terminates, I want to compute the median of those values.
If I had to compute the mean, I could just store the sum and the number of generated values and thus have O(1) memory requirement. How about the median? Is there a way to save on the obvious O(n) coming from storing all the values?
Edit: Interested in 2 cases: 1) the stream length is known, 2) it's not.
You are going to need to store at least ceil(n/2) points, because any one of the first n/2 points could be the median. It is probably simplest to just store the points and find the median. If saving ceil(n/2) points is of value, then read in the first n/2 points into a sorted list (a binary tree is probably best), then as new points are added throw out the low or high points and keep track of the number of points on either end thrown out.
Edit:
If the stream length is unknown, then obviously, as Stephen observed in the comments, then we have no choice but to remember everything. If duplicate items are likely, we could possibly save a bit of memory using Dolphins idea of storing values and counts.
I had the same problem and got a way that has not been posted here. Hopefully my answer can help someone in the future.
If you know your value range and don't care much about median value precision, you can incrementally create a histogram of quantized values using constant memory. Then it is easy to find median or any position of values, with your quantization error.
For example, suppose your data stream is image pixel values and you know these values are integers all falling within 0~255. To create the image histogram incrementally, just create 256 counters (bins) starting from zeros and count one on the bin corresponding to the pixel value while scanning through the input. Once the histogram is created, find the first cumulative count that is larger than half of the data size to get median.
For data that are real numbers, you can still compute histogram with each bin having quantized values (e.g. bins of 10's, 1's, or 0.1's etc.), depending on your expected data value range and precision you want.
If you don't know the value range of entire data sample, you can still estimate the possible value range of median and compute histogram within this range. This drops outliers by nature but is exactly what we want when computing median.
You can
Use statistics, if that's acceptable - for example, you could use sampling.
Use knowledge about your number stream
using a counting sort like approach: k distinct values means storing O(k) memory)
or toss out known outliers and keep a (high,low) counter.
If you know you have no duplicates, you could use a bitmap... but that's just a smaller constant for O(n).
If you have discrete values and lots of repetition you could store the values and counts, which would save a bit of space.
Possibly at stages through the computation you could discard the top 'n' and bottom 'n' values, as long as you are sure that the median is not in that top or bottom range.
e.g. Let's say you are expecting 100,000 values. Every time your stored number gets to (say) 12,000 you could discard the highest 1000 and lowest 1000, dropping storage back to 10,000.
If the distribution of values is fairly consistent, this would work well. However if there is a possibility that you will receive a large number of very high or very low values near the end, that might distort your computation. Basically if you discard a "high" value that is less than the (eventual) median or a "low" value that is equal or greater than the (eventual) median then your calculation is off.
Update
Bit of an example
Let's say that the data set is the numbers 1,2,3,4,5,6,7,8,9.
By inspection the median is 5.
Let's say that the first 5 numbers you get are 1,3,5,7,9.
To save space we discard the highest and lowest, leaving 3,5,7
Now get two more, 2,6 so our storage is 2,3,5,6,7
Discard the highest and lowest, leaving 3,5,6
Get the last two 4,8 and we have 3,4,5,6,8
Median is still 5 and the world is a good place.
However, lets say that the first five numbers we get are 1,2,3,4,5
Discard top and bottom leaving 2,3,4
Get two more 6,7 and we have 2,3,4,6,7
Discard top and bottom leaving 3,4,6
Get last two 8,9 and we have 3,4,6,8,9
With a median of 6 which is incorrect.
If our numbers are well distributed, we can keep trimming the extremities. If they might be bunched in lots of large or lots of small numbers, then discarding is risky.

Why does dynamic array always double by a factor of 2?

I was wondering how does one decide the resizing factor by which dynamic array resizes ?
On wikipedia and else where I have always seen the number of elements being increased by a factor of 2? Why 2? Why not 3? how does one decide this factor ? IF it is language dependent I would like to know this for Java.
Actually in Java's ArrayList the formula to calculate the new capacity after a resize is:
newCapacity = (oldCapacity * 3)/2 + 1;
This means roughtly a 1.5 factor.
About the reason for this number I don't know but I hope someone has done a statistical analisys and found this is a good compromise between space and computational overhead.
Quoting from Wikipedia:
As n elements are inserted, the capacities form a geometric progression. Expanding the array by any constant proportion ensures that inserting n elements takes O(n) time overall, meaning that each insertion takes amortized constant time. The value of this proportion a leads to a time-space tradeoff: the average time per insertion operation is about a/(a−1), while the number of wasted cells is bounded above by (a−1)n. The choice of a depends on the library or application: a=3/21 and a=2[citation needed] is commonly-used.
Apparently it seems that it is a good compromise between CPU time and memory wasting. I guess the "best" value depends on what your application does.
Would you waste more space than you actually use?
If not, the factor should be less than or equal to 2.
If you want it to be an integer so it is simple to work with, there is only one choice.
There is another difference between a growth rate of 2X and a growth rate of 1.5X that nobody here has discussed yet.
Each time we allocate a new buffer to increase our dynamic array capacity, we are building up a region of unused memory preceding the array. If the growth rate is too high, then this region cannot ever be reused in the array.
To visualize, let "X" represent memory cells used by our array, and "O" represent memory cells that we can no longer use. A growth rate of 2X looks like so:
[X] -> [OXX] -> [OOOXXXX] -> [OOOOOOOXXXXXXXX]
... notice that the preceding O's keep growing! In fact, with a 2X growth rate, we can never use that memory again in our array.
But, with a 1.5X growth multiplier (rounded down, but at least 1), the usage looks like:
[X] -> [OXX] -> [OOOXXX] -> [OOOOOOXXXX] -> [XXXXXX]
Wait a sec, we were able to reclaim the old space! That's because the size of the unused space caught up with the size of the array.
If you work out the math, the limit growth factor is Phi (or about 1.618). Anything larger than Phi, and you cannot reclaim the old space.

Resources