hash table about the load factor - data-structures

I'm studying about hash table for algorithm class and I became confused with the load factor.
Why is the load factor, n/m, significant with 'n' being the number of elements and 'm' being the number of table slots?
Also, why does this load factor equal the expected length of n(j), the linked list at slot j in the hash table when all of the elements are stored in a single slot?

The crucial property of a hash table is the expected constant time it takes to look up an element.*
In order to achieve this, the implementer of the hash table has to make sure that every query to the hash table returns below some fixed amount of steps.
If you have a hash table with m buckets and you add elements indefinitely (i.e. n>>m), then also the size of the lists will grow and you can't guarantee that expected constant time for look ups, but you will rather get linear time (since the running time you need to traverse the ever increasing linked lists will outweigh the lookup for the bucket).
So, how can we achieve that the lists don't grow? Well, you have to make sure that the length of the list is bounded by some fixed constant - how we do that? Well, we have to add additional buckets.
If the hash table is well implemented, then the hash function being used to map the elements to buckets, should distribute the elements evenly across the buckets. If the hash function does this, then the length of the lists will be roughly the same.
How long is one of the lists if the elements are distributed evenly? Clearly we'll have total number of elements divided by the number of buckets, i.e. the load factor n/m (number of elements per bucket = expected/average length of each list).
Hence, to ensure constant time look up, what we have to do is keep track of the load factor (again: expected length of the lists) such that, when it goes above the fixed constant we can add additional buckets.
Of course, there are more problems which come in, such as how to redistribute the elements you already stored or how many buckets should you add.
The important message to take away, is that the load factor is needed to decide when to add additional buckets to the hash table - that's why it is not only 'important' but crucial.
Of course, if you map all the elements to the same bucket, then the average length of each list won't be worth much. All this stuff only makes sense, if you distribute evenly across the buckets.
*Note the expected - I can't emphasize this enough. Its typical to hear "hash table have constant look up time". They do not! Worst case is always O(n) and you can't make that go away.

Adding to the existing answers, let me just put in a quick derivation.
Consider a arbitrarily chosen bucket in the table. Let X_i be the indicator random variable that equals 1 if the ith element is inserted into this element and 0 otherwise.
We want to find E[X_1 + X_2 + ... + X_n].
By linearity of expectation, this equals E[X_1] + E[X_2] + ... E[X_n]
Now we need to find the value of E[X_i]. This is simply (1/m) 1 + (1 - (1/m) 0) = 1/m by the definition of expected values. So summing up the values for all i's, we get 1/m + 1/m + 1/m n times. This equals n/m. We have just found out the expected number of elements inserted into a random bucket and this is the load factor.

Related

Hash Table sequence always get inserted

I have a problem related to the hash tables.
Let's consider an hash table of dimension 2^n in a open linear schema.
h(k,i) = (k^n + 2*i)mod(2^n). Show that the sequence
{1,2,...2^n} always can be inserted into the hash table.
I tried to identify a pattern in the way the numbers get inserted into the table and then apply an induction to see if I can prove the question.Any problem which our teacher gave us seems to be like this one, and I can't figure out a way of doing these kind of problems.
h(k,i) = (k^n + 2*i)mod(2^n). Show that the sequence {1,2,...2^n} always can be inserted into the hash table.
Two observations about the hash function:
k^n, for n >= 1, will be odd when k is odd, and even when k is even
2*i will probe every second bucket (wrapping around from last to first)
So, as you hash {1,2,...2^n} we know you'll alternate between finding an unused odd-indexed bucket, and an even-indexed bucket.
Just to emphasise the point, the k^n bit restricts the odd keys to odd-indexed buckets and the even keys to even-indexed buckets, while 2*i ensures all such buckets are considered until a free one's found. It's necessary that exactly half the keys will be odd and half even for the table to become full without h(k,i) failing to find an unused bucket as i is incremented.
You have a lot of terminology problems here.
You hash table does not have dimensions (actually it has, but it is one dimension, and not 2^n), but it has number of slots/buckets.
Most probably the question you asked is not the question your book/teacher wants you to solve. You tell:
Show that the sequence {1,2,...2^n} always can be inserted into the
hash table
and the problem is that in your case any natural number can be inserted in your hash table. This is obvious, because your hash function maps any number to a natural number in a region from [0 to 2^n) and because your hash function has 2^n slots, any number will fit in your hash.
So clarify what your teacher wants, explain find out what k and i is in your hash function and ask another, better prepared question.

Best algorithm to find N unique random numbers in VERY large array

I have an array with, for example, 1000000000000 of elements (integers). What is the best approach to pick, for example, only 3 random and unique elements from this array? Elements must be unique in whole array, not in list of N (3 in my example) elements.
I read about Reservoir sampling, but it provides only method to pick random numbers, which can be non-unique.
If the odds of hitting a non-unique value are low, your best bet will be to select 3 random numbers from the array, then check each against the entire array to ensure it is unique - if not, choose another random sample to replace it and repeat the test.
If the odds of hitting a non-unique value are high, this increases the number of times you'll need to scan the array looking for uniqueness and makes the simple solution non-optimal. In that case you'll want to split the task of ensuring unique numbers from the task of making a random selection.
Sorting the array is the easiest way to find duplicates. Most sorting algorithms are O(n log n), but since your keys are integers Radix sort can potentially be faster.
Another possibility is to use a hash table to find duplicates, but that will require significant space. You can use a smaller hash table or Bloom filter to identify potential duplicates, then use another method to go through that smaller list.
counts = [0] * (MAXINT-MININT+1)
for value in Elements:
counts[value] += 1
uniques = [c for c in counts where c==1]
result = random.pick_3_from(uniques)
I assume that you have a reasonable idea what fraction of the array values are likely to be unique. So you would know, for instance, that if you picked 1000 random array values, the odds are good that one is unique.
Step 1. Pick 3 random hash algorithms. They can all be the same algorithm, except that you add different integers to each as a first step.
Step 2. Scan the array. Hash each integer all three ways, and for each hash algorithm, keep track of the X lowest hash codes you get (you can use a priority queue for this), and keep a hash table of how many times each of those integers occurs.
Step 3. For each hash algorithm, look for a unique element in that bucket. If it is already picked in another bucket, find another. (Should be a rare boundary case.)
That is your set of three random unique elements. Every unique triple should have even odds of being picked.
(Note: For many purposes it would be fine to just use one hash algorithm and find 3 things from its list...)
This algorithm will succeed with high likelihood in one pass through the array. What is better yet is that the intermediate data structure that it uses is fairly small and is amenable to merging. Therefore this can be parallelized across machines for a very large data set.

Updating & Querying all elements in array >= X where X is variable fast

Formally we are given an array with some initial values. Then we have 3 types of Queries :-
Point updates : Increment by 1 at a given position
Range Queries : To count number of elements>=x where x is taken as input
Range Updates : To decrement by 1 all elements>=x, where x is given as input.
N=105 , Q=105 (number of elements in array, number of Queries resp.)
I tried doing this with segment Tree but operations 2,3 can be worse than O(n) even as we don't know which 'range' is to be updated exactly so we may end up traversing whole of segment tree.
NOTE : I wish to clear that if we need to do all 3 operations in logarithmic Worst case ,ie O(log n) ,cause only then we can do this fast , linear approach doesn't works as Q=10^5 n N=10^5 , so worst case could be O(n^2) ,ie 10^10 operation which is clearly not feasible.
Given that you're talking about 105 items, and don't mention needing to add or remove items, it seems to me that the obvious data structure would be a simple sorted vector.
Operation complexities:
point update: O(1) + O(m) (where m is the number of subsequent elements equal to the value before the update).
Range query: O(log n) + O(m) (where n is start of range, m is elements in range).
Range update (same as range query).
It's a little difficult to be sure what "fast" means to you, but the fastest theoretically possible for 1 is O(1), so we're already within some constant factor of optimal.
For 2 and 3, even if we could do the find with constant complexity, we're pretty much stuck with O(m) for the update. Since Log2100000 = ~16.6, most of the time the O(m) term is going to dominate (i.e., the update part will involve as many operations as the search unless the given x is one of the last 17 items in the collection.
I doubt there's any point for this small of a collection, but if you might have to deal with a substantially larger collection and the items in the collection are reasonably predictably distributed, it might be worth considering doing an interpolating search instead of a binary search. With predictable distribution this reduces the expected number of comparisons to approximately O(log log n). In this case, that would be roughly 4 (but normally with a higher constant factor). This might be a win for 105 items, but then again it might not. If you might have to deal with a collection of (say) 108 items or more, it would be much more likely to be a substantial win.
The following may not be optimal, but is the best I could think of tonight.
Let's start by trying to turn the problem sideways. Instead of a map from indices to values, let's consider a map from values to sets of indices. A point update now involves removing an index from one set and adding it to another. A range update involves either simply moving an index set from one value to another or taking the union of two index sets. A range query involves folding over the sets corresponding to the values in range. A quick peek at Wikipedia suggests a traditional disjoint-set data structure is really great for set unions. Unfortunately, it's no good at all for removing an element from a set.
Fortunately, there is a newer data structure supporting union-find with constant time deletion! That takes care of both point updates and range updates quite naturally. Range queries, unfortunately, will require checking all array elements, even if very few elements are in range.

Hashing analysis in hashtable

The search time for a hash value is O(1+alpha) , where
alpha = number of elements/size of table
I don't understand why the 1 is added?
The expected number elements examined is
(1/n summation of i=1 to n (1+(i-1/m)))
I don't understand this too.How it is derived?
(I know how to solve the above expression , but I want to understand how it has been lead to this expression..)
EDIT : n is number of elements present and m is the number of slots or the size of the table
I don't understand why the 1 is added?
The O(1) is there to tell that even if there is no element in a bucket or the hash table at all, you'll have to compute the key hash value and thus it won't be instantaneous.
Your second part needs precisions. See my comments.
EDIT:
Your second portion is there for "amortized analysis", the idea is to consider each insertion in fact in a set of n insertions in an initially empty hash table, each lookup would take O(1) hashing plus O(i-1/m) searching the bucket content considering each bucket is evenly filled with respect to previous elements. The resolution of the sum actually gives the O(1+alpha) amortized time.

How do hashtable indexes work?

I know about creating hashcodes, collisions, the relationship between .GetHashCode and .Equals, etc.
What I don't quite understand is how a 32 bit hash number is used to get the ~O(1) lookup. If you have an array big enough to allocate all the possibilities in a 32bit number then you do get the ~O(1) but that would be waste of memory.
My guess is that internally the Hashtable class creates a small array (e.g. 1K items) and then rehash the 32bit number to a 3 digit number and use that as lookup. When the number of elements reaches a certain threshold (say 75%) it would expand the array to something like 10K items and recompute the internal hash numbers to 4 digit numbers, based on the 32bit hash of course.
btw, here I'm using ~O(1) to account for possible collisions and their resolutions.
Do I have the gist of it correct or am I completely off the mark?
My guess is that internally the Hashtable class creates a small array (e.g. 1K items) and then rehash the 32bit number to a 3 digit number and use that as lookup.
That's exactly what happens, except that the capacity (number of bins) of the table is more commonly set to a power of two or a prime number. The hash code is then taken modulo this number to find the bin into which to insert an item. When the capacity is a power of two, the modulus operation becomes a simple bitmasking op.
When the number of elements reaches a certain threshold (say 75%)
If you're referring to the Java Hashtable implementation, then yes. This is called the load factor. Other implementations may use 2/3 instead of 3/4.
it would expand the array to something like 10K items
In most implementations, the capacity will not be increased ten-fold but rather doubled (for power-of-two-sized hash tables) or multiplied by roughly 1.5 + the distance to the next prime number.
The hashtable has a number of bins that contain items. The number of bins are quite small to start with. Given a hashcode, it simply uses hashcode modulo bincount to find the bin in which the item should reside. That gives the fast lookup (Find the bin for an item: Take modulo of the hashcode, done).
Or in (pseudo) code:
int hash = obj.GetHashCode();
int binIndex = hash % binCount;
// The item is in bin #binIndex. Go get the items there and find the one that matches.
Obviously, as you figured out yourself, at some point the table will need to grow. When it does this, a new array of bins are created, and the items in the table are redistributed to the new bins. This is also means that growing a hashtable can be slow. (So, approx. O(1) in most cases, unless the insert triggers an internal resize. Lookups should always be ~O(1)).
In general, there are a number of variations in how hash tables handle overflow.
Many (including Java's, if memory serves) resize when the load factor (percentage of bins in use) exceeds some particular percentage. The downside of this is that the speed is undependable -- most insertions will be O(1), but a few will be O(N).
To ameliorate that problem, some resize gradually instead: when the load factor exceeds the magic number, they:
Create a second (larger) hash table.
Insert the new item into the new hash table.
Move some items from the existing hash table to the new one.
Then, each subsequent insertion moves another chunk from the old hash table to the new one. This retains the O(1) average complexity, and can be written so the complexity for every insertion is essentially constant: when the hash table gets "full" (i.e., load factor exceeds your trigger point) you double the size of the table. Then, each insertion you insert the new item and move one item from the old table to the new one. The old table will empty exactly as the new one fills up, so every insertion will involve exactly two operations: inserting one new item and moving one old one, so insertion speed remains essentially constant.
There are also other strategies. One I particularly like is to make the hash table a table of balanced trees. With this, you usually ignore overflow entirely. As the hash table fills up, you just end up with more items in each tree. In theory, this means the complexity is O(log N), but for any practical size it's proportional to log N/M, where M=number of buckets. For practical size ranges (e.g., up to several billion items) that's essentially constant (log N grows very slowly) and and it's often a little faster for the largest table you can fit in memory, and a lost faster for smaller sizes. The shortcoming is that it's only really practical when the objects you're storing are fairly large -- if you stored (for example) one character per node, the overhead from two pointers (plus, usually, balance information) per node would be extremely high.

Resources