Does hashtable size depends upon length of key? - algorithm

What i know,
Hashtable size depends on load factor.
It must be largest prime number, and use that prime number as the
modulo value in hash function.
Prime number must not be too close to power of 2 and power of 10.
Doubt I am having,
Does size of hashtable depends on length of key?
Following paragraph from the book Introduction to Algorithms by Cormen.
Does n=2000 mean length of string or number of element which will be store in hash table?
Good values for m are primes not too close to exact powers of 2. For
example, suppose we wish to allocate a hash table, with collisions
resolved by chaining, to hold roughly n = 2000 character strings,
where a character has 8 bits. We don't mind examining an average of 3
elements in an unsuccessful search, so we allocate a hash table of
size m = 701. The number 701 is chosen because it is a prime near =
2000/3 but not near any power of 2. Treating each key k as an integer,
our hash function would be
h(k) = k mod 701 .
Can somebody explain it>

Here's a general overview of the tradeoff with hash tables.
Suppose you have a hash table with m buckets with chains storing a total of n objects.
If you store only references to objects, the total memory consumed is O (m + n).
Now, suppose that, for an average object, its size is s, it takes O (s) time to compute its hash once, and O (s) to compare two such objects.
Consider an operation checking whether an object is present in the hash table.
The bucket will have n / m elements on average, so the operation will take O (s n / m) time.
So, the tradeoff is this: when you increase the number of buckets m, you increase memory consumption but decrease average time for a single operation.
For the original question - Does size of hashtable depends on length of key? - No, it should not, at least not directly.
The paragraph you cite only mentions the strings as an example of an object to store in a hash table.
One mentioned property is that they are 8-bit character strings.
The other is that "We don't mind examining an average of 3 elements in an unsuccessful search".
And that wraps the properties of the stored object into the form: how many elements on average do we want to place in a single bucket?
The length of strings themselves is not mentioned anywhere.

(2) and (3) are false. It is common for a hash table with 2^n buckets (ref) as long as you use the right hash function. On (1), the memory a hash table takes equals the number of buckets times the length of key. Note that for string keys, we usually keep pointers to strings, not the strings themselves, so the length of key is the length of a pointer, which is 8 bytes on 64-bit machines.

Algorithmic-wise, No!
The length of the key is irrelevant here.
Moreover, the key itself is not important, what's important is the number of different keys you predict you'll have.
Implementation-wise, Yes! Since you must save the key itself in your hashtable, it reflects on its size.
For your second question, 'n' means the number of different keys to hold.

Related

Hash Table Double Hashing

I am given two hash functions that I should use for insertion and deletion into the table;
int hash1(int key)
{
return (key % TABLESIZE);
}
int hash2(int key)
{
return (key % PRIME) + 1;
}
I'm confused on how I utilize them.
Do I:
Use hash1 first, and if the slot is taken at the return value, use hash2?
Do I use hash1 then add the result to hash2's output?
Do I use hash's output as hash2's input?
TLDR: bucket_selected = hash1(hash2(key))
hash1() is an identity hash from key to bucket (it folds the keys into the table size). If the table size happens to be a power of two, it effectively masks out some number of low bits, discarding the high bits; for example, with table size 256, it's effective returning key & 255: the least significant 8 bits. This will be somewhat collision prone if the keys aren't either:
mostly contiguous numbers - possibly with a few small gaps, such that they cleanly map onto successive buckets most of the time, or
pretty random in the low bits used, so they scatter across the buckets
If table size is not a power of two, and ideally is a prime, the high order bits help spread the keys around the buckets. Collisions are just as likely for e.g. random numbers, but on computer systems sometimes different bits in a key are more or less likely to vary (for example, doubles in memory consist of a sign, many mantissa, and many exponent bits: if your numbers are of similar magnitude, the exponents won't vary much between them), or there are patterns based on power-of-two boundaries. For example, if you have 4 ASCII characters packed into a uint32_t key, then a % table_size hash with table_size 256 extracts just one of the characters as the hash value. If table size was instead 257, then varying any of the characters would change the bucket selected.
(key % PRIME) + 1 comes close to doing what hash1() would do if the table size was prime, but why add 1? Some languages do index their arrays from 1, which is the only good reason I can think of, but if dealing with such a language, you'd probably want hash1() to add 1 too. To explain the potential use for hash2(), let's take a step back first...
Real general-purpose hash table implementations need to be able to create tables of different sizes - whatever suits the program - and indeed, often applications want the table to "resize" or grow dynamically if more elements are inserted than it can handle well. Because of that, a hash function such as hash1 would be dependent on the hash table implementation or calling code to tell it the current table size. It's normally more convenient if the hash functions can be written independently of any given hash table implementation, only needing the key as input. What many hash functions do is hash the key to a number of a certain size, e.g. a uint32_t or uint64_t. Clearly that means there may be more hash values than there are buckets in the hash table, so a % operation (or faster bitwise-& operation if the # buckets is a power of two) is then used to "fold" the hash value back onto the buckets. So, a good hash table implementation usually accepts a hash function generating e.g. uint32_t or uint64_t output and internally does the % or &.
In the case of hash1 - it can be used:
as an identity hash folding the key to a bucket, or
to fold a hash value from another hash function to a bucket.
In the second usage, that second hash function could be hash2. This might make sense if the keys given to hash2 were typically much larger than the PRIME used, and yet the PRIME was in turn much larger than the number of buckets. To explain why this is desirable, let's take another step back...
Say you have 8 buckets and a hash function that produces a number in the range [0..10] with uniform probability: if you % the hash values into the table size, hash values 0..7 will map to buckets 0..7, and hash values 8..10 will map to buckets 0..2: buckets 0..2 can be expected to have about twice as many keys collide there as the other buckets. When the range of hash values is vastly larger than the number of buckets, the significance of having some buckets %-ed to once more than other buckets is correspondingly tiny. Alternatively, if you have say a hash function outputting 32-bit numbers (so the number of distinct hash values is a power of two), then % by a smaller power-of-two will map exactly the same number of hash values to each bucket.
So, let's return to my earlier assertion: hash2()'s potential utility is actually to use it like this:
bucket_selected = hash1(hash2(key))
In the above formula - hash1 distributes across the buckets but prevents out-of-bounds bucket access; to work reasonable hash2 should output a range of numbers much larger than the number of buckets, but it won't do anything at all unless the keys span a range larger than PRIME, and ideally they'd span a range vastly larger than PRIME, increasing the odds of hash values from hash2(key) forming a near-uniform distribution between 1 and PRIME.

What is the run-time of inserting the words in a string into a hash table?

More info:
n is the number of characters in the string
the hash table should keep track of each word's frequency; i.e., the hash table should store key-value pairs, where the key is a word in the input string, and the value is the number of times that word occurs in the input string
We've had some heated debates about this question at work, and I'd like to see what you guys think the answer is.
Important thing to consider during implementation of insert function is how do we handle collisions and resolution techniques. This will have a greater influence in both put() and get() operations.
The collision resolution techniques are implemented diffently in each libraries. The core idea is to maintain all colliding keys in the same bucket. And during retrieval traverse all the colliding keys and apply some equality check to retrieve the given key. Important thing to note is we need to maintain both 'keys' and 'values' in the bucket, to facilicate the above mentioned equality check.
So the key(words) is also being stored in hash table along with the count.
Another thing to consider is, during insertion operation a hashcode will be generated for the given key. We can consider this to be constant O(1) for every key.
Now, answering the question.
Given a string of length 'n'
Inserting all the words and frequencies will have following steps.
1. split given string in to words, with given delimiter - O(n)
2. For word in words - O(n)
# Considering copy of word of length k as constant and very small compared to 'n'.
# And collision resolution implementation amortized across all inserts
if MAP.exists(word) - O(1)
MAP.set(word, MAP.get(word)+1) - amortized to O(1)
else
MAP.set(word, 1) - O(1)
Over all, O(n) run-time for inserting the words in a string into a hash table. Because the for loop runs 'n/k' times and we know 'k' is constant and small compared to n.
If H is your hashtable mapping words to counts, then H[s] and H[s] = <new value> are both O(len(s)). That's because computing the hashcode for s requires you to read every character of s, and also once you've found the relevant line in the hashtable, you need to compare s to whatever's stored there. Of course, the usual hashtable complexities apply to -- there's O(1) of these comparisons performed.
With respect to your original problem, you can break your string of length n into words in O(n) time. Then for each word, you need an O(len(word)) operation to update the hashtable. For all the strings, O(len(word1) + len(word2) + ... + len(word_n)) = O(n) overall, since the sum of the length of the words is always less than n, the length of the original string.

hash table about the load factor

I'm studying about hash table for algorithm class and I became confused with the load factor.
Why is the load factor, n/m, significant with 'n' being the number of elements and 'm' being the number of table slots?
Also, why does this load factor equal the expected length of n(j), the linked list at slot j in the hash table when all of the elements are stored in a single slot?
The crucial property of a hash table is the expected constant time it takes to look up an element.*
In order to achieve this, the implementer of the hash table has to make sure that every query to the hash table returns below some fixed amount of steps.
If you have a hash table with m buckets and you add elements indefinitely (i.e. n>>m), then also the size of the lists will grow and you can't guarantee that expected constant time for look ups, but you will rather get linear time (since the running time you need to traverse the ever increasing linked lists will outweigh the lookup for the bucket).
So, how can we achieve that the lists don't grow? Well, you have to make sure that the length of the list is bounded by some fixed constant - how we do that? Well, we have to add additional buckets.
If the hash table is well implemented, then the hash function being used to map the elements to buckets, should distribute the elements evenly across the buckets. If the hash function does this, then the length of the lists will be roughly the same.
How long is one of the lists if the elements are distributed evenly? Clearly we'll have total number of elements divided by the number of buckets, i.e. the load factor n/m (number of elements per bucket = expected/average length of each list).
Hence, to ensure constant time look up, what we have to do is keep track of the load factor (again: expected length of the lists) such that, when it goes above the fixed constant we can add additional buckets.
Of course, there are more problems which come in, such as how to redistribute the elements you already stored or how many buckets should you add.
The important message to take away, is that the load factor is needed to decide when to add additional buckets to the hash table - that's why it is not only 'important' but crucial.
Of course, if you map all the elements to the same bucket, then the average length of each list won't be worth much. All this stuff only makes sense, if you distribute evenly across the buckets.
*Note the expected - I can't emphasize this enough. Its typical to hear "hash table have constant look up time". They do not! Worst case is always O(n) and you can't make that go away.
Adding to the existing answers, let me just put in a quick derivation.
Consider a arbitrarily chosen bucket in the table. Let X_i be the indicator random variable that equals 1 if the ith element is inserted into this element and 0 otherwise.
We want to find E[X_1 + X_2 + ... + X_n].
By linearity of expectation, this equals E[X_1] + E[X_2] + ... E[X_n]
Now we need to find the value of E[X_i]. This is simply (1/m) 1 + (1 - (1/m) 0) = 1/m by the definition of expected values. So summing up the values for all i's, we get 1/m + 1/m + 1/m n times. This equals n/m. We have just found out the expected number of elements inserted into a random bucket and this is the load factor.

Hashing analysis in hashtable

The search time for a hash value is O(1+alpha) , where
alpha = number of elements/size of table
I don't understand why the 1 is added?
The expected number elements examined is
(1/n summation of i=1 to n (1+(i-1/m)))
I don't understand this too.How it is derived?
(I know how to solve the above expression , but I want to understand how it has been lead to this expression..)
EDIT : n is number of elements present and m is the number of slots or the size of the table
I don't understand why the 1 is added?
The O(1) is there to tell that even if there is no element in a bucket or the hash table at all, you'll have to compute the key hash value and thus it won't be instantaneous.
Your second part needs precisions. See my comments.
EDIT:
Your second portion is there for "amortized analysis", the idea is to consider each insertion in fact in a set of n insertions in an initially empty hash table, each lookup would take O(1) hashing plus O(i-1/m) searching the bucket content considering each bucket is evenly filled with respect to previous elements. The resolution of the sum actually gives the O(1+alpha) amortized time.

Hash Functions and Tables of size of the form 2^p

While calculating the hash table bucket index from the hash code of a key, why do we avoid use of remainder after division (modulo) when the size of the array of buckets is a power of 2?
When calculating the hash, you want as much information as you can cheaply munge things into with good distribution across the entire range of bits: e.g. 32-bit unsigned integers are usually good, unless you have a lot (>3 billion) of items to store in the hash table.
It's converting the hash code into a bucket index that you're really interested in. When the number of buckets n is a power of two, all you need to do is do an AND operation between hash code h and (n-1), and the result is equal to h mod n.
A reason this may be bad is that the AND operation is simply discarding bits - the high-level bits - from the hash code. This may be good or bad, depending on other things. On one hand, it will be very fast, since AND is a lot faster than division (and is the usual reason why you would choose to use a power of 2 number of buckets), but on the other hand, poor hash functions may have poor entropy in the lower bits: that is, the lower bits don't change much when the data being hashed changes.
Let us say that the table size is m = 2^p.
Let k be a key.
Then, whenever we do k mod m, we will only get the last p bits of the binary representation of k. Thus, if I put in several keys that have the same last p bits, the hash function will perform VERY VERY badly as all keys will be hashed to the same slot in the table. Thus, avoid powers of 2

Resources