Hashing analysis in hashtable - data-structures

The search time for a hash value is O(1+alpha) , where
alpha = number of elements/size of table
I don't understand why the 1 is added?
The expected number elements examined is
(1/n summation of i=1 to n (1+(i-1/m)))
I don't understand this too.How it is derived?
(I know how to solve the above expression , but I want to understand how it has been lead to this expression..)
EDIT : n is number of elements present and m is the number of slots or the size of the table

I don't understand why the 1 is added?
The O(1) is there to tell that even if there is no element in a bucket or the hash table at all, you'll have to compute the key hash value and thus it won't be instantaneous.
Your second part needs precisions. See my comments.
EDIT:
Your second portion is there for "amortized analysis", the idea is to consider each insertion in fact in a set of n insertions in an initially empty hash table, each lookup would take O(1) hashing plus O(i-1/m) searching the bucket content considering each bucket is evenly filled with respect to previous elements. The resolution of the sum actually gives the O(1+alpha) amortized time.

Related

What is the run-time of inserting the words in a string into a hash table?

More info:
n is the number of characters in the string
the hash table should keep track of each word's frequency; i.e., the hash table should store key-value pairs, where the key is a word in the input string, and the value is the number of times that word occurs in the input string
We've had some heated debates about this question at work, and I'd like to see what you guys think the answer is.
Important thing to consider during implementation of insert function is how do we handle collisions and resolution techniques. This will have a greater influence in both put() and get() operations.
The collision resolution techniques are implemented diffently in each libraries. The core idea is to maintain all colliding keys in the same bucket. And during retrieval traverse all the colliding keys and apply some equality check to retrieve the given key. Important thing to note is we need to maintain both 'keys' and 'values' in the bucket, to facilicate the above mentioned equality check.
So the key(words) is also being stored in hash table along with the count.
Another thing to consider is, during insertion operation a hashcode will be generated for the given key. We can consider this to be constant O(1) for every key.
Now, answering the question.
Given a string of length 'n'
Inserting all the words and frequencies will have following steps.
1. split given string in to words, with given delimiter - O(n)
2. For word in words - O(n)
# Considering copy of word of length k as constant and very small compared to 'n'.
# And collision resolution implementation amortized across all inserts
if MAP.exists(word) - O(1)
MAP.set(word, MAP.get(word)+1) - amortized to O(1)
else
MAP.set(word, 1) - O(1)
Over all, O(n) run-time for inserting the words in a string into a hash table. Because the for loop runs 'n/k' times and we know 'k' is constant and small compared to n.
If H is your hashtable mapping words to counts, then H[s] and H[s] = <new value> are both O(len(s)). That's because computing the hashcode for s requires you to read every character of s, and also once you've found the relevant line in the hashtable, you need to compare s to whatever's stored there. Of course, the usual hashtable complexities apply to -- there's O(1) of these comparisons performed.
With respect to your original problem, you can break your string of length n into words in O(n) time. Then for each word, you need an O(len(word)) operation to update the hashtable. For all the strings, O(len(word1) + len(word2) + ... + len(word_n)) = O(n) overall, since the sum of the length of the words is always less than n, the length of the original string.

Hash Table sequence always get inserted

I have a problem related to the hash tables.
Let's consider an hash table of dimension 2^n in a open linear schema.
h(k,i) = (k^n + 2*i)mod(2^n). Show that the sequence
{1,2,...2^n} always can be inserted into the hash table.
I tried to identify a pattern in the way the numbers get inserted into the table and then apply an induction to see if I can prove the question.Any problem which our teacher gave us seems to be like this one, and I can't figure out a way of doing these kind of problems.
h(k,i) = (k^n + 2*i)mod(2^n). Show that the sequence {1,2,...2^n} always can be inserted into the hash table.
Two observations about the hash function:
k^n, for n >= 1, will be odd when k is odd, and even when k is even
2*i will probe every second bucket (wrapping around from last to first)
So, as you hash {1,2,...2^n} we know you'll alternate between finding an unused odd-indexed bucket, and an even-indexed bucket.
Just to emphasise the point, the k^n bit restricts the odd keys to odd-indexed buckets and the even keys to even-indexed buckets, while 2*i ensures all such buckets are considered until a free one's found. It's necessary that exactly half the keys will be odd and half even for the table to become full without h(k,i) failing to find an unused bucket as i is incremented.
You have a lot of terminology problems here.
You hash table does not have dimensions (actually it has, but it is one dimension, and not 2^n), but it has number of slots/buckets.
Most probably the question you asked is not the question your book/teacher wants you to solve. You tell:
Show that the sequence {1,2,...2^n} always can be inserted into the
hash table
and the problem is that in your case any natural number can be inserted in your hash table. This is obvious, because your hash function maps any number to a natural number in a region from [0 to 2^n) and because your hash function has 2^n slots, any number will fit in your hash.
So clarify what your teacher wants, explain find out what k and i is in your hash function and ask another, better prepared question.

hash table about the load factor

I'm studying about hash table for algorithm class and I became confused with the load factor.
Why is the load factor, n/m, significant with 'n' being the number of elements and 'm' being the number of table slots?
Also, why does this load factor equal the expected length of n(j), the linked list at slot j in the hash table when all of the elements are stored in a single slot?
The crucial property of a hash table is the expected constant time it takes to look up an element.*
In order to achieve this, the implementer of the hash table has to make sure that every query to the hash table returns below some fixed amount of steps.
If you have a hash table with m buckets and you add elements indefinitely (i.e. n>>m), then also the size of the lists will grow and you can't guarantee that expected constant time for look ups, but you will rather get linear time (since the running time you need to traverse the ever increasing linked lists will outweigh the lookup for the bucket).
So, how can we achieve that the lists don't grow? Well, you have to make sure that the length of the list is bounded by some fixed constant - how we do that? Well, we have to add additional buckets.
If the hash table is well implemented, then the hash function being used to map the elements to buckets, should distribute the elements evenly across the buckets. If the hash function does this, then the length of the lists will be roughly the same.
How long is one of the lists if the elements are distributed evenly? Clearly we'll have total number of elements divided by the number of buckets, i.e. the load factor n/m (number of elements per bucket = expected/average length of each list).
Hence, to ensure constant time look up, what we have to do is keep track of the load factor (again: expected length of the lists) such that, when it goes above the fixed constant we can add additional buckets.
Of course, there are more problems which come in, such as how to redistribute the elements you already stored or how many buckets should you add.
The important message to take away, is that the load factor is needed to decide when to add additional buckets to the hash table - that's why it is not only 'important' but crucial.
Of course, if you map all the elements to the same bucket, then the average length of each list won't be worth much. All this stuff only makes sense, if you distribute evenly across the buckets.
*Note the expected - I can't emphasize this enough. Its typical to hear "hash table have constant look up time". They do not! Worst case is always O(n) and you can't make that go away.
Adding to the existing answers, let me just put in a quick derivation.
Consider a arbitrarily chosen bucket in the table. Let X_i be the indicator random variable that equals 1 if the ith element is inserted into this element and 0 otherwise.
We want to find E[X_1 + X_2 + ... + X_n].
By linearity of expectation, this equals E[X_1] + E[X_2] + ... E[X_n]
Now we need to find the value of E[X_i]. This is simply (1/m) 1 + (1 - (1/m) 0) = 1/m by the definition of expected values. So summing up the values for all i's, we get 1/m + 1/m + 1/m n times. This equals n/m. We have just found out the expected number of elements inserted into a random bucket and this is the load factor.

Find the N-th most frequent number in the array

Find the nth most frequent number in array.
(There is no limit on the range of the numbers)
I think we can
(i) store the occurence of every element using maps in C++
(ii) build a Max-heap in linear time of the occurences(or frequence) of element and then extract upto the N-th element,
Each extraction takes log(n) time to heapify.
(iii) we will get the frequency of the N-th most frequent number
(iv) then we can linear search through the hash to find the element having this frequency.
Time - O(NlogN)
Space - O(N)
Is there any better method ?
It can be done in linear time and space. Let T be the total number of elements in the input array from which we have to find the Nth most frequent number:
Count and store the frequency of every number in T in a map. Let M be the total number of distinct elements in the array. So, the size of the map is M. -- O(T)
Find Nth largest frequency in map using Selection algorithm. -- O(M)
Total time = O(T) + O(M) = O(T)
Your method is basically right. You would avoid final hash search if you mark each vertex of the constructed heap with the number it represents. Moreover, it is possible to constantly keep watch on the fifth element of the heap as you are building it, because at some point you can get to a situation where the outcome cannot change anymore and the rest of the computation can be dropped. But this would probably not make the algorithm faster in the general case, and maybe not even in special cases. So you answered your own question correctly.
It depends on whether you want most effective, or the most easy-to-write method.
1) if you know that all numbers will be from 0 to 1000, you just make an array of 1000 zeros (occurences), loop through your array and increment the right occurence position. Then you sort these occurences and select the Nth value.
2) You have a "bag" of unique items, you loop through your numbers, check if that number is in a bag, if not, you add it, if it is here, you just increment the number of occurences. Then you pick an Nth smallest number from it.
Bag can be linear array, BST or Dictionary (hash table).
The question is "N-th most frequent", so I think you cannot avoid sorting (or clever data structure), so best complexity can not be better than O(n*log(n)).
Just written a method in Java8: This is not an efficient solution.
Create a frequency map for each element
Sort the map content based on values in reverse order.
Skip the (N-1)th element then find the first element
private static Integer findMostNthFrequentElement(int[] inputs, int frequency) {
return Arrays.stream(inputs).boxed()
.collect(Collectors.groupingBy(Function.identity(), Collectors.counting()))
.entrySet().stream().sorted(Map.Entry.comparingByValue(Comparator.reverseOrder()))
.skip(frequency - 1).findFirst().get().getKey();
}

Finding the repeated element

In an array with integers between 1 and 1,000,000 or say some very larger value ,if a single value is occurring twice twice. How do you determine which one?
I think we can use a bitmap to mark the elements , and then traverse allover again to find out the repeated element . But , i think it is a process with high complexity.Is there any better way ?
This sounds like homework or an interview question ... so rather than giving away the answer, here's a hint.
What calculations can you do on a range of integers whose answer you can determine ahead of time?
Once you realize the answer to this, you should be able to figure it out .... if you still can't figure it out ... (and it's not homework) I'll post the solution :)
EDIT: Ok. So here's the elegant solution ... if the list contains ALL of the integers within the range.
We know that all of the values between 1 and N must exist in the list. Using Guass' formula we can quickly compute the expected value of a range of integers:
Sum(1..N) = 1/2 * (1 + N) * Count(1..N).
Since we know the expected sum, all we have to do is loop through all the values and sum their values. The different between this sum and the expected sum is the duplicate value.
EDIT: As other's have commented, the question doesn't state that the range contains all of the integers ... in this case, you have to decide whether you want to optimize for memory or time.
If you want to perform the operation using O(1) storage, you can perform an in-place sort of the list. As you're sorting you have to check adjacent elements. Once you see a duplicate, you know you can stop. Optimal sorting is an O(n log n) operation on average - which establishes an upper bound for find the duplicate in this manner.
If you want to optimize for speed, you can use an additional O(n) storage. Using a HashSet (or similar structure), insert values from your list until you determine you are inserting a duplicate into the HashSet. Inserting n items into a HashSet is an O(n) operation on average, which establishes that as an upper bound for this method.
you may try to use bits as hashmap:
1 at position k means that number k occured before
0 at position k means that number k did not occured before
pseudocode:
0. assume that your array is A
1. initialize bitarray(there is nice class in c# for this) of 1000000 length filled with zeros
2. for each num in A:
if bitarray[num]
return num
else
bitarray[num] = 1
end
The time complexity of the bitmap solution is O(n) and it doesn't seem like you could do better than that. However it will take up a lot of memory for a generic list of numbers. Sorting the numbers is an obvious way to detect duplicates and doesn't require extra space if you don't mind the current order changing.
Assuming the array is of length n < N (i.e. not ALL integers are present -- in this case LBushkin's trick is the answer to this homework problem), there is no way to solve this problem using less than O(n) memory using an algorithm that just takes a single pass through the array. This is by reduction to the set disjointness problem.
Suppose I made the problem easier, and I promised you that the duplicate elements were in the array such that the first one was in the first n/2 elements, and the second one was in the last n/2 elements. Now we can think of playing a game in which two people each hold a string of n/2 elements, and want to know how many messages they have to send to be sure that none of their elements are the same. Since the first player could simulate the run of any algorithm that takes a pass through the array, and send the contents of its memory to the second player, a lower bound on the number of messages they need to send implies a lower bound on the memory requirements of any algorithm.
But its easy to see in this simple game that they need to send n/2 messages to be sure that they don't hold any of the same elements, which yields the lower bound.
Edit: This generalizes to show that for algorithms that make k passes through the array and use memory m, that m*k = Omega(n). And it is easy to see that you can in fact trade off memory for time in this way.
Of course, if you are willing to use algorithms that don't simply take passes through the array, you can do better as suggested already: sort the array, then take 1 pass through. This takes time O(nlogn) and space O(1). But note curiously that this proves that any sorting algorithm that just makes passes through the array must take time Omega(n^2)! Sorting algorithms that break the n^2 bound must make random accesses.

Resources