perfect hash table sub-table construction success probability - data-structures

Question:How many times must you attempt to construct one of a perfect hash table’s sub-tables so
you have one with probability greater than 99.9999%? This is a failure rate of 1 out of million
My answer is: 4n time. But i know i am wrong.
I am novice so need help.

Related

Calculating cumulating probabilities

I'm sorry if this is the wrong place for this query. If it is, perhaps someone could direct me to the right place.
I have a program that has a bunch of objects (say n) to process and a process that iteratively processes one object.
At each iteration I have one less objects processed. I want to check if I need more objects.
If there are 100 objects or more, I have plenty. When there are less than 100 objects, say, I would like to get some more objects at a probability (P) that is roughly zero at 100 and 1 at 0 objects.
P(n) = 1 - (n/100)
If I just do a random calculation based on this probability then over time I get a cumulative probability that is the product of the series of probabilities which is not the same as the formula above.
If the probability added each time, I would get an integral of P(n), but since it is an accumulating product, what is the new function and how to calculate the function?
So I would like the total probability up till now to equal that formula. How do I work out the probability I need at the current iteration?
I realised after some thought that the answer is a simple integral, because the probability at each step is not independent, if I get more objects, the probability resets, if I don't get more objects the probability is the sum of all the times before that I didn't get more objects.

Searching for a key that doesn't and has never existed O(n)?

Let's take linear probing as an example because it's simple.
You have a (fictional) hash table who's keys look like this:
1 2 3 4 5 6 7
[23| | 44|67|89| |22]
You want to check for the key 99, which doesn't exist. It gives the hash value 5.
Surely the algorithm goes like this:
Check 5: X
Check 6: X
Check 7: X
Check 1: X
Check 2: X
Check 3: X
Check 4: X
Reached 5 again: Key not found
Surely there is no way that the algorithm can tell if the key is present or not unless it checks the whole table.
However while searching for an answer for this, I stumbled upon this page: https://msdn.microsoft.com/en-us/library/system.collections.hashtable.containskey(v=vs.110).aspx which states that it is O(1). Of course, if the key exists it can be O(1), but it won't be on average will it? And the worst case scenario (which is every time the key is not present?) would be O(n).
Am I correct in thinking this?
EDIT:
I just realised that it would stop when it hit an empty space... So this means that it would only reach O(n) if the table is full? Which must be why you don't want clustering?
I just realised that it would stop when it hit an empty space... So
this means that it would only reach O(n) if the table is full? Which
must be why you don't want clustering?
You are right. Bear in mind that every decent hash table implementation that uses open addressing as a collision resolution technique (linear probing belongs to open addressing) stores a special number called load factor. Load factor is a ratio between the number of items in the hash table and the total number of available slots. When load factor increases over the certain value, hash table gets expanded - that is the way to keep number of probes small enough and to ensure good performance.
Since you searched for C# implementation, I took the time and found a documentation describing hash table implementation in C# 2.0. It states:
As aforementioned, Microsoft has tuned the Hashtable to use a default
load factor of 0.72. Therefore, for you can expect on average 3.5
probes per collision. Because this estimate does not vary based on the
number of items in the Hashtable, the asymptotic access time for a
Hashtable is O(1), which beats the pants off of the O(n) search time
for an array.

Hash Table sequence always get inserted

I have a problem related to the hash tables.
Let's consider an hash table of dimension 2^n in a open linear schema.
h(k,i) = (k^n + 2*i)mod(2^n). Show that the sequence
{1,2,...2^n} always can be inserted into the hash table.
I tried to identify a pattern in the way the numbers get inserted into the table and then apply an induction to see if I can prove the question.Any problem which our teacher gave us seems to be like this one, and I can't figure out a way of doing these kind of problems.
h(k,i) = (k^n + 2*i)mod(2^n). Show that the sequence {1,2,...2^n} always can be inserted into the hash table.
Two observations about the hash function:
k^n, for n >= 1, will be odd when k is odd, and even when k is even
2*i will probe every second bucket (wrapping around from last to first)
So, as you hash {1,2,...2^n} we know you'll alternate between finding an unused odd-indexed bucket, and an even-indexed bucket.
Just to emphasise the point, the k^n bit restricts the odd keys to odd-indexed buckets and the even keys to even-indexed buckets, while 2*i ensures all such buckets are considered until a free one's found. It's necessary that exactly half the keys will be odd and half even for the table to become full without h(k,i) failing to find an unused bucket as i is incremented.
You have a lot of terminology problems here.
You hash table does not have dimensions (actually it has, but it is one dimension, and not 2^n), but it has number of slots/buckets.
Most probably the question you asked is not the question your book/teacher wants you to solve. You tell:
Show that the sequence {1,2,...2^n} always can be inserted into the
hash table
and the problem is that in your case any natural number can be inserted in your hash table. This is obvious, because your hash function maps any number to a natural number in a region from [0 to 2^n) and because your hash function has 2^n slots, any number will fit in your hash.
So clarify what your teacher wants, explain find out what k and i is in your hash function and ask another, better prepared question.

MIT Lecture WRONG? Analysis of open addressing in hashing

In the following MIT lecture:
https://www.youtube.com/watch?v=JZHBa-rLrBA at 1:07:00 ,professor taught to calculate number of probes in unsuccessful search.
But my method of calculating doesn't matches his.
My answer is:
Number of probes=
m= no. of slots in hash table
n= no. of elements (keys)
Explanation:
1.The hash function can hit an empty slot with probability m-n/m.
2.Or it can hit a preoccupied key slot with probability n/m.
3.Now in case 2, we will have to again call hash function and there are two chances:
(i) We get a slot with no key with probability (m-n)/(m-1).
(ii) We get a slot with key with probability (n-1)/(m-1).
4.Now repeat case 3 but with different probabilities as shown in the image
Why am I getting different answer. What's wrong with it?
The problem asks us to find the expected number of probes that need to be done in a hash table.
You must do one no matter what, so you have 1 to start with. Then, there is an n / m chance that you have a collision. You got this right in your explanation.
If you have a collision, you must do another probe (and maybe even more). And so on, so the answer is the one the professor gets:
1 + (n / m)(1 + ((n - 1) / (m - 1))(1 + ...))
You don't multiply with the probability that you get an empty slot. You multiply the probability of not getting an empty slot with the number of operations you have to do if you don't get an empty slot (1, because you have to do at least one more probe in that case).
It is meaningless to multiply the probability of getting an open slot with the probability of not getting one, like you're doing. Remember that we want to find the expected number of probes that we need to do. So you multiply the number of operations (probes) at each step with the probability that you don't get what you'd ideally like to get (an empty slot), because if this event happens, then we'll have to do more operations (probes), otherwise we're done.
This is explained very well in the lecture you linked to if you watch carefully until the end.

Upper bound on 4 digit sequences in pi

If this is not the right SE site for this question, please let me know.
A friend shared this interview question he received over the phone, which I have tried to solve myself. I will paraphrase:
The value of pi up to n digits as a string is given.
How can I find all duplicate 4 digit sequences in this string?
This part seems fairly straight forward. Add 4 character sequences to a hash table, incrementing one character at a time. Check if the current 4 character sequence already exists before insertion into the hash table. If so, then you have found a duplicate. Store this somewhere, and repeat the process. I was told this was more or less correct.
The issue I have is on the second question:
What is the upper bound?
n = 10,000,000 was an example.
My algorithm background is admittedly very rusty. My first thought is that the upper bound must be related to n somehow, but I was told it is not.
How do I calculate this?
EDIT:
I would also be open to a solution that disregards the restraint that the upper bound is not related to n. Either is acceptable.
There are only 10,000 possible sequences of four digits (0000 to 9999), so at some point you will have found that every sequence has been duplicated, and there's no need to process further digits.
If you assume that pi is a perfectly uniform random number generator, then each new digit that's processes results in a new sequence, and after about 20,000 digits, you will have found duplicates for all 10,000 sequences. Given that pi is not perfect, you may need significantly more digits before you duplicate all sequences, but 100,000 would be a reasonable guess at the upper bound.
Also, since there are only 10,000 possibilities, you don't really need a hash table. You can simply use an array of 10000 counters, (int count[10000]), and increment the count for each sequence you find.
The upper bound of your solution is the size of the hash table that you can fit into memory.
An alternate technique is to generate all the sequences and sort them. Then the duplicates will be adjacent and easy to detect. You can generally fit more into a linear data structure than you can a hash table, and if you still exhaust memory you can sort to/from disk.
Edit: unless "upper bound" means the O(n) of the algorithm, which should be easy to figure out.

Resources