Probability of a chain of 3 in an m sized Hash Table? - algorithm

Consider a hash table with m slots that uses chaining for collision
resolution. The table is initially empty. What is the probability that, after three
keys are inserted, there is a chain of size 3? Assume simple uniform hashing. Would it be m/m^3?
My guess was that it would be m/m (any of the available slots) multiplied by 1/m (the same spot as the previous slot) and 1/m again, thus creating a chain size of 3.
m/m * 1/m * 1/m = m/m^3
But I just wasn't sure if this logic was correct.

The probability is (1/m)^2. The first one goes anywhere with probability 1. Then, there's 1/m chance the 2nd one lands in the same place, ditto for the third. So your logic is correct.

Related

Elaboration on the uniform randomness of selection of a key

Consider this question:-
Efficiently picking a random element from a chained hash table?
And consider it's first answer. It suggests a method to select a key in a uniformly random fashion. That, however, is unclear to me. The first step would take a probability of 1/m (i.e randomly selecting a bucket from m buckets)
And the second step can be divided into two steps:
1) If k<=p, then p is returned.
2) If k>p then the loop again runs.
This is done until p is returned.
So the probability of a key being selected is :
(1/m)[(k1/L)+((L-k1)/L)[(1/m)[(k2/L)+((L-k2)/L)[(1/m)[(k3/L)+((L-k3)/L)[......and so on.
Now how can this be equal to 1/n?
This is a form of rejection-sampling.
Remark:
it looks like you splitted the two steps and loop somehow on the second step only (my interpretation of your calculation-formula)
redoing all steps every time is the most important aspect of the algorithm!
it's the basic-idea of rejection-sampling: we are sampling from some surrounding-density and need to sample again, if the selected sample is not within the range of our sample range (that's very informal; read the above link)
Why this approach:
Imagine, there are 2 buckets, where b0 hast 2 elements, b1 has 4 elements
Step 1 is selecting one bucket uniformly
But because b0 has a different amount of elements than b1, the actual sampling in step 2 needs to be somehow adapted to the information about the number of elements (or we will use uniformity)
We don't have this full information, we only got the upper-bound L on all chains
Meaning: We use the rejection-idea to sample from the max-range L; and accept if the index is compatible with the bucket. So if the selected bucket has half the amount of elements compared to the biggest one, it needs to be aborted (restart with step 1) 50% of the time. It's like inserting fake-elements to all buckets so that the amounts of elements are constant. Then sample, check if real or fake-element was selected and do that again if a fake one was hit.
It's easy to see that: b0 get's chosen 50% of the time; equal to b1
When sampling within b0, the process will get aborted 50% of time, because k=2, L=4 (L from the size of elements in b2)
When sampling within b1, the process will never get aborted (k=L)
If there would be no chance of aborting; we would select one selected element of b0 2 times as often (L / size-within-b0 = L/2) as one from b1, because the bucket is uniformly selected; but the number of elements to sample from differ.

MIT Lecture WRONG? Analysis of open addressing in hashing

In the following MIT lecture:
https://www.youtube.com/watch?v=JZHBa-rLrBA at 1:07:00 ,professor taught to calculate number of probes in unsuccessful search.
But my method of calculating doesn't matches his.
My answer is:
Number of probes=
m= no. of slots in hash table
n= no. of elements (keys)
Explanation:
1.The hash function can hit an empty slot with probability m-n/m.
2.Or it can hit a preoccupied key slot with probability n/m.
3.Now in case 2, we will have to again call hash function and there are two chances:
(i) We get a slot with no key with probability (m-n)/(m-1).
(ii) We get a slot with key with probability (n-1)/(m-1).
4.Now repeat case 3 but with different probabilities as shown in the image
Why am I getting different answer. What's wrong with it?
The problem asks us to find the expected number of probes that need to be done in a hash table.
You must do one no matter what, so you have 1 to start with. Then, there is an n / m chance that you have a collision. You got this right in your explanation.
If you have a collision, you must do another probe (and maybe even more). And so on, so the answer is the one the professor gets:
1 + (n / m)(1 + ((n - 1) / (m - 1))(1 + ...))
You don't multiply with the probability that you get an empty slot. You multiply the probability of not getting an empty slot with the number of operations you have to do if you don't get an empty slot (1, because you have to do at least one more probe in that case).
It is meaningless to multiply the probability of getting an open slot with the probability of not getting one, like you're doing. Remember that we want to find the expected number of probes that we need to do. So you multiply the number of operations (probes) at each step with the probability that you don't get what you'd ideally like to get (an empty slot), because if this event happens, then we'll have to do more operations (probes), otherwise we're done.
This is explained very well in the lecture you linked to if you watch carefully until the end.

Reservoir sampling: why is it selected uniformly at random

I understand how the algorithm works. However I don't understand why is it correct. Assume we need to select only one element. Here's the proof that I've found
at every step N, keep the next element in the stream with probability 1/N. This means that we have an (N-1)/N probability of keeping the element we are currently holding on to, which means that we keep it with probability (1/(N-1)) * (N-1)/N = 1/N.
I understand everything except for the last part. Why do we multiply the probabilities?
Because Pr[A AND B] == Pr[A] * Pr[B], assuming that A and B are independent (as they are here). The probability of choosing the element AND not replacing it later, is the product of those two possibilities' probabilities.

Probability of collision of SecureRandom.urlsafe_base64(8) in Ruby?

I am using SecureRandom.urlsafe_base64(8) in order to create unique ids in my system that are URL safe.
I would like to know how to calculate the probability of collision? I am inserting about 10.000 of those ids into an array, and I want to avoid checking if one of the keys is already in the array, but I also want to make sure that are not repeated? What are the chances?
There is a good approximation of this probability (which relates to the birthday problem). If there are k potential values and n are sampled, the probability of collision is:
k! / (k^n * (k - n)!)
The base64 method returns a base 64 string built from the inputted number of random bytes, not that number of random digits. Eight random bytes gives us k = 256^8, about 1.8446744e+19. You are generating 10,000 of these strings, so n = 10,000, which gives us a probability of 2.710498492319857e-12, which is very low.
You do not make something sure by calculation of a probability, you only know how likely it might happen.
To protect yourself, just add a unique index to the database column. That ensures that you cannot store duplicate entries in your database. With such a unique index, an insertion will raise an ActiveRecord::InvalidStatement error in case this very unlikely (see #Andrew's answer) ever happens.
Slight adjustment to Andrew's answer, I believe the equation for probability of collision is:
1 - (k! / (k^n * (k - n)!))
Given that k is potential values and n the number of samples.
The equation:
k! / (k^n * (k - n)!)
gives the probability that there is NOT a collision -- according to the birthday problem wiki.
You can sanity check this by trying a few different n values. More samples should naturally give a higher probability of collision.

Randomly sample a data set

I came across a Q that was asked in one of the interviews..
Q - Imagine you are given a really large stream of data elements (queries on google searches in May, products bought at Walmart during the Christmas season, names in a phone book, whatever). Your goal is to efficiently return a random sample of 1,000 elements evenly distributed from the original stream. How would you do it?
I am looking for -
What does random sampling of a data set mean?
(I mean I can simply do a coin toss and select a string from input if outcome is 1 and do this until i have 1000 samples..)
What are things I need to consider while doing so? For example .. taking contiguous strings may be better than taking non-contiguous strings.. to rephrase - Is it better if i pick contiguous 1000 strings randomly.. or is it better to pick one string at a time like coin toss..
This may be a vague question.. I tried to google "randomly sample data set" but did not find any relevant results.
Binary sample/don't sample may not be the right answer.. suppose you want to sample 1000 strings and you do it via coin toss.. This would mean that approximately after visiting 2000 strings.. you will be done.. What about the rest of the strings?
I read this post - http://gregable.com/2007/10/reservoir-sampling.html
which answers this Q quite clearly..
Let me put the summary here -
SIMPLE SOLUTION
Assign a random number to every element as you see them in the stream, and then always keep the top 1,000 numbered elements at all times.
RESERVOIR SAMPLING
Make a reservoir (array) of 1,000 elements and fill it with the first 1,000 elements in your stream.
Start with i = 1,001. With what probability after the 1001'th step should element 1,001 (or any element for that matter) be in the set of 1,000 elements? The answer is easy: 1,000/1,001. So, generate a random number between 0 and 1, and if it is less than 1,000/1,001 you should take element 1,001.
If you choose to add it, then replace any element (say element #2) in the reservoir chosen randomly. The element #2 is definitely in the reservoir at step 1,000 and the probability of it getting removed is the probability of element 1,001 getting selected multiplied by the probability of #2 getting randomly chosen as the replacement candidate. That probability is 1,000/1,001 * 1/1,000 = 1/1,001. So, the probability that #2 survives this round is 1 - that or 1,000/1,001.
This can be extended for the i'th round - keep the i'th element with probability 1,000/i and if you choose to keep it, replace a random element from the reservoir. The probability any element before this step being in the reservoir is 1,000/(i-1). The probability that they are removed is 1,000/i * 1/1,000 = 1/i. The probability that each element sticks around given that they are already in the reservoir is (i-1)/i and thus the elements' overall probability of being in the reservoir after i rounds is 1,000/(i-1) * (i-1)/i = 1,000/i.
I think you have used the word infinite a bit loosely , the very premise of sampling is every element has an equal chance to be in the sample and that is only possible if you at least go through every element. So I would translate infinite to mean a large number indicating you need a single pass solution rather than multiple passes.
Reservoir sampling is the way to go though the analysis from #abipc seems in the right direction but is not completely correct.
It is easier if we are firstly clear on what we want. Imagine you have N elements (N unknown) and you need to pick 1000 elements. This means we need to device a sampling scheme where the probability of any element being there in the sample is exactly 1000/N , so each element has the same probability of being in sample (no preference to any element based on its position on the original list). The scheme mentioned by #abipc works fine, the probability calculations goes like this -
After first step you have 1001 elements so we need to pick each element with probability 1000/1001. We pick the 1001st element with exactly that probability so that is fine. Now we also need to show that every other element also has the same probability of being in the sample.
p(any other element remaining in the sample) = [ 1 - p(that element is
removed from sample)]
= [ 1 - p(1001st element is selected) * p(the element is picked to be removed)
= [ 1 - (1000/1001) * (1/1000)] = 1000/1001
Great so now we have proven every element has a probability of 1000/1001 to be in the sample. This precise argument can be extended for the ith step using induction.
As I know such class of algorithms is called Reservoir Sampling algorithms.
I know one of it from DataMining, but don't know the name of it:
Collect first S elements in your storage with max.size equal to S.
Suppose next element of the stream has number N.
With probability S/N catch new element, else discard it
If you catched element N, then replace one of the elements in the sameple S, picked it uniformally.
N=N+1, get next element, goto 1
It can be theoretically proved that at any step of such stream processing your storage with size S contains elements with equal probablity S/N_you_have_seen.
So for example S=10;
N_you_have_seen=10^6
S - is finite number;
N_you_have_seen - can be infinite number;

Resources