Efficiently picking a random element from a chained hash table? - algorithm

Just for practice (and not as a homework assignment) I have been trying to solve this problem (CLRS, 3rd edition, exercise 11.2-6):
Suppose we have stored n keys in a hash table of size m, with
collisions resolved by chaining, and that we know the length of each
chain, including the length L of the longest chain. Describe a
procedure that selects a key uniformly at random from among the keys
in the hash table and returns it in expected time O(L * (1 + m/n)).
What I thought so far is that the probability of each key being returned is 1/n. If we try to get a random value x between 1 to n, and try to find the xth key in sequence first sorted by bucket then along the chain in the bucket, then it will take O(m) to find the right bucket by going through buckets one by one and O(L) time to get the right key in chain.

Repeat the following steps until an element is returned:
Randomly select a bucket. Let k be the number of elements in the bucket.
Select p uniformly at random from 1 ... L. If p <= k then return the pth element in the bucket.
It should be clear that the procedure returns an element uniformly at random. We are sort of applying rejection sampling to the elements placed in a 2D array.
The expected number of elements per bucket is n / m. The probability that the sampling attempt succeeds is (n / m) / L. The expected number of attempts needed to find a bucket is therefore L * m / n. Together with the O(L) cost of retrieving the element from the bucket, the expected running time is O(L * (1 + m / n)).

Related

How to find 2 special elements in the array in O(n)

Let a1,...,an be a sequence of real numbers. Let m be the minimum of the sequence, and let M be the maximum of the sequence.
I proved that there exists 2 elements in the sequence, x,y, such that |x-y|<=(M-m)/n.
Now, is there a way to find an algorithm that finds such 2 elements in time complexity of O(n)?
I thought about sorting the sequence, but since I dont know anything about M I cannot use radix/bucket or any other linear time algorithm that I'm familier with.
I'd appreciate any idea.
Thanks in advance.
First find out n, M, m. If not already given they can be determined in O(n).
Then create a memory storage of n+1 elements; we will use the storage for n+1 buckets with width w=(M-m)/n.
The buckets cover the range of values equally: Bucket 1 goes from [m; m+w[, Bucket 2 from [m+w; m+2*w[, Bucket n from [m+(n-1)*w; m+n*w[ = [M-w; M[, and the (n+1)th bucket from [M; M+w[.
Now we go once through all the values and sort them into the buckets according to the assigned intervals. There should be at a maximum 1 element per bucket. If the bucket is already filled, it means that the elements are closer together than the boundaries of the half-open interval, e.g. we found elements x, y with |x-y| < w = (M-m)/n.
If no such two elements are found, afterwards n buckets of n+1 total buckets are filled with one element. And all those elements are sorted.
We once more go through all the buckets and compare the distance of the content of neighbouring buckets only, whether there are two elements, which fulfil the condition.
Due to the width of the buckets, the condition cannot be true for buckets, which are not adjoining: For those the distance is always |x-y| > w.
(The fulfilment of the last inequality in 4. is also the reason, why the interval is half-open and cannot be closed, and why we need n+1 buckets instead of n. An alternative would be, to use n buckets and make the now last bucket a special case with [M; M+w]. But O(n+1)=O(n) and using n+1 steps is preferable to special casing the last bucket.)
The running time is O(n) for step 1, 0 for step 2 - we actually do not do anything there, O(n) for step 3 and O(n) for step 4, as there is only 1 element per bucket. Altogether O(n).
This task shows, that either sorting of elements, which are not close together or coarse sorting without considering fine distances can be done in O(n) instead of O(n*log(n)). It has useful applications. Numbers on computers are discrete, they have a finite precision. I have sucessfuly used this sorting method for signal-processing / fast sorting in real-time production code.
About #Damien 's remark: The real threshold of (M-m)/(n-1) is provably true for every such sequence. I assumed in the answer so far the sequence we are looking at is a special kind, where the stronger condition is true, or at least, for all sequences, if the stronger condition was true, we would find such elements in O(n).
If this was a small mistake of the OP instead (who said to have proven the stronger condition) and we should find two elements x, y with |x-y| <= (M-m)/(n-1) instead, we can simplify:
-- 3. We would do steps 1 to 3 like above, but with n buckets and the bucket width set to w = (M-m)/(n-1). The bucket n now goes from [M; M+w[.
For step 4 we would do the following alternative:
4./alternative: n buckets are filled with one element each. The element at bucket n has to be M and is at the left boundary of the bucket interval. The distance of this element y = M to the element x in the n-1th bucket for every such possible element x in the n-1thbucket is: |M-x| <= w = (M-m)/(n-1), so we found x and y, which fulfil the condition, q.e.d.
First note that the real threshold should be (M-m)/(n-1).
The first step is to calculate the min m and max M elements, in O(N).
You calculate the mid = (m + M)/2value.
You concentrate the value less than mid at the beginning, and more than mid at the end of he array.
You select the part with the largest number of elements and you iterate until very few numbers are kept.
If both parts have the same number of elements, you can select any of them. If the remaining part has much more elements than n/2, then in order to maintain a O(n) complexity, you can keep onlyn/2 + 1 of them, as the goal is not to find the smallest difference, but one difference small enough only.
As indicated in a comment by #btilly, this solution could fail in some cases, for example with an input [0, 2.1, 2.9, 5]. For that, it is needed to calculate the max value of the left hand, and the min value of the right hand, and to test if the answer is not right_min - left_max. This doesn't change the O(n) complexity, even if the solution becomes less elegant.
Complexity of the search procedure: O(n) + O(n/2) + O(n/4) + ... + O(2) = O(2n) = O(n).
Damien is correct in his comment that the correct results is that there must be x, y such that |x-y| <= (M-m)/(n-1). If you have the sequence [0, 1, 2, 3, 4] you have 5 elements, but no two elements are closer than (M-m)/n = (4-0)/5 = 4/5.
With the right threshold, the solution is easy - find M and m by scanning through the input once, and then bucket the input into (n-1) buckets of size (M-m)/(n-1), putting values that are on the boundaries of a pair of buckets into both buckets. At least one bucket must have two values in it by the pigeon-hole principle.

Hashing in data structures

Consider a hash table with n buckets, where external (overflow) chaining is used to resolve collisions. The hash function is such that the probability that a key value is hashed to a particular bucket is 1/n. The hash table is initially empty and K distinct values are inserted in the table.
What is the probability that bucket number 1 is empty after the Kth insertion?
What is the probability that no collision has occurred in any of the K
insertions?
What is the probability that the first collision occurs at the Kth insertion?
The probability that bucket 1 is empty after ONE insertion is (n−1)/n. That's the probability that the first item didn't hash to bucket 1. The event that it's empty after TWO insertions is defined by "first item missed bucket 1" AND "2nd item missed bucket one" which is (n - 1) * (n - 1) / n * n. With this, I hope you can compute the probability that the bucket's empty after K insertions.
For K = 1, it's 1. For K = 2, the second item must miss the bucket of the first item. So it has n − 1 places it can safely go. The probability of success is therefore (n − 1) / n. What about the third item? It has only n−2 places it can go. So the probability for K = 3 is (n − 1) * (n - 2) / n * n. You can generalize. Be careful of the case K > n.
Once you work out the details of the first two parts, you can probably make progress on this one as well.
Hint: the first collision occurs on the kth insertion if (i) the first k−1 insertions didn't collide (see 2) and (ii) the kth insertion DOES cause a collision (see the complement of 2).
Let me know if you can figure out all the three answers. Otherwise, I will put more details.

Finding the kth smallest element in a sequence where duplicates are compressed?

I've been asked to write a program to find the kth order statistic of a data set consisting of character and their occurrences. For example, I have a data set consisting of
B,A,C,A,B,C,A,D
Here I have A with 3 occurrences, B with 2 occurrences C with 2 occurrences and D with on occurrence. They can be grouped in pairs (characters, number of occurrences), so, for example, we could represent the above sequence as
(A,3), (B,2), (C,2) and (D,1).
Assuming than k is the number of these pairs, I am asked to find the kth of the data set in O(n) where n is the number of pairs.
I thought could sort the element based their number of occurrence and find their kth smallest elements, but that won't work in the time bounds. Can I please have some help on the algorithm for this problem?
Assuming that you have access to a linear-time selection algorithm, here's a simple divide-and-conquer algorithm for solving the problem. I'm going to let k denote the total number of pairs and m be the index you're looking for.
If there's just one pair, return the key in that pair.
Otherwise:
Using a linear-time selection algorithm, find the median element. Let medFreq be its frequency.
Sum up the frequencies of the elements less than the median. Call this less. Note that the number of elements less than or equal to the median is less + medFreq.
If less < m < less + medFreq, return the key in the median element.
Otherwise, if m ≤ less, recursively search for the mth element in the first half of the array.
Otherwise (m > less + medFreq), recursively search for the (m - less - medFreq)th element in the second half of the array.
The key insight here is that each iteration of this algorithm tosses out half of the pairs, so each recursive call is on an array half as large as the original array. This gives us the following recurrence relation:
T(k) = T(k / 2) + O(k)
Using the Master Theorem, this solves to O(k).

Two pairs of numbers with same sum

Given a list [a_1 a_2 ... a_n] of (not necessarily distinct) integers, determine whether there exist pairwise distinct indices w,x,y,z such that a_w + a_x = a_y + a_z.
I know that one way is to use 4 levels of for loops, each one iterating over one of the indices. When we get equal sums, check whether all the indices are pairwise distinct. If they are, return true. If we've exhausted all the possibilities, return false. This has running time O(n^4).
Can we do better?
Compute all possible values for a_w + a_x, insert them to hash table. Insert (a_w + a_x, w) and (a_w + a_x, x) to second hash table.
Prior to inserting a value to first hash table, check if it is already in the table. If so, check second table. If either of (a_w + a_x, w) or (a_w + a_x, x) is there, don't insert anything (we've got a duplicate element). If neither of these pairs is in the second table, we've got positive answer.
If, after processing all (w, x) pairs, we've got no positive answer, this means there is no such pairwise distinct indices.
Time complexity is O(n2). Space requirements are also O(n2).
It is possible to do the same in O(n) space but O(n2 * log(n)) time with slightly modified algorithm from this answer: Sum-subset with a fixed subset size:
Sort the list.
Use a priority queue for elements, containing a_w + a_x as a key and w, x as values. Pre-fill this queue with n-1 elements, where x = 0 and w = 1 .. n-1.
Repeatedly pop minimal element (sum, w, x) from this queue and put element (a_w + a_x_plus_1, w, x+1) to the queue (but don't put elements when x >= w). Stop when two consecutive elements, removed from queue, have the same sum.
To handle duplicates, it is possible to compare w, x of two consecutive elements, having equal sum. But it's easier to use krjampani's idea of pre-processing. If sorted list contains two pairs of duplicates or a single element is duplicated 4 times, success. Otherwise no more than a single value is duplicated; leave only single instance of it in the list and add its doubled value into priority queue along with a "special" pair of indexes: (2a, -1, -1).
Evgeny's solution can be some what simplified by preprocessing the original array as follows.
We first use a hash table to count the frequency of each element in the original array. If at least 2 elements have duplicates (their frequency is at least 2) or if an element occurs with frequency at least 4 the answer is true. Otherwise, if an element a occurs with frequency 2 or 3, we add 2a to a second hash table, and replace all copies of a with a single copy in the original array.
Then in the modified array, for each pair of indices i, j with i < j, we add a_i + a_j to the second hash table and return true if we find a duplicate entry in this hash table.
If you have 8.5GB of memory (more for unsigned ints, less if either the sums or indeces don’t span the whole int range), create three arrays. First uses 1 bit per each possible sum. It is a bitmap of results. Second uses 32 bits per each possible sum. It records index j. Third uses 1 bit per each possible sum. It is a bitfield that records if that sum has been encountered in the current iteration i--zero it with each iteration. Iterate i=0...n and j=i+1...n. For each sum, see if it is set in the first bitfield (if it was encountered before). If it is, see if the index recorded in 2nd array matches either i or j (if old j matches either new i or new j). If it is not, check that the bit in 2nd array is set (if it was set in the current iteration and therefore old i matches new i). If it is not, you have a match! (Old i will never match old j or new j, and new i will never match new j.) Exit. Otherwise, record the sum in all three arrays and continue.
Although it uses $40 worth of memory (i love the present:), this is probably much faster than using hash maps and boxing. May even use less memory for large n. One downside is the data will almost never be in the L2 cache. But try to set the JVM to use huge pages so at least the TLB miss won’t go to main memory too. It is o(n^2) for processing and o(1) for memory.

Algorithm for finding 2 items with given difference in an array

I am given an array of real numbers, A. It has n+1 elements.
It is known that there are at least 2 elements of the array, x and y, such that:
abs(x-y) <= (max(A)-min(A))/n
I need to create an algorithm for finding the 2 items (if there are more, any couple is good) in O(n) time.
I've been trying for a few hours and I'm stuck, any clues/hints?
woo I got it! The trick is in the Pigeonhole Principle.
Okay.. think of the numbers as being points on a line. Then min(A) and max(A) define the start and end points of the line respectively. Now divide that line into n equal intervals of length (max(A)-min(A))/n. Since there are n+1 points, two of them must fall into one of the intervals.
Note that we don't need to rely on the question telling us that there are two points that satisfy the criterion. There are always two points that satisfy it.
The algorithm itself: You can use a simplified form of bucket sort here, since you only need one item per bucket (hit two and you're done). First loop once through the array to get min(A) and max(A) and create an integer array buckets[n] initialized to some default value, say -1. Then go for a second pass:
for (int i=0; i<len; i++) {
int bucket_num = find_bucket(array[i]);
if (bucket[bucket_num] == -1)
bucket[bucket_num] = i;
else
// found pair at (i, bucket[bucket_num])
}
Where find_bucket(x) returns the rounded-down integer result of x / ((max(A)-min(A))/n).
Let's re-word the problem: we're to find two elements, such that abs(x-y) <= c, where c is a constant, that we can find in O(n) time. (Indeed, we can compute both max(A) and min(A) in linear time and just assign c=(max-min)/n).
Let's imagine we have a set of buckets, so that in first bucket elements 0<=x<c are placed, in the second bucket elements c<=x<=2c are placed, etc. For each element, we can determine its bucket for O(1) time. Note that the number of buckets occupied will be not more than the number of elements in array.
Let's iterate the array and place each element to its bucket. If in the bucket we're going to place it, there already is another element, then we've just found the proper pair of x and y!
If we've iterated the whole array and every element has fallen into its own bucket, no worries! Iterate the buckets now (there is not more than n buckets, as we've said above) and for each bucket element x, if in the next bucket y element is such that abs(x-y)<=c, then we've found the solution.
If we iterated all the buckets and found no proper elements, then there is no solution. OMG, I really missed that pigeonhole stuff (see the other answer).
Buckets may be implemented as a hash map, where each bucket holds one array index (placing element in bucket will look like this: buckets[ a[i] / c] = i). We compute c in O(n) time, assign items to buckets in O(n)*O(1) time (O(1) is access to hash map), traverse buckets in O(n) time. Therefore, the whole algorithm is linear.

Resources