Reservoir sampling: why is it selected uniformly at random - algorithm

I understand how the algorithm works. However I don't understand why is it correct. Assume we need to select only one element. Here's the proof that I've found
at every step N, keep the next element in the stream with probability 1/N. This means that we have an (N-1)/N probability of keeping the element we are currently holding on to, which means that we keep it with probability (1/(N-1)) * (N-1)/N = 1/N.
I understand everything except for the last part. Why do we multiply the probabilities?

Because Pr[A AND B] == Pr[A] * Pr[B], assuming that A and B are independent (as they are here). The probability of choosing the element AND not replacing it later, is the product of those two possibilities' probabilities.

Related

Algorithmic Analysis and interpretation of partitions

So I've been working on this problem and have been stuck forever.
I don't feel like I've made any progress so any help would be great.
For i. I thought it might be the probability of being on the left of the partition multiplied by the probability of being on the right partition. So something like (q/n)* ((n-q)/n). However if I were to do this, I would get the exact same thing for iii. Which doesn't seem correct.
Am I going about this correctly?
I also am unsure how to find the expected number of elements for the other parts. What does that even mean? what equation would I make to solve this?
How are we supposed to know the expected if it could really be anything?
I know that the qth position is sorted so how would I use that to solve it?
First, assume each permutation is equally likely, and that all the elements in the array are different.
Any particular element that's originally to the left of q, is equally likely to go in any of the remaining places in the array. That means it's got probability 1 - q/n of moving to the right. Similarly, an element originally to the right of q has probability (q - 1)/n of moving to the left. Because of the way the partition has been implemented, an element never moves from left to a different place on the left, or right to a different place on the right.
Expectation is linear, and the expected number of moves of a single element is simply its probability of moving. Therefore the expected number of moves from right to left is the product of the probability of moving times the number of elements originally on the right, or (n - q)(q - 1)/n. Similarly, the expected number of moves from left to right is (q - 1)(1 - q/n). So given a particular q, the expectation is the sum of these two, which is 2(n - q)(q - 1)/n.
Now, to get the overall expected number of moves excluding the movement of the partition element, we have to sum up over q, each q being equally likely:
1/n * sum(q=0 to n-1) 2(n - q)(q - 1) / n.
Simplifying, this equals (n^2 - 3n - 4) / 3n.
Finally, we have to add the expected number of moves of the partition which is (n - 1)/n (since it moves except when it happens to be the smallest element in the array).
Simplifying again, this gives (n^2 - 2n - 5) / n.
(Assuming I didn't make an error in calculation along the way!)
Like #Anonymous, I'll assume the elements are all different and the permutations are equally likely.
Let's address the first question first and let's solve the inverse problem. That is:
i. What is the probability that no element moved from the left to the right of q.
Before finding this out, I'd like to point out that if no element moved from the left to the right, therefore no element moved from the right to the left either. This implies that the answer to questions i. and iii. are the same, and so are the answers to questions ii. and iv.. In fact, this is easy to establish by symmetry.
So, let's calculate that probability. Let's divide the possible cases by the value of q and sum up. For any given q (q starts from 1), you have q-1 numbers that are smaller than it. You also have q-1 slots on the left and n-q slots on the right. To total possible permutations of the n-1 numbers (excluding the pivot which is at position q) is (n-1)!. The number of cases where the smaller q-1 numbers are on the left, and the larger n-q numbers are on the right is (q-1)!(n-q)!. The probability of this happening is therefore:
(q-1)!(n-q)!
------------
(n-1)!
To get the total chances, we sum all possible cases divided by n (Note: what we calculated was a conditional probability. The set of possibilities can be divided by the value of q into distinct subsets whose union covers the whole set. In other words, the complete probability can be calculated as the sum of probabilities in each subset multiplied by the probability of that subset. The probability of the pivot ending in location q is 1/n).
n
1 ___ (q-1)!(n-q)!
p = --- \ ------------
n /__ (n-1)!
q=1
This was the chance that no element moves from the left to the right. The chance of having at least one element move is thus 1-p. This could also be rewritten as:
n
1 ___ (q-1)!(n-q)!
p_1 = --- \ (1 - ------------)
n /__ (n-1)!
q=1
where p_1 is the probability for question i.. Note that the term inside the sigma is indeed the chance for a given q that at least some element moved from left to right. Let's call this p_1q as the probability for question i. given a fixed q.
To answer the second question, we know that the expected value would be the sum of the expected value given a q multiplied by the chances for that q. Chances for q being any value is 1/n, as stated before. Now, what is the expected value of number of elements on the left that need to move to right given some q? This is itself the sum of different values multiplied by their probability.
Let's say given a q, there are k values on the left that need to move to the right. In other words, there are k values larger than the pivot at position q and q-k-1 values smaller than the pivot. What are these chances? You need to calculate this again with combinatorics, where k of the q-1 elements on the left are selected from the q-1 values smaller than the pivot. Unfortunately, this answer is taking up longer than I expected and I really need to get back to work (plus, typing equations without latex math is painful). Hope this would help you continue solving this problem yourself.

stack of piles visible from top view

This is an interview question.
Here are the notes arranged in the following manner as depicted in the image.
Given the starting and ending point of each note.
for eg. [2-5], [3-9], [7-100] on a scale of length limits 0-10^9
in this example all three notes will be visible.
we need to find out, when viewed from top, how many notes are visible??
I tried to solve in O(n*n) , where n is the number of notes by checking every note visibilty with every other note. but in this approach how will we determine if the two notes are in different stacks.
ultimately did not reached the solution.
O(n) solutions will be preferred as O(n) solution was demanded by interviewer
If the order of the notes in the input is "the former is on top" than its easy:
keep values of the min_x and the max_x, initializing it to the first note's x values. Iterate over the notes, each note that has x values either greater than max_x or lesser than min_x changes those respective value to its own x values and is considered visible, otherwise it is not. finish iterating and return the list of visible notes. collect the cash.
If O(n log n) is sufficient: first, remap all numbers in the input to between 0..(2*n+1) (that is, if a number x_i is the j-th smallest number among all numbers in the input, replace all x_i with j). You can then use Painter's algorithm on a segment tree.
Details:
Consider an array of size (2 * n + 1). Initialize all these cells with -1.
Painter's algorithm: Iterate the bank notes from the last one given (in the bottom) to the topmost one. For each note covering from a_i to b_i, replace the values of all cells in the array whose index is between a_i and b_i with i. At the end of this algorithm, we can simply look at which indexes are in the array and these form all the visible notes. However, naively this works in O(N^2).
Segment tree: So, instead of using an array, we use a segment tree. The operations above can then be done in O(log N).

Randomly sample a data set

I came across a Q that was asked in one of the interviews..
Q - Imagine you are given a really large stream of data elements (queries on google searches in May, products bought at Walmart during the Christmas season, names in a phone book, whatever). Your goal is to efficiently return a random sample of 1,000 elements evenly distributed from the original stream. How would you do it?
I am looking for -
What does random sampling of a data set mean?
(I mean I can simply do a coin toss and select a string from input if outcome is 1 and do this until i have 1000 samples..)
What are things I need to consider while doing so? For example .. taking contiguous strings may be better than taking non-contiguous strings.. to rephrase - Is it better if i pick contiguous 1000 strings randomly.. or is it better to pick one string at a time like coin toss..
This may be a vague question.. I tried to google "randomly sample data set" but did not find any relevant results.
Binary sample/don't sample may not be the right answer.. suppose you want to sample 1000 strings and you do it via coin toss.. This would mean that approximately after visiting 2000 strings.. you will be done.. What about the rest of the strings?
I read this post - http://gregable.com/2007/10/reservoir-sampling.html
which answers this Q quite clearly..
Let me put the summary here -
SIMPLE SOLUTION
Assign a random number to every element as you see them in the stream, and then always keep the top 1,000 numbered elements at all times.
RESERVOIR SAMPLING
Make a reservoir (array) of 1,000 elements and fill it with the first 1,000 elements in your stream.
Start with i = 1,001. With what probability after the 1001'th step should element 1,001 (or any element for that matter) be in the set of 1,000 elements? The answer is easy: 1,000/1,001. So, generate a random number between 0 and 1, and if it is less than 1,000/1,001 you should take element 1,001.
If you choose to add it, then replace any element (say element #2) in the reservoir chosen randomly. The element #2 is definitely in the reservoir at step 1,000 and the probability of it getting removed is the probability of element 1,001 getting selected multiplied by the probability of #2 getting randomly chosen as the replacement candidate. That probability is 1,000/1,001 * 1/1,000 = 1/1,001. So, the probability that #2 survives this round is 1 - that or 1,000/1,001.
This can be extended for the i'th round - keep the i'th element with probability 1,000/i and if you choose to keep it, replace a random element from the reservoir. The probability any element before this step being in the reservoir is 1,000/(i-1). The probability that they are removed is 1,000/i * 1/1,000 = 1/i. The probability that each element sticks around given that they are already in the reservoir is (i-1)/i and thus the elements' overall probability of being in the reservoir after i rounds is 1,000/(i-1) * (i-1)/i = 1,000/i.
I think you have used the word infinite a bit loosely , the very premise of sampling is every element has an equal chance to be in the sample and that is only possible if you at least go through every element. So I would translate infinite to mean a large number indicating you need a single pass solution rather than multiple passes.
Reservoir sampling is the way to go though the analysis from #abipc seems in the right direction but is not completely correct.
It is easier if we are firstly clear on what we want. Imagine you have N elements (N unknown) and you need to pick 1000 elements. This means we need to device a sampling scheme where the probability of any element being there in the sample is exactly 1000/N , so each element has the same probability of being in sample (no preference to any element based on its position on the original list). The scheme mentioned by #abipc works fine, the probability calculations goes like this -
After first step you have 1001 elements so we need to pick each element with probability 1000/1001. We pick the 1001st element with exactly that probability so that is fine. Now we also need to show that every other element also has the same probability of being in the sample.
p(any other element remaining in the sample) = [ 1 - p(that element is
removed from sample)]
= [ 1 - p(1001st element is selected) * p(the element is picked to be removed)
= [ 1 - (1000/1001) * (1/1000)] = 1000/1001
Great so now we have proven every element has a probability of 1000/1001 to be in the sample. This precise argument can be extended for the ith step using induction.
As I know such class of algorithms is called Reservoir Sampling algorithms.
I know one of it from DataMining, but don't know the name of it:
Collect first S elements in your storage with max.size equal to S.
Suppose next element of the stream has number N.
With probability S/N catch new element, else discard it
If you catched element N, then replace one of the elements in the sameple S, picked it uniformally.
N=N+1, get next element, goto 1
It can be theoretically proved that at any step of such stream processing your storage with size S contains elements with equal probablity S/N_you_have_seen.
So for example S=10;
N_you_have_seen=10^6
S - is finite number;
N_you_have_seen - can be infinite number;

Revisit: 2D Array Sorted Along X and Y Axis

So, this is a common interview question. There's already a topic up, which I have read, but it's dead, and no answer was ever accepted. On top of that, my interests lie in a slightly more constrained form of the question, with a couple practical applications.
Given a two dimensional array such that:
Elements are unique.
Elements are sorted along the x-axis and the y-axis.
Neither sort predominates, so neither sort is a secondary sorting parameter.
As a result, the diagonal is also sorted.
All of the sorts can be thought of as moving in the same direction. That is to say that they are all ascending, or that they are all descending.
Technically, I think as long as you have a >/=/< comparator, any total ordering should work.
Elements are numeric types, with a single-cycle comparator.
Thus, memory operations are the dominating factor in a big-O analysis.
How do you find an element? Only worst case analysis matters.
Solutions I am aware of:
A variety of approaches that are:
O(nlog(n)), where you approach each row separately.
O(nlog(n)) with strong best and average performance.
One that is O(n+m):
Start in a non-extreme corner, which we will assume is the bottom right.
Let the target be J. Cur Pos is M.
If M is greater than J, move left.
If M is less than J, move up.
If you can do neither, you are done, and J is not present.
If M is equal to J, you are done.
Originally found elsewhere, most recently stolen from here.
And I believe I've seen one with a worst-case O(n+m) but a optimal case of nearly O(log(n)).
What I am curious about:
Right now, I have proved to my satisfaction that naive partitioning attack always devolves to nlog(n). Partitioning attacks in general appear to have a optimal worst-case of O(n+m), and most do not terminate early in cases of absence. I was also wondering, as a result, if an interpolation probe might not be better than a binary probe, and thus it occurred to me that one might think of this as a set intersection problem with a weak interaction between sets. My mind cast immediately towards Baeza-Yates intersection, but I haven't had time to draft an adaptation of that approach. However, given my suspicions that optimality of a O(N+M) worst case is provable, I thought I'd just go ahead and ask here, to see if anyone could bash together a counter-argument, or pull together a recurrence relation for interpolation search.
Here's a proof that it has to be at least Omega(min(n,m)). Let n >= m. Then consider the matrix which has all 0s at (i,j) where i+j < m, all 2s where i+j >= m, except for a single (i,j) with i+j = m which has a 1. This is a valid input matrix, and there are m possible placements for the 1. No query into the array (other than the actual location of the 1) can distinguish among those m possible placements. So you'll have to check all m locations in the worst case, and at least m/2 expected locations for any randomized algorithm.
One of your assumptions was that matrix elements have to be unique, and I didn't do that. It is easy to fix, however, because you just pick a big number X=n*m, replace all 0s with unique numbers less than X, all 2s with unique numbers greater than X, and 1 with X.
And because it is also Omega(lg n) (counting argument), it is Omega(m + lg n) where n>=m.
An optimal O(m+n) solution is to start at the top-left corner, that has minimal value. Move diagonally downwards to the right until you hit an element whose value >= value of the given element. If the element's value is equal to that of the given element, return found as true.
Otherwise, from here we can proceed in two ways.
Strategy 1:
Move up in the column and search for the given element until we reach the end. If found, return found as true
Move left in the row and search for the given element until we reach the end. If found, return found as true
return found as false
Strategy 2:
Let i denote the row index and j denote the column index of the diagonal element we have stopped at. (Here, we have i = j, BTW). Let k = 1.
Repeat the below steps until i-k >= 0
Search if a[i-k][j] is equal to the given element. if yes, return found as true.
Search if a[i][j-k] is equal to the given element. if yes, return found as true.
Increment k
1 2 4 5 6
2 3 5 7 8
4 6 8 9 10
5 8 9 10 11

Efficiently selecting a set of random elements from a linked list

Say I have a linked list of numbers of length N. N is very large and I don’t know in advance the exact value of N.
How can I most efficiently write a function that will return k completely random numbers from the list?
There's a very nice and efficient algorithm for this using a method called reservoir sampling.
Let me start by giving you its history:
Knuth calls this Algorithm R on p. 144 of his 1997 edition of Seminumerical Algorithms (volume 2 of The Art of Computer Programming), and provides some code for it there. Knuth attributes the algorithm to Alan G. Waterman. Despite a lengthy search, I haven't been able to find Waterman's original document, if it exists, which may be why you'll most often see Knuth quoted as the source of this algorithm.
McLeod and Bellhouse, 1983 (1) provide a more thorough discussion than Knuth as well as the first published proof (that I'm aware of) that the algorithm works.
Vitter 1985 (2) reviews Algorithm R and then presents an additional three algorithms which provide the same output, but with a twist. Rather than making a choice to include or skip each incoming element, his algorithm predetermines the number of incoming elements to be skipped. In his tests (which, admittedly, are out of date now) this decreased execution time dramatically by avoiding random number generation and comparisons on each in-coming number.
In pseudocode the algorithm is:
Let R be the result array of size s
Let I be an input queue
> Fill the reservoir array
for j in the range [1,s]:
R[j]=I.pop()
elements_seen=s
while I is not empty:
elements_seen+=1
j=random(1,elements_seen) > This is inclusive
if j<=s:
R[j]=I.pop()
else:
I.pop()
Note that I've specifically written the code to avoid specifying the size of the input. That's one of the cool properties of this algorithm: you can run it without needing to know the size of the input beforehand and it still assures you that each element you encounter has an equal probability of ending up in R (that is, there is no bias). Furthermore, R contains a fair and representative sample of the elements the algorithm has considered at all times. This means you can use this as an online algorithm.
Why does this work?
McLeod and Bellhouse (1983) provide a proof using the mathematics of combinations. It's pretty, but it would be a bit difficult to reconstruct it here. Therefore, I've generated an alternative proof which is easier to explain.
We proceed via proof by induction.
Say we want to generate a set of s elements and that we have already seen n>s elements.
Let's assume that our current s elements have already each been chosen with probability s/n.
By the definition of the algorithm, we choose element n+1 with probability s/(n+1).
Each element already part of our result set has a probability 1/s of being replaced.
The probability that an element from the n-seen result set is replaced in the n+1-seen result set is therefore (1/s)*s/(n+1)=1/(n+1). Conversely, the probability that an element is not replaced is 1-1/(n+1)=n/(n+1).
Thus, the n+1-seen result set contains an element either if it was part of the n-seen result set and was not replaced---this probability is (s/n)*n/(n+1)=s/(n+1)---or if the element was chosen---with probability s/(n+1).
The definition of the algorithm tells us that the first s elements are automatically included as the first n=s members of the result set. Therefore, the n-seen result set includes each element with s/n (=1) probability giving us the necessary base case for the induction.
References
McLeod, A. Ian, and David R. Bellhouse. "A convenient algorithm for drawing a simple random sample." Journal of the Royal Statistical Society. Series C (Applied Statistics) 32.2 (1983): 182-184. (Link)
Vitter, Jeffrey S. "Random sampling with a reservoir." ACM Transactions on Mathematical Software (TOMS) 11.1 (1985): 37-57. (Link)
This is called a Reservoir Sampling problem. The simple solution is to assign a random number to each element of the list as you see it, then keep the top (or bottom) k elements as ordered by the random number.
I would suggest: First find your k random numbers. Sort them. Then traverse both the linked list and your random numbers once.
If you somehow don't know the length of your linked list (how?), then you could grab the first k into an array, then for node r, generate a random number in [0, r), and if that is less than k, replace the rth item of the array. (Not entirely convinced that doesn't bias...)
Other than that: "If I were you, I wouldn't be starting from here." Are you sure linked list is right for your problem? Is there not a better data structure, such as a good old flat array list.
If you don't know the length of the list, then you will have to traverse it complete to ensure random picks. The method I've used in this case is the one described by Tom Hawtin (54070). While traversing the list you keep k elements that form your random selection to that point. (Initially you just add the first k elements you encounter.) Then, with probability k/i, you replace a random element from your selection with the ith element of the list (i.e. the element you are at, at that moment).
It's easy to show that this gives a random selection. After seeing m elements (m > k), we have that each of the first m elements of the list are part of you random selection with a probability k/m. That this initially holds is trivial. Then for each element m+1, you put it in your selection (replacing a random element) with probability k/(m+1). You now need to show that all other elements also have probability k/(m+1) of being selected. We have that the probability is k/m * (k/(m+1)*(1-1/k) + (1-k/(m+1))) (i.e. probability that element was in the list times the probability that it is still there). With calculus you can straightforwardly show that this is equal to k/(m+1).
Well, you do need to know what N is at runtime at least, even if this involves doing an extra pass over the list to count them. The simplest algorithm to do this is to just pick a random number in N and remove that item, repeated k times. Or, if it is permissible to return repeat numbers, don't remove the item.
Unless you have a VERY large N, and very stringent performance requirements, this algorithm runs with O(N*k) complexity, which should be acceptable.
Edit: Nevermind, Tom Hawtin's method is way better. Select the random numbers first, then traverse the list once. Same theoretical complexity, I think, but much better expected runtime.
Why can't you just do something like
List GetKRandomFromList(List input, int k)
List ret = new List();
for(i=0;i<k;i++)
ret.Add(input[Math.Rand(0,input.Length)]);
return ret;
I'm sure that you don't mean something that simple so can you specify further?

Resources