Randomly sample a data set

Randomly sample a data set - random

I came across a Q that was asked in one of the interviews..
Q - Imagine you are given a really large stream of data elements (queries on google searches in May, products bought at Walmart during the Christmas season, names in a phone book, whatever). Your goal is to efficiently return a random sample of 1,000 elements evenly distributed from the original stream. How would you do it?
I am looking for -
What does random sampling of a data set mean?
(I mean I can simply do a coin toss and select a string from input if outcome is 1 and do this until i have 1000 samples..)
What are things I need to consider while doing so? For example .. taking contiguous strings may be better than taking non-contiguous strings.. to rephrase - Is it better if i pick contiguous 1000 strings randomly.. or is it better to pick one string at a time like coin toss..
This may be a vague question.. I tried to google "randomly sample data set" but did not find any relevant results.

Binary sample/don't sample may not be the right answer.. suppose you want to sample 1000 strings and you do it via coin toss.. This would mean that approximately after visiting 2000 strings.. you will be done.. What about the rest of the strings?
I read this post - http://gregable.com/2007/10/reservoir-sampling.html
which answers this Q quite clearly..
Let me put the summary here -
SIMPLE SOLUTION
Assign a random number to every element as you see them in the stream, and then always keep the top 1,000 numbered elements at all times.
RESERVOIR SAMPLING
Make a reservoir (array) of 1,000 elements and fill it with the first 1,000 elements in your stream.
Start with i = 1,001. With what probability after the 1001'th step should element 1,001 (or any element for that matter) be in the set of 1,000 elements? The answer is easy: 1,000/1,001. So, generate a random number between 0 and 1, and if it is less than 1,000/1,001 you should take element 1,001.
If you choose to add it, then replace any element (say element #2) in the reservoir chosen randomly. The element #2 is definitely in the reservoir at step 1,000 and the probability of it getting removed is the probability of element 1,001 getting selected multiplied by the probability of #2 getting randomly chosen as the replacement candidate. That probability is 1,000/1,001 * 1/1,000 = 1/1,001. So, the probability that #2 survives this round is 1 - that or 1,000/1,001.
This can be extended for the i'th round - keep the i'th element with probability 1,000/i and if you choose to keep it, replace a random element from the reservoir. The probability any element before this step being in the reservoir is 1,000/(i-1). The probability that they are removed is 1,000/i * 1/1,000 = 1/i. The probability that each element sticks around given that they are already in the reservoir is (i-1)/i and thus the elements' overall probability of being in the reservoir after i rounds is 1,000/(i-1) * (i-1)/i = 1,000/i.

I think you have used the word infinite a bit loosely , the very premise of sampling is every element has an equal chance to be in the sample and that is only possible if you at least go through every element. So I would translate infinite to mean a large number indicating you need a single pass solution rather than multiple passes.
Reservoir sampling is the way to go though the analysis from #abipc seems in the right direction but is not completely correct.
It is easier if we are firstly clear on what we want. Imagine you have N elements (N unknown) and you need to pick 1000 elements. This means we need to device a sampling scheme where the probability of any element being there in the sample is exactly 1000/N , so each element has the same probability of being in sample (no preference to any element based on its position on the original list). The scheme mentioned by #abipc works fine, the probability calculations goes like this -
After first step you have 1001 elements so we need to pick each element with probability 1000/1001. We pick the 1001st element with exactly that probability so that is fine. Now we also need to show that every other element also has the same probability of being in the sample.
p(any other element remaining in the sample) = [ 1 - p(that element is
removed from sample)]
= [ 1 - p(1001st element is selected) * p(the element is picked to be removed)
= [ 1 - (1000/1001) * (1/1000)] = 1000/1001
Great so now we have proven every element has a probability of 1000/1001 to be in the sample. This precise argument can be extended for the ith step using induction.

As I know such class of algorithms is called Reservoir Sampling algorithms.
I know one of it from DataMining, but don't know the name of it:
Collect first S elements in your storage with max.size equal to S.
Suppose next element of the stream has number N.
With probability S/N catch new element, else discard it
If you catched element N, then replace one of the elements in the sameple S, picked it uniformally.
N=N+1, get next element, goto 1
It can be theoretically proved that at any step of such stream processing your storage with size S contains elements with equal probablity S/N_you_have_seen.
So for example S=10;
N_you_have_seen=10^6
S - is finite number;
N_you_have_seen - can be infinite number;

Related

How to make a uniform random distribution but where result is revealed in steps?

For example, let's say there is a array of items each equally likely to be chosen, and the output of this random function will tell which item to be chosen, but I want the function to be split into multiple steps so that along each step the list of potential items is narrowed in giving better insight on the result probabilities.
Here's a step by step example of how it might work:
Step 1: Every item is 1/1000 chance.
Step 2: Random subset of half the original set is removed, so each remaining item is 1/500 now.
Step 3: Repeat step 2 until narrowed down to a single item.
The requirements I'd like for the algorithm is < O(n) time complexity and at each step the distribution is still uniformly random.
Initially I though to have an algorithm which:
Start with variables min and max describing the current range of values left.
Shrink the range by generating random float number between [-1, 1] which is applied to the range to shrink it proportionally. If random number is negative then lower the max, otherwise raise the min. So 50% of the time it is shifting the min up, and shifting the max down, and the range is shrinking by a factor between [0,1].
Repeat 2. until range converges on a single number.
But I noticed this doesn't have a uniform distribution, and instead it is more common for the chosen result to be closer to starting min and max values. So to fix this I think one could add a preliminary step where the starting range is offset by another random value. But this would only fix in making the starting distribution uniformly random, and it still doesn't fit my requirement of making it uniformly random at every step.
The naive solution is to generate random numbers and remove those from the list until at each step, but that is a O(n) solution so I hope there is something better.

You just have to apply Bayes' Theorem.
If you randomly remove a portion p of the remaining possibilities, the remaining items have their probabilities multiplied by 1/(1-p). So in your step 2, the probabilities change by an amount corresponding to how much the range changed. And not by a fixed factor.

This problem has some very simple answers so maybe that is why people seemed confused.
One solution is to generate a random number between [0,n] where n is the number of items in the current set, and instead of just removing it, you remove a range of items around that point.
Solution two is a bit more complicated but has the property of preserving set order + location such that the resulting set is just a spliced section of the original set, wheras the first solution's resulting set could be made of up multiple sections of the original set. The method here is as described initially in my post, but you also apply the random offset during each turn, not just once at the beginning.

algorithm for random extraction of elements

I have some issues understanding why the following procedure for performing random sampling from a set of objects work correctly:
suppose I have a population of 100 males and I want to extract 30 of them. One procedure proposed is the following:
assign to the first element of the list a probability of 30/100 and generate a random number n between 0 and 1. If n<30/100, the element gets selected, otherwise it does not.
If it gets selected, then move to the second element and assign it a probability of 29/99,otherwise move to the secodn record without selecting the first one and assign to it a probability of 30/99.
Eventually, proceding in this way we should reach the desired result of 30 random elements extracted from 100, but I do not understand conceptually why this leads to the correct solution.

I will give a hand wave explanation and leave mathematics for brevity
There are two probabilities involved:
The probability of an element getting selected
The oracle that selects an element
Since each element has an equal probability of being selected, after each selection we make sure that the probability is adjusted according to the number of total samples left and the number of samples left to be chosen.
Initially, the probability of each element being selected is 30/100. The oracle will toss a multi-headed coin. There are two situations now:
We select the first element if the probability of the coin < 30/100. Now we have 99 elements left and we need to choose 29 from them. Since each element has an equal probability of selection, we adjust the probability of each element to 29/100 and proceed with the oracle again.
If the oracle does not select the current element, we are left with 99 elements and we need to choose 30 from them. Since each element has an equal probability of selection, we adjust the probability of each element to 30/99 and proceed with the oracle again.

Randomly choosing from a list with weighted probabilities

I have an array of N elements (representing the N letters of a given alphabet), and each cell of the array holds an integer value, that integer value meaning the number of occurrences in a given text of that letter. Now I want to randomly choose a letter from all of the letters in the alphabet, based on his number of appearances with the given constraints:
If the letter has a positive (nonzero) value, then it can be always chosen by the algorithm (with a bigger or smaller probability, of course).
If a letter A has a higher value than a letter B, then it has to be more likely to be chosen by the algorithm.
Now, taking that into account, I've come up with a simple algorithm that might do the job, but I was just wondering if there was a better thing to do. This seems to be quite fundamental, and I think there might be more clever things to do in order to accomplish this more efficiently. This is the algorithm i thought:
Add up all the frequencies in the array. Store it in SUM
Choosing up a random value from 0 to SUM. Store it in RAN
[While] RAN > 0, Starting from the first, visit each cell in the array (in order), and subtract the value of that cell from RAN
The last visited cell is the chosen one
So, is there a better thing to do than this? Am I missing something?
I'm aware most modern computers can compute this so fast I won't even notice if my algorithm is inefficient, so this is more of a theoretical question rather than a practical one.
I prefer an explained algorithm rather than just code for an answer, but If you're more comfortable providing your answer in code, I have no problem with that.

The idea:
Iterate through all the elements and set the value of each element as the cumulative frequency thus far.
Generate a random number between 1 and the sum of all frequencies
Do a binary search on the values for this number (finding the first value greater than or equal to the number).
Example:
Element A B C D
Frequency 1 4 3 2
Cumulative 1 5 8 10
Generate a random number in the range 1-10 (1+4+3+2 = 10, the same as the last value in the cumulative list), do a binary search, which will return values as follows:
Number Element returned
1 A
2 B
3 B
4 B
5 B
6 C
7 C
8 C
9 D
10 D

The Alias Method has amortized O(1) time per value generated, but requires two uniforms per lookup. Basically, you create a table where each column contains one of the values to be generated, a second value called an alias, and a conditional probability of choosing between the value and its alias. Use your first uniform to pick any of the columns with equal likelihood. Then choose between the primary value and the alias based on your second uniform. It takes a O(n log n) work to initially set up a valid table for n values, but after the table's built generating values is constant time. You can download this Ruby gem to see an actual implementation.
Two other very fast methods by Marsaglia et al. are described here. They have provided C implementations.

Limited Sort/Filter Algorithm

I have a rather large list of elements (100s of thousands).
I have a filter that can either accept or not accept elements.
I want the top 100 elements that satisfy the filter.
So far, I have sorted the results first and then taken the top 100 that satisfy the filter. The rationale behind this is that the filter is not entirely fast.
But right now, the sorting step is taking way longer than the filtering step, so I would like to combine them in some way.
Is there an algorithm to combine the concerns of sorting/filtering to get the top 100 results satisfying the filter without incurring the cost of sorting all of the elements?

My instinct is to select the top 100 elements from the list (much cheaper than a sort, use your favorite variant of QuickSelect). Run those through the filter, yielding n successes and 100-n failures. If n < 100 then repeat by selecting 100-n elements from the top of the remainder of the list:
k = 100
while (k > 0):
select top k from list and remove them
filter them, yielding n successes
k = k - n
All being well this runs in time proportional to the length of the list, since each selection step runs in that time, and the number of selection steps required depends on the success rate of the filter, but not directly on the size of the list.
I expect this has some bad cases, though. If almost all elements fail the filter then it's considerably slower than just sorting everything, since you'll end up selecting thousands of times. So you might want some criteria to bail out if it's looking bad, and fall back to sorting the whole list.
It also has the problem that it will likely do a largeish number of small selects towards the end, since we expect k to decay exponentially if the filter criteria are unrelated to the sort criteria. So you could probably improve it by selecting somewhat more than k elements at each step. Say, k divided by the expected success rate of the filter, plus a small constant. The expectation based on past performance if there's no domain knowledge you can use to predict it, and the small constant chosen experimentally to avoid an annoyingly large number of steps to find the last few elements. If you end up at any step with more items that have passed the filter than the number you're still looking for (i.e, n > k), then select the top k from the current batch of successes and you're done.
Since QuickSelect gives you the top k without sorting those k, you'll need to do a final sort of 100 elements if you need the top 100 in order.

I've solved this exact problem by using a binary tree for sorting and by keeping count of the elements to the left of the current node during insertion. See http://pub.uni-bielefeld.de/publication/2305936 (Figure 4.4 et al) for details.

If I understand right, you have two choiced:
Selecting 100 Elements - N operations of the filter check. Then 100(lg 100) for the sort.
Sorting then selecting 100 Elements - At least N(lg N) for the sort, then the select.
the first sounds shorter then sorting then selecting.

I'd probably filter first, then insert the result of that into a priority queue. Keep track of the number of items in the PQ, and after you do the insert, if it's larger than the number you want to keep (100 in your case), pop off the smallest item and discard it.

Steve's suggestion to use Quicksort is a good one.
1 Read in the first 1000 or so elements.
2 Sort them and pick the 100th largest element.
3 Run one pass of Quicksort on the whole file with the element from step 2 as the pivot.
4 Select the upper half of the result of the Quicksort pass for further processing.
You are guaranteed at least 100 elements in the upper half of the single pass of Quicksort. Assuming the first 1000 are reasonably representative of the whole file then you should end up with about one tenth of the original elements at step 4.

Efficiently selecting a set of random elements from a linked list

Say I have a linked list of numbers of length N. N is very large and I don’t know in advance the exact value of N.
How can I most efficiently write a function that will return k completely random numbers from the list?

There's a very nice and efficient algorithm for this using a method called reservoir sampling.
Let me start by giving you its history:
Knuth calls this Algorithm R on p. 144 of his 1997 edition of Seminumerical Algorithms (volume 2 of The Art of Computer Programming), and provides some code for it there. Knuth attributes the algorithm to Alan G. Waterman. Despite a lengthy search, I haven't been able to find Waterman's original document, if it exists, which may be why you'll most often see Knuth quoted as the source of this algorithm.
McLeod and Bellhouse, 1983 (1) provide a more thorough discussion than Knuth as well as the first published proof (that I'm aware of) that the algorithm works.
Vitter 1985 (2) reviews Algorithm R and then presents an additional three algorithms which provide the same output, but with a twist. Rather than making a choice to include or skip each incoming element, his algorithm predetermines the number of incoming elements to be skipped. In his tests (which, admittedly, are out of date now) this decreased execution time dramatically by avoiding random number generation and comparisons on each in-coming number.
In pseudocode the algorithm is:
Let R be the result array of size s
Let I be an input queue
> Fill the reservoir array
for j in the range [1,s]:
R[j]=I.pop()
elements_seen=s
while I is not empty:
elements_seen+=1
j=random(1,elements_seen) > This is inclusive
if j<=s:
R[j]=I.pop()
else:
I.pop()
Note that I've specifically written the code to avoid specifying the size of the input. That's one of the cool properties of this algorithm: you can run it without needing to know the size of the input beforehand and it still assures you that each element you encounter has an equal probability of ending up in R (that is, there is no bias). Furthermore, R contains a fair and representative sample of the elements the algorithm has considered at all times. This means you can use this as an online algorithm.
Why does this work?
McLeod and Bellhouse (1983) provide a proof using the mathematics of combinations. It's pretty, but it would be a bit difficult to reconstruct it here. Therefore, I've generated an alternative proof which is easier to explain.
We proceed via proof by induction.
Say we want to generate a set of s elements and that we have already seen n>s elements.
Let's assume that our current s elements have already each been chosen with probability s/n.
By the definition of the algorithm, we choose element n+1 with probability s/(n+1).
Each element already part of our result set has a probability 1/s of being replaced.
The probability that an element from the n-seen result set is replaced in the n+1-seen result set is therefore (1/s)*s/(n+1)=1/(n+1). Conversely, the probability that an element is not replaced is 1-1/(n+1)=n/(n+1).
Thus, the n+1-seen result set contains an element either if it was part of the n-seen result set and was not replaced---this probability is (s/n)*n/(n+1)=s/(n+1)---or if the element was chosen---with probability s/(n+1).
The definition of the algorithm tells us that the first s elements are automatically included as the first n=s members of the result set. Therefore, the n-seen result set includes each element with s/n (=1) probability giving us the necessary base case for the induction.
References
McLeod, A. Ian, and David R. Bellhouse. "A convenient algorithm for drawing a simple random sample." Journal of the Royal Statistical Society. Series C (Applied Statistics) 32.2 (1983): 182-184. (Link)
Vitter, Jeffrey S. "Random sampling with a reservoir." ACM Transactions on Mathematical Software (TOMS) 11.1 (1985): 37-57. (Link)

This is called a Reservoir Sampling problem. The simple solution is to assign a random number to each element of the list as you see it, then keep the top (or bottom) k elements as ordered by the random number.

I would suggest: First find your k random numbers. Sort them. Then traverse both the linked list and your random numbers once.
If you somehow don't know the length of your linked list (how?), then you could grab the first k into an array, then for node r, generate a random number in [0, r), and if that is less than k, replace the rth item of the array. (Not entirely convinced that doesn't bias...)
Other than that: "If I were you, I wouldn't be starting from here." Are you sure linked list is right for your problem? Is there not a better data structure, such as a good old flat array list.

If you don't know the length of the list, then you will have to traverse it complete to ensure random picks. The method I've used in this case is the one described by Tom Hawtin (54070). While traversing the list you keep k elements that form your random selection to that point. (Initially you just add the first k elements you encounter.) Then, with probability k/i, you replace a random element from your selection with the ith element of the list (i.e. the element you are at, at that moment).
It's easy to show that this gives a random selection. After seeing m elements (m > k), we have that each of the first m elements of the list are part of you random selection with a probability k/m. That this initially holds is trivial. Then for each element m+1, you put it in your selection (replacing a random element) with probability k/(m+1). You now need to show that all other elements also have probability k/(m+1) of being selected. We have that the probability is k/m * (k/(m+1)*(1-1/k) + (1-k/(m+1))) (i.e. probability that element was in the list times the probability that it is still there). With calculus you can straightforwardly show that this is equal to k/(m+1).

Well, you do need to know what N is at runtime at least, even if this involves doing an extra pass over the list to count them. The simplest algorithm to do this is to just pick a random number in N and remove that item, repeated k times. Or, if it is permissible to return repeat numbers, don't remove the item.
Unless you have a VERY large N, and very stringent performance requirements, this algorithm runs with O(N*k) complexity, which should be acceptable.
Edit: Nevermind, Tom Hawtin's method is way better. Select the random numbers first, then traverse the list once. Same theoretical complexity, I think, but much better expected runtime.

Why can't you just do something like
List GetKRandomFromList(List input, int k)
List ret = new List();
for(i=0;i<k;i++)
ret.Add(input[Math.Rand(0,input.Length)]);
return ret;
I'm sure that you don't mean something that simple so can you specify further?

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio