algorithm for random extraction of elements - random

I have some issues understanding why the following procedure for performing random sampling from a set of objects work correctly:
suppose I have a population of 100 males and I want to extract 30 of them. One procedure proposed is the following:
assign to the first element of the list a probability of 30/100 and generate a random number n between 0 and 1. If n<30/100, the element gets selected, otherwise it does not.
If it gets selected, then move to the second element and assign it a probability of 29/99,otherwise move to the secodn record without selecting the first one and assign to it a probability of 30/99.
Eventually, proceding in this way we should reach the desired result of 30 random elements extracted from 100, but I do not understand conceptually why this leads to the correct solution.

I will give a hand wave explanation and leave mathematics for brevity
There are two probabilities involved:
The probability of an element getting selected
The oracle that selects an element
Since each element has an equal probability of being selected, after each selection we make sure that the probability is adjusted according to the number of total samples left and the number of samples left to be chosen.
Initially, the probability of each element being selected is 30/100. The oracle will toss a multi-headed coin. There are two situations now:
We select the first element if the probability of the coin < 30/100. Now we have 99 elements left and we need to choose 29 from them. Since each element has an equal probability of selection, we adjust the probability of each element to 29/100 and proceed with the oracle again.
If the oracle does not select the current element, we are left with 99 elements and we need to choose 30 from them. Since each element has an equal probability of selection, we adjust the probability of each element to 30/99 and proceed with the oracle again.

Related

How to make a uniform random distribution but where result is revealed in steps?

For example, let's say there is a array of items each equally likely to be chosen, and the output of this random function will tell which item to be chosen, but I want the function to be split into multiple steps so that along each step the list of potential items is narrowed in giving better insight on the result probabilities.
Here's a step by step example of how it might work:
Step 1: Every item is 1/1000 chance.
Step 2: Random subset of half the original set is removed, so each remaining item is 1/500 now.
Step 3: Repeat step 2 until narrowed down to a single item.
The requirements I'd like for the algorithm is < O(n) time complexity and at each step the distribution is still uniformly random.
Initially I though to have an algorithm which:
Start with variables min and max describing the current range of values left.
Shrink the range by generating random float number between [-1, 1] which is applied to the range to shrink it proportionally. If random number is negative then lower the max, otherwise raise the min. So 50% of the time it is shifting the min up, and shifting the max down, and the range is shrinking by a factor between [0,1].
Repeat 2. until range converges on a single number.
But I noticed this doesn't have a uniform distribution, and instead it is more common for the chosen result to be closer to starting min and max values. So to fix this I think one could add a preliminary step where the starting range is offset by another random value. But this would only fix in making the starting distribution uniformly random, and it still doesn't fit my requirement of making it uniformly random at every step.
The naive solution is to generate random numbers and remove those from the list until at each step, but that is a O(n) solution so I hope there is something better.
You just have to apply Bayes' Theorem.
If you randomly remove a portion p of the remaining possibilities, the remaining items have their probabilities multiplied by 1/(1-p). So in your step 2, the probabilities change by an amount corresponding to how much the range changed. And not by a fixed factor.
This problem has some very simple answers so maybe that is why people seemed confused.
One solution is to generate a random number between [0,n] where n is the number of items in the current set, and instead of just removing it, you remove a range of items around that point.
Solution two is a bit more complicated but has the property of preserving set order + location such that the resulting set is just a spliced section of the original set, wheras the first solution's resulting set could be made of up multiple sections of the original set. The method here is as described initially in my post, but you also apply the random offset during each turn, not just once at the beginning.

Looking for an algorithm to a unique problem

I have six arrays that are each given a (not necessarily unique) value from one to fifty. I am also given a number of items to split between them. The value of each item is defined by the array it is in. Arrays can hold infinite or zero items, but the sum of items in all arrays must equal the original number of items given.
I want to find the best configuration of items in arrays where the sum of item values in each individual array are as close as possible to each other.
For instance, let's say that I have three arrays with a value of 10 and three arrays with a value of 20. For nine items, one would go in each of the '20' arrays and two would go into each of the '10' arrays so that the sum of each array is 20 and the total number of items is nine.
I can't add a fractional number of items to an array, and the numbers are hardly ever perfectly divisible like that example, but there always exists a solution where the difference between the sums is minimal.
I'm currently using brute force to solve this problem, but performance suffers with larger numbers of items. I feel like there is a mathematical answer to this problem, but I wouldn't even know where to begin.
It is easy to write a greedy algorithm that comes up with an approximate solution. Just always add the next item to the array with the lowest sum of values.
The array with the highest value should be within 1 item of being correct.
For each count of items in the array with the highest value, you can repeat the exercise. Getting the array with the second highest value to within 1.
Continue through all of them, and with 6 arrays you'll wind up with 3^5 = 243 possible arrangements of items (note that the number of items in the last array is entirely determined by the first 5). Pick the best of these and your combinatorial explosion is contained.
(This approach should work if you're trying to minimize the value difference between the largest and smallest array, and have a fixed number of arrays. )

Reservoir sampling: why is it selected uniformly at random

I understand how the algorithm works. However I don't understand why is it correct. Assume we need to select only one element. Here's the proof that I've found
at every step N, keep the next element in the stream with probability 1/N. This means that we have an (N-1)/N probability of keeping the element we are currently holding on to, which means that we keep it with probability (1/(N-1)) * (N-1)/N = 1/N.
I understand everything except for the last part. Why do we multiply the probabilities?
Because Pr[A AND B] == Pr[A] * Pr[B], assuming that A and B are independent (as they are here). The probability of choosing the element AND not replacing it later, is the product of those two possibilities' probabilities.

Randomly sample a data set

I came across a Q that was asked in one of the interviews..
Q - Imagine you are given a really large stream of data elements (queries on google searches in May, products bought at Walmart during the Christmas season, names in a phone book, whatever). Your goal is to efficiently return a random sample of 1,000 elements evenly distributed from the original stream. How would you do it?
I am looking for -
What does random sampling of a data set mean?
(I mean I can simply do a coin toss and select a string from input if outcome is 1 and do this until i have 1000 samples..)
What are things I need to consider while doing so? For example .. taking contiguous strings may be better than taking non-contiguous strings.. to rephrase - Is it better if i pick contiguous 1000 strings randomly.. or is it better to pick one string at a time like coin toss..
This may be a vague question.. I tried to google "randomly sample data set" but did not find any relevant results.
Binary sample/don't sample may not be the right answer.. suppose you want to sample 1000 strings and you do it via coin toss.. This would mean that approximately after visiting 2000 strings.. you will be done.. What about the rest of the strings?
I read this post - http://gregable.com/2007/10/reservoir-sampling.html
which answers this Q quite clearly..
Let me put the summary here -
SIMPLE SOLUTION
Assign a random number to every element as you see them in the stream, and then always keep the top 1,000 numbered elements at all times.
RESERVOIR SAMPLING
Make a reservoir (array) of 1,000 elements and fill it with the first 1,000 elements in your stream.
Start with i = 1,001. With what probability after the 1001'th step should element 1,001 (or any element for that matter) be in the set of 1,000 elements? The answer is easy: 1,000/1,001. So, generate a random number between 0 and 1, and if it is less than 1,000/1,001 you should take element 1,001.
If you choose to add it, then replace any element (say element #2) in the reservoir chosen randomly. The element #2 is definitely in the reservoir at step 1,000 and the probability of it getting removed is the probability of element 1,001 getting selected multiplied by the probability of #2 getting randomly chosen as the replacement candidate. That probability is 1,000/1,001 * 1/1,000 = 1/1,001. So, the probability that #2 survives this round is 1 - that or 1,000/1,001.
This can be extended for the i'th round - keep the i'th element with probability 1,000/i and if you choose to keep it, replace a random element from the reservoir. The probability any element before this step being in the reservoir is 1,000/(i-1). The probability that they are removed is 1,000/i * 1/1,000 = 1/i. The probability that each element sticks around given that they are already in the reservoir is (i-1)/i and thus the elements' overall probability of being in the reservoir after i rounds is 1,000/(i-1) * (i-1)/i = 1,000/i.
I think you have used the word infinite a bit loosely , the very premise of sampling is every element has an equal chance to be in the sample and that is only possible if you at least go through every element. So I would translate infinite to mean a large number indicating you need a single pass solution rather than multiple passes.
Reservoir sampling is the way to go though the analysis from #abipc seems in the right direction but is not completely correct.
It is easier if we are firstly clear on what we want. Imagine you have N elements (N unknown) and you need to pick 1000 elements. This means we need to device a sampling scheme where the probability of any element being there in the sample is exactly 1000/N , so each element has the same probability of being in sample (no preference to any element based on its position on the original list). The scheme mentioned by #abipc works fine, the probability calculations goes like this -
After first step you have 1001 elements so we need to pick each element with probability 1000/1001. We pick the 1001st element with exactly that probability so that is fine. Now we also need to show that every other element also has the same probability of being in the sample.
p(any other element remaining in the sample) = [ 1 - p(that element is
removed from sample)]
= [ 1 - p(1001st element is selected) * p(the element is picked to be removed)
= [ 1 - (1000/1001) * (1/1000)] = 1000/1001
Great so now we have proven every element has a probability of 1000/1001 to be in the sample. This precise argument can be extended for the ith step using induction.
As I know such class of algorithms is called Reservoir Sampling algorithms.
I know one of it from DataMining, but don't know the name of it:
Collect first S elements in your storage with max.size equal to S.
Suppose next element of the stream has number N.
With probability S/N catch new element, else discard it
If you catched element N, then replace one of the elements in the sameple S, picked it uniformally.
N=N+1, get next element, goto 1
It can be theoretically proved that at any step of such stream processing your storage with size S contains elements with equal probablity S/N_you_have_seen.
So for example S=10;
N_you_have_seen=10^6
S - is finite number;
N_you_have_seen - can be infinite number;

Randomly choosing from a list with weighted probabilities

I have an array of N elements (representing the N letters of a given alphabet), and each cell of the array holds an integer value, that integer value meaning the number of occurrences in a given text of that letter. Now I want to randomly choose a letter from all of the letters in the alphabet, based on his number of appearances with the given constraints:
If the letter has a positive (nonzero) value, then it can be always chosen by the algorithm (with a bigger or smaller probability, of course).
If a letter A has a higher value than a letter B, then it has to be more likely to be chosen by the algorithm.
Now, taking that into account, I've come up with a simple algorithm that might do the job, but I was just wondering if there was a better thing to do. This seems to be quite fundamental, and I think there might be more clever things to do in order to accomplish this more efficiently. This is the algorithm i thought:
Add up all the frequencies in the array. Store it in SUM
Choosing up a random value from 0 to SUM. Store it in RAN
[While] RAN > 0, Starting from the first, visit each cell in the array (in order), and subtract the value of that cell from RAN
The last visited cell is the chosen one
So, is there a better thing to do than this? Am I missing something?
I'm aware most modern computers can compute this so fast I won't even notice if my algorithm is inefficient, so this is more of a theoretical question rather than a practical one.
I prefer an explained algorithm rather than just code for an answer, but If you're more comfortable providing your answer in code, I have no problem with that.
The idea:
Iterate through all the elements and set the value of each element as the cumulative frequency thus far.
Generate a random number between 1 and the sum of all frequencies
Do a binary search on the values for this number (finding the first value greater than or equal to the number).
Example:
Element A B C D
Frequency 1 4 3 2
Cumulative 1 5 8 10
Generate a random number in the range 1-10 (1+4+3+2 = 10, the same as the last value in the cumulative list), do a binary search, which will return values as follows:
Number Element returned
1 A
2 B
3 B
4 B
5 B
6 C
7 C
8 C
9 D
10 D
The Alias Method has amortized O(1) time per value generated, but requires two uniforms per lookup. Basically, you create a table where each column contains one of the values to be generated, a second value called an alias, and a conditional probability of choosing between the value and its alias. Use your first uniform to pick any of the columns with equal likelihood. Then choose between the primary value and the alias based on your second uniform. It takes a O(n log n) work to initially set up a valid table for n values, but after the table's built generating values is constant time. You can download this Ruby gem to see an actual implementation.
Two other very fast methods by Marsaglia et al. are described here. They have provided C implementations.

Resources