choose k from n with a probability - algorithm

I've a list of n elements (e_i). For each i, e_i have a probability p_i to be selected.
I want to write an algorithm to pick k elements from theses n, but I have to respect the probabilities of each element when I choose them. I've no idea how to do that, I didn't know any algorithm which do that :/
Can you direct my reflection?

Let's say you have 3 possible values: A, B, C and:
P(A) = 0.2, P(B) = 0.3, P(C) = 0.5. Then you will put the cumulative probabilities in an array p = [0.2, 0.5, 1]. In each pick you will generate a random number in the range [0, 1] (using the built in library of the language you use). Based on that number, you will return as a response the smallest number which is greater or equal to the randomly generated number (actually the class which corresponds to that number A, B or C).
Hint: that class can be obtained in O(logN) time, if the optimal approach is used.
Here is an example:
if you generate value of 0.4, then you will return B, because 0.5 is the smallest number >= 0.4. If you generate 0.01 you will return A.
That's the idea, I'll let you try implement that. If you need more help, I could write some (pseudo)code too.

Assuming that you want k distinct elements, you could do the following: keep track of the total remaining probability of the non-selected elements. Repeatedly (k times) pick a random number, r, in the range [0,remaining]. Scan over the probabilities, accumulating the probabilities until the sum exceeds r. Pick the corresponding element. Then -- reduce remaining by this probability and then zero the probability of that element so that it won't be picked again.
Here is a Python implementation:
from random import random
def choose(probs,k):
choices = []
remaining = 1
p = probs[:] #create a local copy
for i in range(k):
r = remaining * random()
i = 0
s = p[i]
while s < r:
i += 1
s += p[i]
choices.append(i)
remaining -= p[i]
p[i] = 0 #so won't be chosen again
return choices
#test:
dist = [0.2, 0.4, 0.1, 0.1, 0.1, 0.05, 0.05]
for i in range(10):
print(choose(dist,4))
Typical output:
[2, 5, 1, 3]
[1, 0, 6, 4]
[0, 4, 1, 6]
[1, 2, 3, 0]
[1, 5, 2, 4]
[3, 1, 0, 2]
[1, 2, 0, 4]
[1, 2, 0, 4]
[2, 5, 1, 4]
[1, 2, 0, 3]
Note how 0 and 1 are frequently chosen but 5 and 6 are comparatively rare.
As an implementation detail: the above algorithm should always work in principle, but it is possible that round-off error and a value of r which is extremely close to remaining could lead to a subscript out of range error. For some use cases this should be so rare that you need not worry about it, but you could add error-trapping to e.g. pick the element with the last non-zero probability in the cases that the sum of all non-zero probabilities rounds to just below remaining and the r chosen happens to fall in that narrow gap.

So element ix can be expressed as (e_ix, p_ix), as those are its two components. You apparently already know what values to fill in for all of these. I am going to come up with an example though, so I can show you how to do this without doing it for you:
(A, 1) (B, 2) (C, 3)
What you need to do is assign each value to a range. I'll do it an easy way and just go left to right, starting at zero.
So, we need 1 slot for A, 2 for B, 3 for C. Our possible indices will be 0, 1, 2, 3, 4, and 5.
0->A
1->B
2->B
3->C
4->C
5->C
This is a basic example, and your weights might be floating point, but it should give you a start.
Edit: Floating point example
(D, 2) (E, .5123) (F, 1)
D < 2
2 <= E < 2.5123
2.5123 <= F < 3.5123

Necessary assumption
By linearity of the expectation, it is easy to show that if you pick elements among the n elements 0, 1, 2, ..., n-1 such that each element i has probability p_i of being selected, then the expectation of the number of picked elements is exactly sum p_i. This holds no matter the algorithm used to pick the elements.
You are looking for such an algorithm, but with the added constraint that the number of picked elements is always k. It follows that a necessary assumption is:
sum p_i = k
Fortunately, it turns out that this assumption is also sufficient.
Algorithm
Assume now that sum p_i = k. The following algorithm will select exactly k elements, such that each element i in 0,1,...,n-1 has probability p_i of being chosen.
Compute the cumulative sums:
c_0 = 0
c_1 = p_0
...
c_i = p_0 + p_1 + ... + p_(i-1)
...
c_n = k
Pick a number x uniformly at random in [0,1[
For every number y in the list x, 1+x, 2+x, 3+x, ..., k-1+x:
Choose element i such that c_i <= y < c_(i+1)
It is easy to verify that exactly k elements are chosen, and that every element i has probability p_i of being chosen.
Reference
The previous algorithm is the subject of a research paper from the 80s or 90s, which I can't put my hands on at this exact moment; I will edit this post with a reference if I can find it again.

Related

Use FFT to find all possible fixed-size subset sums

I need to solve the following problem: given an integer sequence x of size N, and a subset size k, find all the possible subset sums. A subset sum is the sum of elements in the subset.
If elements in x are allowed to appear many times (up to k of course) in a subset (sub-multiset), this problem has a pseudo polynomial time solution via FFT. Here is an example:
x = [0, 1, 2, 3, 6]
k = 4
xFrequency = [1, 1, 1, 1, 0, 0, 1] # On the support of [0, 1, 2, 3, 4, 5, 6]
sumFrequency = selfConvolve(xFrequency, times = 4) # A fast approach is to simply raise the power of the Fourier series.
sumFrequency > 0 # Gives a boolean vector indicating all possible size-k subset sums.
But what can be done if an element cannot show up multiple times in a subset?
I came up with the following method but am unsure of its correctness. The idea is to first find the frequencies of sums that are produced by adding at least 2 identical elements:
y = [0, 2, 4, 6, 12] # = [0, 1, 2, 3, 6] + [0, 1, 2, 3, 6]
yFrequency = [0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1]
sumFrequencyWithRedundancy = convolve(yFrequency, x, x)
My reasoning is that since y represents all possible sums of 2 identical elements, then every sum in y + x + x is guaranteed to have been produced by adding at least 2 identical elements. Finally
sumFrequencyNoRedundancy = sumFrequency - sumFrequencyWithRedundancy
sumFrequencyNoRedundancy > 0
Any mistake or any other established method for solving the problem?
Thanks!
Edits:
After some tests, it does not work. There turns out to be much more combinations that should be excluded from sumFrequency besides sumFrequencyWithRedundancy, and the combinatoric analyses seem to escalate rapidly with k, eventually making it less efficient than brute-force summation.
My motivation was to find all possible sample sums given sampling without replacement and the sample size. Then I came across the idea of solving the standard subset sum problem via FFT --- free subset size and the qualified subsets themselves unneeded. The reference materials can be easily found online, basically a divide and conquer approach:
Divide the superset into 2 sets, left and right.
Compute all possible subset sums in the left and right sets. The sums are represented by 2 boolean vectors.
Convolve the 2 boolean vectors.
Find if the target sum is indicated in the final boolean vector.
You can see why the algorithm works for the standard subset sum problem.
If anyone can let me know some work on how to find all possible size-k subset sums, I would really appreciate it!
Given k and the n-element array x, it suffices to evaluate the degree-k coefficient in z of the polynomial
n x[i]
product (1 + y z).
i=1
This coefficient is a polynomial in y where the exponents with nonzero coefficients indicate the sums that can be formed using exactly k distinct terms.
One strategy is to split x with reasonably balanced sums, evaluate each half mod z^(k+1), and then multiply using the school algorithm for the outer multiplications and FFT (or whatever) for the inner. This should end up costing roughly O(k^2 S log^2 S).
The idea for evaluating elementary symmetric polynomials efficiently is due to Ben-Or.

Maximum Sum for Subarray with fixed cutoff

I have a list of integers, and I need to find a way to get the maximum sum of a subset of them, adding elements to the total until the sum is equal to (or greater than) a fixed cutoff. I know this seems similar to the knapsack, but I was unsure whether it was equivalent.
Sorting the array and adding the maximum element until sum <= cutoff does not work. Observe the following list:
list = [6, 5, 4, 4, 4, 3, 2, 2, 1]
cutoff = 15
For this list, doing it the naive way results in a sum of 15, which is very sub-optimal. As far as I can see, the maximum you could arrive at using this list is 20, by adding 4 + 4 + 4 + 2 + 6. If this is just a different version of knapsack, I can just implement a knapsack solution, as I probably have small enough lists to get away with this, but I'd prefer to do something more efficient.
First of all in any sum, you won't have produced a worse result by adding the largest element last. So there is no harm in assuming that the elements are sorted from smallest to largest as a first step.
And now you use a dynamic programming approach similar to the usual subset sum.
def best_cutoff_sum (cutoff, elements):
elements = sorted(elements)
sums = {0: None}
for e in elements:
next_sums = {}
for v, path in sums.iteritems():
next_sums[v] = path
if v < cutoff:
next_sums[v + e] = [e, path]
sums = next_sums
best = max(sums.keys())
return (best, sums[best])
print(best_cutoff_sum(15, [6, 5, 4, 4, 4, 3, 2, 2, 1]))
With a little work you can turn the path from the nested array it currently is to whatever format you want.
If your list of non-negative elements has n elements, your cutoff is c and your maximum value is v, then this algorithm will take time O(n * (k + v))

How more effectively find the minimal composition from n sets that satisfies the given condition?

We have N sets of triples,like
1. { (4; 0,1), (5 ; 0.3), (7; 0,6) }
2. { (7; 0.2), (8 ; 0.4), (1 ; 0.4) }
...
N. { (6; 0.3), (1; 0.2), (9 ; 0.5) }
and need to choose only one pair from each triple, so that the sum of the first members in pair will be minimal, but also we have a condition that sum of the second members in pair must be not less than a given P number.
We can solve this by sorting all possible pair combinations with the sum of their first members (3 ^ N combinations), and in that sorted list choose the first one which also satisfies the second condition.
Could you please help to suggest a better, non trivial solution for this problem?
If there are no constraints on the values inside your triplets, then we are facing a pretty general version of integer programming problem, more specifically a 0-1 linear programming problem, as it can be represented as a system of equations with every coefficient being 0 or 1. You can find the possible approaches on the wiki page, but there is no fast-and-easy solution for this problem in general.
Alternatively, if the second numbers of each pair (the ones that need to sum up to >= P) are from a small enough range, we could view this as Dynamic Programming problem similar to a Knapsack problem. "Small enough" there is a bit hard to define because the original data has non-integer numbers. If they were integers, then the algorithmic complexity of solution I will describe is O(P * N). For non-integer numbers, they need to be first converted to integers by multiplying them all, as well as P, by a large enough number. In your example, the precision of each number is 1 digit after zero, so multiplying by 10 is enough. Hence, the actual complexity is O(M * P * N), where M is the factor everything was multiplied by to achieve integer numbers.
After this, we are essentially solving a modified Knapsack problem: instead of constraining the weight from above, we are constraining it from below, and on each step we are choosing a pair from a triplet, as opposed to deciding whether to put an item into the knapsack or not.
Let's define a function minimum_sum[i][s] which at values i, s represents the minimum possible sum (of first numbers in each pair we took) we can achieve if the sum of the second numbers in pairs taken so far is equal to s and we already considered the first i triplets. One exception to this definition is that minimum_sum[i][P] has the minimum for all sums exceeding P as well. If we can compute all values of this function, then minimum_sum[N][P] is the answer. The function values can be computed with something like this:
minimum_sum[0][0]=0, all other values are set to infinity
for i=0..N-1:
for s=0..P:
for j=0..2:
minimum_sum[i+1][min(P, s+B[i][j])] = min(minimum_sum[i+1][min(P, s+B[i][j])], minimum_sum[i][s] + A[i][j]
A[i][j] here denote the first number in i-th triplet's j-th pair, and B[i][j] denote the second number of the same triplet.
This solution is viable if N is large, but P is small and precision on Bs isn't too high. For instance, if N=50, there is little hope to compute 3^N possibilities, but with M*P=1000000 this approach would work extremely fast.
Python implementation of the idea above:
def compute(A, B, P):
n = len(A)
# note that I use 1,000,000 as “infinity” here, which might need to be increased depending on input data
best = [[1000000 for i in range(P + 1)] for j in range(n + 1)]
best[0][0] = 0
for i in range(n):
for s in range(P+1):
for j in range(3):
best[i+1][min(P, s+B[i][j])] = min(best[i+1][min(P, s+B[i][j])], best[i][s]+A[i][j])
return best[n][P]
Testing:
A=[[4, 5, 7], [7, 8, 1], [6, 1, 9]]
# second numbers in each pair after scaling them up to be integers
B=[[1, 3, 6], [2, 4, 4], [3, 2, 5]]
In [7]: compute(A, B, 0)
Out[7]: 6
In [14]: compute(A, B, 7)
Out[14]: 6
In [15]: compute(A, B, 8)
Out[15]: 7
In [20]: compute(A, B, 13)
Out[20]: 14

Segments with most points algorithm analysis

We define x1, x2,..., x_n to be a sequence of points (numbers) and [s_i, t_i] be a set of n segments for 1 ≤ i ≤ n. Point x_j is inside the segment i if s_i ≤ x_j ≤ t_i. I want to find the segment with the most points.
Now to solve this, I am thinking we can sort x and the intervals based on s. Keep a separate array, T, such that T[i] = maximum points in the segment i. Initialize all the values in this array to 0. Then, for each x, check all the intervals that fit the constraint and increment T[i] accordingly.
This in the worst case scenario can take O(n^2). But I feel like I have a lot of redundancy here. How do I make this more efficient?
Just to clarify, if you problem is one-dimensional, the points in X (x_1 to x_n) are numbers, and the segments are intervals.
You can easily solve this by sorting X and using the resulting indices. You can effectively calculate the number of points within a segment [s, t] by finding the two corresponding indices i and j. Find (using binary-search or whatever is most efficient) i such that x_i < s <= x_(i+1), and j such that x_j <= t < x_(j+1). Note the inequalities (in case s or t might be in X). The number of points within [s, t] is equal to j-i.
If it is possible that s < x_1 or t > x_n, simply append a point to both ends of X (a minimum and a maximum).
This has complexity O(n log n), limited by the sorting algorithm. If you can use something like counting sort that uses the values as indices into an array (or keys into a multiset), then you can improve on that by doing some more work.
Let S be the set of points containing every s and every t for all the segments [s, t]. The idea is to build an indexing array for X (kind of like for a counting sort).
First, build the array A such that A[x in X] = 1 and A[x not in X] = 0. Then, go through it again to build the array A_less such that A_less[i] equals the sum of all A[j] with j < i.
For example, if A = [1, 0, 0, 1, 0, 1, 0], then A_less = [0, 1, 1, 1, 2, 2, 3]. You can build this array using a simple counter.
You can now refer directly to this array to get the number of points which values are less than or equal to another. In the previous example, there are clearly three points in X, with values 0, 3, and 5. By refering to A_less, you can know that there are A_less[4] = 2 points with values less than or equal to 4.
Similarly, build A_less_equal such that A_less_equal[i] equals the sum of all A[j] with j <= i. Using the same example, A_less_equal = [1, 1, 1, 2, 2, 3, 3].
Now, for any segment [s, t], you can get the number of points it contains by computing A_less_equal[t] - A_less[s]. All of that has complexity O(n).
If your points are not integers (are at least, not easily usable as indices), then you can still use the same idea, replacing the arrays with sorted sets, the keys of which are every value in X or S (you need to add the values in S to be able to look them up at the end).

Algorithm to find combination of n numbers with largest sum

Problem is simple -
Suppose I have an array of following numbers -
4,1,4,5,7,4,3,1,5
I have to find number of sets of k elements each that can be created from above numbers having largest sum. Two sets are considered to be different if they have at least one different element.
e.g.
if k = 2, then there can be two sets - {7,5} and {7,5}. Note: 5 appears twice in above array.
I think I can start with something like-
1. Sort array
2. Create two arrays. One for different number and an other in parallel for number's occurence.
But I am stuck now. Any suggestions?
The algorithm is as follows:
1) Sort elements in descending order.
2) Look at this array. It may look something like this:
a ... a b ... b c ... c d ...
| <- k -> |
Now obviously all elements a and b will be in the sets with the largest sum. You can't replace any of them with a smaller element, because then the sum wouldn't be the largest possible. So you have no choice here, you have to choose all a and b for any of the sets.
On the other hand only some of the elements c will be in those sets. So the answer is just the number of possibilities, to choose c's to fill the positions left in the sets, after you have taken all larger elements. That is the binomial coefficient:
count of c's choose (k - (count of elements larger than c))
For example for an array (already sorted here)
[9, 8, 7, 7, 5, 5, 5, 5, 4, 4, 2, 2, 1, 1, 1]
and k = 6, you must choose 9, 8 and both 7's for every set with the largest sum (which is 41). And then you can choose any two out of the four 5's. So the result will be 4 choose 2 = 6.
With the same array and k = 4, the result would be x choose 0 = 1 (that unique set is {9, 8, 7, 7}), with k = 7 the result would be 4 choose 3 = 4, and with k = 9: 2 choose 1 = 2 (choosing any 4 for the set with the largest sum).
EDIT: I edited the answer, because we figured it out that OP needs to count multisets.
First, find the largest k numbers in the array. This is of course easy, and if k is very small, you can do it in O(k) by performing k linear scans. If k is not so small, you can use a binary heap, or a priority queue or just sort the array to do that which is respectively O(n * log(k)) or O(n * log(n)) when using sorting.
Let assume that you have computed k largest numbers. Of course all sets of size k with the largest sum have to contain exactly these k largest numbers and no more other numbers. On the other hand, any different set doesn't have the largest sum.
Let count[i] be the number of occurrences of number i in the input sequence.
Let occ[i] be the number of occurrences of number i in the largest k numbers.
We can compute these both tables in very different ways, for example using a hash table or if input numbers are small, you can use an array indexed by these numbers.
Let B be the array of distinct numbers from the largest k numbers.
Let m be the size of B.
Now let's compute the answer. We will do it in m steps. After i-th step we will have computed the number of different multisets consisting of the first i numbers from B. At the beginning the result is 1 since there is only one empty multiset. In the i-th step, we will multiply the current result by the number of possible chooses of occ[B[i]] elements from count[B[i]] elements, which is equal to binomial(occ[i], count[i])
For example, let's consider your instance with added one more 7 at the end and k set to 3:
k = 3
A = [4, 1, 4, 5, 7, 4, 3, 1, 5, 7]
The largest three numbers in A are 7, 7, 5
At the beginning we have:
count[7] = 2
count[5] = 2
occ[7] = 2
occ[5] = 1
result = 1
B = [7, 5]
We start with the first element in B which is 7. Its count is 2 and its occ is also 2, so we do:
// binomial(2, 2) is 1
result = result * binomial(2, 2)
Next element in B is 5, its count is 2 and its occ is 1, so we do:
// binomial(2, 1) is 2
result = result * binomial(2, 1)
And the final result is 2, since there are two different multisets [7, 7, 5]
I'd create a sorted dictionary of the frequencies of occurrence of the numbers in the input. Then take the two largest numbers and multiply the number of times they occur.
In C++, it could look something like this:
std::vector<int> inputs { 4, 1, 4, 5, 7, 3, 1, 5};
std::map<int, int> counts;
for (auto i : inputs)
++counts[i];
auto last = counts.rbegin();
int largest_count = *last;
int second_count = *++last;
int set_count = largeest_count * second_count;
You can do the following:
1) Sort the elements in descending order;
2) define variable answer=1;
3) Start from the beginning of the array and for each new value you see, count the number of its occurrence (lets call this variable count). every time do: answer = answer * count. The pseudo-code should look like this.
find_count(Array A, K)
{
sort(A,'descending);
int answer=1;
int count=1;
for (int i=1,j=1; i<K && j<A.length;j++)
{
if(A[i] != A[i-1])
{
answer = answer *count;
i++;
count=1;
}
else
count++;
}
return answer;
}

Resources