Given a probability distribution – a mapping of objects to their probability – I want an algorithm that selects random objects from the map and is without replacement (the probability distribution is updated per selection). However, the algorithm must have an O(1) space complexity and have high quality randomness. I tried searching for implementations, but none of them seemed to have both of these properties.
EDIT:
Probability without replacement:
You have a bag of objects, each object has a probability of being selected. Once you select an object, you remove it from the bag. All objects now a different probability of being selected.
With O(1) space complexity, we are not storing a list with objects repeated according to their probability of being selected. Instead, we are only storing a probability distribution and iterating over a permutation (but not storing that permutation).
I would try variation of Fisher-Yates-Knuth shuffle (in Durstenfeld implementation it is O(1))
Original:
for i from 0 to n − 1 do
j ← random integer such that 0 ≤ j ≤ i
if j ≠ i
a[i] ← a[j]
a[j] ← source[i]
Modified to fulfill requirements:
for i from 0 to n − 1 do
p ← probabilities(n-i)
j ← random integer via probabilities(n-i) such that 0 ≤ j ≤ i
if j ≠ i
a[i] ← a[j]
a[j] ← source[i]
So at each step you would update probabilities and use them to sample index. After that it's just FYK shuffle.
Related
Given an array arr of n integers, what is the highest score that a player can reach, playing the following game?
Choose an index 0 < i < n-1 in the array
Add arr[i-1] * arr[i+1] points to the score (initially the score is 0)
Shrink the array by removing element i (forall j >= i: arr[j] = arr[j+1]; then n = n - 1
Repeat steps 1-3 until n == 2.
Do the above until there are only 2 elements (which are the first and the last element because you can't remove them).
What is the highest score you can get ?
Example
arr = [1 2 3 4]
Choose i=2, get: 2*4 = 8 points, remove 3
Remaining: arr = [1 2 4]
Choose i=1, get 1*4 = 4 points, remove 2
Remaining: arr = [1 4].
The sum of points is 8 + 4 = 12, which is the highest possible score on this example.
I think it is related to Dynamic programming but I'm not sure how to solve it.
This problem has a dynamic programming approach similar to Matrix-chain multiplication problem. You can find further explanation in the book "Introduction to Algorithms", 3rd Edition (Cormen, page 370).
Let's find the optimal substructure property and then use it to construct an optimal solution to the problem from optimal solutions to subproblems.
Notation: Ci..j, where i ≤ j, stands for elements Ci,Ci+1,...,Cj.
Definition: A removal sequence for Ci..j is a permutation of i+1,i+2,...,j-1.
A removal sequence for Ci..j is optimal if the score achieved by removing the elements of Ci..j in that order is maximum among all possible removal sequences for Ci..j.
1. Characterize the structure of an optimal solution
If the problem is nontrivial, i.e. i + 1 < j, then any solution has a last removed element which corresponding index is k in the range
i < k < j. Such k split the problem into Ci..k and Ck..j. That is, for some value k, we first remove non extremal elements of Ci..k and Ck..j and then we remove element k. As removing non extremal elements of Ci..k doesn't affect score obtained by removing non extremal elements of Ck..j and an analogous reasoning for removing non extremal elements of Ck..j is also true we state that both subproblems are independent. Then, for a given removal sequence where kth-element is last, the score of Ci..j is equal to the sum of scores of Ci..k and Ck..j, plus the score of removing kth-element (C[i] * C[j]).
The optimal substructure of this problem is as follows. Suppose there is an optimal removal sequence O for Ci..j that ends at kth-element, then the ordering of removed elements from Ci..k must be optimal too. We can prove it by contradiction: If there was a removal sequence for Ci..k that scored higher than removal subsequence extracted from O for Ci..k then we can produce another removal sequence for Ci..j with higher score than optimal removal sequence (contradiction). A similar observation holds for the ordering of removed elements from Ck..j in the optimal removal sequence for Ci..j: it must be optimal too.
We can build an optimal solution for nontrivial instances of the problem by splitting the problem into two subproblems, finding optimal solutions to subproblem instances, and them combining these optimal subproblem solutions.
2. Recursively define the value of an optimal solution.
For this problem our subproblems are the maximum score obtained in Ci..j for 1 ≤ i ≤ j ≤ N. Let S[i, j] be the maximum score obtained in Ci..j; for the full problem, the highest score when evaluating the given rules is S[1, N].
We can define S[i, j] recursively as follows:
If j ≤ i + 1 then S[i, j] = 0
If i + 1 < j then S[i, j] = maxi < k < j{S[i, k] + S[k, j] + C[i] * C[j]}
We ensure that we search for the correct place to split because we consider all possible places, so that we are sure of having examined the optimal one.
3. Compute the value of an optimal solution
You can use your favorite method to compute S:
top-down approach (recursive)
bottom-up approach (iterative)\
I would use bottom-up for computing the solution since it would be < 5 lines long in almost any programming language.
Example in C++11:
for(int l = 2; l <= N; ++l) \\ increasing length intervals
for(int i = 1, j = i + l; j <= N; ++i, ++j)
for(int k = i + 1; k < j; ++k)
S[i, j] = max(S[i, j], S[i, k] + S[k, j] + C[i] * C[j])
4. Time Complexity and Space Complexity
There are nC2 + n = Θ(n2) subproblems and every subproblem do an operation which running time is Θ(l) where l is length of the subproblem so the math yield a running time of Θ(n3) for the algorithm (it's easy to spot the O(n3) part :-)). Also, the algorithm requires Θ(n2) space to store the S table.
I'm trying to solve the extension to a problem I described in my question: Efficient divide-and-conquer algorithm
For this extension, there is known to be representatives for 3 parties at the event, and there are more members for 1 party attending than for any other. A formal description of the problem can be found below.
You are given an integer n. There is a hidden array A of size n, which contains elements that can take 1 of 3 values. There is a value, let this be m, that appears more often in the array than the other 2 values.
You are allowed queries of the form introduce(i, j), where i≠j, and 1 <= i, j <= n, and you will get a boolean value in return: You will get back 1, if A[i] = A[j], and 0 otherwise.
Output: B ⊆ [1, 2. ... n] where the A-value of every element in B is m.
A brute-force solution to this could calculate B in O(n2) by calling introduce(i, j) on n(n-1) combinations of elements and create 3 lists containing A-indexes of elements for which a 1 was returned when introduce was called on them, returning the list of largest size.
I understand the Boyer–Moore majority vote algorithm but can't find a way to modify it for this problem or find an efficient algorithm to solve it.
Scan for all A[i] = A[0], and make list I[] of all i for which A[i] != A[0]. Then scan for all A[I[j]] = A[I[0]], and so on. Which requires one O(n) scan for each possible value in A[].
[I assume if introduce(i, j) = 1 and introduce(j, k) = 1, then introduce(i, k) = 1 -- so you don't need to check all combinations of elements.]
Of course, this doesn't tell you what 'm' is, it just makes n lists, where n is the number of values, and each list is all the 'i' where A[i] is the same.
I just want to know how can I get the average number of swaps in the two colors dutch national flag. sorting positive and negative numbers instead of colors. I'm assuming that the negative numbers are equal to the positive numbers and the array's numbers are given a random configuration, I'm not sure if my assumption is correct.
Algorithm(A[0…n-1]):
i ← 0
j ← n - 1
while i ≤ j:
if A[i] < 0:
i ← i + 1
else:
swap(A[i], A[j])
j ← j - 1
Thank you.
If the distribution of the positives and negatives is uniform, the first element is positive with probability 1/2. After the first iteration, the array is shortened by one element and the distribution of the subarray is still uniform (moving an element is a neutral operation).
There are exactly n iterations before the subarray is empty thus the average number of swaps is n/2. More precisely, the number of swaps follows a Binomial law with parameters 1/2, n (this is a Bernouilli scheme).
My goal is to sample k integers from 0, ... n-1 without duplication. The order of sampled integers doesn't matter. At every each call (which occurs very often), n and k will slightly vary but not much (n is about 250,000 and k is about 2,000). I've come up with the following amortized O(k) algorithm:
Prepare an array A with items 0, 1, 2, ... , n-1. This takes O(n) but since n is relatively stable, the cost can be made amortized constant.
Sample a random number r from [0:i] where i = n - 1. Here the cost is in fact related to n, but as n is not VERY BIG, this dependency is not critical.
Swap the rth item and the ith item in the array A.
Decrease i by 1.
Repeat k times the steps 2~4; now we have a random permutation of length k at the tail of A. Copy this.
We should roll back A to its initial state (0, ... , n-1) to keep the cost of the step 1 constant. This can be done by push r to a stack of length k at each pass of step 2. Preparation of the stack requires amortized constant cost.
I think uniform sampling of permutation/combination should be an exhaustively studied problem, so either (1) there is a much better solution, or at least (2) my solution is a (minor modification of) a well-known solution. Thus,
In case (1), I want to know that better solution.
In case (2), I want to find a reference.
Please help me. Thanks.
If k is much less than n -- say, less than half of n -- then the most efficient solution is to keep the numbers generated in a hash table (actually, a hash set, since there is no value associated with a key). If the random number happens to already be in the hash table, reject it and generate another one in its place. With the actual values of k and n suggested (k ∼ 2000; n ∼ 250,000) the expected number of rejections to generate k unique samples is less than 10, so it will hardly be noticeable. The size of the hash table is O(k), and it can simply be deleted at the end of the sample generation.
It is also possible to simulate the FYK shuffle algorithm using a hash table instead of a vector of n values, thereby avoiding having to reject generated random numbers. If you were using a vector A, you would start by initializing A[i] to i, for every 0 ≤ i < k. With the hash table H, you start with an empty hash table, and use the convention that H[i] is considered to be i if the key i is not in the hash table. Step 3 in your algorithm -- "swap A[r] with A[i]" -- becomes "add H[r] as the next element of the sample and set H[r] to H[i]". Note that it is unnecessary to set H[i] because that element will never be referred to again: all subsequent random numbers r are generate from a range which does not include i.
Because the hash table in this case contains both keys and values, it is larger than the hash set used in alternative 1, above, and the increased size (and consequent increase in memory cache misses) is likely to cause more overhead than is saved by eliminating rejections. However, it has the advantage of working even if k is occasionally close to n.
Finally, in your proposed algorithm, it is actually quite easy to restore A in O(k) time. A value A[j] will have been modified by the algorithm only if:
a. n − k ≤ j < n, or
b. there is some i such that n − k ≤ i < n and A[i] = j.
Consequently, you can restore the vector A by looking at each A[i] for n − k ≤ i < n: first, if A[i] < n−k, set A[A[i]] to A[i]; then, unconditionally set A[i] to i.
Given an array A with N elements I need to find pair (i,j) such that i is not equal to j and if we write the sum A[i]+A[j] for all pairs of (i,j) then it comes at the kth position.
Example : Let N=4 and arrays A=[1 2 3 4] and if K=3 then answer is 5 as we can see it clearly that sum array becomes like this : [3,4,5,5,6,7]
I can't go for all pair of i and j as N can go up to 100000. Please help how to solve this problem
I mean something like this :
int len=N*(N+1)/2;
int sum[len];
int count=0;
for(int i=0;i<N;i++){
for(int j=i+1;j<N;j++){
sum[count]=A[i]+A[j];
count++;
}
}
//Then just find kth element.
We can't go with this approach
A solution that is based on a fact that K <= 50: Let's take the first K + 1 elements of the array in a sorted order. Now we can just try all their combinations. Proof of correctness: let's assume that a pair (i, j) is the answer, where j > K + 1. But there are K pairs with the same or smaller sum: (1, 2), (1, 3), ..., (1, K + 1). Thus, it cannot be the K-th pair.
It is possible to achieve an O(N + K ^ 2) time complexity by choosing the K + 1 smallest numbers using a quickselect algorithm(it is possible to do even better, but it is not required). You can also just the array and get an O(N * log N + K ^ 2 * log K) complexity.
I assume that you got this question from http://www.careercup.com/question?id=7457663.
If k is close to 0 then the accepted answer to How to find kth largest number in pairwise sums like setA + setB? can be adapted quite easily to this problem and be quite efficient. You need O(n log(n)) to sort the array, O(n) to set up a priority queue, and then O(k log(k)) to iterate through the elements. The reversed solution is also efficient if k is near n*n - n.
If k is close to n*n/2 then that won't be very good. But you can adapt the pivot approach of http://en.wikipedia.org/wiki/Quickselect to this problem. First in time O(n log(n)) you can sort the array. In time O(n) you can set up a data structure representing the various contiguous ranges of columns. Then you'll need to select pivots O(log(n)) times. (Remember, log(n*n) = O(log(n)).) For each pivot, you can do a binary search of each column to figure out where it split it in time O(log(n)) per column, and total cost of O(n log(n)) for all columns.
The resulting algorithm will be O(n log(n) log(n)).
Update: I do not have time to do the finger exercise of supplying code. But I can outline some of the classes you might have in an implementation.
The implementation will be a bit verbose, but that is sometimes the cost of a good general-purpose algorithm.
ArrayRangeWithAddend. This represents a range of an array, summed with one value.with has an array (reference or pointer so the underlying data can be shared between objects), a start and an end to the range, and a shiftValue for the value to add to every element in the range.
It should have a constructor. A method to give the size. A method to partition(n) it into a range less than n, the count equal to n, and a range greater than n. And value(i) to give the i'th value.
ArrayRangeCollection. This is a collection of ArrayRangeWithAddend objects. It should have methods to give its size, pick a random element, and a method to partition(n) it into an ArrayRangeCollection that is below n, count of those equal to n, and an ArrayRangeCollection that is larger than n. In the partition method it will be good to not include ArrayRangeWithAddend objects that have size 0.
Now your main program can sort the array, and create an ArrayRangeCollection covering all pairs of sums that you are interested in. Then the random and partition method can be used to implement the standard quickselect algorithm that you will find in the link I provided.
Here is how to do it (in pseudo-code). I have now confirmed that it works correctly.
//A is the original array, such as A=[1,2,3,4]
//k (an integer) is the element in the 'sum' array to find
N = A.length
//first we find i
i = -1
nl = N
k2 = k
while (k2 >= 0) {
i++
nl--
k2 -= nl
}
//then we find j
j = k2 + nl + i + 1
//now compute the sum at index position k
kSum = A[i] + A[j]
EDIT:
I have now tested this works. I had to fix some parts... basically the k input argument should use 0-based indexing. (The OP seems to use 1-based indexing.)
EDIT 2:
I'll try to explain my theory then. I began with the concept that the sum array should be visualised as a 2D jagged array (diminishing in width as the height increases), with the coordinates (as mentioned in the OP) being i and j. So for an array such as [1,2,3,4,5] the sum array would be conceived as this:
3,4,5,6,
5,6,7,
7,8,
9.
The top row are all values where i would equal 0. The second row is where i equals 1. To find the value of 'j' we do the same but in the column direction.
... Sorry I cannot explain this any better!