Maximize minimum distance between arrays - algorithm

Lets say that you are given n sorted arrays of numbers and you need to pick one number from each array such that the minimum distance between the n chosen elements is maximized.
Example:
arrays:
[0, 500]
[100, 350]
[200]
2<=n<=10 and every array could have ~10^3-10^4 elements.
In this example the optimal solution to maximize minimum distance is pick numbers: 500, 350, 200 or 0, 200, 350 where min distance is 150 and is the maximum possible of every combination.
I am looking for an algorithm to solve this. I know that I could binary search the max min distance but I can't see how to decide is there is a solution with max min distance of at least d, in order for the binary search to work. I am thinking maybe dynamic programming could help but haven't managed to find a solution with dp.
Of course generating all combination with n elements is not efficient. I have already tried backtracking but it is slow since it tries every combination.

n ≤ 10 suggests that we can take an exponential dependence on n. Here's
an O(2n m n)-time algorithm where m is the total size of the
arrays.
The dynamic programming approach I have in mind is, for each subset of
arrays, calculate all of the pairs (maximum number, minimum distance) on
the efficient frontier, where we have to choose one number from each of
the arrays in the subset. By efficient frontier I mean that if we have
two pairs (a, b) ≠ (c, d) with a ≤ c and b ≥ d, then (c, d) is not on
the efficient frontier. We'll want to keep these frontiers sorted for
fast merges.
The base case with the empty subset is easy: there's one pair, (minimum
distance = ∞, maximum number = −∞).
For every nonempty subset of arrays in some order that extends the
inclusion order, we compute a frontier for each array in the subset,
representing the subset of solutions where that array contributes the
maximum number. Then we merge these frontiers. (Naively this costs us
another factor of log n, which maybe isn't worth the hassle to avoid
given that n ≤ 10, but we can avoid it by merging the arrays once at the
beginning to enable future merges to use bucketing.)
To construct a new frontier from a subset of arrays and another array
also involves a merge. We initialize an iterator at the start of the
frontier (i.e., least maximum number) and an iterator at the start of
the array (i.e., least number). While neither iterator is past the end,
Emit a candidate pair (min(minimum distance, array number − maximum
number), array number).
If the min was less than or equal to minimum distance, increment the
frontier iterator. If the min was less than or equal to array number
− maximum number, increment the array iterator.
Cull the candidate pairs to leave only the efficient frontier. There is
an elegant way to do this in code that is more trouble to explain.

I am going to give an algorithm that for a given distance d, will output whether it is possible to make a selection where the distance between any pair of chosen numbers is at least d. Then, you can binary-search the maximum d for which the algorithm outputs "YES", in order to find the answer to your problem.
Assume the minimum distance d be given. Here is the algorithm:
for every permutation p of size n do:
last := -infinity
ok := true
for p_i in p do:
x := take the smallest element greater than or equal to last+d in the p_i^th array (can be done efficiently with binary search).
if no such x was found; then
ok = false
break
end
last = x
done
if ok; then
return "YES"
end
done
return "NO"
So, we brute-force the order of arrays. Then, for every possible order, we use a greedy method to choose elements from each array, following the order. For example, take the example you gave:
arrays:
[0, 500]
[100, 350]
[200]
and assume d = 150. For the permutation 1 3 2, we first take 0 from the 1st array, then we find the smallest element in the 3rd array that is greater than or equal to 0+150 (it is 200), then we find the smallest element in the 2nd array which is greater than or equal to 200+150 (it is 350). Since we could find an element from every array, the algorithm outputs "YES". But for d = 200 for instance, the algorithm would output "NO" because none of the possible orderings would result in a successful selection.
The complexity for the above algorithm is O(n! * n * log(m)) where m is the maximum number of elements in an array. I believe it would be sufficient, since n is very small. (For m = 10^4, 10! * 10 * 13 ~ 5*10^8. It can be computed under a second on a modern CPU.)

Lets look at an example with optimal choices, x (horizontal arrays A, B, C, D):
A x
B b x b
C x c
D d x
Our recurrence based on range could be: let f(low, excluded) represent the maximum closest distance between two chosen elements (from arrays 1 to n) of the subset without elements in excluded, where low is the lowest chosen element. Then:
(1)
f(low, excluded) when |excluded| = n-1:
max(low)
for low in the only permitted array
(2)
f(low, excluded):
max(
min(
a - low,
f(a, excluded')
)
)
for a ≥ low, a not in excluded'
where excluded' = excluded ∪ {low's array}
We can limit a. For one thing the maximum we can achieve is
(3)
m = (highest - low) / (n - |excluded| - 1)
which means a need not go higher than low + m.
Secondly, we can store results for all f(a, excluded'), keyed by excluded' (we have 2^10 possible keys), each in a decorated binary tree ordered by a. The decoration will be the highest result achievable in the right subtree, meaning we can find the max for all f(v, excluded'), v ≥ a in logarithmic time.
The latter establishes a dominance relationship and clearly we are intetested in both a larger a and a larger f(a, excluded') so as to maximise the min function in (2). Picking an a in the middle, we can use a binary search. If we have:
a - low < max(v, excluded'), v ≥ a
where max(v, excluded') is the lookup
for a in the decorated tree
then we look to the right since max(v, excluded) indicates there's a better answer on the right, where a - low is also larger.
And if we have:
a - low ≥ max(v, excluded), v ≥ a
then we record this candidate and look to the left since to the right, the answer is fixed at max(v, excluded), given that a - low could not decrease.
In order to conduct the binary search on the range, [low, low + m] (see (3)), rather than merge and label all the arrays at the outset, we can keep them separate and compare the closest candidates to mid out of each array we are currently permitted to choose a from. (The trees have the mixed results, keyed by subset.) (The flow of this part is not completely clear to me.)
Worst case with this method, given that n = C is constant seems to be
O(C * array_length * 2^C * C * log(array_length) * log(C * array_length))
C * array_length is the iteration on low
Each low can be paired with 2^C inclusions
C * log(array_length) is the separated binary-search
And log(C * array_length) is the tree lookup
Simplifying:
= O(array_length * log^2(array_length))
although in practice, there could be many dead-end branches that exit early where a full selection wouldn't be possible.
In case, it wasn't clear, the iteration is on a fixed lowest element in the selection. In other words, we want the best f(low, excluded) for all different lows (and excludeds). For bottom-up, we would iterate from the highest value down so our results for a get stored as we iterate.

Related

Integer partition weighted minimum of maximum

Given a non-negative integer n and a positive real weight vector w with dimension m, partition n into a length-m non-negative integer vector that sums to n (call it v) such that max w_iv_i is the smallest, that is, we want to find the vector v such that the maximum of element-wise product between w and v is the smallest. There maybe several partitions, and we only want the smallest value of max w_iv_i among all possible v.
Seems like this problem can use a greedy algorithm to solve. From a target vector v for n-1, we add 1 to each entry, and find the minimum among those m vectors. but I don't think it's correct. The intuition is that it might add "over" the minimum. That is, there exists another partition not yielded by the add 1 procedure that falls in between the "minimum" of n-1 produced by this greedy algorithm and that of n produced by this greedy algorithm. Can anyone prove if this is correct or incorrect?
If you already know the maximum element-wise product P, then you can just set vi = floor(P/wi) until you run out of n.
Use binary search to find the smallest possible value of P.
The largest guess you need to try is n * min(w), so that means testing log(n) + log(min(w)) candidates, spending O(m) time for each test, or O(m*(log n + log(min(w))) all together.

How to find 2 special elements in the array in O(n)

Let a1,...,an be a sequence of real numbers. Let m be the minimum of the sequence, and let M be the maximum of the sequence.
I proved that there exists 2 elements in the sequence, x,y, such that |x-y|<=(M-m)/n.
Now, is there a way to find an algorithm that finds such 2 elements in time complexity of O(n)?
I thought about sorting the sequence, but since I dont know anything about M I cannot use radix/bucket or any other linear time algorithm that I'm familier with.
I'd appreciate any idea.
Thanks in advance.
First find out n, M, m. If not already given they can be determined in O(n).
Then create a memory storage of n+1 elements; we will use the storage for n+1 buckets with width w=(M-m)/n.
The buckets cover the range of values equally: Bucket 1 goes from [m; m+w[, Bucket 2 from [m+w; m+2*w[, Bucket n from [m+(n-1)*w; m+n*w[ = [M-w; M[, and the (n+1)th bucket from [M; M+w[.
Now we go once through all the values and sort them into the buckets according to the assigned intervals. There should be at a maximum 1 element per bucket. If the bucket is already filled, it means that the elements are closer together than the boundaries of the half-open interval, e.g. we found elements x, y with |x-y| < w = (M-m)/n.
If no such two elements are found, afterwards n buckets of n+1 total buckets are filled with one element. And all those elements are sorted.
We once more go through all the buckets and compare the distance of the content of neighbouring buckets only, whether there are two elements, which fulfil the condition.
Due to the width of the buckets, the condition cannot be true for buckets, which are not adjoining: For those the distance is always |x-y| > w.
(The fulfilment of the last inequality in 4. is also the reason, why the interval is half-open and cannot be closed, and why we need n+1 buckets instead of n. An alternative would be, to use n buckets and make the now last bucket a special case with [M; M+w]. But O(n+1)=O(n) and using n+1 steps is preferable to special casing the last bucket.)
The running time is O(n) for step 1, 0 for step 2 - we actually do not do anything there, O(n) for step 3 and O(n) for step 4, as there is only 1 element per bucket. Altogether O(n).
This task shows, that either sorting of elements, which are not close together or coarse sorting without considering fine distances can be done in O(n) instead of O(n*log(n)). It has useful applications. Numbers on computers are discrete, they have a finite precision. I have sucessfuly used this sorting method for signal-processing / fast sorting in real-time production code.
About #Damien 's remark: The real threshold of (M-m)/(n-1) is provably true for every such sequence. I assumed in the answer so far the sequence we are looking at is a special kind, where the stronger condition is true, or at least, for all sequences, if the stronger condition was true, we would find such elements in O(n).
If this was a small mistake of the OP instead (who said to have proven the stronger condition) and we should find two elements x, y with |x-y| <= (M-m)/(n-1) instead, we can simplify:
-- 3. We would do steps 1 to 3 like above, but with n buckets and the bucket width set to w = (M-m)/(n-1). The bucket n now goes from [M; M+w[.
For step 4 we would do the following alternative:
4./alternative: n buckets are filled with one element each. The element at bucket n has to be M and is at the left boundary of the bucket interval. The distance of this element y = M to the element x in the n-1th bucket for every such possible element x in the n-1thbucket is: |M-x| <= w = (M-m)/(n-1), so we found x and y, which fulfil the condition, q.e.d.
First note that the real threshold should be (M-m)/(n-1).
The first step is to calculate the min m and max M elements, in O(N).
You calculate the mid = (m + M)/2value.
You concentrate the value less than mid at the beginning, and more than mid at the end of he array.
You select the part with the largest number of elements and you iterate until very few numbers are kept.
If both parts have the same number of elements, you can select any of them. If the remaining part has much more elements than n/2, then in order to maintain a O(n) complexity, you can keep onlyn/2 + 1 of them, as the goal is not to find the smallest difference, but one difference small enough only.
As indicated in a comment by #btilly, this solution could fail in some cases, for example with an input [0, 2.1, 2.9, 5]. For that, it is needed to calculate the max value of the left hand, and the min value of the right hand, and to test if the answer is not right_min - left_max. This doesn't change the O(n) complexity, even if the solution becomes less elegant.
Complexity of the search procedure: O(n) + O(n/2) + O(n/4) + ... + O(2) = O(2n) = O(n).
Damien is correct in his comment that the correct results is that there must be x, y such that |x-y| <= (M-m)/(n-1). If you have the sequence [0, 1, 2, 3, 4] you have 5 elements, but no two elements are closer than (M-m)/n = (4-0)/5 = 4/5.
With the right threshold, the solution is easy - find M and m by scanning through the input once, and then bucket the input into (n-1) buckets of size (M-m)/(n-1), putting values that are on the boundaries of a pair of buckets into both buckets. At least one bucket must have two values in it by the pigeon-hole principle.

Finding longest overlapping interval pair

Say I have a list of n integral intervals [a,b] each representing set S = {a, a+1, ...b}. An overlap is defined as |S_1 \cap S_2|. Example: [3,6] and [5,9] overlap on [5,6] so the length of that is 2. The task is to find two intervals with the longest overlap in Little-O(n^2) using just recursion and not dynamic programming.
Naive approach is obviously brute force, which does not hold with time complexity condition. I was also unsuccessful trying sweep line algo and/or Longest common subsequence algorithm.
I just cannot find a way of dividing it into subproblems. Any ideas would be appreciated.
Also found this, which in my opinion does not work at all:
Finding “maximum” overlapping interval pair in O(nlog(n))
Here is an approach that takes N log(N) time.
Breakdown every interval [a,b] [c,d] into an array of pair like this:
pair<a,-1>
pair<b,a>
pair<c,-1>
pair<d,c>
sort these pairs in increasing order. Since interval starts are marked as -1, in case of ties interval they should come ahead of interval ends.
for i = 0 to end of the pair array
if current pair represents interval start
put it in a multiset
else
remove the interval start corresponding to this interval end from the multiset.
if the multiset is not empty
update the maxOverlap with (current_interval_end - max(minimum_value_in_multiset,start_value_of_current_interval)+1)
This approach should update the maxOverlap to the highest possible value.
Keep info about the two largest overlapping intervals max1 and max2 (empty in the beginning).
Sort the input list [x1, y1] .. [xn, yn] = I1..In by the value x, discarding the shorter of two intervals if equality is encountered. While throwing intervals out, keep max1 and max2 updated.
For each interval, add an attribute max in linear time, showing the largest y value of all preceding intervals (in sorted list):
rollmax = −∞
for j = 1..n do
Ij.max = rollmax
rollmax = max(rollmax, Ij.y)
On sorted, filtered, and expanded input list perform the following query. It uses an ever expanding sublist of intervals smaller then currently searched for interval Ii as input into recursive function SearchOverlap.
for i = 2..n do
SearchOverlap(Ii, 1, i − 1)
return {max1, max2}
Function SearchOverlap uses divide and conquer approach to traverse the sorted list Il, .. Ir. It imagines such list as a complete binary tree, with interval Ic as its local root. The test Ic.max < I.max is used to always decide to traverse the binary tree (go left/right) in direction of interval with largest overlap with I. Note, that I is the queried for interval, which is compared to log(n) other intervals. Also note, that the largest possible overlapping interval might be passed in such traversal, hence the check for largest overlap in the beginning of function SearchOverlap.
SearchOverlap(I , l, r)
c = ceil(Avg(l, r)) // Central element of queried list
if Overlap(max1, max2) < Overlap(I , Ic) then
max1 = I
max2 = Ic
if l ≥ r then
return
if Ic.max < I.max then
SearchOverlap(I , c + 1, r)
else
SearchOverlap(I , l, c − 1)
return
Largest overlapping intervals (if not empty) are returned at the end. Total complexity is O(n log(n)).

How to find pair with kth largest sum?

Given two sorted arrays of numbers, we want to find the pair with the kth largest possible sum. (A pair is one element from the first array and one element from the second array). For example, with arrays
[2, 3, 5, 8, 13]
[4, 8, 12, 16]
The pairs with largest sums are
13 + 16 = 29
13 + 12 = 25
8 + 16 = 24
13 + 8 = 21
8 + 12 = 20
So the pair with the 4th largest sum is (13, 8). How to find the pair with the kth largest possible sum?
Also, what is the fastest algorithm? The arrays are already sorted and sizes M and N.
I am already aware of the O(Klogk) solution , using Max-Heap given here .
It also is one of the favorite Google interview question , and they demand a O(k) solution .
I've also read somewhere that there exists a O(k) solution, which i am unable to figure out .
Can someone explain the correct solution with a pseudocode .
P.S.
Please DON'T post this link as answer/comment.It DOESN'T contain the answer.
I start with a simple but not quite linear-time algorithm. We choose some value between array1[0]+array2[0] and array1[N-1]+array2[N-1]. Then we determine how many pair sums are greater than this value and how many of them are less. This may be done by iterating the arrays with two pointers: pointer to the first array incremented when sum is too large and pointer to the second array decremented when sum is too small. Repeating this procedure for different values and using binary search (or one-sided binary search) we could find Kth largest sum in O(N log R) time, where N is size of the largest array and R is number of possible values between array1[N-1]+array2[N-1] and array1[0]+array2[0]. This algorithm has linear time complexity only when the array elements are integers bounded by small constant.
Previous algorithm may be improved if we stop binary search as soon as number of pair sums in binary search range decreases from O(N2) to O(N). Then we fill auxiliary array with these pair sums (this may be done with slightly modified two-pointers algorithm). And then we use quickselect algorithm to find Kth largest sum in this auxiliary array. All this does not improve worst-case complexity because we still need O(log R) binary search steps. What if we keep the quickselect part of this algorithm but (to get proper value range) we use something better than binary search?
We could estimate value range with the following trick: get every second element from each array and try to find the pair sum with rank k/4 for these half-arrays (using the same algorithm recursively). Obviously this should give some approximation for needed value range. And in fact slightly improved variant of this trick gives range containing only O(N) elements. This is proven in following paper: "Selection in X + Y and matrices with sorted rows and columns" by A. Mirzaian and E. Arjomandi. This paper contains detailed explanation of the algorithm, proof, complexity analysis, and pseudo-code for all parts of the algorithm except Quickselect. If linear worst-case complexity is required, Quickselect may be augmented with Median of medians algorithm.
This algorithm has complexity O(N). If one of the arrays is shorter than other array (M < N) we could assume that this shorter array is extended to size N with some very small elements so that all calculations in the algorithm use size of the largest array. We don't actually need to extract pairs with these "added" elements and feed them to quickselect, which makes algorithm a little bit faster but does not improve asymptotic complexity.
If k < N we could ignore all the array elements with index greater than k. In this case complexity is equal to O(k). If N < k < N(N-1) we just have better complexity than requested in OP. If k > N(N-1), we'd better solve the opposite problem: k'th smallest sum.
I uploaded simple C++11 implementation to ideone. Code is not optimized and not thoroughly tested. I tried to make it as close as possible to pseudo-code in linked paper. This implementation uses std::nth_element, which allows linear complexity only on average (not worst-case).
A completely different approach to find K'th sum in linear time is based on priority queue (PQ). One variation is to insert largest pair to PQ, then repeatedly remove top of PQ and instead insert up to two pairs (one with decremented index in one array, other with decremented index in other array). And take some measures to prevent inserting duplicate pairs. Other variation is to insert all possible pairs containing largest element of first array, then repeatedly remove top of PQ and instead insert pair with decremented index in first array and same index in second array. In this case there is no need to bother about duplicates.
OP mentions O(K log K) solution where PQ is implemented as max-heap. But in some cases (when array elements are evenly distributed integers with limited range and linear complexity is needed only on average, not worst-case) we could use O(1) time priority queue, for example, as described in this paper: "A Complexity O(1) Priority Queue for Event Driven Molecular Dynamics Simulations" by Gerald Paul. This allows O(K) expected time complexity.
Advantage of this approach is a possibility to provide first K elements in sorted order. Disadvantages are limited choice of array element type, more complex and slower algorithm, worse asymptotic complexity: O(K) > O(N).
EDIT: This does not work. I leave the answer, since apparently I am not the only one who could have this kind of idea; see the discussion below.
A counter-example is x = (2, 3, 6), y = (1, 4, 5) and k=3, where the algorithm gives 7 (3+4) instead of 8 (3+5).
Let x and y be the two arrays, sorted in decreasing order; we want to construct the K-th largest sum.
The variables are: i the index in the first array (element x[i]), j the index in the second array (element y[j]), and k the "order" of the sum (k in 1..K), in the sense that S(k)=x[i]+y[j] will be the k-th greater sum satisfying your conditions (this is the loop invariant).
Start from (i, j) equal to (0, 0): clearly, S(1) = x[0]+y[0].
for k from 1 to K-1, do:
if x[i+1]+ y[j] > x[i] + y[j+1], then i := i+1 (and j does not change) ; else j:=j+1
To see that it works, consider you have S(k) = x[i] + y[j]. Then, S(k+1) is the greatest sum which is lower (or equal) to S(k), and such as at least one element (i or j) changes. It is not difficult to see that exactly one of i or j should change.
If i changes, the greater sum you can construct which is lower than S(k) is by setting i=i+1, because x is decreasing and all the x[i'] + y[j] with i' < i are greater than S(k). The same holds for j, showing that S(k+1) is either x[i+1] + y[j] or x[i] + y[j+1].
Therefore, at the end of the loop you found the K-th greater sum.
tl;dr: If you look ahead and look behind at each iteration, you can start with the end (which is highest) and work back in O(K) time.
Although the insight underlying this approach is, I believe, sound, the code below is not quite correct at present (see comments).
Let's see: first of all, the arrays are sorted. So, if the arrays are a and b with lengths M and N, and as you have arranged them, the largest items are in slots M and N respectively, the largest pair will always be a[M]+b[N].
Now, what's the second largest pair? It's going to have perhaps one of {a[M],b[N]} (it can't have both, because that's just the largest pair again), and at least one of {a[M-1],b[N-1]}. BUT, we also know that if we choose a[M-1]+b[N-1], we can make one of the operands larger by choosing the higher number from the same list, so it will have exactly one number from the last column, and one from the penultimate column.
Consider the following two arrays: a = [1, 2, 53]; b = [66, 67, 68]. Our highest pair is 53+68. If we lose the smaller of those two, our pair is 68+2; if we lose the larger, it's 53+67. So, we have to look ahead to decide what our next pair will be. The simplest lookahead strategy is simply to calculate the sum of both possible pairs. That will always cost two additions, and two comparisons for each transition (three because we need to deal with the case where the sums are equal);let's call that cost Q).
At first, I was tempted to repeat that K-1 times. BUT there's a hitch: the next largest pair might actually be the other pair we can validly make from {{a[M],b[N]}, {a[M-1],b[N-1]}. So, we also need to look behind.
So, let's code (python, should be 2/3 compatible):
def kth(a,b,k):
M = len(a)
N = len(b)
if k > M*N:
raise ValueError("There are only %s possible pairs; you asked for the %sth largest, which is impossible" % M*N,k)
(ia,ib) = M-1,N-1 #0 based arrays
# we need this for lookback
nottakenindices = (0,0) # could be any value
nottakensum = float('-inf')
for i in range(k-1):
optionone = a[ia]+b[ib-1]
optiontwo = a[ia-1]+b[ib]
biggest = max((optionone,optiontwo))
#first deal with look behind
if nottakensum > biggest:
if optionone == biggest:
newnottakenindices = (ia,ib-1)
else: newnottakenindices = (ia-1,ib)
ia,ib = nottakenindices
nottakensum = biggest
nottakenindices = newnottakenindices
#deal with case where indices hit 0
elif ia <= 0 and ib <= 0:
ia = ib = 0
elif ia <= 0:
ib-=1
ia = 0
nottakensum = float('-inf')
elif ib <= 0:
ia-=1
ib = 0
nottakensum = float('-inf')
#lookahead cases
elif optionone > optiontwo:
#then choose the first option as our next pair
nottakensum,nottakenindices = optiontwo,(ia-1,ib)
ib-=1
elif optionone < optiontwo: # choose the second
nottakensum,nottakenindices = optionone,(ia,ib-1)
ia-=1
#next two cases apply if options are equal
elif a[ia] > b[ib]:# drop the smallest
nottakensum,nottakenindices = optiontwo,(ia-1,ib)
ib-=1
else: # might be equal or not - we can choose arbitrarily if equal
nottakensum,nottakenindices = optionone,(ia,ib-1)
ia-=1
#+2 - one for zero-based, one for skipping the 1st largest
data = (i+2,a[ia],b[ib],a[ia]+b[ib],ia,ib)
narrative = "%sth largest pair is %s+%s=%s, with indices (%s,%s)" % data
print (narrative) #this will work in both versions of python
if ia <= 0 and ib <= 0:
raise ValueError("Both arrays exhausted before Kth (%sth) pair reached"%data[0])
return data, narrative
For those without python, here's an ideone: http://ideone.com/tfm2MA
At worst, we have 5 comparisons in each iteration, and K-1 iterations, which means that this is an O(K) algorithm.
Now, it might be possible to exploit information about differences between values to optimise this a little bit, but this accomplishes the goal.
Here's a reference implementation (not O(K), but will always work, unless there's a corner case with cases where pairs have equal sums):
import itertools
def refkth(a,b,k):
(rightia,righta),(rightib,rightb) = sorted(itertools.product(enumerate(a),enumerate(b)), key=lamba((ia,ea),(ib,eb):ea+eb)[k-1]
data = k,righta,rightb,righta+rightb,rightia,rightib
narrative = "%sth largest pair is %s+%s=%s, with indices (%s,%s)" % data
print (narrative) #this will work in both versions of python
return data, narrative
This calculates the cartesian product of the two arrays (i.e. all possible pairs), sorts them by sum, and takes the kth element. The enumerate function decorates each item with its index.
The max-heap algorithm in the other question is simple, fast and correct. Don't knock it. It's really well explained too. https://stackoverflow.com/a/5212618/284795
Might be there isn't any O(k) algorithm. That's okay, O(k log k) is almost as fast.
If the last two solutions were at (a1, b1), (a2, b2), then it seems to me there are only four candidate solutions (a1-1, b1) (a1, b1-1) (a2-1, b2) (a2, b2-1). This intuition could be wrong. Surely there are at most four candidates for each coordinate, and the next highest is among the 16 pairs (a in {a1,a2,a1-1,a2-1}, b in {b1,b2,b1-1,b2-1}). That's O(k).
(No it's not, still not sure whether that's possible.)
[2, 3, 5, 8, 13]
[4, 8, 12, 16]
Merge the 2 arrays and note down the indexes in the sorted array. Here is the index array looks like (starting from 1 not 0)
[1, 2, 4, 6, 8]
[3, 5, 7, 9]
Now start from end and make tuples. sum the elements in the tuple and pick the kth largest sum.
public static List<List<Integer>> optimization(int[] nums1, int[] nums2, int k) {
// 2 * O(n log(n))
Arrays.sort(nums1);
Arrays.sort(nums2);
List<List<Integer>> results = new ArrayList<>(k);
int endIndex = 0;
// Find the number whose square is the first one bigger than k
for (int i = 1; i <= k; i++) {
if (i * i >= k) {
endIndex = i;
break;
}
}
// The following Iteration provides at most endIndex^2 elements, and both arrays are in ascending order,
// so k smallest pairs must can be found in this iteration. To flatten the nested loop, refer
// 'https://stackoverflow.com/questions/7457879/algorithm-to-optimize-nested-loops'
for (int i = 0; i < endIndex * endIndex; i++) {
int m = i / endIndex;
int n = i % endIndex;
List<Integer> item = new ArrayList<>(2);
item.add(nums1[m]);
item.add(nums2[n]);
results.add(item);
}
results.sort(Comparator.comparing(pair->pair.get(0) + pair.get(1)));
return results.stream().limit(k).collect(Collectors.toList());
}
Key to eliminate O(n^2):
Avoid cartesian product(or 'cross join' like operation) of both arrays, which means flattening the nested loop.
Downsize iteration over the 2 arrays.
So:
Sort both arrays (Arrays.sort offers O(n log(n)) performance according to Java doc)
Limit the iteration range to the size which is just big enough to support k smallest pairs searching.

What shuffling algorithms exist besides Fisher-Yates and finding the "next permutation?"

Specifically in the domain of one-dimensional sets of items of the same type, such as a vector of integers.
Say, for example, you had a vector of size 32,768 containing the sorted integers 0 through 32,767.
What I mean by "next permutation" is performing the next permutation in a lexical ordering system.
Wikipedia lists two, and I'm wondering if there are any more (besides something bogo :P)
O(N) implementation
This is based on Eyal Schneider's mapping Zn! -> P(n)
def get_permutation(k, lst):
N = len(lst)
while N:
next_item = k/f(N-1)
lst[N-1], lst[next_item] = lst[next_item], lst[N-1]
k = k - next_item*f(N-1)
N = N-1
return lst
It reduces his O(N^2) algorithm by integrating the conversion step with finding the permutation. It essentially has the same form as Fisher-Yates but replaces a call to random with the next step of the mapping. If the mapping is in fact a bijection (which I'm working to prove) then this is a better algorithm than Fisher-Yates because it only calls out to pseudo random number generator once and so will be more efficient. Note also that this returns the action of permutation (N! - k) rather than permutation k but that's of little consequence because if k is uniform on [0, N!], then so is N! - k.
old answer
This is slightly related to the idea of "next" permutation. If the items can be well ordered, then one can construct lexicographical ordering on the permutations. This allows you to construct a map from the integers into the space of permutations.
Then finding a random permutation is equivalent to choosing a random integer between 0 and N! and constructing the corresponding permutation. This algorithm will be as efficient as (and as difficult to implement) as calculating the n'th permutation of the set in question. This trivially gives a uniform choice of permutation if our choice of n is uniform.
A little more detail about ordering the permutations. given a set S = {a b c d}, mathematicians view the set of permutations of S as a group with the operation of composition. if p is one permutation, lets say (b a c d), then p operates on S by taking b to a, a to c, c to d and d to b. if q is another permutation, lets say (d b c a) then pq is obtained by first applying q and then p which gives (d a b)(c). for example, q takes d to b and p takes b to a so that pq takes d to a. You'll see that pq has two cycles because it takes b to d and fixes c. It's customary to omit 1-cycles but I left it in for clarity.
We're going to use some facts from group theory.
disjoint cycles commute. (a b)(c d) is the same as (c d)(a b)
we can arrange elements in a cycle in any cyclic order. that is (a b c) = (b c a) = (c a b)
So given a permutation, order the cycles so that the largest cycles come first. When two cycles are the same length, arrange their items so that the largest (we can always order a denumerable set, even if arbitrarily so) item comes first. Then we just have a lexicographical ordering first on the length of the cycles, then on their contents. This is well ordered because two permutations that consist of the same cycles must be the same permutation so if p > q and q > p then p = q.
This algorithm can be trivially executed in O(N!logN! + N!) time. just construct all the permutations (EDIT: Just to be clear, I had my mathematician hat on when I proposed this and it was tongue in cheek anyway) , quicksort them and find the n'th. It is a different algorithm than the two you mention though.
Here is an idea on how to improve aaronasterling's answer. It avoids generating all N! permutations and sorting them according to their lexicographic order, and therefore has a much better time complexity.
Internally it uses an unusual permutation representation, that simulates a selection & removal process from a shrinking array. For example, the sequence <0,1,0> represents a permutation resulting from removing item #0 from [0,1,2], then removing item #1 from [1,2], and then removing item #0 from [1]. The resulting permutation is <0,2,1>. With this representation, the first permutation will always be <0,0,...0>, and the last one will always be <N-1,N-2,...0>. I will call this special representation the "array representation".
Clearly, an array representation of size N can be converted to a standard permutation representation in O(N^2) time, by using an array and shrinking it when necessary.
The following function can be used to return the Kth permutation on {0,1,2...,N-1}, in the array representation:
getPermutation(k, N) {
while(N > 0) {
nextItem = floor(k / (N-1)!)
output nextItem
k = k - nextItem * (N-1)!
N = N - 1
}
}
This algorithm works in O(N^2) time (due to the representation conversion), instead of O(N! log N) time.
--Example--
getPermutation(4,3) returns <2,0,0>. This array representation corresponds to <C,A,B>, which is really the permutation at index 4 in the ordered list of permutations on {A,B,C}:
ABC
ACB
BAC
BCA
CAB
CBA
You can adapt merge sort such that it will shuffle the input randomly instead of sorting it.
In particular, when merging two lists, you choose the new head element at random instead of choosing it to be the smallest head element. The probability of choosing the element from the first list must be n/(n+m) where n is the length of the first and m the length of the second list for this to work.
I've written a detailed explanation here: Random Permutations and Sorting.
Another possibility is to build an LFSR or PRNG with a period equal to the number of items you want.
Start with a sorted array. Pick 2 random indexes, switch the elements at those indexes. Repeat O(n lg n) times.
You need to repeat O(n lg n) times to ensure that the distribution approaches uniform. (You need to make sure that each index is picked at least once, which is a balls-in-bins problem.)

Resources