How to group numbers by size - algorithm

I have n different numbers and I want to sort them into k groups, such that any number in group 1 is smaller than any number in group 2, and anyone in group 2 smaller than anyone in group 3 and so on until group k (the numbers do not have to be sorted inside each group). I'm asked to design an algorithm that runs in O(n log k), but I can only come up with O(n^2) ones.
How can I do this?

You could achieve this by modifying the Bucket sort algorithm, below I have included a JavaScript implementation, see Github for further details on the source code. This implementation uses 16 buckets, you will have to modify it to allow for k buckets and you can omit the sorting of buckets itself. One approach would be to use 2^p buckets where p is the smallest integer that satisfies 2^p < n. This algorithm will run in O(n log k)
// Copyright 2011, Tom Switzer
// Under terms of ISC License: http://www.isc.org/software/license
/**
* Sorts an array of integers in linear time using bucket sort.
* This gives a good speed up vs. built-in sort in new JS engines
* (eg. V8). If a key function is given, then the result of
* key(a[i]) is used as the integer value to sort on instead a[i].
*
* #param a A JavaScript array.
* #param key A function that maps values of a to integers.
* #return The array a.
*/
function bsort(a, key) {
key = key || function(x) {
return x
};
var len = a.length,
buckets = [],
i, j, b, d = 0;
for (; d < 32; d += 4) {
for (i = 16; i--;)
buckets[i] = [];
for (i = len; i--;)
buckets[(key(a[i]) >> d) & 15].push(a[i]);
//This implementation uses 16 buckets, you will need to modify this
for (b = 0; b < 16; b++)
//The next two lines sort each bucket, you can leave it out
for (j = buckets[b].length; j--;)
a[++i] = buckets[b][j];
}
return a;
}
var array = [2, 4, 1, 5, 3];
$('#result').text(bsort(array, function(x) {
return x
}));
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<div id="result"></div>

Note that the problem statement is to separate n different numbers into k groups. This would get more complicated if there were duplicates as noted in the wiki links below.
Any process that can determine the kth smallest element with less than O(n log(k)) complexity could be used k-1 times to produce an array of the elements corresponding to the boundaries between k groups. Then a single pass could be made on the array, doing a binary search of the boundary array to split up the array into k groups with O(n log(k)) complexity. However, it seems that at least one algorithm to find the kth smallest element also partitions the array, so that alone could be used to create the k groups.
A unordered partial sort using a selection algorithm with worst case time O(n) is possible. Wiki links:
http://en.wikipedia.org/wiki/Selection_algorithm
http://en.wikipedia.org/wiki/Selection_algorithm#Unordered_partial_sorting
http://en.wikipedia.org/wiki/Quickselect
http://en.wikipedia.org/wiki/Median_of_medians
http://en.wikipedia.org/wiki/Soft_heap#Applications

Use K-selection algorithm with partition function from QuickSort - QuickSelect.
Let's K is power of 2 for simplicity.
At the first stage we make partition of N elements, it takes O(N) ~ p* N time, where p is some constant
At the second stage we recursively make 2 partitions of N/2 elements, it takes 2* p* N/2 = p*N time.
At the third stage we make 4 partitions of N/4 elements, it takes 4*pN/4 = pN time.
...
At the last stage we make K partitions of N/K elements, it takes K* p* N/K = p*N time.
Note there are Log(K) stages, so overall time is Log(K) * p * N = O(N*Log(K)

Thank you for all your help, basically a quickselect (or any linear time sorting algorithm that finds the k-th statistic in linear time is enough) and, after running it k-1 times, we make a binary search over the original array to split the elements into groups, getting O(nlog k).
Also, if you don't want to make a binary search, in the quickselect, you can also separate the elements and find the statistic in each subset! #rcgldr, #MBo thank you for your ideas!

Related

Time complexity of deterministic select

Suppose I do k groups. As the size of the input array is not constant, the number of elements in a group will be n/k. Sorting this will take (n/k)log(n/k). For k groups it will be (n)log(n/k) which is
O(nlogn). Then how come the algorithm is of O(n)?
Edit: From Gfg https://www.geeksforgeeks.org/kth-smallestlargest-element-unsorted-array-set-3-worst-case-linear-time/amp/
kthSmallest(arr[0..n-1], k)
1) Divide arr[] into ⌈n/5⌉ groups where size of each group
is 5 except possibly the last group which may have less
than 5 elements.
2) Sort the above created ⌈n/5⌉ groups and find median
of all groups. Create an auxiliary array ‘median[]’ and
store medians of all ⌈n/5⌉ groups in this median array.
// Recursively call this method to find median of
median[0..⌈n/5⌉-1]
3) medOfMed = kthSmallest(median[0..⌈n/5⌉-1], ⌈n/10⌉)
4) Partition arr[] around medOfMed and obtain its
position.
pos = partition(arr, n, medOfMed)
5) If pos == k return medOfMed
6) If pos > k return kthSmallest(arr[l..pos-1], k)
7) If pos < k return kthSmallest(arr[pos+1..r], k-pos+l-1)
You are not sorting the elements in each group, you are only looking for their medians, which is O(n/k) (this is theoretically done on small and constant values of n/k, which is then just considered as O(1)). So you have to run O(n/k) algorithm k times, which yields O(n).
Note that even if you do sort, since the algorithm uses a constant size of groups, then the size of each group: n/k = C (for some constant C), and you get: O(nlog(n/k)) = O(nlog(C)) = O(n)

Divide and conquer algorithm

I had a job interview a few weeks ago and I was asked to design a divide and conquer algorithm. I could not solve the problem, but they just called me for a second interview! Here is the question:
we are giving as input two n-element arrays A[0..n − 1] and B[0..n − 1] (which
are not necessarily sorted) of integers, and an integer value. Give an O(nlogn) divide and conquer algorithm that determines if there exist distinct values i, j (that is, i != j) such that A[i] + B[j] = value. Your algorithm should return True if i, j exists, and return False otherwise. You may assume that the elements in A are distinct, and the elements in B are distinct.
can anybody solve the problem? Thanks
My approach is..
Sort any of the array. Here we sort array A. Sort it with the Merge Sort algorithm which is a Divide and Conquer algorithm.
Then for each element of B, Search for Required Value- Element of B in array A by Binary Search. Again this is a Divide and Conquer algorithm.
If you find the element Required Value - Element of B from an Array A then Both element makes pair such that Element of A + Element of B = Required Value.
So here for Time Complexity, A has N elements so Merge Sort will take O(N log N) and We do Binary Search for each element of B(Total N elements) Which takes O(N log N). So total time complexity would be O(N log N).
As you have mentioned you require to check for i != j if A[i] + B[j] = value then here you can take 2D array of size N * 2. Each element is paired with its original index as second element of the each row. Sorting would be done according the the data stored in the first element. Then when you find the element, You can compare both elements original indexes and return the value accordingly.
The following algorithm does not use Divide and Conquer but it is one of the solutions.
You need to sort both the arrays, maintaining the indexes of the elements maybe sorting an array of pairs (elem, index). This takes O(n log n) time.
Then you can apply the merge algorithm to check if there two elements such that A[i]+B[j] = value. This would O(n)
Overall time complexity will be O(n log n)
I suggest using hashing. Even if it's not the way you are supposed to solve the problem, it's worth mentioning since hashing has a better time complexity O(n) v. O(n*log(n)) and that's why more efficient.
Turn A into a hashset (or dictionary if we want i index) - O(n)
Scan B and check if there's value - B[j] in the hashset (dictionary) - O(n)
So you have an O(n) + O(n) = O(n) algorithm (which is better that required (O n * log(n)), however the solution is NOT Divide and Conquer):
Sample C# implementation
int[] A = new int[] { 7, 9, 5, 3, 47, 89, 1 };
int[] B = new int[] { 5, 7, 3, 4, 21, 59, 0 };
int value = 106; // 47 + 59 = A[4] + B[5]
// Turn A into a dictionary: key = item's value; value = item's key
var dict = A
.Select((val, index) => new {
v = val,
i = index, })
.ToDictionary(item => item.v, item => item.i);
int i = -1;
int j = -1;
// Scan B array
for (int k = 0; k < B.Length; ++k) {
if (dict.TryGetValue(value - B[k], out i)) {
// Solution found: {i, j}
j = k;
// if you want any solution then break
break;
// scan further (comment out "break") if you want all pairs
}
}
Console.Write(j >= 0 ? $"{i} {j}" : "No solution");
Seems hard to achieve without sorting.
If you leave the arrays unsorted, checking for existence of A[i]+B[j] = Value takes time Ω(n) for fixed i, then checking for all i takes Θ(n²), unless you find a trick to put some order in B.
Balanced Divide & Conquer on the unsorted arrays doesn't seem any better: if you divide A and B in two halves, the solution can lie in one of Al/Bl, Al/Br, Ar/Bl, Ar/Br and this yields a recurrence T(n) = 4 T(n/2), which has a quadratic solution.
If sorting is allowed, the solution by Sanket Makani is a possibility but you do better in terms of time complexity for the search phase.
Indeed, assume A and B now sorted and consider the 2D function A[i]+B[j], which is monotonic in both directions i and j. Then the domain A[i]+B[j] ≤ Value is limited by a monotonic curve j = f(i) or equivalently i = g(j). But strict equality A[i]+B[j] = Value must be checked exhaustively for all points of the curve and one cannot avoid to evaluate f everywhere in the worst case.
Starting from i = 0, you obtain f(i) by dichotomic search. Then you can follow the border curve incrementally. You will perform n step in the i direction, and at most n steps in the j direction, so that the complexity remains bounded by O(n), which is optimal.
Below, an example showing the areas with a sum below and above the target value (there are two matches).
This optimal solution has little to do with Divide & Conquer. It is maybe possible to design a variant based on the evaluation of the sum at a central point, which allows to discard a whole quadrant, but that would be pretty artificial.

Choosing k out of n

I want to choose k elements uniformly at random out of a possible n without choosing the same number twice. There are two trivial approaches to this.
Make a list of all n possibilities. Shuffle them (you don't need
to shuffle all n numbers just k of them by performing the first
k steps of Fisher Yates). Choose the first k. This approach
takes O(k) time (assuming allocating an array of size n takes
O(1) time) and O(n) space. This is a problem if k is very
small relative to n.
Store a set of seen elements. Choose a number at random from [0, n-1]. While the element is in the set then choose a new number.
This approach takes O(k) space. The run-time is a little more
complicated to analyze. If k = theta(n) then the run-time is
O(k*lg(k))=O(n*lg(n)) because it is the coupon collector's
problem. If k is small relative to n then it takes slightly
more than O(k) because of the probability (albeit low) of choosing
the same number twice. This is better than the above solution in
terms of space but worse in terms of run-time.
My question:
is there an O(k) time, O(k) space algorithm for all k and n?
With an O(1) hash table, the partial Fisher-Yates method can be made to run in O(k) time and space. The trick is simply to store only the changed elements of the array in the hash table.
Here's a simple example in Java:
public static int[] getRandomSelection (int k, int n, Random rng) {
if (k > n) throw new IllegalArgumentException(
"Cannot choose " + k + " elements out of " + n + "."
);
HashMap<Integer, Integer> hash = new HashMap<Integer, Integer>(2*k);
int[] output = new int[k];
for (int i = 0; i < k; i++) {
int j = i + rng.nextInt(n - i);
output[i] = (hash.containsKey(j) ? hash.remove(j) : j);
if (j > i) hash.put(j, (hash.containsKey(i) ? hash.remove(i) : i));
}
return output;
}
This code allocates a HashMap of 2×k buckets to store the modified elements (which should be enough to ensure that the hash table is never rehashed), and just runs a partial Fisher-Yates shuffle on it.
Here's a quick test on Ideone; it picks two elements out of three 30,000 times, and counts the number of times each pair of elements gets chosen. For an unbiased shuffle, each ordered pair should appear approximately 5,000 (&pm;100 or so) times, except for the impossible cases where both elements would be equal.
Your second approach does not take Theta(k log k) time on average, it takes about n/(n-k+1) + n/(n-k+2) + ... + n/n operations, which is less than k(n/(n-k)) since you have k terms which are each smaller than n/(n-k). For k <= n/2, it takes under 2*k operations on average. For k>n/2, you can choose a random subset of size n-k, and take the complement. So, this is already an O(k) average time and space algorithm.
What you could use is the following algorithm (using javascript instead of pseudocode):
var k = 3;
var n = [1,2,3,4,5,6];
// O(k) iterations
for(var i = 0, tmp; i < k; ++i) {
// Random index O(1)
var index = Math.floor(Math.random() * (n.length - i));
// Output O(1)
console.log(n[index]);
// Swap and lookup O(1)
tmp = n[index];
n[index] = n[n.length - i - 1];
n[n.length - i - 1] = tmp;
}
In short, you swap the selected value with the last item and in the next iteration sample from the reduced subset. This assumes your original set is wholly unique.
The storage is O(n), if you wish to retrieve the numbers as a set, just refer to the last k entries from n.

Find kth number in sum array

Given an array A with N elements I need to find pair (i,j) such that i is not equal to j and if we write the sum A[i]+A[j] for all pairs of (i,j) then it comes at the kth position.
Example : Let N=4 and arrays A=[1 2 3 4] and if K=3 then answer is 5 as we can see it clearly that sum array becomes like this : [3,4,5,5,6,7]
I can't go for all pair of i and j as N can go up to 100000. Please help how to solve this problem
I mean something like this :
int len=N*(N+1)/2;
int sum[len];
int count=0;
for(int i=0;i<N;i++){
for(int j=i+1;j<N;j++){
sum[count]=A[i]+A[j];
count++;
}
}
//Then just find kth element.
We can't go with this approach
A solution that is based on a fact that K <= 50: Let's take the first K + 1 elements of the array in a sorted order. Now we can just try all their combinations. Proof of correctness: let's assume that a pair (i, j) is the answer, where j > K + 1. But there are K pairs with the same or smaller sum: (1, 2), (1, 3), ..., (1, K + 1). Thus, it cannot be the K-th pair.
It is possible to achieve an O(N + K ^ 2) time complexity by choosing the K + 1 smallest numbers using a quickselect algorithm(it is possible to do even better, but it is not required). You can also just the array and get an O(N * log N + K ^ 2 * log K) complexity.
I assume that you got this question from http://www.careercup.com/question?id=7457663.
If k is close to 0 then the accepted answer to How to find kth largest number in pairwise sums like setA + setB? can be adapted quite easily to this problem and be quite efficient. You need O(n log(n)) to sort the array, O(n) to set up a priority queue, and then O(k log(k)) to iterate through the elements. The reversed solution is also efficient if k is near n*n - n.
If k is close to n*n/2 then that won't be very good. But you can adapt the pivot approach of http://en.wikipedia.org/wiki/Quickselect to this problem. First in time O(n log(n)) you can sort the array. In time O(n) you can set up a data structure representing the various contiguous ranges of columns. Then you'll need to select pivots O(log(n)) times. (Remember, log(n*n) = O(log(n)).) For each pivot, you can do a binary search of each column to figure out where it split it in time O(log(n)) per column, and total cost of O(n log(n)) for all columns.
The resulting algorithm will be O(n log(n) log(n)).
Update: I do not have time to do the finger exercise of supplying code. But I can outline some of the classes you might have in an implementation.
The implementation will be a bit verbose, but that is sometimes the cost of a good general-purpose algorithm.
ArrayRangeWithAddend. This represents a range of an array, summed with one value.with has an array (reference or pointer so the underlying data can be shared between objects), a start and an end to the range, and a shiftValue for the value to add to every element in the range.
It should have a constructor. A method to give the size. A method to partition(n) it into a range less than n, the count equal to n, and a range greater than n. And value(i) to give the i'th value.
ArrayRangeCollection. This is a collection of ArrayRangeWithAddend objects. It should have methods to give its size, pick a random element, and a method to partition(n) it into an ArrayRangeCollection that is below n, count of those equal to n, and an ArrayRangeCollection that is larger than n. In the partition method it will be good to not include ArrayRangeWithAddend objects that have size 0.
Now your main program can sort the array, and create an ArrayRangeCollection covering all pairs of sums that you are interested in. Then the random and partition method can be used to implement the standard quickselect algorithm that you will find in the link I provided.
Here is how to do it (in pseudo-code). I have now confirmed that it works correctly.
//A is the original array, such as A=[1,2,3,4]
//k (an integer) is the element in the 'sum' array to find
N = A.length
//first we find i
i = -1
nl = N
k2 = k
while (k2 >= 0) {
i++
nl--
k2 -= nl
}
//then we find j
j = k2 + nl + i + 1
//now compute the sum at index position k
kSum = A[i] + A[j]
EDIT:
I have now tested this works. I had to fix some parts... basically the k input argument should use 0-based indexing. (The OP seems to use 1-based indexing.)
EDIT 2:
I'll try to explain my theory then. I began with the concept that the sum array should be visualised as a 2D jagged array (diminishing in width as the height increases), with the coordinates (as mentioned in the OP) being i and j. So for an array such as [1,2,3,4,5] the sum array would be conceived as this:
3,4,5,6,
5,6,7,
7,8,
9.
The top row are all values where i would equal 0. The second row is where i equals 1. To find the value of 'j' we do the same but in the column direction.
... Sorry I cannot explain this any better!

How to find pair with kth largest sum?

Given two sorted arrays of numbers, we want to find the pair with the kth largest possible sum. (A pair is one element from the first array and one element from the second array). For example, with arrays
[2, 3, 5, 8, 13]
[4, 8, 12, 16]
The pairs with largest sums are
13 + 16 = 29
13 + 12 = 25
8 + 16 = 24
13 + 8 = 21
8 + 12 = 20
So the pair with the 4th largest sum is (13, 8). How to find the pair with the kth largest possible sum?
Also, what is the fastest algorithm? The arrays are already sorted and sizes M and N.
I am already aware of the O(Klogk) solution , using Max-Heap given here .
It also is one of the favorite Google interview question , and they demand a O(k) solution .
I've also read somewhere that there exists a O(k) solution, which i am unable to figure out .
Can someone explain the correct solution with a pseudocode .
P.S.
Please DON'T post this link as answer/comment.It DOESN'T contain the answer.
I start with a simple but not quite linear-time algorithm. We choose some value between array1[0]+array2[0] and array1[N-1]+array2[N-1]. Then we determine how many pair sums are greater than this value and how many of them are less. This may be done by iterating the arrays with two pointers: pointer to the first array incremented when sum is too large and pointer to the second array decremented when sum is too small. Repeating this procedure for different values and using binary search (or one-sided binary search) we could find Kth largest sum in O(N log R) time, where N is size of the largest array and R is number of possible values between array1[N-1]+array2[N-1] and array1[0]+array2[0]. This algorithm has linear time complexity only when the array elements are integers bounded by small constant.
Previous algorithm may be improved if we stop binary search as soon as number of pair sums in binary search range decreases from O(N2) to O(N). Then we fill auxiliary array with these pair sums (this may be done with slightly modified two-pointers algorithm). And then we use quickselect algorithm to find Kth largest sum in this auxiliary array. All this does not improve worst-case complexity because we still need O(log R) binary search steps. What if we keep the quickselect part of this algorithm but (to get proper value range) we use something better than binary search?
We could estimate value range with the following trick: get every second element from each array and try to find the pair sum with rank k/4 for these half-arrays (using the same algorithm recursively). Obviously this should give some approximation for needed value range. And in fact slightly improved variant of this trick gives range containing only O(N) elements. This is proven in following paper: "Selection in X + Y and matrices with sorted rows and columns" by A. Mirzaian and E. Arjomandi. This paper contains detailed explanation of the algorithm, proof, complexity analysis, and pseudo-code for all parts of the algorithm except Quickselect. If linear worst-case complexity is required, Quickselect may be augmented with Median of medians algorithm.
This algorithm has complexity O(N). If one of the arrays is shorter than other array (M < N) we could assume that this shorter array is extended to size N with some very small elements so that all calculations in the algorithm use size of the largest array. We don't actually need to extract pairs with these "added" elements and feed them to quickselect, which makes algorithm a little bit faster but does not improve asymptotic complexity.
If k < N we could ignore all the array elements with index greater than k. In this case complexity is equal to O(k). If N < k < N(N-1) we just have better complexity than requested in OP. If k > N(N-1), we'd better solve the opposite problem: k'th smallest sum.
I uploaded simple C++11 implementation to ideone. Code is not optimized and not thoroughly tested. I tried to make it as close as possible to pseudo-code in linked paper. This implementation uses std::nth_element, which allows linear complexity only on average (not worst-case).
A completely different approach to find K'th sum in linear time is based on priority queue (PQ). One variation is to insert largest pair to PQ, then repeatedly remove top of PQ and instead insert up to two pairs (one with decremented index in one array, other with decremented index in other array). And take some measures to prevent inserting duplicate pairs. Other variation is to insert all possible pairs containing largest element of first array, then repeatedly remove top of PQ and instead insert pair with decremented index in first array and same index in second array. In this case there is no need to bother about duplicates.
OP mentions O(K log K) solution where PQ is implemented as max-heap. But in some cases (when array elements are evenly distributed integers with limited range and linear complexity is needed only on average, not worst-case) we could use O(1) time priority queue, for example, as described in this paper: "A Complexity O(1) Priority Queue for Event Driven Molecular Dynamics Simulations" by Gerald Paul. This allows O(K) expected time complexity.
Advantage of this approach is a possibility to provide first K elements in sorted order. Disadvantages are limited choice of array element type, more complex and slower algorithm, worse asymptotic complexity: O(K) > O(N).
EDIT: This does not work. I leave the answer, since apparently I am not the only one who could have this kind of idea; see the discussion below.
A counter-example is x = (2, 3, 6), y = (1, 4, 5) and k=3, where the algorithm gives 7 (3+4) instead of 8 (3+5).
Let x and y be the two arrays, sorted in decreasing order; we want to construct the K-th largest sum.
The variables are: i the index in the first array (element x[i]), j the index in the second array (element y[j]), and k the "order" of the sum (k in 1..K), in the sense that S(k)=x[i]+y[j] will be the k-th greater sum satisfying your conditions (this is the loop invariant).
Start from (i, j) equal to (0, 0): clearly, S(1) = x[0]+y[0].
for k from 1 to K-1, do:
if x[i+1]+ y[j] > x[i] + y[j+1], then i := i+1 (and j does not change) ; else j:=j+1
To see that it works, consider you have S(k) = x[i] + y[j]. Then, S(k+1) is the greatest sum which is lower (or equal) to S(k), and such as at least one element (i or j) changes. It is not difficult to see that exactly one of i or j should change.
If i changes, the greater sum you can construct which is lower than S(k) is by setting i=i+1, because x is decreasing and all the x[i'] + y[j] with i' < i are greater than S(k). The same holds for j, showing that S(k+1) is either x[i+1] + y[j] or x[i] + y[j+1].
Therefore, at the end of the loop you found the K-th greater sum.
tl;dr: If you look ahead and look behind at each iteration, you can start with the end (which is highest) and work back in O(K) time.
Although the insight underlying this approach is, I believe, sound, the code below is not quite correct at present (see comments).
Let's see: first of all, the arrays are sorted. So, if the arrays are a and b with lengths M and N, and as you have arranged them, the largest items are in slots M and N respectively, the largest pair will always be a[M]+b[N].
Now, what's the second largest pair? It's going to have perhaps one of {a[M],b[N]} (it can't have both, because that's just the largest pair again), and at least one of {a[M-1],b[N-1]}. BUT, we also know that if we choose a[M-1]+b[N-1], we can make one of the operands larger by choosing the higher number from the same list, so it will have exactly one number from the last column, and one from the penultimate column.
Consider the following two arrays: a = [1, 2, 53]; b = [66, 67, 68]. Our highest pair is 53+68. If we lose the smaller of those two, our pair is 68+2; if we lose the larger, it's 53+67. So, we have to look ahead to decide what our next pair will be. The simplest lookahead strategy is simply to calculate the sum of both possible pairs. That will always cost two additions, and two comparisons for each transition (three because we need to deal with the case where the sums are equal);let's call that cost Q).
At first, I was tempted to repeat that K-1 times. BUT there's a hitch: the next largest pair might actually be the other pair we can validly make from {{a[M],b[N]}, {a[M-1],b[N-1]}. So, we also need to look behind.
So, let's code (python, should be 2/3 compatible):
def kth(a,b,k):
M = len(a)
N = len(b)
if k > M*N:
raise ValueError("There are only %s possible pairs; you asked for the %sth largest, which is impossible" % M*N,k)
(ia,ib) = M-1,N-1 #0 based arrays
# we need this for lookback
nottakenindices = (0,0) # could be any value
nottakensum = float('-inf')
for i in range(k-1):
optionone = a[ia]+b[ib-1]
optiontwo = a[ia-1]+b[ib]
biggest = max((optionone,optiontwo))
#first deal with look behind
if nottakensum > biggest:
if optionone == biggest:
newnottakenindices = (ia,ib-1)
else: newnottakenindices = (ia-1,ib)
ia,ib = nottakenindices
nottakensum = biggest
nottakenindices = newnottakenindices
#deal with case where indices hit 0
elif ia <= 0 and ib <= 0:
ia = ib = 0
elif ia <= 0:
ib-=1
ia = 0
nottakensum = float('-inf')
elif ib <= 0:
ia-=1
ib = 0
nottakensum = float('-inf')
#lookahead cases
elif optionone > optiontwo:
#then choose the first option as our next pair
nottakensum,nottakenindices = optiontwo,(ia-1,ib)
ib-=1
elif optionone < optiontwo: # choose the second
nottakensum,nottakenindices = optionone,(ia,ib-1)
ia-=1
#next two cases apply if options are equal
elif a[ia] > b[ib]:# drop the smallest
nottakensum,nottakenindices = optiontwo,(ia-1,ib)
ib-=1
else: # might be equal or not - we can choose arbitrarily if equal
nottakensum,nottakenindices = optionone,(ia,ib-1)
ia-=1
#+2 - one for zero-based, one for skipping the 1st largest
data = (i+2,a[ia],b[ib],a[ia]+b[ib],ia,ib)
narrative = "%sth largest pair is %s+%s=%s, with indices (%s,%s)" % data
print (narrative) #this will work in both versions of python
if ia <= 0 and ib <= 0:
raise ValueError("Both arrays exhausted before Kth (%sth) pair reached"%data[0])
return data, narrative
For those without python, here's an ideone: http://ideone.com/tfm2MA
At worst, we have 5 comparisons in each iteration, and K-1 iterations, which means that this is an O(K) algorithm.
Now, it might be possible to exploit information about differences between values to optimise this a little bit, but this accomplishes the goal.
Here's a reference implementation (not O(K), but will always work, unless there's a corner case with cases where pairs have equal sums):
import itertools
def refkth(a,b,k):
(rightia,righta),(rightib,rightb) = sorted(itertools.product(enumerate(a),enumerate(b)), key=lamba((ia,ea),(ib,eb):ea+eb)[k-1]
data = k,righta,rightb,righta+rightb,rightia,rightib
narrative = "%sth largest pair is %s+%s=%s, with indices (%s,%s)" % data
print (narrative) #this will work in both versions of python
return data, narrative
This calculates the cartesian product of the two arrays (i.e. all possible pairs), sorts them by sum, and takes the kth element. The enumerate function decorates each item with its index.
The max-heap algorithm in the other question is simple, fast and correct. Don't knock it. It's really well explained too. https://stackoverflow.com/a/5212618/284795
Might be there isn't any O(k) algorithm. That's okay, O(k log k) is almost as fast.
If the last two solutions were at (a1, b1), (a2, b2), then it seems to me there are only four candidate solutions (a1-1, b1) (a1, b1-1) (a2-1, b2) (a2, b2-1). This intuition could be wrong. Surely there are at most four candidates for each coordinate, and the next highest is among the 16 pairs (a in {a1,a2,a1-1,a2-1}, b in {b1,b2,b1-1,b2-1}). That's O(k).
(No it's not, still not sure whether that's possible.)
[2, 3, 5, 8, 13]
[4, 8, 12, 16]
Merge the 2 arrays and note down the indexes in the sorted array. Here is the index array looks like (starting from 1 not 0)
[1, 2, 4, 6, 8]
[3, 5, 7, 9]
Now start from end and make tuples. sum the elements in the tuple and pick the kth largest sum.
public static List<List<Integer>> optimization(int[] nums1, int[] nums2, int k) {
// 2 * O(n log(n))
Arrays.sort(nums1);
Arrays.sort(nums2);
List<List<Integer>> results = new ArrayList<>(k);
int endIndex = 0;
// Find the number whose square is the first one bigger than k
for (int i = 1; i <= k; i++) {
if (i * i >= k) {
endIndex = i;
break;
}
}
// The following Iteration provides at most endIndex^2 elements, and both arrays are in ascending order,
// so k smallest pairs must can be found in this iteration. To flatten the nested loop, refer
// 'https://stackoverflow.com/questions/7457879/algorithm-to-optimize-nested-loops'
for (int i = 0; i < endIndex * endIndex; i++) {
int m = i / endIndex;
int n = i % endIndex;
List<Integer> item = new ArrayList<>(2);
item.add(nums1[m]);
item.add(nums2[n]);
results.add(item);
}
results.sort(Comparator.comparing(pair->pair.get(0) + pair.get(1)));
return results.stream().limit(k).collect(Collectors.toList());
}
Key to eliminate O(n^2):
Avoid cartesian product(or 'cross join' like operation) of both arrays, which means flattening the nested loop.
Downsize iteration over the 2 arrays.
So:
Sort both arrays (Arrays.sort offers O(n log(n)) performance according to Java doc)
Limit the iteration range to the size which is just big enough to support k smallest pairs searching.

Resources