Sort name & time complexity - algorithm

I "invented" "new" sort algorithm. Well, I understand that I can't invent something good, so I tried to search it on wikipedia, but all sort algorithms seems like not my. So I have three questions:
What is name of this algorithm?
Why it sucks? (best, average and worst time complexity)
Can I make it more better still using this idea?
So, idea of my algorithm: if we have an array, we can count number of sorted elements and if this number is less that half of length we can reverse array to make it more sorted. And after that we can sort first half and second half of array. In best case, we need only O(n) - if array is totally sorted in good or bad direction. I have some problems with evaluation of average and worst time complexity.
Code on C#:
public static void Reverse(int[] array, int begin, int end) {
int length = end - begin;
for (int i = 0; i < length / 2; i++)
Algorithms.Swap(ref array[begin+i], ref array[begin + length - i - 1]);
}
public static bool ReverseIf(int[] array, int begin, int end) {
int countSorted = 1;
for (int i = begin + 1; i < end; i++)
if (array[i - 1] <= array[i])
countSorted++;
int length = end - begin;
if (countSorted <= length/2)
Reverse(array, begin, end);
if (countSorted == 1 || countSorted == (end - begin))
return true;
else
return false;
}
public static void ReverseSort(int[] array, int begin, int end) {
if (begin == end || begin == end + 1)
return;
// if we use if-operator (not while), then array {2,3,1} transforms in array {2,1,3} and algorithm stop
while (!ReverseIf(array, begin, end)) {
int pivot = begin + (end - begin) / 2;
ReverseSort(array, begin, pivot + 1);
ReverseSort(array, pivot, end);
}
}
public static void ReverseSort(int[] array) {
ReverseSort(array, 0, array.Length);
}
P.S.: Sorry for my English.

The best case is Theta(n), for, e.g., a sorted array. The worst case is Theta(n^2 log n).
Upper bound
Secondary subproblems have a sorted array preceded or succeeded by an arbitrary element. These are O(n log n). If preceded, we do O(n) work, solve a secondary subproblem on the first half and then on the second half, and then do O(n) more work – O(n log n). If succeeded, do O(n) work, sort the already sorted first half (O(n)), solve a secondary subproblem on the second half, do O(n) work, solve a secondary subproblem on the first half, sort the already sorted second half (O(n)), do O(n) work – O(n log n).
Now, in the general case, we solve two primary subproblems on the two halves and then slowly exchange elements over the pivot using secondary invocations. There are O(n) exchanges necessary, so a straightforward application of the Master Theorem yields a bound of O(n^2 log n).
Lower bound
For k >= 3, we construct an array A(k) of size 2^k recursively using the above analysis as a guide. The bad cases are the arrays [2^k + 1] + A(k).
Let A(3) = [1, ..., 8]. This sorted base case keeps Reverse from being called.
For k > 3, let A(k) = [2^(k-1) + A(k-1)[1], ..., 2^(k-1) + A(k-1)[2^(k-1)]] + A(k-1). Note that the primary subproblems of [2^k + 1] + A(k) are equivalent to [2^(k-1) + 1] + A(k-1).
After the primary recursive invocations, the array is [2^(k-1) + 1, ..., 2^k, 1, ..., 2^(k-1), 2^k + 1]. There are Omega(2^k) elements that have to move Omega(2^k) positions, and each of the secondary invocations that moves an element so far has O(1) sorted subproblems and thus is Omega(n log n).
Clearly more coffee is required – the primary subproblems don't matter. This makes it not too bad to analyze the average case, which is Theta(n^2 log n) as well.
With constant probability, the first half of the array contains at least half of the least quartile and at least half of the greatest quartile. In this case, regardless of whether Reverse happens, there are Omega(n) elements that have to move Omega(n) positions via secondary invocations.

It seems this algorithm, even if it performs horribly with "random" data (as demonstrated by Per in their answer), is quite efficient for "fixing up" arrays which are "nearly-sorted". Thus if you chose to develop this idea further (I personally wouldn't, but if you wanted to think about it as an exercise), you would do well to focus on this strength.
this reference on Wikipedia in the Inversion article alludes to the issue very well. Mahmoud's book is quite insightful, noting that there are various ways to measure "sortedness". For example if we use the number of inversions to characterize a "nearly-sorted array" then we can use insertion sort to sort it extremely quickly. However if your arrays are "nearly-sorted" in slightly different ways (e.g. a deck of cards which is cut or loosely shuffled) then insertion sort will not be the best sort to "fix up" the list.
Input: an array that has already been sorted of size N, with roughly N/k inversions.
I might do something like this for an algorithm:
Calculate number of inversions. (O(N lg(lg(N))), or can assume is small and skip step)
If number of inversions is < [threshold], sort array using insertion sort (it will be fast).
Otherwise the array is not close to being sorted; resort to using your favorite comparison (or better) sorting algorithm
There are better ways to do this though; one can "fix up" such an array in at least O(log(N)*(# new elements)) time if you preprocess your array enough or use the right data-structure, like an array with linked-list properties or similar which supports binary search.
You can generalize this idea even further. Whether "fixing up" an array will work depends on the kind of fixing-up that is required. Thus if you update these statistics whenever you add an element to the list or modify it, you can dispatch onto a good "fix-it-up" algorithm.
But unfortunately this would all be a pain to code. You might just be able to get away with want is a priority queue.

Related

Find an algorithm for sorting integers with time complexity O(n + k*log(k))

Design an algorithm that sorts n integers where there are duplicates. The total number of different numbers is k. Your algorithm should have time complexity O(n + k*log(k)). The expected time is enough. For which values of k does the algorithm become linear?
I am not able to come up with a sorting algorithm for integers which satisfies the condition that it must be O(n + k*log(k)). I am not a very advanced programmer but I was in the problem before this one supposed to come up with an algorithm for all numbers xi in a list, 0 ≤ xi ≤ m such that the algorithm was O(n+m), where n was the number of elements in the list and m was the value of the biggest integer in the list. I solved that problem easily by using counting sort but I struggle with this problem. The condition that makes it the most difficult for me is the term k*log(k) under the ordo notation if that was n*log(n) instead I would be able to use merge sort, right? But that's not possible now so any ideas would be very helpful.
Thanks in advance!
Here is a possible solution:
Using a hash table, count the number of unique values and the number of duplicates of each value. This should have a complexity of O(n).
Enumerate the hashtable, storing the unique values into a temporary array. Complexity is O(k).
Sort this array with a standard algorithm such as mergesort: complexity is O(k.log(k)).
Create the resulting array by replicating the elements of the sorted array of unique values each the number of times stored in the hash table. complexity is O(n) + O(k).
Combined complexity is O(n + k.log(k)).
For example, if k is a small constant, sorting an array of n values converges toward linear time as n becomes larger and larger.
If during the first phase, where k is computed incrementally, it appears that k is not significantly smaller than n, drop the hash table and just sort the original array with a standard algorithm.
The runtime of O(n + k*log(k) indicates (like addition in runtimes often does) that you have 2 subroutines, one which runes in O(n) and the other that runs in O(k*log(k)).
You can first count the frequency of the elements in O(n) (for example in a Hashmap, look this up if youre not familiar with it, it's very useful).
Then you just sort the unique elements, from which there are k. This sorting runs in O(k*log(k)), use any sorting algorithm you want.
At the end replace the single unique elements by how often they actually appeared, by looking this up in the map you created in step 1.
A possible Java solution an be like this:
public List<Integer> sortArrayWithDuplicates(List<Integer> arr) {
// O(n)
Set<Integer> set = new HashSet<>(arr);
Map<Integer, Integer> freqMap = new HashMap<>();
for(Integer i: arr) {
freqMap.put(i, freqMap.getOrDefault(i, 0) + 1);
}
List<Integer> withoutDups = new ArrayList<>(set);
// Sorting => O(k(log(k)))
// as there are k different elements
Arrays.sort(withoutDups);
List<Integer> result = new ArrayList<>();
for(Integer i : withoutDups) {
int c = freqMap.get(i);
for(int j = 0; j < c; j++) {
result.add(i);
}
}
// return the result
return result;
}
The time complexity of the above code is O(n + k*log(k)) and solution is in the same line as answered above.

Divide and conquer algorithm

I had a job interview a few weeks ago and I was asked to design a divide and conquer algorithm. I could not solve the problem, but they just called me for a second interview! Here is the question:
we are giving as input two n-element arrays A[0..n − 1] and B[0..n − 1] (which
are not necessarily sorted) of integers, and an integer value. Give an O(nlogn) divide and conquer algorithm that determines if there exist distinct values i, j (that is, i != j) such that A[i] + B[j] = value. Your algorithm should return True if i, j exists, and return False otherwise. You may assume that the elements in A are distinct, and the elements in B are distinct.
can anybody solve the problem? Thanks
My approach is..
Sort any of the array. Here we sort array A. Sort it with the Merge Sort algorithm which is a Divide and Conquer algorithm.
Then for each element of B, Search for Required Value- Element of B in array A by Binary Search. Again this is a Divide and Conquer algorithm.
If you find the element Required Value - Element of B from an Array A then Both element makes pair such that Element of A + Element of B = Required Value.
So here for Time Complexity, A has N elements so Merge Sort will take O(N log N) and We do Binary Search for each element of B(Total N elements) Which takes O(N log N). So total time complexity would be O(N log N).
As you have mentioned you require to check for i != j if A[i] + B[j] = value then here you can take 2D array of size N * 2. Each element is paired with its original index as second element of the each row. Sorting would be done according the the data stored in the first element. Then when you find the element, You can compare both elements original indexes and return the value accordingly.
The following algorithm does not use Divide and Conquer but it is one of the solutions.
You need to sort both the arrays, maintaining the indexes of the elements maybe sorting an array of pairs (elem, index). This takes O(n log n) time.
Then you can apply the merge algorithm to check if there two elements such that A[i]+B[j] = value. This would O(n)
Overall time complexity will be O(n log n)
I suggest using hashing. Even if it's not the way you are supposed to solve the problem, it's worth mentioning since hashing has a better time complexity O(n) v. O(n*log(n)) and that's why more efficient.
Turn A into a hashset (or dictionary if we want i index) - O(n)
Scan B and check if there's value - B[j] in the hashset (dictionary) - O(n)
So you have an O(n) + O(n) = O(n) algorithm (which is better that required (O n * log(n)), however the solution is NOT Divide and Conquer):
Sample C# implementation
int[] A = new int[] { 7, 9, 5, 3, 47, 89, 1 };
int[] B = new int[] { 5, 7, 3, 4, 21, 59, 0 };
int value = 106; // 47 + 59 = A[4] + B[5]
// Turn A into a dictionary: key = item's value; value = item's key
var dict = A
.Select((val, index) => new {
v = val,
i = index, })
.ToDictionary(item => item.v, item => item.i);
int i = -1;
int j = -1;
// Scan B array
for (int k = 0; k < B.Length; ++k) {
if (dict.TryGetValue(value - B[k], out i)) {
// Solution found: {i, j}
j = k;
// if you want any solution then break
break;
// scan further (comment out "break") if you want all pairs
}
}
Console.Write(j >= 0 ? $"{i} {j}" : "No solution");
Seems hard to achieve without sorting.
If you leave the arrays unsorted, checking for existence of A[i]+B[j] = Value takes time Ω(n) for fixed i, then checking for all i takes Θ(n²), unless you find a trick to put some order in B.
Balanced Divide & Conquer on the unsorted arrays doesn't seem any better: if you divide A and B in two halves, the solution can lie in one of Al/Bl, Al/Br, Ar/Bl, Ar/Br and this yields a recurrence T(n) = 4 T(n/2), which has a quadratic solution.
If sorting is allowed, the solution by Sanket Makani is a possibility but you do better in terms of time complexity for the search phase.
Indeed, assume A and B now sorted and consider the 2D function A[i]+B[j], which is monotonic in both directions i and j. Then the domain A[i]+B[j] ≤ Value is limited by a monotonic curve j = f(i) or equivalently i = g(j). But strict equality A[i]+B[j] = Value must be checked exhaustively for all points of the curve and one cannot avoid to evaluate f everywhere in the worst case.
Starting from i = 0, you obtain f(i) by dichotomic search. Then you can follow the border curve incrementally. You will perform n step in the i direction, and at most n steps in the j direction, so that the complexity remains bounded by O(n), which is optimal.
Below, an example showing the areas with a sum below and above the target value (there are two matches).
This optimal solution has little to do with Divide & Conquer. It is maybe possible to design a variant based on the evaluation of the sum at a central point, which allows to discard a whole quadrant, but that would be pretty artificial.

Big O - is n always the size of the input?

I made up my own interview-style problem, and have a question on the big O of my solution. I will state the problem and my solution below, but first let me say that the obvious solution involves a nested loop and is O(n2). I believe I found a O(n) solution, but then I realized it depends not only on the size of the input, but the largest value of the input. It seems like my running time of O(n) is only a technicality, and that it could easily run in O(n2) time or worse in real life.
The problem is:
For each item in a given array of positive integers, print all the other items in the array that are multiples of the current item.
Example Input:
[2 9 6 8 3]
Example Output:
2: 6 8
9:
6:
8:
3: 9 6
My solution (in C#):
private static void PrintAllDivisibleBy(int[] arr)
{
Dictionary<int, bool> dic = new Dictionary<int, bool>();
if (arr == null || arr.Length < 2)
return;
int max = arr[0];
for(int i=0; i<arr.Length; i++)
{
if (arr[i] > max)
max = arr[i];
dic[arr[i]] = true;
}
for(int i=0; i<arr.Length; i++)
{
Console.Write("{0}: ", arr[i]);
int multiplier = 2;
while(true)
{
int product = multiplier * arr[i];
if (dic.ContainsKey(product))
Console.Write("{0} ", product);
if (product >= max)
break;
multiplier++;
}
Console.WriteLine();
}
}
So, if 2 of the array items are 1 and n, where n is the array length, the inner while loop will run n times, making this equivalent to O(n2). But, since the performance is dependent on the size of the input values, not the length of the list, that makes it O(n), right?
Would you consider this a true O(n) solution? Is it only O(n) due to technicalities, but slower in real life?
Good question! The answer is that, no, n is not always the size of the input: You can't really talk about O(n) without defining what the n means, but often people use imprecise language and imply that n is "the most obvious thing that scales here". Technically we should usually say things like "This sort algorithm performs a number of comparisons that is O(n) in the number of elements in the list": being specific about both what n is, and what quantity we are measuring (comparisons).
If you have an algorithm that depends on the product of two different things (here, the length of the list and the largest element in it), the proper way to express that is in the form O(m*n), and then define what m and n are for your context. So, we could say that your algorithm performs O(m*n) multiplications, where m is the length of the list and n is the largest item in the list.
An algorithm is O(n) when you have to iterate over n elements and perform some constant time operation in each iteration. The inner while loop of your algorithm is not constant time as it depends on the hugeness of the biggest number in your array.
Your algorithm's best case run-time is O(n). This is the case when all the n numbers are same.
Your algorithm's worst case run-time is O(k*n), where k = the max value of int possible on your machine if you really insist to put an upper bound on k's value. For 32 bit int the max value is 2,147,483,647. You can argue that this k is a constant, but this constant is clearly
not fixed for every case of input array; and,
not negligible.
Would you consider this a true O(n) solution?
The runtime actually is O(nm) where m is the maximum element from arr. If the elements in your array are bounded by a constant you can consider the algorithm to be O(n)
Can you improve the runtime? Here's what else you can do. First notice that you can ensure that the elements are different. ( you compress the array in hashmap which stores how many times an element is found in the array). Then your runtime would be max/a[0]+max/a[1]+max/a[2]+...<= max+max/2+...max/max = O(max log (max)) (assuming your array arr is sorted). If you combine this with the obvious O(n^2) algorithm you'd get O(min(n^2, max*log(max)) algorithm.

How to find pair with kth largest sum?

Given two sorted arrays of numbers, we want to find the pair with the kth largest possible sum. (A pair is one element from the first array and one element from the second array). For example, with arrays
[2, 3, 5, 8, 13]
[4, 8, 12, 16]
The pairs with largest sums are
13 + 16 = 29
13 + 12 = 25
8 + 16 = 24
13 + 8 = 21
8 + 12 = 20
So the pair with the 4th largest sum is (13, 8). How to find the pair with the kth largest possible sum?
Also, what is the fastest algorithm? The arrays are already sorted and sizes M and N.
I am already aware of the O(Klogk) solution , using Max-Heap given here .
It also is one of the favorite Google interview question , and they demand a O(k) solution .
I've also read somewhere that there exists a O(k) solution, which i am unable to figure out .
Can someone explain the correct solution with a pseudocode .
P.S.
Please DON'T post this link as answer/comment.It DOESN'T contain the answer.
I start with a simple but not quite linear-time algorithm. We choose some value between array1[0]+array2[0] and array1[N-1]+array2[N-1]. Then we determine how many pair sums are greater than this value and how many of them are less. This may be done by iterating the arrays with two pointers: pointer to the first array incremented when sum is too large and pointer to the second array decremented when sum is too small. Repeating this procedure for different values and using binary search (or one-sided binary search) we could find Kth largest sum in O(N log R) time, where N is size of the largest array and R is number of possible values between array1[N-1]+array2[N-1] and array1[0]+array2[0]. This algorithm has linear time complexity only when the array elements are integers bounded by small constant.
Previous algorithm may be improved if we stop binary search as soon as number of pair sums in binary search range decreases from O(N2) to O(N). Then we fill auxiliary array with these pair sums (this may be done with slightly modified two-pointers algorithm). And then we use quickselect algorithm to find Kth largest sum in this auxiliary array. All this does not improve worst-case complexity because we still need O(log R) binary search steps. What if we keep the quickselect part of this algorithm but (to get proper value range) we use something better than binary search?
We could estimate value range with the following trick: get every second element from each array and try to find the pair sum with rank k/4 for these half-arrays (using the same algorithm recursively). Obviously this should give some approximation for needed value range. And in fact slightly improved variant of this trick gives range containing only O(N) elements. This is proven in following paper: "Selection in X + Y and matrices with sorted rows and columns" by A. Mirzaian and E. Arjomandi. This paper contains detailed explanation of the algorithm, proof, complexity analysis, and pseudo-code for all parts of the algorithm except Quickselect. If linear worst-case complexity is required, Quickselect may be augmented with Median of medians algorithm.
This algorithm has complexity O(N). If one of the arrays is shorter than other array (M < N) we could assume that this shorter array is extended to size N with some very small elements so that all calculations in the algorithm use size of the largest array. We don't actually need to extract pairs with these "added" elements and feed them to quickselect, which makes algorithm a little bit faster but does not improve asymptotic complexity.
If k < N we could ignore all the array elements with index greater than k. In this case complexity is equal to O(k). If N < k < N(N-1) we just have better complexity than requested in OP. If k > N(N-1), we'd better solve the opposite problem: k'th smallest sum.
I uploaded simple C++11 implementation to ideone. Code is not optimized and not thoroughly tested. I tried to make it as close as possible to pseudo-code in linked paper. This implementation uses std::nth_element, which allows linear complexity only on average (not worst-case).
A completely different approach to find K'th sum in linear time is based on priority queue (PQ). One variation is to insert largest pair to PQ, then repeatedly remove top of PQ and instead insert up to two pairs (one with decremented index in one array, other with decremented index in other array). And take some measures to prevent inserting duplicate pairs. Other variation is to insert all possible pairs containing largest element of first array, then repeatedly remove top of PQ and instead insert pair with decremented index in first array and same index in second array. In this case there is no need to bother about duplicates.
OP mentions O(K log K) solution where PQ is implemented as max-heap. But in some cases (when array elements are evenly distributed integers with limited range and linear complexity is needed only on average, not worst-case) we could use O(1) time priority queue, for example, as described in this paper: "A Complexity O(1) Priority Queue for Event Driven Molecular Dynamics Simulations" by Gerald Paul. This allows O(K) expected time complexity.
Advantage of this approach is a possibility to provide first K elements in sorted order. Disadvantages are limited choice of array element type, more complex and slower algorithm, worse asymptotic complexity: O(K) > O(N).
EDIT: This does not work. I leave the answer, since apparently I am not the only one who could have this kind of idea; see the discussion below.
A counter-example is x = (2, 3, 6), y = (1, 4, 5) and k=3, where the algorithm gives 7 (3+4) instead of 8 (3+5).
Let x and y be the two arrays, sorted in decreasing order; we want to construct the K-th largest sum.
The variables are: i the index in the first array (element x[i]), j the index in the second array (element y[j]), and k the "order" of the sum (k in 1..K), in the sense that S(k)=x[i]+y[j] will be the k-th greater sum satisfying your conditions (this is the loop invariant).
Start from (i, j) equal to (0, 0): clearly, S(1) = x[0]+y[0].
for k from 1 to K-1, do:
if x[i+1]+ y[j] > x[i] + y[j+1], then i := i+1 (and j does not change) ; else j:=j+1
To see that it works, consider you have S(k) = x[i] + y[j]. Then, S(k+1) is the greatest sum which is lower (or equal) to S(k), and such as at least one element (i or j) changes. It is not difficult to see that exactly one of i or j should change.
If i changes, the greater sum you can construct which is lower than S(k) is by setting i=i+1, because x is decreasing and all the x[i'] + y[j] with i' < i are greater than S(k). The same holds for j, showing that S(k+1) is either x[i+1] + y[j] or x[i] + y[j+1].
Therefore, at the end of the loop you found the K-th greater sum.
tl;dr: If you look ahead and look behind at each iteration, you can start with the end (which is highest) and work back in O(K) time.
Although the insight underlying this approach is, I believe, sound, the code below is not quite correct at present (see comments).
Let's see: first of all, the arrays are sorted. So, if the arrays are a and b with lengths M and N, and as you have arranged them, the largest items are in slots M and N respectively, the largest pair will always be a[M]+b[N].
Now, what's the second largest pair? It's going to have perhaps one of {a[M],b[N]} (it can't have both, because that's just the largest pair again), and at least one of {a[M-1],b[N-1]}. BUT, we also know that if we choose a[M-1]+b[N-1], we can make one of the operands larger by choosing the higher number from the same list, so it will have exactly one number from the last column, and one from the penultimate column.
Consider the following two arrays: a = [1, 2, 53]; b = [66, 67, 68]. Our highest pair is 53+68. If we lose the smaller of those two, our pair is 68+2; if we lose the larger, it's 53+67. So, we have to look ahead to decide what our next pair will be. The simplest lookahead strategy is simply to calculate the sum of both possible pairs. That will always cost two additions, and two comparisons for each transition (three because we need to deal with the case where the sums are equal);let's call that cost Q).
At first, I was tempted to repeat that K-1 times. BUT there's a hitch: the next largest pair might actually be the other pair we can validly make from {{a[M],b[N]}, {a[M-1],b[N-1]}. So, we also need to look behind.
So, let's code (python, should be 2/3 compatible):
def kth(a,b,k):
M = len(a)
N = len(b)
if k > M*N:
raise ValueError("There are only %s possible pairs; you asked for the %sth largest, which is impossible" % M*N,k)
(ia,ib) = M-1,N-1 #0 based arrays
# we need this for lookback
nottakenindices = (0,0) # could be any value
nottakensum = float('-inf')
for i in range(k-1):
optionone = a[ia]+b[ib-1]
optiontwo = a[ia-1]+b[ib]
biggest = max((optionone,optiontwo))
#first deal with look behind
if nottakensum > biggest:
if optionone == biggest:
newnottakenindices = (ia,ib-1)
else: newnottakenindices = (ia-1,ib)
ia,ib = nottakenindices
nottakensum = biggest
nottakenindices = newnottakenindices
#deal with case where indices hit 0
elif ia <= 0 and ib <= 0:
ia = ib = 0
elif ia <= 0:
ib-=1
ia = 0
nottakensum = float('-inf')
elif ib <= 0:
ia-=1
ib = 0
nottakensum = float('-inf')
#lookahead cases
elif optionone > optiontwo:
#then choose the first option as our next pair
nottakensum,nottakenindices = optiontwo,(ia-1,ib)
ib-=1
elif optionone < optiontwo: # choose the second
nottakensum,nottakenindices = optionone,(ia,ib-1)
ia-=1
#next two cases apply if options are equal
elif a[ia] > b[ib]:# drop the smallest
nottakensum,nottakenindices = optiontwo,(ia-1,ib)
ib-=1
else: # might be equal or not - we can choose arbitrarily if equal
nottakensum,nottakenindices = optionone,(ia,ib-1)
ia-=1
#+2 - one for zero-based, one for skipping the 1st largest
data = (i+2,a[ia],b[ib],a[ia]+b[ib],ia,ib)
narrative = "%sth largest pair is %s+%s=%s, with indices (%s,%s)" % data
print (narrative) #this will work in both versions of python
if ia <= 0 and ib <= 0:
raise ValueError("Both arrays exhausted before Kth (%sth) pair reached"%data[0])
return data, narrative
For those without python, here's an ideone: http://ideone.com/tfm2MA
At worst, we have 5 comparisons in each iteration, and K-1 iterations, which means that this is an O(K) algorithm.
Now, it might be possible to exploit information about differences between values to optimise this a little bit, but this accomplishes the goal.
Here's a reference implementation (not O(K), but will always work, unless there's a corner case with cases where pairs have equal sums):
import itertools
def refkth(a,b,k):
(rightia,righta),(rightib,rightb) = sorted(itertools.product(enumerate(a),enumerate(b)), key=lamba((ia,ea),(ib,eb):ea+eb)[k-1]
data = k,righta,rightb,righta+rightb,rightia,rightib
narrative = "%sth largest pair is %s+%s=%s, with indices (%s,%s)" % data
print (narrative) #this will work in both versions of python
return data, narrative
This calculates the cartesian product of the two arrays (i.e. all possible pairs), sorts them by sum, and takes the kth element. The enumerate function decorates each item with its index.
The max-heap algorithm in the other question is simple, fast and correct. Don't knock it. It's really well explained too. https://stackoverflow.com/a/5212618/284795
Might be there isn't any O(k) algorithm. That's okay, O(k log k) is almost as fast.
If the last two solutions were at (a1, b1), (a2, b2), then it seems to me there are only four candidate solutions (a1-1, b1) (a1, b1-1) (a2-1, b2) (a2, b2-1). This intuition could be wrong. Surely there are at most four candidates for each coordinate, and the next highest is among the 16 pairs (a in {a1,a2,a1-1,a2-1}, b in {b1,b2,b1-1,b2-1}). That's O(k).
(No it's not, still not sure whether that's possible.)
[2, 3, 5, 8, 13]
[4, 8, 12, 16]
Merge the 2 arrays and note down the indexes in the sorted array. Here is the index array looks like (starting from 1 not 0)
[1, 2, 4, 6, 8]
[3, 5, 7, 9]
Now start from end and make tuples. sum the elements in the tuple and pick the kth largest sum.
public static List<List<Integer>> optimization(int[] nums1, int[] nums2, int k) {
// 2 * O(n log(n))
Arrays.sort(nums1);
Arrays.sort(nums2);
List<List<Integer>> results = new ArrayList<>(k);
int endIndex = 0;
// Find the number whose square is the first one bigger than k
for (int i = 1; i <= k; i++) {
if (i * i >= k) {
endIndex = i;
break;
}
}
// The following Iteration provides at most endIndex^2 elements, and both arrays are in ascending order,
// so k smallest pairs must can be found in this iteration. To flatten the nested loop, refer
// 'https://stackoverflow.com/questions/7457879/algorithm-to-optimize-nested-loops'
for (int i = 0; i < endIndex * endIndex; i++) {
int m = i / endIndex;
int n = i % endIndex;
List<Integer> item = new ArrayList<>(2);
item.add(nums1[m]);
item.add(nums2[n]);
results.add(item);
}
results.sort(Comparator.comparing(pair->pair.get(0) + pair.get(1)));
return results.stream().limit(k).collect(Collectors.toList());
}
Key to eliminate O(n^2):
Avoid cartesian product(or 'cross join' like operation) of both arrays, which means flattening the nested loop.
Downsize iteration over the 2 arrays.
So:
Sort both arrays (Arrays.sort offers O(n log(n)) performance according to Java doc)
Limit the iteration range to the size which is just big enough to support k smallest pairs searching.

Median Algorithm in O(log n)

How can we remove the median of a set with time complexity O(log n)? Some idea?
If the set is sorted, finding the median requires O(1) item retrievals. If the items are in arbitrary sequence, it will not be possible to identify the median with certainty without examining the majority of the items. If one has examined most, but not all, of the items, that will allow one to guarantee that the median will be within some range [if the list contains duplicates, the upper and lower bounds may match], but examining the majority of the items in a list implies O(n) item retrievals.
If one has the information in a collection which is not fully ordered, but where certain ordering relationships are known, then the time required may require anywhere between O(1) and O(n) item retrievals, depending upon the nature of the known ordering relation.
For unsorted lists, repeatedly do O(n) partial sort until the element located at the median position is known. This is at least O(n), though.
Is there any information about the elements being sorted?
For a general, unsorted set, it is impossible to reliably find the median in better than O(n) time. You can find the median of a sorted set in O(1), or you can trivially sort the set yourself in O(n log n) time and then find the median in O(1), giving an O(n logn n) algorithm. Or, finally, there are more clever median selection algorithms that can work by partitioning instead of sorting and yield O(n) performance.
But if the set has no special properties and you are not allowed any pre-processing step, you will never get below O(n) by the simple fact that you will need to examine all of the elements at least once to ensure that your median is correct.
Here's a solution in Java, based on TreeSet:
public class SetWithMedian {
private SortedSet<Integer> s = new TreeSet<Integer>();
private Integer m = null;
public boolean contains(int e) {
return s.contains(e);
}
public Integer getMedian() {
return m;
}
public void add(int e) {
s.add(e);
updateMedian();
}
public void remove(int e) {
s.remove(e);
updateMedian();
}
private void updateMedian() {
if (s.size() == 0) {
m = null;
} else if (s.size() == 1) {
m = s.first();
} else {
SortedSet<Integer> h = s.headSet(m);
SortedSet<Integer> t = s.tailSet(m + 1);
int x = 1 - s.size() % 2;
if (h.size() < t.size() + x)
m = t.first();
else if (h.size() > t.size() + x)
m = h.last();
}
}
}
Removing the median (i.e. "s.remove(s.getMedian())") takes O(log n) time.
Edit: To help understand the code, here's the invariant condition of the class attributes:
private boolean isGood() {
if (s.isEmpty()) {
return m == null;
} else {
return s.contains(m) && s.headSet(m).size() + s.size() % 2 == s.tailSet(m).size();
}
}
In human-readable form:
If the set "s" is empty, then "m" must be
null.
If the set "s" is not empty, then it must
contain "m".
Let x be the number of elements
strictly less than "m", and let y be
the number of elements greater than
or equal "m". Then, if the total
number of elements is even, x must be
equal to y; otherwise, x+1 must be
equal to y.
Try a Red-black-tree. It should work quiet good and with a binary search you get ur log(n). It has aswell a remove and insert time of log(n) and rebalancing is done in log(n) aswell.
As mentioned in previous answers, there is no way to find the median without touching every element of the data structure. If the algorithm you look for must be executed sequentially, then the best you can do is O(n). The deterministic selection algorithm (median-of-medians) or BFPRT algorithm will solve the problem with a worst case of O(n). You can find more about that here: http://en.wikipedia.org/wiki/Selection_algorithm#Linear_general_selection_algorithm_-_Median_of_Medians_algorithm
However, the median of medians algorithm can be made to run faster than O(n) making it parallel. Due to it's divide and conquer nature, the algorithm can be "easily" made parallel. For instance, when dividing the input array in elements of 5, you could potentially launch a thread for each sub-array, sort it and find the median within that thread. When this step finished the threads are joined and the algorithm is run again with the newly formed array of medians.
Note that such design would only be beneficial in really large data sets. The additional overhead that spawning threads has and merging them makes it unfeasible for smaller sets. This has a bit of insight: http://www.umiacs.umd.edu/research/EXPAR/papers/3494/node18.html
Note that you can find asymptotically faster algorithms out there, however they are not practical enough for daily use. Your best bet is the already mentioned sequential median-of-medians algorithm.
Master Yoda's randomized algorithm has, of course, a minimum complexity of n like any other, an expected complexity of n (not log n) and a maximum complexity of n squared like Quicksort. It's still very good.
In practice, the "random" pivot choice might sometimes be a fixed location (without involving a RNG) because the initial array elements are known to be random enough (e.g. a random permutation of distinct values, or independent and identically distributed) or deduced from an approximate or exactly known distribution of input values.
I know one randomize algorithm with time complexity of O(n) in expectation.
Here is the algorithm:
Input: array of n numbers A[1...n] [without loss of generality we can assume n is even]
Output: n/2th element in the sorted array.
Algorithm ( A[1..n] , k = n/2):
Pick a pivot - p universally at random from 1...n
Divided array into 2 parts:
L - having element <= A[p]
R - having element > A[p]
if(n/2 == |L|) A[|L| + 1] is the median stop
if( n/2 < |L|) re-curse on (L, k)
else re-curse on (R, k - (|L| + 1)
Complexity:
O( n)
proof is all mathematical. One page long. If you are interested ping me.
To expand on rwong's answer: Here is an example code
// partial_sort example
#include <iostream>
#include <algorithm>
#include <vector>
using namespace std;
int main () {
int myints[] = {9,8,7,6,5,4,3,2,1};
vector<int> myvector (myints, myints+9);
vector<int>::iterator it;
partial_sort (myvector.begin(), myvector.begin()+5, myvector.end());
// print out content:
cout << "myvector contains:";
for (it=myvector.begin(); it!=myvector.end(); ++it)
cout << " " << *it;
cout << endl;
return 0;
}
Output:
myvector contains: 1 2 3 4 5 9 8 7 6
The element in the middle would be the median.

Resources