Find the k largest elements in order - algorithm

What is the fastest way to find the k largest elements in an array in order (i.e. starting from the largest element to the kth largest element)?

One option would be the following:
Using a linear-time selection algorithm like median-of-medians or introsort, find the kth largest element and rearrange the elements so that all elements from the kth element forward are greater than the kth element.
Sort all elements from the kth forward using a fast sorting algorithm like heapsort or quicksort.
Step (1) takes time O(n), and step (2) takes time O(k log k). Overall, the algorithm runs in time O(n + k log k), which is very, very fast.
Hope this helps!

C++ also provides the partial_sort algorithm, which solves the problem of selecting the smallest k elements (sorted), with a time complexity of O(n log k). No algorithm is provided for selecting the greatest k elements since this should be done by inverting the ordering predicate.
For Perl, the module Sort::Key::Top, available from CPAN, provides a set of functions to select the top n elements from a list using several orderings and custom key extraction procedures. Furthermore, the Statistics::CaseResampling module provides a function to calculate quantiles using quickselect.
Python's standard library (since 2.4) includes heapq.nsmallest() and nlargest(), returning sorted lists, the former in O(n + k log n) time, the latter in O(n log k) time.

Radix sort solution:
Sort the array in descending order, using radix sort;
Print first K elements.
Time complexity: O(N*L), where L = length of the largest element, can assume L = O(1).
Space used: O(N) for radix sort.
However, I think radix sort has costly overhead, making its linear time complexity less attractive.

1) Build a Max Heap tree in O(n)
2) Use Extract Max k times to get k maximum elements from the Max Heap O(klogn)
Time complexity: O(n + klogn)
A C++ implementation using STL is given below:
#include <iostream>
#include<bits/stdc++.h>
using namespace std;
int main() {
int arr[] = {4,3,7,12,23,1,8,5,9,2};
//Lets extract 3 maximum elements
int k = 3;
//First convert the array to a vector to use STL
vector<int> vec;
for(int i=0;i<10;i++){
vec.push_back(arr[i]);
}
//Build heap in O(n)
make_heap(vec.begin(), vec.end());
//Extract max k times
for(int i=0;i<k;i++){
cout<<vec.front()<<" ";
pop_heap(vec.begin(),vec.end());
vec.pop_back();
}
return 0;
}

#templatetypedef's solution is probably the fastest one, assuming you can modify or copy input.
Alternatively, you can use heap or BST (set in C++) to store k largest elements at given moment, then read array's elements one by one. While this is O(n lg k), it doesn't modify input and only uses O(k) additional memory. It also works on streams (when you don't know all the data from the beginning).

Here's a solution with O(N + k lg k) complexity.
int[] kLargest_Dremio(int[] A, int k) {
int[] result = new int[k];
shouldGetIndex = true;
int q = AreIndicesValid(0, A.Length - 1) ? RandomizedSelet(0, A.Length-1,
A.Length-k+1) : -1;
Array.Copy(A, q, result, 0, k);
Array.Sort(result, (a, b) => { return a>b; });
return result;
}
AreIndicesValid and RandomizedSelet are defined in this github source file.

There was a question on performance & restricted resources.
Make a value class for the top 3 values. Use such an accumulator for reduction in a parallel stream. Limit the parallelism according to the context (memory, power).
class BronzeSilverGold {
int[] values = new int[] {Integer.MIN_VALUE, Integer.MIN_VALUE, Integer.MIN_VALUE};
// For reduction
void add(int x) {
...
}
// For combining two results of two threads.
void merge(BronzeSilverGold other) {
...
}
}
The parallelism must be restricted in your constellation, hence specify an N_THREADS in:
try {
ForkJoinPool threadPool = new ForkJoinPool(N_THREADS);
threadPool.submit(() -> {
BronzeSilverGold result = IntStream.of(...).parallel().collect(
BronzeSilverGold::new,
(bsg, n) -> BronzeSilverGold::add,
(bsg1, bsg2) -> bsg1.merge(bsg2));
...
});
} catch (InterruptedException | ExecutionException e) {
prrtl();
}

Related

Find an algorithm for sorting integers with time complexity O(n + k*log(k))

Design an algorithm that sorts n integers where there are duplicates. The total number of different numbers is k. Your algorithm should have time complexity O(n + k*log(k)). The expected time is enough. For which values of k does the algorithm become linear?
I am not able to come up with a sorting algorithm for integers which satisfies the condition that it must be O(n + k*log(k)). I am not a very advanced programmer but I was in the problem before this one supposed to come up with an algorithm for all numbers xi in a list, 0 ≤ xi ≤ m such that the algorithm was O(n+m), where n was the number of elements in the list and m was the value of the biggest integer in the list. I solved that problem easily by using counting sort but I struggle with this problem. The condition that makes it the most difficult for me is the term k*log(k) under the ordo notation if that was n*log(n) instead I would be able to use merge sort, right? But that's not possible now so any ideas would be very helpful.
Thanks in advance!
Here is a possible solution:
Using a hash table, count the number of unique values and the number of duplicates of each value. This should have a complexity of O(n).
Enumerate the hashtable, storing the unique values into a temporary array. Complexity is O(k).
Sort this array with a standard algorithm such as mergesort: complexity is O(k.log(k)).
Create the resulting array by replicating the elements of the sorted array of unique values each the number of times stored in the hash table. complexity is O(n) + O(k).
Combined complexity is O(n + k.log(k)).
For example, if k is a small constant, sorting an array of n values converges toward linear time as n becomes larger and larger.
If during the first phase, where k is computed incrementally, it appears that k is not significantly smaller than n, drop the hash table and just sort the original array with a standard algorithm.
The runtime of O(n + k*log(k) indicates (like addition in runtimes often does) that you have 2 subroutines, one which runes in O(n) and the other that runs in O(k*log(k)).
You can first count the frequency of the elements in O(n) (for example in a Hashmap, look this up if youre not familiar with it, it's very useful).
Then you just sort the unique elements, from which there are k. This sorting runs in O(k*log(k)), use any sorting algorithm you want.
At the end replace the single unique elements by how often they actually appeared, by looking this up in the map you created in step 1.
A possible Java solution an be like this:
public List<Integer> sortArrayWithDuplicates(List<Integer> arr) {
// O(n)
Set<Integer> set = new HashSet<>(arr);
Map<Integer, Integer> freqMap = new HashMap<>();
for(Integer i: arr) {
freqMap.put(i, freqMap.getOrDefault(i, 0) + 1);
}
List<Integer> withoutDups = new ArrayList<>(set);
// Sorting => O(k(log(k)))
// as there are k different elements
Arrays.sort(withoutDups);
List<Integer> result = new ArrayList<>();
for(Integer i : withoutDups) {
int c = freqMap.get(i);
for(int j = 0; j < c; j++) {
result.add(i);
}
}
// return the result
return result;
}
The time complexity of the above code is O(n + k*log(k)) and solution is in the same line as answered above.

Can O(1) or O(log n) be used for the count of elements and how would that work

I have a task that sounds like this:
Let A be a sorted list with n elements. We would like to add some elements
into A so that the entire list is sorted as well.
(i) Give an O(n)-algorithm that adds O(1) elements into A and returns a sorted list.
(ii) Give an O(n)-algorithm that adds O(log n) elements into A and returns a sorted list.
I understand how the big O notation could be used to describe time and space complexity (and in the task i assume both parts require time complexity to be O(n)) but in this task it seems to describe the amount of elements too.It is really difficult to understand. Could anyone explain how to interpret the "O(1)" the "O(log n)" part?
EDIT: Do you have any suggestions what type of algorithm should I use to complete the tasks?
(i) you need to add O(1) elements into A, means that you are adding a constant element, meaning one element. So you could add that by first creating a linked list from the list of elements which would be O(n). Further adding a node by iterating over the linked list at the any position would be O(n). Hence the overall time complexity would be O(n).
For Reference:
https://www.geeksforgeeks.org/given-a-linked-list-which-is-sorted-how-will-you-insert-in-sorted-way/
(ii) We can first sort the O(log n) elements we want to insert, this can be done, by an algorithm like merge sort or heap sort, in O(log(n) log (log(n))). if log(n) = k, then that would be O(k log(k)). Next we can make a merge of the two lists (or any kind of datastructure, as long as we can iterate over it in ascending order). This merge can be done in O(k+n), since we concurrently can iterate over the two lists, and each time "emit" the smallest of the two and advance the corresponding cursor.
So the total time complexity would be O(n).
For example for two arrays that are sorted, we can merge these with:
public static int[] merge_sorted(int[] a, int[] b) {
k = a.length;
n = b.length;
int[] c = new int[k+n];
int ai = 0;
int bi = 0;
int ci = 0;
while(ai < k && bi < n) {
if(a[ai] <= b[bi]) {
c[ci++] = a[ai++];
} else {
c[ci++] = b[bi++];
}
}
while(ai < k) {
c[ci++] = a[ai++];
}
while(bi < n) {
c[ci++] = b[bi++];
}
return c;
}
Those describe the size of the set of values to add to the list relative to its current size N.

Divide and conquer algorithm

I had a job interview a few weeks ago and I was asked to design a divide and conquer algorithm. I could not solve the problem, but they just called me for a second interview! Here is the question:
we are giving as input two n-element arrays A[0..n − 1] and B[0..n − 1] (which
are not necessarily sorted) of integers, and an integer value. Give an O(nlogn) divide and conquer algorithm that determines if there exist distinct values i, j (that is, i != j) such that A[i] + B[j] = value. Your algorithm should return True if i, j exists, and return False otherwise. You may assume that the elements in A are distinct, and the elements in B are distinct.
can anybody solve the problem? Thanks
My approach is..
Sort any of the array. Here we sort array A. Sort it with the Merge Sort algorithm which is a Divide and Conquer algorithm.
Then for each element of B, Search for Required Value- Element of B in array A by Binary Search. Again this is a Divide and Conquer algorithm.
If you find the element Required Value - Element of B from an Array A then Both element makes pair such that Element of A + Element of B = Required Value.
So here for Time Complexity, A has N elements so Merge Sort will take O(N log N) and We do Binary Search for each element of B(Total N elements) Which takes O(N log N). So total time complexity would be O(N log N).
As you have mentioned you require to check for i != j if A[i] + B[j] = value then here you can take 2D array of size N * 2. Each element is paired with its original index as second element of the each row. Sorting would be done according the the data stored in the first element. Then when you find the element, You can compare both elements original indexes and return the value accordingly.
The following algorithm does not use Divide and Conquer but it is one of the solutions.
You need to sort both the arrays, maintaining the indexes of the elements maybe sorting an array of pairs (elem, index). This takes O(n log n) time.
Then you can apply the merge algorithm to check if there two elements such that A[i]+B[j] = value. This would O(n)
Overall time complexity will be O(n log n)
I suggest using hashing. Even if it's not the way you are supposed to solve the problem, it's worth mentioning since hashing has a better time complexity O(n) v. O(n*log(n)) and that's why more efficient.
Turn A into a hashset (or dictionary if we want i index) - O(n)
Scan B and check if there's value - B[j] in the hashset (dictionary) - O(n)
So you have an O(n) + O(n) = O(n) algorithm (which is better that required (O n * log(n)), however the solution is NOT Divide and Conquer):
Sample C# implementation
int[] A = new int[] { 7, 9, 5, 3, 47, 89, 1 };
int[] B = new int[] { 5, 7, 3, 4, 21, 59, 0 };
int value = 106; // 47 + 59 = A[4] + B[5]
// Turn A into a dictionary: key = item's value; value = item's key
var dict = A
.Select((val, index) => new {
v = val,
i = index, })
.ToDictionary(item => item.v, item => item.i);
int i = -1;
int j = -1;
// Scan B array
for (int k = 0; k < B.Length; ++k) {
if (dict.TryGetValue(value - B[k], out i)) {
// Solution found: {i, j}
j = k;
// if you want any solution then break
break;
// scan further (comment out "break") if you want all pairs
}
}
Console.Write(j >= 0 ? $"{i} {j}" : "No solution");
Seems hard to achieve without sorting.
If you leave the arrays unsorted, checking for existence of A[i]+B[j] = Value takes time Ω(n) for fixed i, then checking for all i takes Θ(n²), unless you find a trick to put some order in B.
Balanced Divide & Conquer on the unsorted arrays doesn't seem any better: if you divide A and B in two halves, the solution can lie in one of Al/Bl, Al/Br, Ar/Bl, Ar/Br and this yields a recurrence T(n) = 4 T(n/2), which has a quadratic solution.
If sorting is allowed, the solution by Sanket Makani is a possibility but you do better in terms of time complexity for the search phase.
Indeed, assume A and B now sorted and consider the 2D function A[i]+B[j], which is monotonic in both directions i and j. Then the domain A[i]+B[j] ≤ Value is limited by a monotonic curve j = f(i) or equivalently i = g(j). But strict equality A[i]+B[j] = Value must be checked exhaustively for all points of the curve and one cannot avoid to evaluate f everywhere in the worst case.
Starting from i = 0, you obtain f(i) by dichotomic search. Then you can follow the border curve incrementally. You will perform n step in the i direction, and at most n steps in the j direction, so that the complexity remains bounded by O(n), which is optimal.
Below, an example showing the areas with a sum below and above the target value (there are two matches).
This optimal solution has little to do with Divide & Conquer. It is maybe possible to design a variant based on the evaluation of the sum at a central point, which allows to discard a whole quadrant, but that would be pretty artificial.

Algorithm for sum-up to 0 from 4 set

I have 4 arrays A, B, C, D of size n. n is at most 4000. The elements of each array are 30 bit (positive/negative) numbers. I want to know the number of ways, A[i]+B[j]+C[k]+D[l] = 0 can be formed where 0 <= i,j,k,l < n.
The best algorithm I derived is O(n^2 lg n), is there a faster algorithm?
Ok, Here is my O(n^2lg(n^2)) algorithm-
Suppose there is four array A[], B[], C[], D[]. we want to find the number of way A[i]+B[j]+C[k]+D[l] = 0 can be made where 0 <= i,j,k,l < n.
So sum up all possible arrangement of A[] and B[] and place them in another array E[] that contain n*n number of element.
int k=0;
for(i=0;i<n;i++)
{
for(j=0;j<n;j++)
{
E[k++]=A[i]+B[j];
}
}
The complexity of above code is O(n^2).
Do the same thing for C[] and D[].
int l=0;
for(i=0;i<n;i++)
{
for(j=0;j<n;j++)
{
AUX[l++]=C[i]+D[j];
}
}
The complexity of above code is O(n^2).
Now sort AUX[] so that you can find the number of occurrence of unique element in AUX[] easily.
Sorting complexity of AUX[] is O(n^2 lg(n^2)).
now declare a structure-
struct myHash
{
int value;
int valueOccuredNumberOfTimes;
}F[];
Now in structure F[] place the unique element of AUX[] and number of time they appeared.
It's complexity is O(n^2)
possibleQuardtupple=0;
Now for each item of E[], do the following
for(i=0;i<k;i++)
{
x=E[i];
find -x in structure F[] using binary search.
if(found in j'th position)
{
possibleQuardtupple+=number of occurrences of -x in F[j];
}
}
For loop i ,total n^2 number of iteration is performed and in each
iteration for binary search lg(n^2) comparison is done. So overall
complexity is O(n^2 lg(n^2)).
The number of way 0 can be reached is = possibleQuardtupple.
Now you can use stl map/ binary search. But stl map is slow, so its better to use binary search.
Hope my explanation is clear enough to understand.
I disagree that your solution is in fact as efficient as you say. In your solution populating E[] and AUX[] is O(N^2) each, so 2.N^2. These will each have N^2 elements.
Generating x = O(N)
Sorting AUX = O((2N)*log((2N)))
The binary search for E[i] in AUX[] is based on N^2 elements to be found in N^2 elements.
Thus you are still doing N^4 work, plus extra work generating the intermediate arrays ans for sorting the N^2 elements in AUX[].
I have a solution (work in progress) but I find it very difficult to calculate how much work it is. I deleted my previous answer. I will post something when I am more sure of myself.
I need to find a way to compare O(X)+O(Z)+O(X^3)+O(X^2)+O(Z^3)+O(Z^2)+X.log(X)+Z.log(Z) to O(N^4) where X+Z = N.
It is clearly less than O(N^4) ... but by how much???? My math is failing me here....
The judgement is wrong. The supplied solution generates arrays with size N^2. It then operates on these arrays (sorting, etc).
Therefore the Order of work, which would normaly be O(n^2.log(n)) should have n substituted with n^2. The result is therefore O((n^2)^2.log(n^2))

Median Algorithm in O(log n)

How can we remove the median of a set with time complexity O(log n)? Some idea?
If the set is sorted, finding the median requires O(1) item retrievals. If the items are in arbitrary sequence, it will not be possible to identify the median with certainty without examining the majority of the items. If one has examined most, but not all, of the items, that will allow one to guarantee that the median will be within some range [if the list contains duplicates, the upper and lower bounds may match], but examining the majority of the items in a list implies O(n) item retrievals.
If one has the information in a collection which is not fully ordered, but where certain ordering relationships are known, then the time required may require anywhere between O(1) and O(n) item retrievals, depending upon the nature of the known ordering relation.
For unsorted lists, repeatedly do O(n) partial sort until the element located at the median position is known. This is at least O(n), though.
Is there any information about the elements being sorted?
For a general, unsorted set, it is impossible to reliably find the median in better than O(n) time. You can find the median of a sorted set in O(1), or you can trivially sort the set yourself in O(n log n) time and then find the median in O(1), giving an O(n logn n) algorithm. Or, finally, there are more clever median selection algorithms that can work by partitioning instead of sorting and yield O(n) performance.
But if the set has no special properties and you are not allowed any pre-processing step, you will never get below O(n) by the simple fact that you will need to examine all of the elements at least once to ensure that your median is correct.
Here's a solution in Java, based on TreeSet:
public class SetWithMedian {
private SortedSet<Integer> s = new TreeSet<Integer>();
private Integer m = null;
public boolean contains(int e) {
return s.contains(e);
}
public Integer getMedian() {
return m;
}
public void add(int e) {
s.add(e);
updateMedian();
}
public void remove(int e) {
s.remove(e);
updateMedian();
}
private void updateMedian() {
if (s.size() == 0) {
m = null;
} else if (s.size() == 1) {
m = s.first();
} else {
SortedSet<Integer> h = s.headSet(m);
SortedSet<Integer> t = s.tailSet(m + 1);
int x = 1 - s.size() % 2;
if (h.size() < t.size() + x)
m = t.first();
else if (h.size() > t.size() + x)
m = h.last();
}
}
}
Removing the median (i.e. "s.remove(s.getMedian())") takes O(log n) time.
Edit: To help understand the code, here's the invariant condition of the class attributes:
private boolean isGood() {
if (s.isEmpty()) {
return m == null;
} else {
return s.contains(m) && s.headSet(m).size() + s.size() % 2 == s.tailSet(m).size();
}
}
In human-readable form:
If the set "s" is empty, then "m" must be
null.
If the set "s" is not empty, then it must
contain "m".
Let x be the number of elements
strictly less than "m", and let y be
the number of elements greater than
or equal "m". Then, if the total
number of elements is even, x must be
equal to y; otherwise, x+1 must be
equal to y.
Try a Red-black-tree. It should work quiet good and with a binary search you get ur log(n). It has aswell a remove and insert time of log(n) and rebalancing is done in log(n) aswell.
As mentioned in previous answers, there is no way to find the median without touching every element of the data structure. If the algorithm you look for must be executed sequentially, then the best you can do is O(n). The deterministic selection algorithm (median-of-medians) or BFPRT algorithm will solve the problem with a worst case of O(n). You can find more about that here: http://en.wikipedia.org/wiki/Selection_algorithm#Linear_general_selection_algorithm_-_Median_of_Medians_algorithm
However, the median of medians algorithm can be made to run faster than O(n) making it parallel. Due to it's divide and conquer nature, the algorithm can be "easily" made parallel. For instance, when dividing the input array in elements of 5, you could potentially launch a thread for each sub-array, sort it and find the median within that thread. When this step finished the threads are joined and the algorithm is run again with the newly formed array of medians.
Note that such design would only be beneficial in really large data sets. The additional overhead that spawning threads has and merging them makes it unfeasible for smaller sets. This has a bit of insight: http://www.umiacs.umd.edu/research/EXPAR/papers/3494/node18.html
Note that you can find asymptotically faster algorithms out there, however they are not practical enough for daily use. Your best bet is the already mentioned sequential median-of-medians algorithm.
Master Yoda's randomized algorithm has, of course, a minimum complexity of n like any other, an expected complexity of n (not log n) and a maximum complexity of n squared like Quicksort. It's still very good.
In practice, the "random" pivot choice might sometimes be a fixed location (without involving a RNG) because the initial array elements are known to be random enough (e.g. a random permutation of distinct values, or independent and identically distributed) or deduced from an approximate or exactly known distribution of input values.
I know one randomize algorithm with time complexity of O(n) in expectation.
Here is the algorithm:
Input: array of n numbers A[1...n] [without loss of generality we can assume n is even]
Output: n/2th element in the sorted array.
Algorithm ( A[1..n] , k = n/2):
Pick a pivot - p universally at random from 1...n
Divided array into 2 parts:
L - having element <= A[p]
R - having element > A[p]
if(n/2 == |L|) A[|L| + 1] is the median stop
if( n/2 < |L|) re-curse on (L, k)
else re-curse on (R, k - (|L| + 1)
Complexity:
O( n)
proof is all mathematical. One page long. If you are interested ping me.
To expand on rwong's answer: Here is an example code
// partial_sort example
#include <iostream>
#include <algorithm>
#include <vector>
using namespace std;
int main () {
int myints[] = {9,8,7,6,5,4,3,2,1};
vector<int> myvector (myints, myints+9);
vector<int>::iterator it;
partial_sort (myvector.begin(), myvector.begin()+5, myvector.end());
// print out content:
cout << "myvector contains:";
for (it=myvector.begin(); it!=myvector.end(); ++it)
cout << " " << *it;
cout << endl;
return 0;
}
Output:
myvector contains: 1 2 3 4 5 9 8 7 6
The element in the middle would be the median.

Resources