"Stable" k-largest elements algorithm - algorithm

Related: priority queue with limited space: looking for a good algorithm
I am looking for an algorithm which returns the k-largest elements from a list, but does not change the order of the k-largest elements, e.g. for k=4 and given 5,9,1,3,7,2,8,4,6, the algorithm should return 9,7,8,6.
More background, my input data are approximately 200 pairs (distance,importance) which are ordered w.r.t distance, and I need to select the 32 most important of them. Performance is crucial here, since I have to run this selection algorithm a few thousand times.
Up to now I have the two following ideas, but both of them seem not to be the best possible.
Remove the minimum element iteratively until 32 elements are left (i.e. do selection sort)
Use quickselect or median-of-medians to search for the 32nd largest element. Afterwards, sort the remaining 31 elements again w.r.t. distance.
I need to implement this in C++, so if anybody wants to write some code and does not know which language to use, C++ would be an option.

Inspired by #trincot's solution, I have come up with a slightly different variation with working implementation.
Algorithm
Use Floyd's algorithm to build the max heap or which is equivalent to the building priority_queue in C++ using the constructor in which we pass the entire array/vector at once, instead of adding elements individually. The max heap if built in O(N) time complexity.
Now, pop the items from max heap K-1 times until we get Kth Max Importance Item. Store the value of Kth Max Importance Item in variable Kth_Max_Importance_Item.
Scan all the nodes from original input whose importance value is greater than the importance value of Kth_Max_Importance_Item, and push them into output vector.
Calculate the left over count of required items with importance value equal to that of the importance value of Kth_Max_Importance_Item by subtracting the current size of output vector from k. Store it in variable left_Over_Count.
Scan left_Over_Count number of values of items from original input whose importance value if equal to importance value of Kth_Max_Importance_Item, and push them into output vector.
NOTE: If importance values are not unique, then this condition is taken care of by step 3 and 4 of the algorithm.
Time Complexity: O(N + K*log(N)). Assuming K<<N, then, Time Complexity ~ O(N).
Implementation:
#include <iostream>
#include <vector>
#include <queue>
#include <math.h>
typedef struct Item{
int distance;
double importance;
}Item;
struct itemsCompare{
bool operator() (const Item& item1, const Item& item2){
return ((item1.importance < item2.importance) ? true : false);
}
};
bool compareDouble(const double& a, const double& b){
return (fabs(a-b) < 0.000001) ? true : false;
}
int main(){
//Original input
std::vector<Item> items{{10, 2.1}, {9, 2.3}, {8, 2.2}, {7, 2.2}, {6, 1.5}};
int k = 4;
//Min Heap
std::priority_queue<Item, std::vector<Item>, itemsCompare> maxHeap (items.begin(), items.end());
//Checking if the order of original input is intact
/*for(int i=0;i<items.size();i++){
std::cout<<items[i].distance<<" "<<items[i].importance<<std::endl;
}*/
//Pulling the nodes until we get Kth Max Importance Node
int count = 0;
while(!maxHeap.empty()){
if(count == k-1){
break;
}
maxHeap.pop();
count++;
}
Item Kth_Max_Importance_Item = maxHeap.top();
//std::cout<<Kth_Max_Importance_Item.importance<<std::endl;
//Scanning all the nodes from original input whose importance value is greater than the importance value of Kth_Max_Importance_Item.
std::vector<Item> output;
for(int i=0;i<items.size();i++){
if(items[i].importance > Kth_Max_Importance_Item.importance){
output.push_back(items[i]);
}
}
int left_Over_Count = k - output.size();
//std::cout<<left_Over_Count<<std::endl;
//Adding left_Over_Count number of values of items whose importance value if equal to importance value of Kth_Max_Importance_Item
for(int i=0;i<items.size();i++){
if(compareDouble(items[i].importance, Kth_Max_Importance_Item.importance)){
output.push_back(items[i]);
left_Over_Count--;
}
if(!left_Over_Count){
break;
}
}
//Printing the output:
for(int i=0;i<output.size();i++){
std::cout<<output[i].distance<<" "<<output[i].importance<<std::endl;
}
return 0;
}
Output:
9 2.3
8 2.2
7 2.2
10 2.1

Use the heap-based algorithm for finding the k largest value, i.e. use a min heap (not a max heap) that never exceeds a size of k. Once it exceeds that size, keep pulling the root from it to restore it to a size of k.
At the end the heap's root will be k largest value. Let's call it m.
You could then scan the original input again to collect all values that are at least equal to m. This way you'll have them in their original order.
When that m is not unique, you could have collected too many values. So check the size of the result and determine how much longer it is than k. Go backwards through that list and mark the ones that have value m as deleted until you have reached the right size. Finally collect the non-deleted items.
All these scans are O(n). The most expensive step is the first one: O(nlogk).

Related

Algorithm: Filtering out a sequence of numbers in ascending order with least number of removed elements

The actual problem context is very different but the generalized problem is trying to figure out an algorithm that filters a sequence of numbers in ascending order requiring least number of removed elements.
Applying the algorithm will look like the follwing.
wrong_sequence = [1,2,8,88,99,1,18,77,78,100,103]. # need to remove 88,99,1
correct_sequence = [1,2,8,18,77,78,100,103] # should NOT be [1,2,8,88,99]
enter code here
It is not as easy as just looping through the list and checking if the current number of the list is greater than the previous number because there could be a case where the current number is ascending yet should be filtered out.
For example, while 88,99 are ascending numbers, they should be removed from the list. In other words, [1,2,8,88,99] should NOT be the answer because removing 1, 18, 77, 78, 100, 103 numbers require removing more number of elements than removing 88,99,1
Any kind guidance would be greatly appreciated!
I think that what you are looking for is the classic LIS problem - longest increasing subsequence. By finding the longest increasing subsequence you will also get the answer to your problem, and that is n - l where n is the length of the array and l is the length of the longest increasing subsequence.
Here is a more detailed post on this problem.
And also here is my take on this problem:
#include <iostream>
#define NMAX int(1e5) //maximum size of array
#define INF int(1e9) // some large number
using namespace std;
int dp[NMAX],i,n,a[NMAX],max1,ans;
//max1 is the maximum length of the longest increasing subsequence
//dp[i] is the minimum value such that an increasing subsequence of length i ends in it
//because the values of dp[1],dp[2],dp[3],... are non-decreasing we can do a binary search
//to find our answer, thus getting a good time complexity -> O(nlog2n)
int main()
{
cin>>n;
for(i=1;i<=n;i++)
cin>>a[i];
dp[0]=0;
for(i=1;i<=n;i++)
{
dp[i]=INF;
int st=0,dr=i-1,poz=0;
while(st<=dr)
{
int m=(st+dr)/2;
if(dp[m]<=a[i])
{
poz=m;
st=m+1;
}
else dr=m-1;
}
dp[poz+1]=min(dp[poz+1],a[i]);
max1=max(max1,poz+1);
}
ans=n-max1;
cout<<ans<<" elements removed."<<endl;
cout<<"The resulting array is: "<<endl;
for(i=1;i<=max1;i++)
cout<<dp[i]<<' ';
return 0;
}

Find the nth largest element in an unsorted read only data struture with no extra space

I need to find the nth largest element in a range, which is not random access range, with O(1) extra space. The heap method takes too much space. I found a solution How to find the Kth smallest integer in an unsorted read only array? but it does not work for doubles. So is there a similar solution for doubles.
The key part is O(1) and possibly duplicate items. One possibility is:
Find the largest element smaller than the current maximum.
Find the number of elements equal to this.
Decrease until done.
Or in C-code something like:
double findKthLargest(double arr[], int nElements, int k) {
double currentMax, nextMax;
int currentK=0, nDuplicates;
for(;;) {
nDuplicates=0;
for(int j=0;j<nElements;++j) {
if (currentK==0 || arr[j]<currentMax) {
// Possible max
if (j==0 || arr[j]>nextMax) nextMax=arr[j];
}
}
for(int j=0;j<nElements;++j) if (arr[j]==nextMax) nDuplicates++;
if (currentK+nDuplicates>=k) return nextMax;
currentMax=nextMax;
currentK=currentK+nDuplicates;
}
Another is to order the duplicates by keeping track of their index.
If time does not matter:
Iterate the List n times and in each pass look for the largest element smaller than the one you found in the last pass.
If you want/need to handle duplicates you need a counter how often the
all formerly searched largest elements occurred (all together = only one counter needed).
What size is your n that keeping a heap of n elements is not feasible option?

How to apply the Step-Count method to my binary search implementation

int binarySearch(int arr[], int left, int right, int x)
{
while( left <= right)
{
int mid = (left+right)/2;
if(arr[mid] == x)
{
return mid;
}
else if(arr[mid] > x)
{
right = mid-1;
}
else
{
left = mid+1;
}
}
return -1;
}
when I went through this myself I got 5n+4 = O(n) but somehow it is suppose to be O(logN) which I don't understand why that's the case.
int mean(int a[], size_t n)
{
int sum = 0; // 1 step
for (int i = 0; i < n; i++) // 1 step *(n+1)
sum += a[i]; // 1 step
return sum; // 1 step
}
I understand that the above code reduces to 2N+3 but this is a very basic example and doesn't take much thought to understand. Will someone please walk me through the binary search version as all the sources I have encountered don't make much sense to me.
Here is a link to one of the many other resources that I have used, but the explanation as to how each statement is separated into steps is what I prefer if possible.
how to calculate binary search complexity
In binary search you always reduce problem size by 1/2. Lets take an example: searching element is 19 and array size is 8 elements in a sorted array [1,4,7,8,11,16,19,22] then following will be the sequence of steps that a binary search will perform:
Get the middle element index i.e. divide the problem size by 1/2.
Check if element at index is greater than, less than or equal to your searching element.
a. If equal you are done, return the index
b. If searching element is greater, then keep looking on right half of array
c. If searching element is less, than look on left half of array
You continue step 1 and 2 until you are left with one element or you found the element.
In our example problem will look as follows:
Iteration 1: [1,4,7,8,11,16,19,22]
Iteration 2: [16,19,22]
Iteration 3: [19]
Order of complexity: O(log<sub>2</sub>(n))
i.e.
log<sub>2</sub>8 = 3, which means we required 3 steps to find our desired element. Even if element was not there (i.e. in worst case) time complexity of this algorithms remains log2n.
Its important to note base of log in binary search is 2 as we are reducing problem size by 1/2, if in any other algorithm we are reducing problem size by 1/3 than its log3 but asymptotically we call it as logarithmic algorithm irrespective of its base.
Note: Binary search can only be done on sorted data.
Suppose i have an array of 10 elements. Binary search will split the array into two halfs, in this case 5(call it L because these are left 5 elements) and 5 (call it right because these are right 5 elements).
Suppose the element you are trying to find is greater than middle elements , in this case x > array[5] then you just ignore first 5 elements and go to last five elements.
Now you have an array of five elements(starting from index 5 to 10). Now again you will split the array into two halfs , if x > array[mid] then you ignore left whole array and if it is smaller then you ignore whole right array.
In mathematical notation you get a series like this: {n , n/2,n/(2^2) , n/(2^m)}
Now if you try to solve this: Because the highest term is n/2^m so we have n/2^m = 1 and this has a solution as log(n)

Find the farthest sum of two elements from zero in an array

Given an array, what is the most time- and space-efficient algorithm to find the sum of two elements farthest from zero in that array?
Edit
For example, [1, -1, 3, 6, -10] has the farthest sum equal to -11 which is equal to (-1)+(-10).
Using a tournament comparison method to find the largest and second largest elements uses the fewest comparisons, in total n+log(n)-2. Do this twice, once to find the largest and second largest elements, say Z and Y, and again to find the smallest and second smallest elements, say A and B. Then the answer is either Z+Y or -A-B, so one more comparison solves the problem. Overall, this takes 2n+2log(n)-3 comparisons. This is still O(n), but in practice is faster than scanning the entire list 4 times to find A,B,Y,Z (in total uses 4n-5 comparisons).
The tournament method is nicely explained with pictures and sample code in these two tutorials: one and two
If you mean the sum whose absolute value is maximum, it is either the largest sum or the smallest sum. The largest sum is the sum of the two maximal elements. The smallest sum is the sum of the two minimal elements.
So you need to find the four values: Maximal, second maximal, minimal, second minimal. You can do it in a single pass in O(n) time and O(1) memory. I suspect that this question might be about minimizing the constant in O(n) - you can do it by taking elements in fives, sorting each five (it can be done in 7 comparisons) and comparing the two top elements with current-max elements (3 comparisons at worst) and the two bottom elements with current-min elements (ditto.) This gives 2.6 comparisons per element which is a small improvement over the 3 comparisons per element of the obvious algorithm.
Then just sum the two max elements, sum the two min elements and take whichever value has the larger abs().
Let's look at the problem from a general perspective:
Find the largest sum of k integers in your array.
Begin by tracking the FIRST k integers - keep them sorted as you go.
Iterate over the array, testing each integer against the min value of the saved integers thus far.
If it is larger than the min value of the saved integers, replace it with the smallest value, and bubble it up to its proper sorted position.
When you've finished the array, you have your largest k integers.
Now you can easily apply this to k=2.
Just iterate over the array keeping track of the smallest and the largest elements encountered so far. This is time O(n), space O(1) and obviously you can't do better than that.
int GetAnswer(int[] arr){
int min = arr[0];
int max = arr[0];
int maxDistSum = 0;
for (int i = 1; i < arr.Length; ++i)
{
int x = arr[i];
if(Math.Abs(maxDistSum) < Math.Abs(max+x)) maxDistSum = max+x;
if(Math.Abs(maxDistSum) < Math.Abs(min+x)) maxDistSum = min+x;
if(x < min) min = x;
if(x > max) max = x;
}
return maxDistSum;
}
The key observation is that the furthest distance is either the sum of the two smallest elements or the sum of the two largest.

Find median value from a growing set

I came across an interesting algorithm question in an interview. I gave my answer but not sure whether there is any better idea. So I welcome everyone to write something about his/her ideas.
You have an empty set. Now elements are put into the set one by one. We assume all the elements are integers and they are distinct (according to the definition of set, we don't consider two elements with the same value).
Every time a new element is added to the set, the set's median value is asked. The median value is defined the same as in math: the middle element in a sorted list. Here, specially, when the size of set is even, assuming size of set = 2*x, the median element is the x-th element of the set.
An example:
Start with an empty set,
when 12 is added, the median is 12,
when 7 is added, the median is 7,
when 8 is added, the median is 8,
when 11 is added, the median is 8,
when 5 is added, the median is 8,
when 16 is added, the median is 8,
...
Notice that, first, elements are added to set one by one and second, we don't know the elements going to be added.
My answer.
Since it is a question about finding median, sorting is needed. The easiest solution is to use a normal array and keep the array sorted. When a new element comes, use binary search to find the position for the element (log_n) and add the element to the array. Since it is a normal array so shifting the rest of the array is needed, whose time complexity is n. When the element is inserted, we can immediately get the median, using instance time.
The WORST time complexity is: log_n + n + 1.
Another solution is to use link list. The reason for using link list is to remove the need of shifting the array. But finding the location of the new element requires a linear search. Adding the element takes instant time and then we need to find the median by going through half of the array, which always takes n/2 time.
The WORST time complexity is: n + 1 + n/2.
The third solution is to use a binary search tree. Using a tree, we avoid shifting array. But using the binary search tree to find the median is not very attractive. So I change the binary search tree in a way that it is always the case that the left subtree and the right subtree are balanced. This means that at any time, either the left subtree and the right subtree have the same number of nodes or the right subtree has one node more than in the left subtree. In other words, it is ensured that at any time, the root element is the median. Of course this requires changes in the way the tree is built. The technical detail is similar to rotating a red-black tree.
If the tree is maintained properly, it is ensured that the WORST time complexity is O(n).
So the three algorithms are all linear to the size of the set. If no sub-linear algorithm exists, the three algorithms can be thought as the optimal solutions. Since they don't differ from each other much, the best is the easiest to implement, which is the second one, using link list.
So what I really wonder is, will there be a sub-linear algorithm for this problem and if so what will it be like. Any ideas guys?
Steve.
Your complexity analysis is confusing. Let's say that n items total are added; we want to output the stream of n medians (where the ith in the stream is the median of the first i items) efficiently.
I believe this can be done in O(n*lg n) time using two priority queues (e.g. binary or fibonacci heap); one queue for the items below the current median (so the largest element is at the top), and the other for items above it (in this heap, the smallest is at the bottom). Note that in fibonacci (and other) heaps, insertion is O(1) amortized; it's only popping an element that's O(lg n).
This would be called an "online median selection" algorithm, although Wikipedia only talks about online min/max selection. Here's an approximate algorithm, and a lower bound on deterministic and approximate online median selection (a lower bound means no faster algorithm is possible!)
If there are a small number of possible values compared to n, you can probably break the comparison-based lower bound just like you can for sorting.
I received the same interview question and came up with the two-heap solution in wrang-wrang's post. As he says, the time per operation is O(log n) worst-case. The expected time is also O(log n) because you have to "pop an element" 1/4 of the time assuming random inputs.
I subsequently thought about it further and figured out how to get constant expected time; indeed, the expected number of comparisons per element becomes 2+o(1). You can see my writeup at http://denenberg.com/omf.pdf .
BTW, the solutions discussed here all require space O(n), since you must save all the elements. A completely different approach, requiring only O(log n) space, gives you an approximation to the median (not the exact median). Sorry I can't post a link (I'm limited to one link per post) but my paper has pointers.
Although wrang-wrang already answered, I wish to describe a modification of your binary search tree method that is sub-linear.
We use a binary search tree that is balanced (AVL/Red-Black/etc), but not super-balanced like you described. So adding an item is O(log n)
One modification to the tree: for every node we also store the number of nodes in its subtree. This doesn't change the complexity. (For a leaf this count would be 1, for a node with two leaf children this would be 3, etc)
We can now access the Kth smallest element in O(log n) using these counts:
def get_kth_item(subtree, k):
left_size = 0 if subtree.left is None else subtree.left.size
if k < left_size:
return get_kth_item(subtree.left, k)
elif k == left_size:
return subtree.value
else: # k > left_size
return get_kth_item(subtree.right, k-1-left_size)
A median is a special case of Kth smallest element (given that you know the size of the set).
So all in all this is another O(log n) solution.
We can difine a min and max heap to store numbers. Additionally, we define a class DynamicArray for the number set, with two functions: Insert and Getmedian. Time to insert a new number is O(lgn), while time to get median is O(1).
This solution is implemented in C++ as the following:
template<typename T> class DynamicArray
{
public:
void Insert(T num)
{
if(((minHeap.size() + maxHeap.size()) & 1) == 0)
{
if(maxHeap.size() > 0 && num < maxHeap[0])
{
maxHeap.push_back(num);
push_heap(maxHeap.begin(), maxHeap.end(), less<T>());
num = maxHeap[0];
pop_heap(maxHeap.begin(), maxHeap.end(), less<T>());
maxHeap.pop_back();
}
minHeap.push_back(num);
push_heap(minHeap.begin(), minHeap.end(), greater<T>());
}
else
{
if(minHeap.size() > 0 && minHeap[0] < num)
{
minHeap.push_back(num);
push_heap(minHeap.begin(), minHeap.end(), greater<T>());
num = minHeap[0];
pop_heap(minHeap.begin(), minHeap.end(), greater<T>());
minHeap.pop_back();
}
maxHeap.push_back(num);
push_heap(maxHeap.begin(), maxHeap.end(), less<T>());
}
}
int GetMedian()
{
int size = minHeap.size() + maxHeap.size();
if(size == 0)
throw exception("No numbers are available");
T median = 0;
if(size & 1 == 1)
median = minHeap[0];
else
median = (minHeap[0] + maxHeap[0]) / 2;
return median;
}
private:
vector<T> minHeap;
vector<T> maxHeap;
};
For more detailed analysis, please refer to my blog: http://codercareer.blogspot.com/2012/01/no-30-median-in-stream.html.
1) As with the previous suggestions, keep two heaps and cache their respective sizes. The left heap keeps values below the median, the right heap keeps values above the median. If you simply negate the values in the right heap the smallest value will be at the root so there is no need to create a special data structure.
2) When you add a new number, you determine the new median from the size of your two heaps, the current median, and the two roots of the L&R heaps, which just takes constant time.
3) Call a private threaded method to perform the actual work to perform the insert and update, but return immediately with the new median value. You only need to block until the heap roots are updated. Then, the thread doing the insert just needs to maintain a lock on the traversing grandparent node as it traverses the tree; this will ensue that you can insert and rebalance without blocking other inserting threads working on other sub-branches.
Getting the median becomes a constant time procedure, of course now you may have to wait on synchronization from further adds.
Rob
A balanced tree (e.g. R/B tree) with augmented size field should find the median in lg(n) time in the worst case. I think it is in Chapter 14 of the classic Algorithm text book.
To keep the explanation brief, you can efficiently augment a BST to select a key of a specified rank in O(h) by having each node store the number of nodes in its left subtree. If you can guarantee that the tree is balanced, you can reduce this to O(log(n)). Consider using an AVL which is height-balanced (or red-black tree which is roughly balanced), then you can select any key in O(log(n)). When you insert or delete a node into the AVL you can increment or decrement a variable that keeps track of the total number of nodes in the tree to determine the rank of the median which you can then select in O(log(n)).
In order to find the median in linear time you can try this (it just came to my mind). You need to store some values every time you add number to your set, and you won't need sorting. Here it goes.
typedef struct
{
int number;
int lesser;
int greater;
} record;
int median(record numbers[], int count, int n)
{
int i;
int m = VERY_BIG_NUMBER;
int a, b;
numbers[count + 1].number = n:
for (i = 0; i < count + 1; i++)
{
if (n < numbers[i].number)
{
numbers[i].lesser++;
numbers[count + 1].greater++;
}
else
{
numbers[i].greater++;
numbers[count + 1].lesser++;
}
if (numbers[i].greater - numbers[i].lesser == 0)
m = numbers[i].number;
}
if (m == VERY_BIG_NUMBER)
for (i = 0; i < count + 1; i++)
{
if (numbers[i].greater - numbers[i].lesser == -1)
a = numbers[i].number;
if (numbers[i].greater - numbers[i].lesser == 1)
b = numbers[i].number;
m = (a + b) / 2;
}
return m;
}
What this does is, each time you add a number to the set, you must now how many "lesser than your number" numbers have, and how many "greater than your number" numbers have. So, if you have a number with the same "lesser than" and "greater than" it means your number is in the very middle of the set, without having to sort it. In the case that you have an even amount of numbers you may have two choices for a median, so you just return the mean of those two. BTW, this is C code, I hope this helps.

Resources