An unsorted array is given and we need to find top 5 elements in an efficient way and we cannot sort the list .
My solution :
Find the max element in the array. O(n)
Delete this max element after processing/using it.
Repeat step 1 & 2 , k times ( 5 times in this case ).
Time Complexity : O(kn) / O(n) , Space Complexity : O(1).
I think we can find the max element in O(logN) , So it can be improved to O(klogN). Please correct me if I am wrong.
Can we do better than this ? Using max-heap would be inefficient I guess?
PS - This is not any homework.
If you can use an auxiliary heap (a min heap with minus element at top) you can do that in O(nlogm), where n is the list length and m the number of max elements to keep track of.
Since the aux heap has a fixed max size (5) I think that operations on that structure can be considered O(1). In that case the complexity is O(n).
Pseudo code:
foreach element in list:
if aux_heap.size() < 5
aux_heap.add(element)
else if element > aux_heap.top()
aux_heap.remove_top()
aux_head.add(element)
Using a partial quicksort we can achieve O(n), this doesn't require any auxiliary space. Using a max heap, as in the other solution requires O(n log k) time.
Related
As the title suggests, I am wondering what the proof for the lower bound of merging k sorted arrays of size n is? I know that the bound is O(kn*log[k]), but how was this achieved? I tried comparing to sorting an array of p elements using a decision tree but I don't see how to implement this proof.
This is pretty much easy to prove, try to think about it in a merge-sort way. To merge-sort an array of size K*N it takes O(KN*log(K*N)).
But we don't have to reach leafs of size 1, as we know when the array size is N it is sorted. For simplicity we will assume K is a power of 2.
How many times do we have to divide by 2 to reach leafs of size N ?
K times!
Visualization
So you have log(k) steps, then having to merge each step costs KN, and there are log(k) steps. Hence, the time complexity is O(NK(log(K))
Proof: Lets assume it is not a lower bound and we could achieve better. Then for any unknown array of size N*K we could split it in 2 until we reach sub-arrays of size N, merge-sort each of the arrays of size N in Nlog(N) time and in total for all the arrays K*N*log(N) time.
After having the K arrays of size N sorted, we have to merge them into a bigger array of size N*K, pay less than O(NK*(log(K)) as we assumed it is not the lower bound.
At the end you sorted an unknown array of size N*K in a complexity lesser than N*K*log(N*K) which is not possible in the comparison model.
Hence, you can't achieve better than O(NK*(log(K)) while merging the K sorted arrays of size N.
Possible implementation.
Let's create a heap data structure that store pairs (element, arrayIndex) ordered by element. Then
Add the first element of each array with the corresponding array index to this heap.
On each step, remove the top (lowest) pair p from the heap, add p.element to the result, and insert to the heap the pair (next, p.arrayIndex) with the next element from the array with p.arrayIndex index (if it is not empty).
For tracking 'next' element you need an array with k indices/pointers/iterators that are pointing to the next element of the corresponding array.
There will be at most k elements in the heap at any time, thus the insert/remove operations of the heap will have O(log(k)) complexity. Every element will be inserted and removed once from the heap. The number of elements is n*k. Overall complexity is O(n*k*log(k)).
Create a min heap of size k which stores the next item from each of the k arrays. Each node also stores which array it came from. Create your sorted array by adding the min from the heap to final_sorted_array, then adding the next element from the array that value came from to the heap.
Removing the min elt of the heap is O(log k). You have a total of NK elements so you do this NK times. Final result: O(NK log k).
If there are n unsorted weights and I need to find the least number of weights to get at least weight W.
How do I find them in O(n)?
This problem has many solution methods:
Method 1 - Sorting - O(nlogn)
I guess that the most trivial one would be to sort in descending order and then to take the first K elements that give a sum of at least W. The time complexity will be though O(nlogn).
Method 2 - Max Heap - O(n + klogn)
Another method would be to use a max heap.
Creating the heap will take O(n) and then extracting elements until we got to a total sum of at least W. Each extraction will take O(logn) so the total time complexity will be O(klogn) where k is the number of elements we had to extract from the heap.
Method 3 - Using Min Heap - O(nlogk)
Adding this method that JimMischel suggested in the comments below.
Creating a min heap with the first k elements in the list that sums to at least W. Then, iterate over the remaining elements and if it's greater than the minimum (heap top) replace between them.
At this point, it might be that we have more elements of what we actually need to get to W, so we will just extract the minimums until we reach our limit. In practice, depending on the relation between
find_min_set(A,W)
currentW = 0
heap H //Create empty heap
for each Elem in A
if (currentW < W)
H.add(Elem)
currentW += Elem
else if (Elem > H.top())
currentW += (Elem-H.top())
H.pop()
H.add(Elem)
while (currentW-H.top() > W)
currentW -= H.top()
H.pop()
This method might be even faster in practice, depending on the relation between k and n. See when theory meets practice.
Method 4 - O(n)
The best method I could think of will be using some kind of quickselect while keeping track of the total weight and always partitioning with the median as a pivot.
First, let's define few things:
sum(A) - The total sum of all elements in array A.
num(A) - The number of elements in array A.
med(A) - The median of the array A.
find_min_set(A,W,T)
//partition A
//L contains all the elements of A that are less than med(A)
//R contains all the elements of A that are greater or equal to med(A)
L, R = partition(A,med(A))
if (sum(R)==W)
return T+num(R)
if (sum(R) > W)
return find_min_set(R,W,T)
if (sum(R) < W)
return find_min_set(L,W-sum(R),num(R)+T)
Calling this method by find_min_set(A,W,0).
Runtime Complexity:
Finding median is O(n).
Partitioning is O(n).
Each recursive call is taking half of the size of the array.
Summing it all up we get a follow relation: T(n) = T(n/2) + O(n) which is same as the average case of quickselect = O(n).
Note: When all values are unique both worst-case and average complexity is indeed O(n). With possible duplicates values, the average complexity is still O(n) but the worst case is O(nlogn) with using Median of medians method for selecting the pivot.
A way of finding the median of a given set of n numbers is to distribute them among 2 heaps. 1 is a max-heap containing the lower n/2 (ceil(n/2)) numbers and a min-heap containing the rest. If maintained in this way the median is the max of the first heap (along with the min of the second heap if n is even). Here's my c++ code that does this:
priority_queue<int, vector<int> > left;
priority_queue<int,vector<int>, greater<int> > right;
cin>>n; //n= number of items
for (int i=0;i<n;i++) {
cin>>a;
if (left.empty())
left.push(a);
else if (left.size()<=right.size()) {
if (a<=right.top())
left.push(a);
else {
left.push(right.top());
right.pop();
right.push(a);
}
}
else {
if (a>=left.top())
right.push(a);
else {
right.push(left.top());
left.pop();
left.push(a);
}
}
}
We know that the heapify operation has linear complexity . Does this mean that if we insert numbers one by one into the two heaps as in the above code, we are finding the median in linear time?
Linear time heapify is for the cost of building a heap from an unsorted array as a batch operation, not for building a heap by inserting values one at a time.
Consider a min heap where you are inserting a stream of values in increasing order. The value at the top of the heap is the smallest, so each value trickles all the way down to the bottom of the heap. Consider just the last half of the values inserted. At this time the heap will have very nearly its full height, which is log(n), so each value trickles down log(n) slots, and the cost of inserting n/2 values is O(n log(n))
If I present a stream of values in increasing order to your median finding algorithm one of the things it has to do is build a min heap from a stream of values in increasing order so the cost of the median finding is O(n log(n)). In, fact the max heap is going to be doing a lot of deletes as well as insertions, but this is just a constant factor on top so I think the overall complexity is still O(n log(n))
When there is one element, the complexity of the step is Log 1 because of a single element being in a single heap.
When there are two elements, the complexity of the step is Log 1 as we have one element in each heap.
When there are four elements, the complexity of the step is Log 2 as we have two elements in each heap.
So, when there are n elements, the complexity is Log n as we have n/2 elements in each heap and
adding an element; as well as,
removing element from one heap and adding it to another;
takes O(Log n/2) = O(Log n) time.
So for keeping track of median of n elements essentially is done by performing:
2 * ( Log 1 + Log 2 + Log 3 + ... + Log n/2 ) steps.
The factor of 2 comes from performing the same step in 2 heaps.
The above summation can be handled in two ways. One way gives a tighter bound but it is encountered less frequently in general. Here it goes:
Log a + Log b = Log a*b (By property of logarithms)
So, the summation is actually Log ((n/2)!) = O(Log n!).
The second way is:
Each of the values Log 1, Log 2, ... Log n/2 is less than or equal to Log n/2
As there are a total n/2 terms, the summation is less than (n/2) * Log (n/2)
This implies the function is upper bound by (n/2) * Log (n/2)
Or, the complexity is O(n * Log n).
The second bound is looser but more well known.
This is a great question, especially since you can find the median of a list of numbers in O(N) time using Quickselect.
But the dual priority-queue approach gives you O(N log N) unfortunately.
Riffing in binary heap wiki article here, heapify is a bottom-up operation. You have all the data in hand and this allows you to be cunning and reduce the number of swaps/comparisons to O(N). You can build an optimal structure from the get-go.
Adding elements from the top, one at a time, as you are doing here, requires reorganizing every time. That's expensive so the whole operation ends up being O(N log N).
I'm trying to come up with something to solve the following:
Given a max-heap represented as an array, return the kth largest element without modifying the heap. I was asked to do it in linear time, but was told it can be done in log time.
I thought of a solution:
Use a second max-heap and fill it with k or k+1 values into it (breadth first traversal into the original one) then pop k elements and get the desired one. I suppose this should be O(N+logN) = O(N)
Is there a better solution, perhaps in O(logN) time?
The max-heap can have many ways, a better case is a complete sorted array, and in other extremely case, the heap can have a total asymmetric structure.
Here can see this:
In the first case, the kth lagest element is in the kth position, you can compute in O(1) with a array representation of heap.
But, in generally, you'll need to check between (k, 2k) elements, and sort them (or partial sort with another heap). As far as I know, it's O(K·log(k))
And the algorithm:
Input:
Integer kth <- 8
Heap heap <- {19,18,10,17,14,9,4,16,15,13,12}
BEGIN
Heap positionHeap <- Heap with comparation: ((n0,n1)->compare(heap[n1], heap[n0]))
Integer childPosition
Integer candidatePosition <- 0
Integer count <- 0
positionHeap.push(candidate)
WHILE (count < kth) DO
candidatePosition <- positionHeap.pop();
childPosition <- candidatePosition * 2 + 1
IF (childPosition < size(heap)) THEN
positionHeap.push(childPosition)
childPosition <- childPosition + 1
IF (childPosition < size(heap)) THEN
positionHeap.push(childPosition)
END-IF
END-IF
count <- count + 1
END-WHILE
print heap[candidate]
END-BEGIN
EDITED
I found "Optimal Algorithm of Selection in a min-heap" by Frederickson here:
ftp://paranoidbits.com/ebooks/An%20Optimal%20Algorithm%20for%20Selection%20in%20a%20Min-Heap.pdf
No, there's no O(log n)-time algorithm, by a simple cell probe lower bound. Suppose that k is a power of two (without loss of generality) and that the heap looks like (min-heap incoming because it's easier to label, but there's no real difference)
1
2 3
4 5 6 7
.............
permutation of [k, 2k).
In the worst case, we have to read the entire permutation, because there are no order relations imposed by the heap, and as long as k is not found, it could be in any location not yet examined. This takes time Omega(k), matching the (complicated!) algorithm posted by templatetypedef.
To the best of my knowledge, there's no easy algorithm for solving this problem. The best algorithm I know of is due to Frederickson and it isn't easy. You can check out the paper here, but it might be behind a paywall. It runs in time O(k) and this is claimed to be the best possible time, so I suspect that a log-time solution doesn't exist.
If I find a better algorithm than this, I'll be sure to let you know.
Hope this helps!
Max-heap in an array: element at i is larger than elements at 2*i+1 and 2*i+2 (i is 0-based)
You'll need another max heap (insert, pop, empty) with element pairs (value, index) sorted by value. Pseudocode (without boundary checks):
input: k
1. insert (at(0), 0)
2. (v, i) <- pop and k <- k - 1
3. if k == 0 return v
4. insert (at(2*i+1), 2*i+1) and insert (at(2*+2), 2*+2)
5. goto 2
Runtime evaluation
array access at(i): O(1)
insertion into heap: O(log n)
insert at 4. takes at most log(k) since the size of heap of pairs is at most k + 1
statement 3. is reached at most k times
total runtime: O(k log k)
Given an array of integers A[1...n-1] where N is the length of array A. Construct an array B such that B[i] = min(A[i], A[i+1], ..., A[i+K-1]), where K will be given. Array B will have N-K+1 elements.
We can solve the problem using min-heaps Construct min-heap for k elements - O(k). For every next element delete the first element and insert the new element and heapify.
Hence Worst Case Time - O( (n-k+1)*k ) + O(k) Space - O(k)
Can we do it better?
We can do better if in the algorithm from OP we change expensive "heapify" procedure to much cheaper "upheap" or "downheap". This gives O(n * log(k)) time complexity.
Or, if we iterate through input array and put each element to the min-queue of size 'k', we can do it in O(n) time. Min-queue is a queue that can perform find-min in O(1) time. It may be implemented as a pair of min-stacks. See this answer for details.