Get the first x elements of a Heapsort - algorithm

I'm preparing for a Google developer interview and working on algorithm questions. I need to figure out how to get the first x elements in an array of size n using the Heapsort algorithm. What part of the algorithm needs to be modified to get just the first x smallest elements?
This is the Heapsort algorithm from Introduction to Algorithms by Cormen Leiserson (page 155):
HEAPSORT(A)
{
BUILD-MAX-HEAP(A)
for i = A.length down to 2
exchange A[1] with A[i]
A.heap-size = A.heap-size - 1
MAX-HEAPIFY(A, 1)
}
These are the component algorithms:
BUILD-MAX-HEAP(A)
A.heap-size = A.length
for i = floor(A.length / 2) down to 1
MAX-HEAPIFY(A, i)
MAX-HEAPIFY(A, i)
l = LEFT(i)
r = RIGHT(i)
if l <= A.heap-size and A[l] > A[i]
largest = l
else largest = r
if r <= A.heap-size and A[r] > A[largest]
largest = r
if largest != i
exchange A[i] with A[largest]
MAX-HEAPIFY(A, largest)
I can't figure out what part to modify to get the x smallest elements of the sorted array. Also need to find the time complexity of the modified algorithm.

By changing the condition in MAX-HEAPIFY, we can change it into MIN-HEAPIFY, thus , we can easily obtain a min heap.
Then, the first element of this heap is the smallest element, we can remove this element, and bring the last element in the heap to the first element, and call MIN-HEAPIFY again to maintain the property of the heap. Continuing this process n time, we can obtain the first n smallest object.
Time complexity : log(m) + log(m - 1) + ... + log(m - n) ~ O(nlogm)

Related

Heap-sort 'Heapify' iterative procedure

I was checking the iterative approach for the max-heapify algorithm and the following is what is given in CLRS solutions.
while i < A.heap-size do
l =LEFT(i)
r =LEFT(i)
largest = i
if l ≤ A.heap-size and A[l] > A[i] then
largest = l
end if
if r ≤ A.heap-size and A[r] > A[i] then
largest = r
end if
if largest not equal i then
exchange A[i] and A[largest]
i = largest
else return A
end if
end while
return A
My question is why the loop condition is given as i < A.heap-size? Since the left and right should be within the heap size, which would mean that the parent must be i <= A.heap-size/2, why can't we check the condition as such i<=A.heap-size/2?
Yeah you are correct , it is just sufficient to check till heap-size/2. After that we don't even have children for those nodes.

Finding xth smallest element in unsorted array

I've been trying some coding algorithm exercises and one in particular topic has stood out to me. I've been trying to find out a good answer to this but I've been stuck in analysis paralysis. Let's say I have an array of unsorted integers and I want to determine the xth smallest element in this array.
I know of two options to go about this:
Option 1: Run a sort algorithm, sorting elements least to greatest and look up the xth element. To my understanding, the time complexity to this is O(n*log(n)) and O(1) space.
Option 2: Heapify the array, turning it into a min heap. Then pop() the top of the heap x times. To my understanding, this is O(n) + O(x*log(n)).
I can't tell which is optimal answer and maybe I fundamental misunderstanding of priority queues and when to use them. I've tried to measure runtime and I feel like I'm getting conflicting results. Maybe since with option 2, it depends on how big x is. And maybe there is a better way to go algo. If someone could help, I'd appreciate it!
Worst case time complexity of approach 2 should be O(n + n*log(n)), as maximum value of x = n.
For average case, time complexity = O(n + (1+2+3+....n)/n * log(n)) = O(n + (n+1)*log(n)).
Therefore approach 1 is more efficient than approach 2, but still not optimal.
PS: I would like you to have a look at quick select algorithm which works in O(n) on average case.
This algorithms complexity can revolve around two data points:
Value of x.
Value of n.
Space complexity
In both algos space complexity remains the O(1)
Time complexity
Approach 1
Best Case : O(nlog(n)) for sorting & O(1) for case x == 1;
Average Case : O(nlog(n)) if we consider all elements are unique &
O(x+nlog(n)) if there are duplicates.
Worst Case. : O(n+nlog(n)) for case x==n;
Approach 2:
Best Case : O(n) as just heapify would be require case x==1
Average Case : O(n + xlog(n))
Worst Case. : O(n+nlog(n)) case x==n;
Now Coming to the point to analyze this algo's in runtime.
In general below guidelines are to be followed.
1. Always test for larger values of n.
2. Have a good spread for values being tested(here x).
3. Do multiple iterations of the analysis with clean environment
(array created everytime before the experiment etc) & get the average of all
results.
4. Check for the any predefined functions code complexity for exact implementation.
In this case the sort(can be 2nlogn etc) & various heap operations code.
So if considered above all having idle values.
Method 2 should perform better than Method 1.
Although approach 1 will have less time complexity, but both of these algorithms will use auxiliary space,space complexity of std::sort is O(n). Another way of doing this ,in constant is to do is via binary search. You can do binary search for the xth element . Let l be the smallest element of the array and r be the largest, then time complexity will be O((nlog(r-l)).
int ans=l-1;
while(l<=r){
int mid=(l+r)/2;
int cnt=0;
for(int i=0;i<n;i++){
if(a[i]<=mid)
cnt++;
}
if(cnt<x){
ans=mid;
l=mid+1;
}
else
r=mid-1;
}
Now you can look for the smallest element larger than ans present in the array.
Time complexity-O(nlog(r-l))+O(n)(for the last step)
space complexity-O(1)
You can find xth element in O(n); there are also two simple heap algorithms that improve on your option 2 complexity. I'll start with the latter.
Simple heap algorithm №1: O(x + (n-x) log x) worst-case complexity
Create a max heap out of the first x elements; for each of the remaining elements, pop the max and push them instead:
import heapq
def findKthSmallest(nums: List[int], k: int) -> int:
heap = [-n for n in nums[:k]]
heapq.heapify(heap)
for num in nums[k:]:
if -num > heap[0]:
heapq.heapreplace(heap, -num)
return -heap[0]
Simple heap algorithm №2: O(n + x log x)
Turn the whole array into a min heap, and insert its root into an auxiliary min heap.
k-1 times pop an element from the second heap, and push back its children from the first heap.
Return the root of the second heap.
import heapq
def findKthSmallest(nums: List[int], k: int) -> int:
x = nums.copy()
heapq.heapify(x)
s = [(x[0], 0)] #auxiliary heap
for _ in range(k-1):
ind = heapq.heappop(s)[1]
if 2*ind+1 < len(x):
heapq.heappush(s, (x[2*ind+1], 2*ind+1))
if 2*ind+2 < len(x):
heapq.heappush(s, (x[2*ind+2], 2*ind+2))
return s[0][0]
Which of these is faster? It depends on values of x and n.
A more complicated Frederickson algorithm would allow you to find xth smallest element in a heap in O(x), but that would be overkill, since xth smallest element in unsorted array can be found in O(n) worst-case time.
Median-of-medians algorithm: O(n) worst-case time
Described in [1].
Quickselect algorithm: O(n) average time, O(n^2) worst-case time
def partition(A, lo, hi):
"""rearrange A[lo:hi+1] and return j such that
A[lo:j] <= pivot
A[j] == pivot
A[j+1:hi+1] >= pivot
"""
pivot = A[lo]
if A[hi] > pivot:
A[lo], A[hi] = A[hi], A[lo]
#now A[hi] <= A[lo], and A[hi] and A[lo] need to be exchanged
i = lo
j = hi
while i < j:
A[i], A[j] = A[j], A[i]
i += 1
while A[i] < pivot:
i += 1
j -= 1
while A[j] > pivot:
j -= 1
#now put pivot in the j-th place
if A[lo] == pivot:
A[lo], A[j] = A[j], A[lo]
else:
#then A[right] == pivot
j += 1
A[j], A[hi] = A[hi], A[j]
return j
def quickselect(A, left, right, k):
pivotIndex = partition(A, left, right)
if k == pivotIndex:
return A[k]
elif k < pivotIndex:
return quickselect(A, left, pivotIndex - 1, k)
else:
return quickselect(A, pivotIndex + 1, right, k)
Introselect: O(n) worst-case time
Basically, do quickselect, but if recursion gets too deep, switch to median-of-medians.
import numpy as np
def findKthSmallest(nums: List[int], k: int) -> int:
return np.partition(nums, k, kind='introselect')[k]
Rivest-Floyd algorithm: O(n) average time, O(n^2) worst-case time
Another way to speed up quickselect:
import math
C1 = 600
C2 = 0.5
C3 = 0.5
def rivest_floyd(A, left, right, k):
assert k < len(A)
while right > left:
if right - left > C1:
#select a random sample from A
N = right - left + 1
I = k - left + 1
Z = math.log(N)
S = C2 * math.exp(2/3 * Z) #sample size
SD = C3 * math.sqrt(Z * S * (N - S) / N) * math.copysign(1, I - N/2)
#select subsample such that kth element lies between newleft and newright most of the time
newleft = max(left, k - int(I * S / N + SD))
newright = min(right, k + int((N - I) * S / N + SD))
rivest_floyd(A, newleft, newright, k)
A[left], A[k] = A[k], A[left]
j = partition2(A, left, right)
if j <= k:
left = j+1
if k <= j:
right = j-1
return A[k]
[1]Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2009) [1990]. Introduction to Algorithms (3rd ed.). MIT Press and McGraw-Hill. ISBN 0-262-03384-4., pp.220-223

Finding median in merged array of two sorted arrays

Assume we have 2 sorted arrays of integers with sizes of n and m. What is the best way to find median of all m + n numbers?
It's easy to do this with log(n) * log(m) complexity. But i want to solve this problem in log(n) + log(m) time. So is there any suggestion to solve this problem?
Explanation
The key point of this problem is to ignore half part of A and B each step recursively by comparing the median of remaining A and B:
if (aMid < bMid) Keep [aMid +1 ... n] and [bLeft ... m]
else Keep [bMid + 1 ... m] and [aLeft ... n]
// where n and m are the length of array A and B
As the following: time complexity is O(log(m + n))
public double findMedianSortedArrays(int[] A, int[] B) {
int m = A.length, n = B.length;
int l = (m + n + 1) / 2;
int r = (m + n + 2) / 2;
return (getkth(A, 0, B, 0, l) + getkth(A, 0, B, 0, r)) / 2.0;
}
public double getkth(int[] A, int aStart, int[] B, int bStart, int k) {
if (aStart > A.length - 1) return B[bStart + k - 1];
if (bStart > B.length - 1) return A[aStart + k - 1];
if (k == 1) return Math.min(A[aStart], B[bStart]);
int aMid = Integer.MAX_VALUE, bMid = Integer.MAX_VALUE;
if (aStart + k/2 - 1 < A.length) aMid = A[aStart + k/2 - 1];
if (bStart + k/2 - 1 < B.length) bMid = B[bStart + k/2 - 1];
if (aMid < bMid)
return getkth(A, aStart + k / 2, B, bStart, k - k / 2); // Check: aRight + bLeft
else
return getkth(A, aStart, B, bStart + k / 2, k - k / 2); // Check: bRight + aLeft
}
Hope it helps! Let me know if you need more explanation on any part.
Here's a very good solution I found in Java on Stack Overflow. It's a method of finding the K and K+1 smallest items in the two arrays where K is the center of the merged array.
If you have a function for finding the Kth item of two arrays then finding the median of the two is easy;
Calculate the weighted average of the Kth and Kth+1 items of X and Y
But then you'll need a way to find the Kth item of two lists; (remember we're one indexing now)
If X contains zero items then the Kth smallest item of X and Y is the Kth smallest item of Y
Otherwise if K == 2 then the second smallest item of X and Y is the smallest of the smallest items of X and Y (min(X[0], Y[0]))
Otherwise;
i. Let A be min(length(X), K / 2)
ii. Let B be min(length(Y), K / 2)
iii. If the X[A] > Y[B] then recurse from step 1. with X, Y' with all elements of Y from B to the end of Y and K' = K - B, otherwise recurse with X' with all elements of X from A to the end of X, Y and K' = K - A
If I find the time tomorrow I will verify that this algorithm works in Python as stated and provide the example source code, it may have some off-by-one errors as-is.
Take the median element in list A and call it a. Compare a to the center elements in list B. Lets call them b1 and b2 (if B has odd length then exactly where you split b depends on your definition of the median of an even length list, but the procedure is almost identical regardless). if b1&leq;a&leq;b2 then a is the median of the merged array. This can be done in constant time since it requires exactly two comparisons.
If a is greater than b2 then we add the top half of A to the top of B and repeat. B will no longer be sorted, but it doesn't matter. If a is less than b1 then we add the bottom half of A to the bottom of B and repeat. These will iterate log(n) times at most (if the median is found sooner then stop, of course).
It is possible that this will not find the median. If this is the case then the median is in B. If so, perform the same algorithm with A and B reversed. This will require log(m) iterations. In total you will have performed at most 2*(log(n)+log(m)) iterations of a constant time operation, so you have solved the problem in order log(n)+log(m) time.
This is essentially the same answer as was given by iehrlich, but written out more explicitly.
Yes, this can be done. Given two arrays, A and B, in the worst-case scenario you have to first perform a binary search in A, and then, if it fails, binary search in B looking for the median. On each step of a binary search, you check if the current element is actually a median of a merged A+B array. Such check takes constant time.
Let's see why such check is constant. For simplicity, let's assume that |A| + |B| is an odd number, and that all numbers in both arrays are different. You can remove these restrictions later by applying the usual median definition approach (i.e., how to calculate the median of an array containing duplicates, or of an array with even length). Anyway, given that, we know for sure, that in the merged array there will be (|A| + |B| - 1) / 2 elements to the right and to the left of an actual median. In the process of a binary search in A, we know the index of current element x in array A (let it be i). Now, if x satisfies the condition B[j] < x < B[j+1], where i + j == (|A| + |B| - 1) / 2, then x is your median.
The overall complexity is O(log(max(|A|, |B|)) time and O(1) memory.

Count number of swaps to sort first k-smallest element using a bubble sort like algorithm

Given an array a and integer k. Someone uses following algorithm to get first k smallest elements:
cnt = 0
for i in [1, k]:
for j in [i + 1, n]:
if a[i] > a[j]:
swap(a[i], a[j])
cnt = cnt + 1
The problem is: How to calculate value of cnt (when we get final k-sorted array), i.e. the number of swaps, in O(n log n) or better ?
Or simply put: calculate the number of swaps needed to get first k-smallest number sorted using the above algorithm, in less than O(n log n).
I am thinking about a binary search tree, but I get confused (How array will change when increase i ? How to calculate number of swap for a fixed i ?...).
This is a very good question: it involves Inverse Pairs, Stack and some proof techniques.
Note 1: All index used below are 1-based, instead of traditional 0-based.
Note 2: If you want to see the algorithm directly, please start reading from the bottom.
First we define Inverse Pairs as:
For a[i] and a[j], in which i < j holds, if we have a[i] > a[j], then a[i] and a[j] are called an Inverse Pair.
For example, In the following array:
3 2 1 5 4
a[1] and a[2] is a pair of Inverse Pair, a[2] and a[3] is another pair.
Before we start the analysis, let's define a common language: in the reset of the post, "inverse pair starting from i" means the total number of inverse pairs involving a[i].
For example, for a = {3, 1, 2}, inverse pair starting from 1 is 2, and inverse pair starting from 2 is 0.
Now let's look at some facts:
If we have i < j < k, and a[i] > a[k], a[j] > a[k], swap a[i] and a[j] (if they are an inverse pair) won't affect the total number of inverse pair starting from j;
Total inverse pairs starting from i may change after a swap (e.g. suppose we have a = {5, 3, 4}, before a[1] is swapped with a[2], total number of inverse pair starting from 1 is 2, but after swap, array becomes a = {3, 5, 4}, and the number of inverse pair starting from 1 becomes 1);
Given an array A and 2 numbers, a and b, as the head element of A, if we can form more inverse pair with a than b, we have a > b;
Let's denote the total number of inverse pair starting from i as ip[i], then we have: if k is the min number satisfies ip[i] > ip[i + k], then a[i] > a[i + k] while a[i] < a[i + 1 .. i + k - 1] must be true. In words, if ip[i + k] is the first number smaller than ip[i], a[i + k] is also the first number smaller than a[i];
Proof of point 1:
By definition of inverse pair, for all a[k], k > j that forms inverse pair with a[j], a[k] < a[j] must hold. Since a[i] and a[j] are a pair of inverse and provided that i < j, we have a[i] > a[j]. Therefore, we have a[i] > a[j] > a[k], which indicates the inverse-pair-relationships are not broken.
Proof of point 3:
Leave as empty since quite obvious.
Proof of point 4:
First, it's easy to see that when i < j, a[i] > a[j], we have ip[i] >= ip[j] + 1 > ip[j]. Then, it's inverse-contradict statement is also true, i.e. when i < j, ip[i] <= ip[j], we have a[i] <= a[j].
Now back to the point. Since k is the min number to satisfy ip[i] > ip[i + k], then we have ip[i] <= ip[i + 1 .. i + k - 1], which indicates a[i] <= a[i + 1.. i + k - 1] by the lemma we just proved, which also indicates there's no inverse pairs in the region [i + 1, i + k - 1]. Therefore, ip[i] is the same as the number of inverse pairs starting from i + k, but involving a[i]. Given ip[i + k] < ip[i], we know a[i + k] has less inverse pair than a[i] in the region of [i + k + 1, n], which indicates a[i + k] < a[i] (by Point 3).
You can write down some sequences and try out the 4 facts mentioned above and convince yourself or disprove them :P
Now it's about the algorithm.
A naive implementation will take O(nk) to compute the result, and the worst case will be O(n^2) when k = n.
But how about we make use of the facts above:
First we compute ip[i] using Fenwick Tree (see Note 1 below), which takes O(n log n) to construct and O(n log n) to get all ip[i] calculated.
Next, we need to make use of facts. Since swap of 2 numbers only affect current position's inverse pair number but not values after (point 1 and 2), we don't need to worry about the value change. Also, since the nearest smaller number to the right shares the same index in ip and a, we only need to find the first ip[j] that is smaller than ip[i] in [i + 1, n]. If we denote the number of swaps to get first i element sorted as f[i], we have f[i] = f[j] + 1.
But how to find this "first smaller number" fast? Use stack! Here is a post which asks a highly similar problem: Given an array A,compute B s.t B[i] stores the nearest element to the left of A[i] which is smaller than A[i]
In short, we are able to do this in O(n).
But wait, the post says "to the left" but in our case it's "to the right". The solution is simple: we do backward in our case, then everything the same :D
Therefore, in summary, the total time complexity of the algorithm is O(n log n) + O(n) = O(n log n).
Finally, let's talk with an example (a simplified example of #make_lover's example in the comment):
a = {2, 5, 3, 4, 1, 6}, k = 2
First, let's get the inverse pairs:
ip = {1, 3, 1, 1, 0, 0}
To calculate f[i], we do backward (since we need to use the stack technique):
f[6] = 0, since it's the last one
f[5] = 0, since we could not find any number that is smaller than 0
f[4] = f[5] + 1 = 1, since ip[5] is the first smaller number to the right
f[3] = f[5] + 1 = 1, since ip[5] is the first smaller number to the right
f[2] = f[3] + 1 = 2, since ip[3] is the first smaller number to the right
f[1] = f[5] + 1 = 1, since ip[5] is the first smaller number to the right
Therefore, ans = f[1] + f[2] = 3
Note 1: Using Fenwick Tree (Binary Index Tree) to get inverse pair can be done in O(N log N), here is a post on this topic, please have a look :)
Update
Aug/20/2014: There was a critical error in my previous post (thanks to #make_lover), here is the latest update.

What is wrong with my Heapsort algorithm for building a min heap?

I'm prepping for software developer interviews and have been working on algorithm problems. My book shows a Heapsort algorithm that can sort an unordered array in increasing order. I'm trying to modify it so it can sort with a min heap. But when I follow the logic in the code it doesn't get my array sorted correctly. What is wrong with my code (in pseudocode)?
The array to be sorted: 16, 14, 10, 8, 7, 9, 3, 2, 4, 1
The book's Heapsort algorithm using max-heapify:
HEAPSORT(A)
BUILD-MAX-HEAP(A)
for i = A.length down to 2
swap A[1] with A[i]
A.heapsize = A.heapsize - 1
MAX-HEAPIFY(A, 1)
MAX-HEAPIFY(A)
l = Left(i)
r = Right(i)
if l <= A.heapsize and A[l] > A[i]
largest = l
else
largest = i
if r <= A.heapsize and A[r] > A[largest]
largest = r
if largest != i
swap A[i] with A[largest]
MAX-HEAPIFY(A, largest)
My modified code using min-heapify:
HEAPSORT(A) // where A is an array
BUILD-MIN-HEAP(A)
for i = A.length down to 2
swap A[1] with A[i]
A.heapsize = A.heapsize + 1
MIN-HEAPIFY(A, 1)
MIN-HEAPIFY(A, i)
l = Left(i)
r = Right(i)
if l <= heapsize.A and A[l] < A[i]
smallest = l
else
smallest = i
if r <= heapsize.A and A[r] < A[smallest]
smallest = r
if smallest != i
swap A[i] with A[smallest]
MIN-HEAPIFY(A, smallest)
Heap sort runs in two phases: (1) transform the unsorted array into a heap, (2) transform the heap into a sorted array.
For building up the heap, the for-loop should run from 2 to A.length; also the heap size should become larger, not smaller.
The code snippet for BUILD-MAX-HEAP(A) seems to be meant for phase 2, for building up the sorted array out of the heap.
The phase 1 would build up the heap in the beginning of the array by letting new nodes "bubble up" in the heap. As long as the new node is larger (or smaller if you generate a min-heap) than its parent, exchange it with its parent.

Resources