The traditional Heapsort algorithm swaps the last element of the heap with the root of the current heap after every heapification, then continues the process again. However, I noticed that it is kind of unnecessary.
After a heapification of a sub-array, while the node contains the highest value (if it's a max-heap), the next 2 elements in the array must follow the root in the sorted array, either in the same order as they are now, or exchanging them if they are reverse-sorted. So instead of just swapping the root with the last element, won't it be better to swap the first 3 elements (including the node and after the if necessary exchange of the 2nd and 3rd elements) with the last 3 elements, so that 2 subsequent heapifications (for the 2nd and 3rd elements ) are dispensed with?
Is there any disadvantage with this method (apart from the if-needed swapping of the 2nd and 3rd elements, which should be trivial)? If not, if it is indeed better, how much performance boost will it give? Here is the pseudo-code:
function heapify(a,i) {
#assumes node i has two nodes, both heaps. However node i itself might not be a heap, i.e one of its children may be greater than its value.
#if so, find the greater of its two children, then swp the parent with that value.
#this might render that child as no longer a heap, so recurse
}
function create_heap(a) {
#all elements following index |_n/2_| are leaf nodes, thus heapify() should be applied to all elements within index 1 to |_n/2_|
}
function heapsort(a) {
create_heap(a); #a is now a max-heap
#root of the heap, that is a[1] is the maximum, so swap it with a[n].
#The root now contains an element smaller than its children, both of which are themselves heaps.
#so apply heapify(a,1). Note: heap length is now n-1, as a[n] is the largest element already found
#now again the array is a heap. The highest value is in a[1]. Swap it with a[n-1].
#continue
}
Suppose the array is [4,1,3,2,16,9,10,14,8,7]. After running a heapify, it will become [16,14,10,8,7,9,3,2,4]. Now the heapsort's first iteration will swap 16 and 4, leading to [4,14,10,8,7,9,3,2,16]. Since this has now rendered the root of the new heap [4,14,10,8,7,9,3,2] as, umm, un-heaped, (14 and 10 both being greater than 4), run another heapify to produce [14,8,10,4,7,9,3,2]. Now 14 being the root, swap it with 2 to yield [2,8,10,4,7,9,3,14], thus making the array currently [2,8,10,4,7,9,3,14,16]. Again we find that 2 is un-heaped, so again doing a heapify makes the heap as [10,8,9,4,7,2,3]. Then 10 is swapped with 3, making the array as [3,8,9,4,7,2,3,10,14,16]. My point is that instead of doing the 2nd and 3rd heapifications to store 10 and 14 before 16, we can tell from the first heapification that because 10 and 14 follow 16, they are the 2nd and 3rd largest elements (or vice-versa). So after a comparison between them (in case they are already sorted, 14 comes before 10), I swap all the there (16,14,10) with (3,2,4), making the array [3,2,4,8,7,9,16,14,10]. This reduces us to a similar condition as the one after the further two heapifications - [3,8,9,4,7,2,3,10,14,16] originally, as compared to [3,2,4,8,7,9,16,14,10] now. Both will now need further heapification, but the 2nd method has let us arrive at this juncture directly by just a comparison between two elements (14 and 10).
The second largest element of the heap is present in the second or third position, but the third largest can be present further down, at depth 2. (See the figure in http://en.wikipedia.org/wiki/Heap_(data_structure) ). Furthermore, after swapping the first three elements with the last three, the heapify method would first heapify the first subtree of the root, followed by the second subtree of the root, followed by the whole tree. Thus the total cost of this operation is close to three times the cost of swapping the top element with the last and calling heapify. So you won't gain anything by doing this.
Related
I am bit confused. If I have an array I have to build a tree. To compare the childern I have to know how large my array is in this case its N = 6 so I have to divide it by 2 so I get 3. That means I have to start from index 3 to compare with the parent node. If the child is greater than the parent node then I have to swap it otherwise I don't have to. Then I go to index 2 and compare with the parent if the children is greater than the parent node then I have to swap it. Then index 1 I have to compare with the children and swap it if needed. So I have created a Max heap. But know I don't get it but why do I have to exchange A1 with A[6] then A1 with A[5]. Finally I dont get the Max heap I get the Min Heap? What does Heapify mean?
Thanks alot I appreciate every answer!
One of my exercise is Illustrate the steps of Heapsort by filling in the arrays and the tree representations
There are many implementations of a heap data structure, but one is talking about a specific implicit binary heap. Heap-sort is done in-place, so it uses this design. Binary heaps require a compete binary tree, so it can be represented as an implicit structure built out of the array: for every A[n] in zero-based array,
A[0] is the root; if n != 0, A[floor((n-1)/2)] is the parent;
if 2n+1 is in the range of the the array, then A[2n+1] is the left child, or else it is a leaf node;
if 2n+2 is in the range of the array, then A[2n+2] is the right child.
Say one's array is, [10,14,19,21,23,31], is represented implicitly by the homomorphism, using the above rules, as,
This is not following the max-heap invariants, so one must heapify, probably using Floyd's heap construction which uses sift down and runs in O(n). Now you have a heap and a sorted array of no length, ([31,23,19,21,14,10],[]), (this is all implicit, since the heap takes no extra memory, it's just an array in memory.) The visualisation of the heap at this stage,
We pop off the maximum element of the heap and use sift up to restore the heap shape. Now the heap is one smaller and we've taken the maximum element and stored unshifted it into our array, ([23,21,19,10,14],[31]),
repeat, ([21,14,19,10],[23,31]),
([19,14,10],[21,23,31]),
([14,10],[19,21,23,31]),
([10],[14,19,21,23,31]),
The heap size is one, so one's final sorted array is [10,14,19,21,23,31]. If one used a min-heap and the same algorithm, then the array would be sorted the other way.
Heap sort is a two phase process. In the first phase, you turn the array in a heap with the maximum value at the top A[1]. This is the first transition circled in red. After this phase, the heap is in the array from index 1 to 6, and the biggest value is at index 1 in A[1].
In the second phase we sort the values. This is a multistep process where we extract the biggest value from the heap and put it in place in the sorted array.
The heap is on the left side of the array and will shrink toward the left. The sorted array is on the right of the array and grows to the left.
At each step we swap the top of the heap A[1] that contains the biggest value of the heap, with the last value of the heap. The sorted array has then grown one position to the left. Since the value that has been put in A[1] is not the biggest, we have to restore the heap. This operation called max-heapify. After this process, A[1] contains the biggest value in the heap whose size has been reduced by one element.
By repeatedly extracting the biggest value left in the heap, we can sort the values in the array.
The drawing of the binary tree is very confusing. It's size should shrink at each step because the size of the heap shrinks.
I need to prove that the median of binary heap (doesn't matter if it is a min heap or max heap) can be in the lowest level of the heap (in the leaf). I am not sure how to prove it. I thought about using the fact that a heap is a complete binary tree but I am not sure about it. How can I prove it?
As #Evg mentioned in the comments, if all elements are the same, this is trivially true. Assume that all elements need to be different, and let us focus on the case with an odd amount of nodes 2H+1 and a min heap (the max heap case is similar). To create the min heap where the median is at the bottom, first insert the smallest H elements.
There are two cases. Case 1; after doing this the binary tree formed by these H elements is completely filled (every layer is filled) then you can just insert the remaining H+1 elements on the last layer (which you can do since the maximum capacity of the last layer equals (#total_nodes+1)/2 which is precisely H+1).
Case 2 The last layer still has some unfilled spaces. In this case, take the smallest remaining nodes from the largest H elements until this layer is filled (note that there will be no upward movement in your heap since these elements are already larger than whatever is in the tree). Then start the next layer by inserting the median. Finally insert the remaining nodes, which won't be moved upwards either since they are larger than whatever is in the layer above, by construction. By the same argument about the capacity of the last layer, you will not need to start a new layer during this process.
In the case where there are an even amount of nodes 2H, you can argue similarly, but you would have to define the median as H+1 smallest node (otherwise the statement you want to prove is false, as you can see by noticing that the only possible min-heap for the set {1,2} is the tree with root at 1 and leaf at 2).
Easiest way to prove it is just to make one:
1
2 3
4 5 6 7
Any complete heap with nodes in level order will have the median at the left-most leaf, but you don't have to prove that.
It is a coding interview question. We are given an array say random_arr and we need to sort it using only the swap function.
Also the number of swaps for each element in random_arr are limited. For this you are given an array parent_arr, containing number of swaps for each element of random_arr.
Constraints:
You should use swap function.
Every element may repeat minimum 5 times and maximum 26 times.
You cannot make elements of given array to 0.
You should not write helper functions.
Now I will explain how parent_arr is declared. If parent_arr is like:
parent_arr[] = {a,b,c,d,...,z} then
a can be swapped at most one time.
b can be swapped at most two times.
if parent_arr[] = {c,b,a,....,z} then
c can be swapped at most one time.
b can be swapped at most two times.
a can be swapped at most three times
My solution:
For each element in random_arr[] store that how many elements are below it, if it is sorted. Now select element having minimum swap count from parent_arr[] and check whether it exist in random_arr[]. If yes and it if has occurred more than one time then it will have more than one location where it can be placed. Now choose the position(rather element at that position, preciously) with maximum swap count and swap it. Now decrease the swap count for that element and sort the parent_arr[] and repeat the process.
But it is quite inefficient and its correctness can't be proved. Any ideas?
First, let's simplify your algorithm; then let's informally prove its correctness.
Modified algorithm
Observe that once you computed the number of elements below each number in the sorted sequence, you have enough information to determine for each group of equal elements x their places in the sorted array. For example, if c is repeated 7 times and has 21 elements ahead of it, then cs will occupy the range [21..27] (all indexes are zero-based; the range is inclusive of its ends).
Go through the parent_arr in the order of increasing number of swaps. For each element x, find the beginning of its target range rb; also note the end of its target range re. Now go through the elements of random_arr outside of the [rb..re] range. If you see x, swap it into the range. After swapping, increment rb. If you see that random_arr[rb] is equal to x, continue incrementing: these xs are already in the right spot, you wouldn't need to swap them.
Informal proof of correctness
Now lets prove the correctness of the above. Observe that once an element is swapped into its place, it is never moved again. When you reach an element x in the parent_arr, all elements with lower number of swaps are already processed. By construction of the algorithm this means that these elements are already in place. Suppose that x has k number of allowed swaps. When you swap it into its place, you move another element out.
This replaced element cannot be x, because the algorithm skips xs when looking for a destination in the target range [rb..re]. Moreover, the replaced element cannot be one of elements below x in the parent_arr, because all elements below x are in their places already, and therefore cannot move. This means that the swap count of the replaced element is necessarily k+1 or more. Since by the time that we finish processing x we have exhausted at most k swaps on any element (which is easy to prove by induction), any element that we swap out to make room for x will have at least one remaining swap that would allow us to swap it in place when we get to it in the order dictated by the parent_arr.
I found a variant of Heapsort using multiple heaps at http://students.ceid.upatras.gr/~lebenteas/Heapsort-using-Multiple-Heaps-final.pdf. The solution proposes that instead of the traditional Heapsort algorithm, where after each swap, we do another siftdown to bring the highest value in the current heap to the root, we can do some other things. However, what exactly do they mean by 'other things', I cannot understand.
For example, at one point they say We “forget”, for the time being, the existence of the root. That surely means we are currently stalling the swapping of the highest element with the last element of the heap. However, just after some lines, they say So far, two elements have been transferred in the sorted part of the heap., which runs counter to the proposition that the swapping hasn't been done yet. Also in the figure in page 97, the node with value 1 is missing, I don't know how.
Can anybody give me an idea of what exactly is the authors trying to convey, and how worthwhile can it be?
(The line you asked about is in section 2.3, so I will explain the variation of heapsort which is proposed in section 2.3:)
When the author says we "forget" the existence of the root, this does not mean that they are stalling the swapping of the highest element. The swap is done, but they temporarily delay rebuilding the heap. After swapping the highest element into the root position, they compare the roots of the 2 subheaps, and swap one or the other with the next-highest element. Then, after doing 2 swaps (rather than 1), they rebuild the heap.
Then they take this idea a step further in sections 3 and 4, and propose another variant of heapsort, which uses more than one heap.
How do you keep more than one heap in an array? (To make it concrete, let's talk about 2 heaps.) Well, how do you keep a single heap? The root goes at index 0, its children are at 1 and 2, then the children of the left subheap are at 3 and 4, etc., right?
To put 2 heaps together in an array, keep the 2 roots at 0 and 1. The children of the first root go at 2 and 3, then the children of the 2nd root at 4 and 5... with such an arrangement, it is still possible to navigate up and down the tree by doing simple arithmetic operations on indexes.
The standard heapsort repeats 2 steps: swap the root with the last element in the "heap" area, then siftDown to rebuild the heap. This heapsort repeats the following 3 steps: compare the 2 roots to see which one is bigger, swap that one with the last element in the "heap" area, then call siftDown on the appropriate heap.
This requires an extra compare at each step, but the siftDown operations work on slightly shallower heaps, which saves more than a single compare.
I have a binary max heap (largest element at the top), and I need to keep it of constant size (say 20 elements) by getting rid of the smallest element each time I get to 20 elements. The binary heap is stored in an array, with children of node i at 2*i and 2*i+1 (i is zero based). At any point, the heap has 'n_elements' elements, between 0 and 20. For example, the array [16,14,10,8,7,9,3,2,4] would be a valid max binary heap, with 16 having children 14 and 10, 14 having children 8 and 7 ...
To find the smallest element, it seems that in general I have to traverse the array from n_elements/2 to n_elements: the smallest element is not necessarily the last one in the array.
So, with only that array, it seems any attempt at finding/removing the smallest elt is at least O(n). Is that correct?
For any given valid Max Heap the minimum will be at the leaf nodes only. The next question is how to find the leaf nodes of the heap in the array ?. If we carefully observe the last node of the array it will be the last leaf node. Get the parent of the leaf node by the formula
parent node index = (leaf Node Index)/2
Start linear search from the index (parent node index +1) to last leaf node index get the minimum value in that range.
FindMinInMaxHeap(Heap heap)
startIndex = heap->Array[heap->lastIndex/2]
if startIndex == 0
return heap->Array[startIndex]
Minimum = heap->Array[startIndex + 1]
for count from startIndex+2 to heap->lastIndex
if(heap->Array[count] < Minimum)
Minimum := heap->Array[count]
print Minimum
There isn't any way I can think of by which you can get better that O(n) performance for finding and removing the smallest element from a max heap by using the heap alone. One approach that you can take is:
If you are creating this heap data structure yourself, you can keep a separate pointer to the location of the smallest element in the array. So whenever a new element is added to the heap, check if the new element is smaller. If yes, update the pointer etc. Then finding the smallest element would be O(1).
MBo raises a good point in the comment about how to get the next smallest element after each removal. You'll still need to do the O(n) thing to find the next smallest element after each removal. So removal would still be O(n). But finding the smallest element would be O(1)
If you need faster removal as well, you'll need to also maintain a min-heap of all the elements. In that case, removal would be O(log(n)). Insertion will take 2x time because you have to insert into two heaps and it will also take 2x space.
By the way, if you have only 20 elements at any point of time, this is not really going to matter much (unless it is a homework problem or you are just doing it for fun). It would really matter only if you plan to scale it to thousands of values.
There is minmax heap data structure: http://en.wikipedia.org/wiki/Min-max_heap . Of course, it's code is rather complex, but with two separate heaps we have to use a lot of additional space (for the second heap, for maintaining one-to-one mapping) and to do the job twice.