What is the purpose of having heap sort? - algorithm

When we build a heap, the elements in the array get arranged in a particular order (ascending or descending) depending on whether it is max heap or min heap. Then what is the use of heap sort when building a heap itself arranges the elements in sorted order with less time complexity?
void build_heap (int Arr[ ])
{
for(int i = N/2-1 ; i >= 0; i-- )
{
down_heapify (Arr, i, N);
}
}
void heap_sort(int Arr[], int N)
{
build_heap(Arr);
for(int i = N-1; i >= 1; i--)
{
swap(Arr[i], Arr[0]);
down_heapify(Arr, 0, i+1);
}
}

Heap sort summed up
Heap sort is an algorithm which can be summed up in two steps:
Convert the input array into a heap;
Convert the heap into a sorted array.
The heap itself is not a sorted array.
Let's look at an example:
[9, 7, 3, 5, 4, 2, 0, 6, 8, 1] # unsorted array
convert into heap
[9, 8, 3, 7, 4, 2, 0, 6, 5, 1] # array representing a max-heap
sort
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9] # sorted array
If you look closely, you'll notice the second array in my example, which represents a heap, isn't quite sorted. The order of the element looks less random than in the original unsorted array; they look almost sorted in decreasing order; but they aren't completely sorted. 3 comes before 7, 0 comes before 6 in the array.
So what is a heap?
What is a heap?
Note that in the previous section, I make a distinction between "a heap" and "an array representing a heap". Let's talk about what is a heap first, and then what is an array representing a heap.
A max-heap is a binary tree with values on the nodes, which satisfies the two following properties:
the value on a child node is always lower than the value on its parent node;
the tree is almost-complete, in the sense that all branches of the tree have almost the same length, with a difference of at most 1 between the longest and the shorted branches; in addition, the longest branches must be on the left and the shortest branches must be on the right.
In the example I gave, the heap constructed is this one:
9
/ \
8 3
/ \ / \
7 4 2 0
/ \ / \ / \ / \
6 5 1
You can check that this binary tree satisfies the two properties of a heap: each child has a lower value than its parent, and all branches have almost the same length, with always 4 or 3 values per branch, and the longest branches being on the left and the shortest branches being on the right.
What is an array representing a heap?
Storing binary trees into arrays is usually pretty inconvenient, and binary trees are most generally implemented using pointers, kinda like a linked list. However, the heap is a very special binary tree, and its "almost-complete" property is super useful to implement it as an array.
All we have to do is read the values row per row, left to right. In the heap above, we have four rows:
9
8 3
7 4 2 0
6 5 1
We simply store these values in that order in an array:
[9, 8, 3, 7, 4, 2, 0, 6, 5, 1]
Notice that this is exactly the array after the first step of heap sort at the beginning of my post.
In this array representation, we can use positions to determine which node is a child of which node: the node at position i has two children, which are at positions 2*i+1 and 2*i+2.
This array is not a sorted array. But it represents a heap and we can easily use it to produce a sorted array, in n log(n) operations, by extracting the maximum element repeatedly.
If heap-sort was implemented with an external binary tree, then we could use either a max-heap or a min-heap, and sort the array by repeatedly selecting the maximum element or the minimum element. However, if you try to implement heap-sort in-place, storing the heap as an array inside the array which is being sorted, you'll notice that it's much more convenient to use a max-heap than a min-heap, in order to sort the elements in increasing order by repeatedly selecting the max element and moving it to the end of the array.

"Then what is the use heap sort when building a heap itself arranges the elements in sorted order"
It seems that you confuse the purposes of Heap Sort algorithm and heap data structure. Let us clarify this way:
heap is a data structure that allows us to find minimum or maximum of the list repeatedly in a changing collection. We can use sink() based approach for creating a heap from scratch in O(n). Then each operation takes O(logn) complexity. However, heap doesn't provide you with a sorted array, it just gives maximum or minimum depending on your implementation.
On the other hand, Heap Sort algorithm provides you with a sorted array/collection using heap data structure. First it builds a heap in O(n) time complexity. Then it starts from the bottom and returns max/min one-by-one to the actual array. In each iteration, you have to heapify your heap to get next max/min properly, which in total gives O(n*logn) time complexity.
void heap_sort(int Arr[], int N)
{
build_heap(Arr); // O(n) time complexity
for(int i = N-1; i >= 1; i--) // n iteration
{
swap(Arr[i], Arr[0]);
down_heapify(Arr, 0, i+1); //O(logn) time complexity
}
// in total O(n) + O(n*logn) = O(n*logn)
}
In conclusion, building a heap itself doesn't provide you with a sorted array.

Related

Time complexity with insertion sort for 2^N array?

Consider an array of integers, which has a size of 2^N, where the element at index X (0 ≤ X < 2N) is X xor 3 (that is, two least significant bits of X are flipped). What is the running time of the insertion sort on this array?
Examine the structure of what the lists looks like:
For n = 2:
{3, 2, 1, 0}
For n = 3 :
{3, 2, 1, 0, 7, 6, 5, 4}
For insertion sort, you're maintaining the invariant that the list from 1 up to your current index is sorted, so you're task at each step is to place the current element into it's correct place among the sorted elements before it. In the worst case, you will have to traverse all previous indices before you can insert the current value (think of the case where the list is in reverse sorted order). But it's clear from the structure above that for a list with the property that each value is equivalent to the index ^ 3, that the furthest back in the list that you would have to go from any given index is 3. So you've reduced the possibility that you'd have to do O(n) work at the insertion step to a constant factor. But you still have to do O(n) work to examine each element of the list. So, for this particular case, the running time of insertion sort is linear in the size of the input, whereas in the worst case it is quadratic.

Find the number of elements greater than x in a given range

Given an array with n elements, how to find the number of elements greater than or equal to a given value (x) in the given range index i to index j in O(log n) complexity?
The queries are of the form (i, j, x) which means find number of elements greater than x from the ith till jth element in the array
The array is not sorted. i, j & x are different for different queries. Elements of the array are static.
Edit: i, j, x all can be different for different queries!
If we know all queries before hand, we can solve this problem by making use of Fenwick tree.
First, we need to sort all elements in array and queries together, based on their values.
So, assuming that we have array [5, 4, 2, 1, 3] and queries (0, 1, 6) and (2, 5, 2), we will have following result after sorting : [1, 2, 2, 3 , 4 , 5, 6]
Now, we will need to process each element in descending order:
If we encounter an element which is from the array, we will update its index in the Fenwick tree, which take O(log n)
If we encounter a queries, we need to check, in this range of the query, how many elements have been added in the tree, which take O(log n).
For above example, the process will be:
1st element is a query for value 6, as Fenwick tree is empty -> result is 0
2nd is element 5 -> add index 0 into Fenwick tree
3rd element is 4 -> add index 1 into tree.
4th element is 3 -> add index 4 into tree.
5th element is 2 -> add index 2 into tree.
6th element is query for range (2, 5), we query the tree and get answer 2.
7th element is 1 -> add index 3 into tree.
Finish.
So, in total, the time complexity for our solution is O((m + n) log(m + n)) with m and n is the number of queries and number of element from input array respectively.
That is possible only if you got the array sorted. In that case binary search the smallest value passing your condition and compute the count simply by sub-dividing your index range by its found position to two intervals. Then just compute the length of the interval passing your condition.
If array is not sorted and you need to preserve its order you can use index sort . When put together:
definitions
Let <i0,i1> be your used index range and x be your value.
index sort array part <i0,i1>
so create array of size m=i1-i0+1 and index sort it. This task is O(m.log(m)) where m<=n.
binary search x position in index array
This task is O(log(m)) and you want the index j = <0,m) for which array[index[j]]<=x is the smallest value <=x
compute count
Simply count how many indexes are after j up to m
count = m-j;
As you can see if array is sorted you got O(log(m)) complexity but if it is not then you need to sort O(m.log(m)) which is worse than naive approach O(m) which should be used only if the array is changing often and cant be sorted directly.
[Edit1] What I mean by Index sort
By index sort I mean this: Let have array a
a[] = { 4,6,2,9,6,3,5,1 }
The index sort means that you create new array ix of indexes in sorted order so for example ascending index sort means:
a[ix[i]]<=a[ix[i+1]]
In our example index bubble sort is is like this:
// init indexes
a[ix[i]]= { 4,6,2,9,6,3,5,1 }
ix[] = { 0,1,2,3,4,5,6,7 }
// bubble sort 1st iteration
a[ix[i]]= { 4,2,6,6,3,5,1,9 }
ix[] = { 0,2,1,4,5,6,7,3 }
// bubble sort 2nd iteration
a[ix[i]]= { 2,4,6,3,5,1,6,9 }
ix[] = { 2,0,1,5,6,7,4,3 }
// bubble sort 3th iteration
a[ix[i]]= { 2,4,3,5,1,6,6,9 }
ix[] = { 2,0,5,6,7,1,4,3 }
// bubble sort 4th iteration
a[ix[i]]= { 2,3,4,1,5,6,6,9 }
ix[] = { 2,5,0,7,6,1,4,3 }
// bubble sort 5th iteration
a[ix[i]]= { 2,3,1,4,5,6,6,9 }
ix[] = { 2,5,7,0,6,1,4,3 }
// bubble sort 6th iteration
a[ix[i]]= { 2,1,3,4,5,6,6,9 }
ix[] = { 2,7,5,0,6,1,4,3 }
// bubble sort 7th iteration
a[ix[i]]= { 1,2,3,4,5,6,6,9 }
ix[] = { 7,2,5,0,6,1,4,3 }
So the result of ascending index sort is this:
// ix: 0 1 2 3 4 5 6 7
a[] = { 4,6,2,9,6,3,5,1 }
ix[] = { 7,2,5,0,6,1,4,3 }
Original array stays unchanged only the index array is changed. Items a[ix[i]] where i=0,1,2,3... are sorted ascending.
So now if x=4 on this interval you need to find (bin search) which i has the smallest but still a[ix[i]]>=x so:
// ix: 0 1 2 3 4 5 6 7
a[] = { 4,6,2,9,6,3,5,1 }
ix[] = { 7,2,5,0,6,1,4,3 }
a[ix[i]]= { 1,2,3,4,5,6,6,9 }
// *
i = 3; m=8; count = m-i = 8-3 = 5;
So the answer is 5 items are >=4
[Edit2] Just to be sure you know what binary search means for this
i=0; // init value marked by `*`
j=4; // max power of 2 < m , i+j is marked by `^`
// ix: 0 1 2 3 4 5 6 7 i j i+j a[ix[i+j]]
a[ix[i]]= { 1,2,3,4,5,6,6,9 } 0 4 4 5>=4 j>>=1;
* ^
a[ix[i]]= { 1,2,3,4,5,6,6,9 } 0 2 2 3< 4 -> i+=j; j>>=1;
* ^
a[ix[i]]= { 1,2,3,4,5,6,6,9 } 2 1 3 4>=4 j>>=1;
* ^
a[ix[i]]= { 1,2,3,4,5,6,6,9 } 2 0 -> stop
*
a[ix[i]] < x -> a[ix[i+1]] >= x -> i = 2+1 = 3 in O(log(m))
so you need index i and binary bit mask j (powers of 2). At first set i with zero and j with the biggest power of 2 still smaller then n (or in this case m). Fro example something like this:
i=0; for (j=1;j<=m;j<<=1;); j>>=1;
Now in each iteration test if a[ix[i+j]] suffice search condition or not. If yes then update i+=j else leave it as is. After that go to next bit so j>>=1 and if j==0 stop else do iteration again. at the end you found value is a[ix[i]] and index is i in log2(m) iterations which is also the number of bits needed to represent m-1.
In the example above I use condition a[ix[i]]<4 so the found value was biggest number still <4 in the array. as we needed to also include 4 then I just increment the index once at the end (I could use <=4instead but was too lazy to rewrite the whole thing again).
The count of such items is then just number of element in array (or interval) minus the i.
Previous answer describes an offline solution using Fenwick tree, but this problem could be solved online (and even when doing updates to the array) with slightly worse complexity. I'll describe such a solution using segment tree and AVL tree (any self-balancing BST could do the trick).
First lets see how to solve this problem using segment tree. We'll do this by keeping the actual elements of the array in every node by range that it covers. So for array A = [9, 4, 5, 6, 1, 3, 2, 8] we'll have:
[9 4 5 6 1 3 2 8] Node 1
[9 4 5 6] [1 3 2 8] Node 2-3
[9 4] [5 6] [1 3] [2 8] Node 4-7
[9] [4] [5] [6] [1] [3] [2] [8] Node 8-15
Since height of our segment tree is log(n) and at every level we keep n elements, total amount of memory used is n log(n).
Next step is to sort these arrays which looks like this:
[1 2 3 4 5 6 8 9] Node 1
[4 5 6 9] [1 2 3 8] Node 2-3
[4 9] [5 6] [1 3] [2 8] Node 4-7
[9] [4] [5] [6] [1] [3] [2] [8] Node 8-15
NOTE: You first need to build the tree and then sort it to keep the order of elements in original array.
Now we can start our range queries and that works basically the same way as in regular segment tree, except when we find a completely overlapping interval, we then additionally check for number of elements greater than X. This can be done with binary search in log(n) time by finding the index of first element greater than X and subtracting it from number of elements in that interval.
Let's say our query was (0, 5, 4), so we do a segment search on interval [0, 5] and end up with arrays: [4, 5, 6, 9], [1, 3]. We then do a binary search on these arrays to see number of elements greater than 4 and get 3 (from first array) and 0 (from second) which brings to total of 3 - our query answer.
Interval search in segment trees can have up to log(n) paths, which means log(n) arrays and since we're doing binary search on each of them, brings complexity to log^2(n) per query.
Now if we wanted to update the array, since we are using segment trees its impossible to add/remove elements efficiently, but we can replace them. Using AVL trees (or other binary trees that allow replacement and lookup in log(n) time) as nodes and storing the arrays, we can manage this operation in same time complexity (replacement with log(n) time).
This is special variant of orthogonal range counting queries in 2D.
Each element el[i] is transformed into point on the plane (i, el[i])
and the query (i,j,x) can be transformed to count all points in the rectangle [i,j] x [x, +infty].
You can use 2D Range Trees (for example: http://www.cs.uu.nl/docs/vakken/ga/slides5b.pdf) for such type of the queries.
The simple idea is to have a tree that stores points in the leaves
(each leaf contains single point) ordered by X-axis.
Each internal node of the tree contains additional tree that stores all points from the subtree (ordered by Y-axis).
The used space is O(n logn)
Simple version could do the counting in O(log^2 n) time, but using
fractional cascading
this could be reduced to O(log n).
There better solution by Chazelle in 1988 (https://www.cs.princeton.edu/~chazelle/pubs/FunctionalDataStructures.pdf)
to O(n) preprocessing and O(log n) query time.
You can find some solutions with better query time, but they are way more complicated.
I would try to give you a simple approach.
You must have studied merge sort.
In merge sort we keep on dividing array into sub array and then build it up back but we dont store the sorted subarrays in this approach we store them as nodes of binary tree.
this takes up nlogn space and nlogn time to build up;
now for each query you just have to find the subarray this will be done in logn on average and logn^2 in worst case.
These tree are also known as fenwick trees.
If you want a simple code I can provide you with that.

Given an unsorted integer array A, return an array B, where B[i] is the number of A[j] such that A[i] > A[j] while i < j

Example: A = [5, 3, 8, 7, 2, 1, 4].
Then we'd get B = [4, 2, 4, 3, 1, 0, 0].
Is there a way to do this in O(n log n)?
a) Work from right to left, inserting the values in an annotated balanced tree as you go. If you keep annotations in the tree that tell you the number of items below each node then you can work out the number of items to the right of each item less than it as you insert. This is a balanced tree, so each insert costs you at most log n, for a total cost of n log n.
b) Divide and conquer, splitting the array into half at each point, and returning the values in sorted order and the B array calculated for each half. When you merge, you need to do a sort, and for the left half of the array you need to add on the number of values in the right half of the array less than itself. You can do this as part of the merge, and it still takes linear time, so the cost is the usual n log n of a merge sort.
You can use a segment tree structure( Each node will store the sum of its children ).All nodes of segment tree is 0 initially. Start from right to left and check the sum of [ai+1,N]. This will be B[i] and then when i passed update the ai. leaf as 1.

find median with minimum time in an array

I have an array lets say a = { 1,4,5,6,2,23,4,2};
now I have to find median of array position from 2 to 6 (odd total terms), so what I have done, I have taken a[1] to a[5] in arr[0] to arr[4] then I have sorted it and write the arr[2] as the median .
But here every time I put values from one array to another, so that the values of my initial array remains the same. Secondly, I have sorted, so this procedure is taking pretty much **time**.
So I want to know if there is any way I can do this differently to reduce my computation time.
Any websites, material to understand, what, and how to do?
Use std::nth_element from <algorithm> which is O(N):
nth_element(a, a + size / 2, a + size);
median = a[size/2];
It is possible to find the median without sorting in O(n) time; algorithms that do this are called selection algorithms.
If you are doing multiple queries on the same array then you could use a Segment Tree. They are generally used to do range minimum/maximum and range sum queries but you can change it to do range median.
A segment tree for a set with n intervals uses O(n log n) storage and can be built in O(n log n) time. A range query can be done in O(log n).
Example of median in range segment tree:
You build the segment tree from the bottom up (update from the top down):
[5]
[3] [7]
[1,2] [4] [6] [8]
1 2 3 4 5 6 7 8
Indices covered by node:
[4]
[2] [6]
[0,1] [3] [5] [7]
0 1 2 3 4 5 6 7
A query for median for range indices of 4-6 would go down this path of values:
[4]
[5]
0 1 2 3 4 5 6 7
Doing a search for the median, you know the number of total elements in the query (3) and the median in that range would be the 2nd element (index 5). So you are essentially doing a search for the first node which contains that index which is node with values [1,2] (indices 0,1).
Doing a search of the median of the range 3-6 is a bit more complicated because you have to search for two indices (4,5) which happen to lie in the same node.
[4]
[6]
[5]
0 1 2 3 4 5 6 7
Segment tree
Range minimum query on Segment Tree
To find the median of an array of less than 9 elements, I think the most efficient is to use a sort algorithm like insertion sort. The complexity is bad, but for such a small array because of the k in the complexity of better algorithms like quicksort, insertion sort is very efficient. Do your own benchmark but I can tell you will have better results with insertion sort than with shell sort or quicksort.
I think the best way is to use the median of medians algorithm of counting the k-th largest element of an array. You can find the overall idea of the algorithm here: Median of Medians in Java , on wikipedia: http://en.wikipedia.org/wiki/Selection_algorithm#Linear_general_selection_algorithm_-_Median_of_Medians_algorithm or just browse the internet. Some general improvements can be made during implementation (avoid sorting when choosing the median of particular arrays). However, note that for an array of less than 50 elements its more efficient to use insertion sort than median of medians algorithm.
All existing answers have some downsides in certain situations:
Sorting the entire subrange is not very efficient because one does not need to sort the entire array to get the median, and one needs an additional array if multiple subrange medians are to be found.
Using std::nth_element is more efficient but it still mutates the subrange, so one still needs an additional array.
Using segment tree gets you an efficent solution but you need to either implement the structure yourself or use a third party library.
For this reason, I am posting my approach which uses std::map and is inspired by selection sort algorithm:
First collect the frequencies of elements in first subrange into an object of std::map<int, int>.
With this object, we can efficently find the median of the subrange whose length is subrangeLength:
double median(const std::map<int, int> &histogram, int subrangeLength)
{
const int middle{subrangeLength / 2};
int count{0};
/* We use the fact that keys in std::map are sorted, so by simply iterating
and adding up the frequencies, we can find the median. */
if (subrangeLength % 2 == 1) {
for (const auto &freq : histogram) {
count += freq.second;
/* In case where subrangeLength is odd, "middle" is the lower integer bound of
subrangeLength / 2, so as soon as we cross it, we have found the median. */
if (count > middle) {
return freq.first;
}
}
} else {
std::optional<double> medLeft;
for (const auto &freq : histogram) {
count += freq.second;
/* In case where subrangeLength is even, we need to pay attention to the case when
elements at positions middle and middle + 1 are different. */
if (count == middle) {
medLeft = freq.first;
} else if (count > middle) {
if (!medLeft) {
medLeft = freq.first;
}
return (*medLeft + freq.first) / 2.0;
}
}
}
return -1;
}
Now when we want to get the median of next subrange, we simply update the histogram by decreasing the frequency of the element that is to be removed and add/increase it for the new element (with std::map, this is done in constant time). Now we compute the median again and continue with this until we handle all subranges.

Sorting Algorithm For Array with Integers of at most n spots away

Given an array with integers, with each integer being at most n positions away from its final position, what would be the best sorting algorithm?
I've been thinking for a while about this and I can't seem to get a good strategy to start dealing with this problem. Can someone please guide me?
I'd split the list (of size N) into 2n sublists (using zero-based indexing):
list 0: elements 0, 2n, 4n, ...
list 1: elements 1, 2n+1, 4n+1, ...
...
list 2n-1: elements 2n-1, 4n-1, ...
Each of these lists is obviously sorted.
Now merge these lists (repeatedly merging 2 lists at a time, or using a min heap with one element of each of these lists).
That's all. Time complexity is O(N log(n)).
This is easy in Python:
>>> a = [1, 0, 5, 4, 3, 2, 6, 8, 9, 7, 12, 13, 10, 11]
>>> n = max(abs(i - x) for i, x in enumerate(a))
>>> n
3
>>> print(*heapq.merge(*(a[i::2 * n] for i in range(2 * n))))
0 1 2 3 4 5 6 7 8 9 10 11 12 13
The Heap Sort is very fast for initially random array/collection of elements. In pseudo code this sort would be imlemented as follows:
# heapify
for i = n/2:1, sink(a,i,n)
→ invariant: a[1,n] in heap order
# sortdown
for i = 1:n,
swap a[1,n-i+1]
sink(a,1,n-i)
→ invariant: a[n-i+1,n] in final position
end
# sink from i in a[1..n]
function sink(a,i,n):
# {lc,rc,mc} = {left,right,max} child index
lc = 2*i
if lc > n, return # no children
rc = lc + 1
mc = (rc > n) ? lc : (a[lc] > a[rc]) ? lc : rc
if a[i] >= a[mc], return # heap ordered
swap a[i,mc]
sink(a,mc,n)
For different cases like "Nearly Sorted" of "Few Unique" the algorithms can work differently and be more efficent. For a complete list of the algorithms with animations in the various cases see this brilliant site.
I hope this helps.
Ps. For nearly sorted sets (as commented above) the insertion sort is your winner.
I'd recommend using a comb sort, just start it with a gap size equal to the maximum distance away (or about there). It's expected O(n log n) (or in your case O(n log d) where d is the maximum displacement), easy to understand, easy to implement, and will work even when the elements are displaced more than you expect. If you need the guaranteed execution time you can use something like heap sort, but in the past I've found the overhead in space or computation time usually isn't worth it and end up implementing nearly anything else.
Since each integer being at most n positions away from its final position:
1) for the smallest integer (aka. the 0th integer in the final sorted array), its current position must be in A[0...n] because the nth element is n positions away from the 0th position
2) for the second smallest integer (aka. the 1st integer in the final sorted array, zero based), its current position must be in A[0...n+1]
3) for the ith smallest integer, its current position must be in A[i-n...i+n]
We could use a (n+1)-size min heap, containing a rolling window to get the array sorted. And you could find more details here:
http://www.geeksforgeeks.org/nearly-sorted-algorithm/

Resources