Sort List Recursively - sorting

At school we learned about recursion, and it's a fairly easy concept to grasp on. But it's fairly confusing to understand why it should be used on certain situations. It makes a lot of sense to browse directories recursively, or calculate factorials and stuff like that, but, my teacher mentioned that recursion is good for sorting lists.
How's that? Can anybody explain me why (and how, if possible) would you do that?
Thanks in advance.

An excellent example of recursion for "sorting a list" is the quicksort algorithm which has order of magnitude, on average O(n log n) and worst case: O(n²)
Here is a nice example from https://www.geeksforgeeks.org/quick-sort/
/* low --> Starting index, high --> Ending index */
quickSort(arr[], low, high)
{
if (low < high)
{
/* pi is partitioning index, arr[pi] is now
at right place */
pi = partition(arr, low, high);
quickSort(arr, low, pi - 1); // Before pi
quickSort(arr, pi + 1, high); // After pi
}
}

One simple reason is that when sorting a list, one doesn't need to know the size or number of iterations the function "recurs" or executes. The algorithm will run until a final condition is met (for example, the largest number is at the last index when sorting numbers in an ascending order).

Assuming that "list" is the equivalent of an array as opposed to a linked list, then some sort algorithms like quicksort are recursive. On the other hand, iterative bottom up merge sort is generally a bit faster than recursive top down merge sort, and most actual libraries with a variation of merge sort use a bottom up iterative variation of merge sort.
Using recursion to calculate factorial is slow compared to iteration, since values are pushed onto the stack as opposed to register based calculations assuming an optimizing compiler.
One of the worst case scenarios for recursion is calculating a Fibonacci number using recursion, with a time complexity of O(2^n), versus iteration with time complexity of O(n).
In some cases, a type of recursion called "tail recursion" may be converted into a loop by a compiler.

Related

Partial selection sort vs Mergesort to find "k largest in array"

I was wondering if my line of thinking is correct.
I'm preparing for interviews (as a college student) and one of the questions I came across was to find the K largest numbers in an array.
My first thought was to just use a partial selection sort (e.g. scan the array from the first element and keep two variables for the lowest element seen and its index and swap with that index at the end of the array and continue doing so until we've swapped K elements and return a copy of the first K elements in that array).
However, this takes O(K*n) time. If I simply sorted the array using an efficient sorting method like Mergesort, it would only take O(n*log(n)) time to sort the entire array and return the K largest numbers.
Is it good enough to discuss these two methods during an interview (comparing log(n) and K of the input and going with the smaller of the two to compute the K largest) or would it be safe to assume that I'm expected to give a O(n) solution for this problem?
There exists an O(n) algorithm for finding the k'th smallest element, and once you have that element, you can simply scan through the list and collect the appropriate elements. It's based on Quicksort, but the reasoning behind why it works are rather hairy... There's also a simpler variation that probably will run in O(n). My answer to another question contains a brief discussion of this.
Here's a general discussion of this particular interview question found from googling:
http://www.geeksforgeeks.org/k-largestor-smallest-elements-in-an-array/
As for your question about interviews in general, it probably greatly depends on the interviewer. They usually like to see how you think about things. So, as long as you can come up with some sort of initial solution, your interviewer would likely ask questions from there depending on what they were looking for exactly.
IMHO, I think the interviewer wouldn't be satisfied with either of the methods if he says the dataset is huge (say a billion elements). In this case, if K to be returned is huge (nearing a billion) your partial selection would almost result in an O(n^2). I think it entirely depends on the intricacies of the question proposed.
EDIT: Aasmund Eldhuset's answer shows you how to achieve the O(n) time complexity.
If you want to find K (so for K = 5 you'll get five results - five highest numbers ) then the best what you can get is O(n+klogn) - you can build prority queue in O(n) and then invoke pq.Dequeue() k times. If you are looking for K biggest number then you can get it with O(n) quicksort modification - it's called k-th order statistics. Pseudocode looks like that: (it's randomized algorithm, avg time is approximately O(n) however worst case is O(n^2))
QuickSortSelection(numbers, currentLength, k) {
if (currentLength == 1)
return numbers[0];
int pivot = random number from numbers array;
int newPivotIndex = partitionAroundPivot(numbers) // check quicksort algorithm for more details - less elements go left to the pivot, bigger elements go right
if ( k == newPivotIndex )
return pivot;
else if ( k < newPivotIndex )
return QuickSortSelection(numbers[0..newPivotIndex-1], newPivotIndex, k)
else
return QuickSortSelection(numbers[newPivotIndex+1..end], currentLength-newPivotIndex+1, k-newPivotIndex);
}
As i said this algorithm is O(n^2) worst case because pivot is chosen at random (however probability of running time of ~n^2 is something like 1/2^n). You can convert it deterministic algorithm with same running time worst case using for instance median of three median as a pivot - but it is slower in practice (due to constant).

using insertion sort once in quicksort

According to here
Use insertion sort...for invocations on small arrays (i.e. where
the length is less than a threshold k determined experimentally). This
can be implemented by simply stopping the recursion when less than k
elements are left, leaving the entire array k-sorted: each element
will be at most k positions away from its final position. Then, a
single insertion sort pass finishes the sort in O(k×n) time.
I'm not sure I'm understanding correctly. One way to do it that involves calling insertion sort multiple times is
quicksort(A, i, k):
if i+threshold < k:
p := partition(A, i, k)
quicksort(A, i, p - 1)
quicksort(A, p + 1, k)
else
insertionsort(A, i, k)
but this would call insertionsort() for each subarray. It sounds like insertion sort could be called only once, but I sort of don't understand this because it doesn't matter how many times insertion sort is called it's still generally slower than quicksort.
Is the idea like this?
sort(A)
quicksort(A, 0, A.length-1)
insertionsort(A, 0, A.length-1)
So basically call insertion sort once at the very end? How do you know it would only take one pass and not run at O(n)?
Yes, your second pseudocode is correct. The usual analysis of insertion sort is that the outer loop inserts each element in turn (O(n) iterations), and the inner loop moves that element to its correct place (O(n) iterations), for a total of O(n^2). However, since your second quicksort leaves an array that can be sorted by permuting elements within blocks of size at most threshold, each element moves at most threshold positions, and the new analysis is O(n*threshold), which is equivalent to running insertion sort on each block separately.
Described by Bentley in the 1999 edition of Programming Pearls, this idea (per Wikipedia) avoids the overhead of starting and stopping the insertion sort loop many times (in essence, we have a natural sentinel value for the insertion loop already in the array). IMHO, it's a cute idea but not clearly still a good one given how different the performance characteristics of commodity hardware are now (specifically, the final insertion sort requires another pass over the data, which has gotten relatively more expensive, and the cost of starting the loop (moving some values in registers) has gotten relatively less expensive).

selection algorithm for median

I was trying to understand the selection algorithm for finding the median. I have pasted the psuedo code below.
SELECT(A[1 .. n], k):
if n<=25
use brute force
else
m = ceiling(n/5)
for i=1 to m
B[i]=SELECT(A[5i-4 .. 5i], 3)
mom=SELECT(B[1 ..m], floor(m/2))
r = PARTITION(A[1 .. n],mom)
if k < r
return SELECT(A[1 .. r-1], k)
else if k > r
return SELECT(A[r +1 .. n], k-r)
else
return mom
i have a very trivial doubt. I was wondering what the author means by brute force written above for i<=25. Is it that he will compare elements one by one with every other element and see if its the kth largest or something else.
The code must come from here.
A brute force algorithm can be any simple and stupid algorithm. In your example, you can sort the 25 elements and find the middle one. This is simple and stupid compared to the selection algorithm since sorting takes O(nlgn) while selection takes only linear time.
A brute force algorithm is often good enough when n is small. Besides, it is easier to implement. Read more about brute force here.
Common wisdom is that Quicksort is slower than insertion sort for small inputs. Therefore many implementations switch to insertion sort at some threshold.
There is a reference to this practice in the Wikipedia page on Quicksort.
Here's an example of commercial mergesort code that switches to insertion sort for small inputs. Here the threshold is 7.
The "brute force" almost certainly refers to the fact that the code here is using the same practice: insertion sort followed by picking the middle element(s) for the median.
However I've found in practice that the common wisdom is not generally true. When I've run benchmarks, the switch has either very little positive effect or negative. That was for Quicksort. In the Parition algorithm, it's more likely ot be negative because one side of the partition is thrown away at each step, so there is less time spent on small inputs. This is verified in #Dennis's response to this SO question.

have I invented a new sorting algorithm? or is this the same as quicksort

I made an algorithm for sorting but I then I thought perhaps I had just reinvented quicksort.
However I heard quicksort is O(N^2) worst case; I think my algorithm should be only O(NLogN) worst case.
Is this the same as quicksort?
The algorithm works by swapping values so that all values smaller than the median are moved to the left of the array. It then works recursively on each side.
The algorithm starts with i=0, j = n-1
i and j move towards each other with list[i] and list[j] being swapped if necessary.
Here is some code for the first iteration before the recursion:
_list = [1,-4,2,-5,3,-6]
def in_place(_list,i,j,median):
while i<j:
a,b = _list[i],_list[j]
if (a<median and b>=median):
i+=1
j-=1
elif (a>=median and b<median):
_list[i],_list[j]=b,a
i+=1
j-=1
elif a<median:
i+=1
else:
j-=1
print "changed to ", _list
def get_median(_list):
#approximate median in O(N) with O(1) space
return -4
median = get_median(_list)
in_place(_list,0,len(_list)-1,median)
"""
changed1 to [-6, -5, 2, -4, 3, 1]
"""
http://en.wikipedia.org/wiki/Quicksort#Selection-based_pivoting
Conversely, once we know a worst-case O(n) selection algorithm is
available, we can use it to find the ideal pivot (the median) at every
step of quicksort, producing a variant with worst-case O(n log n)
running time. In practical implementations, however, this variant is
considerably slower on average.
Another variant is to choose the Median of Medians as the pivot
element instead of the median itself for partitioning the elements.
While maintaining the asymptotically optimal run time complexity of
O(n log n) (by preventing worst case partitions), it is also
considerably faster than the variant that chooses the median as pivot.
For starters, I assume there is other code not shown, as I'm pretty sure that the code you've shown on its own would not work.
I'm sorry to steal your fire, but I'm afraid what code you do show seems to be Quicksort, and not only that, but the code seems to possibly suffer from some bugs.
Consider the case of sorting a list of identical elements. Your _in_place method, which seems to be what is traditionally called partition in Quicksort, would not move any elements correctly, but at the end the j and i seem to reflect the list having only one partition containing the whole list, in which case you would recurse again on the whole list forever. My guess is, as as mentioned, you don't return anything from it, or seem to actually fully sort anywhere, so I am left guessing how this would be used.
I'm afraid using the real median for Quicksort is not only a possibly fairly slow strategy in the average case, it also doesn't avoid the O(n^2) worst case, again a list of identical elements would provide such a worst case. However, I think a three way partition Quicksort with such a median selection algorithm would guarantee O(n*log n) time. Nonetheless, this is a known option for pivot choice and not a new algorithm.
In short, this appears to be an incomplete and possibly buggy Quicksort, and without three way partitioning, using the median would not guarantee you O(n*log n). However, I do feel that it is a good thing and worth congratulations that you did think of the idea of using the median yourself - even if it has been thought of by others before.

Quicksort vs heapsort

Both quicksort and heapsort do in-place sorting. Which is better? What are the applications and cases in which either is preferred?
Heapsort is O(N log N) guaranted, what is much better than worst case in Quicksort. Heapsort doesn't need more memory for another array to putting ordered data as is needed by Mergesort. So why do comercial applications stick with Quicksort? What Quicksort has that is so special over others implementations?
I've tested the algorithms myself and I've seen that Quicksort has something special indeed. It runs fast, much faster than Heap and Merge algorithms.
The secret of Quicksort is: It almost doesn't do unnecessary element swaps. Swap is time consuming.
With Heapsort, even if all of your data is already ordered, you are going to swap 100% of elements to order the array.
With Mergesort, it's even worse. You are going to write 100% of elements in another array and write it back in the original one, even if data is already ordered.
With Quicksort you don't swap what is already ordered. If your data is completely ordered, you swap almost nothing! Although there is a lot of fussing about worst case, a little improvement on the choice of pivot, any other than getting the first or last element of array, can avoid it. If you get a pivot from the intermediate element between first, last and middle element, it is suficient to avoid worst case.
What is superior in Quicksort is not the worst case, but the best case! In best case you do the same number of comparisons, ok, but you swap almost nothing. In average case you swap part of the elements, but not all elements, as in Heapsort and Mergesort. That is what gives Quicksort the best time. Less swap, more speed.
The implementation below in C# on my computer, running on release mode, beats Array.Sort by 3 seconds with middle pivot and by 2 seconds with improved pivot (yes, there is an overhead to get a good pivot).
static void Main(string[] args)
{
int[] arrToSort = new int[100000000];
var r = new Random();
for (int i = 0; i < arrToSort.Length; i++) arrToSort[i] = r.Next(1, arrToSort.Length);
Console.WriteLine("Press q to quick sort, s to Array.Sort");
while (true)
{
var k = Console.ReadKey(true);
if (k.KeyChar == 'q')
{
// quick sort
Console.WriteLine("Beg quick sort at " + DateTime.Now.ToString("HH:mm:ss.ffffff"));
QuickSort(arrToSort, 0, arrToSort.Length - 1);
Console.WriteLine("End quick sort at " + DateTime.Now.ToString("HH:mm:ss.ffffff"));
for (int i = 0; i < arrToSort.Length; i++) arrToSort[i] = r.Next(1, arrToSort.Length);
}
else if (k.KeyChar == 's')
{
Console.WriteLine("Beg Array.Sort at " + DateTime.Now.ToString("HH:mm:ss.ffffff"));
Array.Sort(arrToSort);
Console.WriteLine("End Array.Sort at " + DateTime.Now.ToString("HH:mm:ss.ffffff"));
for (int i = 0; i < arrToSort.Length; i++) arrToSort[i] = r.Next(1, arrToSort.Length);
}
}
}
static public void QuickSort(int[] arr, int left, int right)
{
int begin = left
, end = right
, pivot
// get middle element pivot
//= arr[(left + right) / 2]
;
//improved pivot
int middle = (left + right) / 2;
int
LM = arr[left].CompareTo(arr[middle])
, MR = arr[middle].CompareTo(arr[right])
, LR = arr[left].CompareTo(arr[right])
;
if (-1 * LM == LR)
pivot = arr[left];
else
if (MR == -1 * LR)
pivot = arr[right];
else
pivot = arr[middle];
do
{
while (arr[left] < pivot) left++;
while (arr[right] > pivot) right--;
if(left <= right)
{
int temp = arr[right];
arr[right] = arr[left];
arr[left] = temp;
left++;
right--;
}
} while (left <= right);
if (left < end) QuickSort(arr, left, end);
if (begin < right) QuickSort(arr, begin, right);
}
This paper has some analysis.
Also, from Wikipedia:
The most direct competitor of
quicksort is heapsort. Heapsort is
typically somewhat slower than
quicksort, but the worst-case running
time is always Θ(nlogn). Quicksort is
usually faster, though there remains
the chance of worst case performance
except in the introsort variant, which
switches to heapsort when a bad case
is detected. If it is known in advance
that heapsort is going to be
necessary, using it directly will be
faster than waiting for introsort to
switch to it.
For most situations, having quick vs. a little quicker is irrelevant... you simply never want it to occasionally get waayyy slow. Although you can tweak QuickSort to avoid the way slow situations, you lose the elegance of the basic QuickSort. So, for most things, I actually prefer HeapSort... you can implement it in its full simple elegance, and never get a slow sort.
For situations where you DO want max speed in most cases, QuickSort may be preferred over HeapSort, but neither may be the right answer. For speed-critical situations, it is worth examining closely the details of the situation. For example, in some of my speed-critical code, it is very common that the data is already sorted or near-sorted (it is indexing multiple related fields that often either move up and down together OR move up and down opposite each other, so once you sort by one, the others are either sorted or reverse-sorted or close... either of which can kill QuickSort). For that case, I implemented neither... instead, I implemented Dijkstra's SmoothSort... a HeapSort variant that is O(N) when already sorted or near-sorted... it is not so elegant, not too easy to understand, but fast... read http://www.cs.utexas.edu/users/EWD/ewd07xx/EWD796a.PDF if you want something a bit more challenging to code.
Quicksort-Heapsort in-place hybrids are really interesting, too, since most of them only needs n*log n comparisons in the worst case (they are optimal with respect to the first term of the asymptotics, so they avoid the worst-case scenarios of Quicksort), O(log n) extra-space and they preserve at least "a half" of the good behaviour of Quicksort with respect to already-ordered set of data. An extremely interesting algorithm is presented by Dikert and Weiss in http://arxiv.org/pdf/1209.4214v1.pdf:
Select a pivot p as the median of a random sample of sqrt(n) elements (this can be done in at most 24 sqrt(n) comparisons through the algorithm of Tarjan&co, or 5 sqrt(n) comparisons through the much more convoluted spider-factory algorithm of Schonhage);
Partition your array in two parts as in the first step of Quicksort;
Heapify the smallest part and use O(log n) extra bits to encode a heap in which every left child has a value greater than its sibling;
Recursively extract the root of the heap, sift down the lacune left by the root until it reaches a leaf of the heap, then fill the lacune with an appropriate element took from the other part of the array;
Recur over the remaining non-ordered part of the array (if p is chosen as the exact median, there is no recursion at all).
Comp. between quick sort and merge sort since both are type of in place sorting there is a difference between wrost case running time of the wrost case running time for quick sort is O(n^2) and for heap sort it is still O(n*log(n)) and for a average amount of data quick sort will be more useful. Since it is randomized algorithm so the probability of getting correct ans. in less time will depend on the position of pivot element you choose.
So a
Good call: the sizes of L and G are each less than 3s/4
Bad call: one of L and G has size greater than 3s/4
for small amount we can go for insertion sort and for very large amount of data go for heap sort.
Heapsort has the benefit of having a worst running case of O(n*log(n)) so in cases where quicksort is likely to be performing poorly (mostly sorted data sets generally) heapsort is much preferred.
To me there is a very fundamental difference between heapsort and quicksort: the latter uses a recursion. In recursive algorithms the heap grows with the number of recursions. This does not matter if n is small, but right now I am sorting two matrices with n=10^9 !!. The program takes almost 10 GB of ram and any extra memory will make my computer to start swapping to virtual disk memory. My disk is a RAM disk, but still swapping to it make a huge difference in speed. So in a statpack coded in C++ that includes adjustable dimension matrices, with size unknown in advance to the programmer, and nonparametric statistical kind of sorting I prefer the heapsort to avoid delays to uses with very big data matrices.
Well if you go to architecture level...we use queue data structure in cache memory.so what ever is available in queue will get sorted.As in quick sort we have no issue dividing the array into any lenght...but in heap sort(by using array) it may so happen that the parent may not be present in the sub array available in cache and then it has to bring it in cache memory ...which is time consuming.
That's quicksort is best!!😀
Heapsort builds a heap and then repeatedly extracts the maximum item. Its worst case is O(n log n).
But if you would see the worst case of quick sort, which is O(n2), you would realized that quick sort would be a not-so-good choice for large data.
So this makes sorting is an interesting thing; I believe the reason so many sorting algorithms live today is because all of them are 'best' at their best places. For instance, bubble sort can out perform quick sort if the data is sorted. Or if we know something about the items to be sorted then probably we can do better.
This may not answer your question directly, thought I'd add my two cents.
Heap Sort is a safe bet when dealing with very large inputs. Asymptotic analysis reveals order of growth of Heapsort in the worst case is Big-O(n logn), which is better than Quicksort's Big-O(n^2) as a worst case. However, Heapsort is somewhat slower in practice on most machines than a well-implemented quick sort. Heapsort is also not a stable sorting algorithm.
The reason heapsort is slower in practice than quicksort is due to the better locality of reference ("https://en.wikipedia.org/wiki/Locality_of_reference") in quicksort, where data elements are within relatively close storage locations. Systems that exhibit strong locality of reference are great candidates for performance optimization. Heap sort, however, deals with larger leaps. This makes quicksort more favorable for smaller inputs.
in simple terms >> HeapSort have guaranteed ~worst-case~ running time of "O(n log n)" as opposed to QuickSort’s
~average~ running time of "O(n log n)". QuickSort is usually used in practice, because typically it is faster, but
HeapSort is used for external sort when you need to sort huge files that don’t fit into memory of your
computer.
To answer the original question and address some of the other comments here:
I just compared implementations of selection, quick, merge, and heap sort to see how they'd stack up against each other. The answer is that they all have their downsides.
TL;DR:
Quick is the best general purpose sort (reasonably fast, stable, and mostly in-place)
Personally I prefer heap sort though unless I need a stable sort.
Selection - N^2 - It's really only good for less than 20 elements or so, then it's outperformed. Unless your data is already sorted, or very, very nearly so. N^2 gets really slow really fast.
Quick, in my experience, is not actually that quick all the time. Bonuses for using quick sort as a general sort though are that it's reasonably fast and it's stable. It's also an in-place algorithm, but as it's generally implemented recursively, it will take up additional stack space. It also falls somewhere between O(n log n) and O(n^2). Timing on some sorts seem to confirm this, especially when the values fall within a tight range. It's way faster than selection sort on 10,000,000 items, but slower than merge or heap.
Merge sort is guaranteed O(n log n) since its sort is not data dependent. It just does what it does, regardless of what values you've given it. It's also stable, but very large sorts can blow out your stack if you're not careful about implementation. There are some complex in-place merge sort implementations, but generally you need another array in each level to merge your values into. If those arrays live on the stack you can run into issues.
Heap sort is max O(n log n), but in many cases is quicker, depending on how far you have to move your values up the log n deep heap. The heap can easily be implemented in-place in the original array, so it needs no additional memory, and it's iterative, so no worries about stack overflow while recursing. The huge downside to heap sort is that it is not a stable sort, which means it's right out if you need that.

Resources