What's the difference between quicksort and mergesort? - sorting

Am I right in saying that in both algorithms, all you're doing is taking your structure, recursively splitting it into two, and then building up your structure in the right order?
So, what is the difference?
Edit : I have found the following algorithm for implementing the partition in quicksort, but I don't really understand how it works, , specifically the swop line that uses (hi + low) >>> 1 as an argument! Can anyone make sense of this?
private static int partition( int[] items, int lo, int hi )
{
int destination = lo;
swop( items, (hi + lo) >>> 1, hi );
// The pivot is now stored in items[ hi ].
for (int index = lo; index != hi; index ++)
{
if (items[ hi ] >= items[ index ])
{
// Move current item to start.
swop( items, destination, index );
destination ++;
}
// items[ i ] <= items[ hi ] if lo <= i < destination.
// items[ i ] > items[ hi ] if destination <= i <= index.
}
// items[ i ] <= items[ hi ] if lo <= i < destination.
// items[ i ] > items[ hi ] if destination <= i < hi.
swop( items, destination, hi );
// items[ i ] <= items[ destination ] if lo <= i <= destination.
// items[ i ] > items[ destination ] if destination < i <= hi.
return destination;
}

Am I right in saying that in both algorithms, all you're doing is taking your structure, recursively splitting it into two, and then building up your structure in the right order?
Yes. The difference, however, is in when the structure is built in the right order. In Quicksort, the actual sorting step is done during the splitting (move elements to the left or right half, depending on the comparison with the pivot element) and there is no merging step to get back up the tree (as observed from the data point of view; your implementation may of course have stack unwinding), while in Mergesort, the sorting is done on the way up – the splitting step does not move elements at all, but on the way back up, you need to merge two sorted lists.
As for the performance comparisons: It is certainly true that the worst-case behavior of Quicksort is worse than that of Mergsesort, but the constant factor for the average case (which is observed almost exclusively in practice) is smaller, which makes Quicksort usually the winner for generic, unsorted input. Not that many people need to implement generic sorting routines themselves …

In worst case quicksort will have O(n^2) where mergesort will be O(n*log n)
Quicksort uses a pivot and sorts the two parts with the pivot as reference point with the risk that the pivot will be either maximum or minimum of the sorted array. If you will be choosing the wrong pivots you will end up with complexity n^2 (n^2 comparsions)
Mergesort as named is based on recursively dividing array into halfs of the same size and merging them back. Pretty nice explanations on wikipedia for example. Especially the picture with the tree-like brakedown seems to explain it pretty well.

Quicksort has a bad worst case, while Mergesort is always O(n log n) guaranteed, but typical Quicksort implementations are faster than Mergesort in practice.
Also, Mergesort requires additional storage, which is a problem in many cases (e.g. library routines). This is why Quicksort is almost always used by library routines.
Edit : I have found the following
algorithm for implementing the
partition in quicksort, but I don't
really understand how it works, ,
specifically the swop line that uses
(hi + low) >>> 1 as an argument!
This is taking the average of hi and low, equivalent to (hi + low) / 2.

The implementation details of the underlying data-structures would also be a factor such as efficient random access (needed by quicksort) and the mutability of data-structures would have an effect on the memory requirements (particularly mergesort)

Yes, both Quicksort and Mergesort are divide-and-conquer algorithms.
Wikipedia's Quicksort page has a brief comparison:
Quicksort also competes with
mergesort, another recursive sort
algorithm but with the benefit of
worst-case O(n log n) running time.
Mergesort is a stable sort, unlike
quicksort and heapsort, and can be
easily adapted to operate on linked
lists and very large lists stored on
slow-to-access media such as disk
storage or network attached storage.
Although quicksort can be written to
operate on linked lists, it will often
suffer from poor pivot choices without
random access. The main disadvantage
of mergesort is that, when operating
on arrays, it requires O(n) auxiliary
space in the best case, whereas the
variant of quicksort with in-place
partitioning and tail recursion uses
only O(log n) space. (Note that when
operating on linked lists, mergesort
only requires a small, constant amount
of auxiliary storage.)

http://en.wikipedia.org/wiki/Sorting_algorithm there you have a quick overview of common sort algorithms. Quicksort and Mergesort are the first two ;)
For more information read through the information given on each of these algorithms.
As Jan said, Mergesort has always a complexity of O(n * log n) and Quicksort is up to O(n^2) in worstcase

Related

Search a Sorted Array for First Occurrence of K

I'm trying to solve question 11.1 in Elements of Programming Interviews (EPI) in Java: Search a Sorted Array for First Occurrence of K.
The problem description from the book:
Write a method that takes a sorted array and a key and returns the index of the first occurrence of that key in the array.
The solution they provide in the book is a modified binary search algorithm that runs in O(logn) time. I wrote my own algorithm also based on a modified binary search algorithm with a slight difference - it uses recursion. The problem is I don't know how to determine the time complexity of my algorithm - my best guess is that it will run in O(logn) time because each time the function is called it reduces the size of the candidate values by half. I've tested my algorithm against the 314 EPI test cases that are provided by the EPI Judge so I know it works, I just don't know the time complexity - here is the code:
public static int searchFirstOfKUtility(List<Integer> A, int k, int Lower, int Upper, Integer Index)
{
while(Lower<=Upper){
int M = Lower + (Upper-Lower)/2;
if(A.get(M)<k)
Lower = M+1;
else if(A.get(M) == k){
Index = M;
if(Lower!=Upper)
Index = searchFirstOfKUtility(A, k, Lower, M-1, Index);
return Index;
}
else
Upper=M-1;
}
return Index;
}
Here is the code that the tests cases call to exercise my function:
public static int searchFirstOfK(List<Integer> A, int k) {
Integer foundKey = -1;
return searchFirstOfKUtility(A, k, 0, A.size()-1, foundKey);
}
So, can anyone tell me what the time complexity of my algorithm would be?
Assuming that passing arguments is O(1) instead of O(n), performance is O(log(n)).
The usual theoretical approach for analyzing recursion is calling the Master Theorem. It is to say that if the performance of a recursive algorithm follows a relation:
T(n) = a T(n/b) + f(n)
then there are 3 cases. In plain English they correspond to:
Performance is dominated by all the calls at the bottom of the recursion, so is proportional to how many of those there are.
Performance is equal between each level of recursion, and so is proportional to how many levels of recursion there are, times the cost of any layer of recursion.
Performance is dominated by the work done in the very first call, and so is proportional to f(n).
You are in case 2. Each recursive call costs the same, and so performance is dominated by the fact that there are O(log(n)) levels of recursion times the cost of each level. Assuming that passing a fixed number of arguments is O(1), that will indeed be O(log(n)).
Note that this assumption is true for Java because you don't make a complete copy of the array before passing it. But it is important to be aware that it is not true in all languages. For example I recently did a bunch of work in PL/pgSQL, and there arrays are passed by value. Meaning that your algorithm would have been O(n log(n)).

Regarding sorted data in Fast 3 Way Partition / in place quicksort via Sedgewick

I am interested in the 3 way partition in quickSort at http://algs4.cs.princeton.edu/23quicksort/Quick3way.java.html
because it uses that partition to overcome the Dutch National Flag problem (equal data) in an in-place quicksort.
Since the author is Sedgewick I would assume that there is no error in that code, yet the pivot selected is prone to worst case n^2 time complexity for sorted data.
According to wikipedia:
In the very early versions of quicksort, the leftmost element of the partition would often be chosen as the pivot element. Unfortunately, this causes worst-case behavior on already sorted arrays, which is a rather common use-case. The problem was easily solved by choosing either a random index for the pivot, choosing the middle index of the partition or (especially for longer partitions) choosing the median of the first, middle and last element of the partition for the pivot (as recommended by Sedgewick).[17]
The code for the quick sort:
// quicksort the subarray a[lo .. hi] using 3-way partitioning
private static void sort(Comparable[] a, int lo, int hi) {
if (hi <= lo) return;
int lt = lo, gt = hi;
Comparable v = a[lo];
int i = lo;
while (i <= gt) {
int cmp = a[i].compareTo(v);
if (cmp < 0) exch(a, lt++, i++);
else if (cmp > 0) exch(a, i, gt--);
else i++;
}
// a[lo..lt-1] < v = a[lt..gt] < a[gt+1..hi].
sort(a, lo, lt-1);
sort(a, gt+1, hi);
assert isSorted(a, lo, hi);
}
Am I correct to use the mid or ninther for the pivot or have I missed something? I realize it is instructional but why not at least use the mid?
EDIT
Is shuffling considered a rigorous way to prevent worst case over simply choosing a better pivot? Why not just change the pivot...Shuffling a large array with significant randomness would take some overhead would it not? Since a shuffle algorithm takes extra time, why not choose the pivot? Shuffling data with all equivalent data is a complete waste for instance. Would it not be better to run isSorted on the array as an heuristic with a needed edit for equiv data? –
not one to argue with Hoare, but would it not be better to check ifSorted with a modification for equiv data that would short circuit rather than run the data through the sort unnecessarily? It would take the same time as a shuffle.
The sort method you quoted is a private helper method. The real public method sort is like this:
public static void sort(Comparable[] a) {
StdRandom.shuffle(a);
sort(a, 0, a.length - 1);
assert isSorted(a);
}
By calling StdRandom.shuffle, the array is randomly shuffled before doing quicksort. This is the way to protect against the worst case.
It's not only used for this 3-way partition quicksort, it's also used in the normal quicksort.
Quoting from the Algorithms book by Sedgewick, §2.3 QUICKSORT
Q. Randomly shuffling the array seems to take a significant fraction of the total time for the sort. Is doing so really worthwhile?
A. Yes. It protects against the worst case and makes the running time predictable. Hoare proposed this approach when he presented the algorithm in 1960—it is a prototypical (and among the first) randomized algorithm.

What is the n in big-O notation?

The question is rather simple, but I just can't find a good enough answer. On the most upvoted SO question regarding the big-O notation, it says that:
For example, sorting algorithms are typically compared based on comparison operations (comparing two nodes to determine their relative ordering).
Now let's consider the simple bubble sort algorithm:
for (int i = arr.length - 1; i > 0; i--) {
for (int j = 0; j < i; j++) {
if (arr[j] > arr[j+1]) {
switchPlaces(...)
}
}
}
I know that worst case is O(n²) and best case is O(n), but what is n exactly? If we attempt to sort an already sorted algorithm (best case), we would end up doing nothing, so why is it still O(n)? We are looping through 2 for-loops still, so if anything it should be O(n²). n can't be the number of comparison operations, because we still compare all the elements, right?
When analyzing the Big-O performance of sorting algorithms, n typically represents the number of elements that you're sorting.
So, for example, if you're sorting n items with Bubble Sort, the runtime performance in the worst case will be on the order of O(n2) operations. This is why Bubble Sort is considered to be an extremely poor sorting algorithm, because it doesn't scale well with increasing numbers of elements to sort. As the number of elements to sort increases linearly, the worst case runtime increases quadratically.
Here is an example graph demonstrating how various algorithms scale in terms of worst-case runtime as the problem size N increases. The dark-blue line represents an algorithm that scales linearly, while the magenta/purple line represents a quadratic algorithm.
Notice that for sufficiently large N, the quadratic algorithm eventually takes longer than the linear algorithm to solve the problem.
Graph taken from http://science.slc.edu/~jmarshall/courses/2002/spring/cs50/BigO/.
See Also
The formal definition of Big-O.
I think two things are getting confused here, n and the function of n that is being bounded by the Big-O analysis.
By convention, for any algorithm complexity analysis, n is the size of the input if nothing different is specified. For any given algorithm, there are several interesting functions of the input size for which one might calculate asymptotic bounds such as Big-O.
The commonest such function for a sorting algorithm is the worst case number of comparisons. If someone says a sorting algorithm is O(n^2), without specifying anything else, I would assume they mean the worst case comparison count is O(n^2), where n is the input size.
Another interesting function is the amount of work space, of space in addition to the array being sorted. Bubble sort's work space is O(1), constant space, because it only uses a few variables regardless of the array size.
Bubble sort can be coded to do only n-1 array element comparisons in the best case, by finishing after any pass that does no exchanges. See this pseudo code implementation, which uses swapped to remember whether there were any exchanges. If the array is already sorted the first pass does no exchanges, so the sort finishes after one pass.
n is usually the size of the input. For array, that would be the number of elements.
To see the different cases, you would need to change the algorithm:
for (int i = arr.length - 1; i > 0 ; i--) {
boolean swapped = false;
for (int j = 0; j<i; j++) {
if (arr[j] > arr[j+1]) {
switchPlaces(...);
swapped = true;
}
}
if(!swapped) {
break;
}
}
Your algorithm's best/worst cases are both O(n^2), but with the possibility of returning early, the best-case is now O(n).
n is array length. You want to find T(n) algorithm complexity.
It is much expensive to access memory then check condition if. So, you define T(n) to be number of access memory.
In the given algorithm BC and WC use O(n^2) accesses to memory because you check the if-condition O(n^2) times.
Make the complexity better: Hold a flag and if you don't do any swaps in the main-loop, it means your array is sorted and you can put a break.
Now, in BC the array is sorted and you access all elements once so O(n).
And in WC still O(n^2).

Sort name & time complexity

I "invented" "new" sort algorithm. Well, I understand that I can't invent something good, so I tried to search it on wikipedia, but all sort algorithms seems like not my. So I have three questions:
What is name of this algorithm?
Why it sucks? (best, average and worst time complexity)
Can I make it more better still using this idea?
So, idea of my algorithm: if we have an array, we can count number of sorted elements and if this number is less that half of length we can reverse array to make it more sorted. And after that we can sort first half and second half of array. In best case, we need only O(n) - if array is totally sorted in good or bad direction. I have some problems with evaluation of average and worst time complexity.
Code on C#:
public static void Reverse(int[] array, int begin, int end) {
int length = end - begin;
for (int i = 0; i < length / 2; i++)
Algorithms.Swap(ref array[begin+i], ref array[begin + length - i - 1]);
}
public static bool ReverseIf(int[] array, int begin, int end) {
int countSorted = 1;
for (int i = begin + 1; i < end; i++)
if (array[i - 1] <= array[i])
countSorted++;
int length = end - begin;
if (countSorted <= length/2)
Reverse(array, begin, end);
if (countSorted == 1 || countSorted == (end - begin))
return true;
else
return false;
}
public static void ReverseSort(int[] array, int begin, int end) {
if (begin == end || begin == end + 1)
return;
// if we use if-operator (not while), then array {2,3,1} transforms in array {2,1,3} and algorithm stop
while (!ReverseIf(array, begin, end)) {
int pivot = begin + (end - begin) / 2;
ReverseSort(array, begin, pivot + 1);
ReverseSort(array, pivot, end);
}
}
public static void ReverseSort(int[] array) {
ReverseSort(array, 0, array.Length);
}
P.S.: Sorry for my English.
The best case is Theta(n), for, e.g., a sorted array. The worst case is Theta(n^2 log n).
Upper bound
Secondary subproblems have a sorted array preceded or succeeded by an arbitrary element. These are O(n log n). If preceded, we do O(n) work, solve a secondary subproblem on the first half and then on the second half, and then do O(n) more work – O(n log n). If succeeded, do O(n) work, sort the already sorted first half (O(n)), solve a secondary subproblem on the second half, do O(n) work, solve a secondary subproblem on the first half, sort the already sorted second half (O(n)), do O(n) work – O(n log n).
Now, in the general case, we solve two primary subproblems on the two halves and then slowly exchange elements over the pivot using secondary invocations. There are O(n) exchanges necessary, so a straightforward application of the Master Theorem yields a bound of O(n^2 log n).
Lower bound
For k >= 3, we construct an array A(k) of size 2^k recursively using the above analysis as a guide. The bad cases are the arrays [2^k + 1] + A(k).
Let A(3) = [1, ..., 8]. This sorted base case keeps Reverse from being called.
For k > 3, let A(k) = [2^(k-1) + A(k-1)[1], ..., 2^(k-1) + A(k-1)[2^(k-1)]] + A(k-1). Note that the primary subproblems of [2^k + 1] + A(k) are equivalent to [2^(k-1) + 1] + A(k-1).
After the primary recursive invocations, the array is [2^(k-1) + 1, ..., 2^k, 1, ..., 2^(k-1), 2^k + 1]. There are Omega(2^k) elements that have to move Omega(2^k) positions, and each of the secondary invocations that moves an element so far has O(1) sorted subproblems and thus is Omega(n log n).
Clearly more coffee is required – the primary subproblems don't matter. This makes it not too bad to analyze the average case, which is Theta(n^2 log n) as well.
With constant probability, the first half of the array contains at least half of the least quartile and at least half of the greatest quartile. In this case, regardless of whether Reverse happens, there are Omega(n) elements that have to move Omega(n) positions via secondary invocations.
It seems this algorithm, even if it performs horribly with "random" data (as demonstrated by Per in their answer), is quite efficient for "fixing up" arrays which are "nearly-sorted". Thus if you chose to develop this idea further (I personally wouldn't, but if you wanted to think about it as an exercise), you would do well to focus on this strength.
this reference on Wikipedia in the Inversion article alludes to the issue very well. Mahmoud's book is quite insightful, noting that there are various ways to measure "sortedness". For example if we use the number of inversions to characterize a "nearly-sorted array" then we can use insertion sort to sort it extremely quickly. However if your arrays are "nearly-sorted" in slightly different ways (e.g. a deck of cards which is cut or loosely shuffled) then insertion sort will not be the best sort to "fix up" the list.
Input: an array that has already been sorted of size N, with roughly N/k inversions.
I might do something like this for an algorithm:
Calculate number of inversions. (O(N lg(lg(N))), or can assume is small and skip step)
If number of inversions is < [threshold], sort array using insertion sort (it will be fast).
Otherwise the array is not close to being sorted; resort to using your favorite comparison (or better) sorting algorithm
There are better ways to do this though; one can "fix up" such an array in at least O(log(N)*(# new elements)) time if you preprocess your array enough or use the right data-structure, like an array with linked-list properties or similar which supports binary search.
You can generalize this idea even further. Whether "fixing up" an array will work depends on the kind of fixing-up that is required. Thus if you update these statistics whenever you add an element to the list or modify it, you can dispatch onto a good "fix-it-up" algorithm.
But unfortunately this would all be a pain to code. You might just be able to get away with want is a priority queue.

How to find out the largest element number(array size), let insertion sort beat Merge sort?

from wiki page of insertion sort:
Some divide-and-conquer algorithms such as quicksort and mergesort sort by recursively dividing the list into smaller sublists which are
then sorted. A useful optimization in practice for these algorithms is
to use insertion sort for sorting small sublists, where insertion sort
outperforms these more complex algorithms. The size of list for which
insertion sort has the advantage varies by environment and
implementation, but is typically between eight and twenty elements.
the quote from wiki has one reason is that, the small lists from merge sort are not worse case for insertion sort.
I want to just ignore this reason.
I knew that if the array size is small, Insertion sort O(n^2) has chance to beat Merge Sort O(n log n).
I think(not sure) this is related to the constants in T(n)
Insertion sort: T(n) = c1n^2 +c2n+c3
Merge Sort: T(n) = n log n + cn
now my question is, on the same machine, same case (worse case), how to find out the largest element number, let insertion sort beat merge sort?
It's simple:
Take a set of sample arrays to sort, and iterate over a value k where k is the cutoff point for when you switch from merge to insertion.
then go
for(int k = 1; k < MAX_TEST_VALUE; k++) {
System.out.println("Results for k = " + k);
for(int[] array : arraysToTest) {
long then = System.currentTimeMillis();
mergeSort(array,k); // pass in k to your merge sort so it uses that
long now = System.currentTimeMillis();
System.out.println(now - then);
}
}
For what it's worth, the java.util.Arrays class has this to say on the matter in its internal documentation:
/**
* Tuning parameter: list size at or below which insertion sort will be
* used in preference to mergesort or quicksort.
*/
private static final int INSERTIONSORT_THRESHOLD = 7;
/**
* Src is the source array that starts at index 0
* Dest is the (possibly larger) array destination with a possible offset
* low is the index in dest to start sorting
* high is the end index in dest to end sorting
* off is the offset to generate corresponding low, high in src
*/
private static void mergeSort(Object[] src,
Object[] dest,
int low,
int high,
int off) {
int length = high - low;
// Insertion sort on smallest arrays
if (length < INSERTIONSORT_THRESHOLD) {
for (int i=low; i<high; i++)
for (int j=i; j>low &&
((Comparable) dest[j-1]).compareTo(dest[j])>0; j--)
swap(dest, j, j-1);
return;
}
In its primitive sequences, it also uses 7, although it doesn't use the constant value.
Insertion sort usually beats merge sort for sorted (or almost sorted) lists of any size.
So the question "How to find out the largest element number(array size), let insertion sort beat Merge sort? " is not really correct.
edit:
Just to get the downvoters of my back:
The question could rephrased as:
"how to determine largest array size for which, on average, insertion sort beats merge sort". This usually is measured empirically by generating sample of arrays of small size and running implementations of both algorithms on them. glowcoder does that in his answer.
"what is the largest array size for which insertion sort in worst case performs better than merge sort" This is something that can be approximately answered by a simple calculation as IS has to do n insertions and n*(n-1) element movements (which are insertions) in worst case , while mergesort does always n*logn cell copies from one array to another. Since it will be relatively small number it doesn't even make sense to consider it.
Typically, that's done by testing with arrays of varying size. When n == 10, insertion sort is almost certainly faster. When n == 100, probably not. Test, test, test, until your results converge.
I suppose it's possible to determine the number strictly through analysis, but to do so you'd have to know exactly what code is generated by the compiler, include instruction timings, and take into account things like the cost of cache misses, etc. All things considered, the easiest way is to derive it empirically.
Okay, so we are talking about largest array length where insertion sort beats merge sort. Yes, of course for small inputs insertion sort beats merge sort because of auxiliary space
complexity. Now talking about exact data is somewhat difficult because it requires doing experiment. And it also varies from language to language. In python once n crosses 4000, it beats insertion sort in C(For reference watch https://youtu.be/Kg4bqzAqRBM //forward to 43:00). We can calculate that length in asymptotics though but talking about exact data is somewhat difficult.
p.s.:
Watch the video and most of your doubts would get cleared for sure!! (https://youtu.be/Kg4bqzAqRBM)
Also read about using insertion sort in merge sort when sub-arrays
become sufficiently small.(Refer book: Introduction to Algorithms by
Cormen, Chapter 2, problem 2.1). You can easily get the pdf in google.

Resources