Average Case of Quick Sort - algorithm

I'm working on the program just needed in the following to understand it better.
What is the average case running time for Quick sort and what may cause this average case performance? How can we modify quick sort program to mitigate this problem?
I know that it has average case O(n log(n)) and I know it occurs when the pivot median element. My question is how can I modify the program to mitigate this problem.

The average case of quicksort is not when the pivot is the median element - that's the best case. Analyzing the average case is a bit tricker. We'll assume that the array is in a random order, so that each element is equally likely to be selected as the pivot. Alternatively, we can just select the pivot randomly so that the original array order doesn't matter; either way leads to the same conclusion.
If the numbers in the array are [1, 2, 3, 4, 5], for example, then each number has a 1/5 probability of being selected as the pivot.
If 1 is selected as the pivot, then the recursive calls are on arrays of size 0 and 4.
If 2 is the pivot, then the recursive calls are on arrays of size 1 and 3.
If 3 is the pivot, then we will make recursive calls on arrays of size 2 and 2.
If 4 is the pivot, then the recursive calls are on arrays of size 3 and 1.
If 5 is selected as the pivot, then the recursive calls are on arrays of size 4 and 0.
So the recurrence is that T(5) is 1/5 of T(4) + T(0), T(3) + T(1), T(2) + T(2), T(1) + T(3) and T(0) + T(4), plus an O(n) term for the cost of partitioning. The general form of this recurrence relation is a sum over every possible pivot, divided by the number of possible pivots:
The solution to this recurrence relation happens to be that T(n) is in O(n log n).
The fact that the quicksort algorithm runs in O(n log n) time in the average case is not a problem; in fact, this is asymptotically optimal for any comparison-based sorting algorithm. No comparison-based sorting algorithm can have a better asymptotic running time in the average case.

Related

Find and sort in O(n) the log2(n) smallest values and the log2(n) largest values in an array of n values

Let A be an array of n different numbers (positive & negative).
We are interested in the ⌊log_2(n)⌋ smallest values,
and in the ⌊log_2(n)⌋ largest values.
Find algorithm which calculates this 2⌊log_2(n)⌋ values,
and presents them in a sorted array (size = 2⌊log_2(n)⌋)
1. the running time of the algorithm must be θ(n),
2. prove that the running time is θ(n).
I thought maybe heap sort can be useful, but I'm really not sure.
I don't need a code just the idea... I would appreciate any help
Thanks :) and sorry if I have English mistakes :(
My general approach would be to to create 2 heap data structures, one for the max and one for the min, and heapify the array for/in in both of them. Heapifying is an operation of linear time complexity if done right.
Then I would extract ⌊log_2(n)⌋ items from both heaps where each extraction is of complexity O(log n). So, this would give us the following rough estimation of calculations:
2 * n + 2 * (log(n))^2
2 * n for two heapifying operations.
log(n) * log(n) for extracting log(n) elements from one of the heaps.
2 * (log(n))^2 for extracting log(n) elements from both heaps.
In big O terms, the ruling term is n, since log(n) even to the power of two is asymptotically smaller. So the whole expression above renders to a sweet O(n).

What would be the running time of an algorithm that combines mergeSort and heapsort?

I have been given this problem that asks to compute the worst case running time of an algorithm that's exactly like mergeSort, but one of the two recursive calls is substituted by Heapsort.
So, I know that dividing in mergesort takes constant time and that merging is O(n). Heapsort takes O(nlogn).
This is what I came up with: T(n) = 2T(n/2) + O((n/2)logn)+ O(n).
I have some doubts about the O((n/2)logn) part. Is it n or n/2? I wrote n/2 because I'm doing heapsort only on half of the array, but I'm not sure that's correct
The question asks about running time, but should it be asking about time complexity?
Since recursion is mentioned, this is a question about top down merge sort (as opposed to bottom up merge sort).
With the code written as described, since heap sort is not recursive, recursion only occurs on one of each of the split sub-arrays. Heap sort will be called to sort sub-arrays of size n/2, n/4, n/8, n/16, ... , and no merging takes place until two sub-arrays of size 1 are the result of the recursive splitting. In the simple case where array size is a power of 2, then "merge sort" is only used for a single element, the rest of the sub-arrays of size {1, 2, 4, 8, ..., n/8, n/4, n/2} are sorted by heap sort and then merged.
Since heap sort is slower than merge sort, then running time will be longer, but time complexity remains at O(n log(n)) since constant or lower term factors are ignored for time complexity.
Let’s work out what the recurrence relation should be in this case. Here, we’re
splitting the array in half,
recursively sorting one half (T(n / 2)),
heapsorting one half (O(n log n)), and then
merging the two halves together (O(n)).
That gives us this recurrence relation:
T(n) = T(n / 2) + O(n log n).
Why is this O(n log n) and not, say, O((n / 2) log (n / 2))? The reason is that big-O notation munches up constant factors, so O(n log n) expresses the same asymptotic growth rate as O((n / 2) log (n / 2)). And why isn’t there a coefficient of 2 on the T(n / 2)? It’s because we’re only making one recursive call; remember that the other call was replaced by heapsort.
All that’s left to do now is to solve this recurrence. It does indeed work out to O(n log n), and I’ll leave it to you to decide how you want to show this. The iteration method is a great option here.

What would be the recurrence relationship for this algorithm?

I have been given this algorithm that computes the median of an array and partitions the other items around it.
It puts all the elements smaller than the median in a set A1, all those equal to it in A2 and all those bigger in A3. If A1 is bigger than 1 it goes recursively into it and the same happens for A3. It terminates after copying a concatenation of A1, A2 and A3 in A.
I know it’s very similar to Quickselect, but I don’t know how to proceed in order to figure out the time complexity in the worst case.
What I know is that in Quicksort, time complexity is T(n) = n -1 + T(a) + T(n -a-1), where n - 1 is for the partition, T(a) is the recursive call on the first part and t(n-a-1) is the recursive call on the last part. In that case the worst scenario happened when the pivot was always the biggest or the smallest item in the array.
But now, since we have the median as the pivot, what could the worst case be?
You can use the Big 5 Algorithm which will give you an approximate median. If you use this as your pivot in quicksort, the worst-case complexity would be O(n log n) instead of O(n^2), since we are making equal divisions each time instead of the worst case when we divide unequally with one bucket having one element and the other having n - 1 elements.
This worst case is very unlikely on the other hand. There is a decent amount overhead attached with finding the pivot point using the Big 5 median algorithm, so in practice is it outperformed by choosing random pivots. But if you wanted to find the median every time, the worst case would be O(n logn)

Quicksort time complexity when it always select the 2nd smallest element as pivot in a sublist

Time complexity of Quicksort when pivot always is the 2nd smallest element in a sublist.
Is it still O(NlogN)?
If i solve the recurrence equation
F(N) = F(N-2) + N
= F(N-2(2)) + 2N -2
= F(N-3(2)) + 3N - (2+1)(2)
= F(N-4(2)) + 4N - (3+2+1)(2)
Which is O(N^2), but I doubt my answer somehow, someone help me with the clarification please?
To start with, the quicksort algorithm has an average time complexity of O(NlogN), but its worst-time complexity is actually O(N^2).
The generic complexity analysis of quicksort depends not just on the devising of the recurrence relations, but also on the value of the variable K in F(N-K) term of your recurrence relation. And according to whether you're calculating best, average and worst case complexities, that value is usually estimated by the probability distribution of having the best, average, or worst element as the pivot, respectively.
If, for instance, you want to compute the best case, then you may think that your pivot always divides the array into two. (i.e. K=N/2) If computing for the worst case, you may think that your pivot is either the largest or the smallest element. (i.e. K=1) For the average case, based on the probability distribution of the indices of the elements, K=N/4 is used. (You may need more about it here) Basically, for the average case, your recurrence relation becomes F(N) = F(N / 4) + F(3 * N / 4) + N, which yields O(NlogN).
Now, the value you assumed for K, namely 2, is just one shy from the worst case scenario. That is why you can not observe the average case performance of O(NlogN) here, and get O(N^2).

Median of medians algorithm: why divide the array into blocks of size 5

In median-of-medians algorithm, we need to divide the array into chunks of size 5. I am wondering how did the inventors of the algorithms came up with the magic number '5' and not, may be, 7, or 9 or something else?
The number has to be larger than 3 (and an odd number, obviously) for the algorithm. 5 is the smallest odd number larger than 3. So 5 was chosen.
I think that if you'll check "Proof of O(n) running time" section of wiki page for medians-of-medians algorithm:
The median-calculating recursive call does not exceed worst-case linear behavior because the list of medians is 20% of the size of the list, while the other recursive call recurses on at most 70% of the list, making the running time
The O(n) term c n is for the partitioning work (we visited each element a constant number of times, in order to form them into n/5 groups and take each median in O(1) time).
From this, using induction, one can easily show that
That should help you to understand, why.
You can also use blocks of size 3 or 4, as shown in the paper Select with groups of 3 or 4 by K. Chen and A. Dumitrescu (2015). The idea is to use the "median of medians" algorithm twice and partition only after that. This lowers the quality of the pivot but is faster.
So instead of:
T(n) <= T(n/3) + T(2n/3) + O(n)
T(n) = O(nlogn)
one gets:
T(n) <= T(n/9) + T(7n/9) + O(n)
T(n) = Theta(n)
See this explanation on Brilliant.org. Basically, five is the smallest possible array we can use to maintain linear time. It is also easy to implement a linear sort with an n=5 sized array. Apologies for the laTex:
Why 5?
The median-of-medians divides a list into sublists of length five to
get an optimal running time. Remember, finding the median of small
lists by brute force (sorting) takes a small amount of time, so the
length of the sublists must be fairly small. However, adjusting the
sublist size to three, for example, does change the running time for
the worse.
If the algorithm divided the list into sublists of length three, pp
would be greater than approximately \frac{n}{3} 3 n ​ elements and
it would be smaller than approximately \frac{n}{3} 3 n ​ elements.
This would cause a worst case \frac{2n}{3} 3 2n ​ recursions,
yielding the recurrence T(n) = T\big( \frac{n}{3}\big) +
T\big(\frac{2n}{3}\big) + O(n),T(n)=T( 3 n ​ )+T( 3 2n ​ )+O(n),
which by the master theorem is O(n \log n),O(nlogn), which is slower
than linear time.
In fact, for any recurrence of the form T(n) \leq T(an) + T(bn) +
cnT(n)≤T(an)+T(bn)+cn, if a + b < 1a+b<1, the recurrence will solve to
O(n)O(n), and if a+b > 1a+b>1, the recurrence is usually equal to
\Omega(n \log n)Ω(nlogn). [3]
The median-of-medians algorithm could use a sublist size greater than
5—for example, 7—and maintain a linear running time. However, we need
to keep the sublist size as small as we can so that sorting the
sublists can be done in what is effectively constant time.

Resources