Average Runtime of Quickselect - algorithm

Wikipedia states that the average runtime of quickselect algorithm (Link) is O(n). However, I could not clearly understand how this is so. Could anyone explain to me (via recurrence relation + master method usage) as to how the average runtime is O(n)?

Because
we already know which partition our desired element lies in.
We do not need to sort (by doing partition on) all the elements, but only do operation on the partition we need.
As in quick sort, we have to do partition in halves *, and then in halves of a half, but this time, we only need to do the next round partition in one single partition (half) of the two where the element is expected to lie in.
It is like (not very accurate)
n + 1/2 n + 1/4 n + 1/8 n + ..... < 2 n
So it is O(n).
Half is used for convenience, the actual partition is not exact 50%.

To do an average case analysis of quickselect one has to consider how likely it is that two elements are compared during the algorithm for every pair of elements and assuming a random pivoting. From this we can derive the average number of comparisons. Unfortunately the analysis I will show requires some longer calculations but it is a clean average case analysis as opposed to the current answers.
Let's assume the field we want to select the k-th smallest element from is a random permutation of [1,...,n]. The pivot elements we choose during the course of the algorithm can also be seen as a given random permutation. During the algorithm we then always pick the next feasible pivot from this permutation therefore they are chosen uniform at random as every element has the same probability of occurring as the next feasible element in the random permutation.
There is one simple, yet very important, observation: We only compare two elements i and j (with i<j) if and only if one of them is chosen as first pivot element from the range [min(k,i), max(k,j)]. If another element from this range is chosen first then they will never be compared because we continue searching in a sub-field where at least one of the elements i,j is not contained in.
Because of the above observation and the fact that the pivots are chosen uniform at random the probability of a comparison between i and j is:
2/(max(k,j) - min(k,i) + 1)
(Two events out of max(k,j) - min(k,i) + 1 possibilities.)
We split the analysis in three parts:
max(k,j) = k, therefore i < j <= k
min(k,i) = k, therefore k <= i < j
min(k,i) = i and max(k,j) = j, therefore i < k < j
In the third case the less-equal signs are omitted because we already consider those cases in the first two cases.
Now let's get our hands a little dirty on calculations. We just sum up all the probabilities as this gives the expected number of comparisons.
Case 1
Case 2
Similar to case 1 so this remains as an exercise. ;)
Case 3
We use H_r for the r-th harmonic number which grows approximately like ln(r).
Conclusion
All three cases need a linear number of expected comparisons. This shows that quickselect indeed has an expected runtime in O(n). Note that - as already mentioned - the worst case is in O(n^2).
Note: The idea of this proof is not mine. I think that's roughly the standard average case analysis of quickselect.
If there are any errors please let me know.

In quickselect, as specified, we apply recursion on only one half of the partition.
Average Case Analysis:
First Step: T(n) = cn + T(n/2)
where, cn = time to perform partition, where c is any constant(doesn't matter). T(n/2) = applying recursion on one half of the partition.Since it's an average case we assume that the partition was the median.
As we keep on doing recursion, we get the following set of equation:
T(n/2) = cn/2 + T(n/4) T(n/4) = cn/2 + T(n/8) .. . T(2) = c.2 + T(1) T(1) = c.1 + ...
Summing the equations and cross-cancelling like values produces a linear result.
c(n + n/2 + n/4 + ... + 2 + 1) = c(2n) //sum of a GP
Hence, it's O(n)

I also felt very conflicted at first when I read that the average time complexity of quickselect is O(n) while we break the list in half each time (like binary search or quicksort). It turns out that breaking the search space in half each time doesn't guarantee an O(log n) or O(n log n) runtime. What makes quicksort O(n log n) and quickselect is O(n) is that we always need to explore all branches of the recursive tree for quicksort and only a single branch for quickselect. Let's compare the time complexity recurrence relations of quicksort and quickselect to prove my point.
Quicksort:
T(n) = n + 2T(n/2)
= n + 2(n/2 + 2T(n/4))
= n + 2(n/2) + 4T(n/4)
= n + 2(n/2) + 4(n/4) + ... + n(n/n)
= 2^0(n/2^0) + 2^1(n/2^1) + ... + 2^log2(n)(n/2^log2(n))
= n (log2(n) + 1) (since we are adding n to itself log2 + 1 times)
Quickselect:
T(n) = n + T(n/2)
= n + n/2 + T(n/4)
= n + n/2 + n/4 + ... n/n
= n(1 + 1/2 + 1/4 + ... + 1/2^log2(n))
= n (1/(1-(1/2))) = 2n (by geometric series)
I hope this convinces you why the average runtime of quickselect is O(n).

Related

Recurrence relations, analyze algorithm

Alright I have some difficulty completely understanding recurrence relations.
So for example if I'm analyzing quick sort in the worst case using Θ-notation, giving the array an input of unsorted positive numbers;
When base case n <= 3 I sort the small array using insertionsort.
Time: O(1) or O(n^2)?
I use linear search to set the pivot as the median of all elements.
Time: Θ(n)
Partitioning left and right of pivot and performing recursion.
Time: 2 * T(n/2) I think
would the recurrence be:
T(n) = O(1) + Θ(n) + 2 * T(n/2) then?
Something tells me the base case would instead take O(n^2) time because if the input is small enough that would be the worst case.
Would that then give me the recurrence:
T(n) = O(n^2) + Θ(n) + 2 * T(n/2)?
When base case n <= 3 I sort the small array using insertionsort.
always O(1)
I use linear search to set the pivot as the median of all elements.
You may want to clarify this more, to find the median as pivot is not a simple task of doing linear search. The few ways are i) using
Quick Select Algorithm (average O(N)) or ii) sorting the subpartition
O(NlogN) iii) median-of-median algorithm O(N).
Partitioning left and right of pivot and performing recursion.
2F(N/2) + N
Putting it all together (Assuming that the pivot is always the median & that you take O(N) to find the pivot each time):
Best_Case = Worst_Case (since the pivot is always median)
F(3) = F(2) = F(1) = 1
F(n) = 2F(N/2) + 2N
= 2(2F(N/2^2) + 2(N/2)) + 2N
= 2(2)F(N/2^2) + 2N + 2N
= 2(3)F(N/2^3) + 3(2)N
= 2(LogN)F(N/N) + (2N)LogN
= O(NlogN)

recurrence relation on a Merge Sort algorithm

The question is :
UNBALANCED MERGE SORT
is a sorting algorithm, which is a modified version of
the standard MERGE SORT
algorithm. The only difference is that instead of dividing
the input into 2 equal parts in each stage, we divide it into two unequal parts – the first
2/5 of the input, and the other 3/5.
a. Write the recurrence relation for the worst case time complexity of the
UNBALANCED MERGE SORT
algorithm.
b. What is the worst case time complexity of the UNBALANCEDMERGESORT
algorithm? Solve the recurrence relation from the previous section.
So i'm thinkin the recurrence relation is : T(n) <= T(2n/5) + T(3n/5) + dn.
Not sure how to solve it.
Thanks in advance.
I like to look at it as "runs", where the ith "run" is ALL the recursive steps with depth exactly i.
In each such run, at most n elements are being processed (we will prove it soon), so the total complexity is bounded by O(n*MAX_DEPTH), now, MAX_DEPTH is logarithmic, as in each step the bigger array is size 3n/5, so at step i, the biggest array is of size 3^i/5^i * n.
Sovle the equation:
3^i/5^i * n = 1
and you will find out that i = log_a(n) - for some base a
So, let's be more formal:
Claim:
Each element is being processed by at most one recursive call at depth
i, for all values of i.
Proof:
By induction, at depth 0, all elements are processed exactly once by the first call.
Let there be some element x, and let's have a look on it at step i+1. We know (induction hypothesis) that x was processed at most once in depth i, by some recursive call. This call later invoked (or not, we claim at most once) the recursive call of depth i+1, and sent the element x to left OR to right, never to both. So at depth i+1, the element x is proccessed at most once.
Conclusion:
Since at each depth i of the recursion, each element is processed at most once, and the maximal depth of the recursion is logarithmic, we get an upper bound of O(nlogn).
We can similarly prove a lower bound of Omega(nlogn), but that is not needed, since sorting is already an Omega(nlogn) problem - so we can conclude the modified algorithm is still Theta(nlogn).
If you want to prove it with "basic arithmetics", it can also be done, by induction.
Claim: T(n) = T(3n/5) + T(2n/5) + n <= 5nlog(n) + n
It will be similar when replacing +n with +dn, I simplified it, but follow the same idea of proof with T(n) <= 5dnlogn + dn
Proof:
Base: T(1) = 1 <= 1log(1) + 1 = 1
T(n) = T(3n/5) + T(2n/5) + n
<= 5* (3n/5) log(3n/5) +3n/5 + 5*(2n/5)log(2n/5) +2n/5 + n
< 5* (3n/5) log(3n/5) + 5*(2n/5)log(3n/5) + 2n
= 5*nlog(3n/5) + 2n
= 5*nlog(n) + 5*nlog(3/5) + 2n
(**)< 5*nlog(n) - n + 2n
= 5nlog(n) + n
(**) is because log(3/5)~=-0.22, so 5nlog(3/5) < -n, and 5nlog(3/5) + 2n < n

Time complexity of the following algorithm?

I'm learning Big-O notation right now and stumbled across this small algorithm in another thread:
i = n
while (i >= 1)
{
for j = 1 to i // NOTE: i instead of n here!
{
x = x + 1
}
i = i/2
}
According to the author of the post, the complexity is Θ(n), but I can't figure out how. I think the while loop's complexity is Θ(log(n)). The for loop's complexity from what I was thinking would also be Θ(log(n)) because the number of iterations would be halved each time.
So, wouldn't the complexity of the whole thing be Θ(log(n) * log(n)), or am I doing something wrong?
Edit: the segment is in the best answer of this question: https://stackoverflow.com/questions/9556782/find-theta-notation-of-the-following-while-loop#=
Imagine for simplicity that n = 2^k. How many times x gets incremented? It easily follows this is Geometric series
2^k + 2^(k - 1) + 2^(k - 2) + ... + 1 = 2^(k + 1) - 1 = 2 * n - 1
So this part is Θ(n). Also i get's halved k = log n times and it has no asymptotic effect to Θ(n).
The value of i for each iteration of the while loop, which is also how many iterations the for loop has, are n, n/2, n/4, ..., and the overall complexity is the sum of those. That puts it at roughly 2n, which gets you your Theta(n).

Worst case time complexity for this stupid sort?

The code looks like:
for (int i = 1; i < N; i++) {
if (a[i] < a[i-1]) {
swap(i, i-1);
i = 0;
}
}
After trying out a few things i figure the worst case is when the input array is in descending order. Then looks like the compares will be maximum and hence we will consider only compares. Then it seems it would be a sum of sums, i.e sum of ... {1+2+3+...+(n-1)}+{1+2+3+...+(n-2)}+{1+2+3+...+(n-3)}+ .... + 1 if so what would be O(n) ?
If I am not on the right path can someone point out what O(n) would be and how can it be derived? cheers!
For starters, the summation
(1 + 2 + 3 + ... + n) + (1 + 2 + 3 + ... + n - 1) + ... + 1
is not actually O(n). Instead, it's O(n3). You can see this because the sum 1 + 2 + ... + n = O(n2, and there are n copies of each of them. You can more properly show that this summation is Θ(n3) by looking at the first n / 2 of these terms. Each of those terms is at least 1 + 2 + 3 + ... + n / 2 = Θ(n2), so there are n / 2 copies of something that's Θ(n2), giving a tight bound of Θ(n3).
We can upper-bound the total runtime of this algorithm at O(n3) by noting that every swap decreases the number of inversions in the array by one (an inversion is a pair of elements out of place). There can be at most O(n2) inversions in an array and a sorted array has no inversions in it (do you see why?), so there are at most O(n2) passes over the array and each takes at most O(n) work. That collectively gives a bound of O(n3).
Therefore, the Θ(n3) worst-case runtime you've identified is asymptotically tight, so the algorithm runs in time O(n3) and has worst-case runtime Θ(n3).
Hope this helps!
It does one iteration of the list per swap. The maximum number of swaps necessary is O(n * n) for a reversed list. Doing each iteration is O(n).
Therefore the algorithm is O(n * n * n).
This is one half of the infamous Bubble Sort, which has a O(N^2). This partial sort has O(N) because the For loop goes from 1 to N. After one iteration, you will end up with the largest element at the end of the list and the rest of the list in some changed order. To be a proper Bubble Sort, it needs another loop inside this one to iterate j from 1 to N-i and do the same thing. The If goes inside the inner loop.
Now you have two loops, one inside the other, and they both go from 1 to N (sort of). You will have N * N or N^2 iterations. Thus O(N^2) for the Bubble Sort.
Now you have take your next step as a programmer: Finish writing the Bubble Sort and make it work correctly. Try it with different lengths of list a and see how long it takes. Then never use it again. ;-)

Worst Case Performance of Quicksort

I am trying to prove the following worst-case scenario for the Quicksort algorithm but am having some trouble. Initially, we have an array of size n, where n = ij. The idea is that at every partition step of Quicksort, you end up with two sub-arrays where one is of size i and the other is of size i(j-1). i in this case is an integer constant greater than 0. I have drawn out the recursive tree of some examples and understand why this is a worst-case scenario and that the running time will be theta(n^2). To prove this, I've used the iteration method to solve the recurrence equation:
T(n) = T(ij) = m if j = 1
T(n) = T(ij) = T(i) + T(i(j-1)) + cn if j > 1
T(i) = m
T(2i) = m + m + c*2i = 2m + 2ci
T(3i) = m + 2m + 2ci + 3ci = 3m + 5ci
So it looks like the recurrence is:
j
T(n) = jm + ci * sum k - 1
k=1
At this point, I'm a bit lost as to what to do. It looks the summation at the end will result in j^2 if expanded out, but I need to show that it somehow equals n^2. Any explanation on how to continue with this would be appreciated.
Pay attention, the quicksort algorithm worst case scenario is when you have two subproblems of size 0 and n-1. In this scenario, you have this recurrence equations for each level:
T(n) = T(n-1) + T(0) < -- at first level of tree
T(n-1) = T(n-2) + T(0) < -- at second level of tree
T(n-2) = T(n-3) + T(0) < -- at third level of tree
.
.
.
The sum of costs at each level is an arithmetic serie:
n n(n-1)
T(n) = sum k = ------ ~ n^2 (for n -> +inf)
k=1 2
It is O(n^2).
Its a problem of simple mathematics. The complexity as you have calculated correctly is
O(jm + ij^2)
what you have found out is a parameterized complextiy. The standard O(n^2) is contained in this as follows - assuming i=1 you have a standard base case so m=O(1) hence j=n therefore we get O(n^2). if you put ij=n you will get O(nm/i+n^2/i) . Now what you should remember is that m is a function of i depending upon what you will use as the base case algorithm hence m=f(i) thus you are left with O(nf(i)/i + n^2/i). Now again note that since there is no linear algorithm for general sorting hence f(i) = omega(ilogi) which will give you O(nlogi + n^2/i). So you have only one degree of freedom that is i. Check that for any value of i you cannot reduce it below nlogn which is the best bound for comparison based.
Now what I am confused is that you are doing some worst case analysis of quick sort. This is not the way its done. When you say worst case it implies you are using randomization in which case the worst case will always be when i=1 hence the worst case bound will be O(n^2). An elegant way to do this is explained in randomized algorithm book by R. Motwani and Raghavan alternatively if you are a programmer then you look at Cormen.

Resources