Median of median algorithm recurrence relation - sorting

I know that the linear select (median of medians algorithm) recurrence equation is as follows:
T(n) <= an + T(n/5) + T(7n/10)
But where do these terms come from? I've been trying to understand, but I'm extremely confused. Can anyone please shed some light?

Best attempt:
That equation is only for when you do the median of groups of 5. Otherwise it will change. The an part of the equation is the time it takes for the algorithm to go through all the elements and group them into 5. The T(n/5) is the time it takes for the median to be found for each group of 5. As there is n/5 groups of 5.
T(7n/10) will take more time...
When you do the medians of medians the elements are broken up into 4 quadrants. 3/10 of the elements are greater than the median of medians, 3/10 elements are less than the median of medians. The other 4/10 is split up unto 2 groups of 2/10. These are the elements in which you're not sure if they are greater or less than the median of medians. Therefore, the max number of elements you could have that are greater than or less than the median of medians is 3/10 + 2/10 + 2/10 = 7/10. So the T(7n/10) is the part of continuing the equation with the max segment of numbers that is larger/smaller than the median of medians....
Hopefully that kind of makes sense.

Related

Random Binary Tree, prove that the height is O(logn) with high probability

I am now stuck two days at this exercise from my professor:
"Consider the ordered binary tree over a set S ⊆ ℕ, built by repeated insertion, where the elements of S are inserted by a permutation picked uniformly at random. Prove that the height of the tree is O(logn) with high propability."
My work so far has been to study about the propabilistic analysis of random algorithms. For example, CLSR book has a chapter "12.4 Randomly built binary search trees" where its proven that the expected height of a binary tree built by repeated insertion over a random permutation is O(logn). Many other books prove this bound. But this is not what we are looking for. We want to prove a way stronger bound; that the height is O(logn) with high propability. I've studied the classic paper "A Note on the Height of Binary Search Trees, Luc Devroye, 1986" where he proves that the height is ~ 4.31107... logn , with high probability. But the analysis is way out of my league. I couldn't understand the logic of key points in the paper.
Every book and article i've seen uses the citation of Devroye's paper, and says "it can also be proven that with high probability the height is O(logn)".
How should I proceed further?
Thanks in advance.
I will outline my best idea based on well-known probability results. You will need to add details and make it rigorous.
First let's consider the process of descending down pivots to a random node in a binary tree. Suppose that your random node is known to be somewhere between i and i+m-1. At the next step which adds to the length of the path, you will pick a number j in that range. With probability (j-1)/m our random node is now in a range of length j. With probability 1/m it was j. And with probability (m-j-1)/m it was above j and is now in a range of length (m-j-i). Within those ranges, the unknown node is evenly distributed.
The obvious continuous approximation is to go from discrete to continuous. We pick a random number x from 0 to m to be the next pivot. With probability of x/m we are in a range of size x. With probability of (m-x)/m we are in a range of size m-x. We have therefore shrunk our factor by a random number X that is x/m or (m-x)/m. The distribution of X is known. And the sequence of samples we take from the continuous approximation is independent.
One more note. log(X) has both an expected value E and a variance V that can be calculated. Since X is always between 0 and 1, its expected value is negative.
Now pick ε with 0 < ε. The outline of the proof is as follows.
Show that at each step, the expected error from the discrete to the continuous approximation increases by at most O(1).
Show that the probability that the sum of (ε - 1/E) log(n) samples of log(X) fails to be below -log(n) is O(1/n).
Show that the probability that a random node is at depth (2ε + 1/E) log(n) or less is O(1/n).
Show that the probability of a random permutation has ANY node at depth (3ε + 1/E) log(n) is O(1/log(n)) or more.
Let's go.
1. Show that at each step, the error from the discrete to the continuous approximation increases by at most O(1).
The two roundoffs in each step are at most 1. Any errors carried over from the previous step will shrink by a random factorThe previous errors from the approximation shrink, but not increase, in the next step. And the two roundoffs are both at most 1. So the error increases by at most 2.
2. Show that the probability that the sum of (ε - 1/E) log(n) samples of log(X) fails to be below -log(n) is O(1/n).
The expected value of summing log(X) for (ε - 1/E) log(n) times is (ε E - 1) log(n). Since E is negative, this is below our target. Now we can use the Bernstein Inequalities we can put a bound on the probability of being that far from the mean. This will turn out to be proportional to an exponential in the number of variables. Since we have O(log(n)), this will be proportional to 1/n.
3. Show that the probability that a random node is at depth (2ε + 1/E) log(n) or less is O(1/n).
With probability 1 - O(1/n), in (ε + 1/E) log(n) steps the continuous approximation has converged to within 1. There were O(log(n)) steps, and therefore the error between continuous and discrete is at most O(log(n)). (Look back to step 1 to see that.) So we just have to show that the odds of failing to go from O(log(n)) possibilities to 1 in ε log(n) steps is at most O(1/n).
This would be implied if we could show that for any given constant a, the odds of failing to go from a k to 1 possibilities in at most k steps is a negative exponential in k. (k here is ε log(n).)
For that, let's record a 1 every time we cut the space in half with the next element, and a 0 otherwise. The number of such sequences is 2^k. But for any given a, if k is large enough, then you can't cut the space in half more than k/4 times without reducing the search space to 1. note that each time with odds at least 1/2 you cut the search space by at least 1/2. In k steps there are 2^k sequences of 1 or 0 for whether you cut the search space in half. A little playing around with the binomial formula and Stirling's approximation will get you an upper limit on the likelihood of failing to halve enough times of the form O( k (3/4)^k ). Which is sufficient for our purposes.
4. Show that the probability of a random permutation has ANY node at depth at least (3ε + 1/E) is O(1/log(n)).
The proportion of random nodes in random binary trees that are depth at least (2ε + 1/E) log(n) is at most < p/n for some p.
Any random tree with any node at depth (3ε + 1/E) log(n) has n nodes and ε log(n) nodes of depth at least (2ε + 1/E) log(n). If the odds of having a node at depth (3ε + 1/E) log(n) exceeds p / (ε log(n)), then we have too many nodes of depth (2ε + 1/E) log(n) just in those graphs. And therefore by the pigeon hole principle, we have our upper bound on the likelihood.

Find and sort in O(n) the log2(n) smallest values and the log2(n) largest values in an array of n values

Let A be an array of n different numbers (positive & negative).
We are interested in the ⌊log_2(n)⌋ smallest values,
and in the ⌊log_2(n)⌋ largest values.
Find algorithm which calculates this 2⌊log_2(n)⌋ values,
and presents them in a sorted array (size = 2⌊log_2(n)⌋)
1. the running time of the algorithm must be θ(n),
2. prove that the running time is θ(n).
I thought maybe heap sort can be useful, but I'm really not sure.
I don't need a code just the idea... I would appreciate any help
Thanks :) and sorry if I have English mistakes :(
My general approach would be to to create 2 heap data structures, one for the max and one for the min, and heapify the array for/in in both of them. Heapifying is an operation of linear time complexity if done right.
Then I would extract ⌊log_2(n)⌋ items from both heaps where each extraction is of complexity O(log n). So, this would give us the following rough estimation of calculations:
2 * n + 2 * (log(n))^2
2 * n for two heapifying operations.
log(n) * log(n) for extracting log(n) elements from one of the heaps.
2 * (log(n))^2 for extracting log(n) elements from both heaps.
In big O terms, the ruling term is n, since log(n) even to the power of two is asymptotically smaller. So the whole expression above renders to a sweet O(n).

Average Case of Quick Sort

I'm working on the program just needed in the following to understand it better.
What is the average case running time for Quick sort and what may cause this average case performance? How can we modify quick sort program to mitigate this problem?
I know that it has average case O(n log(n)) and I know it occurs when the pivot median element. My question is how can I modify the program to mitigate this problem.
The average case of quicksort is not when the pivot is the median element - that's the best case. Analyzing the average case is a bit tricker. We'll assume that the array is in a random order, so that each element is equally likely to be selected as the pivot. Alternatively, we can just select the pivot randomly so that the original array order doesn't matter; either way leads to the same conclusion.
If the numbers in the array are [1, 2, 3, 4, 5], for example, then each number has a 1/5 probability of being selected as the pivot.
If 1 is selected as the pivot, then the recursive calls are on arrays of size 0 and 4.
If 2 is the pivot, then the recursive calls are on arrays of size 1 and 3.
If 3 is the pivot, then we will make recursive calls on arrays of size 2 and 2.
If 4 is the pivot, then the recursive calls are on arrays of size 3 and 1.
If 5 is selected as the pivot, then the recursive calls are on arrays of size 4 and 0.
So the recurrence is that T(5) is 1/5 of T(4) + T(0), T(3) + T(1), T(2) + T(2), T(1) + T(3) and T(0) + T(4), plus an O(n) term for the cost of partitioning. The general form of this recurrence relation is a sum over every possible pivot, divided by the number of possible pivots:
The solution to this recurrence relation happens to be that T(n) is in O(n log n).
The fact that the quicksort algorithm runs in O(n log n) time in the average case is not a problem; in fact, this is asymptotically optimal for any comparison-based sorting algorithm. No comparison-based sorting algorithm can have a better asymptotic running time in the average case.

The time complexity of quick select

I read that the time complexity of quick select is:
T(n) = T(n/5) + T(7n/10) + O(n)
I read the above thing as "time taken to quick select from n elements = (time taken to select from 7n/10 elements)+ (time taken to quickselect from n/5 elements) + (some const *n)"
So I understand that once we find decent pivot, only 7n/10 elements are left, and doing one round of arranging the pivot takes time n.
But the n/5 part confuses me. I know it has got to do with median of medians, but i don't quite get it.
Median of medians from what i understood , is recursively splitting into 5 and finding the medians, till u get 1.
I found that the time taken to do that, is about n
So T of mom(n)=n
How do you equate that T of quick_select(n) = T_mom(n)/5?
In other words, this is what I think the equation should read:
T(n)= O(n)+n+T(7n/10)
where,
O(n) -> for finding median
n-> for getting the pivot into its position
T(7n/10) -> Doing the same thing for the other 7n/10 elements. (worst case)
Can someone tell me where I'm going wrong?
In this setup, T(n) refers to the number of steps required to compute MoM on an array of n elements. Let's go through the algorithm one step at a time and see what happens.
First, we break the input into blocks of size 5, sort each block, form a new array of the medians of those blocks, and recursively call MoM to get the median of that new array. Let's see how long each of those steps takes:
Break the input into blocks of size 5: this could be done in time O(1) by just implicitly partitioning the array into blocks without moving anything.
Sort each block: sorting an array of any constant size takes time O(1). There are O(n) such blocks (specifically, ⌈n / 5⌉), so this takes time O(n).
Get the median of each block and form a new array from those medians. The median element of each block can be found in time O(1) by just looking at the center element. There are O(n) blocks, so this step takes time O(n).
Recursively call MoM on that new array. This takes time T(⌈n/5⌉), since we're making a recursive call on the array of that size we formed in the previous step.
So this means that the logic to get the actual median of medians takes time O(n) + T(⌈n/5⌉).
So where does the T(7n/10) part come from? Well, the next step in the algorithm is to use the median of medians we found in step (4) as a partition element to split the elements into elements less than that pivot and elements greater than that pivot. From there, we can determine whether we've found the element we're looking for (if it's at the right spot in the array) or whether we need to recurse on the left or right regions of the array. The advantage of picking the median of the block medians as the splitting point is that it guarantees a worst-case 70/30 split in this step between the smaller and larger elements, so if we do have to recursively continue the algorithm, in the worst case we do so with roughly 7n/10 elements.
In the median of median part, we do the followings:
Take median of sublists which each of them has at most 5 elements. for each of this lists we need O(1) operations and there are n/5 such lists so totally it takes O(n) to just find median of each of them.
We take median of those n/5 medians (median of medians). This needs T(n/5), because there are only n/5 elements which we should check.
So the median of median part is actually T(n/5) + O(n), BTW the T(7n/10) part is not exactly as what you said.

Median of medians algorithm: why divide the array into blocks of size 5

In median-of-medians algorithm, we need to divide the array into chunks of size 5. I am wondering how did the inventors of the algorithms came up with the magic number '5' and not, may be, 7, or 9 or something else?
The number has to be larger than 3 (and an odd number, obviously) for the algorithm. 5 is the smallest odd number larger than 3. So 5 was chosen.
I think that if you'll check "Proof of O(n) running time" section of wiki page for medians-of-medians algorithm:
The median-calculating recursive call does not exceed worst-case linear behavior because the list of medians is 20% of the size of the list, while the other recursive call recurses on at most 70% of the list, making the running time
The O(n) term c n is for the partitioning work (we visited each element a constant number of times, in order to form them into n/5 groups and take each median in O(1) time).
From this, using induction, one can easily show that
That should help you to understand, why.
You can also use blocks of size 3 or 4, as shown in the paper Select with groups of 3 or 4 by K. Chen and A. Dumitrescu (2015). The idea is to use the "median of medians" algorithm twice and partition only after that. This lowers the quality of the pivot but is faster.
So instead of:
T(n) <= T(n/3) + T(2n/3) + O(n)
T(n) = O(nlogn)
one gets:
T(n) <= T(n/9) + T(7n/9) + O(n)
T(n) = Theta(n)
See this explanation on Brilliant.org. Basically, five is the smallest possible array we can use to maintain linear time. It is also easy to implement a linear sort with an n=5 sized array. Apologies for the laTex:
Why 5?
The median-of-medians divides a list into sublists of length five to
get an optimal running time. Remember, finding the median of small
lists by brute force (sorting) takes a small amount of time, so the
length of the sublists must be fairly small. However, adjusting the
sublist size to three, for example, does change the running time for
the worse.
If the algorithm divided the list into sublists of length three, pp
would be greater than approximately \frac{n}{3} 3 n ​ elements and
it would be smaller than approximately \frac{n}{3} 3 n ​ elements.
This would cause a worst case \frac{2n}{3} 3 2n ​ recursions,
yielding the recurrence T(n) = T\big( \frac{n}{3}\big) +
T\big(\frac{2n}{3}\big) + O(n),T(n)=T( 3 n ​ )+T( 3 2n ​ )+O(n),
which by the master theorem is O(n \log n),O(nlogn), which is slower
than linear time.
In fact, for any recurrence of the form T(n) \leq T(an) + T(bn) +
cnT(n)≤T(an)+T(bn)+cn, if a + b < 1a+b<1, the recurrence will solve to
O(n)O(n), and if a+b > 1a+b>1, the recurrence is usually equal to
\Omega(n \log n)Ω(nlogn). [3]
The median-of-medians algorithm could use a sublist size greater than
5—for example, 7—and maintain a linear running time. However, we need
to keep the sublist size as small as we can so that sorting the
sublists can be done in what is effectively constant time.

Resources