Explanation of the Median of Medians algorithm - algorithm

The Median of medians approach is very popular in quicksort type partitioning algorithms to yield a fairly good pivot, such that it partitions the array uniformly. Its logic is given in Wikipedia as:
The chosen pivot is both less than and greater than half of the elements in the list of medians, which is around n/10 elements (1/2 * (n/5)) for each half. Each of these elements is a median of 5, making it less than 2 other elements and greater than 2 other elements outside the block. Hence, the pivot is less than 3(n/10) elements outside the block, and greater than another 3(n/10) elements outside the block. Thus the chosen median splits the elements somewhere between 30%/70% and 70%/30%, which assures worst-case linear behavior of the algorithm.
Can somebody explain it a bit lucidly for me. I am finding it difficult to understand the logic.

Think of the following set of numbers:
5 2 6 3 1
The median of these numbers is 3. Now if you have a number n, if n > 3, then it is bigger than at least half of the numbers above. If n < 3, then it is smaller than at least half of the numbers above.
So that is the idea. That is, for each set of 5 numbers, you get their median. Now you have n / 5 numbers. This is obvious.
Now if you get the median of those numbers (call it m), it is bigger than half of them and smaller than the other half (by definition of median!). In other words, m is bigger than n / 10 numbers (which themselves were medians of small 5 element groups) and bigger than another n / 10 numbers (which again were medians of small 5 element groups).
In the example above, we saw that if the median is k and you have m > k, then m is also bigger than 2 other numbers (that were themselves smaller than k). This means that for each of those smaller 5 element groups where m was bigger than its median, m is also bigger than two other numbers. This makes it at least 3 numbers (2 numbers + the median itself) in each of those n / 10 small 5 element groups, that are smaller than m. Hence, m is at least bigger than 3n/10 numbers.
Similar logic for the number of elements m is bigger than.

Related

Finding largest difference between two elements in array using at most 3/2n comparisons [duplicate]

This question already has answers here:
How to find max. and min. in array using minimum comparisons?
(14 answers)
Closed 2 years ago.
I am working on given an unsorted array with integer elements,
A={a_1,a_2,...,a_n}
finding largest difference between two elements in array (max|a_i-a_j|)with using at most 3/2n comparisons in the worst case.(Runtime do not matter and we can't use operations such as max or min).
I really doubt if this is possible: to find the maximum difference of two elements, in the worst case, shouldn't we always need about 2n comparisons, as we need to use about n comparisons to find the largest element of the array and another n comparisons to find the smallest element of the array? I don't see where can I cut the operations.
I have also considered divide and conquer. Suppose I divide this array into 2 subarrays with length n/2, but then I encountered the same problem, as finding maximum and minimum in each subarray with take about n comparisons so there will be 2n comparisons in total.
A hint on how to do this will be really appreciated.
cppreference proposes an example of implementation of std::minmax_element with a complexity of 3/2 n. The basic idea is to process 2 elements by 2 elements.
If A[i+1] > A[i]
A[i+1] is compared with Max
A[i] is compared with Min
Else
A[i] is compared with Max
A[i+1] is compared with Min
2 elements considered, 3 comparisons -> Complexity O(3/2 n)
Note: in n is odd, last element must be considered separately.
It is straightforward to show that finding the maximum difference is equal to finding the minimum and maximum of the array elements. On the other hand, you can find the minimum and the maximum of an array simultaneously with 3n/2 comparison (the third method in all common programming languages,i.e., C#, C++, Python, C, and Java):
If n is odd then initialize min and max as the first element.
If n is even then initialize min and max as minimum and maximum of the first two elements respectively.
For the rest of the elements, pick them in pairs and compare their
maximum and minimum with max and min respectively.
Total number of comparisons: Different for even and odd n, see below:
If n is odd: 3*(n-1)/2
If n is even: 1 Initial comparison for initializing min and max,
and 3(n-2)/2 comparisons for rest of the elements
= 1 + 3*(n-2)/2 = 3n/2 -2

How can i find the minimum interval that contains half of n numbers?

If I have n numbers , how do I find the minimum interval [a,b] that contains half of those numbers ?
Sort numbers
Set left index to 0, right to n/2-1, get difference A[right]-A[left]
Walk n/2 steps in for-loop, incrementing both indexes, calculating difference again, remember the smallest one and corresponding indexes.
Sort the numbers increasingly and compute all the differences A[i+n/2] - A[i]. The solution is given by the index that minimizes the difference.
Explanation:
There is no need to search among the intervals that contain less than n/2 numbers (because they do not satisfy the conditions) nor those that contain more elements (because if you find a suitable interval, it won't be minimal because you can remove the extreme elements).
When the elements are sorted, any sequence in the array is bounded by its first and last elements. So it suffices to sort the numbers and slide a window of n/2 elements.
Now it is more challenging to tell if this O(n log n) approach is optimal.
How about the following?
Sort the given series of numbers in ascending order.
Start a loop with i from 1st to ([n/2])th number
Calculate the difference d between i + ([n/2])th and ith number. Store the numbers i, i + [n/2] and d in an iteratable collection arr.
End loop
Find the minimum value of d from the array arr. The values of i and i + [n/2] corresponding to this d is your smallest range.

Average number of swaps performed in Bubble Sort

I came across this problem right now:
http://uva.onlinejudge.org/index.php?option=com_onlinejudge&Itemid=8&category=24&page=show_problem&problem=3155
The problems asks to calculate the average number of swaps performed by a Bubble Sort algorithm when the given data is a shuffled order of first n (1 to n listed randomly) natural numbers.
So, I thought that:
Max no of swaps possible=n(n-1)/2. (When they are in descending order.)
Min no of swaps possible=0. (When they are in ascending order.)
So, the mode of this distribution is (0+n(n-1)/2)/2 =n(n-1)/4.
But, this turned out to be the answer.
I don't understand why did the mode coincide with mean.
Since the inputs to be sorted can be any random number with equal probability of occurrence, the distribution is symmetrical.
It is a property of symmetrical distributions that their mean, median and modes coincide which is why the mean and mode are same.
Every swap reduces the number of inversions in the array by exactly 1.
The sorted array has no inversions, thus the number of swaps is equal to the number of inversions in the initial array. Thus we need to compute the average number of inversions in a shuffled array.
The pair of indices i, j, with i < j, is an inversion in exactly half of the shuffled arrays. There are n * (n-1) / 2 such pairs, thus we have n * (n-1) / 4 inversions on average.

Finding two elements whose difference is greater than or equal to D

Lets say you have an array
2 6 4 2 9 4 2
You want to find two elements whose difference is greater than 6. In this case one possible answer is (9,2). How would you do this in less that O(N^2) time?
idea .1
1) you sort your numbers O(n lgn)
2) if difference between last and first element is your number (6). you found them (first and last elements). If the difference is smaller, there are no such elements
idea .2
Min and max elements. If difference between them is less then your seek number, there is no such pair of elements. Time: O(n)
Just scan for the minimum and maximum values. O(n).
Just loop on the elements while storing two numbers, the minimum and maximum obtained so far.
If at any point, you find such a pair that satisfies the condition, thats your answer.
If you reach the end of the array, no such pair exists.
Worst case time: O(n)

Median of medians algorithm: why divide the array into blocks of size 5

In median-of-medians algorithm, we need to divide the array into chunks of size 5. I am wondering how did the inventors of the algorithms came up with the magic number '5' and not, may be, 7, or 9 or something else?
The number has to be larger than 3 (and an odd number, obviously) for the algorithm. 5 is the smallest odd number larger than 3. So 5 was chosen.
I think that if you'll check "Proof of O(n) running time" section of wiki page for medians-of-medians algorithm:
The median-calculating recursive call does not exceed worst-case linear behavior because the list of medians is 20% of the size of the list, while the other recursive call recurses on at most 70% of the list, making the running time
The O(n) term c n is for the partitioning work (we visited each element a constant number of times, in order to form them into n/5 groups and take each median in O(1) time).
From this, using induction, one can easily show that
That should help you to understand, why.
You can also use blocks of size 3 or 4, as shown in the paper Select with groups of 3 or 4 by K. Chen and A. Dumitrescu (2015). The idea is to use the "median of medians" algorithm twice and partition only after that. This lowers the quality of the pivot but is faster.
So instead of:
T(n) <= T(n/3) + T(2n/3) + O(n)
T(n) = O(nlogn)
one gets:
T(n) <= T(n/9) + T(7n/9) + O(n)
T(n) = Theta(n)
See this explanation on Brilliant.org. Basically, five is the smallest possible array we can use to maintain linear time. It is also easy to implement a linear sort with an n=5 sized array. Apologies for the laTex:
Why 5?
The median-of-medians divides a list into sublists of length five to
get an optimal running time. Remember, finding the median of small
lists by brute force (sorting) takes a small amount of time, so the
length of the sublists must be fairly small. However, adjusting the
sublist size to three, for example, does change the running time for
the worse.
If the algorithm divided the list into sublists of length three, pp
would be greater than approximately \frac{n}{3} 3 n ​ elements and
it would be smaller than approximately \frac{n}{3} 3 n ​ elements.
This would cause a worst case \frac{2n}{3} 3 2n ​ recursions,
yielding the recurrence T(n) = T\big( \frac{n}{3}\big) +
T\big(\frac{2n}{3}\big) + O(n),T(n)=T( 3 n ​ )+T( 3 2n ​ )+O(n),
which by the master theorem is O(n \log n),O(nlogn), which is slower
than linear time.
In fact, for any recurrence of the form T(n) \leq T(an) + T(bn) +
cnT(n)≤T(an)+T(bn)+cn, if a + b < 1a+b<1, the recurrence will solve to
O(n)O(n), and if a+b > 1a+b>1, the recurrence is usually equal to
\Omega(n \log n)Ω(nlogn). [3]
The median-of-medians algorithm could use a sublist size greater than
5—for example, 7—and maintain a linear running time. However, we need
to keep the sublist size as small as we can so that sorting the
sublists can be done in what is effectively constant time.

Resources