Calculating median with a Black Box algorithm in O(n) time - algorithm

the problem is this:
given an array A of size n and algorithm B and B(A,n)=b where b is an element of A such that |{1<=i<=n | a_i>b}|>=n/10
|{1<=i<=n | a_i>b}|<=n/10
The time complexity of B is O(n).
i need to find the median in O(n).
I tried solving this question by applying B and then finding the groups of elements that are smaller than b, lets name this group as C.
and the elements bigger than b, lets name this group D.
we can get groups C and D by traversing through array A in O(n).
now i can apply algorithm B on the smaller group from the above because the median is not there and applying the same principle in the end i can get the median element. time complexity O(nlogn)
i can't seem to find a solution that works at O(n).
this is a homework question and i would appreciate any help or insight.

You are supposed to use function B() to choose a pivot element for the Quickselect algorithm: https://en.wikipedia.org/wiki/Quickselect
It looks like you are already thinking of exactly this procedure, so you already have the algorithm, and you're just calculating the complexity incorrectly.
In each iteration, you run a linear time procedure on a list that is at most 9/10ths the size of the list in the previous iteration, so the worst case complexity is
O( n + n*0.9 + n*0.9^2 + n*0.9^3 ...)
Geometric progressions like this converge to a constant multiplier:
Let T = 1 + 0.9^1 + 0.9^2 + ...
It's easy to see that
T - T*0.9 = 1, so
T*(0.1) = 1, and T=10
So the total number of elements processed through all iterations is less than 10n, and your algorithm therefore takes O(n) time.

Related

time complexity to find k elements in unsorted array using quick partition [duplicate]

According to Wikipedia, partition-based selection algorithms such as quickselect have runtime of O(n), but I am not convinced by it. Can anyone explain why it is O(n)?
In the normal quick-sort, the runtime is O(n log n). Every time we partition the branch into two branches (greater than the pivot and lesser than the pivot), we need to continue the process in both branches, whereas quickselect only needs to process one branch. I totally understand these points.
However, if you think in the Binary Search algorithm, after we chose the middle element, we are also searching only one side of the branch. So does that make the algorithm O(1)? No, of course, the Binary Search Algorithm is still O(log N) instead of O(1). This is also the same thing as the search element in a Binary Search Tree. We only search for one side, but we still consider O(log n) instead of O(1).
Can someone explain why in quickselect, if we continue the search in one side of pivot, it is considered O(1) instead of O(log n)? I consider the algorithm to be O(n log n), O(N) for the partitioning, and O(log n) for the number of times to continue finding.
There are several different selection algorithms, from the much simpler quickselect (expected O(n), worst-case O(n2)) to the more complex median-of-medians algorithm (Θ(n)). Both of these algorithms work by using a quicksort partitioning step (time O(n)) to rearrange the elements and position one element into its proper position. If that element is at the index in question, we're done and can just return that element. Otherwise, we determine which side to recurse on and recurse there.
Let's now make a very strong assumption - suppose that we're using quickselect (pick the pivot randomly) and on each iteration we manage to guess the exact middle of the array. In that case, our algorithm will work like this: we do a partition step, throw away half of the array, then recursively process one half of the array. This means that on each recursive call we end up doing work proportional to the length of the array at that level, but that length keeps decreasing by a factor of two on each iteration. If we work out the math (ignoring constant factors, etc.) we end up getting the following time:
Work at the first level: n
Work after one recursive call: n / 2
Work after two recursive calls: n / 4
Work after three recursive calls: n / 8
...
This means that the total work done is given by
n + n / 2 + n / 4 + n / 8 + n / 16 + ... = n (1 + 1/2 + 1/4 + 1/8 + ...)
Notice that this last term is n times the sum of 1, 1/2, 1/4, 1/8, etc. If you work out this infinite sum, despite the fact that there are infinitely many terms, the total sum is exactly 2. This means that the total work is
n + n / 2 + n / 4 + n / 8 + n / 16 + ... = n (1 + 1/2 + 1/4 + 1/8 + ...) = 2n
This may seem weird, but the idea is that if we do linear work on each level but keep cutting the array in half, we end up doing only roughly 2n work.
An important detail here is that there are indeed O(log n) different iterations here, but not all of them are doing an equal amount of work. Indeed, each iteration does half as much work as the previous iteration. If we ignore the fact that the work is decreasing, you can conclude that the work is O(n log n), which is correct but not a tight bound. This more precise analysis, which uses the fact that the work done keeps decreasing on each iteration, gives the O(n) runtime.
Of course, this is a very optimistic assumption - we almost never get a 50/50 split! - but using a more powerful version of this analysis, you can say that if you can guarantee any constant factor split, the total work done is only some constant multiple of n. If we pick a totally random element on each iteration (as we do in quickselect), then on expectation we only need to pick two elements before we end up picking some pivot element in the middle 50% of the array, which means that, on expectation, only two rounds of picking a pivot are required before we end up picking something that gives a 25/75 split. This is where the expected runtime of O(n) for quickselect comes from.
A formal analysis of the median-of-medians algorithm is much harder because the recurrence is difficult and not easy to analyze. Intuitively, the algorithm works by doing a small amount of work to guarantee a good pivot is chosen. However, because there are two different recursive calls made, an analysis like the above won't work correctly. You can either use an advanced result called the Akra-Bazzi theorem, or use the formal definition of big-O to explicitly prove that the runtime is O(n). For a more detailed analysis, check out "Introduction to Algorithms, Third Edition" by Cormen, Leisserson, Rivest, and Stein.
Let me try to explain the difference between selection & binary search.
Binary search algorithm in each step does O(1) operations. Totally there are log(N) steps and this makes it O(log(N))
Selection algorithm in each step performs O(n) operations. But this 'n' keeps on reducing by half each time. There are totally log(N) steps.
This makes it N + N/2 + N/4 + ... + 1 (log(N) times) = 2N = O(N)
For binary search it is 1 + 1 + ... (log(N) times) = O(logN)
In Quicksort, the recursion tree is lg(N) levels deep and each of these levels requires O(N) amount of work. So the total running time is O(NlgN).
In Quickselect, the recurision tree is lg(N) levels deep and each level requires only half the work of the level above it. This produces the following:
N * (1/1 + 1/2 + 1/4 + 1/8 + ...)
or
N * Summation(1/i^2)
1 < i <= lgN
The important thing to note here is that i goes from 1 to lgN, but not from 1 to N and also not from 1 to infinity.
The summation evaluates to 2. Hence Quickselect = O(2N).
Quicksort does not have a big-O of nlogn - it's worst case runtime is n^2.
I assume you're asking about Hoare's Selection Algorithm (or quickselect) not the naive selection algorithm that is O(kn). Like quicksort, quickselect has a worst case runtime of O(n^2) (if bad pivots are chosen), not O(n). It can run in expectation time n because it's only sorting one side, as you point out.
Because for selection, you're not sorting, necessarily. You can simply count how many items there are which have any given value. So an O(n) median can be performed by counting how many times each value comes up, and picking the value that has 50% of items above and below it. It's 1 pass through the array, simply incrementing a counter for each element in the array, so it's O(n).
For example, if you have an array "a" of 8 bit numbers, you can do the following:
int histogram [ 256 ];
for (i = 0; i < 256; i++)
{
histogram [ i ] = 0;
}
for (i = 0; i < numItems; i++)
{
histogram [ a [ i ] ]++;
}
i = 0;
sum = 0;
while (sum < (numItems / 2))
{
sum += histogram [ i ];
i++;
}
At the end, the variable "i" will contain the 8-bit value of the median. It was about 1.5 passes through the array "a". Once through the entire array to count the values, and half through it again to get the final value.

How to choose the least number of weights to get a total weight in O(n) time

If there are n unsorted weights and I need to find the least number of weights to get at least weight W.
How do I find them in O(n)?
This problem has many solution methods:
Method 1 - Sorting - O(nlogn)
I guess that the most trivial one would be to sort in descending order and then to take the first K elements that give a sum of at least W. The time complexity will be though O(nlogn).
Method 2 - Max Heap - O(n + klogn)
Another method would be to use a max heap.
Creating the heap will take O(n) and then extracting elements until we got to a total sum of at least W. Each extraction will take O(logn) so the total time complexity will be O(klogn) where k is the number of elements we had to extract from the heap.
Method 3 - Using Min Heap - O(nlogk)
Adding this method that JimMischel suggested in the comments below.
Creating a min heap with the first k elements in the list that sums to at least W. Then, iterate over the remaining elements and if it's greater than the minimum (heap top) replace between them.
At this point, it might be that we have more elements of what we actually need to get to W, so we will just extract the minimums until we reach our limit. In practice, depending on the relation between
find_min_set(A,W)
currentW = 0
heap H //Create empty heap
for each Elem in A
if (currentW < W)
H.add(Elem)
currentW += Elem
else if (Elem > H.top())
currentW += (Elem-H.top())
H.pop()
H.add(Elem)
while (currentW-H.top() > W)
currentW -= H.top()
H.pop()
This method might be even faster in practice, depending on the relation between k and n. See when theory meets practice.
Method 4 - O(n)
The best method I could think of will be using some kind of quickselect while keeping track of the total weight and always partitioning with the median as a pivot.
First, let's define few things:
sum(A) - The total sum of all elements in array A.
num(A) - The number of elements in array A.
med(A) - The median of the array A.
find_min_set(A,W,T)
//partition A
//L contains all the elements of A that are less than med(A)
//R contains all the elements of A that are greater or equal to med(A)
L, R = partition(A,med(A))
if (sum(R)==W)
return T+num(R)
if (sum(R) > W)
return find_min_set(R,W,T)
if (sum(R) < W)
return find_min_set(L,W-sum(R),num(R)+T)
Calling this method by find_min_set(A,W,0).
Runtime Complexity:
Finding median is O(n).
Partitioning is O(n).
Each recursive call is taking half of the size of the array.
Summing it all up we get a follow relation: T(n) = T(n/2) + O(n) which is same as the average case of quickselect = O(n).
Note: When all values are unique both worst-case and average complexity is indeed O(n). With possible duplicates values, the average complexity is still O(n) but the worst case is O(nlogn) with using Median of medians method for selecting the pivot.

Merge Sort Complexity Confusion

Can someone explain to me in plain english how Merge Sort is O(n*logn). I know that the 'n' comes from the fact that it takes n appends to merge two sorted lists of size n/2. What confuses me is the log. If we were to draw a tree of the function calls of running Merge Sort on a 32 element list, then it would have 5 levels. Log2(32)= 5. That makes sense, however, why do we use the levels of the tree, rather than the actual function calls and merges in the Big O definition ?
In this diagram we can see that for an 8 element list, there are 3 levels. In this context, Big O is trying to find how the number of operations behaves as the input increases, my question is how are the levels (of function calls) considered operations?
The levels of function calls are considered like this(in the book [introduction to algorithms](https://mitpress.mit.edu/books/introduction-algorithms Chapter 2.3.2):
We reason as follows to set up the recurrence for T(n), the worst-case running time of merge sort on n numbers. Merge sort on just one element takes constant time. When we have n > 1 elements, we break down the running time as follows.
Divide: The divide step just computes the middle of the subarray, which takes constant time. Thus, D(n) = Θ(1).
Conquer: We recursively solve two subproblems, each of size n/2, which contributes 2T(n/2) to the running time.
Combine: We have already noted that the MERGE procedure on an n-element subarray takes time Θ(n), and so C(n) = Θ(n).
When we add the functions D(n) and C(n) for the merge sort analysis, we are adding a function that is Θ(n) and a function that is Θ(1). This sum is a linear function of n, that is, Θ(n). Adding it to the 2T(n/2) term from the “conquer” step gives the recurrence for the worst-case running time T(n) of merge sort:
T(n) = Θ(1), if n = 1; T(n) = 2T(n/2) + Θ(n), if n > 1.
Then using the recursion tree or the master theorem, we can calculate:
T(n) = Θ(nlgn).
Simple analysis:-
Say length of array is n to be sorted.
Now every time it will be divided into half.
So, see as under:-
n
n/2 n/2
n/4 n/4 n/4 n/4
............................
1 1 1 ......................
As you can see height of tree will be logn( 2^k = n; k = logn)
At every level sum will be n. (n/2 +n/2 = n, n/4+n/4+n/4+n/4 = n).
So finally levels = logn and every level takes n
combining we get nlogn
Now regarding your question, how levels are considered operations, consider as under:-
array 9, 5, 7
suppose its split into 9,5 and 7
for 9,5 it will get converted to 5,9 (at this level one swap required)
then in upper level 5,9 and 7 while merging gets converted to 5,7,9
(again at this level one swap required).
In worst case on any level number operations can be O(N) and number of levels logn. Hence nlogn.
For more clarity try to code merge sort, you will be able to visualise it.
Let's take your 8-item array as an example. We start with [5,3,7,8,6,2,1,4].
As you noted, there are three passes. In the first pass, we merge 1-element subarrays. In this case, we'd compare 5 with 3, 7 with 8, 2 with 6, and 1 with 4. Typical merge sort behavior is to copy items to a secondary array. So every item is copied; we just change the order of adjacent items when necessary. After the first pass, the array is [3,5,7,8,2,6,1,4].
On the next pass, we merge two-element sequences. So [3,5] is merged with [7,8], and [2,6] is merged with [1,4]. The result is [3,5,7,8,1,2,4,6]. Again, every element was copied.
In the final pass the algorithm again copies every item.
There are log(n) passes, and at every pass all n items are copied. (There are also comparisons, of course, but the number is linear and no more than the number of items.) Anyway, if you're doing n operations log(n) times, then the algorithm is O(n log n).

Selection i'th smallest number algorithm

I'm reading Introduction to Algorithms book, second edition, the chapter about Medians and Order statistics. And I have a few questions about randomized and non-randomized selection algorithms.
The problem:
Given an unordered array of integers, find i'th smallest element in the array
a. The Randomized_Select algorithm is simple. But I cannot understand the math that explains it's work time. Is it possible to explain that without doing deep math, in more intuitive way? As for me, I'd think that it should work for O(nlog n), and in worst case it should be O(n^2), just like quick sort. In avg randomizedPartition returns near middle of the array, and array is divided into two each call, and the next recursion call process only half of the array. The RandomizedPartition costs (p-r+1)<=n, so we have O(n*log n). In the worst case it would choose every time the max element in the array, and divide the array into two parts - (n-1) and (0) each step. That's O(n^2)
The next one (Select algorithm) is more incomprehensible then previous:
b. What it's difference comparing to previous. Is it faster in avg?
c. The algorithm consists of five steps. In first one we divide the array into n/5 parts each one with 5 elements (beside the last one). Then each part is sorted using insertion sort, and we select 3rd element (median) of each. Because we have sorted these elements, we can be sure that previous two <= this pivot element, and the last two are >= then it. Then we need to select avg element among medians. In the book stated that we recursively call Select algorithm for these medians. How we can do that? In select algorithm we are using insertion sort, and if we are swapping two medians, we need to swap all four (or even more if it is more deeper step) elements that are "children" for each median. Or do we create new array that contain only previously selected medians, and are searching medians among them? If yes, how can we fill them in original array, as we changed their order previously.
The other steps are pretty simple and look like in the randomized_partition algorithm.
The randomized select run in O(n). look at this analysis.
Algorithm :
Randomly choose an element
split the set in "lower than" set L and "bigger than" set B
if the size of "lower than" is j-1 we found it
if the size is bigger, then Lookup in L
or lookup in B
The total cost is the sum of :
The cost of splitting the array of size n
The cost of lookup in L or the cost of looking up in B
Edited: I Tried to restructure my post
You can notice that :
We always go next in the set with greater amount of elements
The amount of elements in this set is n - rank(xj)
1 <= rank(xi) <= n So 1 <= n - rank(xj) <= n
The randomness of the element xj directly affect the randomness of the number of element which
are greater xj(and which are smaller than xj)
if xj is the element chosen , then you know that the cost is O(n) + cost(n - rank(xj)). Let's call rank(xj) = rj.
To give a good estimate we need to take the expected value of the total cost, which is
T(n) = E(cost) = sum {each possible xj}p(xj)(O(n) + T(n - rank(xj)))
xj is random. After this it is pure math.
We obtain :
T(n) = 1/n *( O(n) + sum {all possible values of rj when we continue}(O(n) + T(n - rj))) )
T(n) = 1/n *( O(n) + sum {1 < rj < n, rj != i}(O(n) + T(n - rj))) )
Here you can change variable, vj = n - rj
T(n) = 1/n *( O(n) + sum { 0 <= vj <= n - 1, vj!= n-i}(O(n) + T(vj) ))
We put O(n) outside the sum , gain a factor
T(n) = 1/n *( O(n) + O(n^2) + sum {1 <= vj <= n -1, vj!= n-i}( T(vj) ))
We put O(n) and O(n^2) outside, loose a factor
T(n) = O(1) + O(n) + 1/n *( sum { 0 <= vj <= n -1, vj!= n-i} T(vj) )
Check the link on how this is computed.
For the non-randomized version :
You say yourself:
In avg randomizedPartition returns near middle of the array.
That is exactly why the randomized algorithm works and that is exactly what it is used to construct the deterministic algorithm. Ideally you want to pick the pivot deterministically such that it produces a good split, but the best value for a good split is already the solution! So at each step they want a value which is good enough, "at least 3/10 of the array below the pivot and at least 3/10 of the array above". To achieve this they split the original array in 5 at each step, and again it is a mathematical choice.
I once created an explanation for this (with diagram) on the Wikipedia page for it... http://en.wikipedia.org/wiki/Selection_algorithm#Linear_general_selection_algorithm_-_Median_of_Medians_algorithm

Why is the runtime of the selection algorithm O(n)?

According to Wikipedia, partition-based selection algorithms such as quickselect have runtime of O(n), but I am not convinced by it. Can anyone explain why it is O(n)?
In the normal quick-sort, the runtime is O(n log n). Every time we partition the branch into two branches (greater than the pivot and lesser than the pivot), we need to continue the process in both branches, whereas quickselect only needs to process one branch. I totally understand these points.
However, if you think in the Binary Search algorithm, after we chose the middle element, we are also searching only one side of the branch. So does that make the algorithm O(1)? No, of course, the Binary Search Algorithm is still O(log N) instead of O(1). This is also the same thing as the search element in a Binary Search Tree. We only search for one side, but we still consider O(log n) instead of O(1).
Can someone explain why in quickselect, if we continue the search in one side of pivot, it is considered O(1) instead of O(log n)? I consider the algorithm to be O(n log n), O(N) for the partitioning, and O(log n) for the number of times to continue finding.
There are several different selection algorithms, from the much simpler quickselect (expected O(n), worst-case O(n2)) to the more complex median-of-medians algorithm (Θ(n)). Both of these algorithms work by using a quicksort partitioning step (time O(n)) to rearrange the elements and position one element into its proper position. If that element is at the index in question, we're done and can just return that element. Otherwise, we determine which side to recurse on and recurse there.
Let's now make a very strong assumption - suppose that we're using quickselect (pick the pivot randomly) and on each iteration we manage to guess the exact middle of the array. In that case, our algorithm will work like this: we do a partition step, throw away half of the array, then recursively process one half of the array. This means that on each recursive call we end up doing work proportional to the length of the array at that level, but that length keeps decreasing by a factor of two on each iteration. If we work out the math (ignoring constant factors, etc.) we end up getting the following time:
Work at the first level: n
Work after one recursive call: n / 2
Work after two recursive calls: n / 4
Work after three recursive calls: n / 8
...
This means that the total work done is given by
n + n / 2 + n / 4 + n / 8 + n / 16 + ... = n (1 + 1/2 + 1/4 + 1/8 + ...)
Notice that this last term is n times the sum of 1, 1/2, 1/4, 1/8, etc. If you work out this infinite sum, despite the fact that there are infinitely many terms, the total sum is exactly 2. This means that the total work is
n + n / 2 + n / 4 + n / 8 + n / 16 + ... = n (1 + 1/2 + 1/4 + 1/8 + ...) = 2n
This may seem weird, but the idea is that if we do linear work on each level but keep cutting the array in half, we end up doing only roughly 2n work.
An important detail here is that there are indeed O(log n) different iterations here, but not all of them are doing an equal amount of work. Indeed, each iteration does half as much work as the previous iteration. If we ignore the fact that the work is decreasing, you can conclude that the work is O(n log n), which is correct but not a tight bound. This more precise analysis, which uses the fact that the work done keeps decreasing on each iteration, gives the O(n) runtime.
Of course, this is a very optimistic assumption - we almost never get a 50/50 split! - but using a more powerful version of this analysis, you can say that if you can guarantee any constant factor split, the total work done is only some constant multiple of n. If we pick a totally random element on each iteration (as we do in quickselect), then on expectation we only need to pick two elements before we end up picking some pivot element in the middle 50% of the array, which means that, on expectation, only two rounds of picking a pivot are required before we end up picking something that gives a 25/75 split. This is where the expected runtime of O(n) for quickselect comes from.
A formal analysis of the median-of-medians algorithm is much harder because the recurrence is difficult and not easy to analyze. Intuitively, the algorithm works by doing a small amount of work to guarantee a good pivot is chosen. However, because there are two different recursive calls made, an analysis like the above won't work correctly. You can either use an advanced result called the Akra-Bazzi theorem, or use the formal definition of big-O to explicitly prove that the runtime is O(n). For a more detailed analysis, check out "Introduction to Algorithms, Third Edition" by Cormen, Leisserson, Rivest, and Stein.
Let me try to explain the difference between selection & binary search.
Binary search algorithm in each step does O(1) operations. Totally there are log(N) steps and this makes it O(log(N))
Selection algorithm in each step performs O(n) operations. But this 'n' keeps on reducing by half each time. There are totally log(N) steps.
This makes it N + N/2 + N/4 + ... + 1 (log(N) times) = 2N = O(N)
For binary search it is 1 + 1 + ... (log(N) times) = O(logN)
In Quicksort, the recursion tree is lg(N) levels deep and each of these levels requires O(N) amount of work. So the total running time is O(NlgN).
In Quickselect, the recurision tree is lg(N) levels deep and each level requires only half the work of the level above it. This produces the following:
N * (1/1 + 1/2 + 1/4 + 1/8 + ...)
or
N * Summation(1/i^2)
1 < i <= lgN
The important thing to note here is that i goes from 1 to lgN, but not from 1 to N and also not from 1 to infinity.
The summation evaluates to 2. Hence Quickselect = O(2N).
Quicksort does not have a big-O of nlogn - it's worst case runtime is n^2.
I assume you're asking about Hoare's Selection Algorithm (or quickselect) not the naive selection algorithm that is O(kn). Like quicksort, quickselect has a worst case runtime of O(n^2) (if bad pivots are chosen), not O(n). It can run in expectation time n because it's only sorting one side, as you point out.
Because for selection, you're not sorting, necessarily. You can simply count how many items there are which have any given value. So an O(n) median can be performed by counting how many times each value comes up, and picking the value that has 50% of items above and below it. It's 1 pass through the array, simply incrementing a counter for each element in the array, so it's O(n).
For example, if you have an array "a" of 8 bit numbers, you can do the following:
int histogram [ 256 ];
for (i = 0; i < 256; i++)
{
histogram [ i ] = 0;
}
for (i = 0; i < numItems; i++)
{
histogram [ a [ i ] ]++;
}
i = 0;
sum = 0;
while (sum < (numItems / 2))
{
sum += histogram [ i ];
i++;
}
At the end, the variable "i" will contain the 8-bit value of the median. It was about 1.5 passes through the array "a". Once through the entire array to count the values, and half through it again to get the final value.

Resources