Geometric mean pivot - algorithm

I am PhD student and I am working on my project,
I want to know what will be worst case partition time complexity if I am using geometric mean as pivot to partitioning array into approximate two equal part?
results :-
Vladimir Yaroslavskiy dual pivot quickselect partition :- 2307601193 nanosecond
Geometric mean pivot quickselect partition :- 8661916394 nanosecond
We know that it is very costly and make quick partition much slower. There are many algorithms Which are much faster than quick select to find median but in our project we are not going to use them directly.
Example of Geometric mean pivot:-
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
Input- 789654123 , 700 , 10^20 , 588412 , 900 , 5 , 500
Geometric mean :-( 789654123*700*10^20*588412*900*5*500)^(1/7)= 1846471
Pass 1- 500 700 5 588412 900 |<---> | 10^20 789654123
Geometric mean :-(500*700*5*588412*900)^(1/5)=984
Pass 2- 500, 700, 5, 900, |<---> | 588412, 10^20, 789654123
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
By this way we can divide array into approximate two equal parts.
My question is what will be worst case( worst unbalanced partitioning ) time complexity if i am using geometric mean as pivot to partitioning array into approximate two equal part?
Note:- we are not using -ve no in data set.

Geometric mean is equivalent to the arithmetic mean of the logarithm, so we just need to find something where arithmetic mean breaks down badly and take the exponent of it. One example would be factorials, if you have a list
1!, 2!, 3!, 4!, ..., n!
taking the arithmetic mean will split exactly before the last element. Proof: The sum of this array is larger than the last element:
s_n > n!
Consequently the arithmetic mean is larger than the element before it:
av_n = s_n/n > (n-1)!
As a result quick select requires n rounds and its performance will be O(n^2), in contrast to the average performance which would be O(n). To get the same behavior with the geometric mean you have to consider the list of exponents of this
a^(1!), a^(2!), ..., a^(n!)
for any a>1 or 0<a<1. The resulting performance of a quick-select based on the geometric mean would be O(n^2).

220, 221, 222, ..., 22n - 1 have geometric mean
(220 · 221 · 222 · ... · 22n - 1)(1 / n)
= (220 + 21 + ... 2n)(1 / n)
= (22n+1 - 1)(1/n)
= 2(2n+1 - 1) / n
= 2(2n+1 - 1) 2-log n
= 2(2n+1 - log n - log n)
Notice that this number is (approximately) 22n - log n. This means that your partition will only split approximately log n terms into the second group of the array, which is a very small number compared to the overall array size. Consequently, you'd expect that for data sets of this sort, you'd have closer to Θ(n2) performance than to Θ(n log n) performance. However, I can't get an exact asymptotic value for this because I don't know how exactly how many rounds there will be.
Hope this helps!

Related

What make Bucket Sort good?

So I stumbled about non-comparison sorting based algorithms bucket sort to be exact and I couldn't exactly get why it is good.
I've a thought but I need somebody to confirm it.
Let's assume I want to sort a 1000 element array.If it were uniformly distributed and bucketed into 10 buckets where each bucket had 100 elements.
sorting 100 element 10 times using n log(n) algorithm = 10 * 100 log(100) = 1000 log(100) = 2000
while sorting 1000 elements using n log(n) algorithm = 1000 log(1000) = 3000
So the algorithm makes use that if n = m + l then (m+l)^2 > m^2 + l^2 and same applies to n log(n) algorithms
so the more uniformly bucketed the data is the better the performance of the bucket sort
Is this right ?
and what would the optimum number of buckets be ? ( I feel it's a space-time trade off thing but also depending on uniformity of the data being sorted)
But you have to take into account that the bucketing step has a complexity of 1000.
This gives you:
bucket sort: 1000 + 10 * 100 log(100) = 3000
comparison sort: 1000 * log(1000) = 3000
But you can reapply again the bucketing strategy to sort the smaller arrays. This is https://en.wikipedia.org/wiki/Radix_sort .
The complexity advertised is O(n.w) where w is the number of bits to represent an element. Linear? Better than merge sort? Wait a minute, how big is w usually? Yeah right, for usual sets of stuff, you have to use log(n) bits to represent elements, so back to n log(n).
As you said this is a time/memory trade of though, and Radix sort is when you have a fixed memory budget (but who doesn't?). If you can grow your memory linearly with the input size, take n buckets and you have a O(n) sort.
An example reference (there are many!): https://www.radford.edu/nokie/classes/360/Linear.Sorts.html .

Binary vs Linear searches for unsorted N elements

I try to understand a formula when we should use quicksort. For instance, we have an array with N = 1_000_000 elements. If we will search only once, we should use a simple linear search, but if we'll do it 10 times we should use sort array O(n log n). How can I detect threshold when and for which size of input array should I use sorting and after that use binary search?
You want to solve inequality that rougly might be described as
t * n > C * n * log(n) + t * log(n)
where t is number of checks and C is some constant for sort implementation (should be determined experimentally). When you evaluate this constant, you can solve inequality numerically (with uncertainty, of course)
Like you already pointed out, it depends on the number of searches you want to do. A good threshold can come out of the following statement:
n*log[b](n) + x*log[2](n) <= x*n/2 x is the number of searches; n the input size; b the base of the logarithm for the sort, depending on the partitioning you use.
When this statement evaluates to true, you should switch methods from linear search to sort and search.
Generally speaking, a linear search through an unordered array will take n/2 steps on average, though this average will only play a big role once x approaches n. If you want to stick with big Omicron or big Theta notation then you can omit the /2 in the above.
Assuming n elements and m searches, with crude approximations
the cost of the sort will be C0.n.log n,
the cost of the m binary searches C1.m.log n,
the cost of the m linear searches C2.m.n,
with C2 ~ C1 < C0.
Now you compare
C0.n.log n + C1.m.log n vs. C2.m.n
or
C0.n.log n / (C2.n - C1.log n) vs. m
For reasonably large n, the breakeven point is about C0.log n / C2.
For instance, taking C0 / C2 = 5, n = 1000000 gives m = 100.
You should plot the complexities of both operations.
Linear search: O(n)
Sort and binary search: O(nlogn + logn)
In the plot, you will see for which values of n it makes sense to choose the one approach over the other.
This actually turned into an interesting question for me as I looked into the expected runtime of a quicksort-like algorithm when the expected split at each level is not 50/50.
the first question I wanted to answer was for random data, what is the average split at each level. It surely must be greater than 50% (for the larger subdivision). Well, given an array of size N of random values, the smallest value has a subdivision of (1, N-1), the second smallest value has a subdivision of (2, N-2) and etc. I put this in a quick script:
split = 0
for x in range(10000):
split += float(max(x, 10000 - x)) / 10000
split /= 10000
print split
And got exactly 0.75 as an answer. I'm sure I could show that this is always the exact answer, but I wanted to move on to the harder part.
Now, let's assume that even 25/75 split follows an nlogn progression for some unknown logarithm base. That means that num_comparisons(n) = n * log_b(n) and the question is to find b via statistical means (since I don't expect that model to be exact at every step). We can do this with a clever application of least-squares fitting after we use a logarithm identity to get:
C(n) = n * log(n) / log(b)
where now the logarithm can have any base, as long as log(n) and log(b) use the same base. This is a linear equation just waiting for some data! So I wrote another script to generate an array of xs and filled it with C(n) and ys and filled it with n*log(n) and used numpy to tell me the slope of that least squares fit, which I expect to equal 1 / log(b). I ran the script and got b inside of [2.16, 2.3] depending on how high I set n to (I varied n from 100 to 100'000'000). The fact that b seems to vary depending on n shows that my model isn't exact, but I think that's okay for this example.
To actually answer your question now, with these assumptions, we can solve for the cutoff point of when: N * n/2 = n*log_2.3(n) + N * log_2.3(n). I'm just assuming that the binary search will have the same logarithm base as the sorting method for a 25/75 split. Isolating N you get:
N = n*log_2.3(n) / (n/2 - log_2.3(n))
If your number of searches N exceeds the quantity on the RHS (where n is the size of the array in question) then it will be more efficient to sort once and use binary searches on that.

time complexity to find k elements in unsorted array using quick partition [duplicate]

According to Wikipedia, partition-based selection algorithms such as quickselect have runtime of O(n), but I am not convinced by it. Can anyone explain why it is O(n)?
In the normal quick-sort, the runtime is O(n log n). Every time we partition the branch into two branches (greater than the pivot and lesser than the pivot), we need to continue the process in both branches, whereas quickselect only needs to process one branch. I totally understand these points.
However, if you think in the Binary Search algorithm, after we chose the middle element, we are also searching only one side of the branch. So does that make the algorithm O(1)? No, of course, the Binary Search Algorithm is still O(log N) instead of O(1). This is also the same thing as the search element in a Binary Search Tree. We only search for one side, but we still consider O(log n) instead of O(1).
Can someone explain why in quickselect, if we continue the search in one side of pivot, it is considered O(1) instead of O(log n)? I consider the algorithm to be O(n log n), O(N) for the partitioning, and O(log n) for the number of times to continue finding.
There are several different selection algorithms, from the much simpler quickselect (expected O(n), worst-case O(n2)) to the more complex median-of-medians algorithm (Θ(n)). Both of these algorithms work by using a quicksort partitioning step (time O(n)) to rearrange the elements and position one element into its proper position. If that element is at the index in question, we're done and can just return that element. Otherwise, we determine which side to recurse on and recurse there.
Let's now make a very strong assumption - suppose that we're using quickselect (pick the pivot randomly) and on each iteration we manage to guess the exact middle of the array. In that case, our algorithm will work like this: we do a partition step, throw away half of the array, then recursively process one half of the array. This means that on each recursive call we end up doing work proportional to the length of the array at that level, but that length keeps decreasing by a factor of two on each iteration. If we work out the math (ignoring constant factors, etc.) we end up getting the following time:
Work at the first level: n
Work after one recursive call: n / 2
Work after two recursive calls: n / 4
Work after three recursive calls: n / 8
...
This means that the total work done is given by
n + n / 2 + n / 4 + n / 8 + n / 16 + ... = n (1 + 1/2 + 1/4 + 1/8 + ...)
Notice that this last term is n times the sum of 1, 1/2, 1/4, 1/8, etc. If you work out this infinite sum, despite the fact that there are infinitely many terms, the total sum is exactly 2. This means that the total work is
n + n / 2 + n / 4 + n / 8 + n / 16 + ... = n (1 + 1/2 + 1/4 + 1/8 + ...) = 2n
This may seem weird, but the idea is that if we do linear work on each level but keep cutting the array in half, we end up doing only roughly 2n work.
An important detail here is that there are indeed O(log n) different iterations here, but not all of them are doing an equal amount of work. Indeed, each iteration does half as much work as the previous iteration. If we ignore the fact that the work is decreasing, you can conclude that the work is O(n log n), which is correct but not a tight bound. This more precise analysis, which uses the fact that the work done keeps decreasing on each iteration, gives the O(n) runtime.
Of course, this is a very optimistic assumption - we almost never get a 50/50 split! - but using a more powerful version of this analysis, you can say that if you can guarantee any constant factor split, the total work done is only some constant multiple of n. If we pick a totally random element on each iteration (as we do in quickselect), then on expectation we only need to pick two elements before we end up picking some pivot element in the middle 50% of the array, which means that, on expectation, only two rounds of picking a pivot are required before we end up picking something that gives a 25/75 split. This is where the expected runtime of O(n) for quickselect comes from.
A formal analysis of the median-of-medians algorithm is much harder because the recurrence is difficult and not easy to analyze. Intuitively, the algorithm works by doing a small amount of work to guarantee a good pivot is chosen. However, because there are two different recursive calls made, an analysis like the above won't work correctly. You can either use an advanced result called the Akra-Bazzi theorem, or use the formal definition of big-O to explicitly prove that the runtime is O(n). For a more detailed analysis, check out "Introduction to Algorithms, Third Edition" by Cormen, Leisserson, Rivest, and Stein.
Let me try to explain the difference between selection & binary search.
Binary search algorithm in each step does O(1) operations. Totally there are log(N) steps and this makes it O(log(N))
Selection algorithm in each step performs O(n) operations. But this 'n' keeps on reducing by half each time. There are totally log(N) steps.
This makes it N + N/2 + N/4 + ... + 1 (log(N) times) = 2N = O(N)
For binary search it is 1 + 1 + ... (log(N) times) = O(logN)
In Quicksort, the recursion tree is lg(N) levels deep and each of these levels requires O(N) amount of work. So the total running time is O(NlgN).
In Quickselect, the recurision tree is lg(N) levels deep and each level requires only half the work of the level above it. This produces the following:
N * (1/1 + 1/2 + 1/4 + 1/8 + ...)
or
N * Summation(1/i^2)
1 < i <= lgN
The important thing to note here is that i goes from 1 to lgN, but not from 1 to N and also not from 1 to infinity.
The summation evaluates to 2. Hence Quickselect = O(2N).
Quicksort does not have a big-O of nlogn - it's worst case runtime is n^2.
I assume you're asking about Hoare's Selection Algorithm (or quickselect) not the naive selection algorithm that is O(kn). Like quicksort, quickselect has a worst case runtime of O(n^2) (if bad pivots are chosen), not O(n). It can run in expectation time n because it's only sorting one side, as you point out.
Because for selection, you're not sorting, necessarily. You can simply count how many items there are which have any given value. So an O(n) median can be performed by counting how many times each value comes up, and picking the value that has 50% of items above and below it. It's 1 pass through the array, simply incrementing a counter for each element in the array, so it's O(n).
For example, if you have an array "a" of 8 bit numbers, you can do the following:
int histogram [ 256 ];
for (i = 0; i < 256; i++)
{
histogram [ i ] = 0;
}
for (i = 0; i < numItems; i++)
{
histogram [ a [ i ] ]++;
}
i = 0;
sum = 0;
while (sum < (numItems / 2))
{
sum += histogram [ i ];
i++;
}
At the end, the variable "i" will contain the 8-bit value of the median. It was about 1.5 passes through the array "a". Once through the entire array to count the values, and half through it again to get the final value.

Merge sort complexity

We know that merge sort has time complexity O(nlogn) for the below algorithm:
void mergesort(n elements) {
mergesort(left half); ------------ (1)
mergesort(right half); ------------(2)
merge(left half, right half);
What will be the Time complexities for the following implementations?
(1)
void mergesort(n elements) {
mergesort(first quarter); ------------ (1)
mergesort(remaining three quarters); ------------(2)
merge(first quarter, remaining three quarters);
(2)
void mergesort(n elements) {
mergesort(first quarter); ------------ (1)
mergesort(second quarter); ------------(2)
mergesort(third quarter); ------------ (3)
mergesort(fourth quarter); ------------(4)
merge(first quarter, second quarter,third quarter, fourth quarter);
Please elaborate how you find the complexities.
Still O (n log n) because log base 4 of n = log n / log 4, which ends up being a constant.
[EDIT]
The recurence relation of the merge sort algorithm with k split is as follows. I assume that merging k sorted arrays with a total of n elements cost n log2(k), log2 representing log base 2.
T(1) = 0
T(n) = n log2(k) + k T(n/k)
I could resolve the recurence relation to:
T(n) = n log2(n)
regardless of the value of k.
Note that this is not exact answer to your question but a hint.
First we need to understand how time complexity for default merge sort comes out to be n(log n).
If we have 8 elements and by default mergesort approach, if we go on dividing them half each time till we reach group containing only one element, it will takes us 3 steps.
So it means mergersort is called 3 times on N elements. thats why time complexity is 3*8 i.e. (log N)*N
If you are changing default partition from half to other proportion, you will have to count, how many steps it take for you to reach group of 1 elements.
Also note that this answer only aims to explain how complexity is calculated. Big O complexity of all the partition approach is same and even other 2 partition if implemented in efficient way will have exact complexity of N(logN)
All three of the algorithms you posted are O(n log n), just with slightly different constants.
The basic idea is that it takes log(n) passes, and in each pass you examine n items. It doesn't matter how large your partitions are, and in fact you can have varying sized partitions. It always works out to O(n log n).
The runtime difference will be in the merge method. Merging sorted lists is an O(n log k) operation, where n is the total number of items to be merged, and k is the number of lists. So merging two lists is n * log(2), which works out to n (because log2(2) == 1).
See my answer to How to sort K sorted arrays, with MERGE SORT for more information.

Selection i'th smallest number algorithm

I'm reading Introduction to Algorithms book, second edition, the chapter about Medians and Order statistics. And I have a few questions about randomized and non-randomized selection algorithms.
The problem:
Given an unordered array of integers, find i'th smallest element in the array
a. The Randomized_Select algorithm is simple. But I cannot understand the math that explains it's work time. Is it possible to explain that without doing deep math, in more intuitive way? As for me, I'd think that it should work for O(nlog n), and in worst case it should be O(n^2), just like quick sort. In avg randomizedPartition returns near middle of the array, and array is divided into two each call, and the next recursion call process only half of the array. The RandomizedPartition costs (p-r+1)<=n, so we have O(n*log n). In the worst case it would choose every time the max element in the array, and divide the array into two parts - (n-1) and (0) each step. That's O(n^2)
The next one (Select algorithm) is more incomprehensible then previous:
b. What it's difference comparing to previous. Is it faster in avg?
c. The algorithm consists of five steps. In first one we divide the array into n/5 parts each one with 5 elements (beside the last one). Then each part is sorted using insertion sort, and we select 3rd element (median) of each. Because we have sorted these elements, we can be sure that previous two <= this pivot element, and the last two are >= then it. Then we need to select avg element among medians. In the book stated that we recursively call Select algorithm for these medians. How we can do that? In select algorithm we are using insertion sort, and if we are swapping two medians, we need to swap all four (or even more if it is more deeper step) elements that are "children" for each median. Or do we create new array that contain only previously selected medians, and are searching medians among them? If yes, how can we fill them in original array, as we changed their order previously.
The other steps are pretty simple and look like in the randomized_partition algorithm.
The randomized select run in O(n). look at this analysis.
Algorithm :
Randomly choose an element
split the set in "lower than" set L and "bigger than" set B
if the size of "lower than" is j-1 we found it
if the size is bigger, then Lookup in L
or lookup in B
The total cost is the sum of :
The cost of splitting the array of size n
The cost of lookup in L or the cost of looking up in B
Edited: I Tried to restructure my post
You can notice that :
We always go next in the set with greater amount of elements
The amount of elements in this set is n - rank(xj)
1 <= rank(xi) <= n So 1 <= n - rank(xj) <= n
The randomness of the element xj directly affect the randomness of the number of element which
are greater xj(and which are smaller than xj)
if xj is the element chosen , then you know that the cost is O(n) + cost(n - rank(xj)). Let's call rank(xj) = rj.
To give a good estimate we need to take the expected value of the total cost, which is
T(n) = E(cost) = sum {each possible xj}p(xj)(O(n) + T(n - rank(xj)))
xj is random. After this it is pure math.
We obtain :
T(n) = 1/n *( O(n) + sum {all possible values of rj when we continue}(O(n) + T(n - rj))) )
T(n) = 1/n *( O(n) + sum {1 < rj < n, rj != i}(O(n) + T(n - rj))) )
Here you can change variable, vj = n - rj
T(n) = 1/n *( O(n) + sum { 0 <= vj <= n - 1, vj!= n-i}(O(n) + T(vj) ))
We put O(n) outside the sum , gain a factor
T(n) = 1/n *( O(n) + O(n^2) + sum {1 <= vj <= n -1, vj!= n-i}( T(vj) ))
We put O(n) and O(n^2) outside, loose a factor
T(n) = O(1) + O(n) + 1/n *( sum { 0 <= vj <= n -1, vj!= n-i} T(vj) )
Check the link on how this is computed.
For the non-randomized version :
You say yourself:
In avg randomizedPartition returns near middle of the array.
That is exactly why the randomized algorithm works and that is exactly what it is used to construct the deterministic algorithm. Ideally you want to pick the pivot deterministically such that it produces a good split, but the best value for a good split is already the solution! So at each step they want a value which is good enough, "at least 3/10 of the array below the pivot and at least 3/10 of the array above". To achieve this they split the original array in 5 at each step, and again it is a mathematical choice.
I once created an explanation for this (with diagram) on the Wikipedia page for it... http://en.wikipedia.org/wiki/Selection_algorithm#Linear_general_selection_algorithm_-_Median_of_Medians_algorithm

Resources