Average time complexity of finding top-k elements - algorithm

Consider the task of finding the top-k elements in a set of N independent and identically distributed floating point values. By using a priority queue / heap, we can iterate once over all N elements and maintain a top-k set by the following operations:
if the element x is "worse" than the heap's head: discard x ⇒ complexity O(1)
if the element x is "better" than the heap's head: remove the head and insert x ⇒ complexity O(log k)
The worst case time complexity of this approach is obviously O(N log k), but what about the average time complexity? Due to the iid-assumption, the probability of the O(1) operation increases over time, and we rarely have to perform the costly O(log k), especially for k << N.
Is this average time complexity documented in any citable reference? What's the average time complexity? If you have a citeable reference for your answer please include it.

Consider the i'th largest element, and a particular permutation. It'll inserted into the k-sized heap if it appears before no more than k-1 of the (i - 1) larger elements in the permutation.
The probability of that heap-insertion happening is 1 if i <= k, and k/i if i > k.
From this, you can compute the expectation of the number of heap adjustments, using linearity of expectation. It's sum(i = 1 to k)1 + sum(i = k+1 to n)k/i = k + sum(i = k+1 to n)k/i = k * (1 + H(n) - H(k)), where H(n) is the n'th harmonic number.
This is approximately k log(n) (for k << n), and you can compute your average cost from there.

Related

Sorting algorithm proof and running-time

Hydrosort is a sorting algorithm. Below is the pseudocode.
*/A is arrary to sort, i = start index, j = end index */
Hydrosort(A, i, j): // Let T(n) be the time to find where n = j-1+1
n = j – i + 1 O(1)
if (n < 10) { O(1)
sort A[i…j] by insertion-sort O(n^2) //insertion sort = O(n^2) worst-case
return O(1)
}
m1 = i + 3 * n / 4 O(1)
m2 = i + n / 4 O(1)
Hydrosort(A, i, m1) T(n/2)
Hydrosort(A, m2, j) T(n/2)
Hydrosort(A, i, m1) T(n/2)
T(n) = O(n^2) + 3T(n/2), so T(n) is O(n^2). I used the 3rd case of the Master Theorem to solve this recurrence.
I have 2 questions:
Have I calculated the worst-case running time here correctly?
how would I prove that Hydrosort(A, 1, n) correctly sorts an array A of n elements?
Have I calculated the worst-case running time here correctly?
I am afraid not.
The complexity function is:
T(n) = 3T(3n/4) + CONST
This is because:
You have three recursive calls for a problem of size 3n/4
The constant modifier here is O(1), since all non recursive calls operations are bounded to a constant size (Specifically, insertion sort for n<=10 is O(1)).
If you go on and calculate it, you'll get worse than O(n^2) complexity
how would I prove that Hydrosort(A, 1, n) correctly sorts an array A
of n elements?
By induction. Assume your algorithm works for problems of size n, and let's examine a problem of size n+1. For n+1<10 it is trivial, so we ignore this case.
After first recursive call, you are guaranteed that first 3/4 of the
array is sorted, and in particular you are guaranteed that the first n/4 of the elements are the smallest ones in this part. This means, they cannot be in the last n/4 of the array, because there are at least n/2 elements bigger than them. This means, the n/4 biggest elements are somewhere between m2 and j.
After second recursive call, since it is guaranteed to be invoked on the n/4 biggest elements, it will place these elements at the end of the array. This means the part between m1 and j is now sorted properly. This also means, 3n/4 smallest elements are somewhere in between i and m1.
Third recursive call sorts the 3n/4 elements properly, and the n/4 biggest elements are already in place, so the array is now sorted.

Why Time complexity of Fibonacci using for loop is O(n^2) and not O(n)?

Why is the time complexity is calculated as O(n^2) instead of O(n) for below algo.
FibList(n)
array[0-n] Create Array O(n)
F[0] <- 0 O(1)
F[1] <- 1 O(1)
for i from 2 to n O(n)
F[i] <- F[i-1] + F[i-2] O(n)
return F[n] O(1)
O(n) + O(1) + O(1) + O(n) O(n) + O(1) = O(n^2)
If you assume the cost of adding an integer with k1 bits to one with k2 bits is proportional to max(k1, k2) (which is the so-called "bit" cost model, or "logarithmic" cost model), then the time complexity of the code you've produced is O(n^2).
That's because F(i) is (almost) proportional to phi^i, where phi is the Golden ratio. That means F(i) has ~i bits.
So the cost of:
for i from 2 to n
F[i] <- F[i-1] + F[i-2]
is proportional to (1 + 2 + 3 + ... n-1) which is n(n-1)/2, and thus O(n^2).
If you assume that addition of arbitrary-sized integers is O(1), then the code is O(n).
For background on cost models, see this section on wikipedia https://en.wikipedia.org/wiki/Analysis_of_algorithms#Cost_models which says
One must be careful here; for instance, some analyses count an
addition of two numbers as one step. This assumption may not be
warranted in certain contexts. For example, if the numbers involved in
a computation may be arbitrarily large, the time required by a single
addition can no longer be assumed to be constant.
Incidentally, the method used in your question, of writing the maximum complexity of each line and then multiplying nested ones, is not a valid way of computing tight-bound complexities, although it works in cases where all the complexities are polynomials, which in this case they are.

Proof of Ω(n logk) worst case complexity in a comparison sort algorithm

I'm writing a comparison algorithm that takes n numbers and a number k as input.
It separates the n numbers to k groups so that all numbers in group 1 are smaller than all numbers of group 2 , ... , smaller than group k.
The numbers of the same group are not necessarily sorted.
I'm using a selection(A[],left,right,k) to find the k'th element , which in my case is the n/k element (to divide the whole array in to 2 pieces) and then repeat for each piece , until the initial array is divided to k parts of n/k numbers each.
It has a complexity of Θ(n logk) as its a tree of logk levels (depth) that cost maximum cn calculations each level. This is linear time as logk is considered a constant.
I am asked to prove that all comparison algorithms that sort an Array[n] to k groups in this way, cost Ω(nlogk) in the worst case.
I've searched around here , google and my algorithm's book (Jon Kleinberg Eva Tardos) I only find proof for comparison algorithms that sort ALL the elements. The proof of such algorithm complexity is not accepted in my case because all of these are under circumstances that do not meet my problem, nor can they be altered to meet my problem. ( also consider that regular quicksort with random selection results in Θ(nlogn) which is not linear as Ω(nlogk) is)
You can find the general algorithm proof here:
https://www.cs.cmu.edu/~avrim/451f11/lectures/lect0913.pdf
where it is also clearly explained why my problem does not belong in the comparison sort case of O(nlogn)
Sorting requires lg(n!) = Omega(n log n) comparisons because there are n! different output permutations.
For this problem there are
n!
-------
k
(n/k)!
equivalence classes of output permutations because the order within k independent groups of n/k elements does not matter. We compute
n!
lg ------- = lg (n!) - k lg((n/k)!)
k
(n/k)!
= n lg n - n - k ((n/k) lg (n/k) - n/k) ± O(lg n + k lg (n/k))
(write lg (...!) as a sum, bound with two integrals;
see https://en.wikipedia.org/wiki/Stirling's_approximation)
= n (lg n - lg (n/k)) ± O(lg n + k lg (n/k))
= n lg k ± O(lg n + k lg (n/k))
= Omega(n lg k).
(O(lg n + k lg (n/k)) = O(n), since k <= n)
prove that all comparison algorithms that sort an Array[n] to k groups in this way, cost Ω(nlogk) in the worst case.
I think the statement is false. If using quickselect with a poor pivot choice (such as always using first or last element), then the worst case is probably O(n^2).
Only some comparison algorithms will have a worst case of O(n log(k)). Using median of medians (the n/5 version) for the pivot prevents quickselect solves the pivot issue. There are other algorithms that would also be O(n log(k)).

O(n) - the next permutation lexicographically

i'm just wondering what is efficiency (O(n)) of this algorithm:
Find the largest index k such that a[k] < a[k + 1]. If no such index exists, the permutation is the last permutation.
Find the largest index l such that a[k] < a[l]. Since k + 1 is such an index, l is well defined and satisfies k < l.
Swap a[k] with a[l].
Reverse the sequence from a[k + 1] up to and including the final element a[n].
As I understand the worst case O(n) = n (when k is the first element of previous permutation), best case O(n) = 1 (when k is last element of previous permutation).
Can I say that O(n) = n/2 ?
O(n) = n/2 makes no sense. Let f(n) = n be the running time of your algorithm. Then the right way to say it is that f(n) is in O(n). O(n) is a set of functions that are at most asymptotically linear in n.
Your optimization makes the expected running time g(n) = n/2. g(n) is also in O(n). In fact O(n) = O(n/2) so your saving of half of the time does not change the asymptotic complexity.
All steps in the algorithm takes O(n) asymptotically.
Your averaging is incorrect. Just because best case is O(1) and worst case is O(n), you can't say the algorithm takes O(n)=n/2. Big O notation is simply for the upper bound of the algorithm.
So the algorithm is still O(n) irrespective of the best case scenario.
There is no such thing as O(n) = n/2.
When you do O(n) calculations you're just trying to find the functional dependency, you don't care about coefficients. So there's no O(n)= n/2 just like there's no O(n) = 5n
Asymptotically, O(n) is the same as O(n/2). In any case, the algorithm is performed for each of the n! permutations, so the order is much greater than your estimate (on the order of n!).

Time complexity for generating binary heap from unsorted array

Can any one explain why the time complexity for generating a binary heap from a unsorted array using bottom-up heap construction is O(n) ?
(Solution found so far: I found in Thomas and Goodrich book that the total sum of sizes of paths for internal nodes while constructing the heap is 2n-1, but still don't understand their explanation)
Thanks.
Normal BUILD-HEAP Procedure for generating a binary heap from an unsorted array is implemented as below :
BUILD-HEAP(A)
heap-size[A] ← length[A]
for i ← length[A]/2 downto 1
do HEAPIFY(A, i)
Here HEAPIFY Procedure takes O(h) time, where h is the height of the tree, and there
are O(n) such calls making the running time O(n h). Considering h=lg n, we can say that BUILD-HEAP Procedure takes O(n lg n) time.
For tighter analysis, we can observe that heights of most nodes are small.
Actually, at any height h, there can be at most CEIL(n/ (2^h +1)) nodes, which we can easily prove by induction.
So, the running time of BUILD-HEAP can be written as,
lg n lg n
∑ n/(2^h+1)*O(h) = O(n* ∑ O(h/2^h))
h=0 h=0
Now,
∞
∑ k*x^k = X/(1-x)^2
k=0
∞
Putting x=1/2, ∑h/2^h = (1/2) / (1-1/2)^2 = 2
h=0
Hence, running time becomes,
lg n ∞
O(n* ∑ O(h/2^h)) = O(n* ∑ O(h/2^h)) = O(n)
h=0 h=0
So, this gives a running time of O(n).
N.B. The analysis is taken from this.
Check out wikipedia:
Building a heap:
A heap could be built by successive insertions. This approach requires O(n log n) time because each insertion takes O(log n) time and there are n elements. However this is not the optimal method. The optimal method starts by arbitrarily putting the elements on a binary tree, respecting the shape property. Then starting from the lowest level and moving upwards, shift the root of each subtree downward as in the deletion algorithm until the heap property is restored.
http://en.wikipedia.org/wiki/Binary_heap

Resources