Kth largest element in a max-heap - algorithm

I'm trying to come up with something to solve the following:
Given a max-heap represented as an array, return the kth largest element without modifying the heap. I was asked to do it in linear time, but was told it can be done in log time.
I thought of a solution:
Use a second max-heap and fill it with k or k+1 values into it (breadth first traversal into the original one) then pop k elements and get the desired one. I suppose this should be O(N+logN) = O(N)
Is there a better solution, perhaps in O(logN) time?

The max-heap can have many ways, a better case is a complete sorted array, and in other extremely case, the heap can have a total asymmetric structure.
Here can see this:
In the first case, the kth lagest element is in the kth position, you can compute in O(1) with a array representation of heap.
But, in generally, you'll need to check between (k, 2k) elements, and sort them (or partial sort with another heap). As far as I know, it's O(K·log(k))
And the algorithm:
Input:
Integer kth <- 8
Heap heap <- {19,18,10,17,14,9,4,16,15,13,12}
BEGIN
Heap positionHeap <- Heap with comparation: ((n0,n1)->compare(heap[n1], heap[n0]))
Integer childPosition
Integer candidatePosition <- 0
Integer count <- 0
positionHeap.push(candidate)
WHILE (count < kth) DO
candidatePosition <- positionHeap.pop();
childPosition <- candidatePosition * 2 + 1
IF (childPosition < size(heap)) THEN
positionHeap.push(childPosition)
childPosition <- childPosition + 1
IF (childPosition < size(heap)) THEN
positionHeap.push(childPosition)
END-IF
END-IF
count <- count + 1
END-WHILE
print heap[candidate]
END-BEGIN
EDITED
I found "Optimal Algorithm of Selection in a min-heap" by Frederickson here:
ftp://paranoidbits.com/ebooks/An%20Optimal%20Algorithm%20for%20Selection%20in%20a%20Min-Heap.pdf

No, there's no O(log n)-time algorithm, by a simple cell probe lower bound. Suppose that k is a power of two (without loss of generality) and that the heap looks like (min-heap incoming because it's easier to label, but there's no real difference)
1
2 3
4 5 6 7
.............
permutation of [k, 2k).
In the worst case, we have to read the entire permutation, because there are no order relations imposed by the heap, and as long as k is not found, it could be in any location not yet examined. This takes time Omega(k), matching the (complicated!) algorithm posted by templatetypedef.

To the best of my knowledge, there's no easy algorithm for solving this problem. The best algorithm I know of is due to Frederickson and it isn't easy. You can check out the paper here, but it might be behind a paywall. It runs in time O(k) and this is claimed to be the best possible time, so I suspect that a log-time solution doesn't exist.
If I find a better algorithm than this, I'll be sure to let you know.
Hope this helps!

Max-heap in an array: element at i is larger than elements at 2*i+1 and 2*i+2 (i is 0-based)
You'll need another max heap (insert, pop, empty) with element pairs (value, index) sorted by value. Pseudocode (without boundary checks):
input: k
1. insert (at(0), 0)
2. (v, i) <- pop and k <- k - 1
3. if k == 0 return v
4. insert (at(2*i+1), 2*i+1) and insert (at(2*+2), 2*+2)
5. goto 2
Runtime evaluation
array access at(i): O(1)
insertion into heap: O(log n)
insert at 4. takes at most log(k) since the size of heap of pairs is at most k + 1
statement 3. is reached at most k times
total runtime: O(k log k)

Related

Find an element that occurs at least k times in a sorted array in log(n) time

Given a sorted array of n elements and a number k, is it possible to find an element that occurs more than k times, in log(n) time? If there is more than one number that occurs more than k times, any of them are acceptable.
If yes, how?
Edit:
I'm able to solve the problem in linear time, and I'm happy to post that solution here - but it's fairly straightforward to solve it in n. I'm completely stumped when it comes to making it work in log(n), though, and that's what my question is about.
Here is O(n/k log(k)) solution:
i = 0
while i+k-1 < n: //don't get out of bounds
if arr[i] == arr[i+k-1]:
produce arr[i] as dupe
i = min { j | arr[j] > arr[i] } //binary search
else:
c = min { j | arr[j] == arr[i+k-1] } //binary search
i = c
The idea is, you check the element at index i+k-1, if it matches the element at index i - good, it's a dupe. Otherwise, you don't need to check all the element between i to i+k-1, only the one with the same value as arr[i+k-1].
You do need to look back for for the earliest index of this element, but you are guaranteed to exceed the index i+k by next iteration, making the total number of iteration of this algorithm O(n/k), each takes O(logk) time.
This is asymptotically better than linear time algorithm, especially for large values of k (where the algorithm decays to O(logn) for cases where k is in O(n), like for example - find element that repeats at least with frequency 0.1)
Not in general. For example, if k=2, no algorithm that doesn't in the worst case inspect every element of the array can guarantee to find a duplicate.

Find kth number in sum array

Given an array A with N elements I need to find pair (i,j) such that i is not equal to j and if we write the sum A[i]+A[j] for all pairs of (i,j) then it comes at the kth position.
Example : Let N=4 and arrays A=[1 2 3 4] and if K=3 then answer is 5 as we can see it clearly that sum array becomes like this : [3,4,5,5,6,7]
I can't go for all pair of i and j as N can go up to 100000. Please help how to solve this problem
I mean something like this :
int len=N*(N+1)/2;
int sum[len];
int count=0;
for(int i=0;i<N;i++){
for(int j=i+1;j<N;j++){
sum[count]=A[i]+A[j];
count++;
}
}
//Then just find kth element.
We can't go with this approach
A solution that is based on a fact that K <= 50: Let's take the first K + 1 elements of the array in a sorted order. Now we can just try all their combinations. Proof of correctness: let's assume that a pair (i, j) is the answer, where j > K + 1. But there are K pairs with the same or smaller sum: (1, 2), (1, 3), ..., (1, K + 1). Thus, it cannot be the K-th pair.
It is possible to achieve an O(N + K ^ 2) time complexity by choosing the K + 1 smallest numbers using a quickselect algorithm(it is possible to do even better, but it is not required). You can also just the array and get an O(N * log N + K ^ 2 * log K) complexity.
I assume that you got this question from http://www.careercup.com/question?id=7457663.
If k is close to 0 then the accepted answer to How to find kth largest number in pairwise sums like setA + setB? can be adapted quite easily to this problem and be quite efficient. You need O(n log(n)) to sort the array, O(n) to set up a priority queue, and then O(k log(k)) to iterate through the elements. The reversed solution is also efficient if k is near n*n - n.
If k is close to n*n/2 then that won't be very good. But you can adapt the pivot approach of http://en.wikipedia.org/wiki/Quickselect to this problem. First in time O(n log(n)) you can sort the array. In time O(n) you can set up a data structure representing the various contiguous ranges of columns. Then you'll need to select pivots O(log(n)) times. (Remember, log(n*n) = O(log(n)).) For each pivot, you can do a binary search of each column to figure out where it split it in time O(log(n)) per column, and total cost of O(n log(n)) for all columns.
The resulting algorithm will be O(n log(n) log(n)).
Update: I do not have time to do the finger exercise of supplying code. But I can outline some of the classes you might have in an implementation.
The implementation will be a bit verbose, but that is sometimes the cost of a good general-purpose algorithm.
ArrayRangeWithAddend. This represents a range of an array, summed with one value.with has an array (reference or pointer so the underlying data can be shared between objects), a start and an end to the range, and a shiftValue for the value to add to every element in the range.
It should have a constructor. A method to give the size. A method to partition(n) it into a range less than n, the count equal to n, and a range greater than n. And value(i) to give the i'th value.
ArrayRangeCollection. This is a collection of ArrayRangeWithAddend objects. It should have methods to give its size, pick a random element, and a method to partition(n) it into an ArrayRangeCollection that is below n, count of those equal to n, and an ArrayRangeCollection that is larger than n. In the partition method it will be good to not include ArrayRangeWithAddend objects that have size 0.
Now your main program can sort the array, and create an ArrayRangeCollection covering all pairs of sums that you are interested in. Then the random and partition method can be used to implement the standard quickselect algorithm that you will find in the link I provided.
Here is how to do it (in pseudo-code). I have now confirmed that it works correctly.
//A is the original array, such as A=[1,2,3,4]
//k (an integer) is the element in the 'sum' array to find
N = A.length
//first we find i
i = -1
nl = N
k2 = k
while (k2 >= 0) {
i++
nl--
k2 -= nl
}
//then we find j
j = k2 + nl + i + 1
//now compute the sum at index position k
kSum = A[i] + A[j]
EDIT:
I have now tested this works. I had to fix some parts... basically the k input argument should use 0-based indexing. (The OP seems to use 1-based indexing.)
EDIT 2:
I'll try to explain my theory then. I began with the concept that the sum array should be visualised as a 2D jagged array (diminishing in width as the height increases), with the coordinates (as mentioned in the OP) being i and j. So for an array such as [1,2,3,4,5] the sum array would be conceived as this:
3,4,5,6,
5,6,7,
7,8,
9.
The top row are all values where i would equal 0. The second row is where i equals 1. To find the value of 'j' we do the same but in the column direction.
... Sorry I cannot explain this any better!

Find pairs with given difference

Given n, k and n number of integers. How would you find the pairs of integers for which their difference is k?
There is a n*log n solution, but I cannot figure it out.
You can do it like this:
Sort the array
For each item data[i], determine its two target pairs, i.e. data[i]+k and data[i]-k
Run a binary search on the sorted array for these two targets; if found, add both data[i] and data[targetPos] to the output.
Sorting is done in O(n*log n). Each of the n search steps take 2 * log n time to look for the targets, for the overall time of O(n*log n)
For this problem exists the linear solution! Just ask yourself one question. If you have a what number should be in the array? Of course a+k or a-k (A special case: k = 0, required an alternative solution). So, what now?
You are creating a hash-set (for example unordered_set in C++11) with all values from the array. O(1) - Average complexity for each element, so it's O(n).
You are iterating through the array, and check for each element Is present in the array (x+k) or (x-k)?. You check it for each element, in set in O(1), You check each element once, so it's linear (O(n)).
If you found x with pair (x+k / x-k), it is what you are looking for.
So it's linear (O(n)). If you really want O(n lg n) you should use a set on tree, with checking is_exist in (lg n), then you have O(n lg n) algorithm.
Apposition: No need to check x+k and x-k, just x+k is sufficient. Cause if a and b are good pair then:
if a < b then
a + k == b
else
b + k == a
Improvement: If you know a range, you can guarantee linear complexity, by using bool table (set_tab[i] == true, when i is in table.).
Solution similar to one above:
Sort the array
set variables i = 0; j = 1;
check the difference between array[i] and array[j]
if the difference is too small, increase j
if the difference is too big, increase i
if the difference is the one you're looking for, add it to results and increase j
repeat 3 and 4 until the end of array
Sorting is O(n*lg n), the next step is, if I'm correct, O(n) (at most 2*n comparisons), so the whole algorithm is O(n*lg n)

Why is merge sort worst case run time O (n log n)?

Can someone explain to me in simple English or an easy way to explain it?
The Merge Sort use the Divide-and-Conquer approach to solve the sorting problem. First, it divides the input in half using recursion. After dividing, it sort the halfs and merge them into one sorted output. See the figure
It means that is better to sort half of your problem first and do a simple merge subroutine. So it is important to know the complexity of the merge subroutine and how many times it will be called in the recursion.
The pseudo-code for the merge sort is really simple.
# C = output [length = N]
# A 1st sorted half [N/2]
# B 2nd sorted half [N/2]
i = j = 1
for k = 1 to n
if A[i] < B[j]
C[k] = A[i]
i++
else
C[k] = B[j]
j++
It is easy to see that in every loop you will have 4 operations: k++, i++ or j++, the if statement and the attribution C = A|B. So you will have less or equal to 4N + 2 operations giving a O(N) complexity. For the sake of the proof 4N + 2 will be treated as 6N, since is true for N = 1 (4N +2 <= 6N).
So assume you have an input with N elements and assume N is a power of 2. At every level you have two times more subproblems with an input with half elements from the previous input. This means that at the the level j = 0, 1, 2, ..., lgN there will be 2^j subproblems with an input of length N / 2^j. The number of operations at each level j will be less or equal to
2^j * 6(N / 2^j) = 6N
Observe that it doens't matter the level you will always have less or equal 6N operations.
Since there are lgN + 1 levels, the complexity will be
O(6N * (lgN + 1)) = O(6N*lgN + 6N) = O(n lgN)
References:
Coursera course Algorithms: Design and Analysis, Part 1
On a "traditional" merge sort, each pass through the data doubles the size of the sorted subsections. After the first pass, the file will be sorted into sections of length two. After the second pass, length four. Then eight, sixteen, etc. up to the size of the file.
It's necessary to keep doubling the size of the sorted sections until there's one section comprising the whole file. It will take lg(N) doublings of the section size to reach the file size, and each pass of the data will take time proportional to the number of records.
After splitting the array to the stage where you have single elements i.e. call them sublists,
at each stage we compare elements of each sublist with its adjacent sublist. For example, [Reusing #Davi's image
]
At Stage-1 each element is compared with its adjacent one, so n/2 comparisons.
At Stage-2, each element of sublist is compared with its adjacent sublist, since each sublist is sorted, this means that the max number of comparisons made between two sublists is <= length of the sublist i.e. 2 (at Stage-2) and 4 comparisons at Stage-3 and 8 at Stage-4 since the sublists keep doubling in length. Which means the max number of comparisons at each stage = (length of sublist * (number of sublists/2)) ==> n/2
As you've observed the total number of stages would be log(n) base 2
So the total complexity would be == (max number of comparisons at each stage * number of stages) == O((n/2)*log(n)) ==> O(nlog(n))
Algorithm merge-sort sorts a sequence S of size n in O(n log n)
time, assuming two elements of S can be compared in O(1) time.
This is because whether it be worst case or average case the merge sort just divide the array in two halves at each stage which gives it lg(n) component and the other N component comes from its comparisons that are made at each stage. So combining it becomes nearly O(nlg n). No matter if is average case or the worst case, lg(n) factor is always present. Rest N factor depends on comparisons made which comes from the comparisons done in both cases. Now the worst case is one in which N comparisons happens for an N input at each stage. So it becomes an O(nlg n).
Many of the other answers are great, but I didn't see any mention of height and depth related to the "merge-sort tree" examples. Here is another way of approaching the question with a lot of focus on the tree. Here's another image to help explain:
Just a recap: as other answers have pointed out we know that the work of merging two sorted slices of the sequence runs in linear time (the merge helper function that we call from the main sorting function).
Now looking at this tree, where we can think of each descendant of the root (other than the root) as a recursive call to the sorting function, let's try to assess how much time we spend on each node... Since the slicing of the sequence and merging (both together) take linear time, the running time of any node is linear with respect to the length of the sequence at that node.
Here's where tree depth comes in. If n is the total size of the original sequence, the size of the sequence at any node is n/2i, where i is the depth. This is shown in the image above. Putting this together with the linear amount of work for each slice, we have a running time of O(n/2i) for every node in the tree. Now we just have to sum that up for the n nodes. One way to do this is to recognize that there are 2i nodes at each level of depth in the tree. So for any level, we have O(2i * n/2i), which is O(n) because we can cancel out the 2is! If each depth is O(n), we just have to multiply that by the height of this binary tree, which is logn. Answer: O(nlogn)
reference: Data Structures and Algorithms in Python
The recursive tree will have depth log(N), and at each level in that tree you will do a combined N work to merge two sorted arrays.
Merging sorted arrays
To merge two sorted arrays A[1,5] and B[3,4] you simply iterate both starting at the beginning, picking the lowest element between the two arrays and incrementing the pointer for that array. You're done when both pointers reach the end of their respective arrays.
[1,5] [3,4] --> []
^ ^
[1,5] [3,4] --> [1]
^ ^
[1,5] [3,4] --> [1,3]
^ ^
[1,5] [3,4] --> [1,3,4]
^ x
[1,5] [3,4] --> [1,3,4,5]
x x
Runtime = O(A + B)
Merge sort illustration
Your recursive call stack will look like this. The work starts at the bottom leaf nodes and bubbles up.
beginning with [1,5,3,4], N = 4, depth k = log(4) = 2
[1,5] [3,4] depth = k-1 (2^1 nodes) * (N/2^1 values to merge per node) == N
[1] [5] [3] [4] depth = k (2^2 nodes) * (N/2^2 values to merge per node) == N
Thus you do N work at each of k levels in the tree, where k = log(N)
N * k = N * log(N)
MergeSort algorithm takes three steps:
Divide step computes mid position of sub-array and it takes constant time O(1).
Conquer step recursively sort two sub arrays of approx n/2 elements each.
Combine step merges a total of n elements at each pass requiring at most n comparisons so it take O(n).
The algorithm requires approx logn passes to sort an array of n elements and so total time complexity is nlogn.
lets take an example of 8 element{1,2,3,4,5,6,7,8} you have to first divide it in half means n/2=4({1,2,3,4} {5,6,7,8}) this two divides section take 0(n/2) and 0(n/2) times so in first step it take 0(n/2+n/2)=0(n)time.
2. Next step is divide n/22 which means (({1,2} {3,4} )({5,6}{7,8})) which would take
(0(n/4),0(n/4),0(n/4),0(n/4)) respectively which means this step take total 0(n/4+n/4+n/4+n/4)=0(n) time.
3. next similar as previous step we have to divide further second step by 2 means n/222 ((({1},{2},{3},{4})({5},{6},{7},{8})) whose time is 0(n/8+n/8+n/8+n/8+n/8+n/8+n/8+n/8)=0(n)
which means every step takes 0(n) times .lets steps would be a so time taken by merge sort is 0(an) which mean a must be log (n) because step will always divide by 2 .so finally TC of merge sort is 0(nlog(n))

Problem k-subvector using dynamic programming

Given a vector V of n integers and an integer k, k <= n, you want a subvector (a sequence of consecutive elements of the vector ) of maximum length containing at most k distinct elements.
The technique that I use for the resolution of the problem is dynamic programming.
The complexity of this algorithm must be O(n*k).
The main problem is how to count distinct elements of the vector. as you would resolve it ?
How to write the EQUATION OF RECURRENCE ?
Thanks you!!!.
I don't know why you would insist on O(n*k), this can be solved in O(n) with 'sliding window' approach.
Maintain current 'window' [left..right]
At each step, if we can increase right by 1 (without violating 'at most k disctint elements' requirement), do it
Otherwise, increase left by 1
Check whether current window is the longest and go back to #2
Checking whether we can increase right in #2 is a little tricky. We can use hashtable storing for each element inside window how many times it occurred there.
So, the condition to allow right increase would look like
hash.size < k || hash.contains(V[right + 1])
And each time left or right is increased, we'll need to update hash (decrease or increase number of occurrences of the given element).
I'm pretty sure, any DP solution here would be longer and more complicated.
the main problem is how to count distinct elements of the vector. as you would resolve it?
If you allowed to use hashing, you could do the following
init Hashtable h
distinct_count := 0
for each element v of the vector V
if h does not contain v (O(1) time in average)
insert v into h (O(1) time in average)
distinct_count := distinct_count + 1
return distinct_count
This is in average O(n) time.
If not here is an O(n log n) solution - this time worst case
sort V (O(n log n) comparisons)
Then it should be easy to determine the number of different elements in O(n) time ;-)
I could also tell you an algorithm to sort V in O(n*b) where b is the bit count of the integers - if this helps you.
Here is the algorithm:
sort(vector, begin_index, end_index, currentBit)
reorder the vector[begin_index to end_index] so that the elements that have a 1 at bit currentBit are after those that have a 0 there (O(end_index-begin_index) time)
Let c be the count of elements that have a 0 at bit currentBit (O(end_index-begin_index) time; can be got from the step before)
if (currentBit is not 0)
call sort(vector, begin_index, begin_index+c)
call sort(vector, begin_index+c+1, end_index)
Call it with
vector = V
begin_index = 0
end_index = n-1
currentBit = bit count of the integers (=: b)-1.
This even uses dynamic programming as requested.
As you can determine very easily this is O(n*b) time with a recursion depth of b.

Resources