Time complexity to find 7th smallest element in a min heap? - algorithm

I am interested in finding 7th smallest element in a min heap, if we assume that min heap contains duplicates ?
I don't know how to approach. Can anyone provide an idea ?

As the seventh smallest element is in the top 7 levels of the min-heap, it is the 7th smallest of the 127 elements in the top 7 levels. Since this number is fixed (independent of the size of the original heap), the complexity is O(1).

There's a simple O(k*log k) algorithm to select the k'th smallest element from a heap:
# h = input heap
q = new min-heap()
q.insert(h.root)
for i := 1 to k - 1
top = q.delete-min()
q.insert(top.left)
q.insert(top.right)
report q.top
Of course this is constant time for the case k = 7. If you want the k-th smallest distinct element, rather than the k-th smallest overall, you will need linear time, because all elements in the heap could be equal except for the leaves, and then you need to find the (k-1)st smallest leaf, which is not possible in o(n) if all inner nodes have the same value.

Related

How to calculate the maximum median in an array

This is an algorithm question:
Input is an array with non-duplicate positive integers. Find a continuous subarray(size > 1) which has the maximum median value.
Example: input: [100, 1, 99, 2, 1000], output should be the result of (1000 + 2) / 2 = 501
I can come up the brute force solution: try all lengths from 2 -> array size to find the maximum median. But it seems too slow. I also tried to use two pointer on this question but not sure when to move left and right pointer.
Anyone has a better idea to solve this question?
tl;dr - We can show that the answer must be of length 2 or 3, after which it's linear time to check all the possibilities.
Let's say the input is A and the smallest subarray with the biggest median is a. The biggest median is either a single element or the average of a pair of elements from a. Notice that every element in a bigger than the largest element of the median can only be next to elements less than the smallest element of the median (otherwise such a pair could be chosen as a subarray to form a bigger median).
If either end of a had a pair of elements that didn't include an element of the median, it could be eliminated from a without affecting the median, a contradiction.
If either end of a was smaller than the smallest element of the median, eliminating it would increase the median, a contradiction.
Thus each end of a is either an element of the median or larger than the largest element of the median (because it's larger than the smallest elt of the median and not equal to the largest elt of the median).
Thus each end of a is an element of the median because otherwise, we'd have an element larger than an element of the median adjacent to an elt of the median, forming a larger median.
If a is odd then it must be of length three, since any larger odd length could have 2 removed from the end of a farthest from the median without changing the median.
If a is even then it must be of length 2 because any larger even length bookended by the elements of the median with interior elements alternating between smaller and larger than the median must have one of the median elements adjacent to a larger element than the other elt of the median, forming a larger median.
This proof outline could use some editing, but regardless, the conclusion is that the smallest array containing the largest median must be of length 2 or 3.
Given that, check every such subarray in linear time. O(n).
This is a Python implementation of an algorithm that solves the problem in O(n):
import random
import statistics
n = 50
numbers = random.sample(range(n),n)
max_m = 0;
max_a = [];
for i in range(2,3):
for j in range(0,n-i+1):
a = numbers[j:j+i]
m = statistics.median(a)
if m > max_m:
max_m = m
max_a = a
print(numbers)
print(max_m)
print(max_a)
This is a variation of the brute force algorithm (O(n^3)) that performs only the search for sub-arrays of length 2 or 3. The reason is that for every array of size n, there exists a sub-array that has the same or improved median. Applying this reasoning recursively, we can reduce the size of the sub-array to 2 or 3. Thus, by looking only at sub-arrays of size 2 or 3, we are guaranteed to obtain the sub-array with the maximum median.
The operation is the following: If, for a contiguous sub-array (at the beginning or at the end), at least half of the elements are lower than the median (or lower than both values forming the median, if this is the case), remove them to improve or at least preserve the median.
If in all sub-arrays there is always at least one more element above or equal to the median(s) than below, there will come a point where the size of the sub-array will be that of the median. In that case, it means that the complement will have more elements below the median, and thus, we can simply remove the complement and improve (or preserve) the median. Thus, we can always perform the operation. For n=3, it can happen that you need to remove 2 or 3 elements to perform the operation, which is not allowed. In this case, the result is the list itself.

Efficient approach to find co-prime subarrays

Given an array, is it possible to find the number of co-prime sub arrays of the array in better than O(N²) time? Co-prime arrays are defined as a contiguous subset of an array such that GCD of all elements is 1.
Consider adding one element to the end of the array. Now find the rightmost position, if any, such that the sub-array from that position to the element you have just added is co-prime. Since it is rightmost, no shorter array ending with the element added is co-prime. Since it is co-prime, every array that starts to its left and ends with the new element is co-prime. So you have worked out the number of co-prime sub-arrays that end with the new element. If you can find the rightmost position efficiently - say in O(log n) instead of O(n) - then you can count the number of co-prime sub-arrays in O(n log n) by extending the array one element at a time.
To make it possible to find rightmost positions, think of the full array as the leaves of a complete binary tree, padded out to make its a length a power of two. At each node put the GCD of all of the elements below that node - you can do this from the bottom up in time O(n). Every contiguous interval within the array can be covered by a collection of nodes of size O(log n) such that the interval consists of the leaves underneath the nodes, so you can compute the GCD of the interval is time O(log n).
To find the rightmost position forming a co-prime subarray with your current element, start with the current element and check to see if it is 1. If it is, you are finished. If not, look at the element to its left, take a GCD with that, and push the result on a stack. If the result is 1, you are finished, if not, do the same, but look to see if there is a sub-tree of 2 elements you can use to add 2 elements at once. At each of the succeeding steps you double the size of the sub-tree you are trying to find. You won't always find a convenient sub-tree of the size you want, but because every interval can be covered by O(log n) subtrees you should get lucky often enough to go through this step in time O(log n).
Now you have either found that whole array to the current element is not co-prime or you have found a section that is co-prime, but may go further to the left than it needs. The value at the top of the stack was computed by taking the GCD of the value just below it on the stack and the GCD at the top of a sub-tree. Pop it off the stack and take the GCD of the value just below it and the right half of the sub-tree. If you are still co-prime then you didn't need the left half of the sub-tree. If not, then you needed it, but perhaps not all of it. In either case you can continue down to find the rightmost match in time O(log n).
So I think you can find the rightmost position forming a co-prime subarray with the current element in time O(log n) (admittedly with some very fiddly programming) so you can count the number of coprime sub-arrays in time O(n log n)
Two examples:
List 1, 3, 5, 7. The next level is 1, 1 and the root is 1. If the current element is 13 then I check against 7 and find that gcd(7, 13) = 1. Therefore I immediately know that GCD(5, 7, 13) = GCD(3, 5, 7, 13) = GCD(1, 3, 4, 7, 13) = 1.
List 2, 4, 8, 16. The next level is 2, 8 and the root is 2. If the current numbers is 32 then I check against 16 and find that gcd(16, 32) = 16 != 1 so then I check against 8 and find that GCD(8, 32) = 8 and then I check against 2 and find that GCD(2, 32) = 2 so there is no interval in the extended array which has GCD = 1.

What is the complexity of this approach to finding K largest of N numbers

In this post on how to find the K largest of N elements the 2nd method proposed is:
Store the first k elements in a temporary array temp[0..k-1].
Find the smallest element in temp[], let the smallest element be min.
For each element x in arr[k] to arr[n-1]
If x is greater than the min then remove min from temp[] and insert x.
Print final k elements of temp[]
While I understand the approach, I do not understand their computed
Time Complexity of O((n-k)*k).
From my perspective, you are making a linear traversal of n-k elements and doing a single comparison on each element. And then perhaps replacing one elements of the temporary array of K elements.
More specifically, where does the *k aspect of their computed
Time Complexity of O((n-k)*k) come from? Why do they multipy n-k by that?
Lets consider that at kth iteration :
arr[k] > min(temp[0..k-1]
Now you will replace min(temp[0..k-1]) with arr[k].
And now you again need to compute the updated min of temp[0..k-1], because that would have changed. It can be any number in your updated temp[0..k-1]
So in worst case, u update the min everytime and hence the O(k).
Thus, time complexity = O((n-k)*k)

Algorithm to find 100 closest stars to the origin

First let me phrase the proper question:
Q: There is a file containing more than a million points (x,y) each of which represents a star. There is a planet earth at (a,b). Now, the task is to build an algorithm that would return the 100 closest stars to earth. What would be the time and space complexities of your algorithm.
This question has been asked many times in various interviews. I tried looking up the answers but could not find a satisfactory one.
One way to do it which I thought might be using a max heap of size 100. Calculate distances for each star and check if the distance is lesser than the root of the max heap. If yes, replace it with the root and call heapify.
Any other better/faster answers?
P.S: This is not a homework question.
You can actually do this in time O(n) and space O(k), where k is the number of closest points that you want, by using a very clever trick.
The selection problem is as follows: Given an array of elements and some index i, rearrange the elements of the array such that the ith element is in the right place, all elements smaller than the ith element are to the left, and all elements greater than the ith element are to the right. For example, given the array
40 10 00 30 20
If I tried to select based on index 2 (zero-indexed), one result might be
10 00 20 40 30
Since the element at index 2 (20) is in the right place, the elements to the left are smaller than 20, and the elements to the right are greater than 20.
It turns out that since this is a less strict requirement than actually sorting the array, it's possible to do this in time O(n), where n is the number of elements of the array. Doing so requires some complex algorithms like the median-of-medians algorithm, but is indeed O(n) time.
So how do you use this here? One option is to load all n elements from the file into an array, then use the selection algorithm to select the top k in O(n) time and O(n) space (here, k = 100).
But you can actually do better than this! For any constant k that you'd like, maintain a buffer of 2k elements. Load 2k elements from the file into the array, then use the selection algorithm to rearrange it so that the smallest k elements are in the left half of the array and the largest are in the right, then discard the largest k elements (they can't be any of the k closest points). Now, load k more elements from the file into the buffer and do this selection again, and repeat this until you've processed every line of the file. Each time you do a selection you discard the largest k elements in the buffer and retain the k closest points you've seen so far. Consequently, at the very end, you can select the top k elements one last time and find the top k.
What's the complexity of the new approach? Well, you're using O(k) memory for the buffer and the selection algorithm. You end up calling select on a buffer of size O(k) a total of O(n / k) times, since you call select after reading k new elements. Since select on a buffer of size O(k) takes time O(k), the total runtime here is O(n + k). If k = O(n) (a reasonable assumption), this takes time O(n), space O(k).
Hope this helps!
To elaborate on the MaxHeap solution you would build a max-heap with the first k elements from the file ( k = 100 in this case ).
The key for the max-heap would be its distance from Earth (a,b). Distance between 2 points on a 2d plane can be calculated using:
dist = (x1,y1) to (x2,y2) = square_root((x2 - x1)^2 + (y2 - y1)^2);
This would take O(k) time to construct. For every subsequent element from k to n. ie (n - k) elements you need to fetch its distance from earth and compare it with the top of max-heap. If the new element to be inserted is closer to earth than the top of the max-heap, replace the top of the max-heap and call heapify on the new root of the heap.
This would take O((n-k)logk) time to complete.
Finally we would be left with just the k elements in the max-heap. You can call heapify k times to return all these k elements. This is another O(klogk).
Overall time complexity would be O(k + (n-k)logk + klogk).
It's a famous question and there has been lot's of solution for that:
http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm
if you did not find it useful, there are some other resources such as Rurk's computational geometry book.
Your algorithm is correct. Just remember that time complexity of your program is O(n . log 100 ) = O(n), unless number of closest points to find can vary.
import sys,os,csv
iFile=open('./file_copd.out','rU')
earth = [0,0]
##getDistance return distance given two stars
def getDistance(star1,star2):
return sqrt((star1[0]-star2[0])**2 +(star1[1]-star2[1])**2 )
##diction dict_galaxy looks like this {key,distance} key is the seq assign to each star, value is a list [distance,its cordinance]
##{1,[distance1,[x,y]];2,[distance2,[x,y]]}
dict_galaxy={}
#list_galaxy=[]
count = 0
sour=iFile.readlines()
for line in sour:
star=line.split(',') ##Star is a list [x,y]
dict_galaxy[count]=[getDistance(earth,star),star]
count++
###Now sort this dictionary based on their distance, and return you a list of keys.
list_sorted_key = sorted(dict_galaxy,key=lambda x:dict_galaxy[x][0])
print 'is this what you want %s'%(list_sorted_key[:100].to_s)
iFile.close()

How to find k nearest neighbors to the median of n distinct numbers in O(n) time?

I can use the median of medians selection algorithm to find the median in O(n). Also, I know that after the algorithm is done, all the elements to the left of the median are less that the median and all the elements to the right are greater than the median. But how do I find the k nearest neighbors to the median in O(n) time?
If the median is n, the numbers to the left are less than n and the numbers to the right are greater than n.
However, the array is not sorted in the left or the right sides. The numbers are any set of distinct numbers given by the user.
The problem is from Introduction to Algorithms by Cormen, problem 9.3-7
No one seems to quite have this. Here's how to do it. First, find the median as described above. This is O(n). Now park the median at the end of the array, and subtract the median from every other element. Now find element k of the array (not including the last element), using the quick select algorithm again. This not only finds element k (in order), it also leaves the array so that the lowest k numbers are at the beginning of the array. These are the k closest to the median, once you add the median back in.
The median-of-medians probably doesn't help much in finding the nearest neighbours, at least for large n. True, you have each column of 5 partitioned around it's median, but this isn't enough ordering information to solve the problem.
I'd just treat the median as an intermediate result, and treat the nearest neighbours as a priority queue problem...
Once you have the median from the median-of-medians, keep a note of it's value.
Run the heapify algorithm on all your data - see Wikipedia - Binary Heap. In comparisons, base the result on the difference relative to that saved median value. The highest priority items are those with the lowest ABS(value - median). This takes O(n).
The first item in the array is now the median (or a duplicate of it), and the array has heap structure. Use the heap extract algorithm to pull out as many nearest-neighbours as you need. This is O(k log n) for k nearest neighbours.
So long as k is a constant, you get O(n) median of medians, O(n) heapify and O(log n) extracting, giving O(n) overall.
med=Select(A,1,n,n/2) //finds the median
for i=1 to n
B[i]=mod(A[i]-med)
q=Select(B,1,n,k) //get the kth smallest difference
j=0
for i=1 to n
if B[i]<=q
C[j]=A[i] //A[i], the real value should be assigned instead of B[i] which is only the difference between A[i] and median.
j++
return C
You can solve your problem like that:
You can find the median in O(n), w.g. using the O(n) nth_element algorithm.
You loop through all elements substutiting each with a pair:
the absolute difference to the median, element's value.
Once more you do nth_element with n = k. after applying this algorithm you are guaranteed to have the k smallest elements in absolute difference first in the new array. You take their indices and DONE!
Four Steps:
Use Median of medians to locate the median of the array - O(n)
Determine the absolute difference between the median and each element in the array and store them in a new array - O(n)
Use Quickselect or Introselect to pick k smallest elements out of the new array - O(k*n)
Retrieve the k nearest neighbours by indexing the original array - O(k)
When k is small enough, the overall time complexity becomes O(n).
Find the median in O(n). 2. create a new array, each element is the absolute value of the original value subtract the median 3. Find the kth smallest number in O(n) 4. The desired values are the elements whose absolute difference with the median is less than or equal to the kth smallest number in the new array.
You could use a non-comparison sort, such as a radix sort, on the list of numbers L, then find the k closest neighbors by considering windows of k elements and examining the window endpoints. Another way of stating "find the window" is find i that minimizes abs(L[(n-k)/2+i] - L[n/2]) + abs(L[(n+k)/2+i] - L[n/2]) (if k is odd) or abs(L[(n-k)/2+i] - L[n/2]) + abs(L[(n+k)/2+i+1] - L[n/2]) (if k is even). Combining the cases, abs(L[(n-k)/2+i] - L[n/2]) + abs(L[(n+k)/2+i+!(k&1)] - L[n/2]). A simple, O(k) way of finding the minimum is to start with i=0, then slide to the left or right, but you should be able to find the minimum in O(log(k)).
The expression you minimize comes from transforming L into another list, M, by taking the difference of each element from the median.
m=L[n/2]
M=abs(L-m)
i minimizes M[n/2-k/2+i] + M[n/2+k/2+i].
You already know how to find the median in O(n)
if the order does not matter, selection of k smallest can be done in O(n)
apply for k smallest to the rhs of the median and k largest to the lhs of the median
from wikipedia
function findFirstK(list, left, right, k)
if right > left
select pivotIndex between left and right
pivotNewIndex := partition(list, left, right, pivotIndex)
if pivotNewIndex > k // new condition
findFirstK(list, left, pivotNewIndex-1, k)
if pivotNewIndex < k
findFirstK(list, pivotNewIndex+1, right, k)
don't forget the special case where k==n return the original list
Actually, the answer is pretty simple. All we need to do is to select k elements with the smallest absolute differences from the median moving from m-1 to 0 and m+1 to n-1 when the median is at index m. We select the elements using the same idea we use in merging 2 sorted arrays.
If you know the index of the median, which should just be ceil(array.length/2) maybe, then it just should be a process of listing out n(x-k), n(x-k+1), ... , n(x), n(x+1), n(x+2), ... n(x+k)
where n is the array, x is the index of the median, and k is the number of neighbours you need.(maybe k/2, if you want total k, not k each side)
First select the median in O(n) time, using a standard algorithm of that complexity.
Then run through the list again, selecting the elements that are nearest to the median (by storing the best known candidates and comparing new values against these candidates, just like one would search for a maximum element).
In each step of this additional run through the list O(k) steps are needed, and since k is constant this is O(1). So the total for time needed for the additional run is O(n), as is the total runtime of the full algorithm.
Since all the elements are distinct, there can be atmost 2 elements with the same difference from the mean. I think it is easier for me to have 2 arrays A[k] and B[k] the index representing the absolute value of the difference from the mean. Now the task is to just fill up the arrays and choose k elements by reading the first k non empty values of the arrays reading A[i] and B[i] before A[i+1] and B[i+1]. This can be done in O(n) time.
All the answers suggesting to subtract the median from the array would produce incorrect results. This method will find the elements closest in value, not closest in position.
For example, if the array is 1,2,3,4,5,10,20,30,40. For k=2, the value returned would be (3,4); which is incorrect. The correct output should be (4,10) as they are the nearest neighbor.
The correct way to find the result would be using the selection algorithm to find upper and lower bound elements. Then by direct comparison find the remaining elements from the list.

Resources