Algorithm to find 100 closest stars to the origin - algorithm

First let me phrase the proper question:
Q: There is a file containing more than a million points (x,y) each of which represents a star. There is a planet earth at (a,b). Now, the task is to build an algorithm that would return the 100 closest stars to earth. What would be the time and space complexities of your algorithm.
This question has been asked many times in various interviews. I tried looking up the answers but could not find a satisfactory one.
One way to do it which I thought might be using a max heap of size 100. Calculate distances for each star and check if the distance is lesser than the root of the max heap. If yes, replace it with the root and call heapify.
Any other better/faster answers?
P.S: This is not a homework question.

You can actually do this in time O(n) and space O(k), where k is the number of closest points that you want, by using a very clever trick.
The selection problem is as follows: Given an array of elements and some index i, rearrange the elements of the array such that the ith element is in the right place, all elements smaller than the ith element are to the left, and all elements greater than the ith element are to the right. For example, given the array
40 10 00 30 20
If I tried to select based on index 2 (zero-indexed), one result might be
10 00 20 40 30
Since the element at index 2 (20) is in the right place, the elements to the left are smaller than 20, and the elements to the right are greater than 20.
It turns out that since this is a less strict requirement than actually sorting the array, it's possible to do this in time O(n), where n is the number of elements of the array. Doing so requires some complex algorithms like the median-of-medians algorithm, but is indeed O(n) time.
So how do you use this here? One option is to load all n elements from the file into an array, then use the selection algorithm to select the top k in O(n) time and O(n) space (here, k = 100).
But you can actually do better than this! For any constant k that you'd like, maintain a buffer of 2k elements. Load 2k elements from the file into the array, then use the selection algorithm to rearrange it so that the smallest k elements are in the left half of the array and the largest are in the right, then discard the largest k elements (they can't be any of the k closest points). Now, load k more elements from the file into the buffer and do this selection again, and repeat this until you've processed every line of the file. Each time you do a selection you discard the largest k elements in the buffer and retain the k closest points you've seen so far. Consequently, at the very end, you can select the top k elements one last time and find the top k.
What's the complexity of the new approach? Well, you're using O(k) memory for the buffer and the selection algorithm. You end up calling select on a buffer of size O(k) a total of O(n / k) times, since you call select after reading k new elements. Since select on a buffer of size O(k) takes time O(k), the total runtime here is O(n + k). If k = O(n) (a reasonable assumption), this takes time O(n), space O(k).
Hope this helps!

To elaborate on the MaxHeap solution you would build a max-heap with the first k elements from the file ( k = 100 in this case ).
The key for the max-heap would be its distance from Earth (a,b). Distance between 2 points on a 2d plane can be calculated using:
dist = (x1,y1) to (x2,y2) = square_root((x2 - x1)^2 + (y2 - y1)^2);
This would take O(k) time to construct. For every subsequent element from k to n. ie (n - k) elements you need to fetch its distance from earth and compare it with the top of max-heap. If the new element to be inserted is closer to earth than the top of the max-heap, replace the top of the max-heap and call heapify on the new root of the heap.
This would take O((n-k)logk) time to complete.
Finally we would be left with just the k elements in the max-heap. You can call heapify k times to return all these k elements. This is another O(klogk).
Overall time complexity would be O(k + (n-k)logk + klogk).

It's a famous question and there has been lot's of solution for that:
http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm
if you did not find it useful, there are some other resources such as Rurk's computational geometry book.

Your algorithm is correct. Just remember that time complexity of your program is O(n . log 100 ) = O(n), unless number of closest points to find can vary.

import sys,os,csv
iFile=open('./file_copd.out','rU')
earth = [0,0]
##getDistance return distance given two stars
def getDistance(star1,star2):
return sqrt((star1[0]-star2[0])**2 +(star1[1]-star2[1])**2 )
##diction dict_galaxy looks like this {key,distance} key is the seq assign to each star, value is a list [distance,its cordinance]
##{1,[distance1,[x,y]];2,[distance2,[x,y]]}
dict_galaxy={}
#list_galaxy=[]
count = 0
sour=iFile.readlines()
for line in sour:
star=line.split(',') ##Star is a list [x,y]
dict_galaxy[count]=[getDistance(earth,star),star]
count++
###Now sort this dictionary based on their distance, and return you a list of keys.
list_sorted_key = sorted(dict_galaxy,key=lambda x:dict_galaxy[x][0])
print 'is this what you want %s'%(list_sorted_key[:100].to_s)
iFile.close()

Related

How to find 2 special elements in the array in O(n)

Let a1,...,an be a sequence of real numbers. Let m be the minimum of the sequence, and let M be the maximum of the sequence.
I proved that there exists 2 elements in the sequence, x,y, such that |x-y|<=(M-m)/n.
Now, is there a way to find an algorithm that finds such 2 elements in time complexity of O(n)?
I thought about sorting the sequence, but since I dont know anything about M I cannot use radix/bucket or any other linear time algorithm that I'm familier with.
I'd appreciate any idea.
Thanks in advance.
First find out n, M, m. If not already given they can be determined in O(n).
Then create a memory storage of n+1 elements; we will use the storage for n+1 buckets with width w=(M-m)/n.
The buckets cover the range of values equally: Bucket 1 goes from [m; m+w[, Bucket 2 from [m+w; m+2*w[, Bucket n from [m+(n-1)*w; m+n*w[ = [M-w; M[, and the (n+1)th bucket from [M; M+w[.
Now we go once through all the values and sort them into the buckets according to the assigned intervals. There should be at a maximum 1 element per bucket. If the bucket is already filled, it means that the elements are closer together than the boundaries of the half-open interval, e.g. we found elements x, y with |x-y| < w = (M-m)/n.
If no such two elements are found, afterwards n buckets of n+1 total buckets are filled with one element. And all those elements are sorted.
We once more go through all the buckets and compare the distance of the content of neighbouring buckets only, whether there are two elements, which fulfil the condition.
Due to the width of the buckets, the condition cannot be true for buckets, which are not adjoining: For those the distance is always |x-y| > w.
(The fulfilment of the last inequality in 4. is also the reason, why the interval is half-open and cannot be closed, and why we need n+1 buckets instead of n. An alternative would be, to use n buckets and make the now last bucket a special case with [M; M+w]. But O(n+1)=O(n) and using n+1 steps is preferable to special casing the last bucket.)
The running time is O(n) for step 1, 0 for step 2 - we actually do not do anything there, O(n) for step 3 and O(n) for step 4, as there is only 1 element per bucket. Altogether O(n).
This task shows, that either sorting of elements, which are not close together or coarse sorting without considering fine distances can be done in O(n) instead of O(n*log(n)). It has useful applications. Numbers on computers are discrete, they have a finite precision. I have sucessfuly used this sorting method for signal-processing / fast sorting in real-time production code.
About #Damien 's remark: The real threshold of (M-m)/(n-1) is provably true for every such sequence. I assumed in the answer so far the sequence we are looking at is a special kind, where the stronger condition is true, or at least, for all sequences, if the stronger condition was true, we would find such elements in O(n).
If this was a small mistake of the OP instead (who said to have proven the stronger condition) and we should find two elements x, y with |x-y| <= (M-m)/(n-1) instead, we can simplify:
-- 3. We would do steps 1 to 3 like above, but with n buckets and the bucket width set to w = (M-m)/(n-1). The bucket n now goes from [M; M+w[.
For step 4 we would do the following alternative:
4./alternative: n buckets are filled with one element each. The element at bucket n has to be M and is at the left boundary of the bucket interval. The distance of this element y = M to the element x in the n-1th bucket for every such possible element x in the n-1thbucket is: |M-x| <= w = (M-m)/(n-1), so we found x and y, which fulfil the condition, q.e.d.
First note that the real threshold should be (M-m)/(n-1).
The first step is to calculate the min m and max M elements, in O(N).
You calculate the mid = (m + M)/2value.
You concentrate the value less than mid at the beginning, and more than mid at the end of he array.
You select the part with the largest number of elements and you iterate until very few numbers are kept.
If both parts have the same number of elements, you can select any of them. If the remaining part has much more elements than n/2, then in order to maintain a O(n) complexity, you can keep onlyn/2 + 1 of them, as the goal is not to find the smallest difference, but one difference small enough only.
As indicated in a comment by #btilly, this solution could fail in some cases, for example with an input [0, 2.1, 2.9, 5]. For that, it is needed to calculate the max value of the left hand, and the min value of the right hand, and to test if the answer is not right_min - left_max. This doesn't change the O(n) complexity, even if the solution becomes less elegant.
Complexity of the search procedure: O(n) + O(n/2) + O(n/4) + ... + O(2) = O(2n) = O(n).
Damien is correct in his comment that the correct results is that there must be x, y such that |x-y| <= (M-m)/(n-1). If you have the sequence [0, 1, 2, 3, 4] you have 5 elements, but no two elements are closer than (M-m)/n = (4-0)/5 = 4/5.
With the right threshold, the solution is easy - find M and m by scanning through the input once, and then bucket the input into (n-1) buckets of size (M-m)/(n-1), putting values that are on the boundaries of a pair of buckets into both buckets. At least one bucket must have two values in it by the pigeon-hole principle.

What is the complexity of this approach to finding K largest of N numbers

In this post on how to find the K largest of N elements the 2nd method proposed is:
Store the first k elements in a temporary array temp[0..k-1].
Find the smallest element in temp[], let the smallest element be min.
For each element x in arr[k] to arr[n-1]
If x is greater than the min then remove min from temp[] and insert x.
Print final k elements of temp[]
While I understand the approach, I do not understand their computed
Time Complexity of O((n-k)*k).
From my perspective, you are making a linear traversal of n-k elements and doing a single comparison on each element. And then perhaps replacing one elements of the temporary array of K elements.
More specifically, where does the *k aspect of their computed
Time Complexity of O((n-k)*k) come from? Why do they multipy n-k by that?
Lets consider that at kth iteration :
arr[k] > min(temp[0..k-1]
Now you will replace min(temp[0..k-1]) with arr[k].
And now you again need to compute the updated min of temp[0..k-1], because that would have changed. It can be any number in your updated temp[0..k-1]
So in worst case, u update the min everytime and hence the O(k).
Thus, time complexity = O((n-k)*k)

Finding the kth smallest element in a sequence where duplicates are compressed?

I've been asked to write a program to find the kth order statistic of a data set consisting of character and their occurrences. For example, I have a data set consisting of
B,A,C,A,B,C,A,D
Here I have A with 3 occurrences, B with 2 occurrences C with 2 occurrences and D with on occurrence. They can be grouped in pairs (characters, number of occurrences), so, for example, we could represent the above sequence as
(A,3), (B,2), (C,2) and (D,1).
Assuming than k is the number of these pairs, I am asked to find the kth of the data set in O(n) where n is the number of pairs.
I thought could sort the element based their number of occurrence and find their kth smallest elements, but that won't work in the time bounds. Can I please have some help on the algorithm for this problem?
Assuming that you have access to a linear-time selection algorithm, here's a simple divide-and-conquer algorithm for solving the problem. I'm going to let k denote the total number of pairs and m be the index you're looking for.
If there's just one pair, return the key in that pair.
Otherwise:
Using a linear-time selection algorithm, find the median element. Let medFreq be its frequency.
Sum up the frequencies of the elements less than the median. Call this less. Note that the number of elements less than or equal to the median is less + medFreq.
If less < m < less + medFreq, return the key in the median element.
Otherwise, if m ≤ less, recursively search for the mth element in the first half of the array.
Otherwise (m > less + medFreq), recursively search for the (m - less - medFreq)th element in the second half of the array.
The key insight here is that each iteration of this algorithm tosses out half of the pairs, so each recursive call is on an array half as large as the original array. This gives us the following recurrence relation:
T(k) = T(k / 2) + O(k)
Using the Master Theorem, this solves to O(k).

A divide-and-conquer algorithm for counting dominating points?

Let's say that a point at coordinate (x1,y1) dominates another point (x2,y2) if x1 ≤ x2 and y1 ≤ y2;
I have a set of points (x1,y1) , ....(xn,yn) and I want to find the total number of dominating pairs. I can do this using brute force by comparing all points against one another, but this takes time O(n2). Instead, I'd like to use a divide-and-conquer approach to solve this in time O(n log n).
Right now, I have the following algorithm:
Draw a vertical line dividing the set of points points into two equal subsets of Pleft and Pright. As a base case, if there are just two points left, I can compare them directly.
Recursively count the number of dominating pairs in Pleft and Pright
Some conquer step?
The problem is that I can't see what the "conquer" step should be here. I want to count how many dominating pairs there are that cross from Pleft into Pright, but I don't know how to do that without comparing all the points in both parts, which would take time O(n2).
Can anyone give me a hint about how to do the conquer step?
so the 2 halves of y coordinates are : {1,3,4,5,5} and {5,8,9,10,12}
i draw the division line.
Suppose you sort the points in both halves separately in ascending order by their y coordinates. Now, look at the lowest y-valued point in both halves. If the lowest point on the left has a lower y value than the lowest point on the right, then that point is dominated by all points on the right. Otherwise, the bottom point on the right doesn't dominate anything on the left.
In either case, you can remove one point from one of the two halves and repeat the process with the remaining sorted lists. This does O(1) work per point, so if there are n total points, this does O(n) work (after sorting) to count the number of dominating pairs across the two halves. If you've seen it before, this is similar to the algorithm for counting inversions in an array).
Factoring in the time required to sort the points (O(n log n)), this conquer step takes O(n log n) time, giving the recurrence
T(n) = 2T(n / 2) + O(n log n)
This solves to O(n log2 n) according to the Master Theorem.
However, you can speed this up. Suppose that before you start the divide amd conquer step that you presort the points by their y coordinates, doing one pass of O(n log n) work. Using tricks similar to the closest pair of points problem, you can then get the points in each half sorted in O(n) time on each subproblem of size n (see the discussion at this bottom of this page) for details). That changes the recurrence to
T(n) = 2T(n / 2) + O(n)
Which solves to O(n log n), as required.
Hope this helps!
Well in this way you have O(n^2) just for division to subsets...
My approach would be different
sort points by X ... O(n.log(n))
now check for Y
but check only points with bigger X (if you sort them ascending then with larger index)
so now you have O(n.log(n)+(n.n/2))
You can also further speed things up by doing separate X and Y test and after that combine the result, that would leads O(n + 3.n.log(n))
add index attribute to your points
where index = 0xYYYYXXXXh is unsigned integer type
YYYY is index of point in Y-sorted array
XXXX is index of point in X-sorted array
if you have more than 2^16 points use bigger then 32-bit data-type.
sort points by ascending X and set XXXX part of their index O1(n.log(n))
sort points by ascending Y and set YYYY part of their index O2(n.log(n))
sort points by ascending index O3(n.log(n))
now point i dominates any point j if (i < j)
but if you need to create actually all the pairs for any point
that would take O4(n.n/2) so this approach will save not a bit of time
if you need just single pair for any point then simple loop will suffice O4(n-1)
so in this case O(n-1+3.n.log(n)) -> ~O(n+3.n.log(n))
hope it helped,... of course if you are stuck with that subdivision approach than i have no better solution for you.
PS. for this you do not need any additional recursion just 3x sorting and only one uint for any point so the memory requirements are not that big and even should be faster than recursive call to subdivision recursion in general
This algorithm runs in O(N*log(N)) where N is the size of the list of points and it uses O(1) extra space.
Perform the following steps:
Sort the list of points by y-coordinate (ascending order), break ties by
x-coordinate (ascending order).
Go through the sorted list in reverse order to count the dominating points:
if the current x-coordinate >= max x-coordinate value encountered so far
then increment the result and update the max.
This works since you know for sure that if all pairs with a greater y-coordinates have a smaller x-coordinate than the current point you have found a dominating points. The sorting step makes it really efficient.
Here's the Python code:
def my_cmp(p1, p2):
delta_y = p1[1] - p2[1]
if delta_y != 0:
return delta_y
return p1[0] - p2[0]
def count_dom_points(points):
points.sort(cmp = my_cmp)
maxi = float('-inf')
count = 0
for x, y in reversed(points):
if x >= maxi:
count += 1
maxi = x
return count

How to find k nearest neighbors to the median of n distinct numbers in O(n) time?

I can use the median of medians selection algorithm to find the median in O(n). Also, I know that after the algorithm is done, all the elements to the left of the median are less that the median and all the elements to the right are greater than the median. But how do I find the k nearest neighbors to the median in O(n) time?
If the median is n, the numbers to the left are less than n and the numbers to the right are greater than n.
However, the array is not sorted in the left or the right sides. The numbers are any set of distinct numbers given by the user.
The problem is from Introduction to Algorithms by Cormen, problem 9.3-7
No one seems to quite have this. Here's how to do it. First, find the median as described above. This is O(n). Now park the median at the end of the array, and subtract the median from every other element. Now find element k of the array (not including the last element), using the quick select algorithm again. This not only finds element k (in order), it also leaves the array so that the lowest k numbers are at the beginning of the array. These are the k closest to the median, once you add the median back in.
The median-of-medians probably doesn't help much in finding the nearest neighbours, at least for large n. True, you have each column of 5 partitioned around it's median, but this isn't enough ordering information to solve the problem.
I'd just treat the median as an intermediate result, and treat the nearest neighbours as a priority queue problem...
Once you have the median from the median-of-medians, keep a note of it's value.
Run the heapify algorithm on all your data - see Wikipedia - Binary Heap. In comparisons, base the result on the difference relative to that saved median value. The highest priority items are those with the lowest ABS(value - median). This takes O(n).
The first item in the array is now the median (or a duplicate of it), and the array has heap structure. Use the heap extract algorithm to pull out as many nearest-neighbours as you need. This is O(k log n) for k nearest neighbours.
So long as k is a constant, you get O(n) median of medians, O(n) heapify and O(log n) extracting, giving O(n) overall.
med=Select(A,1,n,n/2) //finds the median
for i=1 to n
B[i]=mod(A[i]-med)
q=Select(B,1,n,k) //get the kth smallest difference
j=0
for i=1 to n
if B[i]<=q
C[j]=A[i] //A[i], the real value should be assigned instead of B[i] which is only the difference between A[i] and median.
j++
return C
You can solve your problem like that:
You can find the median in O(n), w.g. using the O(n) nth_element algorithm.
You loop through all elements substutiting each with a pair:
the absolute difference to the median, element's value.
Once more you do nth_element with n = k. after applying this algorithm you are guaranteed to have the k smallest elements in absolute difference first in the new array. You take their indices and DONE!
Four Steps:
Use Median of medians to locate the median of the array - O(n)
Determine the absolute difference between the median and each element in the array and store them in a new array - O(n)
Use Quickselect or Introselect to pick k smallest elements out of the new array - O(k*n)
Retrieve the k nearest neighbours by indexing the original array - O(k)
When k is small enough, the overall time complexity becomes O(n).
Find the median in O(n). 2. create a new array, each element is the absolute value of the original value subtract the median 3. Find the kth smallest number in O(n) 4. The desired values are the elements whose absolute difference with the median is less than or equal to the kth smallest number in the new array.
You could use a non-comparison sort, such as a radix sort, on the list of numbers L, then find the k closest neighbors by considering windows of k elements and examining the window endpoints. Another way of stating "find the window" is find i that minimizes abs(L[(n-k)/2+i] - L[n/2]) + abs(L[(n+k)/2+i] - L[n/2]) (if k is odd) or abs(L[(n-k)/2+i] - L[n/2]) + abs(L[(n+k)/2+i+1] - L[n/2]) (if k is even). Combining the cases, abs(L[(n-k)/2+i] - L[n/2]) + abs(L[(n+k)/2+i+!(k&1)] - L[n/2]). A simple, O(k) way of finding the minimum is to start with i=0, then slide to the left or right, but you should be able to find the minimum in O(log(k)).
The expression you minimize comes from transforming L into another list, M, by taking the difference of each element from the median.
m=L[n/2]
M=abs(L-m)
i minimizes M[n/2-k/2+i] + M[n/2+k/2+i].
You already know how to find the median in O(n)
if the order does not matter, selection of k smallest can be done in O(n)
apply for k smallest to the rhs of the median and k largest to the lhs of the median
from wikipedia
function findFirstK(list, left, right, k)
if right > left
select pivotIndex between left and right
pivotNewIndex := partition(list, left, right, pivotIndex)
if pivotNewIndex > k // new condition
findFirstK(list, left, pivotNewIndex-1, k)
if pivotNewIndex < k
findFirstK(list, pivotNewIndex+1, right, k)
don't forget the special case where k==n return the original list
Actually, the answer is pretty simple. All we need to do is to select k elements with the smallest absolute differences from the median moving from m-1 to 0 and m+1 to n-1 when the median is at index m. We select the elements using the same idea we use in merging 2 sorted arrays.
If you know the index of the median, which should just be ceil(array.length/2) maybe, then it just should be a process of listing out n(x-k), n(x-k+1), ... , n(x), n(x+1), n(x+2), ... n(x+k)
where n is the array, x is the index of the median, and k is the number of neighbours you need.(maybe k/2, if you want total k, not k each side)
First select the median in O(n) time, using a standard algorithm of that complexity.
Then run through the list again, selecting the elements that are nearest to the median (by storing the best known candidates and comparing new values against these candidates, just like one would search for a maximum element).
In each step of this additional run through the list O(k) steps are needed, and since k is constant this is O(1). So the total for time needed for the additional run is O(n), as is the total runtime of the full algorithm.
Since all the elements are distinct, there can be atmost 2 elements with the same difference from the mean. I think it is easier for me to have 2 arrays A[k] and B[k] the index representing the absolute value of the difference from the mean. Now the task is to just fill up the arrays and choose k elements by reading the first k non empty values of the arrays reading A[i] and B[i] before A[i+1] and B[i+1]. This can be done in O(n) time.
All the answers suggesting to subtract the median from the array would produce incorrect results. This method will find the elements closest in value, not closest in position.
For example, if the array is 1,2,3,4,5,10,20,30,40. For k=2, the value returned would be (3,4); which is incorrect. The correct output should be (4,10) as they are the nearest neighbor.
The correct way to find the result would be using the selection algorithm to find upper and lower bound elements. Then by direct comparison find the remaining elements from the list.

Resources