sort array of locations by nearest point - algorithm

So the question is like this:
Given a location X and an array of locations, I want to get an array
of locations that are closest to location X, in other words sorted by
closest distance.
The way I solved this is by iterating through each location array and calculate the distance between X and that particular location, store that distance, and sort the location by distance using a Comparator. Done! Is there a better way to do this? Assuming that the sort is a merge sort, it should be O(n log n).

If I understand this right, you can do this pretty quickly for multiple queries - as in, given several values of X, you wouldn't have to re-sort your solution array every time. Here's how you do it:
Sort the array initially (O(n logn) - call this pre-processing)
Now, on every query X, binary search the array for X (or closest number smaller than X). Maintain, two indices, i and j, one which points to the current location, one to the next. One among these is clearly the closest number to X on the list. Pick the smaller distance one and put it in your solution array. Now, if i was picked, decrement i - if j was picked, increment j. Repeat this process till all the numbers are in the solution array.
This takes O(n + logn) for each query with O(nlogn) preprocessing. Of course, if we were talking about just one X, this is not better at all.

The problem you describe sounds like a m-nearest neighbor search to me.
So, if I correctly understood your question, i.e. the notion of a location being a vector in a multidimensional metric space, and the distance being a proper metric in this space, then it would be nice to put the array of locations in a k-d-Tree.
You have some overhead for the tree building once, but you get the search for O(log n) then.
A benefit of this, assuming you are just interested in the m < n closest available locations, you don't need to evaluate all n distances every time you search for a new X.

You should try using min-heap ds to implement this. Just keep on storing the locations in heap with key = diff of X and that location

You can't do asymptotically better than O(n log n) if using a comparison-based sort. If you want to talk about micro-optimization of the code, though, some ideas include...
Sort by squared distance; no reason to ever use sqrt() - sqrt() is expensive
Only compute squared distance if necessary; if |dx1| <= |dx2| and |dy1| <= |dy2|, pt1 is closer than pt2 - integer multiplication is fast, but avoiding it in many cases may be somewhat faster
A thinking-outside-the-box solution might be to use e.g. Bucket sort... A linear time sorting algorithm which might be applicable here.

Proof by contradiction that you can't do better:
It is well known that comparison-based sorts are the only way to sort arbitrary numbers (which may include irrational numbers), and that they can't do better than n*log(n) time.
If you go through the list in O(n) time and select the smallest number, then use that as X, and somehow come up with a list of numbers that are sorted by distance to X, then you have sorted n numbers in less than O(n*log(n)) time.

Related

K nearest points. Time complexity O(n), not O(nLogn). How?

Given a million list of co-ordinates in the form of longitude and latitude just as Google maps, how will you print closest k cities to a given location?
I had this question asked during an interview. The interviewer said this can be done in O(n) by using insertion sort up to k rather that sorting the whole list, which is NlogN. I found other answers online, and most say NLogN... was he[interviewer] correct?
I think, when calculating the distance, you can maintain a list of K elements.
Every time you have a new distance, insert it into the list if it is smaller than the largest one, and remove the largest one.
This insertion can be O(k) if you are using an sorted array, or O(logK) if you are using a binary heap.
In the worst case, you will insert n times. In total, it will be O(NK) or O(NlogK). If K is small enough, it is O(N).
It's an algorithm of quickselect (https://en.wikipedia.org/wiki/Quickselect)
Basically it's quicksort with a modification - whenever you have two halves you sort only one of them:
If a half contains k-th position - continue with subdividing and sorting it
If a half is completely after k-th position - no need to sort it, we are not interested in those elements
If a half is completely before k-th position - no need to sort it, we need all those elements and their order doesn't matter
After finish you will have the closest k elements in the first k places of the array (but they are not necessarily sorted).
Since at every step you process only one half, time will be n+n/2+n/4+n/8+...=2n (ignoring constants).
For guarantied O(n) you can always select a good pivot with e.g. median of medians (https://en.wikipedia.org/wiki/Median_of_medians).
Working on the assumption that latitude and longitude have a given number of digits, we can actually use radix sort. It seems similar to Hanqiu's answer, but I'm not sure if it's the same one. The Wikipedia description:
In computer science, radix sort is a non-comparative integer sorting algorithm that sorts data with integer keys by grouping keys by the individual digits which share the same significant position and value. A positional notation is required, but because integers can represent strings of characters (e.g., names or dates) and specially formatted floating point numbers, radix sort is not limited to integers. Radix sort dates back as far as 1887 to the work of Herman Hollerith on tabulating machines.
The article says the following about efficiency:
The topic of the efficiency of radix sort compared to other sorting algorithms is somewhat tricky and subject to quite a lot of misunderstandings. Whether radix sort is equally efficient, less efficient or more efficient than the best comparison-based algorithms depends on the details of the assumptions made. Radix sort complexity is O(wn) for n keys which are integers of word size w. Sometimes w is presented as a constant, which would make radix sort better (for sufficiently large n) than the best comparison-based sorting algorithms, which all perform Θ(n log n) comparisons to sort n keys.
In your case, the w corresponds to the word size of your latitude and longitude, that is the number of digits. In particular, this gets more efficiently for lower precision (fewer digits) in your coordinates. Whether it's more efficient that nlogn algorithms depends on your n and your implementation. Asymptotically, it's better than nlogn.
Obviously, you'd still need to combine the two into actual distance.
You could also use this algorithm with O(N) complexity, which exploits a "HashMap-like" array which would automatically sort the distances, within a given resolution.
Here's the pseudo-code in Java:
City[] cities = //your city list
Coordinate coor = //the coordinate of interest
double resolution = 0.1, capacity = 1000;
ArrayList<City>[] cityDistances = new ArrayList<City>[(int)(capacity/resolution)];
ArrayList<City> closestCities = new ArrayList<City>();
for(City c : cities) {
double distance = coor.getDistance(c);
int hash = distance/resolution;
if(cityDistances[hash] == null) cityDistances[hash] = new ArrayList<City>();
cityDistances[hash].add(c);
}
for(int index = 0 ; closestCities.size() < 10 ; index++) {
ArrayList<City> cList = cityDist[index];
if(cList == null) continue;
closestCities.addAll(cList);
}
The idea is to loop through the list of cities, calculate the distance with the coordinate of interest, and then use the distance to determine where the city should be added to the "HashMap-like" array cityDistances. The smaller the distance, the closer the index will be to 0.
The smaller the resolution, the more likely that the list closestCities will end up with 10 cities after the last loop.

Grouping set of points to nearest pairs

I need an algorithm for the following problem:
I'm given a set of 2D points P = { (x_1, y_1), (x_2, y_2), ..., (x_n, y_n) } on a plane. I need to group them in pairs in the following manner:
Find two closest points (x_a, y_a) and (x_b, y_b) in P.
Add the pair <(x_a, y_a), (x_b, y_b)> to the set of results R.
Remove <(x_a, y_a), (x_b, y_b)> from P.
If initial set P is not empty, go to the step one.
Return set of pairs R.
That naive algorithm is O(n^3), using faster algorithm for searching nearest neighbors it can be improved to O(n^2 logn). Could it be made any better?
And what if the points are not in the euclidean space?
An example (resulting groups are circled by red loops):
Put all of the points into an http://en.wikipedia.org/wiki/R-tree (time O(n log(n))) then for each point calculate the distance to its nearest neighbor. Put points and initial distances into a priority queue. Initialize an empty set of removed points, and an empty set of pairs. Then do the following pseudocode:
while priority_queue is not empty:
(distance, point) = priority_queue.get();
if point in removed_set:
continue
neighbor = rtree.find_nearest_neighbor(point)
if distance < distance_between(point, neighbor):
# The previous neighbor was removed, find the next.
priority_queue.add((distance_between(point, neighbor), point)
else:
# This is the closest pair.
found_pairs.add(point, neighbor)
removed_set.add(point)
removed_set.add(neighbor)
rtree.remove(point)
rtree.remove(neighbor)
The slowest part of this is the nearest neighbor searches. An R-tree does not guarantee that those nearest neighbor searches will be O(log(n)). But they tend to be. Furthermore you are not guaranteed that you will do O(1) neighbor searches per point. But typically you will. So average performance should be O(n log(n)). (I might be missing a log factor.)
This problem calls for a dynamic Voronoi diagram I guess.
When the Voronoi diagram of a point set is known, the nearest neighbor pair can be found in linear time.
Then deleting these two points can be done in linear or sublinear time (I didn't find precise info on that).
So globally you can expect an O(N²) solution.
If your distances are arbitrary and you can't embed your points into Euclidean space (and/or the dimension of the space would be really high), then there's basically no way around at least a quadratic time algorithm because you don't know what the closest pair is until you check all the pairs. It is easy to get very close to this, basically by sorting all pairs according to distance and then maintaining a boolean look up table indicating which points in your list have already been taken, and then going through the list of sorted pairs in order and adding a pair of points to your "nearest neighbors" if neither point in the pair is in the look up table of taken points, and then adding both points in the pair to the look up table if so. Complexity O(n^2 log n), with O(n^2) extra space.
You can find the closest pair with this divide and conquer algorithm that runs in O(nlogn) time, you may repeat this n times and you will get O(n^2 logn) which is not better than what you got.
Nevertheless, you can exploit the recursive structure of the divide and conquer algorithm. Think about this, if the pair of points you removed were on the right side of the partition, then everything will behave the same on the left side, nothing changed there, so you just have to redo the O(logn) merge steps bottom up. But consider that the first new merge step will be to merge 2 elements, the second merges 4 elements then 8, and then 16,.., n/4, n/2, n, so the total number of operations on these merge steps are O(n), so you get the second closest pair in just O(n) time. So you repeat this n/2 times by removing the previously found pair and get a total O(n^2) runtime with O(nlogn) extra space to keep track of the recursive steps, which is a little better.
But you can do even better, there is a randomized data structure that let you do updates in your point set and get an expected O(logn) query and update time. I'm not very familiar with that particular data structure but you can find it in this paper. That will make your algorithm O(nlogn) expected time, I'm not sure if there is a deterministic version with similar runtimes, but those tend to be way more cumbersome.

Finding the m Largest Numbers

This is a problem from the Cormen text, but I'd like to see if there are any other solutions.
Given an array with n distinct numbers, you need to find the m largest ones in the array, and have
them in sorted order. Assume n and m are large, but grow differently. In particular, you need
to consider below the situations where m = t*n, where t is a small number, say 0.1, and then the
possibility m = √n.
The solution given in the book offers 3 options:
Sort the array and return the top m-long segment
Convert the array to a max-heap and extract the m elements
Select the m-th largest number, partition the array about it, and sort the segment of larger entries.
These all make sense, and they all have their pros and cons, but I'm wondering, is there another way to do it? It doesn't have to be better or faster, I'm just curious to see if this is a common problem with more solutions, or if we are limited to those 3 choices.
The time complexities of the three approaches you have mentioned are as follows.
O(n log n)
O(n + m log n)
O(n + m log m)
So option (3) is definitely better than the others in terms of asymptotic complexity, since m <= n. When m is small, the difference between (2) and (3) is so small it would have little practical impact.
As for other ways to solve the problem, there are infinitely many ways you could, so the question is somewhat poor in this regard. Another approach I can think of as being practically simple and performant is the following.
Extract the first m numbers from your list of n into an array, and sort it.
Repeatedly grab the next number from your list and insert it into the correct location in the array, shifting all the lesser numbers over by one and pushing one out.
I would only do this if m was very small though. Option (2) from your original list is also extremely easy to implement if you have a max-heap implementation and will work great.
A different approach.
Take the first m numbers, and turn them into a min heap. Run through the array, if its value exceeds the min of the top m then you extract the min value and insert the new one. When you reach the end of the array you can then extract the elements into an array and reverse it.
The worst case performance of this version is O(n log(m)) placing it between the first and second methods for efficiency.
The average case is more interesting. On average only O(m log(n/m)) of the elements are going to pass the first comparison test, each time incurring O(log(m)) work so you get O(n + m log(n/m) log(m)) work, which puts it between the second and third methods. But if n is many orders of magnitude greater than m then the O(n) piece dominates, and the O(n) median select in the third approach has worse constants than the one comparison per element in this approach, so in this case this is actually the fastest!

Finding pair of big-small points from a set of points in a 2D plane

The following is an interview question which I've tried hard to solve. The required bound is to be less than O(n^2). Here is the problem:
You are given with a set of points S = (x1,y1)....(xn,yn). The points
are co-ordinates on the XY plane. A point (xa,ya) is said to be
greater than point (xb,yb) if and only if xa > xb and ya > yb.
The objective is the find all pairs of points p1 = (xa,ya) and p2 = (xb,yb) from the set S such that p1 > p2.
Example:
Input S = (1,2),(2,1),(3,4)
Answer: {(3,4),(1,2)} , {(3,4),(2,1)}
I can only come up with an O(n^2) solution that involves checking each point with other. If there is a better approach, please help me.
I am not sure you can do it.
Example Case: Let the points be (1,1), (2,2) ... (n,n).
There are O(n^2) such points and outputting them itself takes O(n^2) time.
I am assuming you actually want to count such pairs.
Sort descendingly by x in O(n log n). Now we have reduced the problem to a single dimension: for each position k we need to count how many numbers before it are larger than the number at position k. This is equivalent to counting inversions, a problem that has been answered many times on this site, including by me, for example here.
The easiest way to get O(n log n) for that problem is by using the merge sort algorithm, if you want to think about it yourself before clicking that link. Other ways include using binary indexed trees (fenwick trees) or binary search trees. The fastest in practice is probably by using binary indexed trees, because they only involve bitwise operations.
If you want to print the pairs, you cannot do better than O(n^2) in the worst case. I would be interested in an output-sensitive O(num_pairs) algorithm too however.
Why don't you just sort the list of points by X, and Y as a secondary index? (O(nlogn))
Then you can just give a "lazy" indicator that shows for each point that all the points on its right are bigger than it.
If you want to find them ALL, it will take O(n^2) anyway, because there's O(n^2) pairs.
Think of a sorted list, the first one is smallest, so there's n-1 bigger points, the second one has n-2 bigger points... which adds up to about (n^2)/2 == O(n^2)

Is it possible to find two numbers whose difference is minimum in O(n) time

Given an unsorted integer array, and without making any assumptions on
the numbers in the array:
Is it possible to find two numbers whose
difference is minimum in O(n) time?
Edit: Difference between two numbers a, b is defined as abs(a-b)
Find smallest and largest element in the list. The difference smallest-largest will be minimum.
If you're looking for nonnegative difference, then this is of course at least as hard as checking if the array has two same elements. This is called element uniqueness problem and without any additional assumptions (like limiting size of integers, allowing other operations than comparison) requires >= n log n time. It is the 1-dimensional case of finding the closest pair of points.
I don't think you can to it in O(n). The best I can come up with off the top of my head is to sort them (which is O(n * log n)) and find the minimum difference of adjacent pairs in the sorted list (which adds another O(n)).
I think it is possible. The secret is that you don't actually have to sort the list, you just need to create a tally of which numbers exist. This may count as "making an assumption" from an algorithmic perspective, but not from a practical perspective. We know the ints are bounded by a min and a max.
So, create an array of 2 bit elements, 1 pair for each int from INT_MIN to INT_MAX inclusive, set all of them to 00.
Iterate through the entire list of numbers. For each number in the list, if the corresponding 2 bits are 00 set them to 01. If they're 01 set them to 10. Otherwise ignore. This is obviously O(n).
Next, if any of the 2 bits is set to 10, that is your answer. The minimum distance is 0 because the list contains a repeated number. If not, scan through the list and find the minimum distance. Many people have already pointed out there are simple O(n) algorithms for this.
So O(n) + O(n) = O(n).
Edit: responding to comments.
Interesting points. I think you could achieve the same results without making any assumptions by finding the min/max of the list first and using a sparse array ranging from min to max to hold the data. Takes care of the INT_MIN/MAX assumption, the space complexity and the O(m) time complexity of scanning the array.
The best I can think of is to counting sort the array (possibly combining equal values) and then do the sorted comparisons -- bin sort is O(n + M) (M being the number of distinct values). This has a heavy memory requirement, however. Some form of bucket or radix sort would be intermediate in time and more efficient in space.
Sort the list with radixsort (which is O(n) for integers), then iterate and keep track of the smallest distance so far.
(I assume your integer is a fixed-bit type. If they can hold arbitrarily large mathematical integers, radixsort will be O(n log n) as well.)
It seems to be possible to sort unbounded set of integers in O(n*sqrt(log(log(n))) time. After sorting it is of course trivial to find the minimal difference in linear time.
But I can't think of any algorithm to make it faster than this.
No, not without making assumptions about the numbers/ordering.
It would be possible given a sorted list though.
I think the answer is no and the proof is similar to the proof that you can not sort faster than n lg n: you have to compare all of the elements, i.e create a comparison tree, which implies omega(n lg n) algorithm.
EDIT. OK, if you really want to argue, then the question does not say whether it should be a Turing machine or not. With quantum computers, you can do it in linear time :)

Resources