Efficient repeated sorting - algorithm

I have a set of N points on a graph defined by (x,y) coordinates, as well as a table with their pairwise distance. I'm looking to generate a table with their relative "closeness ranking", e.g. if closeness[5][9] == 4, then node 9 is the fourth closest item relative to node 5.
The obvious way of doing this is by generating a list of indeces and sorting them based on d[i][j] < d[i][k] for every i (1->n), then transforming the table by the knowledge that sorted[5][4] == 9 implies closeness[5][9] == 4.
This would require O(n² log n) time. I feel like there could be a more efficient way. Any ideas?

Okay, I'm going to try and take a stab at this.
For the background knowledge: This problem is somewhat related to k-nearest neighbor. I'm not sure how you generated your pairwise distance, but k-d tree is pretty good at solving these type of problem.
Now, even if you use k-d tree, it will help a bit (you only query what you need, instead of sorting ALL the points): building the tree in O(N log N) time, then for the K nearest points you want to query, each would take O(log N) time. In the end, you are looking at O(N log N) + O(NK log N).
Okay, now, the actual heuristic part. This will depend on your data, you might want to see if they are close together or far apart. But, you can try a divide and conquer approach, where you divide the plane into bins. When you need to find the closest points, find out which bin the point you are working with belong to, then you only work with neighboring bins, and explore more neighboring bins as you need more points.
Hopefully this helps, good luck.

Related

Grouping set of points to nearest pairs

I need an algorithm for the following problem:
I'm given a set of 2D points P = { (x_1, y_1), (x_2, y_2), ..., (x_n, y_n) } on a plane. I need to group them in pairs in the following manner:
Find two closest points (x_a, y_a) and (x_b, y_b) in P.
Add the pair <(x_a, y_a), (x_b, y_b)> to the set of results R.
Remove <(x_a, y_a), (x_b, y_b)> from P.
If initial set P is not empty, go to the step one.
Return set of pairs R.
That naive algorithm is O(n^3), using faster algorithm for searching nearest neighbors it can be improved to O(n^2 logn). Could it be made any better?
And what if the points are not in the euclidean space?
An example (resulting groups are circled by red loops):
Put all of the points into an http://en.wikipedia.org/wiki/R-tree (time O(n log(n))) then for each point calculate the distance to its nearest neighbor. Put points and initial distances into a priority queue. Initialize an empty set of removed points, and an empty set of pairs. Then do the following pseudocode:
while priority_queue is not empty:
(distance, point) = priority_queue.get();
if point in removed_set:
continue
neighbor = rtree.find_nearest_neighbor(point)
if distance < distance_between(point, neighbor):
# The previous neighbor was removed, find the next.
priority_queue.add((distance_between(point, neighbor), point)
else:
# This is the closest pair.
found_pairs.add(point, neighbor)
removed_set.add(point)
removed_set.add(neighbor)
rtree.remove(point)
rtree.remove(neighbor)
The slowest part of this is the nearest neighbor searches. An R-tree does not guarantee that those nearest neighbor searches will be O(log(n)). But they tend to be. Furthermore you are not guaranteed that you will do O(1) neighbor searches per point. But typically you will. So average performance should be O(n log(n)). (I might be missing a log factor.)
This problem calls for a dynamic Voronoi diagram I guess.
When the Voronoi diagram of a point set is known, the nearest neighbor pair can be found in linear time.
Then deleting these two points can be done in linear or sublinear time (I didn't find precise info on that).
So globally you can expect an O(N²) solution.
If your distances are arbitrary and you can't embed your points into Euclidean space (and/or the dimension of the space would be really high), then there's basically no way around at least a quadratic time algorithm because you don't know what the closest pair is until you check all the pairs. It is easy to get very close to this, basically by sorting all pairs according to distance and then maintaining a boolean look up table indicating which points in your list have already been taken, and then going through the list of sorted pairs in order and adding a pair of points to your "nearest neighbors" if neither point in the pair is in the look up table of taken points, and then adding both points in the pair to the look up table if so. Complexity O(n^2 log n), with O(n^2) extra space.
You can find the closest pair with this divide and conquer algorithm that runs in O(nlogn) time, you may repeat this n times and you will get O(n^2 logn) which is not better than what you got.
Nevertheless, you can exploit the recursive structure of the divide and conquer algorithm. Think about this, if the pair of points you removed were on the right side of the partition, then everything will behave the same on the left side, nothing changed there, so you just have to redo the O(logn) merge steps bottom up. But consider that the first new merge step will be to merge 2 elements, the second merges 4 elements then 8, and then 16,.., n/4, n/2, n, so the total number of operations on these merge steps are O(n), so you get the second closest pair in just O(n) time. So you repeat this n/2 times by removing the previously found pair and get a total O(n^2) runtime with O(nlogn) extra space to keep track of the recursive steps, which is a little better.
But you can do even better, there is a randomized data structure that let you do updates in your point set and get an expected O(logn) query and update time. I'm not very familiar with that particular data structure but you can find it in this paper. That will make your algorithm O(nlogn) expected time, I'm not sure if there is a deterministic version with similar runtimes, but those tend to be way more cumbersome.

How to calculate the average time complexity of the nearest neighbor search using kd-tree?

We know the complexity of the nearest neighbor search of kd-tree is O(logn). But how to calculate it? The main problem is the average time complexity of the back tracing. I have tried to read the paper "An Algorithm for Finding Best Matches in Logarithmic Expected Time", but it is too complicate for me. Does anyone know a simple way to calculate that?
The calculation in the paper is about as simple as possible for a rigorous analysis.
(NB This is the price of being a true computer scientist and software engineer. You must put the effort into learning the math. Knowing the math is what separates people who think they can write solid programs from those who actually can. Jon Bentley, the guy who invented kd-trees, did so when he was in high school. Take this as inspiration.)
If you want a rough intuitive idea that is not rigorous, here is one.
Assume we are working in 2d. The sizes of the geometric areas represented by the 2d-tree are the key.
In the average case, one point partitions the domain into 2 roughly equal-sized rectangles. 3 points into 4. 7 points into 8 parts. Etc. In general N points lead to N-1 roughly equal-sized rectangles.
It not hard to see that if the domain is 1x1, the length of a side of these parts is on average O(sqrt(1/N)).
When you search for a nearest neighbor, you descend the tree to the rectangle containing the search point. After doing this, you have used O(log N) effort to find a point within R = O(sqrt(1/N)) of the correct one. This is just a point contained in the leaf that you discovered.
But this rectangle is not the only one that must be searched. You must still look at all others containing a point no more than distance R away from the search point, refining R each time you find a closer point.
Fortunately, the O(sqrt(1/N)) limit on R provides a tight bound on the average number of other rectangles this can be. In the average case, it's about 8 because each equal-sized rectangle has no more than 8 neighbors.
So the total effort to search is O(8 log n) = O(log n).
Again, I repeat this is not a rigorous analysis, but it ought to give you a feel for why the algorithm is O(log N) in the average case.

Finding a single cluster of points with low variance

Given a collection of points in the complex plane, I want to find a "typical value", something like mean or mode. However, I expect that there will be a lot of outliers, and that only a minority of the points will be close to the typical value. Here is the exact measure that I would like to use:
Find the mean of the largest set of points with variance less than some programmer-defined constant C
The closest thing I have found is the article Finding k points with minimum diameter and related problems, which gives an efficient algorithm for finding a set of k points with minimum variance, for some programmer-defined constant k. This is not useful to me because the number of points close to the typical value could vary a lot and there may be other small clusters. However, incorporating the article's result into a binary search algorithm shows that my problem can be solved in polynomial time. I'm asking here in the hope of finding a more efficient solution.
Here is way to do it (from what i have understood of problem) : -
select the point k from dataset and calculate sorted list of points in ascending order of their distance from k in O(NlogN).
Keeping k as mean add the points from sorted list into set till variance < C and then stop.
Do this for all points
Keep track of set which is largest.
Time Complexity:- O(N^2*logN) where N is size of dataset
Mode-seeking algorithms such as Mean-Shift clustering may still be a good choice.
You could then just keep the mode with the largest set of points that has variance below the threshold C.
Another approach would be to run k-means with a fairly large k. Then remove all points that contribute too much to variance, decrease k and repeat. Even though k-means does not handle noise very well, it can be used (in particular with a large k) to identify such objects.
Or you might first run some simple outlier detection methods to remove these outliers, then identify the mode within the reduced set only. A good candidate method is 1NN outlier detection, which should run in O(n log n) if you have an R-tree for acceleration.

Finding pair of big-small points from a set of points in a 2D plane

The following is an interview question which I've tried hard to solve. The required bound is to be less than O(n^2). Here is the problem:
You are given with a set of points S = (x1,y1)....(xn,yn). The points
are co-ordinates on the XY plane. A point (xa,ya) is said to be
greater than point (xb,yb) if and only if xa > xb and ya > yb.
The objective is the find all pairs of points p1 = (xa,ya) and p2 = (xb,yb) from the set S such that p1 > p2.
Example:
Input S = (1,2),(2,1),(3,4)
Answer: {(3,4),(1,2)} , {(3,4),(2,1)}
I can only come up with an O(n^2) solution that involves checking each point with other. If there is a better approach, please help me.
I am not sure you can do it.
Example Case: Let the points be (1,1), (2,2) ... (n,n).
There are O(n^2) such points and outputting them itself takes O(n^2) time.
I am assuming you actually want to count such pairs.
Sort descendingly by x in O(n log n). Now we have reduced the problem to a single dimension: for each position k we need to count how many numbers before it are larger than the number at position k. This is equivalent to counting inversions, a problem that has been answered many times on this site, including by me, for example here.
The easiest way to get O(n log n) for that problem is by using the merge sort algorithm, if you want to think about it yourself before clicking that link. Other ways include using binary indexed trees (fenwick trees) or binary search trees. The fastest in practice is probably by using binary indexed trees, because they only involve bitwise operations.
If you want to print the pairs, you cannot do better than O(n^2) in the worst case. I would be interested in an output-sensitive O(num_pairs) algorithm too however.
Why don't you just sort the list of points by X, and Y as a secondary index? (O(nlogn))
Then you can just give a "lazy" indicator that shows for each point that all the points on its right are bigger than it.
If you want to find them ALL, it will take O(n^2) anyway, because there's O(n^2) pairs.
Think of a sorted list, the first one is smallest, so there's n-1 bigger points, the second one has n-2 bigger points... which adds up to about (n^2)/2 == O(n^2)

sort array of locations by nearest point

So the question is like this:
Given a location X and an array of locations, I want to get an array
of locations that are closest to location X, in other words sorted by
closest distance.
The way I solved this is by iterating through each location array and calculate the distance between X and that particular location, store that distance, and sort the location by distance using a Comparator. Done! Is there a better way to do this? Assuming that the sort is a merge sort, it should be O(n log n).
If I understand this right, you can do this pretty quickly for multiple queries - as in, given several values of X, you wouldn't have to re-sort your solution array every time. Here's how you do it:
Sort the array initially (O(n logn) - call this pre-processing)
Now, on every query X, binary search the array for X (or closest number smaller than X). Maintain, two indices, i and j, one which points to the current location, one to the next. One among these is clearly the closest number to X on the list. Pick the smaller distance one and put it in your solution array. Now, if i was picked, decrement i - if j was picked, increment j. Repeat this process till all the numbers are in the solution array.
This takes O(n + logn) for each query with O(nlogn) preprocessing. Of course, if we were talking about just one X, this is not better at all.
The problem you describe sounds like a m-nearest neighbor search to me.
So, if I correctly understood your question, i.e. the notion of a location being a vector in a multidimensional metric space, and the distance being a proper metric in this space, then it would be nice to put the array of locations in a k-d-Tree.
You have some overhead for the tree building once, but you get the search for O(log n) then.
A benefit of this, assuming you are just interested in the m < n closest available locations, you don't need to evaluate all n distances every time you search for a new X.
You should try using min-heap ds to implement this. Just keep on storing the locations in heap with key = diff of X and that location
You can't do asymptotically better than O(n log n) if using a comparison-based sort. If you want to talk about micro-optimization of the code, though, some ideas include...
Sort by squared distance; no reason to ever use sqrt() - sqrt() is expensive
Only compute squared distance if necessary; if |dx1| <= |dx2| and |dy1| <= |dy2|, pt1 is closer than pt2 - integer multiplication is fast, but avoiding it in many cases may be somewhat faster
A thinking-outside-the-box solution might be to use e.g. Bucket sort... A linear time sorting algorithm which might be applicable here.
Proof by contradiction that you can't do better:
It is well known that comparison-based sorts are the only way to sort arbitrary numbers (which may include irrational numbers), and that they can't do better than n*log(n) time.
If you go through the list in O(n) time and select the smallest number, then use that as X, and somehow come up with a list of numbers that are sorted by distance to X, then you have sorted n numbers in less than O(n*log(n)) time.

Resources