Grouping set of points to nearest pairs - algorithm

I need an algorithm for the following problem:
I'm given a set of 2D points P = { (x_1, y_1), (x_2, y_2), ..., (x_n, y_n) } on a plane. I need to group them in pairs in the following manner:
Find two closest points (x_a, y_a) and (x_b, y_b) in P.
Add the pair <(x_a, y_a), (x_b, y_b)> to the set of results R.
Remove <(x_a, y_a), (x_b, y_b)> from P.
If initial set P is not empty, go to the step one.
Return set of pairs R.
That naive algorithm is O(n^3), using faster algorithm for searching nearest neighbors it can be improved to O(n^2 logn). Could it be made any better?
And what if the points are not in the euclidean space?
An example (resulting groups are circled by red loops):

Put all of the points into an http://en.wikipedia.org/wiki/R-tree (time O(n log(n))) then for each point calculate the distance to its nearest neighbor. Put points and initial distances into a priority queue. Initialize an empty set of removed points, and an empty set of pairs. Then do the following pseudocode:
while priority_queue is not empty:
(distance, point) = priority_queue.get();
if point in removed_set:
continue
neighbor = rtree.find_nearest_neighbor(point)
if distance < distance_between(point, neighbor):
# The previous neighbor was removed, find the next.
priority_queue.add((distance_between(point, neighbor), point)
else:
# This is the closest pair.
found_pairs.add(point, neighbor)
removed_set.add(point)
removed_set.add(neighbor)
rtree.remove(point)
rtree.remove(neighbor)
The slowest part of this is the nearest neighbor searches. An R-tree does not guarantee that those nearest neighbor searches will be O(log(n)). But they tend to be. Furthermore you are not guaranteed that you will do O(1) neighbor searches per point. But typically you will. So average performance should be O(n log(n)). (I might be missing a log factor.)

This problem calls for a dynamic Voronoi diagram I guess.
When the Voronoi diagram of a point set is known, the nearest neighbor pair can be found in linear time.
Then deleting these two points can be done in linear or sublinear time (I didn't find precise info on that).
So globally you can expect an O(N²) solution.

If your distances are arbitrary and you can't embed your points into Euclidean space (and/or the dimension of the space would be really high), then there's basically no way around at least a quadratic time algorithm because you don't know what the closest pair is until you check all the pairs. It is easy to get very close to this, basically by sorting all pairs according to distance and then maintaining a boolean look up table indicating which points in your list have already been taken, and then going through the list of sorted pairs in order and adding a pair of points to your "nearest neighbors" if neither point in the pair is in the look up table of taken points, and then adding both points in the pair to the look up table if so. Complexity O(n^2 log n), with O(n^2) extra space.

You can find the closest pair with this divide and conquer algorithm that runs in O(nlogn) time, you may repeat this n times and you will get O(n^2 logn) which is not better than what you got.
Nevertheless, you can exploit the recursive structure of the divide and conquer algorithm. Think about this, if the pair of points you removed were on the right side of the partition, then everything will behave the same on the left side, nothing changed there, so you just have to redo the O(logn) merge steps bottom up. But consider that the first new merge step will be to merge 2 elements, the second merges 4 elements then 8, and then 16,.., n/4, n/2, n, so the total number of operations on these merge steps are O(n), so you get the second closest pair in just O(n) time. So you repeat this n/2 times by removing the previously found pair and get a total O(n^2) runtime with O(nlogn) extra space to keep track of the recursive steps, which is a little better.
But you can do even better, there is a randomized data structure that let you do updates in your point set and get an expected O(logn) query and update time. I'm not very familiar with that particular data structure but you can find it in this paper. That will make your algorithm O(nlogn) expected time, I'm not sure if there is a deterministic version with similar runtimes, but those tend to be way more cumbersome.

Related

Fast way to calculate k nearest points to point P

I can not decide the fastest way to pick the k nearest points to some point P from an n-points set. My guesses are below:
Compute the n-distance, order it and pick the k smallest values;
Compute pointwise distance and update a k-sized point stack;
Any other manners are welcome.
Getting the median is O(n) operation, thus the whole problem has minimum complexity of O(n) - compute all distances and find the kth smallest element partitioning the whole set by this threshold.
One can also work in chunks of K >> k.
The maximum of the first k distances works as a preliminary threshold: all the points further than that do not need to be considered. One will instead place all the points smaller than that into an array, and after the array size is close to K, one can use the kth element linear algorithm to re-partition the array.
Finding the smallest k elements is O(n), for any value of k. It's O(n log k) if you need the k elements to be sorted. This is the Partition Algorithm.
You're best off reading the algorithm on Wikipedia. It's quicksort, but you only need to 'recurse' on one side because the other side is guaranteed to be completely out or completely in. There are expensive tricks that guarantee O(n log n) instead of that merely being the average.

Is there an algorithm to sort points in the plane in different orientations in linear time (with nonlinear preprocessing)?

I have a set of points in the plane that I want to sort based on when they encounter an arbitrary sweepline. An alternative definition is that I want to be able to sort them based on any linear combination of the x- and y-coordinates. I want to do the sorting in linear time, but am allowed to perform precomputation on the set of points in quadratic time (but preferably O(n log(n))). Is this possible? I would love a link to a paper that discusses this problem, but I could not find it myself.
For example, if I have the points (2,2), (3,0), and (0,3) I want to be able to sort them on the value of 3x+2y, and get [(0,3), (3,0), (2,2)].
Edit: In the comments below the question a helpful commenter has shown me that the naive algorithm of enumerating all possible sweeplines will give a O(n^2 log(n)) preprocessing algorithm (thanks again!). Is it possible to have a O(n log(n)) preprocessing algorithm?
First note, that enumerating all of the sweeplines takes O(n^2 log(n)), but then you have to sort the n^2 sweeplines. Doing that naively will take time O(n^3 log(n)) and space O(n^3).
I think I can get average performance down to O(n) with O(n^2 log*(n)) time and O(n^2) space spent on preprocessing. (Here log* is the iterated logarithm and for all intents and purposes it is a constant.) But this is only average performance, not worst case.
The first thing to note is that there are n choose 2 = n*(n-1)/2 pairs of points. As we rotate 360 degrees, each pair will cross the other twice, for at most O(n^2) different orderings and O(n^2) pair crossings between them. Also note that after a pair crosses, it does not cross again for 180 degrees. Over any range of less than 180 degrees, a given pair either will cross once or won't.
Now the idea is that we'll store a random O(n) of those possible orderings and which sweepline they correspond to. Between any sweepline and the next, we'll see O(n^2 / n) = O(n) pairs of points cross. Therefore both sorts are correct to on average O(1), and every inversion between the first and the order we want is an inversion between the first and second sorts. We'll use this to find our final sort in O(n).
Let me fill in details backwards.
We have our O(n) sweeplines precalculated. In time O(log(n)) we find the two nearest. Let's assume we find the following data structures.
pos1: Lookup from point to its position in sweepline 1.
points1: Lookup from position to the point there in sweepline 1.
points2: Lookup from position to the point there in sweepline 2.
We will now try to sort in time O(n).
We initialize the following data structures:
upcoming: Priority queue of points that could be next.
is_seen: Bitmap from position to whether we've added the point to upcoming.
answer: A vector/array/whatever you language calls it that will hold the answer at the end.
max_j: The farthest point in line 2 that we have added to upcoming. Starts at -1.
And now we do the following.
for i in range(n):
while is_seen[i] == 0:
# Find another possible point
max_j++
point = points2[max_j]
upcoming.add(point with where it is encountered as priority)
is_seen[pos1[point]] = 1
# upcoming has points1[i] and every point that can come before it.
answer.append(upcoming.pop())
Waving my hands vigorously, every point is put into upcoming once, and taken out once. On average, upcoming has O(1) points in it, so all operations average out to O(1). Since there are n points, the total time is O(n).
OK, how do we set up our sweeplines? Since we only care about average performance, we cheat. We randomly choose O(n) pairs of points. Each pair of points defines a sweepline. We sort those sweeplines in O(n log(n)).
Now we have to sort O(n) sweeplines. How do we do this?
Well we can sort a fixed number of them by any method we want. Let's pick 4 evenly chosen sweeplines and do that. (We actually only need to do the calculation 2x. We pick 2 pairs of points. We pick the sweepline where the first 2 cross, then the second 2 cross, then the other 2 sweeplines are at 180 degrees from the first 2, and therefore are just reversed order.) After that, we can use the algorithm above to sort a sweepline between 2 others. And do that through bisection to smaller and smaller intervals.
Now, of course, the sweeplines will not be as close as they were above. But let's note that if we expect the points to agree to within an average O(f(n)) places between the sweepline, then the heap will have O(f(n)) elements in it, and operations on it will take O(log(f(n))) time, and so we get the intermediate sweepline in O(n log(f(n)). How long is the whole calculation?
Well, we have kind of a tree of calculations to do. Let's divide the sweeplines by what level they are, the group them. The grouping will be the top:
1 .. n/log(n)
n/log(n) .. n/log(log(n))
n/log(log(n)) .. n/log(log(log(n)))
...and so on.
In each group we have O(n / log^k(n)) sweeplines to calculate. Each sweepline takes O(n log^k(n)) time to calculate. Therefore each level takes O(n^2). The number of levels is the iterated logarithm, log*(n). So total preprocessing time is O(n^2 log*(n)).

A query on a BST that takes neither maximum nor minimum time

No idea if this is better off on MathOverflow.
A dynamic binary search tree (I assume that it can keep the tree strictly balanced, e.g. red-black, also the BST shall have n keys and each key shall be queried exactly once) can take minimum O(n) for a "good" instance (e.g. a sorted query) or maximum O(n log n) for a "bad" instance (e.g. a bit-reversal sequence).
If this is easier for you, this is an almost equal geometric setting, courtesy by Demaine: Draw n points in the plane (no equal x or y coordinates!). For each pair of points, draw the axis parallel rectangle with these two points as corners. If no other point is inside this rectangle, you own its diagonal, but you can't own two diagonals if the two rectangles overlap in a way that at least one corner of one is strictly inside the other. (Sharing a corner is OK! The somewhat ugly condition is necessary, as otherwise with two parallel diagonals of points, you can own O(n^2) diagonals.) Again, can you own between O(n) diagonals (the points lie on a diagonal) and O(n log n) (the points lie chaotic)?
I'm now sitting on my master thesis for almost two years, and like to namedrop an example that needs an "in-between" value, say O(n loglog n). Do you know one? (A reference would be even better.)
Disclaimer: The work is almost done, it's just for a insignificant subordinate clause, but it's driving me mad that any of the umphty wacky sequences I have come up with falls either into the maximum or minimum case. Also, it could come extremely handy for any follow-up work. If you can't come up with one off-hand, I drop the question altogether. After all, I already have written 150 pages... :-)
Let the tree hold the numbers 0..(n-1).
Let m be an integer close to log(n) and relatively prime to n.
Your data set is the multiples of m modulo n. That is, m, 2m, ..., 0.
You can easily verify that all of the numbers appear once in the sequence. There are m ≈ log(n) times you wrap around, each of which causes a query to take log(n) for an overhead at the boundaries that is easily o(n). And otherwise for n - m times what you have to go is go up log(m) levels, then down log(m) levels to get to the value that is m away. These steps take O(log(m)) = O(log(log(n))) each, for overall time O(n log(log(n))).
Good enough?

Efficient repeated sorting

I have a set of N points on a graph defined by (x,y) coordinates, as well as a table with their pairwise distance. I'm looking to generate a table with their relative "closeness ranking", e.g. if closeness[5][9] == 4, then node 9 is the fourth closest item relative to node 5.
The obvious way of doing this is by generating a list of indeces and sorting them based on d[i][j] < d[i][k] for every i (1->n), then transforming the table by the knowledge that sorted[5][4] == 9 implies closeness[5][9] == 4.
This would require O(n² log n) time. I feel like there could be a more efficient way. Any ideas?
Okay, I'm going to try and take a stab at this.
For the background knowledge: This problem is somewhat related to k-nearest neighbor. I'm not sure how you generated your pairwise distance, but k-d tree is pretty good at solving these type of problem.
Now, even if you use k-d tree, it will help a bit (you only query what you need, instead of sorting ALL the points): building the tree in O(N log N) time, then for the K nearest points you want to query, each would take O(log N) time. In the end, you are looking at O(N log N) + O(NK log N).
Okay, now, the actual heuristic part. This will depend on your data, you might want to see if they are close together or far apart. But, you can try a divide and conquer approach, where you divide the plane into bins. When you need to find the closest points, find out which bin the point you are working with belong to, then you only work with neighboring bins, and explore more neighboring bins as you need more points.
Hopefully this helps, good luck.

Finding pair of big-small points from a set of points in a 2D plane

The following is an interview question which I've tried hard to solve. The required bound is to be less than O(n^2). Here is the problem:
You are given with a set of points S = (x1,y1)....(xn,yn). The points
are co-ordinates on the XY plane. A point (xa,ya) is said to be
greater than point (xb,yb) if and only if xa > xb and ya > yb.
The objective is the find all pairs of points p1 = (xa,ya) and p2 = (xb,yb) from the set S such that p1 > p2.
Example:
Input S = (1,2),(2,1),(3,4)
Answer: {(3,4),(1,2)} , {(3,4),(2,1)}
I can only come up with an O(n^2) solution that involves checking each point with other. If there is a better approach, please help me.
I am not sure you can do it.
Example Case: Let the points be (1,1), (2,2) ... (n,n).
There are O(n^2) such points and outputting them itself takes O(n^2) time.
I am assuming you actually want to count such pairs.
Sort descendingly by x in O(n log n). Now we have reduced the problem to a single dimension: for each position k we need to count how many numbers before it are larger than the number at position k. This is equivalent to counting inversions, a problem that has been answered many times on this site, including by me, for example here.
The easiest way to get O(n log n) for that problem is by using the merge sort algorithm, if you want to think about it yourself before clicking that link. Other ways include using binary indexed trees (fenwick trees) or binary search trees. The fastest in practice is probably by using binary indexed trees, because they only involve bitwise operations.
If you want to print the pairs, you cannot do better than O(n^2) in the worst case. I would be interested in an output-sensitive O(num_pairs) algorithm too however.
Why don't you just sort the list of points by X, and Y as a secondary index? (O(nlogn))
Then you can just give a "lazy" indicator that shows for each point that all the points on its right are bigger than it.
If you want to find them ALL, it will take O(n^2) anyway, because there's O(n^2) pairs.
Think of a sorted list, the first one is smallest, so there's n-1 bigger points, the second one has n-2 bigger points... which adds up to about (n^2)/2 == O(n^2)

Resources