Finding a single cluster of points with low variance - algorithm

Given a collection of points in the complex plane, I want to find a "typical value", something like mean or mode. However, I expect that there will be a lot of outliers, and that only a minority of the points will be close to the typical value. Here is the exact measure that I would like to use:
Find the mean of the largest set of points with variance less than some programmer-defined constant C
The closest thing I have found is the article Finding k points with minimum diameter and related problems, which gives an efficient algorithm for finding a set of k points with minimum variance, for some programmer-defined constant k. This is not useful to me because the number of points close to the typical value could vary a lot and there may be other small clusters. However, incorporating the article's result into a binary search algorithm shows that my problem can be solved in polynomial time. I'm asking here in the hope of finding a more efficient solution.

Here is way to do it (from what i have understood of problem) : -
select the point k from dataset and calculate sorted list of points in ascending order of their distance from k in O(NlogN).
Keeping k as mean add the points from sorted list into set till variance < C and then stop.
Do this for all points
Keep track of set which is largest.
Time Complexity:- O(N^2*logN) where N is size of dataset

Mode-seeking algorithms such as Mean-Shift clustering may still be a good choice.
You could then just keep the mode with the largest set of points that has variance below the threshold C.
Another approach would be to run k-means with a fairly large k. Then remove all points that contribute too much to variance, decrease k and repeat. Even though k-means does not handle noise very well, it can be used (in particular with a large k) to identify such objects.
Or you might first run some simple outlier detection methods to remove these outliers, then identify the mode within the reduced set only. A good candidate method is 1NN outlier detection, which should run in O(n log n) if you have an R-tree for acceleration.

Related

Fixed radius nearest neighbours, with sets

I need to efficiently solve the following problem, a variant of the Fixed radius nearest neighbours problem:
Given a list of n sets S, where each set S[i] consists of (2-dimensional) input points, and a query point q: List the indices of all sets in S such that at least one point of the set is within distance 'r' of q.
Approaches involving range trees, k-d trees and similar data structures storing all the points solve this in running times similar to O(log(n) + k), where n is the total number of points, and k is the number of results (points) returned. My problem is that each set is quite large, and while I can deal with large values of n, large values of k make my algorithm run very slowly and consume prohibitive amounts of space, when I actually only need the indices of the valid sets rather than all of the individual points or nearest point in each set.
If I make randomized k-d trees of each set, and then query for q in each set, (correct me if I'm wrong), I can solve the problem in O(m*log(n/m)) amortized time, where m is the number of the sets, which is a significant improvement over the first approach, but before implementing it, I wonder if there are other better practical ways of solving the problem, especially as m and n can become 10x or more of what they are now, and I am also concerned about the space/memory used in this approach. Elements could also be added to the sets, which may make the k-d trees unbalanced, and may require frequent reconstructions.
Other approaches I've tried involve partitioning the 2-d space into grids, and then using bloom filters, (and taking their union), but that that takes a prohibitive amount of space, and I still need to query for m sets. I also can't use a disjoint set to compute unions, because the points in each partition are not disjoint, and cannot be made disjoint.
Current values I am working with:
Total number of points: 250 million (could become 10x larger)
Number of sets: 50,000
The number of points in a set are thus, on average, ~5,000, but there are sets having 200,000+ points.
Values of k (number of matching points), for radii of interest: up to 40 million when there are 250 million points. The points are very densely clustered in some places. Even for such a large value of k, the number of matching sets is only 30,000 or so.
I'd welcome an approach which involves "once you've found any point in the set within the radius, don't bother about processing other points in the set." Any other approach that solves this problem efficiently, is of course, equally welcome.
I don't have to store the entire data structure in memory, I can store the structure in a database and retrieve parts that are needed.
On a side note, I'd also appreciate if someone could point me to a well-tested k-d tree implementation in Java, which at least works well for 2 dimensions, and serializes and deserializes properly.

Most efficient implementation to get the closest k items

In the K-Nearest-Neighbor algorithm, we find the top k neighbors closest to a new point out of N observations and use those neighbors to classify the point. From my knowledge of data structures, I can think of two implementations of this process:
Approach 1
Calculate the distances to the new point from each of N observations
Sort the distances using quicksort and take the top k points
Which would take O(N + NlogN) = O(NlogN) time.
Approach 2
Create a max-heap of size k
Calculate the distance from the new point for the first k points
For each following observation, if the distance is less than the max in the heap, pop that point from the heap and replace it with the current observation
Re-heapify (logk operations for N points)
Continue until there are no more observations at which point we should only have the top 5 closest distances in the heap.
This approach would take O(N + NlogK) = O(NlogK) operations.
Are my analyses correct? How would this process be optimized in a standard package like sklearn? Thank you!
Here's a good overview of the common methods used: https://en.wikipedia.org/wiki/Nearest_neighbor_search
What you describe is linear search (since you need to compute the distance to every point in the dataset).
The good thing is that this always works. The bad thing about it is that is slow, especially if you query it a lot.
If you know a bit more about your data you can get better performance. If the data has low dimensionality (2D, 3D) and is uniformly distributed (this doesn't mean perfectly, just not in very dense and very tight clusters) then space partitioning works great because it cuts down quickly on the points that are too far anyway (complexity O(logN) ). They work also for higher dimensionallity or if there are some clusters, but the performance suffers a bit (still better overall than linear search).
Usually space partitioning or locality sensitive hashing are enough for common datasets.
The trade-off is that you use more memory and some set-up time to speed up future queries. If you have a lot of queries then it's worth it. If you only have a few, not so much.

Given n points in a 2-D plane we have to find k nearest neighbours of each point among themselves

I explored the method using a min-heap. For each point we can store a min heap of size k but it takes too much space for large n(I m targeting for n around a 100 million). Surely there must be a better way of doing this utilising lesser space and not affecting time complexity that much. Is there some other data structure?
This problem is typical setup for KD-tree. Such solution would have linearithmic complexity but may be relatively complex to implement(if a ready implementation is not available)
An alternative approach could be using bucketing to reduce the complexity of the naive algorithm. The idea is to separate the plane into "buckets" i.e. squares of some size and place the points in the bucket they belong to. The closest points will be from the closest buckets. In case of random data this could be quite good improvement but the worst case is still the same as the naive approach.

Grouping set of points to nearest pairs

I need an algorithm for the following problem:
I'm given a set of 2D points P = { (x_1, y_1), (x_2, y_2), ..., (x_n, y_n) } on a plane. I need to group them in pairs in the following manner:
Find two closest points (x_a, y_a) and (x_b, y_b) in P.
Add the pair <(x_a, y_a), (x_b, y_b)> to the set of results R.
Remove <(x_a, y_a), (x_b, y_b)> from P.
If initial set P is not empty, go to the step one.
Return set of pairs R.
That naive algorithm is O(n^3), using faster algorithm for searching nearest neighbors it can be improved to O(n^2 logn). Could it be made any better?
And what if the points are not in the euclidean space?
An example (resulting groups are circled by red loops):
Put all of the points into an http://en.wikipedia.org/wiki/R-tree (time O(n log(n))) then for each point calculate the distance to its nearest neighbor. Put points and initial distances into a priority queue. Initialize an empty set of removed points, and an empty set of pairs. Then do the following pseudocode:
while priority_queue is not empty:
(distance, point) = priority_queue.get();
if point in removed_set:
continue
neighbor = rtree.find_nearest_neighbor(point)
if distance < distance_between(point, neighbor):
# The previous neighbor was removed, find the next.
priority_queue.add((distance_between(point, neighbor), point)
else:
# This is the closest pair.
found_pairs.add(point, neighbor)
removed_set.add(point)
removed_set.add(neighbor)
rtree.remove(point)
rtree.remove(neighbor)
The slowest part of this is the nearest neighbor searches. An R-tree does not guarantee that those nearest neighbor searches will be O(log(n)). But they tend to be. Furthermore you are not guaranteed that you will do O(1) neighbor searches per point. But typically you will. So average performance should be O(n log(n)). (I might be missing a log factor.)
This problem calls for a dynamic Voronoi diagram I guess.
When the Voronoi diagram of a point set is known, the nearest neighbor pair can be found in linear time.
Then deleting these two points can be done in linear or sublinear time (I didn't find precise info on that).
So globally you can expect an O(N²) solution.
If your distances are arbitrary and you can't embed your points into Euclidean space (and/or the dimension of the space would be really high), then there's basically no way around at least a quadratic time algorithm because you don't know what the closest pair is until you check all the pairs. It is easy to get very close to this, basically by sorting all pairs according to distance and then maintaining a boolean look up table indicating which points in your list have already been taken, and then going through the list of sorted pairs in order and adding a pair of points to your "nearest neighbors" if neither point in the pair is in the look up table of taken points, and then adding both points in the pair to the look up table if so. Complexity O(n^2 log n), with O(n^2) extra space.
You can find the closest pair with this divide and conquer algorithm that runs in O(nlogn) time, you may repeat this n times and you will get O(n^2 logn) which is not better than what you got.
Nevertheless, you can exploit the recursive structure of the divide and conquer algorithm. Think about this, if the pair of points you removed were on the right side of the partition, then everything will behave the same on the left side, nothing changed there, so you just have to redo the O(logn) merge steps bottom up. But consider that the first new merge step will be to merge 2 elements, the second merges 4 elements then 8, and then 16,.., n/4, n/2, n, so the total number of operations on these merge steps are O(n), so you get the second closest pair in just O(n) time. So you repeat this n/2 times by removing the previously found pair and get a total O(n^2) runtime with O(nlogn) extra space to keep track of the recursive steps, which is a little better.
But you can do even better, there is a randomized data structure that let you do updates in your point set and get an expected O(logn) query and update time. I'm not very familiar with that particular data structure but you can find it in this paper. That will make your algorithm O(nlogn) expected time, I'm not sure if there is a deterministic version with similar runtimes, but those tend to be way more cumbersome.

How to calculate the average time complexity of the nearest neighbor search using kd-tree?

We know the complexity of the nearest neighbor search of kd-tree is O(logn). But how to calculate it? The main problem is the average time complexity of the back tracing. I have tried to read the paper "An Algorithm for Finding Best Matches in Logarithmic Expected Time", but it is too complicate for me. Does anyone know a simple way to calculate that?
The calculation in the paper is about as simple as possible for a rigorous analysis.
(NB This is the price of being a true computer scientist and software engineer. You must put the effort into learning the math. Knowing the math is what separates people who think they can write solid programs from those who actually can. Jon Bentley, the guy who invented kd-trees, did so when he was in high school. Take this as inspiration.)
If you want a rough intuitive idea that is not rigorous, here is one.
Assume we are working in 2d. The sizes of the geometric areas represented by the 2d-tree are the key.
In the average case, one point partitions the domain into 2 roughly equal-sized rectangles. 3 points into 4. 7 points into 8 parts. Etc. In general N points lead to N-1 roughly equal-sized rectangles.
It not hard to see that if the domain is 1x1, the length of a side of these parts is on average O(sqrt(1/N)).
When you search for a nearest neighbor, you descend the tree to the rectangle containing the search point. After doing this, you have used O(log N) effort to find a point within R = O(sqrt(1/N)) of the correct one. This is just a point contained in the leaf that you discovered.
But this rectangle is not the only one that must be searched. You must still look at all others containing a point no more than distance R away from the search point, refining R each time you find a closer point.
Fortunately, the O(sqrt(1/N)) limit on R provides a tight bound on the average number of other rectangles this can be. In the average case, it's about 8 because each equal-sized rectangle has no more than 8 neighbors.
So the total effort to search is O(8 log n) = O(log n).
Again, I repeat this is not a rigorous analysis, but it ought to give you a feel for why the algorithm is O(log N) in the average case.

Resources