Most efficient implementation to get the closest k items - algorithm

In the K-Nearest-Neighbor algorithm, we find the top k neighbors closest to a new point out of N observations and use those neighbors to classify the point. From my knowledge of data structures, I can think of two implementations of this process:
Approach 1
Calculate the distances to the new point from each of N observations
Sort the distances using quicksort and take the top k points
Which would take O(N + NlogN) = O(NlogN) time.
Approach 2
Create a max-heap of size k
Calculate the distance from the new point for the first k points
For each following observation, if the distance is less than the max in the heap, pop that point from the heap and replace it with the current observation
Re-heapify (logk operations for N points)
Continue until there are no more observations at which point we should only have the top 5 closest distances in the heap.
This approach would take O(N + NlogK) = O(NlogK) operations.
Are my analyses correct? How would this process be optimized in a standard package like sklearn? Thank you!

Here's a good overview of the common methods used: https://en.wikipedia.org/wiki/Nearest_neighbor_search
What you describe is linear search (since you need to compute the distance to every point in the dataset).
The good thing is that this always works. The bad thing about it is that is slow, especially if you query it a lot.
If you know a bit more about your data you can get better performance. If the data has low dimensionality (2D, 3D) and is uniformly distributed (this doesn't mean perfectly, just not in very dense and very tight clusters) then space partitioning works great because it cuts down quickly on the points that are too far anyway (complexity O(logN) ). They work also for higher dimensionallity or if there are some clusters, but the performance suffers a bit (still better overall than linear search).
Usually space partitioning or locality sensitive hashing are enough for common datasets.
The trade-off is that you use more memory and some set-up time to speed up future queries. If you have a lot of queries then it's worth it. If you only have a few, not so much.

Related

Given n points in a 2-D plane we have to find k nearest neighbours of each point among themselves

I explored the method using a min-heap. For each point we can store a min heap of size k but it takes too much space for large n(I m targeting for n around a 100 million). Surely there must be a better way of doing this utilising lesser space and not affecting time complexity that much. Is there some other data structure?
This problem is typical setup for KD-tree. Such solution would have linearithmic complexity but may be relatively complex to implement(if a ready implementation is not available)
An alternative approach could be using bucketing to reduce the complexity of the naive algorithm. The idea is to separate the plane into "buckets" i.e. squares of some size and place the points in the bucket they belong to. The closest points will be from the closest buckets. In case of random data this could be quite good improvement but the worst case is still the same as the naive approach.

Is range tree widely used in spacial search problems?

I am looking for some data structures for range searching. I think range trees offer a good time complexity (but with some storage requirements).
However, it seems to me that other data structures, like KD-trees, are more discussed and recommended than range trees. Is this true? If so, why?
I would expect that it is because kd-trees can straightforwardly be extended to contains objects other than points. This gives it many applications in e.g. virtual worlds, where we want quick querying of triangles. Similar extensions of range trees are not straightforward, and in fact I've never seen any.
To give a quick recap: a kd-tree can preprocess a set of n points in d-space in O(n log n) time into a structure using O(n) space, such that any d-dimensional range query can be answered in O(n1-1/d + k) time, where k is the number of answers. A range trees takes O(n logd-1n) time to preprocess, takes O(n logd-1n) space, and can answer range queries in O(logd-1n + k) time.
The query time for a range tree is obviously a lot better than that of a kd-tree, if we're talking about 2- or 3-dimensional space. However, the kd-tree has several advantages. First of all, it always requires only linear storage. Secondly, it is always constructed in O(n log n) time. Third, if the dimensionality is very high, it will outperform a range tree unless your points sets are very large (although arguably, at this point a linear search will be almost as fast as a kd-tree).
I think another important point is that kd-trees are more well known by people than range trees. I'd never heard of a range tree before taking a course in computational geometry, but I'd heard of and worked with kd-trees before (albeit in a computer graphics setting).
EDIT: You ask what is a better data structure for 2D or 3D fixed radius search when you have millions of points. I really can't tell you! I'd be inclined to say a range tree will be faster if you perform many queries, but for 3D the construction will be slower by a factor of O(log n), and memory use may become an issue before speed does. I'd recommend integrating good implementations of both structures and simply testing what does a better job for your particular requirements.
For your needs (particle simulation in 2D or 3D), you want a data structure capable of all nearest neighbor queries. The cover tree is a data structure that is best suited for this task. I came across it while computing the nearest neighbors for kernel density estimation. This Wikipedia page explains the basic definition of the tree, and John Langford's page has a link to a C++ implementation.
The running time of a single query is O(c^12*logn), where c is the expansion constant of the dataset. This is an upper bound - in practice, the data structure performs faster than others. This paper shows that the running time of batch processing of all nearest neighbors (for all the data points), as needed for a particle simulation, is O(c^16*n), and this theoretical linear bound is also practical for your need. Construction time is O(nlogn) and storage is O(n).

How to calculate the average time complexity of the nearest neighbor search using kd-tree?

We know the complexity of the nearest neighbor search of kd-tree is O(logn). But how to calculate it? The main problem is the average time complexity of the back tracing. I have tried to read the paper "An Algorithm for Finding Best Matches in Logarithmic Expected Time", but it is too complicate for me. Does anyone know a simple way to calculate that?
The calculation in the paper is about as simple as possible for a rigorous analysis.
(NB This is the price of being a true computer scientist and software engineer. You must put the effort into learning the math. Knowing the math is what separates people who think they can write solid programs from those who actually can. Jon Bentley, the guy who invented kd-trees, did so when he was in high school. Take this as inspiration.)
If you want a rough intuitive idea that is not rigorous, here is one.
Assume we are working in 2d. The sizes of the geometric areas represented by the 2d-tree are the key.
In the average case, one point partitions the domain into 2 roughly equal-sized rectangles. 3 points into 4. 7 points into 8 parts. Etc. In general N points lead to N-1 roughly equal-sized rectangles.
It not hard to see that if the domain is 1x1, the length of a side of these parts is on average O(sqrt(1/N)).
When you search for a nearest neighbor, you descend the tree to the rectangle containing the search point. After doing this, you have used O(log N) effort to find a point within R = O(sqrt(1/N)) of the correct one. This is just a point contained in the leaf that you discovered.
But this rectangle is not the only one that must be searched. You must still look at all others containing a point no more than distance R away from the search point, refining R each time you find a closer point.
Fortunately, the O(sqrt(1/N)) limit on R provides a tight bound on the average number of other rectangles this can be. In the average case, it's about 8 because each equal-sized rectangle has no more than 8 neighbors.
So the total effort to search is O(8 log n) = O(log n).
Again, I repeat this is not a rigorous analysis, but it ought to give you a feel for why the algorithm is O(log N) in the average case.

Finding a single cluster of points with low variance

Given a collection of points in the complex plane, I want to find a "typical value", something like mean or mode. However, I expect that there will be a lot of outliers, and that only a minority of the points will be close to the typical value. Here is the exact measure that I would like to use:
Find the mean of the largest set of points with variance less than some programmer-defined constant C
The closest thing I have found is the article Finding k points with minimum diameter and related problems, which gives an efficient algorithm for finding a set of k points with minimum variance, for some programmer-defined constant k. This is not useful to me because the number of points close to the typical value could vary a lot and there may be other small clusters. However, incorporating the article's result into a binary search algorithm shows that my problem can be solved in polynomial time. I'm asking here in the hope of finding a more efficient solution.
Here is way to do it (from what i have understood of problem) : -
select the point k from dataset and calculate sorted list of points in ascending order of their distance from k in O(NlogN).
Keeping k as mean add the points from sorted list into set till variance < C and then stop.
Do this for all points
Keep track of set which is largest.
Time Complexity:- O(N^2*logN) where N is size of dataset
Mode-seeking algorithms such as Mean-Shift clustering may still be a good choice.
You could then just keep the mode with the largest set of points that has variance below the threshold C.
Another approach would be to run k-means with a fairly large k. Then remove all points that contribute too much to variance, decrease k and repeat. Even though k-means does not handle noise very well, it can be used (in particular with a large k) to identify such objects.
Or you might first run some simple outlier detection methods to remove these outliers, then identify the mode within the reduced set only. A good candidate method is 1NN outlier detection, which should run in O(n log n) if you have an R-tree for acceleration.

How to efficiently find k-nearest neighbours in high-dimensional data?

So I have about 16,000 75-dimensional data points, and for each point I want to find its k nearest neighbours (using euclidean distance, currently k=2 if this makes it easiser)
My first thought was to use a kd-tree for this, but as it turns out they become rather inefficient as the number of dimension grows. In my sample implementation, its only slightly faster than exhaustive search.
My next idea would be using PCA (Principal Component Analysis) to reduce the number of dimensions, but I was wondering: Is there some clever algorithm or data structure to solve this exactly in reasonable time?
The Wikipedia article for kd-trees has a link to the ANN library:
ANN is a library written in C++, which
supports data structures and
algorithms for both exact and
approximate nearest neighbor searching
in arbitrarily high dimensions.
Based on our own experience, ANN
performs quite efficiently for point
sets ranging in size from thousands to
hundreds of thousands, and in
dimensions as high as 20. (For applications in significantly higher
dimensions, the results are rather
spotty, but you might try it anyway.)
As far as algorithm/data structures are concerned:
The library implements a number of
different data structures, based on
kd-trees and box-decomposition trees,
and employs a couple of different
search strategies.
I'd try it first directly and if that doesn't produce satisfactory results I'd use it with the data set after applying PCA/ICA (since it's quite unlikely your going to end up with few enough dimensions for a kd-tree to handle).
use a kd-tree
Unfortunately, in high dimensions this data structure suffers severely from the curse of dimensionality, which causes its search time to be comparable to the brute force search.
reduce the number of dimensions
Dimensionality reduction is a good approach, which offers a fair trade-off between accuracy and speed. You lose some information when you reduce your dimensions, but gain some speed.
By accuracy I mean finding the exact Nearest Neighbor (NN).
Principal Component Analysis(PCA) is a good idea when you want to reduce the dimensional space your data live on.
Is there some clever algorithm or data structure to solve this exactly in reasonable time?
Approximate nearest neighbor search (ANNS), where you are satisfied with finding a point that might not be the exact Nearest Neighbor, but rather a good approximation of it (that is the 4th for example NN to your query, while you are looking for the 1st NN).
That approach cost you accuracy, but increases performance significantly. Moreover, the probability of finding a good NN (close enough to the query) is relatively high.
You could read more about ANNS in the introduction our kd-GeRaF paper.
A good idea is to combine ANNS with dimensionality reduction.
Locality Sensitive Hashing (LSH) is a modern approach to solve the Nearest Neighbor problem in high dimensions. The key idea is that points that lie close to each other are hashed to the same bucket. So when a query arrives, it will be hashed to a bucket, where that bucket (and usually its neighboring ones) contain good NN candidates).
FALCONN is a good C++ implementation, which focuses in cosine similarity. Another good implementation is our DOLPHINN, which is a more general library.
You could conceivably use Morton Codes, but with 75 dimensions they're going to be huge. And if all you have is 16,000 data points, exhaustive search shouldn't take too long.
No reason to believe this is NP-complete. You're not really optimizing anything and I'd have a hard time figure out how to convert this to another NP-complete problem (I have Garey and Johnson on my shelf and can't find anything similar). Really, I'd just pursue more efficient methods of searching and sorting. If you have n observations, you have to calculate n x n distances right up front. Then for every observation, you need to pick out the top k nearest neighbors. That's n squared for the distance calculation, n log (n) for the sort, but you have to do the sort n times (different for EVERY value of n). Messy, but still polynomial time to get your answers.
BK-Tree isn't such a bad thought. Take a look at Nick's Blog on Levenshtein Automata. While his focus is strings it should give you a spring board for other approaches. The other thing I can think of are R-Trees, however I don't know if they've been generalized for large dimensions. I can't say more than that since I neither have used them directly nor implemented them myself.
One very common implementation would be to sort the Nearest Neighbours array that you have computed for each data point.
As sorting the entire array can be very expensive, you can use methods like indirect sorting, example Numpy.argpartition in Python Numpy library to sort only the closest K values you are interested in. No need to sort the entire array.
#Grembo's answer above should be reduced significantly. as you only need K nearest Values. and there is no need to sort the entire distances from each point.
If you just need K neighbours this method will work very well reducing your computational cost, and time complexity.
if you need sorted K neighbours, sort the output again
see
Documentation for argpartition

Resources