Why does K-nearest neighbor algorithm suffer from curse of dimensionality? - knn

As per the algorithm we are considering only the k-nearest neighbors. Then How does it depends on other features ?

The curse of dimensionality in the k-NN context basically means that Euclidean distance is unhelpful in high dimensions because all vectors are almost equidistant to the search query vector (imagine multiple points lying more or less on a circle with the query point at the center; the distance from the query to all data points in the search space is almost the same).
Quoted from Wikipedia!

Related

Distance from ellipses to static set of polygons

I have a static set of simple polygons (they may be nonconvex but are not self-intersecting) and a large number of query ellipses. Assume that this is all being done in 2D. I need to find the distance between each ellipse and the closest polygon to that ellipse. Distance is defined as the short distance between any two points on the ellipse and polygon respectively. If the ellipse intersects a polygon, then we can say the distance is 0 or assign some negative value.
A brute force approach would simply compute the distance between each ellipse and each polygon and return the lowest distance in O(mn) time where m is the number of polygons and n is the average number of vertices per polygon. I would like to reduce the m term here because I think I can cull the amount of polygons being considered though some spacial analysis.
I've considered a few approaches including Voronoi diagrams and R-trees and kd-trees. However, most of these seems to involve points and I'm not sure how to extend these to polygons. I think the most promising approach involves computing bounding boxes for each polygon and the ellipse and using an R-tree to find some set of nearby polygons. However, I'm not quite about the best way to find this close set of polygons. Or perhaps there's a better way that I'm overlooking.
Using bounding boxes or disks has the benefit of reducing the work to compute a distance ellipse/polygon to O(1). And it allows you to obtain a lower and an upper bound on the true distance.
Assume you use disks and also enclose the ellipse in a disk. You will need to perform a modified nearest neighbors search, that enumerates the disks such that the lower bound of their distance to the query disk is smaller than the best upper bound found so far.
This can be accelerated by means of a k-D tree (D=2) built on the disk centers. You can enhance every node in the k-D tree with the radius of the largest and smallest disks in the subtree it roots. During the search you will use this information to evaluate the bounds without knowing the exact radii of the disks, but the deeper you go in the tree, the better you know them.
Perform a search to obtain the tightest upper bound on the distance, then a second search to enumerate all the disks with a lower bound smaller than the tightest upper bound. This will reduce the number of disks to be considered.
You can also use bounding boxes, and store the min/max width/height of the boxes in the tree nodes.

K-nearest neighbors for a moving query point

I have been given coordinates of n fixed points and m query points. I have to find the k-nearest neighbors of each of the m query points from the n fixed points. Finding distances separately for each query point is very costly. Is there an efficient way of doing this?
There are fast indexing structures for such problems, like KD Tree or Ball Tree. In particular - scikit-learn (sklearn) implements them in their knn routines ( http://scikit-learn.org/stable/modules/neighbors.html )
A real answer to your question depends on numerous factors. For example, if you are not using the Euclidean distance - then you can't use KDTrees. There is also scaling issues (how many points enrolled? Dimension Size? "Clustered" ness) How long you can wait for training, if values need to be added to the set, and so on.
A number of less commonly available, bust still useful, algorithms for such are available in JSAT. This includes VP Trees, RBC, and LSH. (bias warning, I'm the author of JSAT)
If you are working out the square root of the sum of the squares to get the distances, try dropping the square root which is computationally intensive. Just find the ones with the nearest squared distances - they are the same points.

Nearest neighbor search with periodic boundary conditions

In a cubic box I have a large collection points in R^3. I'd like to find the k nearest neighbors for each point. Normally I'd think to use something like a k-d tree, but in this case I have periodic boundary conditions. As I understand it, a k-d tree works by partitioning the space by cutting it into hyper planes of one less dimension, i.e. in 3D we would split the space by drawing 2D planes. For any given point, it is either on the plane, above it, or below it. However, when you split the space with periodic boundary conditions a point could be considered to be on either side!
What's the most efficient method of finding and maintaining a list of nearest neighbors with periodic boundary conditions in R^3?
Approximations are not sufficient, and the points will only be moved one at a time (think Monte Carlo not N-body simulation).
Even in the Euclidean case, a point and its nearest neighbor may be on opposite sides of a hyperplane. The core of nearest-neighbor search in a k-d tree is a primitive that determines the distance between a point and a box; the only modification necessary for your case is to take the possibility of wraparound into account.
Alternatively, you could implement cover trees, which work on any metric.
(I'm posting this answer even though I'm not fully sure it works. Intuitively it seems right, but there might be an edge case I haven't considered)
If you're working with periodic boundary conditions, then you can think of space as being cut into a series of blocks of some fixed size that are all then superimposed on top of one another. Suppose that we're in R2. Then one option would be to replicate that block nine times and arrange them into a 3x3 grid of duplicates of the block. Given this, if we find the nearest neighbor of any single node in the central square, then either
The nearest neighbor is inside the central square, in which case the neighbor is a nearest neighbor, or
The nearest neighbor is in a square other than the central square. In that case, if we find the point in the central square that the neighbor corresponds to, that point should be the nearest neighbor of the original test point under the periodic boundary condition.
In other words, we just replicate the elements enough times so that the Euclidean distance between points lets us find the corresponding distance in the modulo space.
In n dimensions, you would need to make 3n copies of all the points, which sounds like a lot, but for R3 is only a 27x increase over the original data size. This is certainly a huge increase, but if it's within acceptable limits you should be able to use this trick to harness a standard kd-tree (or other spacial tree).
Hope this helps! (And hope this is correct!)

Spatial index for geo coordinates?

What kind of data structure could be used for an efficient nearest neighbor search in a large set of geo coordinates? With "regular" spatial index structures like R-Trees that assume planar coordinates, I see two problems (Are there others I have overlooked?):
Wraparound at the poles and the International Date Line
Distortion of distances near the poles
How can these factors be allowed for? I guess the second one could compensated by transforming the coordinates. Can an R-Tree be modified to take wraparound into account? Or are there specialized geo-spatial index structures?
Could you use a locality-sensitive hashing (LSH) algorithm in 3 dimensions? That would quickly give you an approximate neighboring group which you could then sanity-check by calculating great-circle distances.
Here's a paper describing an algorithm for efficient LSH on the surface of a unit d-dimensional hypersphere. Presumably it works for d=3.
Take a look at Geohash.
Also, to compensate for wraparound, simply use not one but three orthogonal R-trees, so that there does not exist a point on the earth surface such that all three trees have a wraparound at that point. Then, two points are close if they are close according to at least one of these trees.

Can I use arbitrary metrics to search KD-Trees?

I just finished implementing a kd-tree for doing fast nearest neighbor searches. I'm interested in playing around with different distance metrics other than the Euclidean distance. My understanding of the kd-tree is that the speedy kd-tree search is not guaranteed to give exact searches if the metric is non-Euclidean, which means that I might need to implement a new data structure and search algorithm if I want to try out new metrics for my search.
I have two questions:
Does using a kd-tree permanently tie me to the Euclidean distance?
If so, what other sorts of algorithms should I try that work for arbitrary metrics? I don't have a ton of time to implement lots of different data structures, but other structures I'm thinking about include cover trees and vp-trees.
The nearest-neighbour search procedure described on the Wikipedia page you linked to can certainly be generalised to other distance metrics, provided you replace "hypersphere" with the equivalent geometrical object for the given metric, and test each hyperplane for crossings with this object.
Example: if you are using the Manhattan distance instead (i.e. the sum of the absolute values of all differences in vector components), your hypersphere would become a (multidimensional) diamond. (This is easiest to visualise in 2D -- if your current nearest neighbour is at distance x from the query point p, then any closer neighbour behind a different hyperplane must intersect a diamond shape that has width and height 2x and is centred on p). This might make the hyperplane-crossing test more difficult to code or slower to run, however the general principle still applies.
I don't think you're tied to euclidean distance - as j_random_hacker says, you can probably use Manhattan distance - but I'm pretty sure you're tied to geometries that can be represented in cartesian coordinates. So you couldn't use a kd-tree to index a metric space, for example.

Resources