I just finished implementing a kd-tree for doing fast nearest neighbor searches. I'm interested in playing around with different distance metrics other than the Euclidean distance. My understanding of the kd-tree is that the speedy kd-tree search is not guaranteed to give exact searches if the metric is non-Euclidean, which means that I might need to implement a new data structure and search algorithm if I want to try out new metrics for my search.
I have two questions:
Does using a kd-tree permanently tie me to the Euclidean distance?
If so, what other sorts of algorithms should I try that work for arbitrary metrics? I don't have a ton of time to implement lots of different data structures, but other structures I'm thinking about include cover trees and vp-trees.
The nearest-neighbour search procedure described on the Wikipedia page you linked to can certainly be generalised to other distance metrics, provided you replace "hypersphere" with the equivalent geometrical object for the given metric, and test each hyperplane for crossings with this object.
Example: if you are using the Manhattan distance instead (i.e. the sum of the absolute values of all differences in vector components), your hypersphere would become a (multidimensional) diamond. (This is easiest to visualise in 2D -- if your current nearest neighbour is at distance x from the query point p, then any closer neighbour behind a different hyperplane must intersect a diamond shape that has width and height 2x and is centred on p). This might make the hyperplane-crossing test more difficult to code or slower to run, however the general principle still applies.
I don't think you're tied to euclidean distance - as j_random_hacker says, you can probably use Manhattan distance - but I'm pretty sure you're tied to geometries that can be represented in cartesian coordinates. So you couldn't use a kd-tree to index a metric space, for example.
Related
I need to answer a lot of queries about finding nearest neighbour in pointset, locating far away from the query point. All approaches I've found so far work bad in this case (for example, k-d tree may have O(N) per query) or require use of Voronoi diagram (I have ~10m points so Voronoi diagram is too expensive).
Is there any known algorithm designed for such a task?
The problem here are the distances. You see, when a query is far from your dataset, then the kd-tree has to check many points, thus slowing down the query time.
The scenario you are facing is hard for the Nearest Neighbor Structures in general (and it's not the usual case), but if I were you, I would give a shot with Balanced Box-Decomposition trees, where you can read more about their algorithm and data structure.
Some multidimensional indexes have kNN queries that could be easily adapted to you needs, especially with k==1.
kNN algorithms usually have to first estimate the approximate nearest neighbour distance, then they use this distance to perform a range query.
In R-Trees or quadtrees, this estimation can be done efficiently by finding the node that is closest to your search point. Then they take one point from the closest node, calculate the distance to the search point, and then perform a range query based on this distance, usually with some multiplier because k>1.
This should be reasonable efficient even if the search point is far away.
If you are searching for one point only (k=1) then you could adapt this algorithm to use a range query that is exactly based on the closest point you found, no extra extension to get k>1 points.
If you are using Java, you could use my open-source implementations here. There is also a PH-Tree (a kind of quadtree, but much more space efficient and faster to load), which uses the same kNN approach.
I would like to know if I am missing any acceleration structure that is designed for retrieving k-nearest spheres within a range.
The context of my question is molecular visualization, specifically, I need to retrieve k-nearest spheres to a point to produce a function that will be used to guide sphere tracing step length.
To simplify, the search can be limited in range to the point being tested.
All I have seen in the articles handle k-nearest points to a point, but my case is different, since I want to work with spheres closest to a point. It seems possible to adapt the kd-trees, changing the test of points to spheres but I believe that it would affect the performance. So I wonder if there is a better structure or if I should use and adapt the kd-trees.
Currently, I am using an Hybrid Bounding Volume Hierarchy but I think that the search performance could be better with other structure, since I have a big overlap of bounding volumes due to the nature of the molecules.
PS: I don't care much about the construction time. I want good search performance and decent memory occupation.
You could use a 3-step approach:
Find the nearest neighbor using the center-points of the spheres.
For this nearest neighbor you substract its radius and add the maximum radius. Then you Perform a spherical range query with the new distance. This will return all center points of spheres who may be the closest to your original sphere.
Then you manually calculate the actual distance for each sphere using it's actual radius.
This should be reasonably efficient assuming that the radius of spheres is not massively bigger than their average distance.
I have a floor on which various sensors are placed at different location on the floor. For every transmitting device, sensors may detect its readings. It is possible to have 6-7 sensors on a floor, and it is possible that a particular reading may not be detected by some sensors, but are detected by some other sensors.
For every reading I get, I would like to identify the location of that reading on the floor. We divide floor logically into TILEs (5x5 feet area) and find what ideally the reading at each TILE should be as detected by each sensor device (based on some transmission pathloss equation).
I am using the precomputed readings from 'N' sensor device at each TILE as a point in N-dimensional space. When I get a real life reading, I find the nearest neighbours of this reading, and assign this reading to that location.
I would like to know if there is a variant of K nearest neighbours, where a dimension could be REMOVED from consideration. This will especially be useful, when a particular sensor is not reporting any reading. I understand that putting weightage on a dimension will be impossible with algorithms like kd-tree or R trees. However, I would like to know if it would be possible to discard a dimension when computing nearest neighbours. Is there any such algorithm?
EDIT:
What I want to know is if the same R/kd tree could be used for k nearest search with different queries, where each query has different dimension weightage? I don't want to construct another kd-tree for every different weightage on dimensions.
EDIT 2:
Is there any library in python, which allows you to specify the custom distance function, and search for k nearest neighbours? Essentially, I would want to use different custom distance functions for different queries.
Both for R-trees and kd-trees, using weighted Minkowski norms is straightforward. Just put the weights into your distance equations!
Putting weights into Eulidean point-to-rectangle minimum distance is trivial, just look at the regular formula and plug in the weight as desired.
Distances are not used at tree construction time, so you can vary the weights as desired at query time.
After going through a lot of questions on stackoverflow, and finally going into details of scipy kd tree source code, I realised the answer by "celion" in following link is correct:
KD-Trees and missing values (vector comparison)
Excerpt:
"I think the best solution involves getting your hands dirty in the code that you're working with. Presumably the nearest-neighbor search computes the distance between the point in the tree leaf and the query vector; you should be able to modify this to handle the case where the point and the query vector are different sizes. E.g. if the points in the tree are given in 3D, but your query vector is only length 2, then the "distance" between the point (p0, p1, p2) and the query vector (x0, x1) would be
sqrt( (p0-x0)^2 + (p1-x1)^2 )
I didn't dig into the java code that you linked to, but I can try to find exactly where the change would need to go if you need help.
-Chris
PS - you might not need the sqrt in the equation above, since distance squared is usually equivalent."
I have been given coordinates of n fixed points and m query points. I have to find the k-nearest neighbors of each of the m query points from the n fixed points. Finding distances separately for each query point is very costly. Is there an efficient way of doing this?
There are fast indexing structures for such problems, like KD Tree or Ball Tree. In particular - scikit-learn (sklearn) implements them in their knn routines ( http://scikit-learn.org/stable/modules/neighbors.html )
A real answer to your question depends on numerous factors. For example, if you are not using the Euclidean distance - then you can't use KDTrees. There is also scaling issues (how many points enrolled? Dimension Size? "Clustered" ness) How long you can wait for training, if values need to be added to the set, and so on.
A number of less commonly available, bust still useful, algorithms for such are available in JSAT. This includes VP Trees, RBC, and LSH. (bias warning, I'm the author of JSAT)
If you are working out the square root of the sum of the squares to get the distances, try dropping the square root which is computationally intensive. Just find the ones with the nearest squared distances - they are the same points.
What kind of data structure could be used for an efficient nearest neighbor search in a large set of geo coordinates? With "regular" spatial index structures like R-Trees that assume planar coordinates, I see two problems (Are there others I have overlooked?):
Wraparound at the poles and the International Date Line
Distortion of distances near the poles
How can these factors be allowed for? I guess the second one could compensated by transforming the coordinates. Can an R-Tree be modified to take wraparound into account? Or are there specialized geo-spatial index structures?
Could you use a locality-sensitive hashing (LSH) algorithm in 3 dimensions? That would quickly give you an approximate neighboring group which you could then sanity-check by calculating great-circle distances.
Here's a paper describing an algorithm for efficient LSH on the surface of a unit d-dimensional hypersphere. Presumably it works for d=3.
Take a look at Geohash.
Also, to compensate for wraparound, simply use not one but three orthogonal R-trees, so that there does not exist a point on the earth surface such that all three trees have a wraparound at that point. Then, two points are close if they are close according to at least one of these trees.