Nearest Neighbor Algorithm in R-Tree - data-structures

I'am reading the Paper from Guttman Link to paper/book
And I was wondering how do nearest neighbor queries work with R-Trees or how it is implemented actually.
What I have thought of is that you traverse the tree starting at the root and check if one of the entries inlcude the query point.
So the first question is, if a rectangle include the query point, this does not mean all rectangles inside this rectangle will automatically be the nearest to the query point. It is also possible that there is another rectangle which has a fewer distance, even if the query point lies not inside the rectangle?
Second, assume the query point is actually a minimum bouding box, for example mbr = [left,bottom, right, top] and I want all rectangles that overlap this region or better all rectangles where its centroid lies inside the given region. Is this also possible?

EDIT
Doing numerous experiments, the algorithm by
Hjaltason, Gísli R., and Hanan Samet. "Distance browsing in spatial databases." ACM Transactions on Database Systems (TODS) 24.2 (1999): 265-318.
(as posted in the answer by #Anony-Mousse) is clearly superior to the algorithms I describe here.
OLD ANSWER:
As far as I know, the best kNN search algorithm is the one by
Cheung, King Lum, and Ada Wai-Chee Fu. "Enhanced nearest neighbour search on the R-tree." ACM SIGMOD Record 27.3 (1998): 16-21. (copied from answer by #Anony-Mousse) PDF download
The basic algorithm is also explained in this presenation
If I remember correctly, it does the following things:
Traverse all nodes in the tree, except if they can be excluded based on the current maximal known distance.
Order candidate subnodes before traversing them such that the 'closest' subnodes are traversed first.
As a result, this algorithm very quickly finds the closest neighbours and traverses hardly, if any, nodes that do not contain part of the end-result.
Interestingly, the algorithm by Cheung et al improves previous algorithms by removing some checks that were meant to exclude even more subnodes before traversing them. They could show that the additional checks could not possibly exclude nodes.

There are many papers on finding nearest neighbors in R-trees.
Roussopoulos, Nick, Stephen Kelley, and Frédéric Vincent. "Nearest neighbor queries." ACM sigmod record. Vol. 24. No. 2. ACM, 1995.
Papadopoulos, Apostolos, and Yannis Manolopoulos. "Performance of nearest neighbor queries in R-trees." Database Theory—ICDT'97 (1997): 394-408.
Hjaltason, Gísli R., and Hanan Samet. "Distance browsing in spatial databases." ACM Transactions on Database Systems (TODS) 24.2 (1999): 265-318.
Cheung, King Lum, and Ada Wai-Chee Fu. "Enhanced nearest neighbour search on the R-tree." ACM SIGMOD Record 27.3 (1998): 16-21.
Berchtold, S., Böhm, C., Keim, D. A., & Kriegel, H. P. (1997, May). A cost model for nearest neighbor search in high-dimensional data space. In Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems (pp. 78-86). ACM.

Related

Is there an algorithm to link points, that minimises Manhattan length?

I'm trying to link points in the plane, ie draw a graph, but using only axis-aligned lines. I found the KDTree algorithm
to be quite promising and close to what I need
but it does not make sure the segments are as small as possible.
The result I'm looking for is closer to
I have also read up on
https://en.wikipedia.org/wiki/Delaunay_triangulation
because initially, I thought that would be it;
but it turns out its way off:
- based on circles and triangles
- traces a perimeter
- nodes have multiple connections (>=3)
Can you point me towards an algorithm that already exists?
or can you help me with drafting a new algorithm?
PS: Only 1000-1100 points so efficiency is not super important.
In terms of Goals and Costs, reaching all nodes is the Goal
and the length of all segments is the Cost.
Thanks to MBo, I now know that this is known as 'The Steiner Tree Problem'. This is the subject of a 1992 book (of the same name) demonstrating that it is an NP-hard problem.
This is the Rectilinear variant of that. There are a few approximate algorithms or heuristic algorithms known to (help) solve it.
( HAN, HAN4, LBH, HWAB, HWAD, BEA are listed inside
https://www.sciencedirect.com/science/article/pii/0166218X9390010L )
I haven't found anything yet that a "practitioner" might be able to actually use. Still looking.
It seems like a 'good' way forward is:
Compute edges using Delaunay triangulation.
Label each edge with its length or rectilinear distance.
Find the minimum spanning tree (with algorithms such as Borůvka's, Prim's, or Kruskal's).
Add Steiner points restricted to the Hanan grid.
Move edges to the grid.
Still unclear about that last step. How?

Manhattan, Euclidian and Chebyshev in a A* Algorithm

I am confused by what the purpose of manhattan, euclidian and chebyshev in an A* Algorithm. Is it just the distance calculation or does the A* algorithm find paths in different ways depending on those metrics (vertical & horizontal or diagonally or all three). My impression of these three metrics were that they have their different methods of calculating distance as seen in this website : https://lyfat.wordpress.com/2012/05/22/euclidean-vs-chebyshev-vs-manhattan-distance/
But some people tell me that the A* algorithm moves only vertical and horizontal if the manhattan metric is used and must be drawn that way. Only diagonally for
euclidian and can move in all three directions for chebyshev.
So what I wanted to clarify was does the A* algorithm run in different directions based on the metrics (Manhattan, Chebyshev and Euclidian) or does it run on all directions but have different heuristic costs based on the metrics. I am a student and have been confused by this so any clarification possible is appreciated!
Actually, things are a little bit the other way around, i.e. we usually know the movement type that we are interested in, and this movement type determines which is the best metric (Manhattan, Chebyshev, Euclidian) to be used in the heuristic.
Changing the heuristic will not change the connectivity of neighboring cells.
In order to make the A* algorithm find paths according to a particular movement type (i.e. only horizontal+vertical, or diagonal, etc), the neighbor enumeration procedure should be set accordingly. (This enumeration of the neighbors of a node is done somewhere inside the main loop of the algorithm, after a node is popped from the queue).
In brief, not the heuristic, but the way the neighbors of a node are enumerated determines which type of movements the A* algorithm allows.
Afterwards, once a movement type was established and encoded into the algorithm as described above, it is also important to find a good heuristic. The heuristic needs to satisfy certain criteria in order to be valid (it needs to not over-estimate the distance to the target), thus some heuristics are incompatible with certain movement types. Choosing an invalid heuristic no longer guarantees that A* will find the proper solution when it's done. A good choice for the heuristic is to use precisely the one measuring distance under the selected movement type (e.g. Manhattan for horizontal/vertical, and so on).
It is also worth mentioning the octile distance, which is a very accurate estimate of the distance when traveling on a grid with neighboring diagonals allowed. It basically estimates a direct path from A to B using neighboring diagonal moves which have a cost of sqrt(2) instead of 1 for cardinal movements. In other words it is a kind of Manhattan distance but with diagonals.
A very good resource on all of those grid heuristics is found here
http://theory.stanford.edu/~amitp/GameProgramming/Heuristics.html

3D Nearest neighbour for query points located far away from set of points

I need to answer a lot of queries about finding nearest neighbour in pointset, locating far away from the query point. All approaches I've found so far work bad in this case (for example, k-d tree may have O(N) per query) or require use of Voronoi diagram (I have ~10m points so Voronoi diagram is too expensive).
Is there any known algorithm designed for such a task?
The problem here are the distances. You see, when a query is far from your dataset, then the kd-tree has to check many points, thus slowing down the query time.
The scenario you are facing is hard for the Nearest Neighbor Structures in general (and it's not the usual case), but if I were you, I would give a shot with Balanced Box-Decomposition trees, where you can read more about their algorithm and data structure.
Some multidimensional indexes have kNN queries that could be easily adapted to you needs, especially with k==1.
kNN algorithms usually have to first estimate the approximate nearest neighbour distance, then they use this distance to perform a range query.
In R-Trees or quadtrees, this estimation can be done efficiently by finding the node that is closest to your search point. Then they take one point from the closest node, calculate the distance to the search point, and then perform a range query based on this distance, usually with some multiplier because k>1.
This should be reasonable efficient even if the search point is far away.
If you are searching for one point only (k=1) then you could adapt this algorithm to use a range query that is exactly based on the closest point you found, no extra extension to get k>1 points.
If you are using Java, you could use my open-source implementations here. There is also a PH-Tree (a kind of quadtree, but much more space efficient and faster to load), which uses the same kNN approach.

On-line Ball tree Algorithm

I need a method to construct a ball tree in an on-line manner for Nearest Neighbour search. Before this I have been using Scikit-learn's implementation of the Ball Tree Nearest Neighbour module to carry out my scientific calculations but it is not feasible when I have new data arriving and the entire Ball Tree has to be reconstructed every time. I have not found much literature on implementing an on-line algorithm, the wikipedia article here suggests an off-line method, so I wanted to pose the question to the SO community.
You can try a space-filling-curve. Translate the co-ordinate to a binary and interleave it. Maybe treat it as a base-4 number.

Can I use arbitrary metrics to search KD-Trees?

I just finished implementing a kd-tree for doing fast nearest neighbor searches. I'm interested in playing around with different distance metrics other than the Euclidean distance. My understanding of the kd-tree is that the speedy kd-tree search is not guaranteed to give exact searches if the metric is non-Euclidean, which means that I might need to implement a new data structure and search algorithm if I want to try out new metrics for my search.
I have two questions:
Does using a kd-tree permanently tie me to the Euclidean distance?
If so, what other sorts of algorithms should I try that work for arbitrary metrics? I don't have a ton of time to implement lots of different data structures, but other structures I'm thinking about include cover trees and vp-trees.
The nearest-neighbour search procedure described on the Wikipedia page you linked to can certainly be generalised to other distance metrics, provided you replace "hypersphere" with the equivalent geometrical object for the given metric, and test each hyperplane for crossings with this object.
Example: if you are using the Manhattan distance instead (i.e. the sum of the absolute values of all differences in vector components), your hypersphere would become a (multidimensional) diamond. (This is easiest to visualise in 2D -- if your current nearest neighbour is at distance x from the query point p, then any closer neighbour behind a different hyperplane must intersect a diamond shape that has width and height 2x and is centred on p). This might make the hyperplane-crossing test more difficult to code or slower to run, however the general principle still applies.
I don't think you're tied to euclidean distance - as j_random_hacker says, you can probably use Manhattan distance - but I'm pretty sure you're tied to geometries that can be represented in cartesian coordinates. So you couldn't use a kd-tree to index a metric space, for example.

Resources