I need a method to construct a ball tree in an on-line manner for Nearest Neighbour search. Before this I have been using Scikit-learn's implementation of the Ball Tree Nearest Neighbour module to carry out my scientific calculations but it is not feasible when I have new data arriving and the entire Ball Tree has to be reconstructed every time. I have not found much literature on implementing an on-line algorithm, the wikipedia article here suggests an off-line method, so I wanted to pose the question to the SO community.
You can try a space-filling-curve. Translate the co-ordinate to a binary and interleave it. Maybe treat it as a base-4 number.
Related
I'am reading the Paper from Guttman Link to paper/book
And I was wondering how do nearest neighbor queries work with R-Trees or how it is implemented actually.
What I have thought of is that you traverse the tree starting at the root and check if one of the entries inlcude the query point.
So the first question is, if a rectangle include the query point, this does not mean all rectangles inside this rectangle will automatically be the nearest to the query point. It is also possible that there is another rectangle which has a fewer distance, even if the query point lies not inside the rectangle?
Second, assume the query point is actually a minimum bouding box, for example mbr = [left,bottom, right, top] and I want all rectangles that overlap this region or better all rectangles where its centroid lies inside the given region. Is this also possible?
EDIT
Doing numerous experiments, the algorithm by
Hjaltason, Gísli R., and Hanan Samet. "Distance browsing in spatial databases." ACM Transactions on Database Systems (TODS) 24.2 (1999): 265-318.
(as posted in the answer by #Anony-Mousse) is clearly superior to the algorithms I describe here.
OLD ANSWER:
As far as I know, the best kNN search algorithm is the one by
Cheung, King Lum, and Ada Wai-Chee Fu. "Enhanced nearest neighbour search on the R-tree." ACM SIGMOD Record 27.3 (1998): 16-21. (copied from answer by #Anony-Mousse) PDF download
The basic algorithm is also explained in this presenation
If I remember correctly, it does the following things:
Traverse all nodes in the tree, except if they can be excluded based on the current maximal known distance.
Order candidate subnodes before traversing them such that the 'closest' subnodes are traversed first.
As a result, this algorithm very quickly finds the closest neighbours and traverses hardly, if any, nodes that do not contain part of the end-result.
Interestingly, the algorithm by Cheung et al improves previous algorithms by removing some checks that were meant to exclude even more subnodes before traversing them. They could show that the additional checks could not possibly exclude nodes.
There are many papers on finding nearest neighbors in R-trees.
Roussopoulos, Nick, Stephen Kelley, and Frédéric Vincent. "Nearest neighbor queries." ACM sigmod record. Vol. 24. No. 2. ACM, 1995.
Papadopoulos, Apostolos, and Yannis Manolopoulos. "Performance of nearest neighbor queries in R-trees." Database Theory—ICDT'97 (1997): 394-408.
Hjaltason, Gísli R., and Hanan Samet. "Distance browsing in spatial databases." ACM Transactions on Database Systems (TODS) 24.2 (1999): 265-318.
Cheung, King Lum, and Ada Wai-Chee Fu. "Enhanced nearest neighbour search on the R-tree." ACM SIGMOD Record 27.3 (1998): 16-21.
Berchtold, S., Böhm, C., Keim, D. A., & Kriegel, H. P. (1997, May). A cost model for nearest neighbor search in high-dimensional data space. In Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems (pp. 78-86). ACM.
I need to answer a lot of queries about finding nearest neighbour in pointset, locating far away from the query point. All approaches I've found so far work bad in this case (for example, k-d tree may have O(N) per query) or require use of Voronoi diagram (I have ~10m points so Voronoi diagram is too expensive).
Is there any known algorithm designed for such a task?
The problem here are the distances. You see, when a query is far from your dataset, then the kd-tree has to check many points, thus slowing down the query time.
The scenario you are facing is hard for the Nearest Neighbor Structures in general (and it's not the usual case), but if I were you, I would give a shot with Balanced Box-Decomposition trees, where you can read more about their algorithm and data structure.
Some multidimensional indexes have kNN queries that could be easily adapted to you needs, especially with k==1.
kNN algorithms usually have to first estimate the approximate nearest neighbour distance, then they use this distance to perform a range query.
In R-Trees or quadtrees, this estimation can be done efficiently by finding the node that is closest to your search point. Then they take one point from the closest node, calculate the distance to the search point, and then perform a range query based on this distance, usually with some multiplier because k>1.
This should be reasonable efficient even if the search point is far away.
If you are searching for one point only (k=1) then you could adapt this algorithm to use a range query that is exactly based on the closest point you found, no extra extension to get k>1 points.
If you are using Java, you could use my open-source implementations here. There is also a PH-Tree (a kind of quadtree, but much more space efficient and faster to load), which uses the same kNN approach.
Suppose there is a point cloud having 50 000 points in the x-y-z 3D space. For every point in this cloud, what algorithms or data strictures should be implemented to find k neighbours of a given point which are within a distance of [R,r]? Naive way is to go through each of the 49 999 points for each of the 50 000 points and do a metric testing. But this approach will take large time. Just like there is kd tree to find nearest neighbour in small time so is there some real-time DS/algo implementation out there to pre-process the point clouds to achieve the goal inn shortest time?
Your problem is part of the topic of Nearest Neighbor Search, or more precisely, k-Nearest Neighbor Search. The answer to your question depends on the data structure you are using to store the points. If you use R-trees or variants like R*-trees, and you are doing multiple searches on your database, you will likely find a substantial performance improvement in two or three-dimensional space compared with naive linear search. In higher dimensions, space partitioning schemes tend to underperform linear search.
As some answers already suggest for NN search you could use some tree algorithm like k-d-tree. There are implementations available for all programming languages.
If your description [R,r] suggests a hollow sphere you should compare one-time-testing (within interval) vs. two stages (test-for-outer and remove samples that pass test-for-inner).
You also did not mention performance requirements (timing or frame rate?) and your intended application (feasible approach?).
If you are using an ordinary Euclidean metric, you could go through the list three times and extract those points that within R in each dimension, essentially extracting the enclosing cube. Searching the resulting list would still be O(n^2), but on a much smaller n.
There are efficient algorithms (in average, for random data), see Nearest neighbor search.
Your approach is not efficient, yet simple.
Please read through, check you requirements and get back so we can help.
I'm writing an app that looks up points in two-dimensional space using a k-d tree. It would be nice, during development, to be able to "see" the nearest-neighbor zones surrounding each point.
In the attached image, the red points are points in the k-d tree, and the blue lines surrounding each point bound the zone where a nearest neighbor search will return the contained point.
The image was created thusly:
for each point in the space:
da = distance to nearest neighbor
db = distance to second-nearest neighbor
if absolute_value(da - db) < 4:
draw blue pixel
This algorithm has two problems:
(more important) It's slow on my (reasonably fast Core i7) computer.
(less important) It's sloppy, as you can see by the varying widths of the blue lines.
What is this "visualization" of a set of points called?
What are some good algorithms to create such a visualization?
This is called a Voronoi Diagram and there are many excellent algorithms for generating them efficiently. The one I've heard about most is Fortune's algorithm, which runs in time O(n log n), though others algorithms exist for this problem.
Hope this helps!
Jacob,
hey, you found an interesting way of generating this Voronoi diagram, even though it is not so efficient.
The less important issue first: the varying thickness boundaries that you get, those butterfly shapes, are in fact the area between the two branches of an hyperbola. Precisely the hyperbola given by the equation |da - db| = 4. To get a thick line instead, you have to modify this criterion and replace it by the distance to the bisector of the two nearest neighbors, let A and B; using vector calculus, | PA.AB/||AB|| - ||AB||/2 | < 4.
The more important issue: there are two well known efficient solutions to the construction of the Voronoi diagram of a set of points: Fortune's sweep algorithm (as mentioned by templatetypedef) and Preparata & Shamos' Divide & Conquer solutions. Both run in optimal time O(N.Lg(N)) for N points, but aren't so easy to implement.
These algorithm will construct the Voronoi diagram as a set of line segments and half-lines. Check http://en.wikipedia.org/wiki/Voronoi_diagram.
This paper "Primitives for the manipulation of general subdivisions and the computation of Voronoi" describes both algorithms using a somewhat high-level framework, caring about all implementation details; the article is difficult but the algorithms are implementable.
You may also have a look at "A straightforward iterative algorithm for the planar Voronoi diagram", which I never tried.
A totally different approach is to directly build the distance map from the given points for example by means of Dijkstra's algorithm: starting from the given points, you grow the boundary of the area within a given distance from every point and you stop growing when two boundaries meet. [More explanations required.] See http://1.bp.blogspot.com/-O6rXggLa9fE/TnAwz4f9hXI/AAAAAAAAAPk/0vrqEKRPVIw/s1600/distmap-20-seed4-fin.jpg
Another good starting point (for efficiently computing the distance map) can be "A general algorithm for computing distance transforms in linear time".
From personal experience: Fortune's algorithm is a pain to implement. The divide and conquer algorithm presented by Guibas and Stolfi isn't too bad; they give detailed pseudocode that's easy to transcribe into a procedural programming language. Both will blow up if you have nearly degenerate inputs and use floating point, but since the primitives are quadratic, if you can represent coordinates as 32-bit integers, then you can use 64 bits to carry out the determinant computations.
Once you get it working, you might consider replacing your kd-tree algorithms, which have a Theta(√n) worst case, with algorithms that work on planar subdivisions.
You can find a great implementation for it at D3.js library: http://mbostock.github.com/d3/ex/voronoi.html
Problem: What is the smallest possible diameter of a circle which covers given N points on a 2D plane?
What is the most efficient algorithm to solve this problem and how does it work?
This is the smallest circle problem. See the references for the links to the suggested algorithms.
E.Welzl, Smallest Enclosing Disks
(Balls and Ellipsoids), in H. Maurer
(Ed.), New Results and New Trends in
Computer Science, Lecture Notes in
Computer Science, Vol. 555,
Springer-Verlag, 359–37 (1991)
is the reference to the "fastest" algorithm.
There are several algorithms and implementations out there for the Smallest enclosing balls problem.
For 2D and 3D, Gärtner's implementation is probably the fastest.
For higher dimensions (up to 10,000, say), take a look at https://github.com/hbf/miniball, which is the implementation of an algorithm by Gärtner, Kutz, and Fischer (note: I am one of the co-authors).
For very, very high dimensions, core-set (approximation) algorithms will be faster.
Note: If you are looking for an algorithm to compute the smallest enclosing sphere of spheres, you will find a C++ implementation in the Computational Geometry Algorithms Library (CGAL). (You do not need to use all of CGAL; simply extract the required header and source files.)
the furthest point voronoi diagram approach
http://www.dma.fi.upm.es/mabellanas/tfcs/fvd/algorithm.html
turns out to work really well for the 2 d problem. It is non-iterative and (pretty sure) guaranteed exact. I suspect it doesn't extend so well to higher dimensions, which is why there is little attention to it in the literature.
If there is interest i'll describe it here - the above link is a bit hard to follow I think.
edit another link: http://ojs.statsbiblioteket.dk/index.php/daimipb/article/view/6704