R-Tree construction algorithm using polygon MBR - algorithm

I cannot seem to find any documentation on how to construct an R-Tree when I have all the known Minimum Bounding Rectangles (MBR) of the polygons in my set. The R-Tree would be ideal for storing these spatial references to eliminate my current brute force inspecting for polygon intersection:
for p1 in polygons: # O(n)
for p2 in polygons: # O(n)
if p2 is not p1: # O(1)
if p2.intersects(p1): # O(1); computed using DeMorgans law on vertices
# do stuff
Does anybody have a reference that denotes methods of how to determine the partitioning of the rectangles that encompass the MBRs of the polygons in the set?

There are many R-Tree partitioning algorithm, in my experience the best one is R*Tree (R-Star-Tree) by Beckmann et al. Just search for "The R*-tree: an efficient and robust access method for points and rectangles".
If you prefer reading code, there are also many open-source implementations, including my own one in Java. Be warned though, the R*Tree algorithm is not trivial.
If you are looking for something simpler, try quadtrees or octrees. If insertion and update speed is top priority, have a look at the PH-Tree (again my own implementation), but it also more complicated than quadtrees.
Another simpler solution is the AABB-Tree, it's like an R-Tree but with only two bounding boxes per node. It's used a lot in computer graphics I think, but I don't know much about it except that it is relatively simple for an R-Tree.
EDIT (Update to answer comment)
If you are looking for a bulk loading strategy such as STR, here is the original paper. You can have a look at my R-Tree implementation, as it also provides an implementation of an STR-Loader that can handle points and rectangles.
Searching stack overflow I also found this answer which apparently points to an alternative bulk loader specifically for storing rectangles.
I would like to point out that bulk loading (such as STR-Loading) is the fasted way to load an R-Tree. However, in my own experiments (see Figure 3 here), this is still 2-3 times slower than a good quadtree or a PH-Tree.

Related

Fastest k nearest neighbor with arbitrary metric?

The gotcha with this question is "arbitrary metric". If you don't know what that is, it's just the way to measure distance between points. (In the "real" world, the 1-dimensinal distance is just the absolute magnitude of the difference between the two points).
Enough of the pre-lims. I'm trying to find a fast k nearest neighbor algorithm with these properties:
works on an arbitrary metric
somewhat easy to implement
optimized for finding the distance of a set of points to another set of points
Wikipedia gives a list of algorithms and approaches but nothing on implementation.
UPDATE: the metric is the cosine similarity, which does not satisfy the triangle inquality. However, it seems that I can use the "angular similarity" (as per Wikipedia).
UPDATE: the use case is natural language processing. "Vectors" are the "context" of a given word, represented by binary properties (ex: the title of the document). So while there may be only a few properties (right now I'm just using 3), each vector has arbitrarily large dimension (in the title example, each title in the database would correspond to a dimension in the vector).
UPDATE: For the curious, I'm implementing this algorithm:
http://josquin.cs.depaul.edu/~mramezani/papers/IEEEIS.pdf
UPDATE: The algorithm will need to find nearest neighbors for about a dozen points from about 100s of points. The average dimension will probably be very large, say 50, (I really don't know yet). And yes, I'm interested in an algorithm, not a library. And yes, estimates are probably good enough.
I would advice you to go for Locality-sensitive hashing (LSH), which is in trend right now. It reduces the dimensionality of high-dimensional data, but I am not sure if your dimension will go well with that algorithm. See the Wikipedia page for more.
You can use your own metric, but in general you can do that in many algorithms. Hope this helps.
You could go for RKD trees, a forest of them, but maybe this is too much now.

dynamic space partitioning tree data structures?

I have an application where I need to do nearest neighbor, rectangle/polygon overlap and other basic computational geometry operations against dynamically changing data (all 2d.) I understand the basic data structures in the static case (quadtrees, 2-dimensional Kd-trees, R-trees, BSP, etc.) but I want to understand the state of the art in the dynamic case. The difficulty seems to be knowning when/how to balance on insertion and deletion. For example, is there a dynamic data structure that answers k-nearest neighbors against n points in O(log n + k), where insertion and deletion take O(log n) (amortized, maybe)? Is there a standard reference that summarizes what's known about this problem?
To be honest, I haven't done too much with dynamic trees myself (mostly static). But I believe the Bkd-tree paper (early 2000s?) is good source to start. I believe it has been referenced many times since then. You can use sources like acm/citeseer to track newer papers that reference it. Side note: i think there is public code available for Bkds so you can play with it without investing too much time - see if it works for you.
Bkd-Tree: A Dynamic Scalable kd-Tree
Octavian Procopiuc, Pankaj K. Agarwal, Lars Arge, and Jeffrey Scott Vitter
You can try a monster curve (space filling curve). A fast algorithm is to simply interleave the x and y coordinate. It also possible for 3 dimensions.

k-NN search in HUGE dimensions (~100,000)

Are there any articles about k-NN search problem for really huge amount of dimensions like 10k - 100k?
Most of articles with tests on real-world data operates with 10-50 dims, and a few operates 100-500.
In my case there is ~10^9 points in ~100k feature dimension, and there is no way to effectively reduce number of dimensions.
UPD.:
At the moment we are trying to adapt and implement VP-trees, but it's clear enough that any tree struct on this dimensionality wont work well.
Second approach is LSH, but there may be big troubles with accuracy depending on data distribution.
Take a look at FLANN library.
In this paper you will find a dissertation on how data dimensionality is one of the factors that has a great impact on the nearest neighbor matching performance, and the solutions adopted in FLANN.
Are you using kd-tree for nearest neighbour search? kd-tree deteriorates to almost exhaustive search in higher dimensions.
In higher dimensions, it is usually suggested to use approximate nearest neighbour search. here is the link to the original paper: http://cvs.cs.umd.edu/~mount/Papers/dist.pdf, and if that is a bit too heavy, try this: dimacs.rutgers.edu/Workshops/MiningTutorial/pindyk-slides.ppt‎
There are many factors affecting the choice of decision when it comes to nearest neighbour search. Whether you need to load the points entirely in primary memory or you could use secondary memory should also govern your decision.

kNN with dynamic insertions in high-dim space

I am looking for a method to do fast nearest neighbour (hopefully O(log n)) for high dimensional points (typically ~11-13 dimensional). I would like it to behave optimally during insertions after having initialized the structure. KD tree came to my mind but if you do not do bulk loading but do dynamic insertions, then kd tree ceases to be balanced and afaik balancing is an expensive operation.
So, I wanted to know what data structures would you prefer for such kind of setting. You have high dimensional points and you would like to do insertions and query for nearest neighbour.
Another data structure that comes to mind is the cover tree. Unlike KD trees which were originally developed to answer range queries, this data structure is optimal for nearest neighbor queries. It has been used in n-body problems that involve computing the k nearest neighbors of all the data points. Such problems also occur in density estimation schemes (Parzen windows).
I don't know enough about your specific problem, but I do know that there are online versions of this data structure. Check out Alexander Gray's page and this link
The Curse of Dimensionality gets in the way here. You might consider applying Principal Component Analysis (PCA) to reduce the dimensionality, but as far as I know, nobody has a great answer for this.
I have dealt with this type of problem before (in audio and video fingerprinting), sometimes with up to 30 dimensions. Analysis usually revealed that some of the dimensions did not contain relevant information for searches (actually fuzzy searches, my main goal), so I omitted them from the index structures used to access the data, but included them in the logic to determine matches from a list of candidates found during the search. This effectively reduced the dimensionality to a tractable level.
I simplified things further by quantizing the remaining dimensions severely, such that the entire multidimensional space was mapped into a 32-bit integer. I used this as the key in an STL map (a red-black tree), though I could have used a hash table. I was able to add millions of records dynamically to such a structure (RAM-based, of course) in about a minute or two, and searches took about a millisecond on average, though the data was by no means evenly distributed. Searches required careful enumeration of values in the dimensions that were mapped into the 32-bit key, but were reliable enough to use in a commercial product. I believe it is used to this day in iTunes Match, if my sources are correct. :)
The bottom line is that I recommend you take a look at your data and do something custom that exploits features in it to make for fast indexing and searching. Find the dimensions that vary the most and are the most independent of each other. Quantize those and use them as the key in an index. Each bucket in the index contains all items that share that key (there will likely be more than one). To find nearest neighbors, look at "nearby" keys and within each bucket, look for nearby values. Good luck.
p.s. I wrote a paper on my technique, available here. Sorry about the paywall. Perhaps you can find a free copy elsewhere. Let me know if you have any questions about it.
If you use a Bucket Kd-Tree with a reasonably large bucket size it lets the tree get better idea of where to split when the leaves get too full. The guys in Robocode do this under extremely harsh time-constraints, with random insertions happening on the fly and kNN with k>80, d > 10 and n > 30k in under 1ms. Check out this kD-Tree Tutorial which explains a bunch of kD-Tree enhancements and how to implement them.
In my experience, 11-13 dimensions is not too bad -- if you bulk-load. Both bulk-loaded R-trees (in contrast to k-d-trees these remain balanced!) and k-d-trees should still work much better than linear scanning.
Once you go fully dynamic, my experiences are much worse. Roughly: with bulk loaded trees I'm seeing 20x speedups, with incrementally built R-trees just 7x. So it does pay off to frequently rebuild the tree. And depending on how you organize your data, it may be much faster than you think. The bulk load for the k-d-tree that I'm using is O(n log n), and I read that there is a O(n log log n) variant, too. With a low constant factor. For the R-tree, Sort-Tile-Recursive is the best bulk load I have seen so far, and also O(n log n) with a low constant factor.
So yes, in high-dimensionality I would consider to just reload the tree from time to time.

Hilbert tree: Does anyone know where to find a code implementation of this?

I'm looking for the code or even a visual demo on how this tree works.
I have read this paper on Hilbert R-Trees and tried to implement the algorithms stated
I get lost when I have to adjust the tree with the sets, as well as not sure on most other things.
It doesn't matter what language if there is an implementation as it will be used to build a C# implementation and I will be using it for reference.
Its for moving points with boundaries, that needs to have very fast insert and update calls.
Try these links for a R-Tree demo
http://gis.umb.no/gis/applets/rtree2/jdk1.1/
http://gist.cs.berkeley.edu/libgist-2.0/amdb_demo.html
http://donar.umiacs.umd.edu/quadtree/points/rtrees.html
cheers
If you are looking for an Hilbert tree, this might help:
https://code.google.com/p/uzaygezen/
If you need a spatial index with fast delete/insert capabilities, have a look at the PH-tree. It partly based on quadtrees but faster and more space efficient.
Btw, the Hilbert curve is a space filling curve. The PH-Tree also uses a space filling curve internally, however it does not use the Hilbert curve but the z-curve (Morton order) which is much easier to calculate.

Resources