Rtrees - basics of the algorithm - algorithm

I am trying to understand the basics of RTree algorithm and I am trying to figure out how it performs the search of e.g. all retaurants within 1 km. We would have all objects stores in rectangles in our database, we would then (prbably) build a query rectangle, based on our current position, and then find all rectangles that overlap with it. WOuld we then scan through the results to find the ones of interest i.e. only objects which are restaurants?

Yes, this is basically how range queries on R-trees work: if a rectangle overlaps with your query region, expand it (look at the contents, rectangles or points). Otherwise, ignore it. Overlap testing is simple for rectangle-to-rectangle, and for spherical queries you need to compute the minimum distance of the sphere center to the rectangle ("minDist").
k nearest neighbor queries are a bit more tricky; here you need priority queues. Always expand the best candidate (by "minDist"), until you have found k objects that are closer than the next rectangles "minDist".
Since you can't really index the "is a restaurant" property, you'll have to either build an r-tree containing restaurants only, or filter the results by the restaurant property.
(This also is how it is done e.g. in SQLite; the spatial part is indexed with an R-tree, while the restaurant property is e.g. obtained via a join or a bitmap index)
The tricky part of an R-tree is not the query, but how to build it. There are very simple but good methods for bulk loading point data (STR), but for an online database you need somewhat tricky methods. R*-trees outperform classic R-trees significantly in my experience; the reinsertions used by R*-trees are in particular tricky to implement in a real DBMS. An interesting tradeoff is to just use insert and split from R*, but not the reinsertions. On the query side, there is no difference between R and R* anyway.
kd-trees: They are related to r-trees, but have some key differences: first of all, they are not designed for disk storage, but in-memory operation only. Secondly, they are not meant to be updated (they are not balanced trees), but if you have changes you will have to rebuild them again every now and then to keep the performance good. So in some cases they will perform very well (and they are fairly simple to implement), but once you get to large data and on-disk they are much more painful. Furthermore, they do not allow for different loading strategies.

Related

Balanced Tree structure for storing and updating 3D points

I'm on an odyssey to find a good tree structure to store and update my application data.
The data are positions in 3 dimensions (x, y, z)
They need to be able to be updated and queried by range quickly (every 30 milliseconds). The queries would be, for example: "get all the points around (2,3,4) in a radius of 100cm"
The data is always in internal memory.
Could someone of you recommend me a good type of tree that meets these requirements?
The KD-Trees wouldn't work for me because they are not made to be updated at this speed. I should rebuild them whole on every update.
BKD-Trees wouldn't work for me either because they are made to store data on disk (not in internal memory).
Apparently the R-Trees are also designed to store the data in the leaves.
If you need fast updates as well as range queries, in-memory, I can recommend either a grid index or the PH-tree.
A grid index is essentially an 2D/3D array of buckets. The grid is laid over your data space and you just store your data in the bucket (=grid cell) where your point is. For range queries you just check all entries in all buckets that overlap with your query range.
It takes a bit trial and error to find the best grid size.
In my experience this is the best solution in 2D with 1000 points or less. I have no experience with 3D grid indexes.
For larger datasets I recommend the PH-tree (disclaimer: self advertisement). Updates are much faster than with R-trees, deletion is as fast as insertion. There is no rebalancing (as it happens with R-trees or some kd-trees) so insertion/deletion times are quite predictable (rebalancing is neither need nor possible, imbalance is inherently limited).
Range queries (= window queries) are a bit slower than R-trees, but the difference almost disappears for very small ranges (windows).
It is available in Java and C++.

Spatial partition data structure that is better suited for a placement system than a quadtree

I want to know if there is spatial partition data structure that is better suited for a placement system than a quadtree. By better suited I mean for the data structure to have a O(logn) time complexity or less when search querying it and using less memory. I want to know what data structure can organize my data in such a way that querying it is faster than a quadtree. Its all 2D and its all rectangles which should never overlap. I currently have a quadtree done and it works great and its fast, I am just curious to know if there is a data structure that uses less resources and its faster than a quadtree for this case.
The fastest is probably brute forcing it on a GPU.
Also, it is really worth trying out different implementations, I found performance differences between implementations to be absolutely wild.
Another tip: measure performance with realistic data (potentially multiple scenarios), data and usage characteristics can have enormous influence on index performance.
Some of these characteristics are (you already mentioned "rectangle data" and "2D"):
How big is your dataset
How much overlap do you have between rectangles?
Do you need to update data often?
Do you have a large variance between small and large rectangles?
Do you have dense cluster of rectangles?
How large is the are you cover?
Are your coordinates integers or floats?
Is it okay if the execution time of operations varies or should it be consistent?
Can you pre-load data? Do you need to update the index?
Quadtrees can be a good initial choice. However they have some problems, e.g.:
They can get very deep (and inefficient) with dense clusters
They don't work very well when there is a lot of overlap between rectangles
Update operations may take longer if nodes are merged or split.
Another popular choice are R-Trees (I found R-star-Trees to be the best). Some properties:
Balanced (good for predictable search time but bad because update times can be very unpredictable due to rebalancing)
Quite complex to implement.
R-Trees can also be preloaded (takes longer but allows queries to be faster), this is called STR-Tree (Sort-tile-recurse-R-Tree)
It may be worth looking at the PH-Tree (disclaimer: self advertisement):
Similar to a quadtree but depth is limited to the bit-width of the data (usually 32 or 64 (bits)).
No rebalancing. Merging or splitting is guaranteed to move only one entry (=cheap)
Prefers integer coordinates but works reasonably well with floating point data as well.
Implementations can be quite space efficient (they don't need to store all bit of coordinates). However, not all implementations support that. Also, the effect varies and is strongest with integer coordinates.
I made some measurements here. The measurements include a 2D dataset where I store line segments from OpenStreetMap as boxes, the relevant diagrams are labeled with "OSM-R" (R for rectangles).
Fig. 3a shows timings for inserting a given amount of data into a tree
Fig. 9a shows memory usage
Fig. 15a shows query times for queries that return on average 1000 entries
Fig. 17a shows how query performance changes when varying the query window size (on an index with 1M entries)
Fig. 41a shows average times for updating an index with 1M entries
PH/PHM is the PH-Tree, PHM has coordinates converted to integer before storing them
RSZ/RSS are two different R-Tree implementations
STR is an STR-Tree
Q(T)Z is a quadtree
In case you are using Java, have a look at my spatial index collection.
Similar collections exist for other programming languages.

Algorithm for 2D nearest-neighbour queries with dynamic points

I am trying to find a fast algorithm for finding the (approximate, if need be) nearest neighbours of a given point in a two-dimensional space where points are frequently removed from the dataset and new points are added.
(Relatedly, there are two variants of this problem that interest me: one in which points can be thought of as being added and removed randomly and another in which all the points are in constant motion.)
Some thoughts:
kd-trees offer good performance, but are only suitable for static point sets
R*-trees seem to offer good performance for a variety of dimensions, but the generality of their design (arbitrary dimensions, general content geometries) suggests the possibility that a more specific algorithm might offer performance advantages
Algorithms with existing implementations are preferable (though this is not necessary)
What's a good choice here?
I agree with (almost) everything that #gsamaras said, just to add a few things:
In my experience (using large dataset with >= 500,000 points), kNN-performance of KD-Trees is worse than pretty much any other spatial index by a factor of 10 to 100. I tested them (2 KD-trees and various other indexes) on a large OpenStreetMap dataset. In the following diagram, the KD-Trees are called KDL and KDS, the 2D dataset is called OSM-P (left diagram): The diagram is taken from this document, see bullet points below for more information.
This research describes an indexing method for moving objects, in case you keep (re-)inserting the same points in slightly different positions.
Quadtrees are not too bad either, they can be very fast in 2D, with excellent kNN performance for datasets < 1,000,000 entries.
If you are looking for Java implementations, have a look at my index library. In has implementations of quadtrees, R-star-tree, ph-tree, and others, all with a common API that also supports kNN. The library was written for the TinSpin, which is a framework for testing multidimensional indexes. Some results can be found enter link description here (it doesn't really describe the test data, but 'OSM-P' results are based on OpenStreetMap data with up to 50,000,000 2D points.
Depending on your scenario, you may also want to consider PH-Trees. They appear to be slower for kNN-queries than R-Trees in low dimensionality (though still faster than KD-Trees), but they are faster for removal and updates than RTrees. If you have a lot of removal/insertion, this may be a better choice (see the TinSpin results, Figures 2 and 46). C++ versions are available here and here.
Check the Bkd-Tree, which is:
an I/O-efficient dynamic data structure based on the kd-tree. [..] the Bkd-tree maintains its high space utilization and excellent
query and update performance regardless of the number of updates performed on it.
However this data structure is multi dimensional, and not specialized to lower dimensions (like the kd-tree).
Play with it in bkdtree.
Dynamic Quadtrees can also be a candidate, with O(logn) query time and O(Q(n)) insertion/deletion time, where Q(n) is the time
to perform a query in the data structure used. Note that this data structure is specialized for 2D. For 3D however, we have octrees, and in a similar way the structure can be generalized for higher dimensions.
An implentation is QuadTree.
R*-tree is another choice, but I agree with you on the generality. A r-star-tree implementations exists too.
A Cover tree could be considered as well, but I am not sure if it fits your description. Read more here,and check the implementation on CoverTree.
Kd-tree should still be considered, since it's performance is remarkable on 2 dimensions, and its insertion complexity is logarithic in size.
nanoflann and CGAL are jsut two implementations of it, where the first requires no install and the second does, but may be more performant.
In any case, I would try more than one approach and benchmark (since all of them have implementations and these data structures are usually affected by the nature of your data).

Efficient data structure for quality threshold clustering algorithm

I'm trying to implement the quality threshold clustering algorithm. The outline of it (taken from here) is listed below:
Initialize the threshold distance allowed for clusters and the minimum cluster size
Build a candidate cluster for each data point by including the closest point, the next closest, and so on, until the distance of the cluster surpasses the threshold
Save the candidate cluster with the most points as the first true cluster, and remove all points in the cluster from further consideration
Repeat with the reduced set of points until no more cluster can be formed having the minimum cluster size
I've been reading up on some nearest neighbor search algorithms and space partitioning data structures, as they seem to be the kind of thing I need, but I cannot determine which one to use or if I'm supposed to be looking at something else.
I want to implement the data structure myself for educational purposes, and I need one that can successively return the nearest points for some point. However, since I don't know the number of times I need to query (i.e. until the threshold is exceeded), I can't use k-nearest neighbor algorithms. I've been looking mostly at quadtrees and k-d trees.
Additionally, since the algorithm constantly builds new candidate clusters, it would be interesting to use a modified data structure that uses cached information to speed up subsequent queries (but also taking point removal into account).
This algorithm sounds like a predecessor of DBSCAN (Wikipedia), which is known to work very well with R*-Tree indexes (Wikipedia). But of course, kd-trees are also an option. The main difference between these two is that R*-trees are meant for database use - they support online insertions and deletions very well, and are block oriented - while kd-trees are more of an in-memory data structure based on binary splits. R*-trees perform rebalancing, while kd-trees will slowly become unbalanced and will need to be rebuilt.
I find nearest neighbor search in R*-trees much more understandable than in k-d-trees, because you have the bounding rectangles are very intuitive.
DBSCAN also "removes" points from further consideration, but simply by marking them as already assigned. That way you don't need to update the index; and it's sufficient to bulk-load it once in the beginning. You should be able to do this for QT, too. So unless I'm mistaken, you can get the QT clustering efficiently by running DBSCAN with epsilon set to the QT clustering and minPts=2 (although one would prefer higher values in proper DBSCAN).
There are a number of DBSCAN implementations around. The one in Weka is exceptionally crappy, so stay away from it. The fpc implementation in R is okay, but could still be a lot faster. ELKI seems to be the only one with full index support, and the speed difference is massive. Their Benchmark shows a 12x speed gain by using an index on this data set, allowing them to cluster in 50 seconds instead of 603 (without index). Weka took incredible 37917 seconds, R fpc 4339 there. That aligns with my experiences, Weka has the reputation of being quite slow, and R only kicks ass at vectorized operations, once the R interpreter has to work, it is significantly slower than anything native. But it is a good example about how different the same algorithm can perform when it is implemented by different people. I would have expected this to be 2x-5x, but apparently the differences can easily be 50x from one programmer implementing the same algorithm to another.

How to subdivide a 2d game world for better collision detection

I'm developing a game which features a sizeable square 2d playing area. The gaming area is tileless with bounded sides (no wrapping around). I am trying to figure out how I can best divide up this world to increase the performance of collision detection. Rather than checking each entity for collision with all other entities I want to only check nearby entities for collision and obstacle avoidance.
I have a few special concerns for this game world...
I want to be able to be able to use a large number of entities in the game world at once. However, a % of entities won't collide with entities of the same type. For example projectiles won't collide with other projectiles.
I want to be able to use a large range of entity sizes. I want there to be a very large size difference between the smallest entities and the largest.
There are very few static or non-moving entities in the game world.
I'm interested in using something similar to what's described in the answer here: Quadtree vs Red-Black tree for a game in C++?
My concern is how well will a tree subdivision of the world be able to handle large size differences in entities? To divide the world up enough for the smaller entities the larger ones will need to occupy a large number of regions and I'm concerned about how that will affect the performance of the system.
My other major concern is how to properly keep the list of occupied areas up to date. Since there's a lot of moving entities, and some very large ones, it seems like dividing the world up will create a significant amount of overhead for keeping track of which entities occupy which regions.
I'm mostly looking for any good algorithms or ideas that will help reduce the number collision detection and obstacle avoidance calculations.
If I were you I'd start off by implementing a simple BSP (binary space partition) tree. Since you are working in 2D, bound box checks are really fast. You basically need three classes: CBspTree, CBspNode and CBspCut (not really needed)
CBspTree has one root node instance of class CBspNode
CBspNode has an instance of CBspCut
CBspCut symbolize how you cut a set in two disjoint sets. This can neatly be solved by introducing polymorphism (e.g. CBspCutX or CBspCutY or some other cutting line). CBspCut also has two CBspNode
The interface towards the divided world will be through the tree class and it can be a really good idea to create one more layer on top of that, in case you would like to replace the BSP solution with e.g. a quad tree. Once you're getting the hang of it. But in my experience, a BSP will do just fine.
There are different strategies of how to store your items in the tree. What I mean by that is that you can choose to have e.g. some kind of container in each node that contains references to the objects occuping that area. This means though (as you are asking yourself) that large items will occupy many leaves, i.e. there will be many references to large objects and very small items will show up at single leaves.
In my experience this doesn't have that large impact. Of course it matters, but you'd have to do some testing to check if it's really an issue or not. You would be able to get around this by simply leaving those items at branched nodes in the tree, i.e. you will not store them on "leaf level". This means you will find those objects quick while traversing down the tree.
When it comes to your first question. If you only are going to use this subdivision for collision testing and nothing else, I suggest that things that can never collide never are inserted into the tree. A missile for example as you say, can't collide with another missile. Which would mean that you dont even have to store the missile in the tree.
However, you might want to use the bsp for other things as well, you didn't specify that but keep that in mind (for picking objects with e.g. the mouse). Otherwise I propose that you store everything in the bsp, and resolve the collision later on. Just ask the bsp of a list of objects in a certain area to get a limited set of possible collision candidates and perform the check after that (assuming objects know what they can collide with, or some other external mechanism).
If you want to speed up things, you also need to take care of merge and split, i.e. when things are removed from the tree, a lot of nodes will become empty or the number of items below some node level will decrease below some merge threshold. Then you want to merge two subtrees into one node containing all items. Splitting happens when you insert items into the world. So when the number of items exceed some splitting threshold you introduce a new cut, which splits the world in two. These merge and split thresholds should be two constants that you can use to tune the efficiency of the tree.
Merge and split are mainly used to keep the tree balanced and to make sure that it works as efficient as it can according to its specifications. This is really what you need to worry about. Moving things from one location and thus updating the tree is imo fast. But when it comes to merging and splitting it might become expensive if you do it too often.
This can be avoided by introducing some kind of lazy merge and split system, i.e. you have some kind of dirty flagging or modify count. Batch up all operations that can be batched, i.e. moving 10 objects and inserting 5 might be one batch. Once that batch of operations is finished, you check if the tree is dirty and then you do the needed merge and/or split operations.
Post some comments if you want me to explain further.
Cheers !
Edit
There are many things that can be optimized in the tree. But as you know, premature optimization is the root to all evil. So start off simple. For example, you might create some generic callback system that you can use while traversing the tree. This way you dont have to query the tree to get a list of objects that matched the bound box "question", instead you can just traverse down the tree and execute that call back each time you hit something. "If this bound box I'm providing intersects you, then execute this callback with these parameters"
You most definitely want to check this list of collision detection resources from gamedev.net out. It's full of resources with game development conventions.
For other than collision detection only, check their entire list of articles and resources.
My concern is how well will a tree
subdivision of the world be able to
handle large size differences in
entities? To divide the world up
enough for the smaller entities the
larger ones will need to occupy a
large number of regions and I'm
concerned about how that will affect
the performance of the system.
Use a quad tree. For objects that exist in multiple areas you have a few options:
Store the object in both branches, all the way down. Everything ends up in leaf nodes but you may end up with a significant number of extra pointers. May be appropriate for static things.
Split the object on the zone border and insert each part in their respective locations. Creates a lot of pain and isn't well defined for a lot of objects.
Store the object at the lowest point in the tree you can. Sets of objects now exist in leaf and non-leaf nodes, but each object has one pointer to it in the tree. Probably best for objects that are going to move.
By the way, the reason you're using a quad tree is because it's really really easy to work with. You don't have any heuristic based creation like you might with some BSP implementations. It's simple and it gets the job done.
My other major concern is how to
properly keep the list of occupied
areas up to date. Since there's a lot
of moving entities, and some very
large ones, it seems like dividing the
world up will create a significant
amount of overhead for keeping track
of which entities occupy which
regions.
There will be overhead to keeping your entities in the correct spots in the tree every time they move, yes, and it can be significant. But the whole point is that you're doing much much less work in your collision code. Even though you're adding some overhead with the tree traversal and update it should be much smaller than the overhead you just removed by using the tree at all.
Obviously depending on the number of objects, size of game world, etc etc the trade off might not be worth it. Usually it turns out to be a win, but it's hard to know without doing it.
There are lots of approaches. I'd recommend settings some specific goals (e.g., x collision tests per second with a ratio of y between smallest to largest entities), and do some prototyping to find the simplest approach that achieves those goals. You might be surprised how little work you have to do to get what you need. (Or it might be a ton of work, depending on your particulars.)
Many acceleration structures (e.g., a good BSP) can take a while to set up and thus are generally inappropriate for rapid animation.
There's a lot of literature out there on this topic, so spend some time searching and researching to come up with a list candidate approaches. Mock them up and profile.
I'd be tempted just to overlay a coarse grid over the play area to form a 2D hash. If the grid is at least the size of the largest entity then you only ever have 9 grid squares to check for collisions and it's a lot simpler than managing quad-trees or arbitrary BSP trees. The overhead of determining which coarse grid square you're in is typically just 2 arithmetic operations and when a change is detected the grid just has to remove one reference/ID/pointer from one square's list and add the same to another square.
Further gains can be had from keeping the projectiles out of the grid/tree/etc lookup system - since you can quickly determine where the projectile would be in the grid, you know which grid squares to query for potential collidees. If you check collisions against the environment for each projectile in turn, there's no need for the other entities to then check for collisions against the projectiles in reverse.

Resources