Fast query of objects based on location and radius - algorithm

I am developing a simple ecosystem simulation and I am stuck on choosing an optimal datastructure for the creatures.
What I basically have is a list of all animals in my current world with information about animal type and location. This list can grow very large - probably close to millions.
So, in order to implement some simple AI, I need to make this operation as fast as possible:
Given a point on the map and a radius, give me a list of all animals within the radius sorted by the distance from center point.
The world would be 2d so we are limited to planar coordinates.
Some of the operations I also need to support:
Changing location of animals
Creating new animals
Removing animals
I read about kd trees and their ability to calculate nearest neighbour fast.
Questions:
do you think it will work in my case? If not, what data structure should I use to meet the requirements?
EDIT:
Here are some more details as requested in comments.
The world wouldn't be too big - something that can fit a screen with animals being little circles. I should also support the scenario where the world could become fairly dense.
Finally, I expect not more than a few dozens returned per query but as I will have a great number of those queries (one per each animal althrough this should be cached somehow probably but let's forget about that for simplicity for now) this to be as fast and efficient as possible.

Given your constraints (a large number of moving objects and a small world of presumably fixed size), a simple grid may be a better fit than a kd-tree.
See: Trees or Grids? Indexing Moving Objects in Main Memory

kd-trees should work fine if there are many queries before the positions change (e.g., one per animal). Incremental updates probably would be fine as well, but an immutable kd-tree that is completely rebuilt when necessary has the advantage that the tree structure can be represented succinctly with an array. To find all points within a search disk, traverse the tree while pruning nodes whose half-plane completely excludes the search disk. You'll have to sort the results by distance at the end.

Related

Bark at a Sphere Tree or Look Somewhere Else?

The scenario: A large number of players, playing a real-time game in 3d space, must be organized in a way where a server can efficiently update other players and any other observer of a players move and actions. What objects 'talk' to one another needs to be culled based on their range from one another in simulation; this is to preserve network sanity, programmer sanity, and also to allow for server-lets to handle smaller chunks of the overall world play-space.
However, if you have 3000 players, this runs into the issue that one must run 3000! Calculations to find out the ranges between everything. (Google tells me that ends up as a number with over 9000 digits; that’s insane and not worth considering for a near-real-time environment.)
Daybreak Games seems to have solved this problem with their massive online first person shooting game Planetside 2; it had allowed 3000 players to play on a shared space and have real-time responsiveness. They’ve apparently done it through a “Sphere Tree” data structure.
However, I’m not positive this is the solution they use, and I’m still a questioning how to apply the concept of "Sphere Trees" to reduce the range calculations for culling to a reasonable amount.
If Sphere Trees are not the right tree to bark up, what else should I be directing my attention at to tackle this problem?
(I'm a c# programmer (mainly), but I'm looking for a logical answer, not a code one)
References I’ve found about sphere trees;
http://isg.cs.tcd.ie/spheretree/#algorithms
https://books.google.com/books?id=1-NfBElV97IC&pg=PA385&lpg=PA385#v=onepage&q&f=false
Here are a few of my thoughts:
Let n denote the total number of players.
I think your estimate of 3000! is wrong. If you want to calculate all pairs distances given a fixed distance matrix, you run 3000 choose 2 operations, in the order of O(n^2*t), where t is the number of operations you spend calculating the distance between two players. If you build the graph underlying the players with edge weights being the Euclidean distance, you can reduce this to the all-pairs shortest paths problem, which is doable via the Floyd-Warshall algorithm in O(n^3).
What you're describing sounds pretty similar to doing a range query: https://en.wikipedia.org/wiki/Range_searching. There are a lot of data structures that can help you, such as range trees and k-d trees.
If objects only need to interact with objects that are, e.g., <= 100m away, then you can divide the world up into 100m x 100m tiles (or voxels), and keep track of which objects are in each non-empty tile.
Then, when one object needs to 'talk', you only need to check the objects in at most 9 tiles to see if they are close enough to hear it.

Auto-balancing (or cheaply balanced) 3D datastructure

I am working on a tool that requires a 3D "voxel-based" engine. By that I mean it will involve adding and removing cubes from a grid. In order to manage these cubes I need a data structure that allows for quick insertions and deletes. The problem I've seen with k-d trees and octrees is that it seems like they would frequently need to be recreated (or at least rebalanced) because of these operations.
Before I jumped in I wanted to get opinions on what the best way to go about this would be.
Some more details:
x,y,z position is in integer space
needs to be efficient enough for a real-time application
there is no hard limit on the number of cubes that would be used.
In all likelihood the number will most often be inconsequentially
low (<100), however I would like to have the tool handle as many
cubes as possible
I guess the ultimate question is what is the best way to manage what is essentially 3D point data in a way that can handle frequent insertions and deletes?
(No I'm not making Minecraft)
Octrees are easy to update dynamically. Typically the tree is refined based on a per leaf upper/lower population count:
When a new item is inserted, it is pushed onto the item list for the enclosing leaf node. If the upper population count is exceeded, the leaf is refined.
When an existing item is erased, it is removed from the item list for the enclosing leaf node. If the lower population count is reached, the leaf siblings are scanned. If all siblings are leaf nodes and their cummulative item count is less than the upper population count the set of siblings are deleted and the items pushed onto the parent.
Both operations are local, traversing only the height of the tree, which is O(log(n)) for well distributed point sets.
KD-trees, on the other hand, are not easy to update dynamically, since their structure is based on the distribution of the full point set.
There are also a number of other spatial data structures that support dynamic updates - R-trees, Delaunay triangulations to name a few, but it's not clear that they'd offer better performance than an Octree. I'm not aware of any spatial structure that supports better than O(log(n)) dynamic queries.
Hope this helps.

Efficiently querying a B+ Tree holding multidimensional data

I have a collection of tuples (x,y) of 64-bit integers that make up my dataset. I have, say, trillions of these tuples; it is not feasible to keep the dataset in memory on any machine on earth. However, it is quite reasonable to store them on disk.
I have an on-disk store (a B+-tree) that allow for the quick, and concurrent, querying of data in a single dimension. However, some of my queries rely on both dimensions.
Query examples:
Find the tuple whose x is greater than or equal than some given value
Find the tuple whose x is as small as possible s.t. it's y is greater than or equal to some given value
Find the tuple whose x is as small as possible s.t. it's y is less than or equal to some given value
Perform maintenance operations (insert some tuple, remove some tuple)
The best bet I have found are Z-order curves but I cannot seem to figure out how to conduct the queries given my two dimensional data-set.
Solutions that are not acceptable include a sequential scan of the data, this could be far too slow.
I think, the most appropriate data structures for your requirements are R-tree and its variants (R*-tree, R+-tree, Hilbert R-tree). R-tree is similar to B+-tree, but also allows multidimensional queries.
Other relevant data structure is Priority Search Tree. It is good for queries like your examples 1 .. 3, but not very efficient if you need frequent updates or on-disk store. For details see this paper or this book: "Handbook of Data Structures and Applications" (Chapter 18.5).
Are you saying you don't know how to query z-order curves? The Wikipedia page describes how you do range searches.
A z-curve divides your space into nested rectangles, where each additional bit in the key divides the space in half. To search for a point:
Start with the largest rectangle that might contain your point.
Recursively:
Create a result set of rectangles
For each rectangle in your set
If the rectangle is a single point, you are done, it is what you are looking for.
Otherwise, divide the rectangle in two (specify one additional bit of the z-curve)
If both halves contain a point
If one half is better
Add that rectangle to your result set of rectangles
Otherwise
Add both rectangles to your result set of rectangles
Otherwise, only one half contains a point
Add that rectangle to your result set of rectangles
Search your result set of rectangles
Worst case performance is bad, of course. You can adjust it by changing how you construct your z-order index.
I'm currently working on designing a data structure which is essentially a 'stacked' B+ tree (or a d+ tree where d is the number of dimensions) for multidimensional data. I believe it would suit your data perfectly and is being designed specifically for your use case.
The basic idea is this:
Each dimension is a B+ tree and is linked to the next dimension's B+ tree. Search through the first dimension normally, once a leaf is reached it contains a pointer to the root of the next B+ tree which belongs to the next dimension. Everything in the second B+ tree belongs to the same x value.
The original plan was to only store the unique values for each dimension along with it's count. This employs a very simple compression algorithm (if you can even call it that) while still allowing for the entire data set to be represented. This 'linked' dimension scheme could allow for extra dimensions to be added later as they are simply added to the stack of B+ trees.
Total insert/search/delete time for 2 dimensions would be something similar to this:
log b(card(x)) + log b(card(y))
where b is the base of each B+ tree and card(x) would be the cardinality of the x dimension.
I hope that makes sense. I'm still working on an implementation, however feel free to use or even augment the idea.
http://fallabs.com/tokyocabinet/
Tokyo Cabinet is a library of routines for managing a database. The database is a simple data file containing records, each is a pair of a key and a value. Every key and value is serial bytes with variable length. Both binary data and character string can be used as a key and a value. There is neither concept of data tables nor data types. Records are organized in hash table, B+ tree, or fixed-length array.
Tokyo Cabinet is written in the C language, and provided as API of C, Perl, Ruby, Java, and Lua. Tokyo Cabinet is available on platforms which have API conforming to C99 and POSIX. Tokyo Cabinet is a free software licensed under the GNU Lesser General Public License.
may it easy for u to embed?

Collision detection complexity <O(n²): Simpler approach than grid, quadtrees, BSP?

I have a large number of object (balls, for a start) that move stepwise in space, one at a time, and shall not overlap. Currently, for every move I check for collision with every other object. Several other questions here deal with this, however, I thought of a seemingly simple solution that does not seem to come up in this context, and I wonder why.
Why not simply keep 2 collections (for 2D, or 3 in three dimensions) of all objects, sorted by the x and y (and z) coordinate, respectively, and at every move look up all other objects within a given distance (ball diameter here) in each dimension and do the actual collision check only on objects in both (or all 3) result sets?
I realize this only works for equally-sized objects, but alternatively one could use twice as many collections, sorted by the (1) highest (2) lowest coordinate of each object for each dimension. Any reason why this would not work, or give significantly less of an improvement compared to going from O(n) "pairwise check" to "grid method" or "quad/octrees"? I see the update of these sorted collections as the costly operation here, but using e.g. a TreeSet (my implementation would be in Java) it should still be significantly less than O(n), right?
The check for which objects are in both result sets involves looking at all objects in two strips of the plane. That is a much larger area, and therefore involves more objects, than the enclosing square that a quadtree lets you immediately narrow down to. More objects means that it is slower.
You want to use a spatial index or space-filling-curve instead of a quadtree. A sfc reduce the 2d complexity to a 1d complexity and is different from a quadtree because it can only store 1 object per x,y pair? Maybe this works for your problem? You want to search for Nick's hilbert curve quadtree spatial index blog.

Possible to calculate closest locations via lat/long in better than O(n) time?

I'm wondering if there is an algorithm for calculating the nearest locations (represented by lat/long) in better than O(n) time.
I know I could use the Haversine formula to get the distance from the reference point to each location and sort ASC, but this is inefficient for large data sets.
How does the MySQL DISTANCE() function perform? I'm guessing O(n)?
If you use a kd-tree to store your points, you can do this in O(log n) time (expected) or O(sqrt(n)) worst case.
You mention MySql, but there are some pretty sophisticated spatial features in SQL Server 2008 including a geography data type. There's some information out there about doing the types of things you are asking about. I don't know spatial well enough to talk about perf. but I doubt there is a bounded time algorithm to do what you're asking, but you might be able to do some fast set operations on locations.
If the data set being searched is static, e.g., the coordinates of all gas stations in the US, then a proper index (BSP) would allow for efficient searching. Postgres has had good support since the mid 90's for 2-dimensional indexed data so you can do just this sort of query.
Better than O(n)? Only if you go the way of radix sort or store the locations with hash keys that represent the general location they are in.
For instance, you could divide the globe with latitude and longitude to the minutes, enumerate the resulting areas, and make the hash for a location it's area. So when the time comes to get the closest location, you only need to check at most 9 hash keys -- you can test beforehand if an adjacent grid can possibly provide a close location than the best found so far, thus decreasing the set of locations to compute the distance to. It's still O(n), but with a much smaller constant factor. Properly implemented you won't even notice it.
Or, if the data is in memory or otherwise randomly accessible, you could store it sorted by both latitude and longitude. You then use binary search to find the closest latitude and longitude in the respective data sets. Next, you keep reading locations with increasing latitude or longitude (ie, the preceding and succeeding locations), until it becomes impossible to find a closer location.
You know you can't find a close location when the latitude of the next location to either side of the latitude-sorted data wouldn't be closer than the best case found so far even if they belonged in the same longitude as the point from which distance is being calculated. A similar test applies for the longitude-sorted data.
This actually gives you better than O(n) -- closer to O(logN), I think, but does require random, instead of sequential, access to data, and duplication of all data (or the keys to the data, at least).
I wrote a article about Finding the nearest Line at DDJ a couple of years ago, using a grid (i call it quadrants). Using it to find the nearest point (instead of lines) would be just a reduction of it.
Using Quadrants reduces the time drastically, although the complexity is not determinable mathematically (all points could theoretically lie in a single quadrant). A precondition of using quadrants/grids is, that you have a maximum distance for the point searched. If you just look for the nearest point, without giving a maximum distance, you cant use quadrants.
In this case, have a look at A Template for the Nearest Neighbor Problem (Larry Andrews at DDJ), having a retrival complexity of O(log n). I did not compare the runtime of both algorithms. Probably, if you have a reasonable maximum width, quadrants are better. The better general purpose algorithm is the one from Larry Andrews.
If you are looking for the (1) closest location, there's no need to sort. Simply iterate through your list, calculating the distance to each point and keeping track of the closest one. By the time you get through the list, you'll have your answer.
Even better would be to introduce the concept of grids. You would assign each point to a grid. Then, for your search, first determine the grid you are in and perform your calculations on the points in the grid. You'll need to be a little careful though. If the test location is close to the boundary of a grid, you'll need to search those grid(s) as well. Still, this is likely to be highly performant.
I haven't looked at it myself, but Postgres does have a module dedicated to the management of GIS data.
In an appliation I worked on in a previous life we took all of the data, computed it's key for a quad-tree (for 2D space) or an oct-tree (for 3D space) and stored that in the database. It was then a simple matter of loading the values from the database (to prevent you having to recompute the quad-tree) and following the standard quad-tree search algorithm.
This does of course mean you will touch all of the data at least once to get it into the data structure. But persisting this data-structure means you can get better lookup speeds from then on. I would imagine you will do a lot of nearest-neighbour checks for each data-set.
(for kd-tree's wikipedia has a good explanation: http://en.wikipedia.org/wiki/Kd-tree)
You need a spatial index. Fortunately, MySQL provides just such an index, in its Spatial Extensions. They use an R-Tree index internally - though it shouldn't really matter what they use. The manual page referenced above has lots of details.
I guess you could do it theoretically if you had a large enough table to do this... secondly, perhaps caching correctly could get you very good average case?
An R-Tree index can be used to speed spatial searches like this. Once created, it allows such searches to be better than O(n).

Resources