how to sort geographical data for quick search - algorithm

I have some objects that are geo-localized (I have for each object the latitude + longitude).
My App needs to display the objects that are 3 kilometers around the GPS position of the mobile device.
I have several thousands of objects and they are localized in large area (for example, several US state, several small country), meaning in my list of objects I can have one located in NYC and another one in Miami but I can have also objects that are very close (few meters).
Currently, my App performs an iterative search. For each objects I compute the distance with the GPS position and if distance is <= 3KM then I keep the object else I ignore it. This algorithm is not very efficient and I'm looking for an Algorithm that will give better performance.
I suppose there's a way to sort my objects using the geo coord and next to find more quickly the objects that are located around the GPS position.
My current idea is just to compute rectangle with the "extreme points", North / South / East / West (from 3km of the GPS position) to limit the search zone. Next I will compute the distance only for the objects inside this box.
I think something better could be done but I don't have the idea...
Any proposal will be appreciated ;-)
Thanks,
Séb.

Sounds like a nearest neighbor search, but not with a maximum number of neighbors (as in kNN), but with a maximum distance threshold.
A common approach is to put the objects into a special data structure to allow ruling out large parts of the search space fast.
However, these are usually made with euclidean spaces in mind, and not for the spherical (lat/lon-)plane (wrap-around issues).
Therefore, you'd probably need to convert your coordinates to 3d coordinates in a cartesian system relative to the center of the sphere before you can apply one of the following data structures to search efficiently for your objects:
Octree
kd-tree

The other answers mentioning spatial indexes are correct, but not necessarily the easiest solution for you.
I would consider something simpler:
Group the items by country, then by state, region, city, and finally - by a few landmarks in dense cities (where you have a lot o objects).
Then you would only need to perform a few queries (check which country am I in, which state, region, etc.) to limit yourself to a very small set of objects, without implementing advanced data structures in your mobile app.

One way of doing this without a specialized datastructure would appear to be sorting two copies of your data - once by longitude, once by latitude. Anything that binary searches down to close on both lat and long, is close.
Similarly, you could use your usual treap (fast) or red-black tree (low variability).
But there are probably advantages to using an r-tree or kd-tree. What I've described is probably only for avoiding taking new dependencies or avoiding coding a new datastructure from scratch.

Related

Using a spatial index to find points within range of each other

I'm trying to find a spatial index structure suitable for a particular problem : using a union-find data structure, I want to connect\associate points that are within a certain range of each other.
I have a lot of points and I'm trying to optimize an existing solution by using a better spatial index.
Right now, I'm using a simple 2D grid indexing each square of width [threshold distance] of my point map, and I look for potential unions by searching for points in adjacent squares in the grid.
Then I compute the squared Euclidean distance to the adjacent cells combinations, which I compare to my squared threshold, and I use the union-find structure (optimized using path compression and etc.) to build groups of points.
Here is some illustration of the method. The single black points actually represent the set of points that belong to a cell of the grid, and the outgoing colored arrows represent the actual distance comparisons with the outside points.
(I'm also checking for potential connected points that belong to the same cells).
By using this pattern I make sure I'm not doing any distance comparison twice by using a proper "neighbor cell" pattern that doesn't overlap with already tested stuff when I iterate over the grid cells.
Issue is : this approach is not even close to being fast enough, and I'm trying to replace the "spatial grid index" method with something that could maybe be faster.
I've looked into quadtrees as a suitable spatial index for this problem, but I don't think it is suitable to solve it (I don't see any way of performing repeated "neighbours" checks for a particular cell more effectively using a quadtree), but maybe I'm wrong on that.
Therefore, I'm looking for a better algorithm\data structure to effectively index my points and query them for proximity.
Thanks in advance.
I have some comments:
1) I think your problem is equivalent to a "spatial join". A spatial join takes two sets of geometries, for example a set R of rectangles and a set P of points and finds for every rectangle all points in that rectangle. In Your case, R would be the rectangles (edge length = 2 * max distance) around each point and P the set of your points. Searching for spatial join may give you some useful references.
2) You may want to have a look at space filling curves. Space filling curves create a linear order for a set of spatial entities (points) with the property that points that a close in the linear ordering are usually also close in space (and vice versa). This may be useful when developing an algorithm.
3) Have look at OpenVDB. OpenVDB has a spatial index structure that is highly optimized to traverse 'voxel'-cells and their neighbors.
4) Have a look at the PH-Tree (disclaimer: this is my own project). The PH-Tree is a somewhat like a quadtree but uses low level bit operations to optimize navigation. It is also Z-ordered/Morten-ordered (see space filling curves above). You can create a window-query for each point which returns all points within that rectangle. To my knowledge, the PH-Tree is the fastest index structure for this kind of operation, especially if you typically have only 9 points in a rectangle. If you are interested in the code, the V13 implementation is probably the fastest, however the V16 should be much easier to understand and modify.
I tried on my rather old desktop machine, using about 1,000,000 points I can do about 200,000 window queries per second, so it should take about 5 second to find all neighbors for every point.
If you are using Java, my spatial index collection may also be useful.
A standard approach to this is the "sweep and prune" algorithm. Sort all the points by X coordinate, then iterate through them. As you do, maintain the lowest index of the point which is within the threshold distance (in X) of the current point. The points within that range are candidates for merging. You then do the same thing sorting by Y. Then you only need to check the Euclidean distance for those pairs which showed up in both the X and Y scans.
Note that with your current union-find approach, you can end up unioning points which are quite far from each other, if there are a bunch of nearby points "bridging" them. So your basic approach -- of unioning groups of points based on proximity -- can induce an arbitrary amount of distance error, not just the threshold distance.

A data structure to handle moving points contact and containment within bounding boxes?

I have many points in space moving over time. They move in space full of AABB bounding boxes (including nested ones, there are less BBs than points) I wonder if there is a data structure that would help with organization of points getting into bounding boxes detection.
Currently I thought of a kd-tree based on boxes centers to perform ANN on points movement, with boxes intersection /nesting hierarchy (who is inside /beside whom) for box detection.
Yet this is slow for so many points so I wonder if there is some specialized algorithm/data structure for such case? A way to make such query for many points at the same time?
I would suggest using some kind if quadtree.
Basic quadtrees are already quite good with fast deletion/insertion, but there are special variants, that are better:
The Quadtree proposed by Samet et al uses overlapping regions, which allow moving objects to stay in the same node for longer, before requiring reinsertion in a different node.
The PH-Tree as technically a 'bit-level trie'. It basically looks like a quadtree, but has limited depth (64 for 64bit values), no reinsertion/rebalancing ever and is guaranteed to modify at most two nodes for every insertion or removal. At the same time, node size is limited by dimensionality, so maximum 8 points per node for 3D points or 64 entries when storing points and rectangles (boxes). Important for your case: Very good update performance (better than R-trees or KD-trees) and very good window-query performance. Window-queries would represent your AABBs when you look for points that overlap with a AABB. Moreover, you could also store th AABBs in the tree, then any window-query would return all points and other AABBs that it overlaps with (if that is useful for you).
Unfortunately, I'm not aware of any free implementation of Samet's quadtree. For the PH-Tree, you can find a Java version on the link given above and a C++ version here. You can also look at my index collection for Java implementations of various multidimensional indexes.

Different Searching Methods In Spatial Data Structures

I am trying to write a spatial data structure (such as a K-D tree or a QuadTree) which, given a point, will find the x closest points to it.
The issue with the data structures I mentioned above is that they support mostly a radial/region search. So they will obtain the points that are within a radius of y of a given point/node.
Altering those structures search for what I want would be inefficient. I am assuming I will need to repeat the radial search several times, starting from a short radial distance, and keep increasing it until I have the wanted x amount of points close to the given point. Of course, this defeats the whole purpose behind the data structure.
Almost all spatial data structures operate on radial search. What are other efficient search methods I could apply to a QuadTree, or any other spatial data structures I need to consider to achieve what I mean? Any suggestions?
I'm not sure that you are right in your assumptions. The Wikipedia article on kd-trees indicates how the structure can be used to support finding the x nearest neighbours to a search point. Yes, it is essentially a repetition of finding the nearest neighbour x times, but I'm not sure that you have a right to expect a more efficient performance from an algorithm over a kd-tree.
If that is not good enough for you perhaps you need to store your points in a different data structure. If x is small and bounded you could store your points in a weighted graph where the edge weights are, of course, the distances between points.
If x is neither small nor bounded you might employ a simple subdivision of space into k*m uniform cells (2D here, inflate to 3+D if necessary). For each search point go straight to the cell which contains it, find the other points in the same cell. If x of them are closer to the search point than the boundary of the cell, those are what you are looking for. If not, search in the cells on the other side of the near boundaries too.
If you find yourself needing to support both radial/region searches and x-nearest neighbour searches it's not the end of the world if you have to maintain 2 data structures, one to support each type of query. For many search problems the first step to an efficient solution is to put the data into the right structure for efficient searching. Making this decision depends on numbers you simply haven't provided us.
If you do call the search method several times over on a quadtree (which is what I've done a few times), then if you double the search radius on each call until you have correct number of points, the search is not that inefficient.
Assuming a 2d space, if the correct minimum radius to contain the X points is R1, and you keep on doubling until you find a radius R2 which contains them, then (a) R2 must be less than 2xR1 and (b) the area searched becomes 4 times bigger on each search, which (I think) gives you a worst case scenario of only half the area you've searched through actually being unnecessary (or thereabouts).

Sparse (Pseudo) Infinite Grid Data Structure for Web Game

I'm considering trying to make a game that takes place on an essentially infinite grid.
The grid is very sparse. Certain small regions of relatively high density. Relatively few isolated nonempty cells.
The amount of the grid in use is too large to implement naively but probably smallish by "big data" standards (I'm not trying to map the Internet or anything like that)
This needs to be easy to persist.
Here are the operations I may want to perform (reasonably efficiently) on this grid:
Ask for some small rectangular region of cells and all their contents (a player's current neighborhood)
Set individual cells or blit small regions (the player is making a move)
Ask for the rough shape or outline/silhouette of some larger rectangular regions (a world map or region preview)
Find some regions with approximately a given density (player spawning location)
Approximate shortest path through gaps of at most some small constant empty spaces per hop (it's OK to be a bad approximation often, but not OK to keep heading the wrong direction searching)
Approximate convex hull for a region
Here's the catch: I want to do this in a web app. That is, I would prefer to use existing data storage (perhaps in the form of a relational database) and relatively little external dependency (preferably avoiding the need for a persistent process).
Guys, what advice can you give me on actually implementing this? How would you do this if the web-app restrictions weren't in place? How would you modify that if they were?
Thanks a lot, everyone!
I think you can do everything using quadtrees, as others have suggested, and maybe a few additional data structures. Here's a bit more detail:
Asking for cell contents, setting cell contents: these are the basic quadtree operations.
Rough shape/outline: Given a rectangle, go down sufficiently many steps within the quadtree that most cells are empty, and make the nonempty subcells at that level black, the others white.
Region with approximately given density: if the density you're looking for is high, then I would maintain a separate index of all objects in your map. Take a random object and check the density around that object in the quadtree. Most objects will be near high density areas, simply because high-density areas have many objects. If the density near the object you picked is not the one you were looking for, pick another one.
If you're looking for low-density, then just pick random locations on the map - given that it's a sparse map, that should typically give you low density spots. Again, if it doesn't work right try again.
Approximate shortest path: if this is a not-too-frequent operation, then create a rough graph of the area "between" the starting point A and end point B, for some suitable definition of between (maybe the square containing the circle with the midpoint of AB as center and 1.5*AB as diameter, except if that diameter is less than a certain minimum, in which case... experiment). Make the same type of grid that you would use for the rough shape / outline, then create (say) a Delaunay triangulation of the black points. Do a shortest path on this graph, then overlay that on the actual map and refine the path to one that makes sense given the actual map. You may have to redo this at a few different levels of refinement - start with a very rough graph, then "zoom in" taking two points that you got from the higher level as start and end point, and iterate.
If you need to do this very frequently, you'll want to maintain this type of graph for the entire map instead of reconstructing it every time. This could be expensive, though.
Approx convex hull: again start from something like the rough shape, then take the convex hull of the black points in that.
I'm not sure if this would be easy to put into a relational database; a file-based storage could work but it would be impractical to have a write operation be concurrent with anything else, which you would probably want if you want to allow this to grow to a reasonable number of players (per world / map, if there are multiple worlds / maps). I think in that case you are probably best off keeping a separate process alive... and even then making this properly respect multithreading is going to be a headache.
A kd tree or a quadtree is a good data structure to solve your problem. Especially the latter it's a clever way to address the grid and to reduce the 2d complexity to a 1d complexity. Quadtrees is also used in many maps application like bing and google maps. Here is a good start: Nick quadtree spatial index hilbert curve blog.

Comparison of Lat, Long Coordinates

I have a list of more than 15 thousand latitude and longitude coordinates. Given any X,Y coordinates, what is the fastest way to find the closest coordinates on the list?
I did this once for a web site. I.e. find the dealer within 50 miles of your zip code. I used the great circle calculation to find the coordinates that were 50 miles north, 50 miles east, 50 miles south, and 50 miles west. That gave me a min and max lat and a min and max long. From there then I did a database query:
select *
from dealers
where latitude >= minlat
and latitude <= maxlat
and longitude >= minlong
and longitude <= maxlong
Since some of those results will still be more than 50 miles away, then I used the great circle formula once more on that small list of coordinates. Then I printed out the list along with the distance from the target.
Of course, if you wanted to search for points near the international date line or the poles, than this won't work. But it works great for searches inside North America!
You will want to use a geometric construction called a Voronoi diagram. This divides up the plane into a number of areas, one for each point, that encompass all the points that are closest to each of your given points.
The code for the exact algorithms to create the Voronoi diagram and arrange the data structure lookups are too large to fit in this little edit box. :)
#Linor: That's essentially what you would do after creating a Voronoi diagram. But instead of making a rectangular grid, you can choose dividing lines that closely match the lines of the Voronoi diagram (this way you will get fewer areas that cross dividing lines). If you recursively divide your Voronoi diagram in half along the best dividing line for each subdiagram, you can then do a tree search for each point you want to look up. This requires a bit of work up front but saves time later. Each lookup would be on the order of log N where N is the number of points. 16 comparisons is a lot better than 15,000!
The general concept you're describing is nearest-neighbour search, and there are a whole raft of techniques which deal with solving these types of queries, either exactly or approximately. The basic idea is to use a spatial partitioning technique to reduce the complexity from O(n) per query to (approximately) O( log n ) per query.
KD-Trees, and variants of KD-Trees seem to work very well, but quad-trees will also work. The quality of these searches depends on whether your set of 15,000 data points are static (you're not adding a-lot of data points to the reference set). Mount and Arya's work on the Approximate Nearest Neighbour library is both easy to use and understand, even without a good grounding in the math. It also gives you some flexibility in the types and tolerances of your queries.
It rather depends how many times you want to do it, and what resources are available - if you're doing the test once, then the O(log N) techniques are good. If you're doing it a thousand times on a server, constructing a bitmap lookup table would be faster, either giving the result directly or as a first stage of. 2GB of bitmap can map the whole world lat-lon to a 32bit value at 0.011 degree pixels (1.2km at equator), and should fit into memory. If you're only doing single country, or can exclude the poles, you can have a smaller map or higher resolution. For 15,000 points you probably have a much smaller map - I first sized it up as a first step to doing lat-lon to postcode searches, which needs higher resolution. Depending on requirements, you use the mapped value to point at the result directly, or to short list of the candidates (which would allow a smaller map, but requires greater subsequent processing - you're not in O(1) lookup territory any more).
You didn't specify what you meant by fastest. If you want to get the answer quickly without writing any code, I would give the gpsbabel radius filter a go.
Based on your clarifications, I would use a geometrical data structure such as a KD-tree or an R-tree. MySQL has a SPATIAL data type which does this. Other languages/frameworks/databases have libraries to support this. Basically, such a data structure embeds the points in a tree of rectangles, and searches the tree using a radius. This should be fast enough, and I believe is simpler than building a Voronoi diagram. I guess there is some threshold above which you would prefer the added performance of a Voronoi diagram so you will be ready to pay the added complexity.
This can be solved in several ways. I would first approach this problem by generating a Delaunay network connecting closest points to each other. This can be accomplished with the v.delaunay command in the open source GIS application GRASS. You could complete the problem in GRASS using one of the many network analysis modules in GRASS. Alternatively, you could use the free spatial RDBMS PostGIS to do the distance queries. The PostGIS spatial queries are considerably more powerful than those in MySQL, as they are not constrained to BBOX operations. For example:
SELECT network_id, ST_Length(geometry) from spatial_table where ST_Length(geometry) < 10;
Since you are using Longitude and Latitude, you probably want to use the Spheroid-Distance functions. With a spatial index, PostGIS scales very well for large datasets.
Even if you create a voronoi diagram, that still means you need to compare your x, y coordinates to all 15 thousand created areas. To make that easier, the first thing that popped into my mind though was to create some sort of grid over the possible values, so that you can easily place and x/y coordinate into one of the boxes in a grid, if the same is done for the list of areas you should quickly shrink the possible candidates for comparison (because the grid would be more rectangular, it's possible for an area to be in multiple grid positions).
Premature optimization is the root of all evil.
15K coordinates aren't that much. Why not iterate over the 15K coordinates and see if that's really a performance problem? You could save a lot of work and maybe it never gets too slow to even notice.
How large an area are these coordinates spread out over? What latitude are they at? How much accuracy do you require? If they're fairly close together, you can probably ignore the fact that the earth is round and just treat this as a Cartesian plane rather than messing about with spherical geometry and great circle distances. Of course, as you get further from the equator, degrees of longitute get smaller compared to degrees of latitude, so some sort of scaling factor may be appropriate.
Start with a fairly simple distance formula and a brute force search and see how long that's going to take and if the results are accurate enough before you get fancy.
Thanks everyone for the answers.
#Tom, #Chris Upchurch: The coordinates are fairly close to each others, and they are in a relatively small area of about 800 sq km. I guess I can assume the surface to be flat. I need to process the requests over and over again, and the response should be faster enough for more web experience.
A grid is very simple, and very fast. It's basically just a 2D array of lists. Each array entry represents the points that fall inside a grid cell. Very easy to set the grid up:
for each point p
get cell that contains p
add point to that cell's list
and it's very easy to look things up:
given a query point p
get cell that contains p
check points in that cell (and its 8 neighbors), against query point p
Alejo
Just to be contrairian, do you mean close in distance or (driving) time? In an urban area I'd gladly drive 5 miles (5min) on the highway than 4 miles (20min stop and go) in another direction.
Thus if it's a 'closest' metric you need, I'd look into GIS databases with travel time metrics.

Resources