Geospatial lookup - algorithm

I'm developing a algorithm and data structures to handle lookup by euclidean distance on a large quantities of 2-dimentional points.
I've tried researching this on google scholar but found nothing yet (probably because I don't know what this problem is usually called in the literature).
These are the two approaches I've considered:
Approach 1:
Create a bidimentional grid with buckets. Insert points into buckets, keeping a reference of each point's bucket.
On lookup of point P with distance D, get its bucket B and all the buckets where any of the corners of its grid-square have (distance to B) < D.
Finally, enumerate the points in all those buckets and calculate distance to P.
Approach 2:
Create two lists, each with all the point ordered by one of the coordinates (x,y). On lookup of point P with distance D, perform binary search to find two points in each of the list in order to find the rectangular region where points have their Chebyshev distance to P < D.
Finally, calculate euclidean distance of all those points to P
I'm guessing the state-of-the-art algorithms will be vastly superior to this, though? Any ideas on this are appreciated

Some tips to help you:
Take a look at KDTree, it is a k-dimensional tree (2d in your case) which is one of the best ways to look for nearest-neighbors.
Perhaps you could benefit from a Spatial Database, specifically developed to deal with Geospatial Data;
You could use any of the above with your desired distance function. Depending on your application, you want map distance, great circle distance, constant slope distance, constant bearing distance, etc. Your distance function should be known by you. I use to apply great circle (haversine) distance to deal with google-maps-like maps and tracks.
In case you want a Python implementation, there is scipy.spatial (docs). From this module, the function query_ball_point((px, py), radius) seems to be what you're looking for.
Hope this helps!

Related

How to find to which area a point belongs using hash maps?

Say we have a 2d space devided into clusters say via Voronoi Tessellation:
We have cluster outlines, mid points. Given a point (x, y coordinates) in such space how to take its hash so that it would allow us to determin to which cluster it belongs?
I know that we can make our clusters form a binary tree by adding layers and figure where point belongs using tree search. Also I know that we can map each space point (asuming it is descreet) to a cluster and get O(1) storing loads of data. But I seek for a way to use unordered hash map style to get a cluster from given point. How to do such thing (algorithmicly)?
Build kd-tree over Voronoi sites (cell "centers"), and you would search the closest site in O(NlogN) time.
Note that both point and the closest site belong to the same cell due to Voronoi diagram properties
Another option - build trapezoid decomposition of cells (perhaps, more complex way)
In the Voronoi Tessellation each "cluster" is the locus of the points that have as its nearest neighbor the corresponding mid point.
Thus, what you are interested in is actually finding the nearest neighbor (among all mid points) of a new point.
If you intend to use hashing for this purpose, maybe you should read about Locality-sensitive hashing.

algorithm to create bounding rectangles for 2D points

The input is a series of point coordinates (x0,y0),(x1,y1) .... (xn,yn) (n is not very large, say ~ 1000). We need to create some rectangles as bounding box of these points. There's no need to find the global optimal solution. The only requirement is if the euclidean distance between two point is less than R, they should be in the same bounding rectangle. I've searched for sometime and it seems to be a clustering problem and K-means method might be a useful one.
However, the input point coordinates didn't have specific pattern from time to time. So it maybe not possible to set a specific K in K-mean. I am wondering if there is any algorithm or method possible to solve this problem?
The only requirement is if the euclidean distance between two point is less than R, they should be in the same bounding rectangle
This is the definition of single-linkage hierarchical clustering cut at a height of R.
Note that this may yield overlapping rectangles.
For much faster and highly efficient methods, have a look at bulk loading strategies for R*-trees, such as sort-tile-recursive. It won't satisfy your "only" requirement above, but it will yield well balanced, non-overlapping rectangles.
K-means is obviously not appropriate for your requirements.
With only 1000 points I would do the following:
1) Work out the difference between all pairs of points. If the distance of a pair is less than R, they need to go in the same bounding rectangle, so use http://en.wikipedia.org/wiki/Disjoint-set_data_structure to record this.
2) For each subset that comes out of your Disjoint set data structure, work out the min and max co-ordinates of the points in it and use this to create a bounding box for the points in this subset.
If you have more points or are worried about efficiency, you will want to make stage (1) more efficient. One easy way would be to go through the points in order of x co-ordinate, keeping only points at most R to the left of the most recent point seen, and using a balanced tree structure to find from these the points at most R above or below the most recent point seen, before calculating the distance to the most recent point seen. One step up from this would be to create a spatial data structure to get yet more efficiency in finding pairs with distance R of each other.
Note that for some inputs you will get just one huge bounding box because you have long chains of points, and for some other inputs you will get bounding boxes inside bounding boxes, for instance if your points are in concentric circles.

K nearest neighbour search with weights on dimensions

I have a floor on which various sensors are placed at different location on the floor. For every transmitting device, sensors may detect its readings. It is possible to have 6-7 sensors on a floor, and it is possible that a particular reading may not be detected by some sensors, but are detected by some other sensors.
For every reading I get, I would like to identify the location of that reading on the floor. We divide floor logically into TILEs (5x5 feet area) and find what ideally the reading at each TILE should be as detected by each sensor device (based on some transmission pathloss equation).
I am using the precomputed readings from 'N' sensor device at each TILE as a point in N-dimensional space. When I get a real life reading, I find the nearest neighbours of this reading, and assign this reading to that location.
I would like to know if there is a variant of K nearest neighbours, where a dimension could be REMOVED from consideration. This will especially be useful, when a particular sensor is not reporting any reading. I understand that putting weightage on a dimension will be impossible with algorithms like kd-tree or R trees. However, I would like to know if it would be possible to discard a dimension when computing nearest neighbours. Is there any such algorithm?
EDIT:
What I want to know is if the same R/kd tree could be used for k nearest search with different queries, where each query has different dimension weightage? I don't want to construct another kd-tree for every different weightage on dimensions.
EDIT 2:
Is there any library in python, which allows you to specify the custom distance function, and search for k nearest neighbours? Essentially, I would want to use different custom distance functions for different queries.
Both for R-trees and kd-trees, using weighted Minkowski norms is straightforward. Just put the weights into your distance equations!
Putting weights into Eulidean point-to-rectangle minimum distance is trivial, just look at the regular formula and plug in the weight as desired.
Distances are not used at tree construction time, so you can vary the weights as desired at query time.
After going through a lot of questions on stackoverflow, and finally going into details of scipy kd tree source code, I realised the answer by "celion" in following link is correct:
KD-Trees and missing values (vector comparison)
Excerpt:
"I think the best solution involves getting your hands dirty in the code that you're working with. Presumably the nearest-neighbor search computes the distance between the point in the tree leaf and the query vector; you should be able to modify this to handle the case where the point and the query vector are different sizes. E.g. if the points in the tree are given in 3D, but your query vector is only length 2, then the "distance" between the point (p0, p1, p2) and the query vector (x0, x1) would be
sqrt( (p0-x0)^2 + (p1-x1)^2 )
I didn't dig into the java code that you linked to, but I can try to find exactly where the change would need to go if you need help.
-Chris
PS - you might not need the sqrt in the equation above, since distance squared is usually equivalent."

which data structure is appropriate to query "all points within distance d from point p"

I have a 3D pointcloud and I'd like to efficiently query all points within distance d from an arbitrary point p (which is not necessarily part of the stored pointcloud)
The query would look something like
Pointcloud getAllPoints(Point p, float d);
what accelerationstructure would be appropriate for this? A range-tree seems to be appropriate only for querying rectangular volumes, not sphere volumes (of course I could query the boundingbox of the sphere and then sort out all vertices that have larger distance than d - but maybe there is a better way to do this??)
thanks!
according to Novelocrats suggestion, I try to define the desired functions of the structure:
SearchStructure Create(Set<Point> cloud)
Set<Point> Query(SearchStructure S, Point p, float maxDistance)
SearchStructure Remove(Point p)
SearchStructure Insert(Point p)
SearchStructure Displace(Set<Point> displacement) //where each value describes an offsetVector to the currently present points
Usually, after n queries, the points get displaced and a few (not many!) insertions and deletions are made. the offset vectors are very small compared to the boundingbox of all points
What you want is a structure that decomposes space so that particular regions can be found efficiently. A properly decomposed octree or kD-tree should allow you to do this well, as you would only 'open' the section of the tree containing your point p to look for points nearby. This should let you put a fairly low asymptotic bound on how many extra points you need to compare distance to (knowing that below some level of decomposition, all points are close enough). Unfortunately, I don't know the literature in this area well enough to give more detailed pointers. My encounter with these things is from the Barnes-Hut n-Body simulation algorithm.
Here's another question closely related to this one.
And another.
And a third, mentioning a data structure (Hilbert R-Trees) that I hadn't previously heard of.
VTK can help:
void vtkAbstractPointLocator::FindPointsWithinRadius (
double R,
double x,
double y,
double z,
vtkIdList * result
)
Subclasses of vtkAbstractPointLocator contain different data structures for search acceleration: regular buckets, kd-trees, and octrees.
I don't understand your API, you can round up all points in a PointCloud that lie inside an arbitrary sphere, but you also say that the point-clouds are stored? In that case shouldn't you get a list of PointClouds that is inside the given sphere, otherwise what is the point (excuse the pun) with having the PointClouds stored?
Instead of trying to define the API in advance, define it when you need it. There is no need to implement something that never will be used, let alone optimize a function that never will be called (unless it's for fun of course :)).
I think you should implement the bounding-box culling, followed by the more detailed sphere search as a first implementation. Perhaps it's not such a bottleneck as you think, and perhaps you will have far more serious bottlenecks to consider. It's always possible to optimize later when you actually see that you have everything working toghether as you have planned.
Have a look at A Template for the Nearest Neighbor Problem (Larry Andrews at DDJ). Its only 2D, having a retrival complexity of O(log n), but it might be adopted for 3D as well.
A map with key equal to the distance and value being the Point itself would allow you to query for all Points less than a given distance or within a given range.
Well, it depends on what other uses you need for the data structure.
You can have a list of distances from point p to other points, ordered by distance, and map these lists to the points with a hashmap.
map:
p1 -> [{p2, d12}, {p4, d14}, {p3, d13}]
p2 -> ...
...
You can look up the point in the map, and iterate the list until the distance is higher than required.

Can I use arbitrary metrics to search KD-Trees?

I just finished implementing a kd-tree for doing fast nearest neighbor searches. I'm interested in playing around with different distance metrics other than the Euclidean distance. My understanding of the kd-tree is that the speedy kd-tree search is not guaranteed to give exact searches if the metric is non-Euclidean, which means that I might need to implement a new data structure and search algorithm if I want to try out new metrics for my search.
I have two questions:
Does using a kd-tree permanently tie me to the Euclidean distance?
If so, what other sorts of algorithms should I try that work for arbitrary metrics? I don't have a ton of time to implement lots of different data structures, but other structures I'm thinking about include cover trees and vp-trees.
The nearest-neighbour search procedure described on the Wikipedia page you linked to can certainly be generalised to other distance metrics, provided you replace "hypersphere" with the equivalent geometrical object for the given metric, and test each hyperplane for crossings with this object.
Example: if you are using the Manhattan distance instead (i.e. the sum of the absolute values of all differences in vector components), your hypersphere would become a (multidimensional) diamond. (This is easiest to visualise in 2D -- if your current nearest neighbour is at distance x from the query point p, then any closer neighbour behind a different hyperplane must intersect a diamond shape that has width and height 2x and is centred on p). This might make the hyperplane-crossing test more difficult to code or slower to run, however the general principle still applies.
I don't think you're tied to euclidean distance - as j_random_hacker says, you can probably use Manhattan distance - but I'm pretty sure you're tied to geometries that can be represented in cartesian coordinates. So you couldn't use a kd-tree to index a metric space, for example.

Resources