Say we have a 2d space devided into clusters say via Voronoi Tessellation:
We have cluster outlines, mid points. Given a point (x, y coordinates) in such space how to take its hash so that it would allow us to determin to which cluster it belongs?
I know that we can make our clusters form a binary tree by adding layers and figure where point belongs using tree search. Also I know that we can map each space point (asuming it is descreet) to a cluster and get O(1) storing loads of data. But I seek for a way to use unordered hash map style to get a cluster from given point. How to do such thing (algorithmicly)?
Build kd-tree over Voronoi sites (cell "centers"), and you would search the closest site in O(NlogN) time.
Note that both point and the closest site belong to the same cell due to Voronoi diagram properties
Another option - build trapezoid decomposition of cells (perhaps, more complex way)
In the Voronoi Tessellation each "cluster" is the locus of the points that have as its nearest neighbor the corresponding mid point.
Thus, what you are interested in is actually finding the nearest neighbor (among all mid points) of a new point.
If you intend to use hashing for this purpose, maybe you should read about Locality-sensitive hashing.
Related
I'm trying to find a spatial index structure suitable for a particular problem : using a union-find data structure, I want to connect\associate points that are within a certain range of each other.
I have a lot of points and I'm trying to optimize an existing solution by using a better spatial index.
Right now, I'm using a simple 2D grid indexing each square of width [threshold distance] of my point map, and I look for potential unions by searching for points in adjacent squares in the grid.
Then I compute the squared Euclidean distance to the adjacent cells combinations, which I compare to my squared threshold, and I use the union-find structure (optimized using path compression and etc.) to build groups of points.
Here is some illustration of the method. The single black points actually represent the set of points that belong to a cell of the grid, and the outgoing colored arrows represent the actual distance comparisons with the outside points.
(I'm also checking for potential connected points that belong to the same cells).
By using this pattern I make sure I'm not doing any distance comparison twice by using a proper "neighbor cell" pattern that doesn't overlap with already tested stuff when I iterate over the grid cells.
Issue is : this approach is not even close to being fast enough, and I'm trying to replace the "spatial grid index" method with something that could maybe be faster.
I've looked into quadtrees as a suitable spatial index for this problem, but I don't think it is suitable to solve it (I don't see any way of performing repeated "neighbours" checks for a particular cell more effectively using a quadtree), but maybe I'm wrong on that.
Therefore, I'm looking for a better algorithm\data structure to effectively index my points and query them for proximity.
Thanks in advance.
I have some comments:
1) I think your problem is equivalent to a "spatial join". A spatial join takes two sets of geometries, for example a set R of rectangles and a set P of points and finds for every rectangle all points in that rectangle. In Your case, R would be the rectangles (edge length = 2 * max distance) around each point and P the set of your points. Searching for spatial join may give you some useful references.
2) You may want to have a look at space filling curves. Space filling curves create a linear order for a set of spatial entities (points) with the property that points that a close in the linear ordering are usually also close in space (and vice versa). This may be useful when developing an algorithm.
3) Have look at OpenVDB. OpenVDB has a spatial index structure that is highly optimized to traverse 'voxel'-cells and their neighbors.
4) Have a look at the PH-Tree (disclaimer: this is my own project). The PH-Tree is a somewhat like a quadtree but uses low level bit operations to optimize navigation. It is also Z-ordered/Morten-ordered (see space filling curves above). You can create a window-query for each point which returns all points within that rectangle. To my knowledge, the PH-Tree is the fastest index structure for this kind of operation, especially if you typically have only 9 points in a rectangle. If you are interested in the code, the V13 implementation is probably the fastest, however the V16 should be much easier to understand and modify.
I tried on my rather old desktop machine, using about 1,000,000 points I can do about 200,000 window queries per second, so it should take about 5 second to find all neighbors for every point.
If you are using Java, my spatial index collection may also be useful.
A standard approach to this is the "sweep and prune" algorithm. Sort all the points by X coordinate, then iterate through them. As you do, maintain the lowest index of the point which is within the threshold distance (in X) of the current point. The points within that range are candidates for merging. You then do the same thing sorting by Y. Then you only need to check the Euclidean distance for those pairs which showed up in both the X and Y scans.
Note that with your current union-find approach, you can end up unioning points which are quite far from each other, if there are a bunch of nearby points "bridging" them. So your basic approach -- of unioning groups of points based on proximity -- can induce an arbitrary amount of distance error, not just the threshold distance.
This was asked during interviewing process for a company. Suppose there is an interface to look for nearest delivery center to your area. All you have to enter is your zipcode/pincode and it returns the nearest delivery center. What would be the data structure and algorithm to do this? Like, you have broken your phone and want to go to a service center. You go to the company website and enter your zipcode to find out the nearest repair center. How does it do that?
I suggested a graph + hashmap solution where I will return the neighbouring nodes from a given node and addresses will be stored in hashmap w.r.t zipcodes but that wasn't good enough as the interviewer kept pressing on using the geographical property saying that you are not given the distance between two centers so how do you know which is the nearest and also if asked for top 3 nearest centers. I was not able to come up with any solution then. He was also asking me again and again what data you need to solve this thing. Would be really helpful to know what could be the approach for this as it has been bugging me for days. Thanks
Most algorithms deal with single points - just taking the centre point of a zip code area should suffice.
For a single nearest neighbour, a Voronoi diagram seem like the way to go.
It separates the space into regions such that, given any query point, we know which point is closest.
Taken from Wikipedia:
A kd-tree is also an option:
The k-d tree is a binary tree in which every node is a k-dimensional point. Every non-leaf node can be thought of as implicitly generating a splitting hyperplane that divides the space into two parts, known as half-spaces. Points to the left of this hyperplane are represented by the left subtree of that node and points right of the hyperplane are represented by the right subtree. The hyperplane direction is chosen in the following way: every node in the tree is associated with one of the k-dimensions, with the hyperplane perpendicular to that dimension's axis. So, for example, if for a particular split the "x" axis is chosen, all points in the subtree with a smaller "x" value than the node will appear in the left subtree and all points with larger "x" value will be in the right subtree. In such a case, the hyperplane would be set by the x-value of the point, and its normal would be the unit x-axis.
Finding the k nearest neighbours is significantly more difficult. There is a k nearest neighbours algorithm, but this is a classification algorithm, so I'm not sure it helps here.
One option is to create a grid of the region. Then, given a point, we know which cell it's in, and we can simply query that cell and its neighbours until we've found the desired number of neighbours.
One just has to be careful here, as the next nearest point can actually be in another cell, e.g.:
--------------
| B|
A | X |
| |
| |
--------------
Given point X, the closest point is A, but B would be returned if we simply look in the same cell. We also need to look at all neighbouring cells after we've found k points.
You need the whole road network which is a sparse matrix containing the distance between all of the nodes. You also need to have the list of nodes containing the service centers. Having this, I think the A* algorithm should do the job in determining the distance between a given location and each service center, then picking up the least three in distance. I am certain there are more efficient algorithms but I believe the interviewer should concentrate on the way you think to resolve a problem rather than asking for implementation details such as data structures. Would I have to solve such a problem in real life, I would do a literature research first.
I am not sure about what strategy is best when facing such interviewer and if he would have accepted such a response. Being assertive and providing an overview of the solution before diping into the details might have been better.
Do not have regrets though. Benefit from the experience and move on. You do not know what bounties God has in store for you.
I have a floor on which various sensors are placed at different location on the floor. For every transmitting device, sensors may detect its readings. It is possible to have 6-7 sensors on a floor, and it is possible that a particular reading may not be detected by some sensors, but are detected by some other sensors.
For every reading I get, I would like to identify the location of that reading on the floor. We divide floor logically into TILEs (5x5 feet area) and find what ideally the reading at each TILE should be as detected by each sensor device (based on some transmission pathloss equation).
I am using the precomputed readings from 'N' sensor device at each TILE as a point in N-dimensional space. When I get a real life reading, I find the nearest neighbours of this reading, and assign this reading to that location.
I would like to know if there is a variant of K nearest neighbours, where a dimension could be REMOVED from consideration. This will especially be useful, when a particular sensor is not reporting any reading. I understand that putting weightage on a dimension will be impossible with algorithms like kd-tree or R trees. However, I would like to know if it would be possible to discard a dimension when computing nearest neighbours. Is there any such algorithm?
EDIT:
What I want to know is if the same R/kd tree could be used for k nearest search with different queries, where each query has different dimension weightage? I don't want to construct another kd-tree for every different weightage on dimensions.
EDIT 2:
Is there any library in python, which allows you to specify the custom distance function, and search for k nearest neighbours? Essentially, I would want to use different custom distance functions for different queries.
Both for R-trees and kd-trees, using weighted Minkowski norms is straightforward. Just put the weights into your distance equations!
Putting weights into Eulidean point-to-rectangle minimum distance is trivial, just look at the regular formula and plug in the weight as desired.
Distances are not used at tree construction time, so you can vary the weights as desired at query time.
After going through a lot of questions on stackoverflow, and finally going into details of scipy kd tree source code, I realised the answer by "celion" in following link is correct:
KD-Trees and missing values (vector comparison)
Excerpt:
"I think the best solution involves getting your hands dirty in the code that you're working with. Presumably the nearest-neighbor search computes the distance between the point in the tree leaf and the query vector; you should be able to modify this to handle the case where the point and the query vector are different sizes. E.g. if the points in the tree are given in 3D, but your query vector is only length 2, then the "distance" between the point (p0, p1, p2) and the query vector (x0, x1) would be
sqrt( (p0-x0)^2 + (p1-x1)^2 )
I didn't dig into the java code that you linked to, but I can try to find exactly where the change would need to go if you need help.
-Chris
PS - you might not need the sqrt in the equation above, since distance squared is usually equivalent."
I'm developing a algorithm and data structures to handle lookup by euclidean distance on a large quantities of 2-dimentional points.
I've tried researching this on google scholar but found nothing yet (probably because I don't know what this problem is usually called in the literature).
These are the two approaches I've considered:
Approach 1:
Create a bidimentional grid with buckets. Insert points into buckets, keeping a reference of each point's bucket.
On lookup of point P with distance D, get its bucket B and all the buckets where any of the corners of its grid-square have (distance to B) < D.
Finally, enumerate the points in all those buckets and calculate distance to P.
Approach 2:
Create two lists, each with all the point ordered by one of the coordinates (x,y). On lookup of point P with distance D, perform binary search to find two points in each of the list in order to find the rectangular region where points have their Chebyshev distance to P < D.
Finally, calculate euclidean distance of all those points to P
I'm guessing the state-of-the-art algorithms will be vastly superior to this, though? Any ideas on this are appreciated
Some tips to help you:
Take a look at KDTree, it is a k-dimensional tree (2d in your case) which is one of the best ways to look for nearest-neighbors.
Perhaps you could benefit from a Spatial Database, specifically developed to deal with Geospatial Data;
You could use any of the above with your desired distance function. Depending on your application, you want map distance, great circle distance, constant slope distance, constant bearing distance, etc. Your distance function should be known by you. I use to apply great circle (haversine) distance to deal with google-maps-like maps and tracks.
In case you want a Python implementation, there is scipy.spatial (docs). From this module, the function query_ball_point((px, py), radius) seems to be what you're looking for.
Hope this helps!
I just finished implementing a kd-tree for doing fast nearest neighbor searches. I'm interested in playing around with different distance metrics other than the Euclidean distance. My understanding of the kd-tree is that the speedy kd-tree search is not guaranteed to give exact searches if the metric is non-Euclidean, which means that I might need to implement a new data structure and search algorithm if I want to try out new metrics for my search.
I have two questions:
Does using a kd-tree permanently tie me to the Euclidean distance?
If so, what other sorts of algorithms should I try that work for arbitrary metrics? I don't have a ton of time to implement lots of different data structures, but other structures I'm thinking about include cover trees and vp-trees.
The nearest-neighbour search procedure described on the Wikipedia page you linked to can certainly be generalised to other distance metrics, provided you replace "hypersphere" with the equivalent geometrical object for the given metric, and test each hyperplane for crossings with this object.
Example: if you are using the Manhattan distance instead (i.e. the sum of the absolute values of all differences in vector components), your hypersphere would become a (multidimensional) diamond. (This is easiest to visualise in 2D -- if your current nearest neighbour is at distance x from the query point p, then any closer neighbour behind a different hyperplane must intersect a diamond shape that has width and height 2x and is centred on p). This might make the hyperplane-crossing test more difficult to code or slower to run, however the general principle still applies.
I don't think you're tied to euclidean distance - as j_random_hacker says, you can probably use Manhattan distance - but I'm pretty sure you're tied to geometries that can be represented in cartesian coordinates. So you couldn't use a kd-tree to index a metric space, for example.