I'm trying to find a spatial index structure suitable for a particular problem : using a union-find data structure, I want to connect\associate points that are within a certain range of each other.
I have a lot of points and I'm trying to optimize an existing solution by using a better spatial index.
Right now, I'm using a simple 2D grid indexing each square of width [threshold distance] of my point map, and I look for potential unions by searching for points in adjacent squares in the grid.
Then I compute the squared Euclidean distance to the adjacent cells combinations, which I compare to my squared threshold, and I use the union-find structure (optimized using path compression and etc.) to build groups of points.
Here is some illustration of the method. The single black points actually represent the set of points that belong to a cell of the grid, and the outgoing colored arrows represent the actual distance comparisons with the outside points.
(I'm also checking for potential connected points that belong to the same cells).
By using this pattern I make sure I'm not doing any distance comparison twice by using a proper "neighbor cell" pattern that doesn't overlap with already tested stuff when I iterate over the grid cells.
Issue is : this approach is not even close to being fast enough, and I'm trying to replace the "spatial grid index" method with something that could maybe be faster.
I've looked into quadtrees as a suitable spatial index for this problem, but I don't think it is suitable to solve it (I don't see any way of performing repeated "neighbours" checks for a particular cell more effectively using a quadtree), but maybe I'm wrong on that.
Therefore, I'm looking for a better algorithm\data structure to effectively index my points and query them for proximity.
Thanks in advance.
I have some comments:
1) I think your problem is equivalent to a "spatial join". A spatial join takes two sets of geometries, for example a set R of rectangles and a set P of points and finds for every rectangle all points in that rectangle. In Your case, R would be the rectangles (edge length = 2 * max distance) around each point and P the set of your points. Searching for spatial join may give you some useful references.
2) You may want to have a look at space filling curves. Space filling curves create a linear order for a set of spatial entities (points) with the property that points that a close in the linear ordering are usually also close in space (and vice versa). This may be useful when developing an algorithm.
3) Have look at OpenVDB. OpenVDB has a spatial index structure that is highly optimized to traverse 'voxel'-cells and their neighbors.
4) Have a look at the PH-Tree (disclaimer: this is my own project). The PH-Tree is a somewhat like a quadtree but uses low level bit operations to optimize navigation. It is also Z-ordered/Morten-ordered (see space filling curves above). You can create a window-query for each point which returns all points within that rectangle. To my knowledge, the PH-Tree is the fastest index structure for this kind of operation, especially if you typically have only 9 points in a rectangle. If you are interested in the code, the V13 implementation is probably the fastest, however the V16 should be much easier to understand and modify.
I tried on my rather old desktop machine, using about 1,000,000 points I can do about 200,000 window queries per second, so it should take about 5 second to find all neighbors for every point.
If you are using Java, my spatial index collection may also be useful.
A standard approach to this is the "sweep and prune" algorithm. Sort all the points by X coordinate, then iterate through them. As you do, maintain the lowest index of the point which is within the threshold distance (in X) of the current point. The points within that range are candidates for merging. You then do the same thing sorting by Y. Then you only need to check the Euclidean distance for those pairs which showed up in both the X and Y scans.
Note that with your current union-find approach, you can end up unioning points which are quite far from each other, if there are a bunch of nearby points "bridging" them. So your basic approach -- of unioning groups of points based on proximity -- can induce an arbitrary amount of distance error, not just the threshold distance.
Related
I have possibly billions of multipoint (sometimes more than 20) polygonal regions (varying sizes , intersecting, nested) distributed throughout the span of a country. Though, the polygonal regions or 'fences' will not usually be too large and most of them will be negligible in size compared to the span . The points of the polygons are described in terms of it's GPS co-ordinates. I need a fast algorithm to check whether the current GPS location of the device falls within any of the enclosed 'fences'.
I have thought of using RAY CASTING ALGORITHM (point in polygon test) after sufficiently narrowing down on the candidates using some search algorithm. The question now being what method would serve as an efficient 2D SEARCH ALGORITHM keeping the following two considerations-
Speed of processing
Data structure should be compact as the storage space is limited (flash storage)
The regions are static and constant. Hence , the 2 dimensional data can be pre sorted to any requirement before being stored.
The unit is to be put on a vehicle , and hence the search need not be repeated for regions sufficiently far from the vehicle's proximity. It's acceptable to have an initial boot up time provided the successive real-time look-ups are extremely fast.
I initially thought of using KD trees to perform a boundary box query after reducing each 'fence' to a point ( approx. by using a point within it) followed by ray casting algorithm.
I have also been looking at hilbert curves and geo hash for the same.
Another method I'm considering is to use RECTANGLES that just enclose the fences. We choose these rectangles for each fence such that it is aligned to the GRID (sides aligned to the latitude and longitude).
That is , find out the max and min value of the latitudes and longitudes for a fence individually , let them be LAT 1, LAT 2, LONG 1 , LONG 2.
Now the points (LAT1, LONG1) , (LAT1, LONG2) , (LAT2, LONG1) , (LAT2, LONG2) are the co-ordinates of the rectangle (aligned to the GRID) that the fence must necessarily be contained in. (I do understand that it will not be a rectangle in the geometrical sense, nonetheless. )
Now use R tree to search for the RECTANGLES that the current gps location falls within. This will narrow the search down to very few results , each of which can be individually tested using RAY CASTING ALGORITHM.
I could also use a hybrid method by using K-D tree for the initial chunk and then applying R tree search.
I am more than open to any other method as well.
What do you think is the best way to approach this problem statement?
EDIT - Removed the size restriction on the polygonal fences.
It sounds like your set of polygons is static. If so, one of the fastest ways is likely to be binary space partitioning. Very briefly, building a BSP tree means picking a single edge from among your billions of polygons, splitting any polygons that cross this (infinite) line into two halves, partitioning the polygons into those on each side of the line, and recursively building a BSP tree on each side. At the bottommost level of the tree are convex polygons that are wholly inside some set of original polygons -- so e.g. if two original polygons A and B intersect but one doesn't fully contain the other, then this will produce at least 3 "basic" polygons at the lowest level, one for the region A but not B, one for B but not A, and one for A and B. (More may be needed if any of these are not convex.) Each of these can be tagged with the list of original polygons that it is part of.
In the ideal case where no polygons are split, finding the complete set of polygons that you are inside is O(log(total number of edges)) once the BSP tree has been built, since deciding which child to visit next in the tree is just an O(1) test of which side of a line the query point is on.
Building the BSP tree takes quite a bit of work up front, especially if you want to make sure it does as little polygon subdivision as possible. It also requires that your polygons are convex, but if that's not the case, you can easily make them so by triangulating them first.
If your set of polygons grows slowly over time, you can of course complement the above technique with regular point-in-polygon testing for the new polygons, occasionally rebuilding the BSP tree when some threshold is reached.
You can try a hierarchical point-in-polygon test for example kirkpatrick data structure but its very complicated. Personaly I would try rectangles, kd-tree, hilbert curves, quadtrees.
In my problem, there are N points in the domain and they are somehow randomly distributed. For each point I need to find all neighbor points with distance less than a given double precision floating number, DIST.
Is there an efficient way to do this in Thrust?
In serial, I would use a neighborhood table and hope to achieve approximately O(n) instead of naive algorithm of O(n^2).
I have found a thrust example for 2D bucket sort, which is a perfect fit for the first part of my problem. But that is not enough, because for each bucket I need to find all points in the neighbor buckets, and then compute their distances and see if any of them is less than DIST. Finding neighbors and compute distance should be relatively easy, but adding those eligible points to a result array seems really difficult for me to implement in Thrust.
A way to rephrase this particular problem is this -- I have two 2D arrays A1 and A2, the column number represent the index of the 2D bucket and each column have different number of elements that are indices of my points. Each element in column(i) of A1 will form a potential pair with each element in colunm(i) of A2, and all eligible pairs should be recorded to a result array.
I could use a CUDA kernel and allocating tons of potentially unused memory as a workaround, but that would be the last thing I would want to do.
Thanks in advance.
The full solution is out of the scope of a single Stack Overflow answer, but there's a discussion on how to use Thrust to build a 2D spatial index in this repository:
https://github.com/jaredhoberock/thrust-workshop
Another possibility, simpler than creating a quad-tree, is using a neighborhood matrix.
First place all your points into a 2D square matrix (or 3D cubic grid, if you are dealing with three dimensions). Then you can run a full or partial spatial sort, so points will became ordered inside the matrix.
Points with small Y could move to the top rows of the matrix, and likewise, points with large Y would go to the bottom rows. The same will happen with points with small X coordinates, that should move to the columns on the left. And symmetrically, points with large X value will go to the right columns.
After you did the spatial sort (there are many ways to achieve this, both by serial or parallel algorithms) you can lookup the nearest points of a given point P by just visiting the adjacent cells where point P is actually stored in the neighborhood matrix.
If this matrix is placed into texture memory, you can use all the spatial caching from CUDA to have very fast accesses to all neighbors!
You can read more details for this idea in the following paper (you will find PDF copies of it online): Supermassive Crowd Simulation on GPU based on Emergent Behavior.
The sorting step gives you interesting choices. You can use just the even-odd transposition sort described in the paper, which is very simple to implement (even in CUDA). If you run just one pass of this, it will give you a partial sort, which can be already useful if your matrix is near-sorted. That is, if your points move slowly, it will save you a lot of computation.
If you need a full sort, you can run such even-odd transposition pass several times (as described in the following Wikipedia page):
http://en.wikipedia.org/wiki/Odd%E2%80%93even_sort
There is a second paper from the same authors, describing an extension to 3D and using three passes of the bitonic sort (which is highly parallel, but it is not a spatial sort). they claim it is both more precise than a single even-odd transposition pass and more efficient than a full sort. The paper is A Neighborhood Grid Data Structure for Massive 3D Crowd Simulation on GPU.
I am trying to write a spatial data structure (such as a K-D tree or a QuadTree) which, given a point, will find the x closest points to it.
The issue with the data structures I mentioned above is that they support mostly a radial/region search. So they will obtain the points that are within a radius of y of a given point/node.
Altering those structures search for what I want would be inefficient. I am assuming I will need to repeat the radial search several times, starting from a short radial distance, and keep increasing it until I have the wanted x amount of points close to the given point. Of course, this defeats the whole purpose behind the data structure.
Almost all spatial data structures operate on radial search. What are other efficient search methods I could apply to a QuadTree, or any other spatial data structures I need to consider to achieve what I mean? Any suggestions?
I'm not sure that you are right in your assumptions. The Wikipedia article on kd-trees indicates how the structure can be used to support finding the x nearest neighbours to a search point. Yes, it is essentially a repetition of finding the nearest neighbour x times, but I'm not sure that you have a right to expect a more efficient performance from an algorithm over a kd-tree.
If that is not good enough for you perhaps you need to store your points in a different data structure. If x is small and bounded you could store your points in a weighted graph where the edge weights are, of course, the distances between points.
If x is neither small nor bounded you might employ a simple subdivision of space into k*m uniform cells (2D here, inflate to 3+D if necessary). For each search point go straight to the cell which contains it, find the other points in the same cell. If x of them are closer to the search point than the boundary of the cell, those are what you are looking for. If not, search in the cells on the other side of the near boundaries too.
If you find yourself needing to support both radial/region searches and x-nearest neighbour searches it's not the end of the world if you have to maintain 2 data structures, one to support each type of query. For many search problems the first step to an efficient solution is to put the data into the right structure for efficient searching. Making this decision depends on numbers you simply haven't provided us.
If you do call the search method several times over on a quadtree (which is what I've done a few times), then if you double the search radius on each call until you have correct number of points, the search is not that inefficient.
Assuming a 2d space, if the correct minimum radius to contain the X points is R1, and you keep on doubling until you find a radius R2 which contains them, then (a) R2 must be less than 2xR1 and (b) the area searched becomes 4 times bigger on each search, which (I think) gives you a worst case scenario of only half the area you've searched through actually being unnecessary (or thereabouts).
Problem Statement:
I have the following problem:
There are more than a billion points in 3D space. The goal is to find the top N points which has largest number of neighbors within given distance R. Another condition is that the distance between any two points of those top N points must be greater than R. The distribution of those points are not uniform. It is very common that certain regions of the space contain a lot of points.
Goal:
To find an algorithm that can scale well to many processors and has a small memory requirement.
Thoughts:
Normal spatial decomposition is not sufficient for this kind of problem due to the non-uniform distribution. irregular spatial decomposition that evenly divide the number of points may help us the problem. I will really appreciate that if someone can shed some lights on how to solve this problem.
Use an Octree. For 3D data with a limited value domain that scales very well to huge data sets.
Many of the aforementioned methods such as locality sensitive hashing are approximate versions designed for much higher dimensionality where you can't split sensibly anymore.
Splitting at each level into 8 bins (2^d for d=3) works very well. And since you can stop when there are too few points in a cell, and build a deeper tree where there are a lot of points that should fit your requirements quite well.
For more details, see Wikipedia:
https://en.wikipedia.org/wiki/Octree
Alternatively, you could try to build an R-tree. But the R-tree tries to balance, making it harder to find the most dense areas. For your particular task, this drawback of the Octree is actually helpful! The R-tree puts a lot of effort into keeping the tree depth equal everywhere, so that each point can be found at approximately the same time. However, you are only interested in the dense areas, which will be found on the longest paths in the Octree without even having to look at the actual points yet!
I don't have a definite answer for you, but I have a suggestion for an approach that might yield a solution.
I think it's worth investigating locality-sensitive hashing. I think dividing the points evenly and then applying this kind of LSH to each set should be readily parallelisable. If you design your hashing algorithm such that the bucket size is defined in terms of R, it seems likely that for a given set of points divided into buckets, the points satisfying your criteria are likely to exist in the fullest buckets.
Having performed this locally, perhaps you can apply some kind of map-reduce-style strategy to combine spatial buckets from different parallel runs of the LSH algorithm in a step-wise manner, making use of the fact that you can begin to exclude parts of your problem space by discounting entire buckets. Obviously you'll have to be careful about edge cases that span different buckets, but I suspect that at each stage of merging, you could apply different bucket sizes/offsets such that you remove this effect (e.g. perform merging spatially equivalent buckets, as well as adjacent buckets). I believe this method could be used to keep memory requirements small (i.e. you shouldn't need to store much more than the points themselves at any given moment, and you are always operating on small(ish) subsets).
If you're looking for some kind of heuristic then I think this result will immediately yield something resembling a "good" solution - i.e. it will give you a small number of probable points which you can check satisfy your criteria. If you are looking for an exact answer, then you are going to have to apply some other methods to trim the search space as you begin to merge parallel buckets.
Another thought I had was that this could relate to finding the metric k-center. It's definitely not the exact same problem, but perhaps some of the methods used in solving that are applicable in this case. The problem is that this assumes you have a metric space in which computing the distance metric is possible - in your case, however, the presence of a billion points makes it undesirable and difficult to perform any kind of global traversal (e.g. sorting of the distances between points). As I said, just a thought, and perhaps a source of further inspiration.
Here are some possible parts of a solution.
There are various choices at each stage,
which will depend on Ncluster, on how fast the data changes,
and on what you want to do with the means.
3 steps: quantize, box, K-means.
1) quantize: reduce the input XYZ coordinates to say 8 bits each,
by taking 2^8 percentiles of X,Y,Z separately.
This will speed up the whole flow without much loss of detail.
You could sort all 1G points, or just a random 1M,
to get 8-bit x0 < x1 < ... x256, y0 < y1 < ... y256, z0 < z1 < ... z256
with 2^(30-8) points in each range.
To map float X -> 8 bit x, unrolled binary search is fast —
see Bentley, Pearls p. 95.
Added: Kd trees
split any point cloud into different-sized boxes, each with ~ Leafsize points —
much better than splitting X Y Z as above.
But afaik you'd have to roll your own Kd tree code
to split only the first say 16M boxes, and keep counts only, not the points.
2) box: count the number of points in each 3d box,
[xj .. xj+1, yj .. yj+1, zj .. zj+1].
The average box will have 2^(30-3*8) points;
the distribution will depend on how clumpy the data is.
If some boxes are too big or get too many points, you could
a) split them into 8,
b) track the centre of the points in each box,
otherwide just take box midpoints.
3)
K-means clustering
on the 2^(3*8) box centres.
(Google parallel "k means" -> 121k hits.)
This depends strongly on K aka Ncluster, also on your radius R.
A rough approach would be to grow a
heap
of the say 27*Ncluster boxes with the most points,
then take the biggest ones subject to your Radius constraint.
(I like to start with a
Minimum spanning tree,
then remove the K-1 longest links to get K clusters.)
See also
Color quantization .
I'd make Nbit, here 8, a parameter from the beginning.
What is your Ncluster ?
Added: if your points are moving in time, see
collision-detection-of-huge-number-of-circles on SO.
I would also suggest to use an octree. The OctoMap framework is very good at dealing with huge 3D point clouds. It does not store all the points directly, but updates the occupancy density of every node (aka 3D box).
After the tree is built, you can use a simple iterator to find the node with the highest density. If you would like to model the point density or distribution inside the nodes, the OctoMap is very easy to adopt.
Here you can see how it was extended to model the point distribution using a planar model.
Just an idea. Create a graph with given points and edges between points when distance < R.
Creation of this kind of graph is similar to spatial decomposition. Your questions can be answered with local search in graph. First are vertices with max degree, second is finding of maximal unconnected set of max degree vertices.
I think creation of graph and search can be made parallel. This approach can have large memory requirement. Splitting domain and working with graphs for smaller volumes can reduce memory need.
I have a collection of 2D coordinate sets (on the scale of a 100K-500K points in each set) and I am looking for the most efficient way to measure the similarity of 1 set to the other. I know of the usuals: Cosine, Jaccard/Tanimoto, etc. However I am hoping for some suggestions on any fast/efficient ones to measure similarity, especially ones that can cluster by similarity.
Edit 1: The image shows what I need to do. I need to cluster all the reds, blues and greens by their shape/orientatoin, etc.
alt text http://img402.imageshack.us/img402/8121/curves.png
It seems that the first step of any solution is going to be to find the centroid, or other reference point, of each shape, so that they can be compared regardless of absolute position.
One algorithm that comes to mind would be to start at the point nearest the centroid and walk to its nearest neighbors. Compare the offsets of those neighbors (from the centroid) between the sets being compared. Keep walking to the next-nearest neighbors of the centroid, or the nearest not-already-compared neighbors of the ones previously compared, and keep track of the aggregate difference (perhaps RMS?) between the two shapes. Also, at each step of this process calculate the rotational offset that would bring the two shapes into closest alignment [and whether mirroring affects it as well?]. When you are finished you will have three values for every pair of sets, including their direct similarity, their relative rotational offset (mostly only useful if they are close matches after rotation), and their similarity after rotation.
Try K-means algorithm. It dynamically calculated the centroid of each cluster and calculates distance to all the pointers and associates them to the nearest cluster.
Since your clustering is based on a nearness-to-shape metric, perhaps you need some form of connected component labeling. UNION-FIND can give you a fast basic set primitive.
For union-only, start every point in a different set, and merge them if they meet some criterion of nearness, influenced by local colinearity since that seems important to you. Then keep merging until you pass some over-threshold condition for how difficult your merge is. If you treat it like line-growing (only join things at their ends) then some data structures become simpler. Are all your clusters open lines and curves? No closed curves, like circles?
The crossing lines are trickier to get right, you either have to find some way merge then split, or you set your merge criteria to extremely favor colinearity and you luck out on the crossing lines.