N.B: there's a major edit at the bottom of the question - check it out
Question
Say I have a set of points:
I want to find the point with the most points surrounding it, within radius (ie a circle) or within (ie a square) of the point for 2 dimensions. I'll refer to it as the densest point function.
For the diagrams in this question, I'll represent the surrounding region as circles. In the image above, the middle point's surrounding region is shown in green. This middle point has the most surrounding points of all the points within radius and would be returned by the densest point function.
What I've tried
A viable way to solve this problem would be to use a range searching solution; this answer explains further and that it has " worst-case time". Using this, I could get the number of points surrounding each point and choose the point with largest surrounding point count.
However, if the points were extremely densely packed (in the order of a million), as such:
then each of these million points () would need to have a range search performed. The worst-case time , where is the number of points returned in the range, is true for the following point tree types:
kd-trees of two dimensions (which are actually slightly worse, at ),
2d-range trees,
Quadtrees, which have a worst-case time of
So, for a group of points within radius of all points within the group, it gives complexity of for each point. This yields over a trillion operations!
Any ideas on a more efficient, precise way of achieving this, so that I could find the point with the most surrounding points for a group of points, and in a reasonable time (preferably or less)?
EDIT
Turns out that the method above is correct! I just need help implementing it.
(Semi-)Solution
If I use a 2d-range tree:
A range reporting query costs , for returned points,
For a range tree with fractional cascading (also known as layered range trees) the complexity is ,
For 2 dimensions, that is ,
Furthermore, if I perform a range counting query (i.e., I do not report each point), then it costs .
I'd perform this on every point - yielding the complexity I desired!
Problem
However, I cannot figure out how to write the code for a counting query for a 2d layered range tree.
I've found a great resource (from page 113 onwards) about range trees, including 2d-range tree psuedocode. But I can't figure out how to introduce fractional cascading, nor how to correctly implement the counting query so that it is of O(log n) complexity.
I've also found two range tree implementations here and here in Java, and one in C++ here, although I'm not sure this uses fractional cascading as it states above the countInRange method that
It returns the number of such points in worst case
* O(log(n)^d) time. It can also return the points that are in the rectangle in worst case
* O(log(n)^d + k) time where k is the number of points that lie in the rectangle.
which suggests to me it does not apply fractional cascading.
Refined question
To answer the question above therefore, all I need to know is if there are any libraries with 2d-range trees with fractional cascading that have a range counting query of complexity so I don't go reinventing any wheels, or can you help me to write/modify the resources above to perform a query of that complexity?
Also not complaining if you can provide me with any other methods to achieve a range counting query of 2d points in in any other way!
I suggest using plane sweep algorithm. This allows one-dimensional range queries instead of 2-d queries. (Which is more efficient, simpler, and in case of square neighborhood does not require fractional cascading):
Sort points by Y-coordinate to array S.
Advance 3 pointers to array S: one (C) for currently inspected (center) point; other one, A (a little bit ahead) for nearest point at distance > R below C; and the last one, B (a little bit behind) for farthest point at distance < R above it.
Insert points pointed by A to Order statistic tree (ordered by coordinate X) and remove points pointed by B from this tree. Use this tree to find points at distance R to the left/right from C and use difference of these points' positions in the tree to get number of points in square area around C.
Use results of previous step to select "most surrounded" point.
This algorithm could be optimized if you rotate points (or just exchange X-Y coordinates) so that width of the occupied area is not larger than its height. Also you could cut points into vertical slices (with R-sized overlap) and process slices separately - if there are too many elements in the tree so that it does not fit in CPU cache (which is unlikely for only 1 million points). This algorithm (optimized or not) has time complexity O(n log n).
For circular neighborhood (if R is not too large and points are evenly distributed) you could approximate circle with several rectangles:
In this case step 2 of the algorithm should use more pointers to allow insertion/removal to/from several trees. And on step 3 you should do a linear search near points at proper distance (<=R) to distinguish points inside the circle from the points outside it.
Other way to deal with circular neighborhood is to approximate circle with rectangles of equal height (but here circle should be split into more pieces). This results in much simpler algorithm (where sorted arrays are used instead of order statistic trees):
Cut area occupied by points into horizontal slices, sort slices by Y, then sort points inside slices by X.
For each point in each slice, assume it to be a "center" point and do step 3.
For each nearby slice use binary search to find points with Euclidean distance close to R, then use linear search to tell "inside" points from "outside" ones. Stop linear search where the slice is completely inside the circle, and count remaining points by difference of positions in the array.
Use results of previous step to select "most surrounded" point.
This algorithm allows optimizations mentioned earlier as well as fractional cascading.
I would start by creating something like a https://en.wikipedia.org/wiki/K-d_tree, where you have a tree with points at the leaves and each node information about its descendants. At each node I would keep a count of the number of descendants, and a bounding box enclosing those descendants.
Now for each point I would recursively search the tree. At each node I visit, either all of the bounding box is within R of the current point, all of the bounding box is more than R away from the current point, or some of it is inside R and some outside R. In the first case I can use the count of the number of descendants of the current node to increase the count of points within R of the current point and return up one level of the recursion. In the second case I can simply return up one level of the recursion without incrementing anything. It is only in the intermediate case that I need to continue recursing down the tree.
So I can work out for each point the number of neighbours within R without checking every other point, and pick the point with the highest count.
If the points are spread out evenly then I think you will end up constructing a k-d tree where the lower levels are close to a regular grid, and I think if the grid is of size A x A then in the worst case R is large enough so that its boundary is a circle that intersects O(A) low level cells, so I think that if you have O(n) points you could expect this to cost about O(n * sqrt(n)).
You can speed up whatever algorithm you use by preprocessing your data in O(n) time to estimate the number of neighbouring points.
For a circle of radius R, create a grid whose cells have dimension R in both the x- and y-directions. For each point, determine to which cell it belongs. For a given cell c this test is easy:
c.x<=p.x && p.x<=c.x+R && c.y<=p.y && p.y<=c.y+R
(You may want to think deeply about whether a closed or half-open interval is correct.)
If you have relatively dense/homogeneous coverage, then you can use an array to store the values. If coverage is sparse/heterogeneous, you may wish to use a hashmap.
Now, consider a point on the grid. The extremal locations of a point within a cell are as indicated:
Points at the corners of the cell can only be neighbours with points in four cells. Points along an edge can be neighbours with points in six cells. Points not on an edge are neighbours with points in 7-9 cells. Since it's rare for a point to fall exactly on a corner or edge, we assume that any point in the focal cell is neighbours with the points in all 8 surrounding cells.
So, if a point p is in a cell (x,y), N[p] identifies the number of neighbours of p within radius R, and Np[y][x] denotes the number of points in cell (x,y), then N[p] is given by:
N[p] = Np[y][x]+
Np[y][x-1]+
Np[y-1][x-1]+
Np[y-1][x]+
Np[y-1][x+1]+
Np[y][x+1]+
Np[y+1][x+1]+
Np[y+1][x]+
Np[y+1][x-1]
Once we have the number of neighbours estimated for each point, we can heapify that data structure into a maxheap in O(n) time (with, e.g. make_heap). The structure is now a priority-queue and we can pull points off in O(log n) time per query ordered by their estimated number of neighbours.
Do this for the first point and use a O(log n + k) circle search (or some more clever algorithm) to determine the actual number of neighbours the point has. Make a note of this point in a variable best_found and update its N[p] value.
Peek at the top of the heap. If the estimated number of neighbours is less than N[best_found] then we are done. Otherwise, repeat the above operation.
To improve estimates you could use a finer grid, like so:
along with some clever sliding window techniques to reduce the amount of processing required (see, for instance, this answer for rectangular cases - for circular windows you should probably use a collection of FIFO queues). To increase security you can randomize the origin of the grid.
Considering again the example you posed:
It's clear that this heuristic has the potential to save considerable time: with the above grid, only a single expensive check would need to be performed in order to prove that the middle point has the most neighbours. Again, a higher-resolution grid will improve the estimates and decrease the number of expensive checks which need to be made.
You could, and should, use a similar bounding technique in conjunction with mcdowella's answers; however, his answer does not provide a good place to start looking, so it is possible to spend a lot of time exploring low-value points.
Related
I have non empty Set of points scattered on plane, they are given by their coordinates.
Problem is to quickly reply such queries:
Give me the point from your set which is nearest to the point A(x, y)
My current solution pseudocode
query( given_point )
{
nearest_point = any point from Set
for each point in Set
if dist(point, query_point) < dist(nearest_point, given_point)
nearest_point = point
return nearest_point
}
But this algorithm is very slow with complexity is O(N).
The question is, is there any data structure or tricky algorithms with precalculations which will dramatically reduce time complexity? I need at least O(log N)
Update
By distance I mean Euclidean distance
You can get O(log N) time using a kd-tree. This is like a binary search tree, except that it splits points first on the x-dimension, then the y-dimension, then the x-dimension again, and so on.
If your points are homogeneously distributed, you can achieve O(1) look-up by binning the points into evenly-sized boxes and then searching the box in which the query point falls and its eight neighbouring boxes.
It would be difficult to make an efficient solution from Voronoi diagrams since this requires that you solve the problem of figuring out which Voronoi cell the query point falls in. Much of the time this involves building an R*-tree to query the bounding boxes of the Voronoi cells (in O(log N) time) and then performing point-in-polygon checks (O(p) in the number of points in the polygon's perimeter).
You can divide your grid in subsections:
Depending on the number of points and grid size, you choose a useful division. Let's assume a screen of 1000x1000 pixels, filled with random points, evenly distributed over the surface.
You may divide the screen into 10x10 sections and make a map (roughX, roughY)->(List ((x, y), ...). For a certain point, you may lookup all points in the same cell and - since the point may be closer to points of the neighbor cell than to an extreme point in the same cell, the surrounding cells, maybe even 2 cells away. This would reduce the searching scope to 16 cells.
If you don't find a point in the same cell/layer, expand the search to next layer.
If you happen to find the next neighbor in one of the next layers, you have to expand the searching scope to an additional layer for each layer. If there are too many points, choose a finer grid. If there are to few points, choose a bigger grid. Note, that the two green circles, connected to the red with a line, have the same distance to the red one, but one is in layer 0 (same cell) but the other layer 2 (next of next cell).
Without preprocessing you definitely need to spend O(N), as you must look at every point before return the closest.
You can look here Nearest neighbor search for how to approach this problem.
Given a small number of points and circles (say under 100), how do I tell which point lies in which circles? The circles can intersect, so one point can lie in multiple circles.
If it's of any relevance, both points and circle centers are aligned on a hexagonal grid, and the radii of the circles are also aligned to the grid.
With a bit of thought, it seems the worse case scenario would always be quadratic (when each point lies in all circles) ... but there might be some way to make this faster for the average case when there aren't that many intersections?
I'm doing this for an AI simulation and the circle/point locations change all the time, so I can't really pre-compute anything ahead of time.
If the number of points and circles is that small, you probably will get away with brute-forcing it. Circle-point intersections are pretty cheap, and 100 * 100 checks a frame shouldn't harm performance at all.
If you are completely sure that this routine is the bottleneck and needs to be optimized, read on.
You can try using a variation of Bounding Volume Hierarchies.
A bounding volume hierarchy is a tree in which each node covers the entire volume of both (or more if you decide to use a tree with higher degree) of its children. The volumes/objects that have to be tested for intersections are always the leaf nodes of the tree.
Insertion, removal and intersection queries have an amortized average run-time of O(log n). You will however have to update the tree, as your objects are dynamic, which is done by removing and reinserting invalid nodes (nodes which do not contain their leaf nodes fully any more). Updating the full tree takes a worst case time of O(n log n).
Care should be taken that while insertion, a node should be inserted into that sub-tree that increases the sub-tree's volume by the least amount.
Here is a good blog post by Randy Gaul which explains dynamic bounding hierarchies well.
You'll have to use circles as the bounding volumes, unless you can find a way to use AABBs in all nodes except leaf nodes, and circles as leaf nodes. AABBs are more accurate and should give you a slightly better constructed tree.
You can build a kd-tree of the points. And then for each circle center you retrieve all the points of the kd-tree with distance bounded by the circle radius. Given M points and N circles the complexity should be M log M + N log M = max(M,N) log M (if points and circles are "well distributed").
Whether you can gain anything compared to a brute-force pair-wise check depends on the geometric structure of your points and circles. If, for instance, the radii of the circles are big in relation to the distances of the points or the distances of the cirlce centers then there is not much to expect, I think.
Rather than going to a full 2D-tree, there is an intermediate possibility based on sorting.
Sort the P points on the abscissas. With a good sorting algorithm (say Heapsort), the cost can be modeled as S.P.Lg(P) (S is the cost of comparisons/moves).
Then, for every circle (C of them), locate its leftmost point (Xc-R) in the sorted list by dichotomy, with a cost D.Lg(P) (D is the cost of a dichotomy step). Then step to the rightmost point (Xc+R) and perform the point/circle test every time.
Doing this, you will spare the comparisons with the points to the left and to the right of the circle. Let F denote the average fraction of the points which fall in the range [Xc-R, Xc+R] for all circles.
Denoting K the cost of a point/circle comparison, the total can be estimated as
S.P.Lg(P) + D.Lg(P).C + F.K.P.C
to be compared to K.P.C.
The ratio is
S/K.Lg(P)/C + D/K.Lg(P)/P + F.
With the unfavorable hypothesis that S=D=K, for P=C=100 we get 6.6% + 6.6% + F. These three terms respectively correspond to the preprocessing time, an acceleration overhead and the reduced workload.
Assuming resonably small circles, let F = 10%, and you can hope a speedup x4.
If you are using a bounding box test before the exact point/circle comparison (which is not necessarily an improvement), you can simplify the bounding box test to two Y comparisons, as the X overlap is implicit.
Is there an algorithm that for a given 2d position finds the closest point on a 2d polyline consisting of n - 1 line segments (n line vertices) in constant time? The naive solution is to traverse all segments, test the minimum distance of each segment to the given position and then for the closest segment, calculate the exact closest point to the given position, which has a complexity of O(n). Unfortunately, hardware constraints prevent me from using any type of loop or pointers, meaning also no optimizations like quadtrees for a hierarchical lookup of the closest segment in O(log n).
I have theoretically unlimited time to pre-calculate any datastructure that can be used for a lookup and this pre-calculation can be arbitrarily complex, only the lookup at runtime itself needs to be in O(1). However, the second constraint of the hardware is that I only have very limited memory, meaning that it is not feasible to find the closest point on the line for each numerically possible position of the domain and storing this in a huge array. In other words, the memory consumption should be in O(n^x).
So it comes down to the question how to find the closest segment of a polyline or its index given a 2d position without any loops. Is this possible?
Edit: About the given position … it can be quite arbitrary, but it is reasonable to consider only positions in the closer neighborhood of a line, given by a constant maximum distance.
Create a single axis-aligned box that contains all of your line segments with some padding. Discretize it into a WxH grid of integer indexes. For each grid cell, compute the nearest line segment, and store its index in that grid cell.
To query a point, in O(1) time compute which grid cell it falls in. Lookup the index of the nearest line segment. Do the standard O(1) algorithm to compute exactly the nearest point on the line.
This is an O(1) almost-exact algorithm that will take O(WH) space, where WH is the number of cells in the grid.
For example, here is the subdivision of the space imposed by some line segments:
Here is a 9x7 tiling of the space, where each color corresponds to an edge index: red (0), green (1), blue (2), purple (3). Notice how the discretizing of the space introduces some error. You would of course use a much finer subdivision of the space to reduce that error to as much as you want, at the cost of having to store a larger grid. This coarse tiling is meant for illustration only.
You can keep your algorithm O(1) and make it even more almost-exact by taking your query point, identifying what cell it lies in, and then looking at the 8 neighboring cells in addition to that cell. Determine the set of edges that those 9 cells identify. (The set contains at most 9 edges.) Then for each edge find the closest point. Then keep the closest among those (at most 9) closest points.
In any case, this approach will always fail for some pathological case, so you'll have to factor that into deciding whether you want to use this.
You can find the closest geometric point on a line in O(1) time, but that won't tell you which of the given vertices is closest to it. The best you can do for that is a binary search, which is O(log n), but of course requires a loop or recursion.
If you're designing VLSI or FPGA, you can evaluate all the vertices in parallel. Then, you can compare neighbors, and do a big wired-or to encode the index of the segment that straddles the closest geometric point. You'll technically get some sort of O(log n) delay based on the number of elements in the wired-or, but that kind of thing is usually treated as near-constant.
You can optimize this type of search using an R-Tree which is a general purpose spatial data structure support fast searches. It's not a constant time algorithm; it's average case is O(log n).
You said that you can pre-calculate the data structure, but you could not use any loops. However is there some limitation that prevents any loops? Arbitrary searches are not likely to hit an existing datapoint so it must at least look left and right in a tree.
This SO answer contains some links to libraries:
Java commercial-friendly R-tree implementation?
I am making a simple game and stumbled upon this problem. Assume several points in 2D space. What I want is to make points close to each other interact in some way.
Let me throw a picture here for better understanding of the problem:
Now, the problem isn't about computing the distance. I know how to do that.
At first I had around 10 points and I could simply check every combination, but as you can already assume, this is extremely inefficient with increasing number of points. What if I had a million of points in total, but all of them would be very distant to each other?
I'm trying to find a suitable data structure or a way to look at this problem, so every point can only mind their surrounding and not whole space. Are there any known algorithms for this? I don't exactly know how to name this problem so I can google exactly what I want.
If you don't know of such known algorighm, all ideas are very welcome.
This is a range searching problem. More specifically - the 2-d circular range reporting problem.
Quoting from "Solving Query-Retrieval Problems by Compacting Voronoi Diagrams" [Aggarwal, Hansen, Leighton, 1990]:
Input: A set P of n points in the Euclidean plane E²
Query: Find all points of P contained in a disk in E² with radius r centered at q.
The best results were obtained in "Optimal Halfspace Range Reporting in Three Dimensions" [Afshani, Chan, 2009]. Their method requires O(n) space data structure that supports queries in O(log n + k) worst-case time. The structure can be preprocessed by a randomized algorithm that runs in O(n log n) expected time. (n is the number of input points, and k in the number of output points).
The CGAL library supports circular range search queries. See here.
You're still going to have to iterate through every point, but there are two optimizations you can perform:
1) You can eliminate obvious points by checking if x1 < radius and if y1 < radius (like Brent already mentioned in another answer).
2) Instead of calculating the distance, you can calculate the square of the distance and compare it to the square of the allowed radius. This saves you from performing expensive square root calculations.
This is probably the best performance you're gonna get.
This looks like a nearest neighbor problem. You should be using the kd tree for storing the points.
https://en.wikipedia.org/wiki/K-d_tree
Space partitioning is what you want.. https://en.wikipedia.org/wiki/Quadtree
If you could get those points to be sorted by x and y values, then you could quickly pick out those points (binary search?) which are within a box of the central point: x +- r, y +- r. Once you have that subset of points, then you can use the distance formula to see if they are within the radius.
I assume you have a minimum and maximum X and Y coordinate? If so how about this.
Call our radius R, Xmax-Xmin X, and Ymax-Ymin Y.
Have a 2D matrix of [X/R, Y/R] of double-linked lists. Put each dot structure on the correct linked list.
To find dots you need to interact with, you only need check your cell plus your 8 neighbors.
Example: if X and Y are 100 each, and R is 1, then put a dot at 43.2, 77.1 in cell [43,77]. You'll check cells [42,76] [43,76] [44,76] [42,77] [43,77] [44,77] [42,78] [43,78] [44,78] for matches. Note that not all cells in your own box will match (for instance 43.9,77.9 is in the same list but more than 1 unit distant), and you'll always need to check all 8 neighbors.
As dots move (it sounds like they'd move?) you'd simply unlink them (fast and easy with a double-link list) and relink in their new location. Moving any dot is O(1). Moving them all is O(n).
If that array size gives too many cells, you can make bigger cells with the same algo and probably same code; just be prepared for fewer candidate dots to actually be close enough. For instance if R=1 and the map is a million times R by a million times R, you wouldn't be able to make a 2D array that big. Better perhaps to have each cell be 1000 units wide? As long as density was low, the same code as before would probably work: check each dot only against other dots in this cell plus the neighboring 8 cells. Just be prepared for more candidates failing to be within R.
If some cells will have a lot of dots, each cell having a linked list, perhaps the cell should have an red-black tree indexed by X coordinate? Even in the same cell the vast majority of other cell members will be too far away so just traverse the tree from X-R to X+R. Rather than loop over all dots, and go diving into each one's tree, perhaps you could instead iterate through the tree looking for X coords within R and if/when you find them calculate the distance. As you traverse one cell's tree from low to high X, you need only check the neighboring cell to the left's tree while in the first R entries.
You could also go to cells smaller than R. You'd have fewer candidates that fail to be close enough. For instance with R/2, you'd check 25 link lists instead of 9, but have on average (if randomly distributed) 25/36ths as many dots to check. That might be a minor gain.
I have a dataset of approximately 100,000 (X, Y) pairs representing points in 2D space. For each point, I want to find its k-nearest neighbors.
So, my question is - what data-structure / algorithm would be a suitable choice, assuming I want to absolutely minimise the overall running time?
I'm not looking for code - just a pointer towards a suitable approach. I'm a bit daunted by the range of choices that seem relevent - quad-trees, R-trees, kd-trees, etc.
I'm thinking the best approach is to build a data structure, then run some kind of k-Nearest Neighbor search for each point. However, since (a) I know the points in advance, and (b) I know I must run the search for every point exactly once, perhaps there is a better approach?
Some extra details:
Since I want to minimise the entire running time, I don't care if the majority of time is spent on structure vs search.
The (X, Y) pairs are fairly well spread out, so we can assume an almost uniform distribution.
If k is relatively small (<20 or so) and you have an approximately uniform distribution, create a grid that overlays the range where the points fall, chosen so that the average number of points per grid is comfortably higher than k (so that a centrally-located point will usually get its k neighbors in that one grid point). Then create a set of other grids set half-off from the first (overlapping) along each axis. Now for each point, compute which grid element it falls into (since the grids are regular, no searching is required) and pick the one of four (or howevermany overlapping grids you have) that has that point closest to its center.
Within each grid element, the points should be sorted in one coordinate (let's say x). Starting at the element you chose (find it using bisection), walk outwards along the sorted list until you have found k items (again, if k is small, the fastest way to maintain a list of the k best hits is with binary insertion sort, letting the worst match fall off the end when you insert; insertion sort generally beats everything else up to about 30 items on modern hardware). Keep going until your most distant nearest neighbor is closer to you than the next points away from you in x (i.e. not counting their y-offset, so there could be no new point that could be closer than the kth-closest found so far).
If you do not have k points yet, or you have k points but one or more walls of the grid element are closer to your point of interest than the farthest of the k points, add the relevant adjacent grid elements into the search.
This should give you performance of something like O(N*k^2), with a relatively low constant factor. If k is large, then this strategy is too simplistic and you should choose an algorithm that is linear or log-linear in k, like kd-trees can be.