Finding Closest Point to a Set of Points (in Lat/Long) - algorithm

I am given a random set of points (not told how many) with a latitude and longitude and need to sort them.
I will then be given another random point (lat/long) and need to find the point closest to it (from the set above).
When I implement a linear search (put the set in a list and then calculate all the distances), it takes approximately 7 times longer than permitted.
So, I need a much more efficient sorting algorithm, but I am unsure how I could do this, especially since I'm given points that don't exist on a flat plane.

If your points are moderately well distributed then geometric hashing is a simple method to speed up nearest neighbor searches in practice. The idea is simply to register your objects in grid cells and do your search cell-wise so you can restrict your search to a local neighborhood.
This little python demo applied the idea to circles in the plane:
So in your case you can choose some fixed N and split the longitude coordinates in [0, 2pi] into N equal parts and the latitude coordinates in [0, pi] into N parts. This gives you N^2 cells on the sphere. You register all your initial points at these cells.
When you are given the query point p then you start searching in the cell that is hit by p and in a large enough neighborhood such that you cannot miss the closest point.
If you initially register n points then you could choose N to something like sqrt(n)/4 or so.

Related

Most efficient way to select point with the most surrounding points

N.B: there's a major edit at the bottom of the question - check it out
Question
Say I have a set of points:
I want to find the point with the most points surrounding it, within radius (ie a circle) or within (ie a square) of the point for 2 dimensions. I'll refer to it as the densest point function.
For the diagrams in this question, I'll represent the surrounding region as circles. In the image above, the middle point's surrounding region is shown in green. This middle point has the most surrounding points of all the points within radius and would be returned by the densest point function.
What I've tried
A viable way to solve this problem would be to use a range searching solution; this answer explains further and that it has " worst-case time". Using this, I could get the number of points surrounding each point and choose the point with largest surrounding point count.
However, if the points were extremely densely packed (in the order of a million), as such:
then each of these million points () would need to have a range search performed. The worst-case time , where is the number of points returned in the range, is true for the following point tree types:
kd-trees of two dimensions (which are actually slightly worse, at ),
2d-range trees,
Quadtrees, which have a worst-case time of
So, for a group of points within radius of all points within the group, it gives complexity of for each point. This yields over a trillion operations!
Any ideas on a more efficient, precise way of achieving this, so that I could find the point with the most surrounding points for a group of points, and in a reasonable time (preferably or less)?
EDIT
Turns out that the method above is correct! I just need help implementing it.
(Semi-)Solution
If I use a 2d-range tree:
A range reporting query costs , for returned points,
For a range tree with fractional cascading (also known as layered range trees) the complexity is ,
For 2 dimensions, that is ,
Furthermore, if I perform a range counting query (i.e., I do not report each point), then it costs .
I'd perform this on every point - yielding the complexity I desired!
Problem
However, I cannot figure out how to write the code for a counting query for a 2d layered range tree.
I've found a great resource (from page 113 onwards) about range trees, including 2d-range tree psuedocode. But I can't figure out how to introduce fractional cascading, nor how to correctly implement the counting query so that it is of O(log n) complexity.
I've also found two range tree implementations here and here in Java, and one in C++ here, although I'm not sure this uses fractional cascading as it states above the countInRange method that
It returns the number of such points in worst case
* O(log(n)^d) time. It can also return the points that are in the rectangle in worst case
* O(log(n)^d + k) time where k is the number of points that lie in the rectangle.
which suggests to me it does not apply fractional cascading.
Refined question
To answer the question above therefore, all I need to know is if there are any libraries with 2d-range trees with fractional cascading that have a range counting query of complexity so I don't go reinventing any wheels, or can you help me to write/modify the resources above to perform a query of that complexity?
Also not complaining if you can provide me with any other methods to achieve a range counting query of 2d points in in any other way!
I suggest using plane sweep algorithm. This allows one-dimensional range queries instead of 2-d queries. (Which is more efficient, simpler, and in case of square neighborhood does not require fractional cascading):
Sort points by Y-coordinate to array S.
Advance 3 pointers to array S: one (C) for currently inspected (center) point; other one, A (a little bit ahead) for nearest point at distance > R below C; and the last one, B (a little bit behind) for farthest point at distance < R above it.
Insert points pointed by A to Order statistic tree (ordered by coordinate X) and remove points pointed by B from this tree. Use this tree to find points at distance R to the left/right from C and use difference of these points' positions in the tree to get number of points in square area around C.
Use results of previous step to select "most surrounded" point.
This algorithm could be optimized if you rotate points (or just exchange X-Y coordinates) so that width of the occupied area is not larger than its height. Also you could cut points into vertical slices (with R-sized overlap) and process slices separately - if there are too many elements in the tree so that it does not fit in CPU cache (which is unlikely for only 1 million points). This algorithm (optimized or not) has time complexity O(n log n).
For circular neighborhood (if R is not too large and points are evenly distributed) you could approximate circle with several rectangles:
In this case step 2 of the algorithm should use more pointers to allow insertion/removal to/from several trees. And on step 3 you should do a linear search near points at proper distance (<=R) to distinguish points inside the circle from the points outside it.
Other way to deal with circular neighborhood is to approximate circle with rectangles of equal height (but here circle should be split into more pieces). This results in much simpler algorithm (where sorted arrays are used instead of order statistic trees):
Cut area occupied by points into horizontal slices, sort slices by Y, then sort points inside slices by X.
For each point in each slice, assume it to be a "center" point and do step 3.
For each nearby slice use binary search to find points with Euclidean distance close to R, then use linear search to tell "inside" points from "outside" ones. Stop linear search where the slice is completely inside the circle, and count remaining points by difference of positions in the array.
Use results of previous step to select "most surrounded" point.
This algorithm allows optimizations mentioned earlier as well as fractional cascading.
I would start by creating something like a https://en.wikipedia.org/wiki/K-d_tree, where you have a tree with points at the leaves and each node information about its descendants. At each node I would keep a count of the number of descendants, and a bounding box enclosing those descendants.
Now for each point I would recursively search the tree. At each node I visit, either all of the bounding box is within R of the current point, all of the bounding box is more than R away from the current point, or some of it is inside R and some outside R. In the first case I can use the count of the number of descendants of the current node to increase the count of points within R of the current point and return up one level of the recursion. In the second case I can simply return up one level of the recursion without incrementing anything. It is only in the intermediate case that I need to continue recursing down the tree.
So I can work out for each point the number of neighbours within R without checking every other point, and pick the point with the highest count.
If the points are spread out evenly then I think you will end up constructing a k-d tree where the lower levels are close to a regular grid, and I think if the grid is of size A x A then in the worst case R is large enough so that its boundary is a circle that intersects O(A) low level cells, so I think that if you have O(n) points you could expect this to cost about O(n * sqrt(n)).
You can speed up whatever algorithm you use by preprocessing your data in O(n) time to estimate the number of neighbouring points.
For a circle of radius R, create a grid whose cells have dimension R in both the x- and y-directions. For each point, determine to which cell it belongs. For a given cell c this test is easy:
c.x<=p.x && p.x<=c.x+R && c.y<=p.y && p.y<=c.y+R
(You may want to think deeply about whether a closed or half-open interval is correct.)
If you have relatively dense/homogeneous coverage, then you can use an array to store the values. If coverage is sparse/heterogeneous, you may wish to use a hashmap.
Now, consider a point on the grid. The extremal locations of a point within a cell are as indicated:
Points at the corners of the cell can only be neighbours with points in four cells. Points along an edge can be neighbours with points in six cells. Points not on an edge are neighbours with points in 7-9 cells. Since it's rare for a point to fall exactly on a corner or edge, we assume that any point in the focal cell is neighbours with the points in all 8 surrounding cells.
So, if a point p is in a cell (x,y), N[p] identifies the number of neighbours of p within radius R, and Np[y][x] denotes the number of points in cell (x,y), then N[p] is given by:
N[p] = Np[y][x]+
Np[y][x-1]+
Np[y-1][x-1]+
Np[y-1][x]+
Np[y-1][x+1]+
Np[y][x+1]+
Np[y+1][x+1]+
Np[y+1][x]+
Np[y+1][x-1]
Once we have the number of neighbours estimated for each point, we can heapify that data structure into a maxheap in O(n) time (with, e.g. make_heap). The structure is now a priority-queue and we can pull points off in O(log n) time per query ordered by their estimated number of neighbours.
Do this for the first point and use a O(log n + k) circle search (or some more clever algorithm) to determine the actual number of neighbours the point has. Make a note of this point in a variable best_found and update its N[p] value.
Peek at the top of the heap. If the estimated number of neighbours is less than N[best_found] then we are done. Otherwise, repeat the above operation.
To improve estimates you could use a finer grid, like so:
along with some clever sliding window techniques to reduce the amount of processing required (see, for instance, this answer for rectangular cases - for circular windows you should probably use a collection of FIFO queues). To increase security you can randomize the origin of the grid.
Considering again the example you posed:
It's clear that this heuristic has the potential to save considerable time: with the above grid, only a single expensive check would need to be performed in order to prove that the middle point has the most neighbours. Again, a higher-resolution grid will improve the estimates and decrease the number of expensive checks which need to be made.
You could, and should, use a similar bounding technique in conjunction with mcdowella's answers; however, his answer does not provide a good place to start looking, so it is possible to spend a lot of time exploring low-value points.

Finding all points in certain radius of another point

I am making a simple game and stumbled upon this problem. Assume several points in 2D space. What I want is to make points close to each other interact in some way.
Let me throw a picture here for better understanding of the problem:
Now, the problem isn't about computing the distance. I know how to do that.
At first I had around 10 points and I could simply check every combination, but as you can already assume, this is extremely inefficient with increasing number of points. What if I had a million of points in total, but all of them would be very distant to each other?
I'm trying to find a suitable data structure or a way to look at this problem, so every point can only mind their surrounding and not whole space. Are there any known algorithms for this? I don't exactly know how to name this problem so I can google exactly what I want.
If you don't know of such known algorighm, all ideas are very welcome.
This is a range searching problem. More specifically - the 2-d circular range reporting problem.
Quoting from "Solving Query-Retrieval Problems by Compacting Voronoi Diagrams" [Aggarwal, Hansen, Leighton, 1990]:
Input: A set P of n points in the Euclidean plane E²
Query: Find all points of P contained in a disk in E² with radius r centered at q.
The best results were obtained in "Optimal Halfspace Range Reporting in Three Dimensions" [Afshani, Chan, 2009]. Their method requires O(n) space data structure that supports queries in O(log n + k) worst-case time. The structure can be preprocessed by a randomized algorithm that runs in O(n log n) expected time. (n is the number of input points, and k in the number of output points).
The CGAL library supports circular range search queries. See here.
You're still going to have to iterate through every point, but there are two optimizations you can perform:
1) You can eliminate obvious points by checking if x1 < radius and if y1 < radius (like Brent already mentioned in another answer).
2) Instead of calculating the distance, you can calculate the square of the distance and compare it to the square of the allowed radius. This saves you from performing expensive square root calculations.
This is probably the best performance you're gonna get.
This looks like a nearest neighbor problem. You should be using the kd tree for storing the points.
https://en.wikipedia.org/wiki/K-d_tree
Space partitioning is what you want.. https://en.wikipedia.org/wiki/Quadtree
If you could get those points to be sorted by x and y values, then you could quickly pick out those points (binary search?) which are within a box of the central point: x +- r, y +- r. Once you have that subset of points, then you can use the distance formula to see if they are within the radius.
I assume you have a minimum and maximum X and Y coordinate? If so how about this.
Call our radius R, Xmax-Xmin X, and Ymax-Ymin Y.
Have a 2D matrix of [X/R, Y/R] of double-linked lists. Put each dot structure on the correct linked list.
To find dots you need to interact with, you only need check your cell plus your 8 neighbors.
Example: if X and Y are 100 each, and R is 1, then put a dot at 43.2, 77.1 in cell [43,77]. You'll check cells [42,76] [43,76] [44,76] [42,77] [43,77] [44,77] [42,78] [43,78] [44,78] for matches. Note that not all cells in your own box will match (for instance 43.9,77.9 is in the same list but more than 1 unit distant), and you'll always need to check all 8 neighbors.
As dots move (it sounds like they'd move?) you'd simply unlink them (fast and easy with a double-link list) and relink in their new location. Moving any dot is O(1). Moving them all is O(n).
If that array size gives too many cells, you can make bigger cells with the same algo and probably same code; just be prepared for fewer candidate dots to actually be close enough. For instance if R=1 and the map is a million times R by a million times R, you wouldn't be able to make a 2D array that big. Better perhaps to have each cell be 1000 units wide? As long as density was low, the same code as before would probably work: check each dot only against other dots in this cell plus the neighboring 8 cells. Just be prepared for more candidates failing to be within R.
If some cells will have a lot of dots, each cell having a linked list, perhaps the cell should have an red-black tree indexed by X coordinate? Even in the same cell the vast majority of other cell members will be too far away so just traverse the tree from X-R to X+R. Rather than loop over all dots, and go diving into each one's tree, perhaps you could instead iterate through the tree looking for X coords within R and if/when you find them calculate the distance. As you traverse one cell's tree from low to high X, you need only check the neighboring cell to the left's tree while in the first R entries.
You could also go to cells smaller than R. You'd have fewer candidates that fail to be close enough. For instance with R/2, you'd check 25 link lists instead of 9, but have on average (if randomly distributed) 25/36ths as many dots to check. That might be a minor gain.

Partition of graph to locate a point

I have an area which was already partitioned into tens of sub-areas (think like a country divided into states).
Now I have a point coordinate, what is the best algorithm to tell me which state the point in?
Of course I can match sub-area by sub-area but that's stupid because I have to search through half of them in average right?
Is there an algorithm to determine how to group several adjacent sub-area together to facilitate search, so as to optimize the number of search?
I would start by eliminating all areas that cannot have the point inside of them.
Let's assume that you have a 2D Cartesian coordinate system, you have a point as a 2D-vector and the areas are described as a collection of their boundary points.
Then you can sort the areas by their smallest and largest x and y coordinate (in total 4 ways of sorting). You can eliminate all areas which have their smallest x coordinate bigger than the x coordinate of your point etc.
After that, you can check the remaining polygons with a simple ray-casting algorithm and you should be good.
This is very efficient if you have a structure which keeps the areas sorted in all the different directions because you can eliminate the areas in logarithmic time.

Generating random points with defined minimum and maximum distance

I need algorithm ideas for generating points in 2D space with defined minimum and maximum possible distances between points.
Bassicaly, i want to find a good way to insert a point in a 2D space filled with points, in such manner that the point has random location, but is also more than MINIMUM_DISTANCE_NUM and less than MAXIMUM_DISTANCE_NUM away from nearest points.
I need it for a game, so it should be fast, and not depending on random probability.
Store the set of points in a Kd tree. Generate a new point at random, then examine its nearest neighbors which can be looked up quickly in the Kd tree. If the point is accepted (i.e. MIN_DIST < nearest neighbor < MAX_DIST), then add it to the tree.
I should note that this will work best under conditions where the points are not too tightly packed, i.e. MIN * N << L where N is the number of points and L is the dimension of the box you are putting them in. If this is not true, then most new points will be rejected. But think about it, in this limit, you're packing marbles into a box, and the arrangement can't be very "random" at all above a certain density.
You could use a 2D regular grid of Points (P0,P1,P2,P3,...,P(m*n), when m is width and n is height of this grid )
Each point is associated with 1) a boolean saying wether this grid point was used or not and 2) a 'shift' from this grid position to avoid too much regularity. (or you can put the point+shift coordinates in your grid allready)
Then when you need a new point, just pick a random point of your grid which was not used, state this point as 'used' and use the Point+shift in your game.
Depending on n, m, width/height of your 2D space, and the number of points you're gonna use, this could be just fine.
A good option for this is to use Poisson-Disc sampling. The algorithm is efficient (O(n)) and "produces points that are tightly-packed, but no closer to each other than a specified minimum distance, resulting in a more natural pattern".
https://www.jasondavies.com/poisson-disc/
How many points are you talking about? If you have an upper limit to the number of points, you can generate (pre-compute) an array of points and store them in the array. Take that array and store them in a file.
You'll have all the hard computational processing done before the map loads (so that way you can use any random point generating algorithm) and then you have a nice fast way to get your points.
That way you can generate a ton of different maps and then randomly select one of the maps to generate your points.
This only works if you have an upper limit and you can precomputer the points before the game loads

Suitable choice of data structure and algorithm for fast k-Nearest Neighbor search in 2D

I have a dataset of approximately 100,000 (X, Y) pairs representing points in 2D space. For each point, I want to find its k-nearest neighbors.
So, my question is - what data-structure / algorithm would be a suitable choice, assuming I want to absolutely minimise the overall running time?
I'm not looking for code - just a pointer towards a suitable approach. I'm a bit daunted by the range of choices that seem relevent - quad-trees, R-trees, kd-trees, etc.
I'm thinking the best approach is to build a data structure, then run some kind of k-Nearest Neighbor search for each point. However, since (a) I know the points in advance, and (b) I know I must run the search for every point exactly once, perhaps there is a better approach?
Some extra details:
Since I want to minimise the entire running time, I don't care if the majority of time is spent on structure vs search.
The (X, Y) pairs are fairly well spread out, so we can assume an almost uniform distribution.
If k is relatively small (<20 or so) and you have an approximately uniform distribution, create a grid that overlays the range where the points fall, chosen so that the average number of points per grid is comfortably higher than k (so that a centrally-located point will usually get its k neighbors in that one grid point). Then create a set of other grids set half-off from the first (overlapping) along each axis. Now for each point, compute which grid element it falls into (since the grids are regular, no searching is required) and pick the one of four (or howevermany overlapping grids you have) that has that point closest to its center.
Within each grid element, the points should be sorted in one coordinate (let's say x). Starting at the element you chose (find it using bisection), walk outwards along the sorted list until you have found k items (again, if k is small, the fastest way to maintain a list of the k best hits is with binary insertion sort, letting the worst match fall off the end when you insert; insertion sort generally beats everything else up to about 30 items on modern hardware). Keep going until your most distant nearest neighbor is closer to you than the next points away from you in x (i.e. not counting their y-offset, so there could be no new point that could be closer than the kth-closest found so far).
If you do not have k points yet, or you have k points but one or more walls of the grid element are closer to your point of interest than the farthest of the k points, add the relevant adjacent grid elements into the search.
This should give you performance of something like O(N*k^2), with a relatively low constant factor. If k is large, then this strategy is too simplistic and you should choose an algorithm that is linear or log-linear in k, like kd-trees can be.

Resources