Finding point with largest weight within a region - algorithm

My problem is:
We have a set of N points in a 2D space, each of which has a weight. Given any rectangle region R, how to efficiently return the point with the largest weight inside R?
Note that all query regions R have the same shape, i.e. same lengths and widths. And point and rectangle coordinates are float numbers.
My initial idea is use a R-tree to store points. For a region R, extract all points in R, and then find the point with max. weights. The time complexity is O(logN + V), where V is number of points in R. Can we do better?
I tried to search the solution, but still not successfully. Any suggestion?
Thanks,

This sounds like a range max query problem in 2D. You have some very good algorithms here.
Simpler would be to just use a 2D segment tree.
Basically, each node of your segment tree will store 2d regions instead of 1d intervals. So each node will have 4 children, reducing to a quad tree, which you can then operate on as you would a classic segment tree. This is described in detail here.
This will be O(log n) per query, where n is the total number of points. And it also allows you to do a lot more operations, such as update a point's weight, update a region's weights etc.

How about adding an additional attribute to each tree node which contains the maximum weight of all the points contained by any of its children.
This will be easy to update while adding points. It'll be a little more work to maintain when you remove a point that changes the maximum. You'll have to traverse the tree backwards and update the max value for all parent nodes.
With this attribute, if you want to retrieve the maximum-weight point then when you query the tree with your query region you only inspect the child node with the maximum weight as you traverse the tree. Note, you may have more than one point with the same maximum weight so you may have more than one child node to inspect.
Only inspecting the child nodes with a maximum weight attribute will improve your query throughput at the expense of more memory and slower time building/modifying the tree.

Look up so called range trees, which in your case you would want to implement in 2-dimensions. This would be a 2-layer "tree of trees", where you first split the set of points based on x-coordinate and then for each set of x-points at one of the nodes in the resulting tree, you build a tree based on y-coordinate for those points at that node in the original tree. You can look up how to adapt a 2-d range tree to return the number of points in the query rectangle in O((log n)^2) time, independent of the number of points. Similarly, instead of storing the count of points for subrectangles in the range tree, you can store the maximum objective value of points within that rectangle. This will give you O(n log n) time guaranteed storage and construction time, and O((log n)^2) query time, regardless of the number of points in the query rectangle.
An adaptation of so-called "fractional cascading" for range-tree "find all points in query rectangle" might even be able to get your query time down to O(log n), but I'm not sure since you are taking max of value of points within the query rectangle.

Hint:
Every point has a "zone of influence", which is the locus of the positions of the (top-left corner of the) rectangle such that this point dominates. The set of the zones of influence defines a partition of the plane. Every edge of the partition occurs at the abcissa [ordinate] of a given point or its abscissa [ordinate] minus the width [height] of the query region.
If you map the coordinate values to their rank (by sorting on both axis), you can represent the partition as a digital image of size 4N². To precompute this image, initialize it with minus infinity, and for every point you fill its zone of influence with its weight, taking the maximum. If the query window size is R² pixels on average, the cost of constructing the image is NR².
A query is made by finding the row and column of the relevant pixel and returning the pixel value. This takes two dichotomic searches, in time Lg(N).
This approach is only realistic for moderate values of N (say up to 1000). Better insight can be gained on this problem by studying the geometry of the partition map.

You can try a weighted voronoi-diagram when the positive weight is substracted from the euklidian distance. Sites with big weight tends to have big cells with near-by sites with small weights. Then sort the cells by the number of sites and compute a minimum bounding box for each cell. Match it with the rectangular search box.

Related

Most efficient way to select point with the most surrounding points

N.B: there's a major edit at the bottom of the question - check it out
Question
Say I have a set of points:
I want to find the point with the most points surrounding it, within radius (ie a circle) or within (ie a square) of the point for 2 dimensions. I'll refer to it as the densest point function.
For the diagrams in this question, I'll represent the surrounding region as circles. In the image above, the middle point's surrounding region is shown in green. This middle point has the most surrounding points of all the points within radius and would be returned by the densest point function.
What I've tried
A viable way to solve this problem would be to use a range searching solution; this answer explains further and that it has " worst-case time". Using this, I could get the number of points surrounding each point and choose the point with largest surrounding point count.
However, if the points were extremely densely packed (in the order of a million), as such:
then each of these million points () would need to have a range search performed. The worst-case time , where is the number of points returned in the range, is true for the following point tree types:
kd-trees of two dimensions (which are actually slightly worse, at ),
2d-range trees,
Quadtrees, which have a worst-case time of
So, for a group of points within radius of all points within the group, it gives complexity of for each point. This yields over a trillion operations!
Any ideas on a more efficient, precise way of achieving this, so that I could find the point with the most surrounding points for a group of points, and in a reasonable time (preferably or less)?
EDIT
Turns out that the method above is correct! I just need help implementing it.
(Semi-)Solution
If I use a 2d-range tree:
A range reporting query costs , for returned points,
For a range tree with fractional cascading (also known as layered range trees) the complexity is ,
For 2 dimensions, that is ,
Furthermore, if I perform a range counting query (i.e., I do not report each point), then it costs .
I'd perform this on every point - yielding the complexity I desired!
Problem
However, I cannot figure out how to write the code for a counting query for a 2d layered range tree.
I've found a great resource (from page 113 onwards) about range trees, including 2d-range tree psuedocode. But I can't figure out how to introduce fractional cascading, nor how to correctly implement the counting query so that it is of O(log n) complexity.
I've also found two range tree implementations here and here in Java, and one in C++ here, although I'm not sure this uses fractional cascading as it states above the countInRange method that
It returns the number of such points in worst case
* O(log(n)^d) time. It can also return the points that are in the rectangle in worst case
* O(log(n)^d + k) time where k is the number of points that lie in the rectangle.
which suggests to me it does not apply fractional cascading.
Refined question
To answer the question above therefore, all I need to know is if there are any libraries with 2d-range trees with fractional cascading that have a range counting query of complexity so I don't go reinventing any wheels, or can you help me to write/modify the resources above to perform a query of that complexity?
Also not complaining if you can provide me with any other methods to achieve a range counting query of 2d points in in any other way!
I suggest using plane sweep algorithm. This allows one-dimensional range queries instead of 2-d queries. (Which is more efficient, simpler, and in case of square neighborhood does not require fractional cascading):
Sort points by Y-coordinate to array S.
Advance 3 pointers to array S: one (C) for currently inspected (center) point; other one, A (a little bit ahead) for nearest point at distance > R below C; and the last one, B (a little bit behind) for farthest point at distance < R above it.
Insert points pointed by A to Order statistic tree (ordered by coordinate X) and remove points pointed by B from this tree. Use this tree to find points at distance R to the left/right from C and use difference of these points' positions in the tree to get number of points in square area around C.
Use results of previous step to select "most surrounded" point.
This algorithm could be optimized if you rotate points (or just exchange X-Y coordinates) so that width of the occupied area is not larger than its height. Also you could cut points into vertical slices (with R-sized overlap) and process slices separately - if there are too many elements in the tree so that it does not fit in CPU cache (which is unlikely for only 1 million points). This algorithm (optimized or not) has time complexity O(n log n).
For circular neighborhood (if R is not too large and points are evenly distributed) you could approximate circle with several rectangles:
In this case step 2 of the algorithm should use more pointers to allow insertion/removal to/from several trees. And on step 3 you should do a linear search near points at proper distance (<=R) to distinguish points inside the circle from the points outside it.
Other way to deal with circular neighborhood is to approximate circle with rectangles of equal height (but here circle should be split into more pieces). This results in much simpler algorithm (where sorted arrays are used instead of order statistic trees):
Cut area occupied by points into horizontal slices, sort slices by Y, then sort points inside slices by X.
For each point in each slice, assume it to be a "center" point and do step 3.
For each nearby slice use binary search to find points with Euclidean distance close to R, then use linear search to tell "inside" points from "outside" ones. Stop linear search where the slice is completely inside the circle, and count remaining points by difference of positions in the array.
Use results of previous step to select "most surrounded" point.
This algorithm allows optimizations mentioned earlier as well as fractional cascading.
I would start by creating something like a https://en.wikipedia.org/wiki/K-d_tree, where you have a tree with points at the leaves and each node information about its descendants. At each node I would keep a count of the number of descendants, and a bounding box enclosing those descendants.
Now for each point I would recursively search the tree. At each node I visit, either all of the bounding box is within R of the current point, all of the bounding box is more than R away from the current point, or some of it is inside R and some outside R. In the first case I can use the count of the number of descendants of the current node to increase the count of points within R of the current point and return up one level of the recursion. In the second case I can simply return up one level of the recursion without incrementing anything. It is only in the intermediate case that I need to continue recursing down the tree.
So I can work out for each point the number of neighbours within R without checking every other point, and pick the point with the highest count.
If the points are spread out evenly then I think you will end up constructing a k-d tree where the lower levels are close to a regular grid, and I think if the grid is of size A x A then in the worst case R is large enough so that its boundary is a circle that intersects O(A) low level cells, so I think that if you have O(n) points you could expect this to cost about O(n * sqrt(n)).
You can speed up whatever algorithm you use by preprocessing your data in O(n) time to estimate the number of neighbouring points.
For a circle of radius R, create a grid whose cells have dimension R in both the x- and y-directions. For each point, determine to which cell it belongs. For a given cell c this test is easy:
c.x<=p.x && p.x<=c.x+R && c.y<=p.y && p.y<=c.y+R
(You may want to think deeply about whether a closed or half-open interval is correct.)
If you have relatively dense/homogeneous coverage, then you can use an array to store the values. If coverage is sparse/heterogeneous, you may wish to use a hashmap.
Now, consider a point on the grid. The extremal locations of a point within a cell are as indicated:
Points at the corners of the cell can only be neighbours with points in four cells. Points along an edge can be neighbours with points in six cells. Points not on an edge are neighbours with points in 7-9 cells. Since it's rare for a point to fall exactly on a corner or edge, we assume that any point in the focal cell is neighbours with the points in all 8 surrounding cells.
So, if a point p is in a cell (x,y), N[p] identifies the number of neighbours of p within radius R, and Np[y][x] denotes the number of points in cell (x,y), then N[p] is given by:
N[p] = Np[y][x]+
Np[y][x-1]+
Np[y-1][x-1]+
Np[y-1][x]+
Np[y-1][x+1]+
Np[y][x+1]+
Np[y+1][x+1]+
Np[y+1][x]+
Np[y+1][x-1]
Once we have the number of neighbours estimated for each point, we can heapify that data structure into a maxheap in O(n) time (with, e.g. make_heap). The structure is now a priority-queue and we can pull points off in O(log n) time per query ordered by their estimated number of neighbours.
Do this for the first point and use a O(log n + k) circle search (or some more clever algorithm) to determine the actual number of neighbours the point has. Make a note of this point in a variable best_found and update its N[p] value.
Peek at the top of the heap. If the estimated number of neighbours is less than N[best_found] then we are done. Otherwise, repeat the above operation.
To improve estimates you could use a finer grid, like so:
along with some clever sliding window techniques to reduce the amount of processing required (see, for instance, this answer for rectangular cases - for circular windows you should probably use a collection of FIFO queues). To increase security you can randomize the origin of the grid.
Considering again the example you posed:
It's clear that this heuristic has the potential to save considerable time: with the above grid, only a single expensive check would need to be performed in order to prove that the middle point has the most neighbours. Again, a higher-resolution grid will improve the estimates and decrease the number of expensive checks which need to be made.
You could, and should, use a similar bounding technique in conjunction with mcdowella's answers; however, his answer does not provide a good place to start looking, so it is possible to spend a lot of time exploring low-value points.

Given a few points and circles, how can I tell which point lies in which circles?

Given a small number of points and circles (say under 100), how do I tell which point lies in which circles? The circles can intersect, so one point can lie in multiple circles.
If it's of any relevance, both points and circle centers are aligned on a hexagonal grid, and the radii of the circles are also aligned to the grid.
With a bit of thought, it seems the worse case scenario would always be quadratic (when each point lies in all circles) ... but there might be some way to make this faster for the average case when there aren't that many intersections?
I'm doing this for an AI simulation and the circle/point locations change all the time, so I can't really pre-compute anything ahead of time.
If the number of points and circles is that small, you probably will get away with brute-forcing it. Circle-point intersections are pretty cheap, and 100 * 100 checks a frame shouldn't harm performance at all.
If you are completely sure that this routine is the bottleneck and needs to be optimized, read on.
You can try using a variation of Bounding Volume Hierarchies.
A bounding volume hierarchy is a tree in which each node covers the entire volume of both (or more if you decide to use a tree with higher degree) of its children. The volumes/objects that have to be tested for intersections are always the leaf nodes of the tree.
Insertion, removal and intersection queries have an amortized average run-time of O(log n). You will however have to update the tree, as your objects are dynamic, which is done by removing and reinserting invalid nodes (nodes which do not contain their leaf nodes fully any more). Updating the full tree takes a worst case time of O(n log n).
Care should be taken that while insertion, a node should be inserted into that sub-tree that increases the sub-tree's volume by the least amount.
Here is a good blog post by Randy Gaul which explains dynamic bounding hierarchies well.
You'll have to use circles as the bounding volumes, unless you can find a way to use AABBs in all nodes except leaf nodes, and circles as leaf nodes. AABBs are more accurate and should give you a slightly better constructed tree.
You can build a kd-tree of the points. And then for each circle center you retrieve all the points of the kd-tree with distance bounded by the circle radius. Given M points and N circles the complexity should be M log M + N log M = max(M,N) log M (if points and circles are "well distributed").
Whether you can gain anything compared to a brute-force pair-wise check depends on the geometric structure of your points and circles. If, for instance, the radii of the circles are big in relation to the distances of the points or the distances of the cirlce centers then there is not much to expect, I think.
Rather than going to a full 2D-tree, there is an intermediate possibility based on sorting.
Sort the P points on the abscissas. With a good sorting algorithm (say Heapsort), the cost can be modeled as S.P.Lg(P) (S is the cost of comparisons/moves).
Then, for every circle (C of them), locate its leftmost point (Xc-R) in the sorted list by dichotomy, with a cost D.Lg(P) (D is the cost of a dichotomy step). Then step to the rightmost point (Xc+R) and perform the point/circle test every time.
Doing this, you will spare the comparisons with the points to the left and to the right of the circle. Let F denote the average fraction of the points which fall in the range [Xc-R, Xc+R] for all circles.
Denoting K the cost of a point/circle comparison, the total can be estimated as
S.P.Lg(P) + D.Lg(P).C + F.K.P.C
to be compared to K.P.C.
The ratio is
S/K.Lg(P)/C + D/K.Lg(P)/P + F.
With the unfavorable hypothesis that S=D=K, for P=C=100 we get 6.6% + 6.6% + F. These three terms respectively correspond to the preprocessing time, an acceleration overhead and the reduced workload.
Assuming resonably small circles, let F = 10%, and you can hope a speedup x4.
If you are using a bounding box test before the exact point/circle comparison (which is not necessarily an improvement), you can simplify the bounding box test to two Y comparisons, as the X overlap is implicit.

Find closest 2d point on polyline in constant time

Is there an algorithm that for a given 2d position finds the closest point on a 2d polyline consisting of n - 1 line segments (n line vertices) in constant time? The naive solution is to traverse all segments, test the minimum distance of each segment to the given position and then for the closest segment, calculate the exact closest point to the given position, which has a complexity of O(n). Unfortunately, hardware constraints prevent me from using any type of loop or pointers, meaning also no optimizations like quadtrees for a hierarchical lookup of the closest segment in O(log n).
I have theoretically unlimited time to pre-calculate any datastructure that can be used for a lookup and this pre-calculation can be arbitrarily complex, only the lookup at runtime itself needs to be in O(1). However, the second constraint of the hardware is that I only have very limited memory, meaning that it is not feasible to find the closest point on the line for each numerically possible position of the domain and storing this in a huge array. In other words, the memory consumption should be in O(n^x).
So it comes down to the question how to find the closest segment of a polyline or its index given a 2d position without any loops. Is this possible?
Edit: About the given position … it can be quite arbitrary, but it is reasonable to consider only positions in the closer neighborhood of a line, given by a constant maximum distance.
Create a single axis-aligned box that contains all of your line segments with some padding. Discretize it into a WxH grid of integer indexes. For each grid cell, compute the nearest line segment, and store its index in that grid cell.
To query a point, in O(1) time compute which grid cell it falls in. Lookup the index of the nearest line segment. Do the standard O(1) algorithm to compute exactly the nearest point on the line.
This is an O(1) almost-exact algorithm that will take O(WH) space, where WH is the number of cells in the grid.
For example, here is the subdivision of the space imposed by some line segments:
Here is a 9x7 tiling of the space, where each color corresponds to an edge index: red (0), green (1), blue (2), purple (3). Notice how the discretizing of the space introduces some error. You would of course use a much finer subdivision of the space to reduce that error to as much as you want, at the cost of having to store a larger grid. This coarse tiling is meant for illustration only.
You can keep your algorithm O(1) and make it even more almost-exact by taking your query point, identifying what cell it lies in, and then looking at the 8 neighboring cells in addition to that cell. Determine the set of edges that those 9 cells identify. (The set contains at most 9 edges.) Then for each edge find the closest point. Then keep the closest among those (at most 9) closest points.
In any case, this approach will always fail for some pathological case, so you'll have to factor that into deciding whether you want to use this.
You can find the closest geometric point on a line in O(1) time, but that won't tell you which of the given vertices is closest to it. The best you can do for that is a binary search, which is O(log n), but of course requires a loop or recursion.
If you're designing VLSI or FPGA, you can evaluate all the vertices in parallel. Then, you can compare neighbors, and do a big wired-or to encode the index of the segment that straddles the closest geometric point. You'll technically get some sort of O(log n) delay based on the number of elements in the wired-or, but that kind of thing is usually treated as near-constant.
You can optimize this type of search using an R-Tree which is a general purpose spatial data structure support fast searches. It's not a constant time algorithm; it's average case is O(log n).
You said that you can pre-calculate the data structure, but you could not use any loops. However is there some limitation that prevents any loops? Arbitrary searches are not likely to hit an existing datapoint so it must at least look left and right in a tree.
This SO answer contains some links to libraries:
Java commercial-friendly R-tree implementation?

2D grid data structure for nearest free cell

Consider a 2000 x 2000 2D bool array. 100,000 elements are set to true, the rest to false.
Given a cell (x1,y1) we need to find the nearest cell (x2,y2) (by manhattan distance: abs(x1-x2) + abs(y1-y2)) that is false.
One way to do that would be to:
for (int dist = 0; true; dist++)
for ((x2,y2) in all cells dist away from (x1,y1))
if (!array[x2,y2])
return (x2,y2);
In the worst case we would have to iterate through 100,000 cells before finding the free one.
Is there a data structure we could use rather than a 2D array that would allow us to perform this search quicker?
If the data is constant and you have many queries on it:
You might want to use a k-d tree, and look for the nearest neighbor. Insert (i,j) for each element such that arr[i][j] = false. The standard k-d tree uses euclidean distance but I think one can modify it to use manhattan distances instead..
If the data is used for one query:
You will need at least Omega(n*m) ops to read the data and insert it into any data structure - so no point in doing that - the suggested solution will outperform only the build up of any data structure.
You might be interested into look into Region QuadTree. Here initially the entire image is modeled as the root since the image contains all 0s (assumption). Then when a particular pixel is set, the image is divided into 4 quadrants first and the 3 quadrants where the pixel is not included are left as leaves. The remaining quadrant is subdivided again and so on. This is reached till we have 4 point leaves out of which one is set.
This representation will help to rule-out entire regions during the search and the search time can be optimized to O(log n)

Finding the farthest point in one set from another set

My goal is a more efficient implementation of the algorithm posed in this question.
Consider two sets of points (in N-space. 3-space for the example case of RGB colorspace, while a solution for 1-space 2-space differs only in the distance calculation). How do you find the point in the first set that is the farthest from its nearest neighbor in the second set?
In a 1-space example, given the sets A:{2,4,6,8} and B:{1,3,5}, the answer would be
8, as 8 is 3 units away from 5 (its nearest neighbor in B) while all other members of A are just 1 unit away from their nearest neighbor in B. edit: 1-space is overly simplified, as sorting is related to distance in a way that it is not in higher dimensions.
The solution in the source question involves a brute force comparison of every point in one set (all R,G,B where 512>=R+G+B>=256 and R%4=0 and G%4=0 and B%4=0) to every point in the other set (colorTable). Ignore, for the sake of this question, that the first set is elaborated programmatically instead of iterated over as a stored list like the second set.
First you need to find every element's nearest neighbor in the other set.
To do this efficiently you need a nearest neighbor algorithm. Personally I would implement a kd-tree just because I've done it in the past in my algorithm class and it was fairly straightforward. Another viable alternative is an R-tree.
Do this once for each element in the smallest set. (Add one element from the smallest to larger one and run the algorithm to find its nearest neighbor.)
From this you should be able to get a list of nearest neighbors for each element.
While finding the pairs of nearest neighbors, keep them in a sorted data structure which has a fast addition method and a fast getMax method, such as a heap, sorted by Euclidean distance.
Then, once you're done simply ask the heap for the max.
The run time for this breaks down as follows:
N = size of smaller set
M = size of the larger set
N * O(log M + 1) for all the kd-tree nearest neighbor checks.
N * O(1) for calculating the Euclidean distance before adding it to the heap.
N * O(log N) for adding the pairs into the heap.
O(1) to get the final answer :D
So in the end the whole algorithm is O(N*log M).
If you don't care about the order of each pair you can save a bit of time and space by only keeping the max found so far.
*Disclaimer: This all assumes you won't be using an enormously high number of dimensions and that your elements follow a mostly random distribution.
The most obvious approach seems to me to be to build a tree structure on one set to allow you to search it relatively quickly. A kd-tree or similar would probably be appropriate for that.
Having done that, you walk over all the points in the other set and use the tree to find their nearest neighbour in the first set, keeping track of the maximum as you go.
It's nlog(n) to build the tree, and log(n) for one search so the whole thing should run in nlog(n).
To make things more efficient, consider using a Pigeonhole algorithm - group the points in your reference set (your colorTable) by their location in n-space. This allows you to efficiently find the nearest neighbour without having to iterate all the points.
For example, if you were working in 2-space, divide your plane into a 5 x 5 grid, giving 25 squares, with 25 groups of points.
In 3 space, divide your cube into a 5 x 5 x 5 grid, giving 125 cubes, each with a set of points.
Then, to test point n, find the square/cube/group that contains n and test distance to those points. You only need to test points from neighbouring groups if point n is closer to the edge than to the nearest neighbour in the group.
For each point in set B, find the distance to its nearest neighbor in set A.
To find the distance to each nearest neighbor, you can use a kd-tree as long as the number of dimensions is reasonable, there aren't too many points, and you will be doing many queries - otherwise it will be too expensive to build the tree to be worthwhile.
Maybe I'm misunderstanding the question, but wouldn't it be easiest to just reverse the sign on all the coordinates in one data set (i.e. multiply one set of coordinates by -1), then find the first nearest neighbour (which would be the farthest neighbour)? You can use your favourite knn algorithm with k=1.
EDIT: I meant nlog(n) where n is the sum of the sizes of both sets.
In the 1-Space set I you could do something like this (pseudocode)
Use a structure like this
Struct Item {
int value
int setid
}
(1) Max Distance = 0
(2) Read all the sets into Item structures
(3) Create an Array of pointers to all the Items
(4) Sort the array of pointers by Item->value field of the structure
(5) Walk the array from beginning to end, checking if the Item->setid is different from the previous Item->setid
if (SetIDs are different)
check if this distance is greater than Max Distance if so set MaxDistance to this distance
Return the max distance.

Resources