Average search complexity in a randomly populated QuadTree - complexity-theory

I have a Quad Tree filled with N random 2D points (x,y).
I what to estimate what would be the average search complexity in these two cases
1) range search (within distance R from a position x,y)
2) a single nearest point (x,y)
My reasoning so far is as follows - If we make analogy with the Binary Tree average complexity - https://en.wikipedia.org/wiki/Binary_search_tree
Which is
For a single search the average complexity for a QuadTree should be something like (because there are 4 childs):
Where k should depends from the average number of points that has to be checked if inside the range (but I am not sure how to estimate it, even for a single point a range search inside a circle still has to be done). In the worst case of a large R we have to check all points and complexity deteriorates to O(N).
Thanks.

Related

find best k-means from a list of candidates

I have an array of size n of points called A, and a candidate array of size O(k)>k called S. I want to find k points in S such that the sum of squared distances from the points of A to their closest point from the k points would be minimized. One way to do it would be to check the cost of any possible k points in S and take the minimum, but that would take O(k^k*n) time, is there any more efficient way to do it?
I need either an optimal solution or a constant approximation.
The reason I need this is that I'm trying to find a constant approximation for the k-means as fast as possible and later use this for a coreset construction (coreset=data minimization while still keeping the cost of any query approximately the same). I was able to show that if we assume that in the optimal clustering each cluster has omega(n/k) points we can create pretty fast a list of size O(k) canidates that contains inside of them a 3-approximation for the k-means, so I was wondering if we can find those k points or a constant approximation for their costs in time which is faster than exhaustive search.
Example for k=2
In this example S is the green dots and A is the red dots. The algorithm should return the 2 circled points from S since they minimize the sum of squared distances from the points of A to their closest point of the 2.
I have an array of size n of points called A, and a candidate array of size O(k)>k called S. I want to find k points in S such that the sum of squared distances from the points of A to their closest point from the k points would be minimized.
It sounds like this could be solved simply by checking the N points against the K points to find the k points in N with the smallest squared distance.
Therefore, I'm now fairly sure this is actually finding the k-nearest neighbors (K-NN as a computational geometry problem, not the pattern recognition definition) in the N points for each point in the K points and not actually k-means.
For higher dimensionality, it is often useful to also consider the dimensionality, D in the algorithm.
The algorithm mentioned is indeed O(NDk^2) then when considering K-NN instead. That can be improved to O(NDk) by using Quickselect algorithm on the distances. This allows for checking the list of N points against each of the K points in O(N) to find the nearest k points.
https://en.wikipedia.org/wiki/Quickselect
Edit:
Seems there is some confusion on quickselect and if it can be used. Here is a O(DkNlogN) solution that uses a standard sort O(NlogN) instead of quickselect O(N). Though this might be faster in practice and as you can see in most languages it's pretty easy to implement.
results = {}
for y in F:
def distanceSquared(x):
distance(x,y) # Custom distance for each y
# First k sorted by distanceSquared
results[y] = S.sort(key=distanceSquared)[:k]
return results
Update for new visual
# Build up distance sums O(A*N*D)
results = {}
for y in F:
def distanceSquared(x):
distance(x,y) # Custom distance for each y
# Sum of distance squared from y for all points in S
results[y] = sum(map(distanceSquared, S))
def results_key_value(key):
results[key]
# First k results sorted by key O(D*AlogA)
results.keys().sort(key=results_key_value)[:k]
You could approximate by only considering Z random points chosen from the S points. Alternatively, you could merge points in S if they are close enough together. This could reduce S to a much smaller size as long S remains about F^2 or larger in size, it shouldn't affect which points in F are chosen too much. Though you would also need to adjust the weight of the points to handle that better. IE: the square distance of a point that represents 10 points is multiplied by 10 to account for it acting as 10 points instead of just 1.

Most efficient way to select point with the most surrounding points

N.B: there's a major edit at the bottom of the question - check it out
Question
Say I have a set of points:
I want to find the point with the most points surrounding it, within radius (ie a circle) or within (ie a square) of the point for 2 dimensions. I'll refer to it as the densest point function.
For the diagrams in this question, I'll represent the surrounding region as circles. In the image above, the middle point's surrounding region is shown in green. This middle point has the most surrounding points of all the points within radius and would be returned by the densest point function.
What I've tried
A viable way to solve this problem would be to use a range searching solution; this answer explains further and that it has " worst-case time". Using this, I could get the number of points surrounding each point and choose the point with largest surrounding point count.
However, if the points were extremely densely packed (in the order of a million), as such:
then each of these million points () would need to have a range search performed. The worst-case time , where is the number of points returned in the range, is true for the following point tree types:
kd-trees of two dimensions (which are actually slightly worse, at ),
2d-range trees,
Quadtrees, which have a worst-case time of
So, for a group of points within radius of all points within the group, it gives complexity of for each point. This yields over a trillion operations!
Any ideas on a more efficient, precise way of achieving this, so that I could find the point with the most surrounding points for a group of points, and in a reasonable time (preferably or less)?
EDIT
Turns out that the method above is correct! I just need help implementing it.
(Semi-)Solution
If I use a 2d-range tree:
A range reporting query costs , for returned points,
For a range tree with fractional cascading (also known as layered range trees) the complexity is ,
For 2 dimensions, that is ,
Furthermore, if I perform a range counting query (i.e., I do not report each point), then it costs .
I'd perform this on every point - yielding the complexity I desired!
Problem
However, I cannot figure out how to write the code for a counting query for a 2d layered range tree.
I've found a great resource (from page 113 onwards) about range trees, including 2d-range tree psuedocode. But I can't figure out how to introduce fractional cascading, nor how to correctly implement the counting query so that it is of O(log n) complexity.
I've also found two range tree implementations here and here in Java, and one in C++ here, although I'm not sure this uses fractional cascading as it states above the countInRange method that
It returns the number of such points in worst case
* O(log(n)^d) time. It can also return the points that are in the rectangle in worst case
* O(log(n)^d + k) time where k is the number of points that lie in the rectangle.
which suggests to me it does not apply fractional cascading.
Refined question
To answer the question above therefore, all I need to know is if there are any libraries with 2d-range trees with fractional cascading that have a range counting query of complexity so I don't go reinventing any wheels, or can you help me to write/modify the resources above to perform a query of that complexity?
Also not complaining if you can provide me with any other methods to achieve a range counting query of 2d points in in any other way!
I suggest using plane sweep algorithm. This allows one-dimensional range queries instead of 2-d queries. (Which is more efficient, simpler, and in case of square neighborhood does not require fractional cascading):
Sort points by Y-coordinate to array S.
Advance 3 pointers to array S: one (C) for currently inspected (center) point; other one, A (a little bit ahead) for nearest point at distance > R below C; and the last one, B (a little bit behind) for farthest point at distance < R above it.
Insert points pointed by A to Order statistic tree (ordered by coordinate X) and remove points pointed by B from this tree. Use this tree to find points at distance R to the left/right from C and use difference of these points' positions in the tree to get number of points in square area around C.
Use results of previous step to select "most surrounded" point.
This algorithm could be optimized if you rotate points (or just exchange X-Y coordinates) so that width of the occupied area is not larger than its height. Also you could cut points into vertical slices (with R-sized overlap) and process slices separately - if there are too many elements in the tree so that it does not fit in CPU cache (which is unlikely for only 1 million points). This algorithm (optimized or not) has time complexity O(n log n).
For circular neighborhood (if R is not too large and points are evenly distributed) you could approximate circle with several rectangles:
In this case step 2 of the algorithm should use more pointers to allow insertion/removal to/from several trees. And on step 3 you should do a linear search near points at proper distance (<=R) to distinguish points inside the circle from the points outside it.
Other way to deal with circular neighborhood is to approximate circle with rectangles of equal height (but here circle should be split into more pieces). This results in much simpler algorithm (where sorted arrays are used instead of order statistic trees):
Cut area occupied by points into horizontal slices, sort slices by Y, then sort points inside slices by X.
For each point in each slice, assume it to be a "center" point and do step 3.
For each nearby slice use binary search to find points with Euclidean distance close to R, then use linear search to tell "inside" points from "outside" ones. Stop linear search where the slice is completely inside the circle, and count remaining points by difference of positions in the array.
Use results of previous step to select "most surrounded" point.
This algorithm allows optimizations mentioned earlier as well as fractional cascading.
I would start by creating something like a https://en.wikipedia.org/wiki/K-d_tree, where you have a tree with points at the leaves and each node information about its descendants. At each node I would keep a count of the number of descendants, and a bounding box enclosing those descendants.
Now for each point I would recursively search the tree. At each node I visit, either all of the bounding box is within R of the current point, all of the bounding box is more than R away from the current point, or some of it is inside R and some outside R. In the first case I can use the count of the number of descendants of the current node to increase the count of points within R of the current point and return up one level of the recursion. In the second case I can simply return up one level of the recursion without incrementing anything. It is only in the intermediate case that I need to continue recursing down the tree.
So I can work out for each point the number of neighbours within R without checking every other point, and pick the point with the highest count.
If the points are spread out evenly then I think you will end up constructing a k-d tree where the lower levels are close to a regular grid, and I think if the grid is of size A x A then in the worst case R is large enough so that its boundary is a circle that intersects O(A) low level cells, so I think that if you have O(n) points you could expect this to cost about O(n * sqrt(n)).
You can speed up whatever algorithm you use by preprocessing your data in O(n) time to estimate the number of neighbouring points.
For a circle of radius R, create a grid whose cells have dimension R in both the x- and y-directions. For each point, determine to which cell it belongs. For a given cell c this test is easy:
c.x<=p.x && p.x<=c.x+R && c.y<=p.y && p.y<=c.y+R
(You may want to think deeply about whether a closed or half-open interval is correct.)
If you have relatively dense/homogeneous coverage, then you can use an array to store the values. If coverage is sparse/heterogeneous, you may wish to use a hashmap.
Now, consider a point on the grid. The extremal locations of a point within a cell are as indicated:
Points at the corners of the cell can only be neighbours with points in four cells. Points along an edge can be neighbours with points in six cells. Points not on an edge are neighbours with points in 7-9 cells. Since it's rare for a point to fall exactly on a corner or edge, we assume that any point in the focal cell is neighbours with the points in all 8 surrounding cells.
So, if a point p is in a cell (x,y), N[p] identifies the number of neighbours of p within radius R, and Np[y][x] denotes the number of points in cell (x,y), then N[p] is given by:
N[p] = Np[y][x]+
Np[y][x-1]+
Np[y-1][x-1]+
Np[y-1][x]+
Np[y-1][x+1]+
Np[y][x+1]+
Np[y+1][x+1]+
Np[y+1][x]+
Np[y+1][x-1]
Once we have the number of neighbours estimated for each point, we can heapify that data structure into a maxheap in O(n) time (with, e.g. make_heap). The structure is now a priority-queue and we can pull points off in O(log n) time per query ordered by their estimated number of neighbours.
Do this for the first point and use a O(log n + k) circle search (or some more clever algorithm) to determine the actual number of neighbours the point has. Make a note of this point in a variable best_found and update its N[p] value.
Peek at the top of the heap. If the estimated number of neighbours is less than N[best_found] then we are done. Otherwise, repeat the above operation.
To improve estimates you could use a finer grid, like so:
along with some clever sliding window techniques to reduce the amount of processing required (see, for instance, this answer for rectangular cases - for circular windows you should probably use a collection of FIFO queues). To increase security you can randomize the origin of the grid.
Considering again the example you posed:
It's clear that this heuristic has the potential to save considerable time: with the above grid, only a single expensive check would need to be performed in order to prove that the middle point has the most neighbours. Again, a higher-resolution grid will improve the estimates and decrease the number of expensive checks which need to be made.
You could, and should, use a similar bounding technique in conjunction with mcdowella's answers; however, his answer does not provide a good place to start looking, so it is possible to spend a lot of time exploring low-value points.

Given a few points and circles, how can I tell which point lies in which circles?

Given a small number of points and circles (say under 100), how do I tell which point lies in which circles? The circles can intersect, so one point can lie in multiple circles.
If it's of any relevance, both points and circle centers are aligned on a hexagonal grid, and the radii of the circles are also aligned to the grid.
With a bit of thought, it seems the worse case scenario would always be quadratic (when each point lies in all circles) ... but there might be some way to make this faster for the average case when there aren't that many intersections?
I'm doing this for an AI simulation and the circle/point locations change all the time, so I can't really pre-compute anything ahead of time.
If the number of points and circles is that small, you probably will get away with brute-forcing it. Circle-point intersections are pretty cheap, and 100 * 100 checks a frame shouldn't harm performance at all.
If you are completely sure that this routine is the bottleneck and needs to be optimized, read on.
You can try using a variation of Bounding Volume Hierarchies.
A bounding volume hierarchy is a tree in which each node covers the entire volume of both (or more if you decide to use a tree with higher degree) of its children. The volumes/objects that have to be tested for intersections are always the leaf nodes of the tree.
Insertion, removal and intersection queries have an amortized average run-time of O(log n). You will however have to update the tree, as your objects are dynamic, which is done by removing and reinserting invalid nodes (nodes which do not contain their leaf nodes fully any more). Updating the full tree takes a worst case time of O(n log n).
Care should be taken that while insertion, a node should be inserted into that sub-tree that increases the sub-tree's volume by the least amount.
Here is a good blog post by Randy Gaul which explains dynamic bounding hierarchies well.
You'll have to use circles as the bounding volumes, unless you can find a way to use AABBs in all nodes except leaf nodes, and circles as leaf nodes. AABBs are more accurate and should give you a slightly better constructed tree.
You can build a kd-tree of the points. And then for each circle center you retrieve all the points of the kd-tree with distance bounded by the circle radius. Given M points and N circles the complexity should be M log M + N log M = max(M,N) log M (if points and circles are "well distributed").
Whether you can gain anything compared to a brute-force pair-wise check depends on the geometric structure of your points and circles. If, for instance, the radii of the circles are big in relation to the distances of the points or the distances of the cirlce centers then there is not much to expect, I think.
Rather than going to a full 2D-tree, there is an intermediate possibility based on sorting.
Sort the P points on the abscissas. With a good sorting algorithm (say Heapsort), the cost can be modeled as S.P.Lg(P) (S is the cost of comparisons/moves).
Then, for every circle (C of them), locate its leftmost point (Xc-R) in the sorted list by dichotomy, with a cost D.Lg(P) (D is the cost of a dichotomy step). Then step to the rightmost point (Xc+R) and perform the point/circle test every time.
Doing this, you will spare the comparisons with the points to the left and to the right of the circle. Let F denote the average fraction of the points which fall in the range [Xc-R, Xc+R] for all circles.
Denoting K the cost of a point/circle comparison, the total can be estimated as
S.P.Lg(P) + D.Lg(P).C + F.K.P.C
to be compared to K.P.C.
The ratio is
S/K.Lg(P)/C + D/K.Lg(P)/P + F.
With the unfavorable hypothesis that S=D=K, for P=C=100 we get 6.6% + 6.6% + F. These three terms respectively correspond to the preprocessing time, an acceleration overhead and the reduced workload.
Assuming resonably small circles, let F = 10%, and you can hope a speedup x4.
If you are using a bounding box test before the exact point/circle comparison (which is not necessarily an improvement), you can simplify the bounding box test to two Y comparisons, as the X overlap is implicit.

Find a region with maximum sum of top-K points

My problem is: we have N points in a 2D space, each point has a positive weight. Given a query consisting of two real numbers a,b and one integer k, find the position of a rectangle of size a x b, with edges are parallel to axes, so that the sum of weights of top-k points, i.e. k points with highest weights, covered by the rectangle is maximized?
Any suggestion is appreciated.
P.S.:
There are two related problems, which are already well-studied:
Maximum region sum: find the rectangle with the highest total weight sum. Complexity: NlogN.
top-K query for orthogonal ranges: find top-k points in a given rectangle. Complexity: O(log(N)^2+k).
You can reduce this problem into finding two points in the rectangle: rightmost and topmost. So effectively you can select every pair of points and calculate the top-k weight (which according to you is O(log(N)^2+k)). Complexity: O(N^2*(log(N)^2+k)).
Now, given two points, they might not form a valid pair: they might be too far or one point may be right and top of the other point. So, in reality, this will be much faster.
My guess is the optimal solution will be a variation of maximum region sum problem. Could you point to a link describing that algorithm?
An non-optimal answer is the following:
Generate all the possible k-plets of points (they are N × N-1 × … × N-k+1, so this is O(Nk) and can be done via recursion).
Filter this list down by eliminating all k-plets which are not enclosed in a a×b rectangle: this is a O(k Nk) at worst.
Find the k-plet which has the maximum weight: this is a O(k Nk-1) at worst.
Thus, this algorithm is O(k Nk).
Improving the algorithm
Step 2 can be integrated in step 1 by stopping the branch recursion when a set of points is already too large. This does not change the need to scan the element at least once, but it can reduce the number significantly: think of cases where there are no solutions because all points are separated more than the size of the rectangle, that can be found in O(N2).
Also, the permutation generator in step 1 can be made to return the points in order by x or y coordinate, by pre-sorting the point array correspondingly. This is useful because it lets us discard a bunch of more possibilities up front. Suppose the array is sorted by y coordinate, so the k-plets returned will be ordered by y coordinate. Now, supposing we are discarding a branch because it contains a point whose y coordinate is outside the max rectangle, we can also discard all the next sibling branches because their y coordinate will be more than of equal to the current one which is already out of bounds.
This adds O(n log n) for the sort, but the improvement can be quite significant in many cases -- again, when there are many outliers. The coordinate should be chosen corresponding to the minimum rectangle side, divided by the corresponding side of the 2D field -- by which I mean the maximum coordinate minus the minimum coordinate of all points.
Finally, if all the points lie within an a×b rectangle, then the algorithm performs as O(k Nk) anyways. If this is a concrete possibility, it should be checked, an easy O(N) loop, and if so then it's enough to return the points with the top N weights, which is also O(N).

Finding point with largest weight within a region

My problem is:
We have a set of N points in a 2D space, each of which has a weight. Given any rectangle region R, how to efficiently return the point with the largest weight inside R?
Note that all query regions R have the same shape, i.e. same lengths and widths. And point and rectangle coordinates are float numbers.
My initial idea is use a R-tree to store points. For a region R, extract all points in R, and then find the point with max. weights. The time complexity is O(logN + V), where V is number of points in R. Can we do better?
I tried to search the solution, but still not successfully. Any suggestion?
Thanks,
This sounds like a range max query problem in 2D. You have some very good algorithms here.
Simpler would be to just use a 2D segment tree.
Basically, each node of your segment tree will store 2d regions instead of 1d intervals. So each node will have 4 children, reducing to a quad tree, which you can then operate on as you would a classic segment tree. This is described in detail here.
This will be O(log n) per query, where n is the total number of points. And it also allows you to do a lot more operations, such as update a point's weight, update a region's weights etc.
How about adding an additional attribute to each tree node which contains the maximum weight of all the points contained by any of its children.
This will be easy to update while adding points. It'll be a little more work to maintain when you remove a point that changes the maximum. You'll have to traverse the tree backwards and update the max value for all parent nodes.
With this attribute, if you want to retrieve the maximum-weight point then when you query the tree with your query region you only inspect the child node with the maximum weight as you traverse the tree. Note, you may have more than one point with the same maximum weight so you may have more than one child node to inspect.
Only inspecting the child nodes with a maximum weight attribute will improve your query throughput at the expense of more memory and slower time building/modifying the tree.
Look up so called range trees, which in your case you would want to implement in 2-dimensions. This would be a 2-layer "tree of trees", where you first split the set of points based on x-coordinate and then for each set of x-points at one of the nodes in the resulting tree, you build a tree based on y-coordinate for those points at that node in the original tree. You can look up how to adapt a 2-d range tree to return the number of points in the query rectangle in O((log n)^2) time, independent of the number of points. Similarly, instead of storing the count of points for subrectangles in the range tree, you can store the maximum objective value of points within that rectangle. This will give you O(n log n) time guaranteed storage and construction time, and O((log n)^2) query time, regardless of the number of points in the query rectangle.
An adaptation of so-called "fractional cascading" for range-tree "find all points in query rectangle" might even be able to get your query time down to O(log n), but I'm not sure since you are taking max of value of points within the query rectangle.
Hint:
Every point has a "zone of influence", which is the locus of the positions of the (top-left corner of the) rectangle such that this point dominates. The set of the zones of influence defines a partition of the plane. Every edge of the partition occurs at the abcissa [ordinate] of a given point or its abscissa [ordinate] minus the width [height] of the query region.
If you map the coordinate values to their rank (by sorting on both axis), you can represent the partition as a digital image of size 4N². To precompute this image, initialize it with minus infinity, and for every point you fill its zone of influence with its weight, taking the maximum. If the query window size is R² pixels on average, the cost of constructing the image is NR².
A query is made by finding the row and column of the relevant pixel and returning the pixel value. This takes two dichotomic searches, in time Lg(N).
This approach is only realistic for moderate values of N (say up to 1000). Better insight can be gained on this problem by studying the geometry of the partition map.
You can try a weighted voronoi-diagram when the positive weight is substracted from the euklidian distance. Sites with big weight tends to have big cells with near-by sites with small weights. Then sort the cells by the number of sites and compute a minimum bounding box for each cell. Match it with the rectangular search box.

Resources