Finding Nearest Node for Each Node - algorithm

I have 2 sets of nodes - Set A and Set B. Each set is of size 25,000.
I am given a percentage (lets say 20%). I need to find the minimum distance such that 20% of the nodes in Set A are within that distance of any node in Set B.
Solution:
Find the 20% of Set A which is closest to any node in Set B. The answer is the node in that 20% which is the farthest from any node in Set B.
Brute Force Solution:
foreach (Node a in setA)
{
a.ShortestDistance = infinity;
foreach (Node b in setB)
{
if (a.DistanceTo(b) < a.ShortestDistance)
{
a.ShortestDistance = a.DistanceTo(b);
}
}
}
setA.SortByShortestDistance();
return setA[setA.Size * 0.2];
This works, but the time it would take is insane. (O(n^2 + Sort) I think?)
How can I speed this up? I would like to hit O(n) if possible.

Following is a algorithm which might improve speed:-
convert your (lat,long) pairs to (x,y,z) in cartesian with centre of earth as origin
distance between (x,y,z) in cartesian is lower bound to the actual distances in spherical co-ordinates.
Construct to separate 3d trees for setA and setB.
for each node a in setA search for nearest neighbour in 3d tree of setB which in average case is O(logN).
Then distance for nearest neighbour would be the distance from nearest neighbour.
Then sort setA as you have done.
Time complexity :-
In average case : O(n*logn)
In worst case : O(n^2)

You could pick the smaller of the two sets and build a structure from it for answering nearest-neighbour queries - http://en.wikipedia.org/wiki/Cover_tree doesn't make many assumptions about the underlying metric so it should work with haversine/great circle.
After doing this, the simplest thing to do would be to take every member of the larger set, find the nearest neighbour to it in the smaller set, and then sort or http://en.wikipedia.org/wiki/Quickselect the distances. If you modified the find operation to return early without finding anything if the nearest object must be further than a threshold distance away and you had a rough idea of the distance you might save some time.
You could get a rough idea by performing the same operation on a random sample from the two sets beforehand. If your guess is a bit too high, you just have a few more nearest neighbour distances to sort. If your guess is a bit too low, you only need to repeat the find operations for those points where the nearest neighbour operation returned early without finding anything.

Related

Distance between two polylines

I want to calculate the distance d between two polylines:
Obviously I could check the distance for all pairs of line-segments and choose the smallest distance, but this ways the algorithmn would have a runtime of O(n2). Is there any better approach?
Divide and conquer:
Define a data structure representing a pair of polylines and the minimun distance between their axis-aligned minimum bounding boxes (AAMBB):pair = (poly_a, poly_b, d_ab))
Create an empty queue for pair data estructures, using the distance d_ab as the key.
Create a pair data estructure with the initial polylines and push it into the queue.
We will keep a variable with the minimum distance between the polylines found so far (min_d). Set it to infinite.
Repeat:
Pop from the queue the element with minimum distance d_ab.
If d_ab is greater than min_d we are done.
If any of the polylines poly_a or poly_b contains an only segment:
Use brute force to find the minimal distance between then and update min_d accordingly.
Otherwise:
Divide both polylines poly_a and poly_b in half, for instance:
(1-7) --> { (1-4), (4-7) }
(8-12) --> { (8-10), (10-12) }
Make the cross product of both sets, create 4 new pair data structures and push then into the queue Q.
On the average case, complexity is O(N * log N), worst case may be O(N²).
Update: The algorithm implemented in Perl.
The "standard" way for such problems is to construct the Voronoi diagram of the geometric entities. This can be done in time O(N Log N).
But the construction of such diagrams for line segments is difficult and you should resort to ready made solutions such as in CGAL.

FInd furthest point in O(1) time

Consider a set S of n points in the plane such that the farthest pair is having distance at most 1. I would like to find the farthest point of a given query point q (not in S) in O(1) time. How do I pre-process the points in S to achieve the desired query time bound?
Can this be possible?
It is not possible stricto sensu. This is a point location problem in a planar straight line graph, which is known to require O(log(N)) query time.
Anyway, it can be addressed approximately by gridding.
Overlay a square grid over the furthest point Voronoi diagram, and for every cell note the regions it covers. Make sure that the number of covered regions is bounded. This can be approximately achieved by taking a grid pitch smaller than the distance of the two closest vertices in the diagram.
For a query pixel, finding the containing cell is done in constant time. Then finding the region among a bounded number takes constant time as well.
Assuming there is no relation between the points, there is no single operation that will give you the furthest point. So, the only way to do it is to compute it in advance, so you need a simple mapping between each point and the point furthest from it.

Finding point with largest weight within a region

My problem is:
We have a set of N points in a 2D space, each of which has a weight. Given any rectangle region R, how to efficiently return the point with the largest weight inside R?
Note that all query regions R have the same shape, i.e. same lengths and widths. And point and rectangle coordinates are float numbers.
My initial idea is use a R-tree to store points. For a region R, extract all points in R, and then find the point with max. weights. The time complexity is O(logN + V), where V is number of points in R. Can we do better?
I tried to search the solution, but still not successfully. Any suggestion?
Thanks,
This sounds like a range max query problem in 2D. You have some very good algorithms here.
Simpler would be to just use a 2D segment tree.
Basically, each node of your segment tree will store 2d regions instead of 1d intervals. So each node will have 4 children, reducing to a quad tree, which you can then operate on as you would a classic segment tree. This is described in detail here.
This will be O(log n) per query, where n is the total number of points. And it also allows you to do a lot more operations, such as update a point's weight, update a region's weights etc.
How about adding an additional attribute to each tree node which contains the maximum weight of all the points contained by any of its children.
This will be easy to update while adding points. It'll be a little more work to maintain when you remove a point that changes the maximum. You'll have to traverse the tree backwards and update the max value for all parent nodes.
With this attribute, if you want to retrieve the maximum-weight point then when you query the tree with your query region you only inspect the child node with the maximum weight as you traverse the tree. Note, you may have more than one point with the same maximum weight so you may have more than one child node to inspect.
Only inspecting the child nodes with a maximum weight attribute will improve your query throughput at the expense of more memory and slower time building/modifying the tree.
Look up so called range trees, which in your case you would want to implement in 2-dimensions. This would be a 2-layer "tree of trees", where you first split the set of points based on x-coordinate and then for each set of x-points at one of the nodes in the resulting tree, you build a tree based on y-coordinate for those points at that node in the original tree. You can look up how to adapt a 2-d range tree to return the number of points in the query rectangle in O((log n)^2) time, independent of the number of points. Similarly, instead of storing the count of points for subrectangles in the range tree, you can store the maximum objective value of points within that rectangle. This will give you O(n log n) time guaranteed storage and construction time, and O((log n)^2) query time, regardless of the number of points in the query rectangle.
An adaptation of so-called "fractional cascading" for range-tree "find all points in query rectangle" might even be able to get your query time down to O(log n), but I'm not sure since you are taking max of value of points within the query rectangle.
Hint:
Every point has a "zone of influence", which is the locus of the positions of the (top-left corner of the) rectangle such that this point dominates. The set of the zones of influence defines a partition of the plane. Every edge of the partition occurs at the abcissa [ordinate] of a given point or its abscissa [ordinate] minus the width [height] of the query region.
If you map the coordinate values to their rank (by sorting on both axis), you can represent the partition as a digital image of size 4N². To precompute this image, initialize it with minus infinity, and for every point you fill its zone of influence with its weight, taking the maximum. If the query window size is R² pixels on average, the cost of constructing the image is NR².
A query is made by finding the row and column of the relevant pixel and returning the pixel value. This takes two dichotomic searches, in time Lg(N).
This approach is only realistic for moderate values of N (say up to 1000). Better insight can be gained on this problem by studying the geometry of the partition map.
You can try a weighted voronoi-diagram when the positive weight is substracted from the euklidian distance. Sites with big weight tends to have big cells with near-by sites with small weights. Then sort the cells by the number of sites and compute a minimum bounding box for each cell. Match it with the rectangular search box.

KNN classifier algorithm not working for all cases

To effectively find n nearest neighbors of a point in d-dimensional space, I selected the dimension with greatest scatter (i.e. in this coordinate differences between points are largest). The whole range from minimal to maximal value in this dimension was split into k bins. Each bin contains points which coordinates (in this dimensions) are within the range of that bin. It was ensured that there are at least 2n points in each bin.
The algorithm for finding n nearest neighbors of point x is following:
Identify bin kx,in which point x lies(its projection to be precise).
Compute distances between x and all the points in bin kx.
Sort computed distances in ascending order.
Select first n distances. Points to which these distances were measured are returned as n
nearest neighbors of x.
This algorithm is not working for all cases. When algorithm can fail to compute nearest neighbors?
Can anyone propose modification of the algorithm to ensure proper operation for all cases?
Where KNN failure:
If the data is a jumble of all different classes then knn will fail because it will try to find k nearest neighbours but all points are random
outliers points
Let's say you have two clusters of different classes. Then if you have a outlier point as query, knn will assign one of the classes even though the query point is far away from both clusters.
This is failing because (any of) the k nearest neighbors of x could be in a different bin than x.
What do you mean by "not working"? You do understand that, what you are doing is only an approximate method.
Try normalising the data and then choosing the dimension, else scatter makes no sense.
The best vector for discrimination or for clustering may not be one of the original dimensions, but any combination of dimensions.
Use PCA (Principal Component Analysis) or LDA (Linear Discriminant Analysis), to identify a discriminative dimension.

Finding the farthest point in one set from another set

My goal is a more efficient implementation of the algorithm posed in this question.
Consider two sets of points (in N-space. 3-space for the example case of RGB colorspace, while a solution for 1-space 2-space differs only in the distance calculation). How do you find the point in the first set that is the farthest from its nearest neighbor in the second set?
In a 1-space example, given the sets A:{2,4,6,8} and B:{1,3,5}, the answer would be
8, as 8 is 3 units away from 5 (its nearest neighbor in B) while all other members of A are just 1 unit away from their nearest neighbor in B. edit: 1-space is overly simplified, as sorting is related to distance in a way that it is not in higher dimensions.
The solution in the source question involves a brute force comparison of every point in one set (all R,G,B where 512>=R+G+B>=256 and R%4=0 and G%4=0 and B%4=0) to every point in the other set (colorTable). Ignore, for the sake of this question, that the first set is elaborated programmatically instead of iterated over as a stored list like the second set.
First you need to find every element's nearest neighbor in the other set.
To do this efficiently you need a nearest neighbor algorithm. Personally I would implement a kd-tree just because I've done it in the past in my algorithm class and it was fairly straightforward. Another viable alternative is an R-tree.
Do this once for each element in the smallest set. (Add one element from the smallest to larger one and run the algorithm to find its nearest neighbor.)
From this you should be able to get a list of nearest neighbors for each element.
While finding the pairs of nearest neighbors, keep them in a sorted data structure which has a fast addition method and a fast getMax method, such as a heap, sorted by Euclidean distance.
Then, once you're done simply ask the heap for the max.
The run time for this breaks down as follows:
N = size of smaller set
M = size of the larger set
N * O(log M + 1) for all the kd-tree nearest neighbor checks.
N * O(1) for calculating the Euclidean distance before adding it to the heap.
N * O(log N) for adding the pairs into the heap.
O(1) to get the final answer :D
So in the end the whole algorithm is O(N*log M).
If you don't care about the order of each pair you can save a bit of time and space by only keeping the max found so far.
*Disclaimer: This all assumes you won't be using an enormously high number of dimensions and that your elements follow a mostly random distribution.
The most obvious approach seems to me to be to build a tree structure on one set to allow you to search it relatively quickly. A kd-tree or similar would probably be appropriate for that.
Having done that, you walk over all the points in the other set and use the tree to find their nearest neighbour in the first set, keeping track of the maximum as you go.
It's nlog(n) to build the tree, and log(n) for one search so the whole thing should run in nlog(n).
To make things more efficient, consider using a Pigeonhole algorithm - group the points in your reference set (your colorTable) by their location in n-space. This allows you to efficiently find the nearest neighbour without having to iterate all the points.
For example, if you were working in 2-space, divide your plane into a 5 x 5 grid, giving 25 squares, with 25 groups of points.
In 3 space, divide your cube into a 5 x 5 x 5 grid, giving 125 cubes, each with a set of points.
Then, to test point n, find the square/cube/group that contains n and test distance to those points. You only need to test points from neighbouring groups if point n is closer to the edge than to the nearest neighbour in the group.
For each point in set B, find the distance to its nearest neighbor in set A.
To find the distance to each nearest neighbor, you can use a kd-tree as long as the number of dimensions is reasonable, there aren't too many points, and you will be doing many queries - otherwise it will be too expensive to build the tree to be worthwhile.
Maybe I'm misunderstanding the question, but wouldn't it be easiest to just reverse the sign on all the coordinates in one data set (i.e. multiply one set of coordinates by -1), then find the first nearest neighbour (which would be the farthest neighbour)? You can use your favourite knn algorithm with k=1.
EDIT: I meant nlog(n) where n is the sum of the sizes of both sets.
In the 1-Space set I you could do something like this (pseudocode)
Use a structure like this
Struct Item {
int value
int setid
}
(1) Max Distance = 0
(2) Read all the sets into Item structures
(3) Create an Array of pointers to all the Items
(4) Sort the array of pointers by Item->value field of the structure
(5) Walk the array from beginning to end, checking if the Item->setid is different from the previous Item->setid
if (SetIDs are different)
check if this distance is greater than Max Distance if so set MaxDistance to this distance
Return the max distance.

Resources