Nearest Neighbor algorithm that is fast for non-matches - knn

I have a dataset with millions of points in 100-dimensional space. I need to quickly check whether a query point is not within a certain distance to any of the points in the set.
Are there fast algorithms for it?
The algorithm may have false positives (say the query is close to another point), but must not have false negatives (say the query is far from every point when it isn't).
The difficulty here is:
The number of dimensions is cursed. It's a vast space, but everything is relatively close together.
Most KNN algorithms focus on finding the best candidate, and rejecting all points farther than the best candidate. In my case almost every query is a "miss" so the best candidate is consistently too far to reject even than half of the points.

Related

Trouble finding shortest path across a 2D mesh surface

I asked this question three days ago and I got burned by contributors because I didn't include enough information. I am sorry about that.
I have a 2D matrix and each array position relates to the depth of water in a channel, I was hoping to apply Dijkstra's or a similar "least cost path" algorithm to find out the least amount of concrete needed to build a bridge across the water.
It took some time to format the data into a clean version so I've learned some rudimentary Matlab skills doing that. I have removed most of the land so that now the shoreline is standardised to a certain value, my plan is to use a loop to move through each "pixel" on the "west" shore and run a least cost algorithm against it to the closest "east" shore and move through the entire mesh ultimately finding the least cost one.
This is my problem, fitting the data to any of the algorithms. Unfortunately I get overwhelmed by options and different formats because the other examples are for other use cases.
My other consideration is that when the shortest cost path is calculated that it will be a jagged line which would not be suitable for a bridge so I need to constrain the bend radius in the path if at all possible and I don't know how to go about doing that.
A picture of the channel:
Any advice in an approach method would be great, I just need to know if someone knows a method that should work, then I will spend the time learning how to fit the data.
You can apply Dijkstra to your problem in this way:
the two "dry" regions you want to connect correspond to matrix entries with value 0; the other cells have a positive value designating the depth (or the cost of filling this place with concrete)
your edges are the connections of neighbouring cells in your matrix. (It can be a 4- or 8-neighbourhood.) The weight of the edge is the arithmetic mean of the values of the connected cells.
then you apply the Dijkstra algorithm with a starting point in one "dry" region and an end point in the other "dry" region.
The cheapest path will connect two cells of value 0 and its weight will correspond to sum of costs of cells visited. (One half of each cell weight is coming from the edge going to the cell, the other half from the edge leaving the cell.)
This way you will get a possibly rather crooked path leading over the water, which may be a helpful hint for where to build a cheap bridge.
You can speed up the calculation by using the A*-algorithm. Therefore one needs a lower bound of the remaining costs for reaching the other side for each cell. Such a lower bound can be calculated by examining the "concentric rings" around a point as long as rings do not contain a 0-cell of the other side. The sum of the minimal cell values for each ring is then a lower bound of the remaining costs.
An alternative approach, which emphasizes the constraint that you require a non-jagged shape for your bridge, would be to use Monte-Carlo, simulated annealing or a genetic algorithm, where the initial "bridge" consisted a simple spline curve between two randomly chosen end points (one on each side of the chasm), plus a small number or randomly chosen intermediate points in the chasm. You would end up with a physically 'realistic' bridge and a reasonably optimized cost of concrete.

Index for fast point-(> 100.000.000)-in-polygon-(> 10.000)-test

Problem
I'm working with openstreetmap-data and want to test for point-features in which polygon they lie. In total there are 10.000s of polygons and 100.000.000 of points. I can hold all this data in memory. The polygons usually have 1000s of points, hence making point-in-polygon-tests is very expensive.
Idea
I could index all polygons with an R-Tree, allowing me to only check the polygons whose bounding-box is hit.
Probable new problem
As the polygons are touching each other (think of administrative boundaries) there are many points in the bounding-box of more than one polygon, hence forcing many point-in-polygon-tests.
Question
Do you have any better suggestion than using an R-Tree?
Quad-Trees will likely work worse than rasterization - they are essentially a repeated rasterization to 2x2 images... But definitely exploit rasterization for all the easy cases, because testing the raster should be as fast as it gets. And if you can solve 90% of your points easily, you have more time for the remaining.
Also make sure to first remove duplicates. The indexes often suffer from duplicates, and they are obviously redundant to test.
R*-trees are probably a good thing to try, but you need to really carefully implement them.
The operation you are looking for is a containment spatial join. I don't think there is any implementation around that you could use - but for your performance issues, I would carefully implement it myself anyway. Also make sure to tune parameters and profile your code!
The basic idea of the join is to build two trees - one for the points, one for the polygons.
You then start with the root nodes of each tree, and repeat the following recursively until the leaf level:
If one is a non-directory node:
If the two nodes do not overlap: return
Decide by an heuristic (you'll need to figure this part out, "larger extend" may do for a start) which directory node to expand.
Recurse into each new node, plus the other non-opened node as new pair.
Leaf nodes:
fast test point vs. bounding box of polygon
slow test point in polygon
You can further accelerate this if you have a fast interior-test for the polygon, in particular for rectangle-in-polygon. It may be good enough if it is approximative, as long as it is fast.
For more detailed information, search for r-tree spatial join.
Try using quad trees.
Basically you can recursivelly partion space into 4 parts and then for each part you should know:
a) polygons which are superset of given part
b) polygons which intersect given part
This gives some O(log n) overhead factor which you might not be happy with.
The other option is to just partion space using grid. You should keep same information or each part of the grid as in the case above. This does only have some constant overhead.
Both this options assume, that the distribution of polygons is somehow uniform.
There is an other option, if you can process points offline (in other words you can pick the processing order of points). Then you can use some sweeping line techniques, where you sort points by one coordinate, you iterate over points in this sorted order and maintain only interesting set of polygons during iteration.

Nearest vertex search

I'm looking for effective algorithm to find a vertex nearest to point P(x, y, z). The set of vertices is fixed, each request comes with new point P. I tried kd-tree and others known methods and I've got same problem everywhere: if P is closer then all is fine, search is performed for few tree nodes only. However if P is far enough, then more and more nodes should be scanned and finally speed becomes unacceptable slow. In my task I've no ability to specify a small search radius. What are solutions for such case?
Thanks
Igor
One possible way to speed up your search would be to discretize space into a large number of rectangular prisms spaced apart at regular intervals. For example, you could split space up into lots of 1 × 1 × 1 unit cubes. You then distribute the points in space into these volumes. This gives you a sort of "hash function" for points that distributes points into the volume that contains them.
Once you have done this, do a quick precomputation step and find, for each of these volumes, the closest nonempty volumes. You could do this by checking all volumes one step away from the volume, then two steps away, etc.
Now, to do a nearest neighbor search, you can do the following. Start off by hashing your point in space to the volume that contains it. If that volume contains any points, iterate over all of them to find which one is closest. Then, for each of the volumes that you found in the first step of this process, iterate over those points to see if any of them are closer. The resulting closest point is the nearest neighbor to your test point.
If your volumes end up containing too many points, you can refine this approach by subdividing those volumes into even smaller volumes and repeating this same process. You could alternatively create a bunch of smaller k-d trees, one for each volume, to do the nearest-neighbor search. In this way, each k-d tree holds a much smaller number of points than your original k-d tree, and the points within each volume are all reasonable candidates for a nearest neighbor. Therefore, the search should be much, much faster.
This setup is similar in spirit to an octree, except that you divide space into a bunch of smaller regions rather than just eight.
Hope this helps!
Well, this is not an issue of the index structures used, but of your query:
the nearest neighbor becomes just much more fuzzy the further you are away from your data set.
So I doubt that any other index will help you much.
However, you may be able to plug in a threshold in your search. I.e. "find nearest neighbor, but only when within a maximum distance x".
For static, in-memory, 3-d point double vector data, with euclidean distance, the k-d-tree is hard to beat, actually. It just splits the data very very fast. An octree may sometimes be faster, but mostly for window queries I guess.
Now if you really have very few objects but millions of queries, you could try to do some hybrid approach. Roughly something like this: compute all points on the convex hull of your data set. Compute the center and radius. Whenever a query point is x times further away (you need to do the 3d math yourself to figure out the correct x), it's nearest neighbor must be one of the convex hull points. Then again use a k-d-tree, but one containing the hull points only.
Or even simpler. Find the min/max point in each dimension. Maybe add some additional extremes (in x+y, x-y, x+z, x-y, y+z, y-z etc.). So you get a small set of candidates. So lets for now assume that is 8 points. Precompute the center and the distances of these 6 points. Let m be the maximum distance from the center to these 8 points. For a query compute the distance to the center. If this is larger than m, compute the closest of these 6 candidates first. Then query the k-d-tree, but bound the search to this distance. This costs you 1 (for close) and 7 (for far neighbors) distance computations, and may significantly speed up your search by giving a good candidate early. For further speedups, organize these 6-26 candidates in a k-d-tree, too, to find the best bound quickly.

Dijkstra-based algorithm optimization for caching

I need to find the optimal path connecting two planar points. I'm given a function that determines the maximal advance velocity, which depends on both the location and the time.
My solution is based on Dijkstra algorithm. At first I cover the plane by a 2D lattice, considering only discrete points. Points are connected to their neighbors up to specified order, to get sufficient direction resolution. Then I find the best path by (sort of) Dijkstra algorithm. Next I improve the resolution/quality of the found path. I increase the lattice density and neighbor connectivity order, together with restricting the search only to the points close enough to the already-found path. This may be repeated until the needed resolution is achieved.
This works generally well, but nevertheless I'd like to improve the overall algorithm performance. I've implemented several tricks, such as variable lattice density and neighbor connectivity order, based on "smoothness" of the price function. However I believe there's a potential to improve Dijkstra algorithm itself (applicable to my specific graph), which I couldn't fully realize yet.
First let's agree on the terminology. I split all the lattice points into 3 categories:
cold - points that have not been reached by the algorithm.
warm - points that are reached, but not fully processed yet (i.e. have potential for improvement)
stable - points that are fully processed.
At each step Dijkstra algorithm picks the "cheapest" warm lattice point, and then tries to improve the price of its neighbors. Because of the nature of my graph, I get a kind of a cloud of stable points, surrounded by a thin layer of warm points. At each step a warm point at the cloud perimeter is processed, then it's added to the stable cloud, and the warm perimeter is (potentially) expanded.
The problem is that warm points that are consequently processed by the algorithm are usually spatially (hence - topologically) unrelated. A typical warm perimeter consists of hundreds of thousands of points. At each step the next warm point to process is pseudo-randomal (spatially), hence there's virtually no chance that two related points are processed one after another.
This indeed creates a problem with CPU cache utilization. At each step the CPU deals with pseudo-random memory location. Since there's a large amount of warm points - all the relevant data may not fit the CPU cache (it's order of tens to hundreds of MB).
Well, this is indeed the implication of the Dijkstra algorithm. The whole idea is explicitly to pick the cheapest warm point, regardless to other properties.
However intuitively it's obvious that points on one side of a big cloud perimeter don't make any sense to the points on another side (in our specific case), and there's no problem to swap their processing order.
Hence I though about ways of "adjusting" the warm points processing order, yet without compromising the algorithm in general. I thought about several ideas, such as diving the plane into blocks, and partially solving them independently until some criteria is met, meaning their solution may be interfered. Or alternatively ignore the interference, and potentially allow the "re-solving" (i.e. transition from stable back to warm).
However so far I could not find rigorous method.
Are there any ideas how to do this? Perhaps it's a know problem, with existing research and (hopefully) solutions?
Thanks in advance. And sorry for the long question.
What you're describing is the motivation behind the A* search algorithm, a modification of Dijkstra's algorithm that can dramatically improve the runtime by guiding the search in a direction that is likely to pick points that keep getting closer and closer to the destination. A* never does any more work than a naive Dijkstra's implementation, and typically tends to expand out nodes that are clustered on the frontier of the warm nodes that are closest to the destination node.
Internally, A* works by augmenting Dijkstra's algorithm with a heuristic function that estimates the remaining distance to the target node. This means that if you get can get a rough approximation of how far away a given node is from the destination, you can end up ignoring nodes that don't need to be processed in favor of nodes that are likely to be better.
A* was not designed as a cache-optimal algorithm, but I believe that the increase in speed due to
Expanding fewer nodes (less in the cache), and
Expanding nodes closer to the goal (which were processed more recently and thus more likely to be in the cache)
will give you a huge performance increase and better cache performance.
Hope this helps!

How do I test if a given BSP tree is optimal?

I have a polygon soup of triangles that I would like to construct a BSP tree for. My current program simply constructs a BSP tree by inserting a random triangle from the model one at a time until all the triangles are consumed, then it checks the depth and breadth of the tree and remembers the best score it achieved (lowest depth, lowest breadth).
By definition, the best depth would be log2(n) (or less if co-planar triangles are grouped?) where n is the number of triangles in my model, and the best breadth would be n (meaning no splitting has occurred). But, there are certain configurations of triangles for which this pinnacle would never be reached.
Is there an efficient test for checking the quality of my BSP tree? Specifically, I'm trying to find a way for my program to know it should stop looking for a more optimal construction.
Construction of an optimal tree is an NP-complete problem. Determining if a given tree is optimal is essentially the same problem.
From this BSP faq:
The problem is one of splitting versus
tree balancing. These are mutually
exclusive requirements. You should
choose your strategy for building a
good tree based on how you intend to
use the tree.
Randomly building BSP trees until you chance upon a good one will be really, really inefficient.
Instead of choosing a tri at random to use as a split-plane, you want to try out several (maybe all of them, or maybe a random sampling) and pick one according to some heuristic. The heuristic is typically based on (a) how balanced the resulting child nodes would be, and (b) how many tris it would split.
You can trade off performance and quality by considering a smaller or larger sampling of tris as candidate split-planes.
But in the end, you can't hope to get a totally optimal tree for any real-world data so you might have to settle for 'good enough'.
Try to pick planes that (could potentially) get split by the most planes as splitting planes. Splitting planes can't be split.
Try to pick a plane that has close to the same number of planes in front as in back.
Try to pick a plane that doesn't cause too many splits.
Try to pick a plane that is coplanar with a lot of other surfaces
You'll have to sample this criteria and come up with a scoring system to decide which one is most likely to be a good choice for a splitting plane. For example, the further off balance, the more score it loses. If it causes 20 splits, then penalty is -5 * 20 (for example). Choose the one that scores best. You don't have to sample every polygon, just search for a pretty good one.

Resources