Faster ways to search points close to a line - algorithm

I have a set of points and a line in 2D space. I need to find all points that lie within a distance D from the line. Is there a way for me to do this without having to actually compute distances di of all points from the line? Is there a solution better than linear search?
Edit: I need to search through the same point set for different lines multiple times. The points are always constant but the line would be different during each search. Typically the point set is of the order of tens of thousands (~50k).

As for queries:
If you create a kd-tree using the points, and use a few equidistant points (likely around d) points on the line, you should be able use a modifed nearest neighbor query to find all points that are withing d of a line in roughly O(k + log(N)). The kd-tree requires O(N log N) preprocessing through, so it's only better if you use the same point set (with perhaps slight differences as you can add/remove a point from a kd-tree in O(log N)) and different lines. The only issue is that a kd-tree isn't really meant for use with lines. I'm sure there is something like it for lines that would work better, but I'm not familiar with it.
Note: False positives and negatives are possible depending on how things are arranged, as you are really querying the distance from a point on the line instead of the distance on the line. How problematic this is largely depends on the ratio between the length of the line and d. Thus you are either going to get a fair number false positives or false negatives unless the majority of the points are no where near the line. In general, this probably won't be too much of an issue though, as even with the false positives k should be fairly small compared to N unless d is relatively large.
After a bit of review, I noticed that the query is against a line not line segment. It can however be converted into one by make the line segment bounded by the min/max x/y. I imagine there is still probably a more efficent way to use a kd-tree for this.

This search can't be completed faster than linear search, as input data and result population has the same complexity.

Related

How to find the nearest line segment to a specific point more efficently?

This is a problem I came across frequently and I'm searching a more effective way to solve it. Take a look at this pics:
Let's say you want to find the shortest distance from the red point to a line segment an. Assume you only know the start/end point (x,y) of the segments and the point. Now this can be done in O(n), where n are the line segments, by checking every distance from the point to a line segment. This is IMO not effective, because in the worst case there have to be n-1 distance checks till the right one is found.
This can be a real performance issue for n = 1000 f.e. (which is a likely number), especially if the distance calculation isn't just done in the euclidean space by the Pythagorean theorem but for example by a geodesic method like the haversine formula or Vincenty's.
This is a general problem in different situations:
Is the point inside a radius of the vertices?
Which set of vertices is nearest to the point?
Is the point surrounded by line segments?
To answer these questions, the only approach I know is O(n). Now I would like to know if there is a data structure or a different strategy to solve these problems more efficiently?
To make it short: I'm searching a way, where the line segments / vertices could be "filtered" somehow to get a set of potential candidates before I start my distance calculations. Something to reduce the complexity to O(m) where m < n.
Probably not an acceptable answer, but too long for a comment: The most appropriate answer here depends on details that you did not state in the question.
If you only want to perform this test once, then there will be no way avoid a linear search. However, if you have a fixed set of lines (or a set of lines that does not change too significantly over time), then you may employ various techniques for accelerating the queries. These are sometimes referred to as Spatial Indices, like a Quadtree.
You'll have to expect a trade-off between several factors, like the query time and the memory consumption, or the query time and the time that is required for updating the data structure when the given set of lines changes. The latter also depends on whether it is a structural change (lines being added or removed), or whether only the positions of the existing lines change.

Approximated closest pair algorithm

I have been thinking about a variation of the closest pair problem in which the only available information is the set of distances already calculated (we are not allowed to sort points according to their x-coordinates).
Consider 4 points (A, B, C, D), and the following distances:
dist(A,B) = 0.5
dist(A,C) = 5
dist(C,D) = 2
In this example, I don't need to evaluate dist(B,C) or dist(A,D), because it is guaranteed that these distances are greater than the current known minimum distance.
Is it possible to use this kind of information to reduce the O(n²) to something like O(nlogn)?
Is it possible to reduce the cost to something close to O(nlogn) if I accept a kind of approximated solution? In this case, I am thinking about some technique based on reinforcement learning that only converges to the real solution when the number of reinforcements go to infinite, but provides a great approximation for small n.
Processing time (measured by the big O notation) is not the only issue. To keep a very large amount of previous calculated distances can also be an issue.
Imagine this problem for a set with 10⁸ points.
What kind of solution should I look for? Was this kind of problem solved before?
This is not a classroom problem or something related. I have been just thinking about this problem.
I suggest using ideas that are derived from quickly solving k-nearest-neighbor searches.
The M-Tree data structure: (see http://en.wikipedia.org/wiki/M-tree and http://www.vldb.org/conf/1997/P426.PDF ) is designed to reduce the number distance comparisons that need to be performed to find "nearest neighbors".
Personally, I could not find an implementation of an M-Tree online that I was satisfied with (see my closed thread Looking for a mature M-Tree implementation) so I rolled my own.
My implementation is here: https://github.com/jon1van/MTreeMapRepo
Basically, this is binary tree in which each leaf node contains a HashMap of Keys that are "close" in some metric space you define.
I suggest using my code (or the idea behind it) to implement a solution in which you:
Search each leaf node's HashMap and find the closest pair of Keys within that small subset.
Return the closest pair of Keys when considering only the "winner" of each HashMap.
This style of solution would be a "divide and conquer" approach the returns an approximate solution.
You should know this code has an adjustable parameter the governs the maximum number of Keys that can be placed in an individual HashMap. Reducing this parameter will increase the speed of your search, but it will increase the probability that the correct solution won't be found because one Key is in HashMap A while the second Key is in HashMap B.
Also, each HashMap is associated a "radius". Depending on how accurate you want your result you maybe able to just search the HashMap with the largest hashMap.size()/radius (because this HashMap contains the highest density of points, thus it is a good search candidate)
Good Luck
If you only have sample distances, not original point locations in a plane you can operate on, then I suspect you are bounded at O(E).
Specifically, it would seem from your description that any valid solution would need to inspect every edge in order to rule out it having something interesting to say, meanwhile, inspecting every edge and taking the smallest solves the problem.
Planar versions bypass O(V^2), by using planar distances to deduce limitations on sets of edges, allowing us to avoid needing to look at most of the edge weights.
Use same idea as in space partitioning. Recursively split given set of points by choosing two points and dividing set in two parts, points that are closer to first point and points that are closer to second point. That is same as splitting points by a line passing between two chosen points.
That produces (binary) space partitioning, on which standard nearest neighbour search algorithms can be used.

Constant time search

Suppose I have a rod which I cut to pieces. Given a point on the original rod, is there a way to find out which piece it belongs to, in constant time?
For example:
|------------------|---------|---------------|
0.0 4.5 7.8532 9.123
Given a position:
^
|
8.005
I would like to get 3rd piece.
It is possible to easily get such answer in O(log n) time with binary search but is it possible to do it in O(1)? If I pre-process the "cut" positions somehow?
If you assume the point you want to query is uniformly randomly chosen along the rod, then you can have EXPECTED constant time solution, without crazy memory explosion, as follows. If you break up the rod into N equally spaced pieces, where N is the number of original irregularly spaced segments you have in your rod, and then record for each of the N equal-sized pieces which of the original irregular segment(s) it overlaps, then to do a query you first just take the query point and do simple round-off to find out which equally spaced piece it lies in, then use that index to look up which of your original segments intersect the equally spaced piece, and then check each intersecting original segment to see if the segment contains your point (and you can use binary search if you want to make sure the worst-case performance is still logarithmic). The expected running time for this approach is constant if you assume that the query point is randomly chosen along your rod, and the amount of memory is O(N) if your rod was originally cut into N irregular pieces, so no crazy memory requirements.
PROOF OF EXPECTED O(1) RUNNING TIME:
When you count the total number of intersection pairs between your original N irregular segments and the N equally-spaced pieces I propose constructing, the total number is no more than 2*(N+1) (because if you sort all the end-points of all the regular and irregular segments, a new intersection pair can always be charged to one of the end-points defining either a regular or irregular segment). So you have a multi-set of at most 2(N+1) of your irregular segments, distributed out in some fashion among the N regular segments that they intersect. The actual distribution of intersections among the regular segments doesn't matter. When you have a uniform query point and compute the expected number of irregular segments that intersect the regular segment that contains the query point, each regular segment has probability 1/N of being chosen by the query point, so the expected number of intersected irregular segments that need to be checked is 2*(N+1)/N = O(1).
For arbitrary cuts and precisions, not really, you have to compare the position with the various start or end points.
But, if you're only talking a small number of cuts, performance shouldn't really be an issue.
For example, even with ten segments, you only have nine comparisons, not a huge amount of computation.
Of course, you can always turn the situation into a ploynomial formula (such as ax^4 + bx^3 +cx^2 + dx + e), generated using simultaneous equations, which will give you a segment but the highest power tends to rise with the segment count so it's not necessarily as efficient as simple checks.
You're not going to do better than lg n with a comparison-based algorithm. Reinterpreting the 31 non-sign bits of a positive IEEE float as a 31-bit integer is an order-preserving transformation, so tries and van Emde Boas trees both are options. I would steer you first toward a three-level trie.
You could assign an integral number to every position and then use that as index into a lookup table, which would give you constant-time lookup. This is pretty easy if your stick is short and you don't cut it into pieces that are fractions of a millimeter long. If you can get by with such an approximation, that would be my way to go.
There is one enhanced way which generalizes this even further. In each element of a lookup table, you store the middle position and the segment ID to the left and right. This makes one lookup (O(1)) plus one comparison (O(1)). The downside is that the lookup table has to be so large that you never have more than two different segments in the same table element's range. Again, it depends on your requirements and input data whether this works or not.

Find all points in sphere of radius r around arbitrary coordinate

I'm looking for an efficient algorithm that for a space with known height, width and length, given a fixed radius R, and a list of points N, with 3-dimensional coordinates in that space, will find all the points within a fixed radius R of an arbitrary point on the grid. This query will be done many times with different points, so an expensive pre-processing/sorting step, in exchange for quick queries may be worth it. This is a bit of a bottleneck step of an application I'm working on, so any time I can cut off of it is useful
Things I have tried so far:
-The naive algorithm, iterate over all points and calculate distance
-Divide the space into a grid with cubes of length R, and put the points into these. That way, for each point, I only have to ever query the immediate neighboring buckets. This has a significant speedup
-I've tried using the manhattan distance as a heuristic. That is, within the buckets, before calculating a distance to any point, use the manhattan distance to filter out those that can't possibly be within radius R (that is, those with a manhattan distance of <= sqrt(3)*R). I thought this would offer a speedup, as it only needs addition instead of multiplication, but it actually slowed the program down by a little bit
EDIT: To compare the distances, I use the squared distance to eliminate having to use a sqrt function.
Obviously, there will be some limit on how much I can speed this up, but I could use any suggestions on things to try now.
Not that it probably matters on the algorithmic level, but I'm working in C.
You may get a speed benefit from storing your points in a k-d tree with three dimensions. That will give you searchs in O(log n) amortized time.
Don't compare on the radius, compare on the square of the radius. The reason being is, if the distance between two points is less than R, then the square of the distance is less than R^2.
This way, when you're using the distance formula, you don't need to compute the square root, which is a very expensive operation.
I would recommend using either K-D tree or z-curve:
http://en.wikipedia.org/wiki/Z-order_%28curve%29
How about Binary Indexed Tree ? (Topcoder tutorials referred) It can be extended to n Dimensions,and is simpler to code.
Nicolas Brodu's NEIGHAND library do exactly what you want, improving on the bin-lattice algorithm.
More details can be found in his article: Query Sphere Indexing for Neighborhood Requests
[I might be misunderstanding the question. I'm finding the problem statement difficult to parse.]
In the old days, it was often good to design a this type of algorithm with "early outs" that do tests to try to avoid a more expensive calculation. In modern processors, a failure of a branch-prediction is often very expensive, and those early-out tests can actually be more expensive that the full calculation. (The only way to know for sure is to measure.)
In this case, the calculation is pretty simple, so it may be best to avoid building a data structure or doing any clever early-out checks and instead try to optimize, vectorize, and parallelize to get the throughput you need.
For a point P(x, y, z) and a sphere S(x_s, y_s, z_s, radius), the membership test is:
(x - x_s)^2 + (x - y_s)^2 + (z - z_s)^2 < radius^2
where radius^2 can be pre-calculated once for all the points in the query (avoiding any square root calculations). These calculations are all independent, you can compute it for several points in parallel. With something like SSE, you could probably do four at a time. And if you have many points to test, you could split the list and further parallelize the work across multiple cores.

Is there an efficient algorithm to generate random points in general position in the plane?

I need to generate n random points in general position in the plane, i.e. no three points can lie on a same line. Points should have coordinates that are integers and lie inside a fixed square m x m. What would be the best algorithm to solve such a problem?
Update: square is aligned with the axes.
Since they're integers within a square, treat them as points in a bitmap. When you add a point after the first, use Bresenham's algorithm to paint all pixels on each of the lines going through the new point and one of the old ones. When you need to add a new point, get a random location and check if it's clear; otherwise, try again. Since each pair of pixels gives a new line, and thus excludes up to m-2 other pixels, as the number of points grows you will have several random choices rejected before you find a good one. The advantage of the approach I'm suggesting is that you only pay the cost of going through all lines when you have a good choice, while rejecting a bad one is a very quick test.
(if you want to use a different definition of line, just replace Bresenham's with the appropriate algorithm)
Can't see any way around checking each point as you add it, either by (a) running through all of the possible lines it could be on, or (b) eliminating conflicting points as you go along to reduce the possible locations for the next point. Of the two, (b) seems like it could give you better performance.
Similar to #LaC's answer. If memory is not a problem, you could do it like this:
Add all points on the plane to a list (L).
Shuffle the list.
For each point (P) in the list,
For each point (Q) previously picked,
Remove every point from L which are linear to P-Q.
Add P to the picked list.
You could continue the outer loop until you have enough points, or run out of them.
This might just work (though might be a little constrained on being random). Find the largest circle you can draw within the square (this seems very doable). Pick any n points on the circle, no three will ever be collinear :-).
This should be an easy enough task in code. Say the circle is centered at origin (so something of the form x^2 + y^2 = r^2). Assuming r is fixed and x randomly generated, you can solve to find y coordinates. This gives you two points on the circle for every x which are diametrically opposite. Hope this helps.
Edit: Oh, integer points, just noticed that. Thats a pity. I'm going to keep this solution up though - since I like the idea
Both #LaC's and #MizardX's solution are very interesting, but you can combine them to get even better solution.
The problem with #LaC's solution is that you get random choices rejected. The more points you have already generated the harder it gets to generate new ones. If there is only one available position left you have slight chance of randomly choosing it (1/(n*m)).
In the #MizardX's solution you never get rejected choices, however if you directly implement the "Remove every point from L which are linear to P-Q." step you'll get worse complexity (O(n^5)).
Instead it would be better to use a bitmap to find which points from L are to be removed. The bitmap would contain a value indicating whether a point is free to use and what is its location on the L list or a value indicating that this point is already crossed out. This way you get worst-case complexity of O(n^4) which is probably optimal.
EDIT:
I've just found that question: Generate Non-Degenerate Point Set in 2D - C++
It's very similar to this one. It would be good to use solution from this answer Generate Non-Degenerate Point Set in 2D - C++. Modifying it a bit to use radix or bucket sort and adding all the n^2 possible points to the P set initially and shufflying it, one can also get worst-case complexity of O(n^4) with a much simpler code. Moreover, if space is a problem and #LaC's solution is not feasible due to space requirements, then this algorithm will just fit in without modifications and offer a decent complexity.
Here is a paper that can maybe solve your problem:
"POINT-SETS IN GENERAL POSITION WITH MANY
SIMILAR COPIES OF A PATTERN"
by BERNARDO M. ABREGO AND SILVIA FERNANDEZ-MERCHANT
um, you don't specify which plane.. but just generate 3 random numbers and assign to x,y, and z
if 'the plane' is arbitrary, then set z=o every time or something...
do a check on x and y to see if they are in your m boundary,
compare the third x,y pair to see if it is on the same line as the first two... if it is, then regenerate the random values.

Resources