extract points which satisfy certain conditions - algorithm

I have an array of points in one plane. They form some shape. I need to extract points from this array which only form straight lines of this shape.
At this moment I have an algorithm but it does not work very good. I take first two points, make a straight line and then check if the following points lie on it with some tolerance. But there is a problem: the points which form straight line are not really on the straight but have some deviation. This deviation is quite large. If in my algorithm I make deviation large enough to get points from the straight part, then other points which are on the slightly bent part but have deviation less then specified also extracted.
I am looking for some idea on how to perform such task.
Here is the picture:
In circles are the parts which I want to extract. Red points are the parts which I could extract with my approach. If I increase the tolerance then I miss the straight pieces too.

First, if you already have some candidate subset of points and want to check whether they lie on a straight line. Use a form of linear regression to identify the best-fitting line, then check how well it fits and accept or reject the hypothesis that this particular segment is linear based on that.
One of the most standard ways of doing that is using Least Squares method.
Identifying the subset is a different problem, the best solution to which will depend strongly on the kind of data you have and the objective. I suggest that enumerating all the segments is a good starting point, if the amount of data is not extremely large, -- that should be doable in no more than cubic time, I gather.
There are certainly some approximations one can apply, e.g. choosing a point in the sequence and building a subset by iteratively adding points on either side as long as the segment remains linear within the tolerance threshold, than accepting or rejecting it if the segment is long enough.
I assume here that the curve is parameterizable by one of the coordinates. If this is not the case, e.g. if the curve is closed, additional steps may be required to separate the curve into parameterizable segments.
EDIT: how to check a segment is straight
There's a number of options.
First, I would expect that for a straight line the average deviation would stay roughly the same as you add the new points, then you can simply find a reasonable threshold on that given the data.
Second option is to further split the subset into a fixed number of parts (e.g. 2), find the best fitting line for each one and then compare these. In case of a straight line, roughly the same line should be predicted, but for a curve it would be different.
Third option is to perform nonlinear curve fitting, e.g. fit a quadratic curve and check the coefficient for the quadratic term -- if the line is straight, it should be close to zero.
In each case, of course, there is a tradeoff between the segment size and the deviation of the points from that segment. In the extreme case, there would either be one huge linear segment with huge deviation or a whole buch of 2-point segments with 0 deviation. The actual threshold on the deviation, the difference between the tangent curves, or the magnitude of the quadratic term (depending on the option you prefer) has to be selected for the given dataset to suit your needs. Looking at the plot, I would say that the threshold should be picked so as to allow for segments of length 10 or so.

Related

Matrix of pixels to coordinates

I have to convert a given matrix of pixels (coefficients are in a range from 0 to 255, since the matrix corresponds to a black and white image) into two lists. Both of them may be composed of lists, one containing the abscissas of the points, the other the ordinates.
As you can notice on the included picture, the first case corresponds to a single curve, whereas both the others involve multiple ones, crossing one each other. The algorithm should be able to make the difference between the two or three curves (in the two last examples), so in the two mainlists, a given sublist corresponds to a given curve.
I have absolutely no idea of what to start from...
One last thing : I'm seeking ideas on how to program this algorithm, so this is why I didn't add any specific programming language (if code may help any explanation, feel free to speak any language).
Thanks in advance >^.^<
Check out the Hough transform. It is a simple voting algorithm, that allows finding simple geometric shapes in images. One complication could be that your lines are not strictly straight. But it would give you equations on the lines it does find. Since your case is a little nonstandard I'd try to understand the algorithm itself and write my own implementation.
In my first implementation (centering a circle on a square in long focal depth image I took) I started with a very simple Python example I found online, rewrote it for my purposes and then later moved to C# for speed, since I needed more parameters (higher dimensional search space) than you need for this simple case.
In your case I would start with the simple assumption of a straight line. Then the Hough transform will give 1, 2 and 3 maxima respectively for your three cases.
The idea of the Hough transform is well described on wikipedia.
Here just the gist of the idea:
For a straight line think of giving each black pixel the right to vote
for 180 possible lines that could go through it (one for each angle in
single degree steps), then plotting the vote as histogram over a 2d space, where one
dimension is the angle of the line, another is the distance from
origin (using the Hesse normal form of the line for practical reasons
rather than the common y= m x +b) and the z-dimension is the number of votes. The actual line formed by the black
pixels will get more votes than any other possible line, so you are
simply looking for the Maximum vote location in the transformation
space (say in Python/numpy it would be argmax).
If there are two lines, you will find two clear maxima, the higher one with the longer or thicker line (more votes). You can then start playing with grayscale in your image, giving non-integer votes to pixels. You can also play with the resolution of the angle, depending on the content of your problem.

Efficiently calculating a segmented regression on a large dataset

I currently have a large data set, for which I need to calculate a segmented regression (or fit a piecewise linear function in some similar way). However, I have both a large data set, as well as a very large number of pieces.
Currently I have the following approach:
Let si be the end of segment i
Let (xi,yi) denote the i-th data point
Assume the data point xk lies within segment j, then I can create a vector from xk as
(s1,s2-s1,s3-s2,...,xk-sj-1,0,0,...)
To do a segmented regression on the data point, I can do a normal linear regression on each of these vectors.
However, my current estimates show, that if I define the problem that way, I will get about 600.000 vectors with about 2.000 components each. I haven't benchmarked yet, but I don't think my computer will be able to calculate such a large regression problem in any acceptable time.
Is there a better way to calculate this kind of regression problem? One idea was to maybe use some kind of hierarchical approach, i.e. calculate one regression problem by combining multiple segments, so that I can determine start and endpoints for this set. Then calculate an individual segmented regression for this set of segments. However, I cannot figure out how to calculate the regression for this set of segments, so that the endpoints match (I can only match start or endpoint by fixing the intercept but not both).
Another idea I had was to calculate an individual regression for each of the segments and then only use the slope for that segment. However with that approach, errors might start to accumulate and I have no way to control for this kind of error accumulation.
Yet another ideas is that I could do individual regression for each segment, but fix the intercept to the endpoint of the previous segment. However, I still am not sure, if I may get some kind of error accumulation this way.
Clarification
Not sure if this was clear from the rest of the question. I know where the segments start and end. The most important part is, that I have to get each line segment to intersect at the segment boundary with the next segment.
EDIT
Maybe another fact that could help. All points have different x values.
I would group points to rectangular grid areas
based on their position. So you process this task on more smaller datasets and then merge the results together when all done.
I would process each group like this:
compute histogram of angles
take only the most occurring angles
their count determine the number of line segments present in group
do the regression/line fit for these angles
See this Answer it does something very similar (just single line)
compute the intersection points
between line segments to get the endpoints of your piecewise polyline and also connectivity info (join the closest endpoints)
[edit1] after OP edit
You know the edge x coordinates of all segments (x0,x1,...) so just compute average y coordinates of points near segment edge (gray area, green points) and You got the segment line endpoints (blue points). Of coarse this is no fit or regression because of discard all the other points so it leads to bigger errors (unless the segment x coordinated corresponds to regressed lines ...) but there is no way around it with the constrains of solution you have (at least I do not see any).
Because if you use regression on segment data then you can not connect it to other segments and if you try to merge them then you got almost the same result as this:
the size of gray area determine the output ... so play with it a bit ...

Sorting Geographical non-contiguous line segments along an implied curve

Given:
A Set (for the sake of discussion we will call it S), which is an unordered collection of line segments. Each line segment is defined as two Longitude-Latitude end-points. While all of the line segments follow an implied curve, there are "gaps" between each of the segments, of various sizes. We refer to this curve as "implied" because it is not explicitly defined anywhere. The only information that we have available are the line segments contained within S.
Desired Result:
A sequence (for the sake of discussion we will call it R), which is an ordered collection of line segments. Each line segment is defined just as before, following the same implied curve as before but are now sorted by their position along the implied curve.
Context (i.e. "Why in the heck do I need this?"):
Basically I have incomplete geographical data that needs to be normalized and "completed" by doing some very simple interpolation to form a complete curve with no gaps. You might ask "why not just fit a curve to all the line segment end-points and be done with it?" -- well, that's not quite what I am after. The line segments are precisely where they should be located, and there is no need for the final curve to be "smooth". In fact, I intend to connect each of the segments with a straight-line (the crudest form of interpolation imaginable). But, connecting the segments is easy; the hard part is sorting them.
So In Summary: What would be a performant algorithm for going from S to R?
You can use a k-d tree or a cover tree to find nearby points quickly.
If you need one continuous curve, I would suggest that a short traveling salesman path that incorporates the given edges would be a reasonable reconstruction. You could use 2-opt together with a k-d tree the way Bentley described (paywalled, sorry; I think there's also a description in this chapter on TSP local search by Johnson and McGeoch). The one modification needed would be to ensure that the initial path includes the given edges and that 2-opt moves do not remove those edges.
I guess the implied curve has two properties. One is it is continious which means there is no segments. Second, its first derivative is continious which means there is no corners.
From second property we can say that if the angle between two line is closer to each other, they are more related. But i guess it is not enough. You can define a cost function which depends on the angle between lines and distance of lines.
C = A*angle + B*distance (where A,B should be tested and tuned)
Form this function you can find how much each line is related to another one. Than you can just simply connect the line with the strongest relations. Though i guess greedy algorithm does not mean you will always get the optimal solution.

Space partitioning algorithm

I have a set of points which are contained within the rectangle. I'd like to split the rectangles into subrectangles based on point density (giving a number of subrectangles or desired density, whichever is easiest).
The partitioning doesn't have to be exact (almost any approximation better than regular grid would do), but the algorithm has to cope with the large number of points - approx. 200 millions. The desired number of subrectangles however is substantially lower (around 1000).
Does anyone know any algorithm which may help me with this particular task?
Just to understand the problem.
The following is crude and perform badly, but I want to know if the result is what you want>
Assumption> Number of rectangles is even
Assumption> Point distribution is markedly 2D (no big accumulation in one line)
Procedure>
Bisect n/2 times in either axis, looping from one end to the other of each previously determined rectangle counting "passed" points and storing the number of passed points at each iteration. Once counted, bisect the rectangle selecting by the points counted in each loop.
Is that what you want to achieve?
I think I'd start with the following, which is close to what #belisarius already proposed. If you have any additional requirements, such as preferring 'nearly square' rectangles to 'long and thin' ones you'll need to modify this naive approach. I'll assume, for the sake of simplicity, that the points are approximately randomly distributed.
Split your initial rectangle in 2 with a line parallel to the short side of the rectangle and running exactly through the mid-point.
Count the number of points in both half-rectangles. If they are equal (enough) then go to step 4. Otherwise, go to step 3.
Based on the distribution of points between the half-rectangles, move the line to even things up again. So if, perchance, the first cut split the points 1/3, 2/3, move the line half-way into the heavy half of the rectangle. Go to step 2. (Be careful not to get trapped here, moving the line in ever decreasing steps first in one direction, then the other.)
Now, pass each of the half-rectangles in to a recursive call to this function, at step 1.
I hope that outlines the proposal well enough. It has limitations: it will produce a number of rectangles equal to some power of 2, so adjust it if that's not good enough. I've phrased it recursively, but it's ideal for parallelisation. Each split creates two tasks, each of which splits a rectangle and creates two more tasks.
If you don't like that approach, perhaps you could start with a regular grid with some multiple (10 - 100 perhaps) of the number of rectangles you want. Count the number of points in each of these tiny rectangles. Then start gluing the tiny rectangles together until the less-tiny rectangle contains (approximately) the right number of points. Or, if it satisfies your requirements well enough, you could use this as a discretisation method and integrate it with my first approach, but only place the cutting lines along the boundaries of the tiny rectangles. This would probably be much quicker as you'd only have to count the points in each tiny rectangle once.
I haven't really thought about the running time of either of these; I have a preference for the former approach 'cos I do a fair amount of parallel programming and have oodles of processors.
You're after a standard Kd-tree or binary space partitioning tree, I think. (You can look it up on Wikipedia.)
Since you have very many points, you may wish to only approximately partition the first few levels. In this case, you should take a random sample of your 200M points--maybe 200k of them--and split the full data set at the midpoint of the subsample (along whichever axis is longer). If you actually choose the points at random, the probability that you'll miss a huge cluster of points that need to be subdivided will be approximately zero.
Now you have two problems of about 100M points each. Divide each along the longer axis. Repeat until you stop taking subsamples and split along the whole data set. After ten breadth-first iterations you'll be done.
If you have a different problem--you must provide tick marks along the X and Y axis and fill in a grid along those as best you can, rather than having the irregular decomposition of a Kd-tree--take your subsample of points and find the 0/32, 1/32, ..., 32/32 percentiles along each axis. Draw your grid lines there, then fill the resulting 1024-element grid with your points.
R-tree
Good question.
I think the area you need to investigate is "computational geometry" and the "k-partitioning" problem. There's a link that might help get you started here
You might find that the problem itself is NP-hard which means a good approximation algorithm is the best you're going to get.
Would K-means clustering or a Voronoi diagram be a good fit for the problem you are trying to solve?
That's looks like Cluster analysis.
Would a QuadTree work?
A quadtree is a tree data structure in which each internal node has exactly four children. Quadtrees are most often used to partition a two dimensional space by recursively subdividing it into four quadrants or regions. The regions may be square or rectangular, or may have arbitrary shapes. This data structure was named a quadtree by Raphael Finkel and J.L. Bentley in 1974. A similar partitioning is also known as a Q-tree. All forms of Quadtrees share some common features:
They decompose space into adaptable cells
Each cell (or bucket) has a maximum capacity. When maximum capacity is reached, the bucket splits
The tree directory follows the spatial decomposition of the Quadtree

Compare three-dimensional structures

I need to evaluate if two sets of 3d points are the same (ignoring translations and rotations) by finding and comparing a proper geometric hash. I did some paper research on geometric hashing techniques, and I found a couple of algorithms, that however tend to be complicated by "vision requirements" (eg. 2d to 3d, occlusions, shadows, etc).
Moreover, I would love that, if the two geometries are slightly different, the hashes are also not very different.
Does anybody know some algorithm that fits my need, and can provide some link for further study?
Thanks
Your first thought may be trying to find the rotation that maps one object to another but this a very very complex topic... and is not actually necessary! You're not asking how to best match the two, you're just asking if they are the same or not.
Characterize your model by a list of all interpoint distances. Sort the list by that distance. Now compare the list for each object. They should be identical, since interpoint distances are not affected by translation or rotation.
Three issues:
1) What if the number of points is large, that's a large list of pairs (N*(N-1)/2). In this case you may elect to keep only the longest ones, or even better, keep the 1 or 2 longest ones for each vertex so that every part of your model has some contribution. Dropping information like this however changes the problem to be probabilistic and not deterministic.
2) This only uses vertices to define the shape, not edges. This may be fine (and in practice will be) but if you expect to have figures with identical vertices but different connecting edges. If so, test for the vertex-similarity first. If that passes, then assign a unique labeling to each vertex by using that sorted distance. The longest edge has two vertices. For each of THOSE vertices, find the vertex with the longest (remaining) edge. Label the first vertex 0 and the next vertex 1. Repeat for other vertices in order, and you'll have assigned tags which are shift and rotation independent. Now you can compare edge topologies exactly (check that for every edge in object 1 between two vertices, there's a corresponding edge between the same two vertices in object 2) Note: this starts getting really complex if you have multiple identical interpoint distances and therefore you need tiebreaker comparisons to make the assignments stable and unique.
3) There's a possibility that two figures have identical edge length populations but they aren't identical.. this is true when one object is the mirror image of the other. This is quite annoying to detect! One way to do it is to use four non-coplanar points (perhaps the ones labeled 0 to 3 from the previous step) and compare the "handedness" of the coordinate system they define. If the handedness doesn't match, the objects are mirror images.
Note the list-of-distances gives you easy rejection of non-identical objects. It also allows you to add "fuzzy" acceptance by allowing a certain amount of error in the orderings. Perhaps taking the root-mean-squared difference between the two lists as a "similarity measure" would work well.
Edit: Looks like your problem is a point cloud with no edges. Then the annoying problem of edge correspondence (#2) doesn't even apply and can be ignored! You still have to be careful of the mirror-image problem #3 though.
There a bunch of SIGGRAPH publications which may prove helpful to you.
e.g. "Global Non-Rigid Alignment of 3-D Scans" by Brown and Rusinkiewicz:
http://portal.acm.org/citation.cfm?id=1276404
A general search that can get you started:
http://scholar.google.com/scholar?q=siggraph+point+cloud+registration
spin images are one way to go about it.
Seems like a numerical optimisation problem to me. You want to find the parameters of the transform which transforms one set of points to as close as possible by the other. Define some sort of residual or "energy" which is minimised when the points are coincident, and chuck it at some least-squares optimiser or similar. If it manages to optimise the score to zero (or as near as can be expected given floating point error) then the points are the same.
Googling
least squares rotation translation
turns up quite a few papers building on this technique (e.g "Least-Squares Estimation of Transformation Parameters Between Two Point Patterns").
Update following comment below: If a one-to-one correspondence between the points isn't known (as assumed by the paper above), then you just need to make sure the score being minimised is independent of point ordering. For example, if you treat the points as small masses (finite radius spheres to avoid zero-distance blowup) and set out to minimise the total gravitational energy of the system by optimising the translation & rotation parameters, that should work.
If you want to estimate the rigid
transform between two similar
point clouds you can use the
well-established
Iterative Closest Point method. This method starts with a rough
estimate of the transformation and
then iteratively optimizes for the
transformation, by computing nearest
neighbors and minimizing an
associated cost function. It can be
efficiently implemented (even
realtime) and there are available
implementations available for
matlab, c++... This method has been
extended and has several variants,
including estimating non-rigid
deformations, if you are interested
in extensions you should look at
Computer graphics papers solving
scan registration problem, where
your problem is a crucial step. For
a starting point see the Wikipedia
page on Iterative Closest Point
which has several good external
links. Just a teaser image from a matlab implementation which was designed to match to point clouds:
(source: mathworks.com)
After aligning you could the final
error measure to say how similar the
two point clouds are, but this is
very much an adhoc solution, there
should be better one.
Using shape descriptors one can
compute fingerprints of shapes which
are often invariant under
translations/rotations. In most cases they are defined for meshes, and not point clouds, nevertheless there is a multitude of shape descriptors, so depending on your input and requirements you might find something useful. For this, you would want to look into the field of shape analysis, and probably this 2004 SIGGRAPH course presentation can give a feel of what people do to compute shape descriptors.
This is how I would do it:
Position the sets at the center of mass
Compute the inertia tensor. This gives you three coordinate axes. Rotate to them. [*]
Write down the list of points in a given order (for example, top to bottom, left to right) with your required precision.
Apply any algorithm you'd like for a resulting array.
To compare two sets, unless you need to store the hash results in advance, just apply your favorite comparison algorithm to the sets of points of step 3. This could be, for example, computing a distance between two sets.
I'm not sure if I can recommend you the algorithm for the step 4 since it appears that your requirements are contradictory. Anything called hashing usually has the property that a small change in input results in very different output. Anyway, now I've reduced the problem to an array of numbers, so you should be able to figure things out.
[*] If two or three of your axis coincide select coordinates by some other means, e.g. as the longest distance. But this is extremely rare for random points.
Maybe you should also read up on the RANSAC algorithm. It's commonly used for stitching together panorama images, which seems to be a bit similar to your problem, only in 2 dimensions. Just google for RANSAC, panorama and/or stitching to get a starting point.

Resources