I want to fit a line to line fragments, i.e. a small number (often less than 10) of line segments that approximately belong to the line. The line has a small slope. But there are outliers: segments (usually smaller) outside the line. The figure below shows a typical case. There is no horizontal overlap between the pieces.
I would prefer to avoid trying a fit on all subsets of segments and keeping the best. I also wouldn't rely on RANSAC as the sample is too small.
Any suggestion ?
Update:
I now plan to recast the problem as that of fitting a line on points, namely the infinities of points on the individual line segments, assuming a constant linear density. By rewriting the least squares equations in integral form, one sees that we can consider the segments as concentrated at their middle, with a weight equal to their length; there is also an extra term taking their slope into account. This gives a good grounding to the fitting on segments.
Now I still have to incorporate outlier detection. Inspired by RANSAC, I can pick the longest segments and use them in isolation or in pairs to get candidate lines. For each line, evaluate the total error, and keep the line giving the smallest value. From there, some criterion (yet to be found) should allow rejecting the outliers and performing the final least-squares fit on the inliers.
I'd guess the slope is going to be around the average of the line fragment slopes times a factor equal to the length of the fragment (or square of the length of the fragment, depending on how that length of the outlying fragment compares). And then bestfit that line with that slope.
So take the line fragments, convert the slopes to angles (arctan2(y1-y0,x1-x0)) multiply that by the length add them all up, divide by (total length of all fragments). Do the same thing for the position (position of midpoint of the line fragment * length of fragment) / (total length of all fragments), then make sure the line with that slope intercepts the point with that value.
Update:
If we are not to consider much about the slopes we should rather just bestfit the line positionally with regard to the impact of the various segments which again we weight by their length.
Find the total length of the fragments. Iterate the fragments until you are 1/3rd of the way through the total length of the fragments. That is going to be your x of your first point. Then pick some arbitrarily small value and iterate through fragments again, sampling at your chosen rate. Then the impact of that sample is the given y multiplied by the linear distance of the x from the x 1/3rd of the way through the total fragments all normalized by the total sum of the linear distances across the all the fragments. Do the same for 2/3rds of the way. And draw a line between the two resultant points.
As you have asked, I have some suggestions. Complete and working answer would be a bit too much for me to arrive at. My suggestion contains two major parts. Taking them one by one:
Handling outliers:
One suggestion for getting rid of the outliers is to Cluster the line segments. Then on, don't worry about the lines that fall outside the cluster. But, how do you cluster the lines? Divide the entire 2D plane into y = 0 to a, y = a to 2a, y = 2a to 3a etc. Line segments which fall in the same y = i to j stripe would be the one you will use to generate your i and j values for the correct stripe.
There is however one issue: What if the line segments are not well divided horizontally? What if majority lines are inclined at 38 degrees instead of being close to 0? In that case, you may do a Principle Component Analysis. Sorry to link you to such an open-ended idea - your question kinda demands it.
Realign your lines so that they are majorly parallel to X-axis and then, as I mentioned above, find the stripe that contains most of the lines.
Approximating the best fitting line:
Now, after you have finalized the correct stripe, take all the line segments that fell in the stripe and densify them. Densification is the step of approximating the line segments as a collection of points. Since all of these line segments are between y = i and y = j, therefore, you may start with the line y = (i + j) / 2 as the best fit line. Then:
Find the distance of all the points from this line, keeping the distance as negative when the point is above the line and the distance as positive when the point is below the line.
Sum all the distances. Let's call this summed value are approximationError.
Your target is now to find that y value for which approximationError is 0.
Decrease y if majority points lied below the line, increase it if majority points lied above it.
You will finally arrive at a line like y = c.
Now, incline this line to the same angle by which you changed all your input line segments during the Principle Component Analysis step.
To get the line segment, cut this line by x-value of the two x-farthest points in the stripe.
I realize that this all may not be easy to visualize. Here is a link to the wikipedia image for PCA. Here is a link to another answer demonstrating line densification.
Related
Is there any algorithm that would allow to approximate a path on the x-y plane (i.e. an ordered suite of points defined by x and y) with a limited number of line segments and arcs of circles (constant curvature)? The resulting curve needs to be C1 (continuity of slope).
The maximum number or segments and arcs could be a parameter. An additional interesting constraint would be to prevent two consecutive circles of arcs without an intermediate line segment joining them.
I do not see any way to do this, and I do not think that there exists a method for it, but any hint towards this objective is welcome.
Example:
Sample file available here
Consider this path. It looks like a line, but is actually an ordered suite of very close points. There is no noise and the order of the sequence of points is well known.
I would like to approximate this curve with a minimum number of succession of line segments and circular arcs (let's say 10 line segments and 10 circular arcs) and a C1 continuity. The number of segments/arcs is not an objective itself but I need any parameter which would allow to reduce/increase this number to attain a certain simplicity of the parametrization, at the cost of accuracy loss.
Solution:
Here is my solution, based on Spektre's answer. Red curve is original data. Black lines are segments and blue curves are circle arcs. Green crosses are arc centers with radii shown and blue ones are points where segments potentially join.
Detect line segments, based on slope max deviation and segment minimal length as parameters. The slope of the new segment step is compared with the average step of the existing segment. I would prefer an optimization-based method, but I do not think that it exists for disjoint segments with unknown number, position and length.
Join segments with tangent arcs. To close the system, the radius is chosen such that the segments extremities are the least possible moved. A minimum radius constraint has been added for my purposes. I believe that there will be some special cases to treat in the inflexion points are far away when (e.g. lines are nearly parallel) and interact with neigboring segments.
so you got a point cloud ... for such Usually points close together are considered connected so:
you need to add info about what points are close to which ones
points close only to 2 neighbors signaling interior of curve/line. Only one neighbor means endpoint of curve/lines and more then 2 means intersection or too close almost or parallel lines/curves. No neighbors means either noise or just a dot.
group path segments together
This is called connected component analysis. So you need to form polylines from your neighbor info table.
detect linear path chunks
these have the same slope among neighboring segments so you can join them to single line.
fit the rest with curves
Here related QAs:
Finding holes in 2d point sets?
Algorithms: Ellipse matching
How approximation search works see the sublinks there are quite a bit of examples of fitting
Trace a shape into a polygon of max n sides
[Edit1] simple line detection from #3 on your data
I used 5.0 deg angle change as threshold for lines and also minimal size fo detected line as 50 samples (too lazy to compute length assuming constant point density). The result looks like this:
dots are detected line endpoints, green lines are the detected lines and white "lines" are the curves so I do not see any problem with this approach for now.
Now the problem is with the points left (curves) I think there should be also geometric approach for this as it is just circular arcs so something like this
Formula to draw arcs ending in straight lines, Y as a function of X, starting slope, ending slope, starting point and arc radius?
And this might help too:
Circular approximation of polygon (or its part)
the C1 requirement demands the you must have alternating straights and arcs. Also realize if you permit a sufficient number of segments you can trivially fit every pair of points with a straight and use a tiny arc to satisfy slope continuity.
I'd suggest this algorithm,
1 best fit with a set of (specified N) straight segments. (surely there are well developed algorithms for that.)
2 consider the straight segments fixed and at each joint place an arc. Treating each joint individually i think you have a tractable problem to find the optimum arc center/radius to satisfy continuity and improve the fit.
3 now that you are pretty close attempt to consider all arc centers and radii (segments being defined by tangency) as a global optimization problem. This of course blows up if N is large.
A typical constraint when approximating a given curve by some other curve is to bound the approximate curve to an epsilon-hose within the original curve (in terms if Minkowski sum with a disk of fixed radius epsilon).
For G1- or C2-continuous approximation (which people from CNC/CAD like) with biarcs (and a straight-line segment could be seen as an arc with infinite radius) former colleagues of mine developed an algorithm that gives solutions like this [click to enlarge]:
The above picture is taken from the project website: https://www.cosy.sbg.ac.at/~held/projects/apx/apx.html
The algorithm is fast, that is, it runs in O(n log n) time and is based on the generalized Voronoi diagram. However, it does not give an approximation with the exact minimum number of elements. If you look for the theoretical optimum I would refer to a paper by Drysdale et al., Approximation of an Open Polygonal Curve with
a Minimum Number of Circular Arcs and Biarcs, CGTA, 2008.
Suppose we have n points in a bounded region of the plane. The problem is to divide it in 4 regions (with a horizontal and a vertical line) such that the sum of a metric in each region is minimized.
The metric can be for example, the sum of the distances between the points in each region ; or any other measure about the spreadness of the points. See the figure below.
I don't know if any clustering algorithm might help me tackle this problem, or if for instance it can be formulated as a simple optimization problem. Where the decision variables are the "axes".
I believe this can be formulated as a MIP (Mixed Integer Programming) problem.
Lets introduce 4 quadrants A,B,C,D. A is right,upper, B is right,lower, etc. Then define a binary variable
delta(i,k) = 1 if point i is in quadrant k
0 otherwise
and continuous variables
Lx, Ly : coordinates of the lines
Obviously we have:
sum(k, delta(i,k)) = 1
xlo <= Lx <= xup
ylo <= Ly <= yup
where xlo,xup are the minimum and maximum x-coordinate. Next we need to implement implications like:
delta(i,'A') = 1 ==> x(i)>=Lx and y(i)>=Ly
delta(i,'B') = 1 ==> x(i)>=Lx and y(i)<=Ly
delta(i,'C') = 1 ==> x(i)<=Lx and y(i)<=Ly
delta(i,'D') = 1 ==> x(i)<=Lx and y(i)>=Ly
These can be handled by so-called indicator constraints or written as linear inequalities, e.g.
x(i) <= Lx + (delta(i,'A')+delta(i,'B'))*(xup-xlo)
Similar for the others. Finally the objective is
min sum((i,j,k), delta(i,k)*delta(j,k)*d(i,j))
where d(i,j) is the distance between points i and j. This objective can be linearized as well.
After applying a few tricks, I could prove global optimality for 100 random points in about 40 seconds using Cplex. This approach is not really suited for large datasets (the computation time quickly increases when the number of points becomes large).
I suspect this cannot be shoe-horned into a convex problem. Also I am not sure this objective is really what you want. It will try to make all clusters about the same size (adding a point to a large cluster introduces lots of distances to be added to the objective; adding a point to a small cluster is cheap). May be an average distance for each cluster is a better measure (but that makes the linearization more difficult).
Note - probably incorrect. I will try and add another answer
The one dimensional version of minimising sums of squares of differences is convex. If you start with the line at the far left and move it to the right, each point crossed by the line stops accumulating differences with the points to its right and starts accumulating differences to the points to its left. As you follow this the differences to the left increase and the differences to the right decrease, so you get a monotonic decrease, possibly a single point that can be on either side of the line, and then a monotonic increase.
I believe that the one dimensional problem of clustering points on a line is convex, but I no longer believe that the problem of drawing a single vertical line in the best position is convex. I worry about sets of points that vary in y co-ordinate so that the left hand points are mostly high up, the right hand points are mostly low down, and the intermediate points alternate between high up and low down. If this is not convex, the part of the answer that tries to extend to two dimensions fails.
So for the one dimensional version of the problem you can pick any point and work out in time O(n) whether that point should be to the left or right of the best dividing line. So by binary chop you can find the best line in time O(n log n).
I don't know whether the two dimensional version is convex or not but you can try all possible positions for the horizontal line and, for each position, solve for the position of the vertical line using a similar approach as for the one dimensional problem (now you have the sum of two convex functions to worry about, but this is still convex, so that's OK). Therefore you solve at most O(n) one-dimensional problems, giving cost O(n^2 log n).
If the points aren't very strangely distributed, I would expect that you could save a lot of time by using the solution of the one dimensional problem at the previous iteration as a first estimate of the position of solution for the next iteration. Given a starting point x, you find out if this is to the left or right of the solution. If it is to the left of the solution, go 1, 2, 4, 8... steps away to find a point to the right of the solution and then run binary chop. Hopefully this two-stage chop is faster than starting a binary chop of the whole array from scratch.
Here's another attempt. Lay out a grid so that, except in the case of ties, each point is the only point in its column and the only point in its row. Assuming no ties in any direction, this grid has N rows, N columns, and N^2 cells. If there are ties the grid is smaller, which makes life easier.
Separating the cells with a horizontal and vertical line is pretty much picking out a cell of the grid and saying that cell is the cell just above and just to the right of where the lines cross, so there are roughly O(N^2) possible such divisions, and we can calculate the metric for each such division. I claim that when the metric is the sum of the squares of distances between points in a cluster the cost of this is pretty much a constant factor in an O(N^2) problem, so the whole cost of checking every possibility is O(N^2).
The metric within a rectangle formed by the dividing lines is SUM_i,j[ (X_i - X_j)^2 + (Y_i-Y_j)^2]. We can calculate the X contributions and the Y contributions separately. If you do some algebra (which is easier if you first subtract a constant so that everything sums to zero) you will find that the metric contribution from a co-ordinate is linear in the variance of that co-ordinate. So we want to calculate the variances of the X and Y co-ordinates within the rectangles formed by each division. https://en.wikipedia.org/wiki/Algebraic_formula_for_the_variance gives us an identity which tells us that we can work out the variance given SUM_i Xi and SUM_i Xi^2 for each rectangle (and the corresponding information for the y co-ordinate). This calculation can be inaccurate due to floating point rounding error, but I am going to ignore that here.
Given a value associated with each cell of a grid, we want to make it easy to work out the sum of those values within rectangles. We can create partial sums along each row, transforming 0 1 2 3 4 5 into 0 1 3 6 10 15, so that each cell in a row contains the sum of all the cells to its left and itself. If we take these values and do partial sums up each column, we have just worked out, for each cell, the sum of the rectangle whose top right corner lies in that cell and which extends to the bottom and left sides of the grid. These calculated values at the far right column give us the sum for all the cells on the same level as that cell and below it. If we subtract off the rectangles we know how to calculate we can find the value of a rectangle which lies at the right hand side of the grid and the bottom of the grid. Similar subtractions allow us to work out first the value of the rectangles to the left and right of any vertical line we choose, and then to complete our set of four rectangles formed by two lines crossing by any cell in the grid. The expensive part of this is working out the partial sums, but we only have to do that once, and it costs only O(N^2). The subtractions and lookups used to work out any particular metric have only a constant cost. We have to do one for each of O(N^2) cells, but that is still only O(N^2).
(So we can find the best clustering in O(N^2) time by working out the metrics associated with all possible clusterings in O(N^2) time and choosing the best).
By "Group", I mean a set of pixels such that every pixel at least have one adjacent pixel in the same set, the drawing shows an example of a group.
I would like to find the pixel which is having the greatest straight line distance from a designated pixel (for example, the green pixel). And the straight line connecting the two pixels (the red line) must not leave the group.
My solution is looping through the degrees and simulating the progress of the lines starting from the green pixel with the degree and see which line travelled the farthest distance.
longestDist = 0
bestDegree = -1
farthestX = -1
farthestY = -1
FOR EACH degree from 0 to 360
dx=longestDist * cos(degree);
dy=longestDist * sin(degree);
IF Point(x+dx , y+dy) does not belong to the group
Continue with next degree
//Because it must not be the longest line, so skip it
END IF
(farthestX , farthestY) = simulate(x,y,degree)
d = findDistance(x , y , farthestX , farthestY)
IF d > longestDist
longestDist = d
bestDegree = degree
END IF
END FOR
It is obviously not the best algorithm. Thus I am asking for help here.
Thank you and sorry for my poor English.
I wouldn't work with angles. But I'm pretty sure the largest distance will always be between two pixels at the edge of the set, thus I'd trace the outline: From any pixel in the set go to any direction until you reach the edge of the set. Then move (couter)clockwise along the edge. Do this with any pixel as starting point and you'll be able to find the largest distance. It's still pretty greedy, but I thought it might give you an alternative starting point to improve upon.
Edit: What just came to my mind: When you have a start pixel s and the end pixel e. In the first iteration using s the corresponding e will be adjacent (the next one along the edge in clockwise direction). As you iterate along the edge the case might occur, that there is no straight line through the set between s and e. In that case the line will hit another part of the set-edge (the pixel p) though. You can continue iteration of the edge at that pixel (e = p)
Edit2: And if you hit a p you'll know that there can be no longer distance between s and e so in the next iteration of s you can skip that whole part of the edge (between s and p) and start at p again.
Edit3: Use the above method to find the first p. Take that p as next s and continue. Repeat until you reach your first p again. The max distance will be between two of those p unless the edge of the set is convex in which case you wont find a p.
Disclaimer: This is untested and are just ideas from the top of my head, no drawings have been made to substantiate my claims and everything might be wrong (i.e. think about it for yourself before you implement it ;D)
First, note that the angle discretization in your algorithm may depend on the size of the grid. If the step is too large, you can miss certain cells, if it is too small, you will end up visiting the same cell again and again.
I would suggest that you enumerate the cells in the region and test the condition for each one individually instead. The enumeration can be done using breadth-first or depth-first search (I think the latter would be preferable, since it will allow one to establish a lower bound quickly and do some pruning).
One can maintain the farthest point X found so far and for each new point in the region, check whether (a) the point is further away than the one found so far and (b) it's connected to the origin by a straight line passing through the cells of the region only. If both conditions are satisfied, update the X, else go on with the search. If condition (a) is not satisfied, condition (b) doesn't have to be checked.
The complexity of this solution would be O(N*M), where N is the number of cells in the region and M is the larger dimension of the region (max(width,height)). If performance is of essence, more sophisticated heuristics can be applied, but for a reasonably sized grid this should work fine.
Search for pixel, not for slope. Pseudocode.
bestLength = 0
for each pixel in pixels
currentLength = findDistance(x, y, pixel.x, pixel.y)
if currentLength > bestLength
if goodLine(x, y, pixel.x, pixel.y)
bestLength = currentLength
bestX = pixel.x
bestY = pixel.y
end
end
end
You might want to sort pixels descending by |dx| + |dy| before that.
Use a double data-structure:
One that contains the pixels sorted by angle.
The second one sorted by distance (for fast access, this should also contain "pointers" for the first data structure).
Walk through the angle sorted one, and check for each pixel that the line is within the region. Some pixels will have the same angle, so you can walk from the origin along the line, and go till you go out from the region. You can eliminate all the pixels which are beyond that point. Also, if the maximum distance increased, remove all pixels which have a shorter distance.
Treat your region as a polygon instead of a collection of pixels. From this you can get a list of line segments (the edges of your polygon).
Draw a line from your start pixel to each pixel you are checking. The longest line that does not intersect any of the line segments of your polygon indicates your most distant pixel that is reachable by a straight line from your pixel.
There are various optimizations you can make to this and a few edges cases to check, but let me know if you understand the idea before i post those... in particular, do you understand what I mean by treating as a polygon instead of a collection of pixels?
To add, this approach will be significantly faster than any angle based approach or approach that requires "walking" for all but the smallest collections of pixels. You can further optimize because your problem is equivalent to finding the most distant endpoint of a polygon edge that can be reached via an unintersected straight line from your start point. This can be done in O(N^2), where N is the number of edges. Note that N will be much much smaller than the number of pixels, and many of the algorithms proposed that use angles an/or pixel iteration are be going to instead depend on the number of pixels.
I have read about DDA. But I just came across the term symmetric DDA. What is it ? How is it different from DDA ?
The DDA (Digital Differential Analyzer) algorithm is used to find out interpolating points between any given two points, linearly (i.e. straight line). Now since this is to be done on a digital computer - speed is an important factor.
The equation of a straight line is given by m=Δx/Δy eq(i), where Δx = x(2)-x(1) & Δy = y(2)-y(1),now using this equation we could compute successive points that lie on the line. But then this is the discrete world of raster graphics - so we require integral coordinates.
In simple DDA eq(i) is transformed to m=eΔx/eΔy where e, call it the increment factor, is a positive real number. since putting the same number in numerator and denominator does not change anything - but if suitably chosen - it can help us in generating discrete points thereby reducing the overload of having to round off the resultant points.
Basically what we need to do is: increment the coordinates by a fixed small amount, beginning from the starting point, and each time we have a new point progressing towards the end point.
In simple DDA - e is chosen as 1/max(|Δx|,|Δy|) such that one of the coordinate is integral and only the other coordinate has to be rounded. i.e. P(i+1) = P(i)+(1,Round(e*Δy)) here one coordinate is being incremented by 1 and the other by e*Δy
In symmetric DDA - e is chosen such that though both the co-ordinates of the resultant points has to be rounded off, it can be done so very efficiently, thus quickly.
Specifically e is chosen as 1/2^n where 2^(n-1) <= max(|Δx|,|Δy|) < 2^n. In other words the length of the line is taken to be 2^n aligned. The increments for the two coordinates are e*Δx and e*Δy. With suitably chosen initial fraction part of the beginning coordinates: this causes the points to be generated as mixed fractions whose fractional parts are in a cyclic series, i.e. they repeat over a small length. The resultant coordinates can thus easily be rounded off based on two fixed length look-up tables, one for each coordinate.
refer http://w3.msi.vxu.se/~gsu/DAB726-Ht06/Symm-DDA.pdf for an example.
Notice the cyclic repetition in the fractional part of the resultant coordinates.
For a right triangle specified by an equation aX + bY <= c on integers
I want to plot each pixel(*) in the triangle once and only once, in a pseudo-random order, and without storing a list of previously hit points.
I know how to do this with a line segment between 0 and x
pick a random point'o' along the line,
pick 'p' that is relatively prime to x
repeat for up to x times: Onext = (Ocur + P) MOD x
To do this for a triangle, I would
1. Need to count the number of pixels in the triangle sans lists
2. Map an integer 0..points into a x,y pair that is a valid pixel inside the triangle
I hope any solution could be generalized to pyramids and higher dimensional shapes.
(*) I use the CG term pixel for the pair of integer points X,Y such that the equation is satisfied.
Since you want to guarantee visiting each pixel once and only once, it's probably better to think in terms of pixels rather than the real triangles.
You can slice the triangles horizontally and get bunch of horizontal scan lines. Connect the scan lines together and you have converted your "triangle" into a long line. Apply your point visiting algorithm to your long chain of scan lines.
By the way, this mapping only needs to happen on paper, all you need is a function that can return (x, y) given (t) along the virtual scan line.
Edit:
To convert two points to a line segment, you can look for Bresenham's scan conversion. Once you get the 3 line segments converted into series of points, you can put all points into a bucket and group all points by y. Within the same y-value, sort points by x. The smallest x within a y-value is the begin point of the scan line and the largest x within the y-value is the end point of the scan line. This is called "scan converting triangle". You can find more info if you Google.
Here's a solution for Triangle Point Picking.
What you have to do is choose two vectors (sides) of your triangle, multiply each with a random number in [0,1] and add them up. This will provide a uniform distribution in the quadrilateral defined by the vectors. You'll have to check whether the result lies inside the original triangle; if it doesn't either transform it back in or simply discard it and try again.
One method is to put all of the pixels into an array and then shuffle the array (this is O(n)), then visit the pixels in the order in the shuffled array. This could require quite a lot of memory though.
Here's a method which wastes some CPU time but probably doesn't waste as much as a more complicated method would do.
Compute a rectangle that circumscribes the triangle. It will be easy to "linearize" that rectangle, each scan line followed by the next. Use the algorithm that you already know in order to traverse the pixels of the rectangle. When you hit each pixel, check if the pixel is in the triangle, and if not then skip it.
I would consider the lines of the triangle as single line, which is cut into segments. The segments would be stored in an array where the length of the segment also stored as well as the offset in the total length of the lines. Then depending on the value of O, you can select which array element contains the pixel you want to draw at that moment based on this information and paint the pixel based on the values in the element.