I have a set of data points:
(x1, y1) (x2, y2) (x3, y3) ... (xn, yn)
The number of sample points can be thousands. I want to represent the same curve as accurately as possible with minimal (lets suppose 30) set of points. I want to capture as many inflection points as possible. However, I have a hard limit on the number of allowed points to represent the data.
What is the best algorithm to achieve the same? Is there any free software library that can help?
PS: I have tried to implement relative slope difference based point elimination, but this does not always result in the best possible data representation.
You are searching for an interpolation algorithm. Is your set of points a function in a mathematical sense (all x values are disjunct from each other) then you can go for a polynomial interpolation, or are they distributed over the 2d plane, then you could use bezier curves.
Late answer after years:
Have a look at the Douglas-Peucker algorithm:
function DouglasPeucker(PointList[], epsilon)
// Find the point with the maximum distance
dmax = 0
index = 0
end = length(PointList)
for i = 2 to ( end - 1) {
d = perpendicularDistance(PointList[i], Line(PointList[1], PointList[end]))
if ( d > dmax ) {
index = i
dmax = d
// If max distance is greater than epsilon, recursively simplify
if ( dmax > epsilon ) {
// Recursive call
recResults1[] = DouglasPeucker(PointList[1...index], epsilon)
recResults2[] = DouglasPeucker(PointList[index...end], epsilon)
// Build the result list
ResultList[] = {recResults1[1...length(recResults1)-1], recResults2[1...length(recResults2)]}
} else {
ResultList[] = {PointList[1], PointList[end]}
// Return the result
return ResultList[]
It is frequently used to simplify GPS tracks and reduce the number of waypoints. As a preparation, you may have to sort your points to store neighbour points adjacent in your list or array.
it depends on must your curve intersect each point or it is approximation. Try:
Take points
Apply any interpolation (http://en.wikipedia.org/wiki/Polynomial_interpolation) to get equation of curve
Then take sample points with specific step.
The goal is to find coordinates in a figure with an unknown shape. What IS known is a list of coordinates of the boundary of that figure, for example:
boundary = [(0,0),(1,0),(2,0),(3,0),(3,1),(3,2),(3,3),(2,3),(2,2),(1,2),(1,3),(0,3),(0,2),(0,1]
which would look something like this:
Square with a gab
This is a very basic example and i'd like to do it with very larg lists of very different kinds of figures.
The question is how to get a random coordinate that lies within the figure WITHOUT hardcoding the anything about the shape of the figure, because this will be unknown at the beginning? Is there a way to know for certain or is making an estimate the best option? How would I implement an estimate like that?
Here is tentative answer. You sample numbers in two steps.
Before, do preparation work - split your figure into simple elementary objects. In your case you split it into rectangles, often people triangulate and split it into triangles.
So you have number N of simple objects, each with area of Ai and total area A = Sum(Ai).
First sampling step - select which rectangle you pick point from.
In some pseudocode
r = randomU01(); // random value in [0...1) range
for(i in N) {
r = r - A_i/A;
if (r <= 0) {
k = i;
So you picked up one rectangle with index k, and then just sample point uniformly in that rectangle
x = A_k.dim.x * randomU01();
y = A_k.dim.y * randomU01();
return (x + A_k.lower_left_corner.x, y + A_k.lower_left_corner.y);
And that is it. Very similar technique for triangulated figure.
Rectangle selection could be optimized by doing binary search or even more complicated alias method
If your boundary is generic, then the only good way to go is to triangulate your polygon using any good library out there (f.e. Triangle), then select one of the triangles based on area (step 1), then sample uniformly point in the triangle using two random U01 numbers r1 and r2,
P = (1 - sqrt(r1)) * A + (sqrt(r1)*(1 - r2)) * B + (r2*sqrt(r1)) * C
i.e., in pseudocode
r1 = randomU01();
s1 = sqrt(r1);
r2 = randomU01();
x = (1.0-s1)*A.x + s1*(1.0-r2)*B.x + r2*s1*C.x;
y = (1.0-s1)*A.y + s1*(1.0-r2)*B.y + r2*s1*C.y;
return (x,y);
Is there any algorithm / method to find the smallest regular hexagon around a set of points (x, y).
And by smallest I mean smallest area.
My current idea was to find the smallest circle enclosing the points, and then create a hexagon from there and check if all the points are inside, but that is starting to sound like a never ending problem.
First of all, let's define a hexagon as quadruple [x0, y0, t0, s], where (x0, y0), t0 and s are its center, rotation and side-length respectively.
Next, we need to find whether an arbitrary point is inside the hexagon. The following functions do this:
function getHexAlpha(t, hex)
t = t - hex.t0;
t = t - 2*pi * floor(t / (2*pi));
return pi/2 - abs(rem(t, pi/3) - (pi/6));
function getHexRadious( P, hex )
x = P.x - hex.x0;
y = P.y - hex.y0;
t = atan2(y, x);
return hex.s * cos(pi/6) / sin(getHexAlpha(t, hex));
function isInHex(P, hex)
r = getHexRadious(P, hex);
d = sqrt((P.x - hex.x0)^2 + (P.y - hex.y0)^2);
return r >= d;
Long story short, the getHexRadious function formulates the hexagon in polar form and returns distance from center of hexagon to its boundary at each angle. Read this post for more details about getHexRadious and getHexRadious functions. This is how these work for a set of random points and an arbitrary hexagon:
The Algorithm
I suggest a two-stepped algorithm:
1- Guess an initial hexagon that covers most of points :)
2- Tune s to cover all points
Chapter 1: (2) Following Tarantino in Kill Bill Vol.1
For now, let's assume that our arbitrary hexagon is a good guess. Following functions keep x0, y0, t0 and tune s to cover all points:
function getHexSide( P, hex )
x = P.x - hex.x0;
y = P.y - hex.y0;
r = sqrt(x^2 + y^2);
t = atan2(y, x);
return r / (cos(pi/6) / sin(getHexAlpha(t, hex)));
function findMinSide( P[], hex )
for all P[i] in P
S[i] = getHexSide(P, hex);
return max(S[]);
The getHexSide function is reverse of getHexRadious. It returns the minimum required side-length for a hexagon with x0, y0, t0 to cover point P. This is the outcome for previous test case:
Chapter 2: (1)
As a guess, we can find two points furthest away from each other and fit one of hexagon diameters' on them:
function guessHex( P[] )
D[,] = pairwiseDistance(P[]);
[i, j] = indexOf(max(max(D[,])));
[~, j] = max(D(i, :));
hex.x0 = (P[i].x + P[j].x) / 2;
hex.y0 = (P[i].y + P[j].y) / 2;
hex.s = D[i, j]/2;
hex.t0 = atan2(P.y(i)-hex.y0, P.x(i)-hex.x0);
return hex;
Although this method can find a relatively small polygon, but as a greedy approach, it never guarantees to find the optimum solutions.
Chapter 3: A Better Guess
Well, this problem is definitely an optimization problem with its objective being to minimize area of hexagon (or s variable). I don't know if it has an analytical solution, and SO is not the right place to discuss it. But any optimization algorithm can be used to provide a better initial guess. I used GA to solve this with findMinSide as its cost function. In fact GA generates many guesses about x0, y0, and t0 and the best one will be selected. It finds better results but is more time consuming. Still no guarantee to find the optimum!
Optimization of Optimization
When it comes to optimization algorithms, performance is always an issue. Keep in mind that hexagon only needs to enclose the convex-hall of points. If you are dealing with large sets of points, it's better to find the convex-hall and get rid of the rest of the points.
Consider a discrete curve defined by the points (x1,y1), (x2,y2), (x3,y3), ... ,(xn,yn)
Define a constant SUM = y1+y2+y3+...+yn. Say we change the value of some k number of y points (increase or decrease) such that the total sum of these changed points is less than or equal to the constant SUM.
What would be the best possible manner to adjust the other y points given the following two conditions:
The total sum of the y points (y1'+y2'+...+yn') should remain constant ie, SUM.
The curve should retain as much of its original shape as possible.
A simple solution would be to define some delta as follows:
delta = (ym1' + ym2' + ym3' + ... + ymk') - (ym1 + ym2 + ym3 + ... + ymk')
and to distribute this delta over the rest of the points equally. Here ym1' is the value of the modified point after modification and ym1 is the value of the modified point before modification to give delta as the total difference in modification.
However this would not ensure a totally smoothed curve as area near changed points would appear ragged. Does a better solution/algorithm exist for the this problem?
I've used the following approach, though it is a bit OTT.
Consider adding d[i] to y[i], to get s[i], the smoothed value.
We seek to minimise
S = Sum{ 1<=i<N-1 | sqr( s[i+1]-2*s[i]+s[i-1] } + f*Sum{ 0<=i<N | sqr( d[i])}
The first term is a sum of the squares of (an approximate) second derivative of the curve, and the second term penalises moving away from the original. f is a (positive) constant. A little algebra recasts this as
S = sqr( ||A*d - b||)
where the matrix A has a nice structure, and indeed A'*A is penta-diagonal, which means that the normal equations (ie d = Inv(A'*A)*A'*b) can be solved efficiently. Note that d is computed directly, there is no need to initialise it.
Given the solution d to this problem we can compute the solution d^ to the same problem but with the constraint One'*d = 0 (where One is the vector of all ones) like this
d^ = d - (One'*d/Q) * e
e = Inv(A'*A)*One
Q = One'*e
What value to use for f? Well a simple approach is to try out this procedure on sample curves for various fs and pick a value that looks good. Another approach is to pick a estimate of smoothness, for example the rms of the second derivative, and then a value that should attain, and then search for an f that gives that value. As a general rule, the bigger f is the less smooth the smoothed curve will be.
Some motivation for all this. The aim is to find a 'smooth' curve 'close' to a given one. For this we need a measure of smoothness (the first term in S) and a measure of closeness (the second term. Why these measures? Well, each are easy to compute, and each are quadratic in the variables (the d[]); this will mean that the problem becomes an instance of linear least squares for which there are efficient algorithms available. Moreover each term in each sum depends on nearby values of the variables, which will in turn mean that the 'inverse covariance' (A'*A) will have a banded structure and so the least squares problem can be solved efficiently. Why introduce f? Well, if we didn't have f (or set it to 0) we could minimise S by setting d[i] = -y[i], getting a perfectly smooth curve s[] = 0, which has nothing to do with the y curve. On the other hand if f is gigantic, then to minimise s we should concentrate on the second term, and set d[i] = 0, and our 'smoothed' curve is just the original. So it's reasonable to suppose that as we vary f, the corresponding solutions will vary between being very smooth but far from y (small f) and being close to y but a bit rough (large f).
It's often said that the normal equations, whose use I advocate here, are a bad way to solve least squares problems, and this is generally true. However with 'nice' banded systems -- like the one here -- the loss of stability through using the normal equations is not so great, while the gain in speed is so great. I've used this approach to smooth curves with many thousands of points in a reasonable time.
To see what A is, consider the case where we had 4 points. Then our expression for S comes down to:
sqr( s[2] - 2*s[1] + s[0]) + sqr( s[3] - 2*s[2] + s[1]) + f*(d[0]*d[0] + .. + d[3]*d[3]).
If we substitute s[i] = y[i] + d[i] in this we get, for example,
s[2] - 2*s[1] + s[0] = d[2]-2*d[1]+d[0] + y[2]-2*y[1]+y[0]
and so we see that for this to be sqr( ||A*d-b||) we should take
A = ( 1 -2 1 0)
( 0 1 -2 1)
( f 0 0 0)
( 0 f 0 0)
( 0 0 f 0)
( 0 0 0 f)
b = ( -(y[2]-2*y[1]+y[0]))
( -(y[3]-2*y[2]+y[1]))
( 0 )
( 0 )
( 0 )
( 0 )
In an implementation, though, you probably wouldn't want to form A and b, as they are only going to be used to form the normal equation terms, A'*A and A'*b. It would be simpler to accumulate these directly.
This is a constrained optimization problem. The functional to be minimized is the integrated difference of the original curve and the modified curve. The constraints are the area under the curve and the new locations of the modified points. It is not easy to write such codes on your own. It is better to use some open source optimization codes, like this one: ool.
what about to keep the same dynamic range?
compute original min0,max0 y-values
smooth y-values
compute new min1,max1 y-values
linear interpolate all values to match original min max y
that is it
Not sure for the area but this should keep the shape much closer to original one. I got this Idea right now while reading your question and now I face similar problem so I try to code it and try right now anyway +1 for the getting me this Idea :)
You can adapt this and combine with the area
So before this compute the area and apply #1..#4 and after that compute new area. Then multiply all values by old_area/new_area ratio. If you have also negative values and not computing absolute area then you have to handle positive and negative areas separately and find multiplication ration to best fit original area for booth at once.
[edit1] some results for constant dynamic range
As you can see the shape is slightly shifting to the left. Each image is after applying few hundreds smooth operations. I am thinking of subdivision to local min max intervals to improve this ...
[edit2] have finished the filter for mine own purposes
void advanced_smooth(double *p,int n)
int i,j,i0,i1;
double a0,a1,b0,b1,dp,w0,w1;
double *p0,*p1,*w; int *q;
if (n<3) return;
p0=new double[n<<2]; if (p0==NULL) return;
w =p1+n;
q =(int*)((double*)(w+n));
// compute original min,max
for (a0=p[0],i=0;i<n;i++) if (a0>p[i]) a0=p[i];
for (a1=p[0],i=0;i<n;i++) if (a1<p[i]) a1=p[i];
for (i=0;i<n;i++) p0[i]=p[i]; // store original values for range restoration
// compute local min max positions to p1[]
dp=0.01*(a1-a0); // min delta treshold
// compute first derivation
p1[0]=0.0; for (i=1;i<n;i++) p1[i]=p[i]-p[i-1];
for (i=1;i<n-1;i++) // eliminate glitches
if (p1[i]*p1[i-1]<0.0)
if (p1[i]*p1[i+1]<0.0)
if (fabs(p1[i])<=dp)
for (i0=1;i0;) // remove zeros from derivation
for (i0=0,i=0;i<n;i++)
if (fabs(p1[i])<dp)
if ((i> 0)&&(fabs(p1[i-1])>=dp)) { i0=1; p1[i]=p1[i-1]; }
else if ((i<n-1)&&(fabs(p1[i+1])>=dp)) { i0=1; p1[i]=p1[i+1]; }
// find local min,max to q[]
q[n-2]=0; q[n-1]=0; for (i=1;i<n-1;i++) if (p1[i]*p1[i-1]<0.0) q[i-1]=1; else q[i-1]=0;
for (i=0;i<n;i++) // set sign as +max,-min
if ((q[i])&&(p1[i]<-dp)) q[i]=-q[i]; // this shifts smooth curve to the left !!!
// compute weights
for (i0=0,i1=1;i1<n;i0=i1,i1++) // loop through all local min,max intervals
for (;(!q[i1])&&(i1<n-1);i1++); // <i0,i1>
if (b1>=1e-6)
for (b1=0.35/b1,i=i0;i<=i1;i++) // compute weights bigger near local min max
// smooth few times
for (j=0;j<5;j++)
for (i=0;i<n ;i++) p1[i]=p[i]; // store data to avoid shifting by using half filtered data
for (i=1;i<n-1;i++) // FIR smooth filter
for (i=1;i<n-1;i++) // avoid local min,max shifting too much
if (q[i]>0) // local max
if (p[i]<p[i-1]) p[i]=p[i-1]; // can not be lower then neigbours
if (p[i]<p[i+1]) p[i]=p[i+1];
if (q[i]<0) // local min
if (p[i]>p[i-1]) p[i]=p[i-1]; // can not be higher then neigbours
if (p[i]>p[i+1]) p[i]=p[i+1];
for (i0=0,i1=1;i1<n;i0=i1,i1++) // loop through all local min,max intervals
for (;(!q[i1])&&(i1<n-1);i1++); // <i0,i1>
// restore original local min,max
a0=p0[i0]; b0=p[i0];
a1=p0[i1]; b1=p[i1];
if (a0>a1)
dp=a0; a0=a1; a1=dp;
dp=b0; b0=b1; b1=dp;
if (b1>=1e-6)
for (dp=(a1-a0)/b1,i=i0;i<=i1;i++)
delete[] p0;
so p[n] is the input/output data. There are few things that can be tweaked like:
weights computation (constants 0.8 and 0.35 means weights are <0.8,0.8+0.35/2>)
number of smooth passes (now 5 in the for loop)
the bigger the weight the less the filtering 1.0 means no change
The main Idea behind is:
find local extremes
compute weights for smoothing
so near local extremes are almost none change of the output
repair dynamic range per each interval between all local extremes
I did also try to restore the area but that is incompatible with mine task because it distorts the shape a lot. So if you really need the area then focus on that and not on the shape. The smoothing causes signal to shrink mostly so after area restoration the shape rise on magnitude.
Actual filter state has none markable side shifting of shape (which was the main goal for me). Some images for more bumpy signal (the original filter was extremly poor on this):
As you can see no visible signal shape shifting. The local extremes has tendency to create sharp spikes after very heavy smoothing but that was expected
Hope it helps ...
I created a 4-point bezier curve. I knew the total bezier curve length using this link. And I knew the length from start point.
I want to know how to get a time value from bezier curve and a point. I found a similar question and divided the bezier curve into 1000 pieces; but it isn't a good solution.
How can I get t value?
Note that for a cubic Bezier curve, there is no "one t value for each coordinate". Cubic Bezier can self-intersect, so you can find multiple t values for a single coordinate. There's two ways to do this: approximately or symbolically.
If you want an approximate answer (like what you're already doing for the length computation), simply construct a lookup table of coordinates-for-t:
buildLUT(a,b,c,d) {
for(t=0; t<=1; t+=0.01) {
LUTx[t*100] = getCoordinate(t, a.x,b.x,c.x,d.x);
LUTy[t*100] = getCoordinate(t, a.y,b.y,c.y,d.y);
And write an extra function for reverse lookups, or to build the reverse LUTs:
findTforCoordinate(x, y) {
found = []
for(i=0, len=LUTx.length; i<len; i++) {
_x = LUTx[i], _y = LUTy[i]
if(x==_x && y==_y) { found.push(i/len); }
return found
where a,b,c,d are your curve's control points. Since this is approximate, you're not looking for "t value for coordinate" but "closest t value to coordinate". It won't be perfect.
What WILL be perfect is finding all possible t values for the x and y coordinate components, then finding the one or two t values out of the set of possible six that approach generates that are the same between the x and y solutions. You can do this by using Cardano's approach, which is explain in another stackoverflow question here: Cubic Bezier reverse GetPoint equation: float for Vector <=> Vector for float
Consider points Y given in increasing order from [0,T). We are to consider these points as lying on a circle of circumference T. Now consider points X also from [0,T) and also lying on a circle of circumference T.
We say the distance between X and Y is the sum of the absolute distance between the each point in X and its closest point in Y recalling that both are considered to be lying in a circle. Write this distance as Delta(X, Y).
I am trying to find a quick way of determining a rotation of X which makes this distance as small as possible.
My code for making some data to test with is
import random
import numpy as np
from bisect import bisect_left
def simul(rate, T):
time = np.random.exponential(rate)
times = [0]
newtime = times[-1]+time
while (newtime < T):
newtime = newtime+np.random.exponential(rate)
return times[1:]
For each point I use this function to find its closest neighbor.
def takeClosest(myList, myNumber, T):
Assumes myList is sorted. Returns closest value to myNumber in a circle of circumference T.
If two numbers are equally close, return the smallest number.
pos = bisect_left(myList, myNumber)
before = myList[pos - 1]
after = myList[pos%len(myList)]
if after - myNumber < myNumber - before:
return after
return before
So the distance between two circles is:
def circle_dist(timesY, timesX):
dist = 0
for t in timesX:
closest_number = takeClosest(timesY, t, T)
dist += np.abs(closest_number - t)
return dist
So to make some data we just do
#First make some data
T = 5000
timesX = simul(1, T)
timesY = simul(10, T)
Finally to rotate circle timesX by offset we can
timesX = [(t + offset)%T for t in timesX]
In practice my timesX and timesY will have about 20,000 points each.
Given timesX and timesY, how can I quickly find (approximately) which rotation of timesX gives
the smallest distance to timesY?
Distance along the circle between a single point and a set of points is a piecewise linear function of rotation. The critical points of this function are the points of the set itself (zero distance) and points midway between neighbouring points of the set (local maximums of distance). Linear coefficients of such function are ±1.
Sum of such functions is again piecewise linear, but now with a quadratic number of critical points. Actually all these functions are the same, except shifted along the argument axis. Linear coefficients of the sum are integers.
To find its minimum one would have to calculate its value in all critical points.
I don'see a way to significantly reduce the amount of work needed, but 1,600,000,000 points is not such a big deal anyway, especially if you can spread the work between several processors.
To calculate sum of two such functions, represent the summands as sequences of critical points and associated coefficients to the left and to the right of each critical point. Then just merge the two point sequences while adding the coefficients.
You can solve your (original) problem with a sweep line algorithm. The trick is to use the right "discretization". Imagine cutting your circle up into two strips:
X: x....x....x..........x................x.........x...x
Y: .....x..........x.....x..x.x...........x.............
Now calculate the score = 5+0++1+1+5+9+6.
The key observation is that if we rotate X very slightly (right say), some of the points will improve and some will get worse. We can call this the "differential". In the above example the differential would be 1 - 1 - 1 + 1 + 1 - 1 + 1 because the first point is matched to something on its right, the second point is matched to something under it or to its left etc.
Of course, as we move X more, the differential will change. However only as many times as the matchings change, which is never more than |X||Y| but probably much less.
The proposed algorithm is thus to calculate the initial score and the time (X position) of the next change in differential. Go to that next position and calculate the score again. Continue until you reach your starting position.
This is probably a good example for the iterative closest point (ICP) algorithm:
It repeatedly matches each point with its closest neighbor and moves all points such that the mean squared distance is minimized. (Note that this corresponds to minimizing the sum of squared distances.)
import pylab as pl
T = 10.0
X = pl.array([3, 5.5, 6])
Y = pl.array([1, 1.5, 2, 4])
pl.subplot(1, 2, 1, polar=True)
pl.plot(X / T * 2 * pl.pi, pl.ones(X.shape), 'r.', ms=10, mew=3)
pl.plot(Y / T * 2 * pl.pi, pl.ones(Y.shape), 'b+', ms=10, mew=3)
circDist = lambda X, Y: (Y - X + T / 2) % T - T / 2
while True:
D = circDist(pl.reshape(X, (-1, 1)), pl.reshape(Y, (1, -1)))
closestY = pl.argmin(D**2, axis = 1)
distance = circDist(X, Y[closestY])
shift = pl.mean(distance)
if pl.absolute(shift) < 1e-3:
X = (X + shift) % T
pl.subplot(1, 2, 2, polar=True)
pl.plot(X / T * 2 * pl.pi, pl.ones(X.shape), 'r.', ms=10, mew=3)
pl.plot(Y / T * 2 * pl.pi, pl.ones(Y.shape), 'b+', ms=10, mew=3)
Important properties of the proposed solution are:
The ICP is an iterative algorithm. Thus it depends on an initial approximate solution. Furthermore, it won't always converge to the global optimum. This mainly depends on your data and the initial solution. If in doubt, try evaluating the ICP with different starting configurations and choose the most frequent result.
The current implementation performs a directed match: It looks for the closest point in Y relative to each point in X. It might yield different matches when swapping X and Y.
Computing all pair-wise distances between points in X and points in Y might be intractable for large point clouds (like 20,000 points, as you indicated). Therefore, the line D = circDist(...) might get replaced by a more efficient approach, e.g. not evaluating all possible pairs.
All points contribute to the final rotation. If there are any outliers, they might distort the shift significantly. This can be overcome with a robust average like the median or simply by excluding points with large distance.