What is the number of degrees needed for polynomial curve fitting? - curve-fitting

Assume we have m data points.
What is the number of degrees needed for polynomial curve fitting if we wish to make the adjusted R^2 value to be 1? (Theoretically, it will be 1, but realistically it's nearly 1 due to round off errors).
What is the reason for the chosen number?
8 points (2 0 0 3 8 5 3 3 ) example shown below, but you have to answer with m data points. If you use 8 data points your score will be reduced.

A polynomial of degree m-1 will exactly fit (R^2 = 1) m data points with different x values.
A m-1 degree polynomial has m degrees of freedom a_i:
y(x) = a_1 + a_2 x^1 + a_3 x^2 + ... + a_m x^(m-1)
The m degrees of freedom of a m-1 degree polynomial allow it to uniquely fit to m data points.

Related

Genetic Algorithm : Find curve that fits points

I am working on a genetic algorithm. Here is how it works :
Input : a list of 2D points
Input : the degree of the curve
Output : the equation of the curve that passes through points the best way (try to minimize the sum of vertical distances from point's Ys to the curve)
The algorithm finds good equations for simple straight lines and for 2-degree equations.
But for 4 points and 3 degree equations and more, it gets more complicated. I cannot find the right combination of parameters : sometimes I have to wait 5 minutes and the curve found is still very bad. I tried modifying many parameters, from population size to number of parents selected...
Do famous combinations/theorems in GA programming can help me ?
Thank you ! :)
Based on what is given, you would need a polynomial interpolation in which, the degree of the equation is number of points minus 1.
n = (Number of points) - 1
Now having said that, let's assume you have 5 points that need to be fitted and I am going to define them in a variable:
var points = [[0,0], [2,3], [4,-1], [5,7], [6,9]]
Please be noted the array of the points have been ordered by the x values which you need to do.
Then the equation would be:
f(x) = a1*x^4 + a2*x^3 + a3*x^2 + a4*x + a5
Now based on definition (https://en.wikipedia.org/wiki/Polynomial_interpolation#Constructing_the_interpolation_polynomial), the coefficients are computed like this:
Now you need to used the referenced page to come up with the coefficient.
It is not that complicated, for the polynomial interpolation of degree n you get the following equation:
p(x) = c0 + c1 * x + c2 * x^2 + ... + cn * x^n = y
This means we need n + 1 genes for the coefficients c0 to cn.
The fitness function is the sum of all squared distances from the points to the curve, below is the formula for the squared distance. Like this a smaller value is obviously better, if you don't want that you can take the inverse (1 / sum of squared distances):
d_squared(xi, yi) = (yi - p(xi))^2
I think for faster conversion you could limit the mutation, e.g. when mutating choose a new value with 20% probability between min and max (e.g. -1000 and 1000) and with 80% probabilty a random factor between 0.8 and 1.2 with which you multiply the old value.

Finding a polynomial of minimum degree given series

Given a series on number, how can be find a polynomial which generalizes the series. And than with this generalization one should be able to find out any term in the series.
While searching on net I found out that one can use Langrange's Interpolation technique. How accurate is the method for generalizing the series?
Can we use some other method to find a polynomial?
There are several algorithms which will generate a polynomial matching a finite series, as "justhalf" identified Lagrange's interpolation is one technique.
In general, if you are given a function with n points, you can uniquely define a polynomial of degree n-1 (or sometimes less) which matches at every point.
Consider the series with only two term, "2, 4". As this has only two terms (n=2), there is a polynomial of degree 1 which will generate the series. The general form is y = ax+b and we need to find a and b:
y = ax + b
So
2 = a⋅1 + b => 2 = a + b
4 = a⋅2 + b => 4 = 2a + b
Therefore a = 2 and b = 0.
y = 2x
You can see if you substitute x=1 and x=2 you get the values 2 and 4 respectively.
If the series was 2,4,8 then you would need a polynomial of degree 3-1 = 2, say y = ax^2 + bx + c (where these a and b are new values, not necessarily the same as the a and b for the previous case).
Then you would know that:
2 = a⋅1² + b⋅1 + c => 2 = a + b + c (i)
4 = a⋅2² + b⋅2 + c => 4 = 4a + 2b + c (ii)
8 = a⋅3² + b⋅3 + c => 8 = 9a + 3b + c (iii)
You can solve these equations to find a, b and c:
Subtract (i) from (ii):
2 = 3a + b (iv)
Subtract (ii) from (iii)
4 = 5a + b (v)
Subtract (iv) from (v)
2 = 2a => a = 1
So from (iv)
2 = 3⋅1 + b = 3 + b => b = -1
From (i)
2 = a + b + c = 1 + -1 + c = c => c = 2
So the polynomial y = ax² + bx + c = x² - x + 2 agrees at the three points
Verify:
1² - 1 + 2 = 2
2² - 2 + 2 = 4
3² - 3 + 2 = 8
As we wanted.
But note that this polynomial y = x² - x + 2 also exactly generates the series with only the first 2 terms, "2, 4". So this series with only two terms is satisfied by two polynomials, y = 2x and y = x² - x + 2. Despite agreeing on the first two values 2,4 these are very different polynomials.
In general, if you have a series of n terms then there is a unique polynomial of degree n-1 which will generate the series. In general, there will be no polynomials of degree less than n-1 which will exactly generate it (you may get lucky, but its not generally true). There are an infinite number of polynomials of degree greater then n-1 which will generate the data.
Usually in numerical analysis you try and generate a polynomial of degree less than n-1 which approximates the data (doesn't match exactly, but minimises error). Exact solutions of degree n-1 are unstable, in that tiny changes to the input series produces very different equations. This is not so true of polynomial approximations of degree less than n-1. As many physical measurements have inherent error, using lower degree polynomials minimises the impact of measurement errors.
Lets now consider the series 2, 4, 8, 16
You can produce a polynomial of degree 3 (y = ax³ + bx² + cx + d) which exactly matches these data points using exactly the same approach. This (again) is just solving a set of linear simultaneous equations. This is essentially how Lagrange's algorithm works; we have solved the equations by hand instead of using matrix notation (as Lagrange does).
But given 2,4,8,16 most people would think that the equation is y = 2x. This is not a polynomial equation, so can't be expressed as a polynomial.
For the series 2,4,8 we derived the polynomial y = x² - x + 2. If we tried to extrapolate to find the next value, plugging x=4 will give us y = 4² - 4 + 2 = 14. The term after (x=5) that would be y = 5² - 5 + 2 = 22. As x gets larger, y = x² - x + 2 becomes an increasingly bad approximation to y = 2x. In fact no polynomial will grow as fast as y = 2x.
So ...
If you have n points, you can always find a unique polynomial of degree n-1 (or sometimes less) which will generate exactly those n points for x=1,2,3..n. This is not often used for real life problems, because these solutions are unstable (small changes to input produce large changes to the polynomial).
If you have n points, there are an infinite number of polynomials of degree n or greater which will produce the series. These all have identical values for x = 1, 2, ... n but will disagree on the n+1, n+2 etc terms.
Typically a polynomial approximation of degree less than n-1 is used. It won't usually be an exact fit, but will often show the general shape of the curve. For 8 points you might try and find a polynomial of degree 4 (y = ax⁴ + cx³ + dx² + e) which minimises the error. As a rule of thumb, a polynomial of degree of about n/2 is often used. This is more art than science; usually you have some idea of what the underlying (correct) formula is, and this helps select the degree of the approximating polynomial.
Polynomial approximations can work reasonably well for interpolation (finding a value between two data points) but are hopeless for extrapolation. As we have no knowledge at all of what the "next" value is a series might be (it could be anything), no formula can successfully predict it.
I hope this is useful. Producing a polynomial which exactly generates a finite series is not hard ... its simply solving n linear simultaneous equations with n variables (the coefficients of xn-1, xn-2, ... x², x, and the constant term). This is what we have done above and how Lagrange works. However, in physical systems it may not be particularly meaningful. User beware.

Dynamic programming approximation

I am trying to calculate a function F(x,y) using dynamic programming. Functionally:
F(X,Y) = a1 F(X-1,Y)+ a2 F(X-2,Y) ... + ak F(X-k,Y) + b1 F(X,Y-1)+ b2 F(X,Y-2) ... + bk F(X,Y-k)
where k is a small number (k=10).
The problem is, X=1,000,000 and Y=1,000,000. So it is infeasible to calculate F(x,y) for every value between x=1..1000000 and y=1..1000000. Is there an approximate version of DP where I can avoid calculating F(x,y) for a large number of inputs and still get accurate estimate of F(X,Y).
A similar example is string matching algorithms (Levenshtein's distance) for two very long and similar strings (eg. similar DNA sequences). In such cases only the diagonal scores are important and the far-from-diagonal entries do not contribute to the final distance. How do we avoid calculating off-the-diagonal entries?
PS: Ignore the border cases (i.e. when x < k and y < k).
I'm not sure precisely how to adapt the following technique to your problem, but if you were working in just one dimension there is an O(k3 log n) algorithm for computing the nth term of the series. This is called a linear recurrence and can be solved using matrix math, of all things. The idea is to suppose that you have a recurrence defined as
F(1) = x_1
F(2) = x_2
...
F(k) = x_k
F(n + k + 1) = c_1 F(n) + c_2 F(n + 1) + ... + c_k F(n + k)
For example, the Fibonacci sequence is defined as
F(0) = 0
F(1) = 1
F(n + 2) = 1 x F(n) + 1 x F(n + 1)
There is a way to view this computation as working on a matrix. Specifically, suppose that we have the vector x = (x_1, x_2, ..., x_k)^T. We want to find a matrix A such that
Ax = (x_2, x_3, ..., x_k, x_{k + 1})^T
That is, we begin with a vector of terms 1 ... k of the sequence, and then after multiplying by matrix A end up with a vector of terms 2 ... k + 1 of the sequence. If we then multiply that vector by A, we'd like to get
A(x_2, x_3, ..., x_k, x_{k + 1})^T = (x_3, x_4, ..., x_k, x_{k + 1}, x_{k + 2})
In short, given k consecutive terms of the series, multiplying that vector by A gives us the next term of the series.
The trick uses the fact that we can group the multiplications by A. For example, in the above case, we multiplied our original x by A to get x' (terms 2 ... k + 1), then multiplied x' by A to get x'' (terms 3 ... k + 2). However, we could have instead just multiplied x by A2 to get x'' as well, rather than doing two different matrix multiplications. More generally, if we want to get term n of the sequence, we can compute Anx, then inspect the appropriate element of the vector.
Here, we can use the fact that matrix multiplication is associative to compute An efficiently. Specifically, we can use the method of repeated squaring to compute An in a total of O(log n) matrix multiplications. If the matrix is k x k, then each multiplication takes time O(k3) for a total of O(k3 log n) work to compute the nth term.
So all that remains is actually finding this matrix A. Well, we know that we want to map from (x_1, x_2, ..., x_k) to (x_1, x_2, ..., x_k, x_{k + 1}), and we know that x_{k + 1} = c_1 x_1 + c_2 x_2 + ... + c_k x_k, so we get this matrix:
| 0 1 0 0 ... 0 |
| 0 0 1 0 ... 0 |
A = | 0 0 0 1 ... 0 |
| ... |
| c_1 c_2 c_3 c_4 ... c_k |
For more detail on this, see the Wikipedia entry on solving linear recurrences with linear algebra, or my own code that implements the above algorithm.
The only question now is how you adapt this to when you're working in multiple dimensions. It's certainly possible to do so by treating the computation of each row as its own linear recurrence, then to go one row at a time. More specifically, you can compute the nth term of the first k rows each in O(k3 log n) time, for a total of O(k4 log n) time to compute the first k rows. From that point forward, you can compute each successive row in terms of the previous row by reusing the old values. If there are n rows to compute, this gives an O(k4 n log n) algorithm for computing the final value that you care about. If this is small compared to the work you'd be doing before (O(n2 k2), I believe), then this may be an improvement. Since you're saying that n is on the order of one million and k is about ten, this does seem like it should be much faster than the naive approach.
That said, I wouldn't be surprised if there was a much faster way of solving this problem by not proceeding row by row and instead using a similar matrix trick in multiple dimensions.
Hope this helps!
Without knowing more about your specific problem, the general approach is to use a top-down dynamic programming algorithm and memoize the intermediate results. That way you will only calculate the values that will be actually used (while saving the result to avoid repeated calculations).

Complexity class of computing a hyperplane

I am concerned with the following algorithm:
As input, it takes n points in n dimensional space in rectangular coordinates. These n points define an n-1 dimensional hyperplane (we can ignore the infintesimal probability that they don't). As output, I would like the equation of this hyperplane.
Is there a known algorithm - or at least a known complexity class - for this problem?
Thanks in advance.
The equation you're looking for is
A_1 x_1 + A_2 x_2 + ... + A_n x_n + C = 0
for some coefficients A_1 and C and for the x_i being the rectangular coordinates of a point on the plane. Substitute in the input points and you've got a set of n simultaneous equations which you can solve (up to a scale factor).

What is the most efficient algorithm to find a straight line that goes through most points?

The problem:
N points are given on a 2-dimensional plane. What is the maximum number of points on the same straight line?
The problem has O(N2) solution: go through each point and find the number of points which have the same dx / dy with relation to the current point. Store dx / dy relations in a hash map for efficiency.
Is there a better solution to this problem than O(N2)?
There is likely no solution to this problem that is significantly better than O(n^2) in a standard model of computation.
The problem of finding three collinear points reduces to the problem of finding the line that goes through the most points, and finding three collinear points is 3SUM-hard, meaning that solving it in less than O(n^2) time would be a major theoretical result.
See the previous question on finding three collinear points.
For your reference (using the known proof), suppose we want to answer a 3SUM problem such as finding x, y, z in list X such that x + y + z = 0. If we had a fast algorithm for the collinear point problem, we could use that algorithm to solve the 3SUM problem as follows.
For each x in X, create the point (x, x^3) (for now we assume the elements of X are distinct). Next, check whether there exists three collinear points from among the created points.
To see that this works, note that if x + y + z = 0 then the slope of the line from x to y is
(y^3 - x^3) / (y - x) = y^2 + yx + x^2
and the slope of the line from x to z is
(z^3 - x^3) / (z - x) = z^2 + zx + x^2 = (-(x + y))^2 - (x + y)x + x^2
= x^2 + 2xy + y^2 - x^2 - xy + x^2 = y^2 + yx + x^2
Conversely, if the slope from x to y equals the slope from x to z then
y^2 + yx + x^2 = z^2 + zx + x^2,
which implies that
(y - z) (x + y + z) = 0,
so either y = z or z = -x - y as suffices to prove that the reduction is valid.
If there are duplicates in X, you first check whether x + 2y = 0 for any x and duplicate element y (in linear time using hashing or O(n lg n) time using sorting), and then remove the duplicates before reducing to the collinear point-finding problem.
If you limit the problem to lines passing through the origin, you can convert the points to polar coordinates (angle, distance from origin) and sort them by angle. All points with the same angle lie on the same line. O(n logn)
I don't think there is a faster solution in the general case.
The Hough Transform can give you an approximate solution. It is approximate because the binning technique has a limited resolution in parameter space, so the maximum bin will give you some limited range of possible lines.
Again an O(n^2) solution with pseudo code. Idea is create a hash table with line itself as the key. Line is defined by slope between the two points, point where line cuts x-axis and point where line cuts y-axis.
Solution assumes languages like Java, C# where equals method and hashcode methods of the object are used for hashing function.
Create an Object (call SlopeObject) with 3 fields
Slope // Can be Infinity
Point of intercept with x-axis -- poix // Will be (Infinity, some y value) or (x value, 0)
Count
poix will be a point (x, y) pair. If line crosses x-axis the poix will (some number, 0). If line is parallel to x axis then poix = (Infinity, some number) where y value is where line crosses y axis.
Override equals method where 2 objects are equal if Slope and poix are equal.
Hashcode is overridden with a function which provides hashcode based on combination of values of Slope and poix. Some pseudo code below
Hashmap map;
foreach(point in the array a) {
foeach(every other point b) {
slope = calculateSlope(a, b);
poix = calculateXInterception(a, b);
SlopeObject so = new SlopeObject(slope, poix, 1); // Slope, poix and intial count 1.
SlopeObject inMapSlopeObj = map.get(so);
if(inMapSlopeObj == null) {
inMapSlopeObj.put(so);
} else {
inMapSlopeObj.setCount(inMapSlopeObj.getCount() + 1);
}
}
}
SlopeObject maxCounted = getObjectWithMaxCount(map);
print("line is through " + maxCounted.poix + " with slope " + maxCounted.slope);
Move to the dual plane using the point-line duality transform for p=(a,b) p*:y=a*x + b.
Now using a line sweep algorithm find all intersection points in NlogN time.
(If you have points which are one above the other just rotate the points to some small angle).
The intersection points corresponds in the dual plane to lines in the primer plane.
Whoever said that since 3SUM have a reduction to this problem and thus the complexity is O(n^2). Please note that the complexity of 3SUM is less than that.
Please check https://en.wikipedia.org/wiki/3SUM and also read
https://tmc.web.engr.illinois.edu/reduce3sum_sosa.pdf
As already mentioned, there probably isn't a way to solve the general case of this problem better than O(n^2). However, if you assume a large number of points lie on the same line (say the probability that a random point in the set of points lie on the line with the maximum number of points is p) and don't need an exact algorithm, a randomized algorithm is more efficient.
maxPoints = 0
Repeat for k iterations:
1. Pick 2 random, distinct points uniformly at random
2. maxPoints = max(maxPoints, number of points that lies on the
line defined by the 2 points chosen in step 1)
Note that in the first step, if you picked 2 points which lies on the line with the maximum number of points, you'll get the optimal solution. Assuming n is very large (i.e. we can treat the probability of finding 2 desirable points as sampling with replacement), the probability of this happening is p^2. Therefore the probability of finding a suboptimal solution after k iterations is (1 - p^2)^k.
Suppose you can tolerate a false negative rate rate = err. Then this algorithm runs in O(nk) = O(n * log(err) / log(1 - p^2)). If both n and p are large enough, this is significantly more efficient than O(n^2). (i.e. Supposed n = 1,000,000 and you know there are at least 10,000 points that lie on the same line. Then n^2 would required on the magnitude of 10^12 operations, while randomized algorithm would require on the magnitude of 10^9 operations to get a error rate of less than 5*10^-5.)
It is unlikely for a $o(n^2)$ algorithm to exist, since the problem (of even checking if 3 points in R^2 are collinear) is 3Sum-hard (http://en.wikipedia.org/wiki/3SUM)
This is not a solution better than O(n^2), but you can do the following,
For each point convert first convert it as if it where in the (0,0) coordinate, and then do the equivalent translation for all the other points by moving them the same x,y distance you needed to move the original choosen point.
2.Translate this new set of translated points to the angle with respect to the new (0,0).
3.Keep stored the maximum number (MSN) of points that are in each angle.
4.Choose the maximum stored number (MSN), and that will be the solution

Resources