I have some unknown function f(x), I am using matlab to calculate 2000 points on the function graph. I need a piecewise linear function g containing 20 to 30 segments, and it fits best to the original function, how could I do this in an acceptable way? The possible solution space is impossible to traverse and can't think of a good heuristic function to effectively shrink it.
Here is the code from which the function is derived:
x = sym('x', 'real');
inventory = sym('inventory', 'real');
demand = sym('demand', 'real');
f1 = 1/(sqrt(2*pi))*(-x)*exp(-(x - (demand - inventory)).^2./2);
f2 = 20/(sqrt(2*pi))*(x)*exp(-(x - (demand - inventory)).^2./2);
expectation_expression = int(f1, x, -inf, 0) + int(f2, x, 0, inf);
Depending on what your idea of a good approximation is, there may be a dynamic programming solution for this.
For example, given 2000 points and corresponding values, we wish to find the piecewise linear approximation with 20 segments which minimizes the sum of squared deviations between the true value at each point and the result of the linear approximation.
Work along the 2000 points from left to right, and at each point calculate for i=1 to 20 the total error from the far left to that point for the best piecewise linear approximation using i segments.
You can work out the values at position n+1 using the values calculated for points to the left of that position - points 1..n. For each value of i, consider all points to its left - say points j < n+1. Work out the error contributions resulting from a linear segment running from point j to point n+1. Add to that the value you have worked out for the best possible error using i-1 segments at point j (or possibly point j-1 depending on exactly you you define your piecewise linear approximation). If you now take the minimum such value over all possible j, you have calculated the error from the best possible piecewise linear approximation using i segments for the first n+1 points.
When you have worked out the best value for the first 2000 points using 20 segments you have solved the problem, and you can work back along this table to find out where the segments are - or, if this is inconvenient, you can save extra information as you go along to make this easier.
I believe similar approaches will minimize the sum of absolute deviations, or minimize the maximum deviation at any point, subject to you being able to solve the corresponding problems for a single line. I have implicitly assumed you can fit a straight line to minimize the sum of squared errors, which is of course a standard sum of squares line fit. Minimizing the absolute deviations from a straight lines is an exercise in convex optimization which I would attempt by repeatedly weighted least squares. Minimizing the maximum absolute deviation is linear programming.
Related
I have an array of size n of points called A, and a candidate array of size O(k)>k called S. I want to find k points in S such that the sum of squared distances from the points of A to their closest point from the k points would be minimized. One way to do it would be to check the cost of any possible k points in S and take the minimum, but that would take O(k^k*n) time, is there any more efficient way to do it?
I need either an optimal solution or a constant approximation.
The reason I need this is that I'm trying to find a constant approximation for the k-means as fast as possible and later use this for a coreset construction (coreset=data minimization while still keeping the cost of any query approximately the same). I was able to show that if we assume that in the optimal clustering each cluster has omega(n/k) points we can create pretty fast a list of size O(k) canidates that contains inside of them a 3-approximation for the k-means, so I was wondering if we can find those k points or a constant approximation for their costs in time which is faster than exhaustive search.
Example for k=2
In this example S is the green dots and A is the red dots. The algorithm should return the 2 circled points from S since they minimize the sum of squared distances from the points of A to their closest point of the 2.
I have an array of size n of points called A, and a candidate array of size O(k)>k called S. I want to find k points in S such that the sum of squared distances from the points of A to their closest point from the k points would be minimized.
It sounds like this could be solved simply by checking the N points against the K points to find the k points in N with the smallest squared distance.
Therefore, I'm now fairly sure this is actually finding the k-nearest neighbors (K-NN as a computational geometry problem, not the pattern recognition definition) in the N points for each point in the K points and not actually k-means.
For higher dimensionality, it is often useful to also consider the dimensionality, D in the algorithm.
The algorithm mentioned is indeed O(NDk^2) then when considering K-NN instead. That can be improved to O(NDk) by using Quickselect algorithm on the distances. This allows for checking the list of N points against each of the K points in O(N) to find the nearest k points.
https://en.wikipedia.org/wiki/Quickselect
Edit:
Seems there is some confusion on quickselect and if it can be used. Here is a O(DkNlogN) solution that uses a standard sort O(NlogN) instead of quickselect O(N). Though this might be faster in practice and as you can see in most languages it's pretty easy to implement.
results = {}
for y in F:
def distanceSquared(x):
distance(x,y) # Custom distance for each y
# First k sorted by distanceSquared
results[y] = S.sort(key=distanceSquared)[:k]
return results
Update for new visual
# Build up distance sums O(A*N*D)
results = {}
for y in F:
def distanceSquared(x):
distance(x,y) # Custom distance for each y
# Sum of distance squared from y for all points in S
results[y] = sum(map(distanceSquared, S))
def results_key_value(key):
results[key]
# First k results sorted by key O(D*AlogA)
results.keys().sort(key=results_key_value)[:k]
You could approximate by only considering Z random points chosen from the S points. Alternatively, you could merge points in S if they are close enough together. This could reduce S to a much smaller size as long S remains about F^2 or larger in size, it shouldn't affect which points in F are chosen too much. Though you would also need to adjust the weight of the points to handle that better. IE: the square distance of a point that represents 10 points is multiplied by 10 to account for it acting as 10 points instead of just 1.
Suppose we have n points in a bounded region of the plane. The problem is to divide it in 4 regions (with a horizontal and a vertical line) such that the sum of a metric in each region is minimized.
The metric can be for example, the sum of the distances between the points in each region ; or any other measure about the spreadness of the points. See the figure below.
I don't know if any clustering algorithm might help me tackle this problem, or if for instance it can be formulated as a simple optimization problem. Where the decision variables are the "axes".
I believe this can be formulated as a MIP (Mixed Integer Programming) problem.
Lets introduce 4 quadrants A,B,C,D. A is right,upper, B is right,lower, etc. Then define a binary variable
delta(i,k) = 1 if point i is in quadrant k
0 otherwise
and continuous variables
Lx, Ly : coordinates of the lines
Obviously we have:
sum(k, delta(i,k)) = 1
xlo <= Lx <= xup
ylo <= Ly <= yup
where xlo,xup are the minimum and maximum x-coordinate. Next we need to implement implications like:
delta(i,'A') = 1 ==> x(i)>=Lx and y(i)>=Ly
delta(i,'B') = 1 ==> x(i)>=Lx and y(i)<=Ly
delta(i,'C') = 1 ==> x(i)<=Lx and y(i)<=Ly
delta(i,'D') = 1 ==> x(i)<=Lx and y(i)>=Ly
These can be handled by so-called indicator constraints or written as linear inequalities, e.g.
x(i) <= Lx + (delta(i,'A')+delta(i,'B'))*(xup-xlo)
Similar for the others. Finally the objective is
min sum((i,j,k), delta(i,k)*delta(j,k)*d(i,j))
where d(i,j) is the distance between points i and j. This objective can be linearized as well.
After applying a few tricks, I could prove global optimality for 100 random points in about 40 seconds using Cplex. This approach is not really suited for large datasets (the computation time quickly increases when the number of points becomes large).
I suspect this cannot be shoe-horned into a convex problem. Also I am not sure this objective is really what you want. It will try to make all clusters about the same size (adding a point to a large cluster introduces lots of distances to be added to the objective; adding a point to a small cluster is cheap). May be an average distance for each cluster is a better measure (but that makes the linearization more difficult).
Note - probably incorrect. I will try and add another answer
The one dimensional version of minimising sums of squares of differences is convex. If you start with the line at the far left and move it to the right, each point crossed by the line stops accumulating differences with the points to its right and starts accumulating differences to the points to its left. As you follow this the differences to the left increase and the differences to the right decrease, so you get a monotonic decrease, possibly a single point that can be on either side of the line, and then a monotonic increase.
I believe that the one dimensional problem of clustering points on a line is convex, but I no longer believe that the problem of drawing a single vertical line in the best position is convex. I worry about sets of points that vary in y co-ordinate so that the left hand points are mostly high up, the right hand points are mostly low down, and the intermediate points alternate between high up and low down. If this is not convex, the part of the answer that tries to extend to two dimensions fails.
So for the one dimensional version of the problem you can pick any point and work out in time O(n) whether that point should be to the left or right of the best dividing line. So by binary chop you can find the best line in time O(n log n).
I don't know whether the two dimensional version is convex or not but you can try all possible positions for the horizontal line and, for each position, solve for the position of the vertical line using a similar approach as for the one dimensional problem (now you have the sum of two convex functions to worry about, but this is still convex, so that's OK). Therefore you solve at most O(n) one-dimensional problems, giving cost O(n^2 log n).
If the points aren't very strangely distributed, I would expect that you could save a lot of time by using the solution of the one dimensional problem at the previous iteration as a first estimate of the position of solution for the next iteration. Given a starting point x, you find out if this is to the left or right of the solution. If it is to the left of the solution, go 1, 2, 4, 8... steps away to find a point to the right of the solution and then run binary chop. Hopefully this two-stage chop is faster than starting a binary chop of the whole array from scratch.
Here's another attempt. Lay out a grid so that, except in the case of ties, each point is the only point in its column and the only point in its row. Assuming no ties in any direction, this grid has N rows, N columns, and N^2 cells. If there are ties the grid is smaller, which makes life easier.
Separating the cells with a horizontal and vertical line is pretty much picking out a cell of the grid and saying that cell is the cell just above and just to the right of where the lines cross, so there are roughly O(N^2) possible such divisions, and we can calculate the metric for each such division. I claim that when the metric is the sum of the squares of distances between points in a cluster the cost of this is pretty much a constant factor in an O(N^2) problem, so the whole cost of checking every possibility is O(N^2).
The metric within a rectangle formed by the dividing lines is SUM_i,j[ (X_i - X_j)^2 + (Y_i-Y_j)^2]. We can calculate the X contributions and the Y contributions separately. If you do some algebra (which is easier if you first subtract a constant so that everything sums to zero) you will find that the metric contribution from a co-ordinate is linear in the variance of that co-ordinate. So we want to calculate the variances of the X and Y co-ordinates within the rectangles formed by each division. https://en.wikipedia.org/wiki/Algebraic_formula_for_the_variance gives us an identity which tells us that we can work out the variance given SUM_i Xi and SUM_i Xi^2 for each rectangle (and the corresponding information for the y co-ordinate). This calculation can be inaccurate due to floating point rounding error, but I am going to ignore that here.
Given a value associated with each cell of a grid, we want to make it easy to work out the sum of those values within rectangles. We can create partial sums along each row, transforming 0 1 2 3 4 5 into 0 1 3 6 10 15, so that each cell in a row contains the sum of all the cells to its left and itself. If we take these values and do partial sums up each column, we have just worked out, for each cell, the sum of the rectangle whose top right corner lies in that cell and which extends to the bottom and left sides of the grid. These calculated values at the far right column give us the sum for all the cells on the same level as that cell and below it. If we subtract off the rectangles we know how to calculate we can find the value of a rectangle which lies at the right hand side of the grid and the bottom of the grid. Similar subtractions allow us to work out first the value of the rectangles to the left and right of any vertical line we choose, and then to complete our set of four rectangles formed by two lines crossing by any cell in the grid. The expensive part of this is working out the partial sums, but we only have to do that once, and it costs only O(N^2). The subtractions and lookups used to work out any particular metric have only a constant cost. We have to do one for each of O(N^2) cells, but that is still only O(N^2).
(So we can find the best clustering in O(N^2) time by working out the metrics associated with all possible clusterings in O(N^2) time and choosing the best).
I have identified nearest neighbours amongst a population. I wish to assign a vector of weights to the population such that the difference in weights between nearest neighbours is minimised via an optimisation. I have built:
fun=#(x)sum(nthroot(x-logicalmatrix*x),2)
A=ones(1,height(Population));
b=1;
Aeq=A;
beq=1; % Solution should sum to 1
lb=zeros(height(Population),1); % Lower Bounds
ub=ones(height(Population),1); % UpperBounds
[opt_combinedfun,~,residualCombfun]=fmincon(fun,lb,A,b,Aeq,beq,lb,ub,[],options);
However, although sometimes it returns a solution within bounds it does not appear optimal. The 'logicalmatrix' is a n x n logical matrix identifying the nearest neighbours. The problem is that logicalmatrix is singular which causes the optimisation to return:
Warning: Matrix is close to singular or badly scaled.
Results may be inaccurate.
is fmincon the wrong function to use? Or is there a way to get around the singularity? A more robust way to achieve this optimisation?
Given Multiple (N) lines in 3d space, find the point minimizing the distance to all lines.
Given that the Shortest distance between a line [aX + b] and a point [P] will be on the perpendicular line [aX+b]ā[P] I can express the minimal squared distance as the sum of squared line distances, eg. ([aX+b]ā[P])^2 +ā¦+ ([aX+b]nā[P])^2 .
Since the lines are perpendicular I can use Dot Product to express [P] in the line terms
I have considered using Least Squares for estimating the point minimizing the distance, the problem is that the standard least squares will approximate the best fitting line/curve given a set of points, What I need is the opposite, given a set of lines estimate the best fitting point.
How should this be approached ?
From wikipedia, we read that the squared distance between line a'x + b = 0 and point p is (a'p+b)^2 / (a'a). We can therefore see that the point that minimizes the sum of squared distances is a weighted linear regression problem with one observation for each line. The regression model has the following properties:
Sample data a for each line ax+b=0
Sample outcome -b for each line ax+b=0
Sample weight 1/(a'a) for each line ax+b=0
You should be able to solve this problem with any standard statistical software.
An approach:
form the equations giving the distance from the point to each line
these equations give you N distances
optimize the set of distances by the criterion you want (least squares, minimax, etc.)
This reduces into a simple optimization question once you have the N equations. Of course, the difficulty of the last step depends heavily on the criterion you choose (least squares is simple, minimax not that simple.)
One thing that might help you forward is to find the simplest form of equation giving the distance from a point to line. Your thinking is correct in your #1, but you will need to think a bit more (or then check "distance from a point to line" with any search engine).
I have solved the same problem using hill climbing. Consider a single point and 26 neighbours step away from it(points on a cube centered at the current point). If the distance from the point is better than the distance from all neighbours, divide step by 2, otherwise make the neighbor with best distance new current point. Continue until step is small enough.
Following is solution using calculus :-
F(x,y) = sum((y-mix-ci)^2/(1+mi^2))
Using Partial differentiation :-
dF(x,y)/dx = sum(2*(y-mix-ci)*mi/(1+mi^2))
dF(x,y)/dy = sum(2*(y-mix-ci)/(1+mi^2))
To Minimize F(x,y) :-
dF(x,y)/dy = dF(x,y)/dx = 0
Use Gradient Descent using certain learning rate and random restarts to solve find minimum as much as possible
You can apply the following answer (which talks about finding the point that is closest to a set of planes) to this problem, since just as a plane can be defined by a point on the plane and a normal to the plane, a line can be defined by a point the line passes through and a "normal" vector orthogonal to the line:
https://math.stackexchange.com/a/3483313/365886
You can solve the resulting quadratic form by observing that the solution to 1/2 x^T A x - b x + c is x_min = A^{-1} b.
The parametric boundary of an object can be extract in Matlab by using the bwtraceboundary function. It returns a Q-by-2 matrix B, where Q is the number of boundary pixels for the object and the first and second columns stores the row and column coordinates of the boundary pixels respectively.
What I want to do is to sample this boundary of Q elements by N points that divide the original boundary in segments of equal arch length.
A straightfoward solution that I thought consists in computing the length L of the boundary by summing the distance of all two consecutive boundary pixels. Those distances are either 1 or sqrt(2). Then I divide L by N to find the desired length of the arcs. Finally, I iterate over the boundary again summing the distance of all two consecutive boundary pixels. When the sum is greater or equal the desired arc-length, the current boundary pixel is chosen as one of the N that will compose the sampled boundary.
Is that a good solution? Is there a more efficient/simple solution?
Over the years, I have seen this question a seemingly vast number of times. So I wrote a little tool that will do exactly that. Sample a piecewise linear or even a curvilinear (spline) arc in a general number of dimensions so that the successive points are at a uniform or specified distance along that arc.
In the case of the use of merely piecewise linear arcs, this is rather easy. You sum up the total arc length of the curve, then do an interpolation in arc length, but since that is known to be piecewise linear, it only requires linear interpolation along that length as a function of the cumulative arc length.
In the case of a curved arc, it is most easily done as the solution of a system of ordinary differential equations, watching for events along the way. ODE45 does this nicely.
You can use interparc, as found on the MATLAB Central File Exchange to do this for you, or if you wish to learn to do it yourself for the simple piecewise linear case, read through the first part of the code where I do the piecewise linear arc length interpolation. A nice thing is the linear case is done in a fully vectorized form, so no explicit loops are necessary.