Optimisation to minimse distance between nearest neighbours Matlab - algorithm

I have identified nearest neighbours amongst a population. I wish to assign a vector of weights to the population such that the difference in weights between nearest neighbours is minimised via an optimisation. I have built:
fun=#(x)sum(nthroot(x-logicalmatrix*x),2)
A=ones(1,height(Population));
b=1;
Aeq=A;
beq=1; % Solution should sum to 1
lb=zeros(height(Population),1); % Lower Bounds
ub=ones(height(Population),1); % UpperBounds
[opt_combinedfun,~,residualCombfun]=fmincon(fun,lb,A,b,Aeq,beq,lb,ub,[],options);
However, although sometimes it returns a solution within bounds it does not appear optimal. The 'logicalmatrix' is a n x n logical matrix identifying the nearest neighbours. The problem is that logicalmatrix is singular which causes the optimisation to return:
Warning: Matrix is close to singular or badly scaled.
Results may be inaccurate.
is fmincon the wrong function to use? Or is there a way to get around the singularity? A more robust way to achieve this optimisation?

Related

find best k-means from a list of candidates

I have an array of size n of points called A, and a candidate array of size O(k)>k called S. I want to find k points in S such that the sum of squared distances from the points of A to their closest point from the k points would be minimized. One way to do it would be to check the cost of any possible k points in S and take the minimum, but that would take O(k^k*n) time, is there any more efficient way to do it?
I need either an optimal solution or a constant approximation.
The reason I need this is that I'm trying to find a constant approximation for the k-means as fast as possible and later use this for a coreset construction (coreset=data minimization while still keeping the cost of any query approximately the same). I was able to show that if we assume that in the optimal clustering each cluster has omega(n/k) points we can create pretty fast a list of size O(k) canidates that contains inside of them a 3-approximation for the k-means, so I was wondering if we can find those k points or a constant approximation for their costs in time which is faster than exhaustive search.
Example for k=2
In this example S is the green dots and A is the red dots. The algorithm should return the 2 circled points from S since they minimize the sum of squared distances from the points of A to their closest point of the 2.
I have an array of size n of points called A, and a candidate array of size O(k)>k called S. I want to find k points in S such that the sum of squared distances from the points of A to their closest point from the k points would be minimized.
It sounds like this could be solved simply by checking the N points against the K points to find the k points in N with the smallest squared distance.
Therefore, I'm now fairly sure this is actually finding the k-nearest neighbors (K-NN as a computational geometry problem, not the pattern recognition definition) in the N points for each point in the K points and not actually k-means.
For higher dimensionality, it is often useful to also consider the dimensionality, D in the algorithm.
The algorithm mentioned is indeed O(NDk^2) then when considering K-NN instead. That can be improved to O(NDk) by using Quickselect algorithm on the distances. This allows for checking the list of N points against each of the K points in O(N) to find the nearest k points.
https://en.wikipedia.org/wiki/Quickselect
Edit:
Seems there is some confusion on quickselect and if it can be used. Here is a O(DkNlogN) solution that uses a standard sort O(NlogN) instead of quickselect O(N). Though this might be faster in practice and as you can see in most languages it's pretty easy to implement.
results = {}
for y in F:
def distanceSquared(x):
distance(x,y) # Custom distance for each y
# First k sorted by distanceSquared
results[y] = S.sort(key=distanceSquared)[:k]
return results
Update for new visual
# Build up distance sums O(A*N*D)
results = {}
for y in F:
def distanceSquared(x):
distance(x,y) # Custom distance for each y
# Sum of distance squared from y for all points in S
results[y] = sum(map(distanceSquared, S))
def results_key_value(key):
results[key]
# First k results sorted by key O(D*AlogA)
results.keys().sort(key=results_key_value)[:k]
You could approximate by only considering Z random points chosen from the S points. Alternatively, you could merge points in S if they are close enough together. This could reduce S to a much smaller size as long S remains about F^2 or larger in size, it shouldn't affect which points in F are chosen too much. Though you would also need to adjust the weight of the points to handle that better. IE: the square distance of a point that represents 10 points is multiplied by 10 to account for it acting as 10 points instead of just 1.

Build a linear approximation for an unknown function

I have some unknown function f(x), I am using matlab to calculate 2000 points on the function graph. I need a piecewise linear function g containing 20 to 30 segments, and it fits best to the original function, how could I do this in an acceptable way? The possible solution space is impossible to traverse and can't think of a good heuristic function to effectively shrink it.
Here is the code from which the function is derived:
x = sym('x', 'real');
inventory = sym('inventory', 'real');
demand = sym('demand', 'real');
f1 = 1/(sqrt(2*pi))*(-x)*exp(-(x - (demand - inventory)).^2./2);
f2 = 20/(sqrt(2*pi))*(x)*exp(-(x - (demand - inventory)).^2./2);
expectation_expression = int(f1, x, -inf, 0) + int(f2, x, 0, inf);
Depending on what your idea of a good approximation is, there may be a dynamic programming solution for this.
For example, given 2000 points and corresponding values, we wish to find the piecewise linear approximation with 20 segments which minimizes the sum of squared deviations between the true value at each point and the result of the linear approximation.
Work along the 2000 points from left to right, and at each point calculate for i=1 to 20 the total error from the far left to that point for the best piecewise linear approximation using i segments.
You can work out the values at position n+1 using the values calculated for points to the left of that position - points 1..n. For each value of i, consider all points to its left - say points j < n+1. Work out the error contributions resulting from a linear segment running from point j to point n+1. Add to that the value you have worked out for the best possible error using i-1 segments at point j (or possibly point j-1 depending on exactly you you define your piecewise linear approximation). If you now take the minimum such value over all possible j, you have calculated the error from the best possible piecewise linear approximation using i segments for the first n+1 points.
When you have worked out the best value for the first 2000 points using 20 segments you have solved the problem, and you can work back along this table to find out where the segments are - or, if this is inconvenient, you can save extra information as you go along to make this easier.
I believe similar approaches will minimize the sum of absolute deviations, or minimize the maximum deviation at any point, subject to you being able to solve the corresponding problems for a single line. I have implicitly assumed you can fit a straight line to minimize the sum of squared errors, which is of course a standard sum of squares line fit. Minimizing the absolute deviations from a straight lines is an exercise in convex optimization which I would attempt by repeatedly weighted least squares. Minimizing the maximum absolute deviation is linear programming.

KNN classifier algorithm not working for all cases

To effectively find n nearest neighbors of a point in d-dimensional space, I selected the dimension with greatest scatter (i.e. in this coordinate differences between points are largest). The whole range from minimal to maximal value in this dimension was split into k bins. Each bin contains points which coordinates (in this dimensions) are within the range of that bin. It was ensured that there are at least 2n points in each bin.
The algorithm for finding n nearest neighbors of point x is following:
Identify bin kx,in which point x lies(its projection to be precise).
Compute distances between x and all the points in bin kx.
Sort computed distances in ascending order.
Select first n distances. Points to which these distances were measured are returned as n
nearest neighbors of x.
This algorithm is not working for all cases. When algorithm can fail to compute nearest neighbors?
Can anyone propose modification of the algorithm to ensure proper operation for all cases?
Where KNN failure:
If the data is a jumble of all different classes then knn will fail because it will try to find k nearest neighbours but all points are random
outliers points
Let's say you have two clusters of different classes. Then if you have a outlier point as query, knn will assign one of the classes even though the query point is far away from both clusters.
This is failing because (any of) the k nearest neighbors of x could be in a different bin than x.
What do you mean by "not working"? You do understand that, what you are doing is only an approximate method.
Try normalising the data and then choosing the dimension, else scatter makes no sense.
The best vector for discrimination or for clustering may not be one of the original dimensions, but any combination of dimensions.
Use PCA (Principal Component Analysis) or LDA (Linear Discriminant Analysis), to identify a discriminative dimension.

finding saddle points in 3d heightmap

Given a 3d heightmap (from a laser scanner), how do I find the saddle points?
I.e. given something like this:
I am looking for all points where the curvature is positive in one direction and negative in the other.
(These directions should not need to be aligned with the X and Y axis.
I know how to check whether the curvature in X direction has the opposite sign as the curvature in Y direction, but that does not cover all cases. To make matters worse, the resolution in X is different from the resolution in Y)
Ideally I am looking for an algorithm that can tolerate some amount of noise and only mark "significant" saddle points.
I've been exploring a similar problem for a computational topology class and have had some success with the method outlined below.
First you will need a comparison function that will evaluate the height at two input points and will return < or > (not equal) for any input. One way to do this is that if the points are equal height you use some position-based or random index to find the greater point. You can think of this as adding an infinitesimal perturbation to the height.
Now, for each point, you will compare the height at all the surrounding neighbors (there will be 8 neighbors on a 2D rectangular grid). The lower link for a point will be the set of all neighbors for which the height is less than the point.
If all the neighboring values are in the lower link, you are at a local maximum. If none of the points are in the lower link you are at a local minimum. Otherwise, if the lower link is a single connected set, you are at a regular point on a slope. But if the lower link is two unconnected sets, you are at a saddle.
In 2D you can construct a list of the 8 neighboring point in cyclic order around the point you are checking. You assign a value of +/-1 for each neighbor depending on your comparison function. You can then step through that list (remember to compare the two end points) and count how many times the sign changes to determine the number of connected components in the lower link.
Determining which saddles are "important" is a more difficult analysis. You may wish to look at this: http://www.cs.jhu.edu/~misha/ReadingSeminar/Papers/Gyulassy08.pdf for some guidance.
-Michael
(From a guess at the maths rather than practical experience)
Fit a quadratic to the surface in a small patch around each candidate point, e.g. with least squares. How big the patch is is one way of controlling noise, and you might gain by weighting points depending on their distance from the candidate point. In matrix notation, you can represent the quadratic as x'Ax + b'x + c, where A is symmetric.
The quadratic will have zero gradient at x = (A^-1)b/2. If this not within the patch, discard it.
If A has both +ve and -ve eigenvalues you have a saddle point at x. Since A is only 2x2 and so has at most two eigenvalues, you can ignore the case when it as a zero eigenvalue and so you couldn't invert it at the previous stage.

Need Better Algorithm for Finding Mapping Between 2 Sets of Points with Minimum Distance

Problem: I have two overlapping 2D shapes, A and B, each shape having the same number of pixels, but differing in shape. Some portion of the shapes are overlapping, and there are some pieces of each that are not overlapping. My goal is to move all the non-overlapping pixels in shape A to the non-overlapping pixels in shape B. Since the number of pixels in each shape is the same, I should be able to find a 1-to-1 mapping of pixels. The restriction is that I want to find the mapping that minimizes the total distance traveled by all the pixels that moved.
Brute Force: The brute force approach to solving this problem is obviously out of the question, since I would have to compute the total distance of all possible mappings of which I think there are n! (where n is the number of non-overlapping pixels in one shape) times the computation of calculating a distance for each pair of points in the mapping, n, giving a total of O( n * n! ) or something similar.
Backtracking: The only "better" solution I could think of was to use backtracking, where I would keep track of the current minimum so far and at any point when I'm evaluating a certain mapping, if I reach or exceed that minimum, I move on to the next mapping. Even this won't do any better than O( n! ).
Is there any way to solve this problem with a reasonable complexity?
Also note that the "obvious" approach of simply mapping a point to it's closest matching neighbour does not always yield the optimum solution.
Simpler Approach?: As a secondary question, if a feasible solution doesn't exist, one possibility might be to partition each non-overlapping section into small regions, and map these regions, greatly reducing the number of mappings. To calculate the distance between two regions I would use the center of mass (average of the pixel locations in the region). However, this presents the problem of how I should go about doing the partitioning in order to get a near-optimal answer.
Any ideas are appreciated!!
This is the Minimum Matching problem, and you are correct that it is a hard problem in general. However for the 2D Euclidean Bipartite Minimum Matching case it is solvable in close to O(n²) (see link).
For fast approximations, FryGuy is on the right track with Simulated Annealing. This is one approach.
Also take a look at Approximation algorithms for bipartite and non-bipartite matching in the plane for a O((n/ε)^1.5*log^5(n)) (1+ε)-randomized approximation scheme.
You might consider simulated annealing for this. Start off by assigning A[x] -> B[y] for each pixel, randomly, and calculate the sum of squared distances. Then swap a pair of x<->y mappings, randomly. Then choose to accept this with probability Q, where Q is higher if the new mapping is better, and tends towards zero over time. See the wikipedia article for a better explanation.
Sort pixels in shape A: in increasing order of 'x' and then 'y' ordinates
Sort pixels in shape B: in decreasing order of 'x' and then increasing 'y'
Map pixels at the same index: in the sorted list the first pixel in A will map to first pixel in B. Is this not the mapping you are looking for?

Resources