matlab: optimum amount of points for linear fit - algorithm

I want to make a linear fit to few data points, as shown on the image. Since I know the intercept (in this case say 0.05), I want to fit only points which are in the linear region with this particular intercept. In this case it will be lets say points 5:22 (but not 22:30).
I'm looking for the simple algorithm to determine this optimal amount of points, based on... hmm, that's the question... R^2? Any Ideas how to do it?
I was thinking about probing R^2 for fits using points 1 to 2:30, 2 to 3:30, and so on, but I don't really know how to enclose it into clear and simple function. For fits with fixed intercept I'm using polyfit0 (http://www.mathworks.com/matlabcentral/fileexchange/272-polyfit0-m) . Thanks for any suggestions!
EDIT:
sample data:
intercept = 0.043;
x = 0.01:0.01:0.3;
y = [0.0530642513911393,0.0600786706929529,0.0673485248329648,0.0794662409166333,0.0895915873196170,0.103837395346484,0.107224784565365,0.120300492775786,0.126318699218730,0.141508831492330,0.147135757370947,0.161734674733680,0.170982455701681,0.191799936622712,0.192312642057298,0.204771365716483,0.222689541632988,0.242582251060963,0.252582727297656,0.267390860166283,0.282890010610515,0.292381165948577,0.307990544720676,0.314264952297699,0.332344368808024,0.355781519885611,0.373277721489254,0.387722683944356,0.413648156978284,0.446500064130389;];

What you have here is a rather difficult problem to find a general solution of.
One approach would be to compute all the slopes/intersects between all consecutive pairs of points, and then do cluster analysis on the intersepts:
slopes = diff(y)./diff(x);
intersepts = y(1:end-1) - slopes.*x(1:end-1);
idx = kmeans(intersepts, 3);
x([idx; 3] == 2) % the points with the intersepts closest to the linear one.
This requires the statistics toolbox (for kmeans). This is the best of all methods I tried, although the range of points found this way might have a few small holes in it; e.g., when the slopes of two points in the start and end range lie close to the slope of the line, these points will be detected as belonging to the line. This (and other factors) will require a bit more post-processing of the solution found this way.
Another approach (which I failed to construct successfully) is to do a linear fit in a loop, each time increasing the range of points from some point in the middle towards both of the endpoints, and see if the sum of the squared error remains small. This I gave up very quickly, because defining what "small" is is very subjective and must be done in some heuristic way.
I tried a more systematic and robust approach of the above:
function test
%% example data
slope = 2;
intercept = 1.5;
x = linspace(0.1, 5, 100).';
y = slope*x + intercept;
y(1:12) = log(x(1:12)) + y(12)-log(x(12));
y(74:100) = y(74:100) + (x(74:100)-x(74)).^8;
y = y + 0.2*randn(size(y));
%% simple algorithm
[X,fn] = fminsearch(#(ii)P(ii, x,y,intercept), [0.5 0.5])
[~,inds] = P(X, y,x,intercept)
end
function [C, inds] = P(ii, x,y,intercept)
% ii represents fraction of range from center to end,
% So ii lies between 0 and 1.
N = numel(x);
n = round(N/2);
ii = round(ii*n);
inds = min(max(1, n+(-ii(1):ii(2))), N);
% Solve linear system with fixed intercept
A = x(inds);
b = y(inds) - intercept;
% and return the sum of squared errors, divided by
% the number of points included in the set. This
% last step is required to prevent fminsearch from
% reducing the set to 1 point (= minimum possible
% squared error).
C = sum(((A\b)*A - b).^2)/numel(inds);
end
which only finds a rough approximation to the desired indices (12 and 74 in this example).
When fminsearch is run a few dozen times with random starting values (really just rand(1,2)), it gets more reliable, but I still wouln't bet my life on it.
If you have the statistics toolbox, use the kmeans option.

Depending on the number of data values, I would split the data into a relative small number of overlapping segments, and for each segment calculate the linear fit, or rather the 1-st order coefficient, (remember you know the intercept, which will be same for all segments).
Then, for each coefficient calculate the MSE between this hypothetical line and entire dataset, choosing the coefficient which yields the smallest MSE.

Related

MATLAB: Speeding up a discretization function using bsxfun

For a current project, I have to discretize quasi-continuous values into bins defined by some pre-defined binning resolution. For this purpose, I have written a function, which I expected to be highly efficient as it is able to both process scalar inputs as well as vector inputs using bsxfun. However, after some profiling, I found out that almost all processing time of my much larger project is produced in this function, and within the function, it's mainly the bsxfun part that takes time, with the min-query following on second place. Long story short, I am looking for advice on how to solve this task MUCH faster in MATLAB. Side note: I am usually passing vectors with some 50k elements.
Here's the code:
function sampleNo = value2sample(value,bins)
%Make sure both vectors have orientations fitting bsxfun
value = value(:);
bins = bins(:)';
%Recover bin resolution (avoids passing another parameter)
delta = median(diff(bins));
%Calculate distance matrix between all combinations
dist = abs(bsxfun(#minus,value,bins));
%What we really want to know is the minimum distance per row
[minval,ind] = min(dist,[],2);
%Make sure we don't accidentally further process NaNs as 1st bin
ind(isnan(minval))=NaN;
sampleNo = ind;
sampleNo(minval>delta) = NaN;
end
The reason that your function is slow is because you are computing the distance between every element of values and bins and storing them all in an array - if there are N values and M bins then you will require NM elements to store all the distances, and this is probably a really big number (e.g. if each input has 50,000 elements then you need 2.5 billion elements in the output array).
Moreover, since your bins are sorted (you didn't state this, but it looks like you are assuming it in your code) you do not need to compute the distance from every value to every bin. You can be much smarter,
function ind = value2sample(value, bins)
% Find median bin distance
delta = median(diff(bins));
% Bucket into 'nearest' bin by using midpoints
bins = bins(:);
mids = [-Inf; 0.5 * (bins(1:end-1) + bins(2:end))];
[~, ind] = histc(value, mids);
% Ensure that NaN values and points that aren't near any bin are returned as NaN
ind(isnan(value)) = NaN;
ind(abs(value - bins(ind)) > delta) = NaN;
end
In my tests, with values = randn(10000, 1) and bins = -50:50 it takes around 4.5 milliseconds to run the original function, and 485 microseconds to run the code above, so you are getting around a 10x speedup (and the speedup will be even greater as you increase the size of the inputs).
Thanks to #Chris Taylor, I was able to solve the problem very efficiently. The code now runs almost 400 times faster than before. The only changes I had to make from his version are reflected in the code below. Main issue was to replace histc (whose use is not encouraged anymore) by discretize.
function ind = value2sample(value, bins)
% Make sure the vectors are standing
value = value(:);
bins = bins(:);
% Bucket into 'nearest' bin by using midpoints
mids = [eps; 0.5 * (bins(1:end-1) + bins(2:end))];
ind = discretize(value, mids);
The only thing is, that in this implementation your bins must be non-negative. Other than that, this code does exactly what I want, including the fact that ind has the same size as value and contains NaNs whenever a value is NaN or out of the range of bins.

Global minimum in a huge convex matrix by using small matrices

I have a function J(x,y,z) that gives me the result of those coordinates. This function is convex. What is needed from me is to find the minimum value of this huge matrix.
At first I tried to loop through all of them, calculate then search with min function, but that takes too long ...
so I decided to take advantage of the convexity.
Take a random(for now) set of coordinates, that will be the center of my small 3x3x3 matrice, find the local minimum and make it the center for the next matrice. This will continue until we reach the global minimum.
Another issue is that the function is not perfectly convex, so this problem can appear as well
so I'm thinking of a control measure, when it finds a fake minimum, increase the search range to make sure of it.
How would you advise me to go with it? Is this approach good? Or should I look into something else?
This is something I started myself but I am fairly new to Matlab and I am not sure how to continue.
clear all
clc
min=100;
%the initial size of the search matrix 2*level +1
level=1;
i=input('Enter the starting coordinate for i (X) : ');
j=input('Enter the starting coordinate for j (Y) : ');
k=input('Enter the starting coordinate for k (Z) : ');
for m=i-level:i+level
for n=j-level:j+level
for p=k-level:k+level
A(m,n,p)=J(m,n,p);
if A(m,n,p)<min
min=A(m,n,p);
end
end
end
end
display(min, 'Minim');
[r,c,d] = ind2sub(size(A),find(A ==min));
display(r,'X');
display(c,'Y');
display(d,'Z');
Any guidance, improvement and constructive criticism are appreciated. Thanks in advance.
Try fminsearch because it is fairly general and easy to use. This is especially easy if you can specify your function anonymously. For example:
aFunc = #(x)100*(x(2)-x(1)^2)^2+(1-x(1))^2
then using fminsearch:
[x,fval] = fminsearch( aFunc, [-1.2, 1]);
If your 3-dimensional function, J(x,y,z), can be described anonymously or as regular function, then you can try fminsearch. The input takes a vector so you would need to write your function as J(X) where X is a vector of length 3 so x=X(1), y=X(2), z=X(3)
fminseach can fail especially if the starting point is not near the solution. It is often better to refine the initial starting point. For example, the code below samples a patch around the starting vector and generally improves the chances of finding the global minimum.
% deltaR is used to refine the start vector with scatter min search over
% region defined by a path of [-deltaR+starVec(i):dx:deltaR+startVec(i)] on
% a side.
% Determine dx using maxIter.
maxIter = 1e4;
dx = max( ( 2*deltaR+1)^2/maxIter, 1/8);
dim = length( startVec);
[x,y] = meshgrid( [-deltaR:dx:deltaR]);
xV = zeros( length(x(:)), dim);
% Alternate patches as sequential x-y grids.
for ii = 1:2:dim
xV(:, ii) = startVec(ii) + x(:);
end
for ii = 2:2:dim
xV(:, ii) = startVec(ii) + y(:);
end
% Find the scatter min index to update startVec.
for ii = 1: length( xV)
nS(ii)=aFunc( xV(ii,:));
end
[fSmin, iw] = min( nS);
startVec = xV( iw,:);
fSmin = fSmin
startVec = startVec
[x,fval] = fminsearch( aFunc, startVec);
You can run a 2 dimensional test case f(x,y)=z on AlgorithmHub. The app is running the above code in Octave. You can edit the in-line function (possibly even try your problem) from this web-site as well.

How to do efficient k-nearest neighbor calculation in Matlab

I'm doing data analysis using k-nearest neighbor algorithm in Matlab. My data consists of about 11795 x 88 data matrix, where the rows are observations and columns are variables.
My task is to find k-nearest neighbors for n selected test points. Currently I'm doing it with the following logic:
FOR all the test points
LOOP all the data and find the k-closest neighbors (by euclidean distance)
In other words, I loop all the n test points. For each test point I search the data (which excludes the test point itself) for k-nearest neighbors by euclidean distance. For each test point this takes approximately k x 11794 iterations. So the whole process takes about n x k x 11794 iterations. If n = 10000 and k = 7, this would be approximately 825,6 million iterations.
Is there a more efficient way to calculate the k-nearest neighbors? Most of the computation is going to waste now, because my algorithm simply:
calculates the euclidean distance to all the other points, picks up the closest and excludes the closest point from further consideration --> calculates the euclidean distance to all the other points and picks up the closest --> etc. --> etc.
Is there a smart way to get rid of this 'waste calculation'?
Currently this process takes about 7 hours in my computer (3.2 GHz, 8 GB RAM, 64-bit Win 7)... :(
Here is some of the logic illustrated explicitly (this is not all my code, but this is the part that eats up performance):
for i = 1:size(testpoints, 1) % Loop all the test points
neighborcandidates = all_data_excluding_testpoints; % Use the rest of the data excluding the test points in search of the k-nearest neighbors
testpoint = testpoints(i, :); % This is the test point for which we find k-nearest neighbors
kneighbors = []; % Store the k-nearest neighbors here.
for j = 1:k % Find k-nearest neighbors
bdist = Inf; % The distance of the closest neighbor
bind = 0; % The index of the closest neighbor
for n = 1:size(neighborcandidates, 1) % Loop all the candidates
if pdist([testpoint; neighborcandidates(n, :)]) < bdist % Check the euclidean distance
bdist = pdist([testpoint; neighborcandidates(n, :)]); % Update the best distance so far
bind = n; % Save the best found index so far
end
end
kneighbors = [kneighbors; neighborcandidates(bind, :)]; % Save the found neighbour
neighborcandidates(bind, :) = []; % Remove the neighbor from further consideration
end
end
Using pdist2:
A = rand(20,5); %// This is your 11795 x 88
B = A([1, 12, 4, 8], :); %// This is your n-by-88 subset, i.e. n=4 in this case
n = size(B,1);
D = pdist2(A,B);
[~, ind] = sort(D);
kneighbours = ind(2:2+k, :);
Now you can use kneighbours to index a row in A. Note that the columns of kneighbours correspond to the rows of B
But since you're already dipping into the stats toolbox with pdist why not just use Matlab's knnsearch?
kneighbours_matlab = knnsearch(A,B,'K',k+1);
note that kneighbours is the same as kneighbours_matlab(:,2:end)'
I'm not familiar with specific matlab functions but you can remove k from your formula.
There is a well-known selection algorithm that
takes array A (of size n) and number k as input.
Gives permutation of array A such that k-th biggest/smallest element is at k-th place.
Smaller elements are to the left, bigger are to the right.
e.g.
A=2,4,6,8,10,1,3,5,7,9; k=5
output = 2,4,1,3,5,10,6,8,7,9
This is done in O(n) steps and doesn't depend on k.
EDIT1: You can also precompute all distances as it looks like its the place where you spend most of the computation. It will be roughly a 800M matrix so that shouldnt be the issue on modern machines.
I am not sure if it will speed up the code, but it removes the inner two loops
for i = 1:size(testpoints, 1) % //Loop all the test points
temp = repmat(testpoints(i,:),size(neighborcandidates, 1),1);
euclead_dist = (sum((temp - neighborcandidates).^2,2).^(0.5));
[sort_dist ind] = sort(euclead_dist);
lowest_k_ind = ind(1:k);
kneighbors = neighborcandidates(lowest_k_ind, :);
neighborcandidates(lowest_k_ind, :) = [];
end
Wouldn't this work?
adjk = adj;
for i=1:k-1
adj_k = adj_k*adj;
end
kneigh = find(adj_k(n,:)>0)
given a node n and an index k?
Maybe this is a faster code in the context of Matlab. You can also try parallel functions, data index, and approximate nearest neighbor algorithms to be theoretically more efficient.
% a slightly faster way to find k nearest neighbors in matlab
% find neighbors for data Y from data X
m=size(X,1);
n=size(Y,1);
IDXs_out=zeros(n,k);
distM=(repmat(X(:,1),1,n)-repmat(Y(:,1)',m,1)).^2;
for d=2:size(Y,2)
distM=distM+(repmat(X(:,d),1,n)-repmat(Y(:,d)',m,1)).^2;
end
distM=sqrt(distM);
for i=1:k
[~,idx]=min(distM,[],1);
id=sub2ind(size(distM),idx',(1:n)');
distM(id)=inf;
IDXs_out(:,i)=idx';
end

Get equidistant intervals on approximated bark scale

Wikipedia says we can approximate Bark scale with the equation:
b(f) = 13*atan(0.00076*f)+3.5*atan(power(f/7500,2))
How can I divide frequency spectrum into n intervals of the same length on Bark scale (interval division points will be equidistant on Bark scale)?
The best way would be to analytically inverse function (express x by function of y). I was trying doing it on paper but failed. WolframAlpha search bar couldn't do it also. I tried Octave finverse function, but I got error.
Octave says (for simpler example):
octave:2> x = sym('x');
octave:3> finverse(2*x)
error: `finverse' undefined near line 3 column 1
This is finverse description from Matlab: http://www.mathworks.com/help/symbolic/finverse.html
There could be also numerical way to do it. I can imagine that you just start from dividing the y axis equally and search for ideal division by binary search. But maybe there are some existing tools that do it?
You need to numerically solve this equation (there is no analytical inverse function). Set values for b equally spaced and solve the equation to find the various f. Bissection is somewhat slow but a very good alternative is Brent's method. See http://en.wikipedia.org/wiki/Brent%27s_method
This function can't be inverted analytically. You'll have to use some numerical procedure. Binary search would be fine, but there are more efficient ways to do these sorts of things: look into root-finding algorithms. You can apply your algorithm of choice to the equation b(f) = f_n for each of the frequency interval endpoints f_n.
Just so you know, in (say) octave to implement rpsmi's or David Zaslavsky's answer, you'd do something like this:
global x0 = 0.
function res = b(f)
global x0
res = 13*atan(0.00076*f)+3.5*atan(power(f/7500,2)) - x0
end
function [intervals, barks] = barkintervals(left, right, n)
global x0
intervals = linspace(left, right, n);
barks = intervals;
for i = 1:n
x0 = intervals(i);
# 125*x0 is just a crude guess starting point given the values
[barks(i), fval, info] = fsolve('b', 125*x0);
endfor
end
and run it like so:
octave:1> barks
octave:2> [i,bx] = barkintervals(0, 10, 10)
[... lots of output from fsolve deleted...]
i =
Columns 1 through 8:
0.00000 1.11111 2.22222 3.33333 4.44444 5.55556 6.66667 7.77778
Columns 9 and 10:
8.88889 10.00000
bx =
Columns 1 through 6:
0.0000e+00 1.1266e+02 2.2681e+02 3.4418e+02 4.6668e+02 5.9653e+02
Columns 7 through 10:
7.3639e+02 8.8960e+02 1.0605e+03 1.2549e+03
I finally decided not to use the Bark values approximation but ideal values for critical bands centres (defined for n=1..24). I plotted them with gnuplot and on the same graph I plotted arbitrarily chosen values for points of greater density (for the required n>24). I adjusted the points values in Hz till the the both curves were approximately the same.
Of course rpsmi and David Zaslavsky answers are more general and scalable.

Distributing points over a surface within boundries

I'm interested in a way (algorithm) of distributing a predefined number of points over a 4 sided surface like a square.
The main issue is that each point has got to have a minimum and maximum proximity to each other (random between two predefined values). Basically the distance of any two points should not be closer than let's say 2, and a further than 3.
My code will be implemented in ruby (the points are locations, the surface is a map), but any ideas or snippets are definitely welcomed as all my ideas include a fair amount of brute force.
Try this paper. It has a nice, intuitive algorithm that does what you need.
In our modelization, we adopted another model: we consider each center to be related to all its neighbours by a repulsive string.
At the beginning of the simulation, the centers are randomly distributed, as well as the strengths of the
strings. We choose randomly to move one center; then we calculate the resulting force caused by all
neighbours of the given center, and we calculate the displacement which is proportional and oriented
in the sense of the resulting force.
After a certain number of iterations (which depends on the number of
centers and the degree of initial randomness) the system becomes stable.
In case it is not clear from the figures, this approach generates uniformly distributed points. You may use instead a force that is zero inside your bounds (between 2 and 3, for example) and non-zero otherwise (repulsive if the points are too close, attractive if too far).
This is my Python implementation (sorry, I donĀ“t know ruby). Just import this and call uniform() to get a list of points.
import numpy as np
from numpy.linalg import norm
import pylab as pl
# find the nearest neighbors (brute force)
def neighbors(x, X, n=10):
dX = X - x
d = dX[:,0]**2 + dX[:,1]**2
idx = np.argsort(d)
return X[idx[1:11]]
# repulsion force, normalized to 1 when d == rmin
def repulsion(neib, x, d, rmin):
if d == 0:
return np.array([1,-1])
return 2*(x - neib)*rmin/(d*(d + rmin))
def attraction(neib, x, d, rmax):
return rmax*(neib - x)/(d**2)
def uniform(n=25, rmin=0.1, rmax=0.15):
# Generate randomly distributed points
X = np.random.random_sample( (n, 2) )
# Constants
# step is how much each point is allowed to move
# set to a lower value when you have more points
step = 1./50.
# maxk is the maximum number of iterations
# if step is too low, then maxk will need to increase
maxk = 100
k = 0
# Force applied to the points
F = np.zeros(X.shape)
# Repeat for maxk iterations or until all forces are zero
maxf = 1.
while maxf > 0 and k < maxk:
maxf = 0
for i in xrange(n):
# Force calculation for the i-th point
x = X[i]
f = np.zeros(x.shape)
# Interact with at most 10 neighbors
Neib = neighbors(x, X, 10)
# dmin is the distance to the nearest neighbor
dmin = norm(Neib[0] - x)
for neib in Neib:
d = norm(neib - x)
if d < rmin:
# feel repulsion from points that are too near
f += repulsion(neib, x, d, rmin)
elif dmin > rmax:
# feel attraction if there are no neighbors closer than rmax
f += attraction(neib, x, d, rmax)
# save all forces and the maximum force to normalize later
F[i] = f
if norm(f) <> 0:
maxf = max(maxf, norm(f))
# update all positions using the forces
if maxf > 0:
X += (F/maxf)*step
k += 1
if k == maxk:
print "warning: iteration limit reached"
return X
I presume that one of your brute force ideas includes just repeatedly generating points at random and checking to see if the constraints happen to be satisified.
Another way is to take a configuration that satisfies the constraints and repeatedly perturb a small part of it, chosen at random - for instance move a single point - to move to a randomly chosen nearby configuration. If you do this often enough you should move to a random configuration that is almost independent of the starting point. This could be justified under http://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm or http://en.wikipedia.org/wiki/Gibbs_sampling.
I might try just doing it at random, then going through and dropping points that are to close to other points. You can compare the square of the distance to save some math time.
Or create cells with borders and place a point in each one. Less random, it depends on if this is a "just for looks thing" or not. But it could be very fast.
I made a compromise and ended up using the Poisson Disk Sampling method.
The result was fairly close to what I needed, especially with a lower number of tries (which also drastically reduces cost).

Resources