Finding parameters of exponentially decaying sinusoids (Matrix Pencil Method) - algorithm

The matrix pencil method is an algorithm which can be used to find the individual exponential decaying sinusoids' parameters (frequency, amplitude, decay factor and initial phase) in a signal consisting of multiple such signals added. I am trying to implement the algorithm. The algorithm can be found in the paper from this link:
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=370583 OR
http://krein.unica.it/~cornelis/private/IEEE/IEEEAntennasPropagMag_37_48.pdf
In order to test the algorithm, I created a synthetic signal composed of four exponentially decaying sinusoids generated as follows:
fs=2205;
t=0:1/fs:249/fs;
f(1)=80;
f(2)=120;
f(3)=250;
f(4)=560;
a(1)=.4;
a(2)=1;
a(3)=0.89;
a(4)=.65;
d(1)=70;
d(2)=50;
d(3)=90;
d(4)=80;
for i=1:4
x(i,:)=a(i)*exp(-d(i)*t).*cos(2*pi*f(i)*t);
end
y=x(1,:)+x(2,:)+x(3,:)+x(4,:);
I then feed this signal to the algorithm described in the paper as follows:
function [f d] = mpencil(y)
%construct hankel matrix
N = size(y,2);
L1 = ceil(1/3 * N);
L2 = floor(2/3 * N);
L = ceil((L1 + L2) / 2);
fs=2205;
for i=1:1:(N-L)
Y(i,:)=y(i:(i+L));
end
Y1=Y(:,1:L);
Y2=Y(:,2:(L+1));
[U,S,V] = svd(Y);
D=diag(S);
tol=1e-3;
m=0;
l=length(D);
for i=1:l
if( abs(D(i)/D(1)) >= tol)
m=m+1;
end
end
Ss=S(:,1:m);
Vnew=V(:,1:m);
a=size(Vnew,1);
Vs1=Vnew(1:(a-1),:);
Vs2=Vnew(2:end,:);
Y1=U*Ss*(Vs1');
Y2=U*Ss*(Vs2');
D_fil=(pinv(Y1))*Y2;
z = eig(D_fil);
l=length(z);
for i=1:2:l
f((i+1)/2)= (angle(z(i))*fs)/(2*pi);
d((i+1)/2)=-real(z(i))*fs;
end
In the output from the above code, I am correctly getting the four constituent frequency components but am not getting their decaying factors. If anybody has prior experience with this algorithm or has some understanding about why this discrepancy might be there, I would be very grateful for your help. I have tried rewriting the code from a scratch multiple times but it has been of no help, giving the same results.
Any help would be highly appreciated.

I found the problem.
There are two small glitches in the code:
SVD output is a complex conjugate of the right singular matrix—i.e, Vh—and according to IEEE, it needs to be converted to V first.
Now, this V is filtered for reducing the dimension.
After reducing the dimensions of V, V1 and V2 are calculated from V. (In your case, you are using Vh directly for calculating V1 and V2!)
When calculating Y1 and Y2, the complex conjugates of V1 and V2 are used.
You did not consider the absolute magnitude of complex eigen values, but only the real part.
damping coefficient "zeta"= log(magnitude(z))/Ts

Related

Adaptive Simpsons Quadrature Algorithm for Double Integrals?

I'm currently using Numerical Analysis 10th edition by Richard L Burden as a reference for approximate Integration techniques. In there it describes the Adaptive Simpsons Quadrature rule that inputs only the bounds and an error tolerance, and spits out the approximate integral within precision of the error tolerance. This method is much more effective than the standard Simpsons rule where you have to input number of iterations and not know how close it is to the actual solution. However, the book goes on to describe a method for Double Integrals using Simpson's rule, but not an algorithm Adaptive Simpsons Quadrature rule for double integrals. Does anyone know a pseudo algorithm for an Adaptive Simpsons rule for double integrals??
For reference, this is the pseudo algorithm for Composite Simpsons rule for single integrals: Inputs bounds (a, b) and n # of iterations
`NAME: compositeSimpsons(a, b, n):
h=(b-a)/n
first = f(a)
last = f(b)
sum=0
x = a+h
for(i=2:n-1)
if(i%2==0) // even
sum+=4*(x)
else // odd
sum+=2*f(x)
x+=h
end for
return (h/3) * (first+sum+last)`
And here is the pseudo-algorithm for Adaptive Simpsons Quadrature for single integrals: (Input bounds a, b) and tolerance (tol)
`NAME: adaptiveQuadratureSimspons(a, b, tol):
myStack.push(a)
myStack.push(b)
I=0
while(myStack is not empty)
bb = myStack.pop()
aa = myStack.pop()
I1 = compositeSimpsons(aa, bb, 2)
m = (aa+bb)/2
I2 = compositeSimpsons(aa, mm, 2) + compositeSimspons(mm, bb, 2)
if(|I2-I1|/15 < (bb-aa)*tol)
I += I2
else
myStack.push(m)
myStack.push(bb)
myStack.push(aa)
myStackl.push(m)
end while
return I`
The algorithm for Simpsons rule for two integrals gets very complex fast as you're replacing the x variable with each iteration with a different subdivision, so I won't detail it here unless necessary. However, I know that the problem isn't that algorithm as I've tried it many times and works fine for many different double integral problems. I tried to use the same logic found in the adaptive Simpsons rule my double integral adaptive Simpsons rule by replacing compositeSimpsons() with my compositeSimpsonsDouble(), but it entered an infinite loop as the difference between I2 and I1 was always less than the tolerance. Any help? Coding this in Java
In the lingo of numerical quadrature, "double integrals" don't play as big as a role as the domain you want to integrate your function over. In 1D it's always an interval, in 2D it can be a disk, a rectangle, a triangle, the plane with weight function exp(-r**2) etc. Perhaps your double integral is one of these. For all these different domains, you have different integration techniques. See https://github.com/nschloe/quadpy for some examples.
For adaptive quadrature in 2D, my first impulse would be to check if the domain can be approximated well by a number of triangles. Like intervals in 1D, those can be easily split into smaller triangles if the error estimator recommends so.
Check https://github.com/nschloe/quadpy/wiki/Adaptive-quadrature for how to do this with quadpy.

Weighted Sum Scheduling in Halide

I am implementing a Radial Basis Function in Halide, and while I have it running successfully it is quite slow. For each pixel I compute the distance, then take a weighted sum of this distance to produce the output. To loop over the weights I use an RDom (as seen below). In this implementation, every pixel computation requires reloading all of the many (3000+) weights, hence the slow speed.
My question is how to take advantage of Halide's scheduling functionality in this instance. My desire is to load some of the weights, compute partial weighted sums for a subset of the pixels, load the next set of weights, and continue to completion. This keeps locality for each smaller group of weights, and that kind of thing is exactly what Halide is built for. Unfortunately I haven't found anything for this specific problem. The RDom seems to be at a lower level of abstraction than the scheduling primitives, so its unclear how to schedule this.
Any alternative suggestions for weighted sum implementation in Halide are welcome. No need to do this with an RDom, I'm just not aware of any other way.
Func rbf_ctrl_pts("rbf_ctrl_pts");
// Initialization with all zero
rbf_ctrl_pts(x,y,c) = cast<float>(0);
// Index to iterate with
RDom idx(0,num_ctrl_pts);
// Loop code
// Subtract the vectors
Expr red_sub = (*in_func)(x,y,0) - (*ctrl_pts_h)(0,idx);
Expr green_sub = (*in_func)(x,y,1) - (*ctrl_pts_h)(1,idx);
Expr blue_sub = (*in_func)(x,y,2) - (*ctrl_pts_h)(2,idx);
// Take the L2 norm to get the distance
Expr dist = sqrt( red_sub*red_sub +
green_sub*green_sub +
blue_sub*blue_sub );
// Update persistant loop variables
rbf_ctrl_pts(x,y,c) = select( c == 0, rbf_ctrl_pts(x,y,c) +
( (*weights_h)(0,idx) * dist),
c == 1, rbf_ctrl_pts(x,y,c) +
( (*weights_h)(1,idx) * dist),
rbf_ctrl_pts(x,y,c) +
( (*weights_h)(2,idx) * dist));
You can use split or tile and rfactor in the idx dimension of rbf_ctrl_pts to factor and schedule the reduction operation. Getting locality on the weights should be doable via these mechanisms. I'm not 100% sure the associative prover will handle the select so it may be required to unroll by channels or move to using a Tuple across the channels, although in the code above, I'm not sure the select is doing anything compared to passing c through.

How to perform operation for all matrix elements in Scilab?

I'm trying to simulate the heat distribution on an infinite plate over time. For this purpose, I've wrote a Scilab script. Now, the crucial point of it, is calculation of temperature for all plate points, and it has to be done for every time instance I want to observe:
for j=2:S-1
for i=2:S-1
heat(i, j) = tcoeff*10000*(plate(i-1,j) + plate(i+1,j) - 4*plate(i,j) + plate(i, j-1) + plate(i, j+1)) + plate(i,j);
end;
end
The problem is, that, if I'd like to do it for a 100x100 points plate, it means, that here (it's only for inner part, without boundary conditions), I would have to loop 98x98 = 9604 times, at every turn calculating the heat at a given i,j point. If I'd like to observe that for, say 100 secons, with a 1 s step, I have to repeat it 100 times, giving 960,400 iterations in total. Which takes quite a long time, and I'd like to avoid it. Up to 50x50 plate, it all happens in a reasonable, 4-5 seconds time frame.
Now my question is - is it necessary to do all this using for loops? Is there any built-in aggregate function in Scilab, that will let me do this for all elements of a matrix? The reason I haven't found a way yet, is that the result for every point depends on the values of other matrix points, and that made me do it with nested loops. Any ideas on how to make it faster appreciated.
It seems to me that you want to compute a 2D intercorrelation of your heat field and a certain diffusion pattern. This pattern can be thought as a "filter" kernel, which is a common way to modify images with a linear filter matrix. Your "filter" is:
F=[0,1,0;1,-4,1;0,1,0];
If you install the Image Processing Toolbox (IPD) you will have a MaskFilter function to do this 2D intercorrelation.
S=500;
plate=rand(S,S);
tcoeff=1;
//your solution with nested for loops
t0=getdate();
for j=2:S-1
for i=2:S-1
heat(i, j) = tcoeff*10000*(plate(i-1,j)+plate(i+1,j)-..
4*plate(i,j)+plate(i,j-1)+plate(i, j+1))+plate(i,j);
end
end
t1=getdate();
T0=etime(t1,t0);
mprintf("\nNested for loops: %f s (100 %%)",T0);
//optimised nested for loop
F=[0,1,0;1,-4,1;0,1,0]; //"filter" matrix
F=tcoeff*10000*F;
heat2=zeros(plate);
t0=getdate();
for j=2:S-1
for i=2:S-1
heat2(i,j)=sum(F.*plate(i-1:i+1,j-1:j+1));
end
end
heat2=heat2+plate;
t1=getdate();
T2=etime(t1,t0);
mprintf("\nNested for loops optimised: %f s (%.2f %%)",T2,T2/T0*100);
//MaskFilter from IPD toolbox
t0=getdate();
heat3=MaskFilter(plate,F);
heat3=heat3+plate;
t1=getdate();
T3=etime(t1,t0);
mprintf("\nWith MaskFilter: %f s (%.2f %%)",T3,T3/T0*100);
disp(heat3(1:10,1:10)-heat(1:10,1:10),"Difference of the results (heat3-heat):");
Please note, that MaskFilter pads the image (the original matrix) before applying the filter, and as far as I know it uses a "mirror" array across the border. You should check whether this behaviour is appropriate for you or not.
The speed increase is about *320 (the execution time is 0.32% of your original code). Is that fast enough?
In theory it could be done with two 2D Fourier Transform (with Scilab builtin mfft maybe) but it might not be faster than this. See here: http://mailinglists.scilab.org/Image-processing-filter-td2618144.html#a2618168
Please consider that there is a big difference between vectorizing an operation and parallel computation, as I have explained here. Although vectorizing might improve performance a little bit, that's not comparable to what you can achive through GPU computing for example (e.g. OpenCL). I will try to explain a vectorized form of your code without going too much into the details. Consider these as given:
S = ...;
tcoeff = ...;
function Plate = plate(i, j)
...;
endfunction
function Heat = heat(i, j)
...;
endfunction
Now you could define a meshgrid:
x = 2 : S - 1;
y = 2 : S - 1;
[M, N] = meshgrid(x,y);
Result = feval(M, N, heat);
The feval is the key here which will broadcast the feval function over the M and N matrices.
Your scheme is a finite differences scheme of the Laplacian operator applied to a rectangular grid. If you choose a row-wise or column-wise numbering of your degrees of freedom (here the plate(i,j)) in order to treat them as vectors, then applying your "discrete" Laplacian can be done by multiplying a sparse matrix on the left (it is very fast) This is particularly well explained in the following document:
https://www.math.uci.edu/~chenlong/226/FDMcode.pdf.
The implementation is described in Matlab but is easily translated in Scilab.

Parallelising gradient calculation in Julia

I was persuaded some time ago to drop my comfortable matlab programming and start programming in Julia. I have been working for a long with neural networks and I thought that, now with Julia, I could get things done faster by parallelising the calculation of the gradient.
The gradient need not be calculated on the entire dataset in one go; instead one can split the calculation. For instance, by splitting the dataset in parts, we can calculate a partial gradient on each part. The total gradient is then calculated by adding up the partial gradients.
Though, the principle is simple, when I parallelise with Julia I get a performance degradation, i.e. one process is faster then two processes! I am obviously doing something wrong... I have consulted other questions asked in the forum but I could still not piece together an answer. I think my problem lies in that there is a lot of unnecessary data moving going on, but I can't fix it properly.
In order to avoid posting messy neural network code, I am posting below a simpler example that replicates my problem in the setting of linear regression.
The code-block below creates some data for a linear regression problem. The code explains the constants, but X is the matrix containing the data inputs. We randomly create a weight vector w which when multiplied with X creates some targets Y.
######################################
## CREATE LINEAR REGRESSION PROBLEM ##
######################################
# This code implements a simple linear regression problem
MAXITER = 100 # number of iterations for simple gradient descent
N = 10000 # number of data items
D = 50 # dimension of data items
X = randn(N, D) # create random matrix of data, data items appear row-wise
Wtrue = randn(D,1) # create arbitrary weight matrix to generate targets
Y = X*Wtrue # generate targets
The next code-block below defines functions for measuring the fitness of our regression (i.e. the negative log-likelihood) and the gradient of the weight vector w:
####################################
## DEFINE FUNCTIONS ##
####################################
#everywhere begin
#-------------------------------------------------------------------
function negative_loglikelihood(Y,X,W)
#-------------------------------------------------------------------
# number of data items
N = size(X,1)
# accumulate here log-likelihood
ll = 0
for nn=1:N
ll = ll - 0.5*sum((Y[nn,:] - X[nn,:]*W).^2)
end
return ll
end
#-------------------------------------------------------------------
function negative_loglikelihood_grad(Y,X,W, first_index,last_index)
#-------------------------------------------------------------------
# number of data items
N = size(X,1)
# accumulate here gradient contributions by each data item
grad = zeros(similar(W))
for nn=first_index:last_index
grad = grad + X[nn,:]' * (Y[nn,:] - X[nn,:]*W)
end
return grad
end
end
Note that the above functions are on purpose not vectorised! I choose not to vectorise, as the final code (the neural network case) will also not admit any vectorisation (let us not get into more details regarding this).
Finally, the code-block below shows a very simple gradient descent that tries to recover the parameter weight vector w from the given data Y and X:
####################################
## SOLVE LINEAR REGRESSION ##
####################################
# start from random initial solution
W = randn(D,1)
# learning rate, set here to some arbitrary small constant
eta = 0.000001
# the following for-loop implements simple gradient descent
for iter=1:MAXITER
# get gradient
ref_array = Array(RemoteRef, nworkers())
# let each worker process part of matrix X
for index=1:length(workers())
# first index of subset of X that worker should work on
first_index = (index-1)*int(ceil(N/nworkers())) + 1
# last index of subset of X that worker should work on
last_index = min((index)*(int(ceil(N/nworkers()))), N)
ref_array[index] = #spawn negative_loglikelihood_grad(Y,X,W, first_index,last_index)
end
# gather the gradients calculated on parts of matrix X
grad = zeros(similar(W))
for index=1:length(workers())
grad = grad + fetch(ref_array[index])
end
# now that we have the gradient we can update parameters W
W = W + eta*grad;
# report progress, monitor optimisation
#printf("Iter %d neg_loglikel=%.4f\n",iter, negative_loglikelihood(Y,X,W))
end
As is hopefully visible, I tried to parallelise the calculation of the gradient in the easiest possible way here. My strategy is to break the calculation of the gradient in as many parts as available workers. Each worker is required to work only on part of matrix X, which part is specified by first_index and last_index. Hence, each worker should work with X[first_index:last_index,:]. For instance, for 4 workers and N = 10000, the work should be divided as follows:
worker 1 => first_index = 1, last_index = 2500
worker 2 => first_index = 2501, last_index = 5000
worker 3 => first_index = 5001, last_index = 7500
worker 4 => first_index = 7501, last_index = 10000
Unfortunately, this entire code works faster if I have only one worker. If add more workers via addprocs(), the code runs slower. One can aggravate this issue by create more data items, for instance use instead N=20000.
With more data items, the degradation is even more pronounced.
In my particular computing environment with N=20000 and one core, the code runs in ~9 secs. With N=20000 and 4 cores it takes ~18 secs!
I tried many many different things inspired by the questions and answers in this forum but unfortunately to no avail. I realise that the parallelisation is naive and that data movement must be the problem, but I have no idea how to do it properly. It seems that the documentation is also a bit scarce on this issue (as is the nice book by Ivo Balbaert).
I would appreciate your help as I have been stuck for quite some while with this and I really need it for my work. For anyone wanting to run the code, to save you the trouble of copying-pasting you can get the code here.
Thanks for taking the time to read this very lengthy question! Help me turn this into a model answer that anyone new in Julia can then consult!
I would say that GD is not a good candidate for parallelizing it using any of the proposed methods: either SharedArray or DistributedArray, or own implementation of distribution of chunks of data.
The problem does not lay in Julia, but in the GD algorithm.
Consider the code:
Main process:
for iter = 1:iterations #iterations: "the more the better"
δ = _gradient_descent_shared(X, y, θ)
θ = θ - α * (δ/N)
end
The problem is in the above for-loop which is a must. No matter how good _gradient_descent_shared is, the total number of iterations kills the noble concept of the parallelization.
After reading the question and the above suggestion I've started implementing GD using SharedArray. Please note, I'm not an expert in the field of SharedArrays.
The main process parts (simple implementation without regularization):
run_gradient_descent(X::SharedArray, y::SharedArray, θ::SharedArray, α, iterations) = begin
N = length(y)
for iter = 1:iterations
δ = _gradient_descent_shared(X, y, θ)
θ = θ - α * (δ/N)
end
θ
end
_gradient_descent_shared(X::SharedArray, y::SharedArray, θ::SharedArray, op=(+)) = begin
if size(X,1) <= length(procs(X))
return _gradient_descent_serial(X, y, θ)
else
rrefs = map(p -> (#spawnat p _gradient_descent_serial(X, y, θ)), procs(X))
return mapreduce(r -> fetch(r), op, rrefs)
end
end
The code common to all workers:
#= Returns the range of indices of a chunk for every worker on which it can work.
The function splits data examples (N rows into chunks),
not the parts of the particular example (features dimensionality remains intact).=#
#everywhere function _worker_range(S::SharedArray)
idx = indexpids(S)
if idx == 0
return 1:size(S,1), 1:size(S,2)
end
nchunks = length(procs(S))
splits = [round(Int, s) for s in linspace(0,size(S,1),nchunks+1)]
splits[idx]+1:splits[idx+1], 1:size(S,2)
end
#Computations on the chunk of the all data.
#everywhere _gradient_descent_serial(X::SharedArray, y::SharedArray, θ::SharedArray) = begin
prange = _worker_range(X)
pX = sdata(X[prange[1], prange[2]])
py = sdata(y[prange[1],:])
tempδ = pX' * (pX * sdata(θ) .- py)
end
The data loading and training. Let me assume that we have:
features in X::Array of the size (N,D), where N - number of examples, D-dimensionality of the features
labels in y::Array of the size (N,1)
The main code might look like this:
X=[ones(size(X,1)) X] #adding the artificial coordinate
N, D = size(X)
MAXITER = 500
α = 0.01
initialθ = SharedArray(Float64, (D,1))
sX = convert(SharedArray, X)
sy = convert(SharedArray, y)
X = nothing
y = nothing
gc()
finalθ = run_gradient_descent(sX, sy, initialθ, α, MAXITER);
After implementing this and run (on 8-cores of my Intell Clore i7) I got a very slight acceleration over serial GD (1-core) on my training multiclass (19 classes) training data (715 sec for serial GD / 665 sec for shared GD).
If my implementation is correct (please check this out - I'm counting on that) then parallelization of the GD algorithm is not worth of that. Definitely you might get better acceleration using stochastic GD on 1-core.
If you want to reduce the amount of data movement, you should strongly consider using SharedArrays. You could preallocate just one output vector, and pass it as an argument to each worker. Each worker sets a chunk of it, just as you suggested.

matlab: optimum amount of points for linear fit

I want to make a linear fit to few data points, as shown on the image. Since I know the intercept (in this case say 0.05), I want to fit only points which are in the linear region with this particular intercept. In this case it will be lets say points 5:22 (but not 22:30).
I'm looking for the simple algorithm to determine this optimal amount of points, based on... hmm, that's the question... R^2? Any Ideas how to do it?
I was thinking about probing R^2 for fits using points 1 to 2:30, 2 to 3:30, and so on, but I don't really know how to enclose it into clear and simple function. For fits with fixed intercept I'm using polyfit0 (http://www.mathworks.com/matlabcentral/fileexchange/272-polyfit0-m) . Thanks for any suggestions!
EDIT:
sample data:
intercept = 0.043;
x = 0.01:0.01:0.3;
y = [0.0530642513911393,0.0600786706929529,0.0673485248329648,0.0794662409166333,0.0895915873196170,0.103837395346484,0.107224784565365,0.120300492775786,0.126318699218730,0.141508831492330,0.147135757370947,0.161734674733680,0.170982455701681,0.191799936622712,0.192312642057298,0.204771365716483,0.222689541632988,0.242582251060963,0.252582727297656,0.267390860166283,0.282890010610515,0.292381165948577,0.307990544720676,0.314264952297699,0.332344368808024,0.355781519885611,0.373277721489254,0.387722683944356,0.413648156978284,0.446500064130389;];
What you have here is a rather difficult problem to find a general solution of.
One approach would be to compute all the slopes/intersects between all consecutive pairs of points, and then do cluster analysis on the intersepts:
slopes = diff(y)./diff(x);
intersepts = y(1:end-1) - slopes.*x(1:end-1);
idx = kmeans(intersepts, 3);
x([idx; 3] == 2) % the points with the intersepts closest to the linear one.
This requires the statistics toolbox (for kmeans). This is the best of all methods I tried, although the range of points found this way might have a few small holes in it; e.g., when the slopes of two points in the start and end range lie close to the slope of the line, these points will be detected as belonging to the line. This (and other factors) will require a bit more post-processing of the solution found this way.
Another approach (which I failed to construct successfully) is to do a linear fit in a loop, each time increasing the range of points from some point in the middle towards both of the endpoints, and see if the sum of the squared error remains small. This I gave up very quickly, because defining what "small" is is very subjective and must be done in some heuristic way.
I tried a more systematic and robust approach of the above:
function test
%% example data
slope = 2;
intercept = 1.5;
x = linspace(0.1, 5, 100).';
y = slope*x + intercept;
y(1:12) = log(x(1:12)) + y(12)-log(x(12));
y(74:100) = y(74:100) + (x(74:100)-x(74)).^8;
y = y + 0.2*randn(size(y));
%% simple algorithm
[X,fn] = fminsearch(#(ii)P(ii, x,y,intercept), [0.5 0.5])
[~,inds] = P(X, y,x,intercept)
end
function [C, inds] = P(ii, x,y,intercept)
% ii represents fraction of range from center to end,
% So ii lies between 0 and 1.
N = numel(x);
n = round(N/2);
ii = round(ii*n);
inds = min(max(1, n+(-ii(1):ii(2))), N);
% Solve linear system with fixed intercept
A = x(inds);
b = y(inds) - intercept;
% and return the sum of squared errors, divided by
% the number of points included in the set. This
% last step is required to prevent fminsearch from
% reducing the set to 1 point (= minimum possible
% squared error).
C = sum(((A\b)*A - b).^2)/numel(inds);
end
which only finds a rough approximation to the desired indices (12 and 74 in this example).
When fminsearch is run a few dozen times with random starting values (really just rand(1,2)), it gets more reliable, but I still wouln't bet my life on it.
If you have the statistics toolbox, use the kmeans option.
Depending on the number of data values, I would split the data into a relative small number of overlapping segments, and for each segment calculate the linear fit, or rather the 1-st order coefficient, (remember you know the intercept, which will be same for all segments).
Then, for each coefficient calculate the MSE between this hypothetical line and entire dataset, choosing the coefficient which yields the smallest MSE.

Resources