MATLAB vectorization: computing a neighborhood matrix - performance

Given two vectors X and Y of length n, representing points on the plane, and a neighborhood radius rad, is there a vectorized way to compute the neighborhood matrix of the points?
In other words, can the following (painfully slow for large n) loop be vectorized:
neighborhood_mat = zeros(n, n);
for i = 1 : n
for j = 1 : i - 1
dist = norm([X(j) - X(i), Y(j) - Y(i)]);
if (dist < radius)
neighborhood_mat(i, j) = 1;
neighborhood_mat(j, i) = 1;
end
end
end

Approach #1
bsxfun based approach -
out = bsxfun(#minus,X,X').^2 + bsxfun(#minus,Y,Y').^2 < radius^2
out(1:n+1:end)= 0
Approach #2
Distance matrix calculation using matrix-multiplication based approach (possibly faster) -
A = [X(:) Y(:)]
A_t = A.'; %//'
out = [-2*A A.^2 ones(n,3)]*[A_t ; ones(3,n) ; A_t.^2] < radius^2
out(1:n+1:end)= 0
Approach #3
With pdist and squareform -
A = [X(:) Y(:)]
out = squareform(pdist(A))<radius
out(1:n+1:end)= 0
Approach #4
You can use pdist as with the previous approach, but avoid squareform with some logical indexing to get the final output of neighbourhood matrix as shown below -
A = [X(:) Y(:)]
dists = pdist(A)< radius
mask_lower = bsxfun(#gt,[1:n]',1:n) %//'
%// OR tril(true(n),-1)
mask_upper = bsxfun(#lt,[1:n]',1:n) %//'
%// OR mask_upper = triu(true(n),1)
%// OR mask_upper = ~mask_lower; mask_upper(1:n+1:end) = false;
out = zeros(n)
out(mask_lower) = dists
out_t = out' %//'
out(mask_upper) = out_t(mask_upper)
Note: As one can see, for the all above mentioned approaches, we are using pre-allocation for the output. A fast way to pre-allocate would be with out(n,n) = 0 and is based upon this wonderful blog on undocumented MATLAB. This should really speed up those approaches!

The following approach is great if the number of points in your neighborhoods is small or you run low on memory using the brute-force approach:
If you have the statistics toolbox installed, you can have a look at the rangesearch method.
(Free alternatives include the k-d tree implementations of a range search on the File Exchange.)
The usage of rangesearch is straightforward:
P = [X,Y];
[idx,D] = rangesearch(P, P, rad);
It returns a cell-array idx of the indices of nodes within reach and their distances D.
Depending on the size of your data, this could be beneficial in terms of speed and memory.
Instead of computing all pairwise distances and then filtering out those that are large, this algorithm builds a data structure called a k-d tree to more efficiently search close points.
You can then use this to build a sparse matrix:
I = cell2mat(idx.').';
J = runLengthDecode(cellfun(#numel,idx));
n = size(P,1);
S = sparse(I,J,1,n,n)-speye(n);
(This uses the runLengthDecode function from this answer.)
You can also have a look at the KDTreeSearcher class if your data points don't change and you want to query your data lots of times.

Related

Efficient way of computing multivariate gaussian varying the mean - Matlab

Is there a efficient way to do the computation of a multivariate gaussian (as below) that returns matrix p , that is, making use of some sort of vectorization? I am aware that matrix p is symmetric, but still for a matrix of size 40000x3, for example, this will take quite a long time.
Matlab code example:
DataMatrix = [3 1 4; 1 2 3; 1 5 7; 3 4 7; 5 5 1; 2 3 1; 4 4 4];
[rows, cols ] = size(DataMatrix);
I = eye(cols);
p = zeros(rows);
for k = 1:rows
p(k,:) = mvnpdf(DataMatrix(:,:),DataMatrix(k,:),I);
end
Stage 1: Hack into source code
Iteratively we are performing mvnpdf(DataMatrix(:,:),DataMatrix(k,:),I)
The syntax is : mvnpdf(X,Mu,Sigma).
Thus, the correspondence with our input becomes :
X = DataMatrix(:,:);
Mu = DataMatrix(k,:);
Sigma = I
For the sizes relevant to our situation, the source code mvnpdf.m reduces to -
%// Store size parameters of X
[n,d] = size(X);
%// Get vector mean, and use it to center data
X0 = bsxfun(#minus,X,Mu);
%// Make sure Sigma is a valid covariance matrix
[R,err] = cholcov(Sigma,0);
%// Create array of standardized data, and compute log(sqrt(det(Sigma)))
xRinv = X0 / R;
logSqrtDetSigma = sum(log(diag(R)));
%// Finally get the quadratic form and thus, the final output
quadform = sum(xRinv.^2, 2);
p_out = exp(-0.5*quadform - logSqrtDetSigma - d*log(2*pi)/2)
Now, if the Sigma is always an identity matrix, we would have R as an identity matrix too. Therefore, X0 / R would be same as X0, which is saved as xRinv. So, essentially quadform = sum(X0.^2, 2);
Thus, the original code -
for k = 1:rows
p(k,:) = mvnpdf(DataMatrix(:,:),DataMatrix(k,:),I);
end
reduces to -
[n,d] = size(DataMatrix);
[R,err] = cholcov(I,0);
p_out = zeros(rows);
K = sum(log(diag(R))) + d*log(2*pi)/2;
for k = 1:rows
X0 = bsxfun(#minus,DataMatrix,DataMatrix(k,:));
quadform = sum(X0.^2, 2);
p_out(k,:) = exp(-0.5*quadform - K);
end
Now, if the input matrix is of size 40000x3, you might want to stop here. But with system resources permitting, you can vectorize everything as discussed next.
Stage 2: Vectorize everything
Now that we see what's actually going on and that the computations look parallelizable, it's time to step-up to use bsxfun in 3D with his good friend permute for a vectorized solution, like so -
%// Get size params and R
[n,d] = size(DataMatrix);
[R,err] = cholcov(I,0);
%// Calculate constants : "logSqrtDetSigma" and "d*log(2*pi)/2`"
K1 = sum(log(diag(R)));
K2 = d*log(2*pi)/2;
%// Major thing happening here as we calclate "X0" for all iterations
%// in one go with permute and bsxfun
diffs = bsxfun(#minus,DataMatrix,permute(DataMatrix,[3 2 1]));
%// "Sigma" is an identity matrix, so it plays no in "/R" at "xRinv = X0 / R".
%// Perform elementwise squaring and summing rows to get vectorized "quadform"
quadform1 = squeeze(sum(diffs.^2,2))
%// Finally use "quadform1" and get vectorized output as a 2D array
p_out = exp(-0.5*quadform1 - K1 - K2)

Vectorizing distance calculation between vectors

I have a 3 X 1000 (and later 3 X 10 000) matrix cord given, which contains the three dimensional coordinates for my pixels.
My intention is to calculate the distance between all the pixels, and I do it with a for loop (see below), but I will have to calculate this for huge matrices soon, and am wondering if I could vectorize the code for making it faster...?
dist = zeros(size(cord,2),size(cord,2));
for i = 1:size(cord,2)
for j = 1:size(cord,2)
dist(i,j) = norm(cord(:,i)-cord(:,j));
dist(j,i) = dist(i,j);
end
end
pdist does exactly that. squareform is needed to get the result in the form of a square, symmetric matrix:
dist = squareform(pdist(cord.'));
Approach 1 (Vectorized apprach with bsxfun ) -
squeeze(sqrt(sum(bsxfun(#minus,cord,permute(cord,[1 3 2])).^2)))
Not sure if this will be faster though.
Approach 2 -
Inspired by this very smart approach and all credits to the poster. The code posted here is just slightly customized for your case and hopefully slightly better in terms of runtime. Here it is -
A = cord'; %//'
numA = size(cord,2);
helpA = ones(numA,9);
helpB = ones(numA,9);
for idx = 1:3
sqA_idx = A(:,idx).^2;
helpA(:,3*idx-1:3*idx) = [-2*A(:,idx), sqA_idx ];
helpB(:,3*idx-2:3*idx-1) = [sqA_idx , A(:,idx)];
end
dist1 = sqrt(helpA * helpB'); %// desired output
From your code, you have recognized that the dist matrix is symmetric
dist(i,j) = norm(cord(:,i)-cord(:,j));
dist(j,i) = dist(i,j);
You could change the inner loop to account for this and reduce by roughly one half the number of calculations needed
for j = i:size(cord,2)
Further, we can avoid the dist(j,i) = dist(i,j); at each iteration and just do that at the end by extracting the upper triangle part of dist and adding its transpose to the dist matrix to account for the symmetry
dist = zeros(size(cord,2),size(cord,2));
for i = 1:size(cord,2)
for j = i:size(cord,2)
dist(i,j) = norm(cord(:,i)-cord(:,j));
end
end
dist = dist + triu(dist)';
The above addition is fine since the main diagonal is all zeros.
It still performs poorly though and so we should take advantage of vectorization. We can do that as follows against the inner loop
dist = zeros(size(cord,2),size(cord,2));
for i = 1:size(cord,2)
dist(i,i+1:end) = sum((repmat(cord(:,i),1,size(cord,2)-i)-cord(:,i+1:end)).^2);
end
dist = dist + triu(dist)';
dist = sqrt(dist);
For every element in cord we need to calculate its distance with all other elements that follow it. We reproduce the element with repmat so that we can subtract it from every element that follows without the need for the loop. The differences are squared and summed and assigned to the dist matrix. We take care of the symmetry and then take the square root of the matrix to complete the norm operation.
With tic and toc, the original distance calculation with a random cord (cord = rand(3,num);) took ~93 seconds. This version took ~2.8.

Best practice when working with sparse matrices

My question is twofold:
In the below, A = full(S) where S is a sparse matrix.
What's the "correct" way to access an element in a sparse matrix?
That is, what would the sparse equivalent to var = A(row, col) be?
My view on this topic: You wouldn't do anything different. var = S(row, col) is as efficient as it gets.
What's the "correct" way to add elements to a sparse matrix?
That is, what would the sparse equivalent of A(row, col) = var be? (Assuming A(row, col) == 0 to begin with)
It is known that simply doing A(row, col) = var is slow for large sparse matrices. From the documentation:
If you wanted to change a value in this matrix, you might be tempted
to use the same indexing:
B(3,1) = 42; % This code does work, however, it is slow.
My view on this topic: When working with sparse matrices, you often start with the vectors and use them to create the matrix this way: S = sparse(i,j,s,m,n). Of course, you could also have created it like this: S = sparse(A) or sprand(m,n,density) or something similar.
If you start of the first way, you would simply do:
i = [i; new_i];
j = [j; new_j];
s = [s; new_s];
S = sparse(i,j,s,m,n);
If you started out not having the vectors, you would do the same thing, but use find first:
[i, j, s] = find(S);
i = [i; new_i];
j = [j; new_j];
s = [s; new_s];
S = sparse(i,j,s,m,n);
Now you would of course have the vectors, and can reuse them if you're doing this operation several times. It would however be better to add all new elements at once, and not do the above in a loop, because growing vectors are slow. In this case, new_i, new_j and new_s will be vectors corresponding to the new elements.
MATLAB stores sparse matrices in compressed column format. This means that when you perform an operations like A(2,2) (to get the element in at row 2, column 2) MATLAB first access the second column and then finds the element in row 2 (row indices in each column are stored in ascending order). You can think of it as:
A2 = A(:,2);
A2(2)
If you are only accessing a single element of sparse matrix doing var = S(r,c) is fine. But if you are looping over the elements of a sparse matrix, you probably want to access one column at a time, and then loop over the nonzero row indices via [i,~,x]=find(S(:,c)). Or use something like spfun.
You should avoid constructing a dense matrix A and then doing S = sparse(A), as this operations just squeezes out zeros. Instead, as you note, it's much more efficient to build a sparse matrix from scratch using triplet-form and a call to sparse(i,j,x,m,n). MATLAB has a nice page which describes how to efficiently construct sparse matrices.
The original paper describing the implementation of sparse matrices in MATLAB is quite a good read. It provides some more info on how the sparse matrix algorithms were originally implemented.
EDIT: Answer modified according to suggestions by Oleg (see comments).
Here is my benchmark for the second part of your question. For testing direct insertion, the matrices are initialized empty with a varying nzmax. For testing rebuilding from index vectors this is irrelevant as the matrix is built from scratch at every call. The two methods were tested for doing a single insertion operation (of a varying number of elements), or for doing incremental insertions, one value at a time (up to the same numbers of elements). Due to the computational strain I lowered the number of repetitions from 1000 to 100 for each test case. I believe this is still statistically viable.
Ssize = 10000;
NumIterations = 100;
NumInsertions = round(logspace(0, 4, 10));
NumInitialNZ = round(logspace(1, 4, 4));
NumTests = numel(NumInsertions) * numel(NumInitialNZ);
TimeDirect = zeros(numel(NumInsertions), numel(NumInitialNZ));
TimeIndices = zeros(numel(NumInsertions), 1);
%% Single insertion operation (non-incremental)
% Method A: Direct insertion
for iInitialNZ = 1:numel(NumInitialNZ)
disp(['Running with initial nzmax = ' num2str(NumInitialNZ(iInitialNZ))]);
for iInsertions = 1:numel(NumInsertions)
tSum = 0;
for jj = 1:NumIterations
S = spalloc(Ssize, Ssize, NumInitialNZ(iInitialNZ));
r = randi(Ssize, NumInsertions(iInsertions), 1);
c = randi(Ssize, NumInsertions(iInsertions), 1);
tic
S(r,c) = 1;
tSum = tSum + toc;
end
disp([num2str(NumInsertions(iInsertions)) ' direct insertions: ' num2str(tSum) ' seconds']);
TimeDirect(iInsertions, iInitialNZ) = tSum;
end
end
% Method B: Rebuilding from index vectors
for iInsertions = 1:numel(NumInsertions)
tSum = 0;
for jj = 1:NumIterations
i = []; j = []; s = [];
r = randi(Ssize, NumInsertions(iInsertions), 1);
c = randi(Ssize, NumInsertions(iInsertions), 1);
s_ones = ones(NumInsertions(iInsertions), 1);
tic
i_new = [i; r];
j_new = [j; c];
s_new = [s; s_ones];
S = sparse(i_new, j_new ,s_new , Ssize, Ssize);
tSum = tSum + toc;
end
disp([num2str(NumInsertions(iInsertions)) ' indexed insertions: ' num2str(tSum) ' seconds']);
TimeIndices(iInsertions) = tSum;
end
SingleOperation.TimeDirect = TimeDirect;
SingleOperation.TimeIndices = TimeIndices;
%% Incremental insertion
for iInitialNZ = 1:numel(NumInitialNZ)
disp(['Running with initial nzmax = ' num2str(NumInitialNZ(iInitialNZ))]);
% Method A: Direct insertion
for iInsertions = 1:numel(NumInsertions)
tSum = 0;
for jj = 1:NumIterations
S = spalloc(Ssize, Ssize, NumInitialNZ(iInitialNZ));
r = randi(Ssize, NumInsertions(iInsertions), 1);
c = randi(Ssize, NumInsertions(iInsertions), 1);
tic
for ii = 1:NumInsertions(iInsertions)
S(r(ii),c(ii)) = 1;
end
tSum = tSum + toc;
end
disp([num2str(NumInsertions(iInsertions)) ' direct insertions: ' num2str(tSum) ' seconds']);
TimeDirect(iInsertions, iInitialNZ) = tSum;
end
end
% Method B: Rebuilding from index vectors
for iInsertions = 1:numel(NumInsertions)
tSum = 0;
for jj = 1:NumIterations
i = []; j = []; s = [];
r = randi(Ssize, NumInsertions(iInsertions), 1);
c = randi(Ssize, NumInsertions(iInsertions), 1);
tic
for ii = 1:NumInsertions(iInsertions)
i = [i; r(ii)];
j = [j; c(ii)];
s = [s; 1];
S = sparse(i, j ,s , Ssize, Ssize);
end
tSum = tSum + toc;
end
disp([num2str(NumInsertions(iInsertions)) ' indexed insertions: ' num2str(tSum) ' seconds']);
TimeIndices(iInsertions) = tSum;
end
IncremenalInsertion.TimeDirect = TimeDirect;
IncremenalInsertion.TimeIndices = TimeIndices;
%% Plot results
% Single insertion
figure;
loglog(NumInsertions, SingleOperation.TimeIndices);
cellLegend = {'Using index vectors'};
hold all;
for iInitialNZ = 1:numel(NumInitialNZ)
loglog(NumInsertions, SingleOperation.TimeDirect(:, iInitialNZ));
cellLegend = [cellLegend; {['Direct insertion, initial nzmax = ' num2str(NumInitialNZ(iInitialNZ))]}];
end
hold off;
title('Benchmark for single insertion operation');
xlabel('Number of insertions'); ylabel('Runtime for 100 operations [sec]');
legend(cellLegend, 'Location', 'NorthWest');
grid on;
% Incremental insertions
figure;
loglog(NumInsertions, IncremenalInsertion.TimeIndices);
cellLegend = {'Using index vectors'};
hold all;
for iInitialNZ = 1:numel(NumInitialNZ)
loglog(NumInsertions, IncremenalInsertion.TimeDirect(:, iInitialNZ));
cellLegend = [cellLegend; {['Direct insertion, initial nzmax = ' num2str(NumInitialNZ(iInitialNZ))]}];
end
hold off;
title('Benchmark for incremental insertions');
xlabel('Number of insertions'); ylabel('Runtime for 100 operations [sec]');
legend(cellLegend, 'Location', 'NorthWest');
grid on;
I ran this in MATLAB R2012a. The results for doing a single insertion operations are summarized in this graph:
This shows that using direct insertion is much slower than using index vectors, if only a single operation is done. The growth in the case of using index vectors can be either because of growing the vectors themselves or from the lengthier sparse matrix construction, I'm not sure which. The initial nzmax used to construct the matrices seems to have no effect on their growth.
The results for doing incremental insertions are summarized in this graph:
Here we see the opposite trend: using index vectors is slower, because of the overhead of incrementally growing them and rebuilding the sparse matrix at every step. A way to understand this is to look at the first point in the previous graph: for insertion of a single element, it is more effective to use direct insertion rather than rebuilding using the index vectors. In the incrementlal case, this single insertion is done repetitively, and so it becomes viable to use direct insertion rather than index vectors, against MATLAB's suggestion.
This understanding also suggests that were we to incrementally add, say, 100 elements at a time, the efficient choice would then be to use index vectors rather than direct insertion, as the first graph shows this method to be faster for insertions of this size. In between these two regimes is an area where you should probably experiment to see which method is more effective, though probably the results will show that the difference between the methods is neglibile there.
Bottom line: which method should I use?
My conclusion is that this is dependant on the nature of your intended insertion operations.
If you intend to insert elements one at a time, use direct insertion.
If you intend to insert a large (>10) number of elements at a time, rebuild the matrix from index vectors.

Improving performance of interpolation (Barycentric formula)

I have been given an assignment in which I am supposed to write an algorithm which performs polynomial interpolation by the barycentric formula. The formulas states that:
p(x) = (SIGMA_(j=0 to n) w(j)*f(j)/(x - x(j)))/(SIGMA_(j=0 to n) w(j)/(x - x(j)))
I have written an algorithm which works just fine, and I get the polynomial output I desire. However, this requires the use of some quite long loops, and for a large grid number, lots of nastly loop operations will have to be done. Thus, I would appreciate it greatly if anyone has any hints as to how I may improve this, so that I will avoid all these loops.
In the algorithm, x and f stand for the given points we are supposed to interpolate. w stands for the barycentric weights, which have been calculated before running the algorithm. And grid is the linspace over which the interpolation should take place:
function p = barycentric_formula(x,f,w,grid)
%Assert x-vectors and f-vectors have same length.
if length(x) ~= length(f)
sprintf('Not equal amounts of x- and y-values. Function is terminated.')
return;
end
n = length(x);
m = length(grid);
p = zeros(1,m);
% Loops for finding polynomial values at grid points. All values are
% calculated by the barycentric formula.
for i = 1:m
var = 0;
sum1 = 0;
sum2 = 0;
for j = 1:n
if grid(i) == x(j)
p(i) = f(j);
var = 1;
else
sum1 = sum1 + (w(j)*f(j))/(grid(i) - x(j));
sum2 = sum2 + (w(j)/(grid(i) - x(j)));
end
end
if var == 0
p(i) = sum1/sum2;
end
end
This is a classical case for matlab 'vectorization'. I would say - just remove the loops. It is almost that simple. First, have a look at this code:
function p = bf2(x, f, w, grid)
m = length(grid);
p = zeros(1,m);
for i = 1:m
var = grid(i)==x;
if any(var)
p(i) = f(var);
else
sum1 = sum((w.*f)./(grid(i) - x));
sum2 = sum(w./(grid(i) - x));
p(i) = sum1/sum2;
end
end
end
I have removed the inner loop over j. All I did here was in fact removing the (j) indexing and changing the arithmetic operators from / to ./ and from * to .* - the same, but with a dot in front to signify that the operation is performed on element by element basis. This is called array operators in contrast to ordinary matrix operators. Also note that treating the special case where the grid points fall onto x is very similar to what you had in the original implementation, only using a vector var such that x(var)==grid(i).
Now, you can also remove the outermost loop. This is a bit more tricky and there are two major approaches how you can do that in MATLAB. I will do it the simpler way, which can be less efficient, but more clear to read - using repmat:
function p = bf3(x, f, w, grid)
% Find grid points that coincide with x.
% The below compares all grid values with all x values
% and returns a matrix of 0/1. 1 is in the (row,col)
% for which grid(row)==x(col)
var = bsxfun(#eq, grid', x);
% find the logical indexes of those x entries
varx = sum(var, 1)~=0;
% and of those grid entries
varp = sum(var, 2)~=0;
% Outer-most loop removal - use repmat to
% replicate the vectors into matrices.
% Thus, instead of having a loop over j
% you have matrices of values that would be
% referenced in the loop
ww = repmat(w, numel(grid), 1);
ff = repmat(f, numel(grid), 1);
xx = repmat(x, numel(grid), 1);
gg = repmat(grid', 1, numel(x));
% perform the calculations element-wise on the matrices
sum1 = sum((ww.*ff)./(gg - xx),2);
sum2 = sum(ww./(gg - xx),2);
p = sum1./sum2;
% fix the case where grid==x and return
p(varp) = f(varx);
end
The fully vectorized version can be implemented with bsxfun rather than repmat. This can potentially be a bit faster, since the matrices are not explicitly formed. However, the speed difference may not be large for small system sizes.
Also, the first solution with one loop is also not too bad performance-wise. I suggest you test those and see, what is better. Maybe it is not worth it to fully vectorize? The first code looks a bit more readable..

Summation without a for loop - MATLAB

I have 2 matrices: V which is square MxM, and K which is MxN. Calling the dimension across rows x and the dimension across columns t, I need to evaluate the integral (i.e sum) over both dimensions of K times a t-shifted version of V, the answer being a function of the shift (almost like a convolution, see below). The sum is defined by the following expression, where _{} denotes the summation indices, and a zero-padding of out-of-limits elements is assumed:
S(t) = sum_{x,tau}[V(x,t+tau) * K(x,tau)]
I manage to do it with a single loop, over the t dimension (vectorizing the x dimension):
% some toy matrices
V = rand(50,50);
K = rand(50,10);
[M N] = size(K);
S = zeros(1, M);
for t = 1 : N
S(1,1:end-t+1) = S(1,1:end-t+1) + sum(bsxfun(#times, V(:,t:end), K(:,t)),1);
end
I have similar expressions which I managed to evaluate without a for loop, using a combination of conv2 and\or mirroring (flipping) of a single dimension. However I can't see how to avoid a for loop in this case (despite the appeared similarity to convolution).
Steps to vectorization
1] Perform sum(bsxfun(#times, V(:,t:end), K(:,t)),1) for all columns in V against all columns in K with matrix-multiplication -
sum_mults = V.'*K
This would give us a 2D array with each column representing sum(bsxfun(#times,.. operation at each iteration.
2] Step1 gave us all possible summations and also the values to be summed are not aligned in the same row across iterations, so we need to do a bit more work before summing along rows. The rest of the work is about getting a shifted up version. For the same, you can use boolean indexing with a upper and lower triangular boolean mask. Finally, we sum along each row for the final output. So, this part of the code would look like so -
valid_mask = tril(true(size(sum_mults)));
sum_mults_shifted = zeros(size(sum_mults));
sum_mults_shifted(flipud(valid_mask)) = sum_mults(valid_mask);
out = sum(sum_mults_shifted,2);
Runtime tests -
%// Inputs
V = rand(1000,1000);
K = rand(1000,200);
disp('--------------------- With original loopy approach')
tic
[M N] = size(K);
S = zeros(1, M);
for t = 1 : N
S(1,1:end-t+1) = S(1,1:end-t+1) + sum(bsxfun(#times, V(:,t:end), K(:,t)),1);
end
toc
disp('--------------------- With proposed vectorized approach')
tic
sum_mults = V.'*K; %//'
valid_mask = tril(true(size(sum_mults)));
sum_mults_shifted = zeros(size(sum_mults));
sum_mults_shifted(flipud(valid_mask)) = sum_mults(valid_mask);
out = sum(sum_mults_shifted,2);
toc
Output -
--------------------- With original loopy approach
Elapsed time is 2.696773 seconds.
--------------------- With proposed vectorized approach
Elapsed time is 0.044144 seconds.
This might be cheating (using arrayfun instead of a for loop) but I believe this expression gives you what you want:
S = arrayfun(#(t) sum(sum( V(:,(t+1):(t+N)) .* K )), 1:(M-N), 'UniformOutput', true)

Resources