Vectorizing three for loops - performance

I'm quite new to Matlab and I need help in speeding up some part of my code. I am writing a Matlab application that performs 3D matrix convolution but unlike in standard convolution, the kernel is not constant, it needs to be calculated for each pixel of an image.
So far, I have ended up with a working code, but incredibly slow:
function result = calculateFilteredImages(images, T)
% images - matrix [480,360,10] of 10 grayscale images of height=480 and width=360
% reprezented as a value in a range [0..1]
% i.e. images(10,20,5) = 0.1231;
% T - some matrix [480,360,10, 3,3] of double values, calculated earlier
kerN = 5; %kernel size
mid=floor(kerN/2); %half the kernel size
offset=mid+1; %kernel offset
[h,w,n] = size(images);
%add padding so as not to get IndexOutOfBoundsEx during summation:
%[i.e. changes [1 2 3...10] to [0 0 1 2 ... 10 0 0]]
images = padarray(images,[mid, mid, mid]);
result(h,w,n)=0; %preallocate, faster than zeros(h,w,n)
kernel(kerN,kerN,kerN)=0; %preallocate
% the three parameters below are not important in this problem
% (are used to calculate sigma in x,y,z direction inside the loop)
sigMin=0.5;
sigMax=3;
d = 3;
for a=1:n;
tic;
for b=1:w;
for c=1:h;
M(:,:)=T(c,b,a,:,:); % M is now a 3x3 matrix
[R D] = eig(M); %get eigenvectors and eigenvalues - R and D are now 3x3 matrices
% eigenvalues
l1 = D(1,1);
l2 = D(2,2);
l3 = D(3,3);
sig1=sig( l1 , sigMin, sigMax, d);
sig2=sig( l2 , sigMin, sigMax, d);
sig3=sig( l3 , sigMin, sigMax, d);
% calculate kernel
for i=-mid:mid
for j=-mid:mid
for k=-mid:mid
x_new = [i,j,k] * R; %calculate new [i,j,k]
kernel(offset+i, offset+j, offset+k) = exp(- (((x_new(1))^2 )/(sig1^2) + ((x_new(2))^2)/(sig2^2) + ((x_new(3))^2)/(sig3^2)) /2);
end
end
end
% normalize
kernel=kernel/sum(kernel(:));
%perform summation
xm_sum=0;
for i=-mid:mid
for j=-mid:mid
for k=-mid:mid
xm_sum = xm_sum + kernel(offset+i, offset+j, offset+k) * images(c+mid+i, b+mid+j, a+mid+k);
end
end
end
result(c,b,a)=xm_sum;
end
end
toc;
end
end
I tried replacing the "calculating kernel" part with
sigma=[sig1 sig2 sig3]
[x,y,z] = ndgrid(-mid:mid,-mid:mid,-mid:mid);
k2 = arrayfun(#(x, y, z) exp(-(norm([x,y,z]*R./sigma)^2)/2), x,y,z);
but it turned out to be even slower than the loop. I went through several articles and tutorials on vectorization but I'm quite stuck with this one.
Can it be vectorized or somehow speeded up using something else?
I'm new to Matlab, maybe there are some build-in functions that could help in this case?
Update
The profiling result:
Sample data which was used during profiling:
T.mat
grayImages.mat

As Dennis noted, this is a lot of code, cutting it down to the minimum that's slow given by the profiler will help. I'm not sure if my code is equivalent to yours, can you try it and profile it? The 'trick' to Matlab vectorization is using .* and .^, which operate element-by-element instead of having to use loops. http://www.mathworks.com/help/matlab/ref/power.html
Take your rewritten part:
sigma=[sig1 sig2 sig3]
[x,y,z] = ndgrid(-mid:mid,-mid:mid,-mid:mid);
k2 = arrayfun(#(x, y, z) exp(-(norm([x,y,z]*R./sigma)^2)/2), x,y,z);
And just pick one sigma for now. Looping over 3 different sigmas isn't a performance problem if you can vectorize the underlying k2 formula.
EDIT: Changed the matrix_to_norm code to be x(:), and no commas. See Generate all possible combinations of the elements of some vectors (Cartesian product)
Then try:
% R & mid my test variables
R = [1 2 3; 4 5 6; 7 8 9];
mid = 5;
[x,y,z] = ndgrid(-mid:mid,-mid:mid,-mid:mid);
% meshgrid is also a possibility, check that you are getting the order you want
% Going to break the equation apart for now for clarity
% Matrix operation, should already be fast.
matrix_to_norm = [x(:) y(:) z(:)]*R/sig1
% Ditto
matrix_normed = norm(matrix_to_norm)
% Note the .^ - I believe you want element-by-element exponentiation, this will
% vectorize it.
k2 = exp(-0.5*(matrix_normed.^2))

Related

Efficient way of computing multivariate gaussian varying the mean - Matlab

Is there a efficient way to do the computation of a multivariate gaussian (as below) that returns matrix p , that is, making use of some sort of vectorization? I am aware that matrix p is symmetric, but still for a matrix of size 40000x3, for example, this will take quite a long time.
Matlab code example:
DataMatrix = [3 1 4; 1 2 3; 1 5 7; 3 4 7; 5 5 1; 2 3 1; 4 4 4];
[rows, cols ] = size(DataMatrix);
I = eye(cols);
p = zeros(rows);
for k = 1:rows
p(k,:) = mvnpdf(DataMatrix(:,:),DataMatrix(k,:),I);
end
Stage 1: Hack into source code
Iteratively we are performing mvnpdf(DataMatrix(:,:),DataMatrix(k,:),I)
The syntax is : mvnpdf(X,Mu,Sigma).
Thus, the correspondence with our input becomes :
X = DataMatrix(:,:);
Mu = DataMatrix(k,:);
Sigma = I
For the sizes relevant to our situation, the source code mvnpdf.m reduces to -
%// Store size parameters of X
[n,d] = size(X);
%// Get vector mean, and use it to center data
X0 = bsxfun(#minus,X,Mu);
%// Make sure Sigma is a valid covariance matrix
[R,err] = cholcov(Sigma,0);
%// Create array of standardized data, and compute log(sqrt(det(Sigma)))
xRinv = X0 / R;
logSqrtDetSigma = sum(log(diag(R)));
%// Finally get the quadratic form and thus, the final output
quadform = sum(xRinv.^2, 2);
p_out = exp(-0.5*quadform - logSqrtDetSigma - d*log(2*pi)/2)
Now, if the Sigma is always an identity matrix, we would have R as an identity matrix too. Therefore, X0 / R would be same as X0, which is saved as xRinv. So, essentially quadform = sum(X0.^2, 2);
Thus, the original code -
for k = 1:rows
p(k,:) = mvnpdf(DataMatrix(:,:),DataMatrix(k,:),I);
end
reduces to -
[n,d] = size(DataMatrix);
[R,err] = cholcov(I,0);
p_out = zeros(rows);
K = sum(log(diag(R))) + d*log(2*pi)/2;
for k = 1:rows
X0 = bsxfun(#minus,DataMatrix,DataMatrix(k,:));
quadform = sum(X0.^2, 2);
p_out(k,:) = exp(-0.5*quadform - K);
end
Now, if the input matrix is of size 40000x3, you might want to stop here. But with system resources permitting, you can vectorize everything as discussed next.
Stage 2: Vectorize everything
Now that we see what's actually going on and that the computations look parallelizable, it's time to step-up to use bsxfun in 3D with his good friend permute for a vectorized solution, like so -
%// Get size params and R
[n,d] = size(DataMatrix);
[R,err] = cholcov(I,0);
%// Calculate constants : "logSqrtDetSigma" and "d*log(2*pi)/2`"
K1 = sum(log(diag(R)));
K2 = d*log(2*pi)/2;
%// Major thing happening here as we calclate "X0" for all iterations
%// in one go with permute and bsxfun
diffs = bsxfun(#minus,DataMatrix,permute(DataMatrix,[3 2 1]));
%// "Sigma" is an identity matrix, so it plays no in "/R" at "xRinv = X0 / R".
%// Perform elementwise squaring and summing rows to get vectorized "quadform"
quadform1 = squeeze(sum(diffs.^2,2))
%// Finally use "quadform1" and get vectorized output as a 2D array
p_out = exp(-0.5*quadform1 - K1 - K2)

Efficiency of diag() - MATLAB

Motivation:
In writing out a matrix operation that was to be performed over tens of thousands of vectors I kept coming across the warning:
Requested 200000x200000 (298.0GB) array exceeds maximum array size
preference. Creation of arrays greater than this limit may take a long
time and cause MATLAB to become unresponsive. See array size limit or
preference panel for more information.
The reason for this was my use of diag() to get the values down the diagonal of an matrix inner product. Because MATLAB is generally optimized for vector/matrix operations, when I first write code, I usually go for the vectorized form. In this case, however, MATLAB has to build the entire matrix in order to get the diagonal which causes the memory and speed issues.
Experiment:
I decided to test the use of diag() vs a for loop to see if at any point it was more efficient to use diag():
num = 200000; % Matrix dimension
x = ones(num, 1);
y = 2 * ones(num, 1);
% z = diag(x*y'); % Expression to solve
% Loop approach
tic
z = zeros(num,1);
for i = 1 : num
z(i) = x(i)*y(i);
end
toc
% Dividing the too-large matrix into process-able chunks
fraction = [10, 20, 50, 100, 500, 1000, 5000, 10000, 20000];
time = zeros(size(fraction));
for k = 1 : length(fraction)
f = fraction(k);
% Operation to time
tic
z = zeros(num,1);
for i = 1 : k
first = (i-1) * (num / f);
last = first + (num / f);
z(first + 1 : last) = diag(x(first + 1: last) * y(first + 1 : last)');
end
time(k) = toc;
end
% Plot results
figure;
hold on
plot(log10(fraction), log10(chunkTime));
plot(log10(fraction), repmat(log10(loopTime), 1, length(fraction)));
plot(log10(fraction), log10(chunkTime), 'g*'); % Plot points along time
legend('Partioned Running Time', 'Loop Running Time');
xlabel('Log_{10}(Fractional Size)'), ylabel('Log_{10}(Running Time)'), title('Running Time Comparison');
This is the result of the test:
(NOTE: The red line represents the loop time as a threshold--it's not to say that the total loop time is constant regardless of the number of loops)
From the graph it is clear that it takes breaking the operations down into roughly 200x200 square matrices to be faster to use diag than to perform the same operation using loops.
Question:
Can someone explain why I'm seeing these results? Also, I would think that with MATLAB's ever-more optimized design, there would be built-in handling of these massive matrices within a diag() function call. For example, it could just perform the i = j indexed operations. Is there a particular reason why this might be prohibitive?
I also haven't really thought of memory implications for diag using the partition method, although it's clear that as the partition size decreases, memory requirements drop.
Test of speed of diag vs. a loop.
Initialization:
n = 10000;
M = randn(n, n); %create a random matrix.
Test speed of diag:
tic;
d = diag(M);
toc;
Test speed of loop:
tic;
d = zeros(n, 1);
for i=1:n
d(i) = M(i,i);
end;
toc;
This would test diag. Your code is not a clean test of diag...
Comment on where there might be confusion
Diag only extracts the diagonal of a matrix. If x and y are vectors, and you do d = diag(x * y'), MATLAB first constructs the n by n matrix x*y' and calls diag on that. This is why, you get the error, "cannot construct 290GB matrix..." Matlab interpreter does not optimize in a crazy way, realize you only want the diagonal and construct just a vector (rather than full matrix with x*y', that does not happen.
Not sure if you're asking this, but the fastest way to calculate d = diag(x*y') where x and y are n by 1 vectors would simply be: d = x.*y

Efficiently Calculate Frequency Averaged Periodogram Using GPU

In Matlab I am looking for a way to most efficiently calculate a frequency averaged periodogram on a GPU.
I understand that the most important thing is to minimise for loops and use the already built in GPU functions. However my code still feels relatively unoptimised and I was wondering what changes I can make to it to gain a better speed up.
r = 5; % Dimension
n = 100; % Time points
m = 20; % Bandwidth of smoothing
% Generate some random rxn data
X = rand(r, n);
% Generate normalised weights according to a cos window
w = cos(pi * (-m/2:m/2)/m);
w = w/sum(w);
% Generate non-smoothed Periodogram
FT = (n)^(-0.5)*(ctranspose(fft(ctranspose(X))));
Pdgm = zeros(r, r, n/2 + 1);
for j = 1:n/2 + 1
Pdgm(:,:,j) = FT(:,j)*FT(:,j)';
end
% Finally smooth with our weights
SmPdgm = zeros(r, r, n/2 + 1);
% Take advantage of the GPU filter function
% Create new Periodogram WrapPdgm with m/2 values wrapped around in front and
% behind it (it seems like there is redundancy here)
WrapPdgm = zeros(r,r,n/2 + 1 + m);
WrapPdgm(:,:,m/2+1:n/2+m/2+1) = Pdgm;
WrapPdgm(:,:,1:m/2) = flip(Pdgm(:,:,2:m/2+1),3);
WrapPdgm(:,:,n/2+m/2+2:end) = flip(Pdgm(:,:,n/2-m/2+1:end-1),3);
% Perform filtering
for i = 1:r
for j = 1:r
temp = filter(w, [1], WrapPdgm(i,j,:));
SmPdgm(i,j,:) = temp(:,:,m+1:end);
end
end
In particular, I couldn't see a way to optimise out the for loop when calculating the initial Pdgm from the Fourier transformed data and I feel the trick I play with the WrapPdgm in order to take advantage of filter() on the GPU feels unnecessary if there were a smooth function instead.
Solution Code
This seems to be pretty efficient as benchmark runtimes in the next section might convince us -
%// Select the portion of FT to be processed and
%// send copy to GPU for calculating everything
gFT = gpuArray(FT(:,1:n/2 + 1));
%// Perform non-smoothed Periodogram, thus removing the first loop
Pdgm1 = bsxfun(#times,permute(gFT,[1 3 2]),permute(conj(gFT),[3 1 2]));
%// Generate WrapPdgm right on GPU
WrapPdgm1 = zeros(r,r,n/2 + 1 + m,'gpuArray');
WrapPdgm1(:,:,m/2+1:n/2+m/2+1) = Pdgm1;
WrapPdgm1(:,:,1:m/2) = Pdgm1(:,:,m/2+1:-1:2);
WrapPdgm1(:,:,n/2+m/2+2:end) = Pdgm1(:,:,end-1:-1:n/2-m/2+1);
%// Perform filtering on GPU and get the final output, SmPdgm1
filt_data = filter(w,1,reshape(WrapPdgm1,r*r,[]),[],2);
SmPdgm1 = gather(reshape(filt_data(:,m+1:end),r,r,[]));
Benchmarking
Benchmarking Code
%// Input parameters
r = 50; % Dimension
n = 1000; % Time points
m = 200; % Bandwidth of smoothing
% Generate some random rxn data
X = rand(r, n);
% Generate normalised weights according to a cos window
w = cos(pi * (-m/2:m/2)/m);
w = w/sum(w);
% Generate non-smoothed Periodogram
FT = (n)^(-0.5)*(ctranspose(fft(ctranspose(X))));
tic, %// ... Code from original approach, toc
tic %// ... Code from proposed approach, toc
Runtime results thus obtained on GPU, GTX 750 Ti against CPU, I-7 4790K -
------------------------------ With Original Approach on CPU
Elapsed time is 0.279816 seconds.
------------------------------ With Proposed Approach on GPU
Elapsed time is 0.169969 seconds.
To get rid of the first loop you can do the following:
Pdgm_cell = cellfun(#(x) x * x', mat2cell(FT(:, 1 : 51), [5], ones(51, 1)), 'UniformOutput', false);
Pdgm = reshape(cell2mat(Pdgm_cell),5,5,[]);
Then in your filter you can do the following:
temp = filter(w, 1, WrapPdgm, [], 3);
SmPdgm = temp(:, :, m + 1 : end);
The 3 lets the filter know to operate along the 3rd dimension of your data.
You can use pagefun on the GPU for the first loop. (Note that the implementation of cellfun is basically a hidden loop, whereas pagefun runs natively on the GPU using a batched GEMM operation). Here's how:
n = 16;
r = 8;
X = gpuArray.rand(r, n);
R = gpuArray.zeros(r, r, n/2 + 1);
for jj = 1:(n/2+1)
R(:,:,jj) = X(:,jj) * X(:,jj)';
end
X2 = X(:,1:(n/2+1));
R2 = pagefun(#mtimes, reshape(X2, r, 1, []), reshape(X2, 1, r, []));
R - R2

bsxfun implementation in matrix multiplication

As always trying to learn more from you, I was hoping I could receive some help with the following code.
I need to accomplish the following:
1) I have a vector:
x = [1 2 3 4 5 6 7 8 9 10 11 12]
2) and a matrix:
A =[11 14 1
5 8 18
10 8 19
13 20 16]
I need to be able to multiply each value from x with every value of A, this means:
new_matrix = [1* A
2* A
3* A
...
12* A]
This will give me this new_matrix of size (12*m x n) assuming A (mxn). And in this case (12*4x3)
How can I do this using bsxfun from matlab? and, would this method be faster than a for-loop?
Regarding my for-loop, I need some help here as well... I am not able to storage each "new_matrix" as the loop runs :(
for i=x
new_matrix = A.*x(i)
end
Thanks in advance!!
EDIT: After the solutions where given
First solution
clear all
clc
x=1:0.1:50;
A = rand(1000,1000);
tic
val = bsxfun(#times,A,permute(x,[3 1 2]));
out = reshape(permute(val,[1 3 2]),size(val,1)*size(val,3),[]);
toc
Output:
Elapsed time is 7.597939 seconds.
Second solution
clear all
clc
x=1:0.1:50;
A = rand(1000,1000);
tic
Ps = kron(x.',A);
toc
Output:
Elapsed time is 48.445417 seconds.
Send x to the third dimension, so that singleton expansion would come into effect when bsxfun is used for multiplication with A, extending the product result to the third dimension. Then, perform the bsxfun multiplication -
val = bsxfun(#times,A,permute(x,[3 1 2]))
Now, val is a 3D matrix and the desired output is expected to be a 2D matrix concatenated along the columns through the third dimension. This is achieved below -
out = reshape(permute(val,[1 3 2]),size(val,1)*size(val,3),[])
Hope that made sense! Spread the bsxfun word around! woo!! :)
The kron function does exactly that:
kron(x.',A)
Here is my benchmark of the methods mentioned so far, along with a few additions of my own:
function [t,v] = testMatMult()
% data
%{
x = [1 2 3 4 5 6 7 8 9 10 11 12];
A = [11 14 1; 5 8 18; 10 8 19; 13 20 16];
%}
x = 1:50;
A = randi(100, [1000,1000]);
% functions to test
fcns = {
#() func1_repmat(A,x)
#() func2_bsxfun_3rd_dim(A,x)
#() func2_forloop_3rd_dim(A,x)
#() func3_kron(A,x)
#() func4_forloop_matrix(A,x)
#() func5_forloop_cell(A,x)
#() func6_arrayfun(A,x)
};
% timeit
t = cellfun(#timeit, fcns, 'UniformOutput',true);
% check results
v = cellfun(#feval, fcns, 'UniformOutput',false);
isequal(v{:})
%for i=2:numel(v), assert(norm(v{1}-v{2}) < 1e-9), end
end
% Amro
function B = func1_repmat(A,x)
B = repmat(x, size(A,1), 1);
B = bsxfun(#times, B(:), repmat(A,numel(x),1));
end
% Divakar
function B = func2_bsxfun_3rd_dim(A,x)
B = bsxfun(#times, A, permute(x, [3 1 2]));
B = reshape(permute(B, [1 3 2]), [], size(A,2));
end
% Vissenbot
function B = func2_forloop_3rd_dim(A,x)
B = zeros([size(A) numel(x)], 'like',A);
for i=1:numel(x)
B(:,:,i) = x(i) .* A;
end
B = reshape(permute(B, [1 3 2]), [], size(A,2));
end
% Luis Mendo
function B = func3_kron(A,x)
B = kron(x(:), A);
end
% SergioHaram & TheMinion
function B = func4_forloop_matrix(A,x)
[m,n] = size(A);
p = numel(x);
B = zeros(m*p,n, 'like',A);
for i=1:numel(x)
B((i-1)*m+1:i*m,:) = x(i) .* A;
end
end
% Amro
function B = func5_forloop_cell(A,x)
B = cell(numel(x),1);
for i=1:numel(x)
B{i} = x(i) .* A;
end
B = cell2mat(B);
%B = vertcat(B{:});
end
% Amro
function B = func6_arrayfun(A,x)
B = cell2mat(arrayfun(#(xx) xx.*A, x(:), 'UniformOutput',false));
end
The results on my machine:
>> t
t =
0.1650 %# repmat (Amro)
0.2915 %# bsxfun in the 3rd dimension (Divakar)
0.4200 %# for-loop in the 3rd dim (Vissenbot)
0.1284 %# kron (Luis Mendo)
0.2997 %# for-loop with indexing (SergioHaram & TheMinion)
0.5160 %# for-loop with cell array (Amro)
0.4854 %# arrayfun (Amro)
(Those timings can slightly change between different runs, but this should give us an idea how the methods compare)
Note that some of these methods are going to cause out-of-memory errors for larger inputs (for example my solution based on repmat can easily run out of memory). Others will get significantly slower for larger sizes but won't error due to exhausted memory (the kron solution for instance).
I think that the bsxfun method func2_bsxfun_3rd_dim or the straightforward for-loop func4_forloop_matrix (thanks to MATLAB JIT) are the best solutions in this case.
Of course you can change the above benchmark parameters (size of x and A) and draw your own conclusions :)
Just to add an alternative, you maybe can use cellfun to achieve what you want. Here's an example (slightly modified from yours):
x = randi(2, 5, 3)-1;
a = randi(3,3);
%// bsxfun 3D (As implemented in the accepted solution)
val = bsxfun(#and, a, permute(x', [3 1 2])); %//'
out = reshape(permute(val,[1 3 2]),size(val,1)*size(val,3),[]);
%// cellfun (My solution)
val2 = cellfun(#(z) bsxfun(#and, a, z), num2cell(x, 2), 'UniformOutput', false);
out2 = cell2mat(val2); % or use cat(3, val2{:}) to get a 3D matrix equivalent to val and then permute/reshape like for out
%// compare
disp(nnz(out ~= out2));
Both give the same exact result.
For more infos and tricks using cellfun, see: http://matlabgeeks.com/tips-tutorials/computation-using-cellfun/
And also this: https://stackoverflow.com/a/1746422/1121352
If your vector x is of lenght = 12 and your matrix of size 3x4, I don't think that using one or the other would change much in term of time. If you are working with higher size matrix and vector, now that might become an issue.
So first of all, we want to multiply a vector with a matrix. In the for-loop method, that would give something like that :
s = size(A);
new_matrix(s(1),s(2),numel(x)) = zeros; %This is for pre-allocating. If you have a big vector or matrix, this will help a lot time efficiently.
for i = 1:numel(x)
new_matrix(:,:,i)= A.*x(i)
end
This will give you 3D matrix, with each 3rd dimension being a result of your multiplication. If this is not what you are looking for, I'll be adding another solution which might be more time efficient with bigger matrixes and vectors.

Speed up sparse matrix calculations

Is it possible to speed up large sparse matrix calculations by e.g. placing parantheses optimally?
What I'm asking is: Can I speed up the following code by forcing Matlab to do the operations in a specified order (for instance "from right to left" or something similar)?
I have a sparse square symmetric matrix H, that previously has been factorized, and a sparse vector M with length equal to the dimension of H. What I want to do is the following:
EDIT: Some additional information: H is typically 4000x4000. The calculations of z and c are done around 4000 times, whereas the computation of dVa and dVaComp is done 10 times for every 4000 loops, thus 40000 in total. (dVa and dVaComp are solved iteratively, where P_mis is updated).
Here M*c*M', will become a sparse matrix with 4 non-zero element. In Matlab:
[L U P] = lu(H); % H is sparse (thus also L, U and P)
% for i = 1:4000 % Just to illustrate
M = sparse([bf bt],1,[1 -1],n,1); % Sparse vector with two non-zero elements in bt and bf
z = -M'*(U \ (L \ (P * M))); % M^t*H^-1*M = a scalar
c = (1/dyp + z)^-1; % dyp is a scalar
% while (iterations < 10 && ~=converged)
dVa = - (U \ (L \ (P * P_mis)));
dVaComp = (U \ (L \ (P * M * c * M' * dVa)));
% Update P_mis etc.
% end while
% end for
And for the record: Even though I use the inverse of H many times, it is not faster to pre-compute it.
Thanks =)
There's a few things not entirely clear to me:
The command M = sparse([t f],[1 -1],1,n,1); can't be right; you're saying that on rows t,f and columns 1,-1 there should be a 1; column -1 obviously can't be right.
The result dVaComp is a full matrix due to multiplication by P_mis, while you say it should be sparse.
Leaving these issues aside for now, there's a few small optimizations I see:
You use inv(H)*M twice, so you could pre-compute that.
negation of the dVa can be moved out of the loop.
if you don't need dVa explicitly, leave out the assignment to a variable as well.
inversion of a scalar means dividing 1 by that scalar (computation of c).
Implementing changes, and trying to compare fairly (I used only 40 iterations to keep total time small):
%% initialize
clc
N = 4000;
% H is sparse, square, symmetric
H = tril(rand(N));
H(H<0.5) = 0; % roughly half is empty
H = sparse(H+H.');
% M is sparse vector with two non-zero elements.
M = sparse([1 N],[1 1],1, N,1);
% dyp is some scalar
dyp = rand;
% P_mis = full vector
P_mis = rand(N,1);
%% original method
[L, U, P] = lu(H);
tic
for ii = 1:40
z = -M'*(U \ (L \ (P*M)));
c = (1/dyp + z)^-1;
for jj = 1:10
dVa = -(U \ (L \ (P*P_mis)));
dVaComp = (U \ (L \ (P*M * c * M' * dVa)));
end
end
toc
%% new method I
[L,U,P,Q] = lu(H);
tic
for ii = 1:40
invH_M = U\(L\(P*M));
z = -M.'*invH_M;
c = -1/(1/dyp + z);
for jj = 1:10
dVaComp = c * (invH_M*M.') * ( U\(L\(P*P_mis)) );
end
end
toc
This gives the following results:
Elapsed time is 60.384734 seconds. % your original method
Elapsed time is 33.074448 seconds. % new method
You might want to try using the extended syntax for lu when factoring the (sparse) matrix H:
[L,U,P,Q] = lu(H);
The extra permutation matrix Q re-orders columns to increase the sparsity of the factors L,U (while the permutation matrix P only re-orders rows for partial pivoting).
Specific results depend on the sparsity pattern of H, but in many cases using a good column permutation significantly reduces the number of non-zeros in the factorisation, reducing memory use and increasing speed.
You can read more about the lu syntax here.

Resources