Matrix equation not properly updating in time - matrix

As a simple example to illustrate my point, I am trying to solve the following equation f(t+1) = f(t) + f(t)*Tr (f^2) starting at t=0 where Tr is the trace of a matrix (sum of diagonal elements). Below I provide a basic code. My code compiles with no errors but is not updating the solution as I want. My expected result is also below which I calculated by hand (it's very easy to check by hand via matrix multiplication).
In my sample code below I have two variables that store solution, g is for f(t=0) which I implement, and then I store f(t+1) as f.
complex,dimension(3,3) :: f,g
integer :: k,l,m,p,q
Assume g=f(t=0) is defined as below
do l=1,3 !matrix index loops
do k=1,3 !matrix index loops
if (k == l) then
g(k,l) = cmplx(0.2,0)
else if ( k /= l) then
g(k,l) = cmplx(0,0)
end if
end do
end do
I have checked this result is indeed what I want it to be, so I know f at t=0 is defined properly.
Now I try to use this matrix at t=0 and find the matrix for all time, governed by the equation f(t+1) = f(t)+f(t)*Tr(f^2), but this is where I am not correctly implementing the code I want.
do m=1,3 !loop for 3 time iterations
do p=1,3 !loops for dummy indices for matrix trace
do q=1,3
g(1,1) = g(1,1) + g(1,1)*g(p,q)*g(p,q) !compute trace here
f(1,1) = g(1,1)
!f(2,2) = g(2,2) + g(2,2)*g(p,q)*g(p,q)
!f(3,3) = g(3,3) + g(3,3)*g(p,q)*g(p,q)
!assume all other matrix elements are zero except diagonal
end do
end do
end do
Printing this result is done by
print*, "calculated f where m=", m
do k=1,3
print*, (f(k,l), l=1,3)
end do
This is when I realize my code is not being implemented correctly.
When I print f(k,l) I expect for t=1 a result of 0.224*identity matrix and now I get this. However for t=2 the output is not right. So my code is being updated correctly for the first time iteration, but not after that.
I am looking for a solution as to how to properly implement the equation I want to obtain the result I am expecting.

I'll answer a couple things you seem to be having trouble with. First, the trace. The trace of a 3x3 matrix is A(1,1)+A(2,2)+A(3,3). The first and second indexes are the same, so we use one loop variable. To compute the trace of an NxN matrix A:
trace = 0.
do i=1,N
trace = trace + A(i,i)
enddo
I think you're trying to loop over p and q to compute your trace which is incorrect. In that sum, you'll add in terms like A(2,3) which is wrong.
Second, to compute the update, I recommend you compute the updated f into fNew, and then your code would look something like:
do m=1,3 ! time
! -- Compute f^2 (with loops not shown)
f2 = ...
! -- Compute trace of f2 (with loop not shown)
trace = ...
! -- Compute new f
do j=1,3
do i=1,3
fNew(i,j) = f(i,j) + trace*f(i,j)
enddo
enddo
! -- Now update f, perhaps recording fNew-f for some residual
! -- The LHS and RHS are both arrays of dimension (3,3),
! -- so fortran will automatically perform an array operation
f = fNew
enddo
This method has two advantages. First, your code actually looks like the math you're trying to do, and is easy to follow. This is very important for realistic problesm which are not so simple. Second, if fNew(i,j) depended on f(i+1,j), for example, you are not updating to the next time level while the current time level values still need to be used.

Related

Understanding a FastICA implementation

I'm trying to implement FastICA (independent component analysis) for blind signal separation of images, but first I thought I'd take a look at some examples from Github that produce good results. I'm trying to compare the main loop from the algorithm's steps on Wikipedia's FastICA and I'm having quite a bit of difficulty seeing how they're actually the same.
They look very similar, but there's a few differences that I don't understand. It looks like this implementation is similar to (or the same as) the "Multiple component extraction" version from Wiki.
Would someone please help me understand what's going on in the four or so lines having to do with the nonlinearity function with its first and second derivatives, and the first line of updating the weight vector? Any help is greatly appreciated!
Here's the implementation with the variables changed to mirror the Wiki more closely:
% X is sized (NxM, 3x50K) mixed image data matrix (one row for each mixed image)
C=3; % number of components to separate
W=zeros(numofIC,VariableNum); % weights matrix
for p=1:C
% initialize random weight vector of length N
wp = rand(C,1);
wp = wp / norm(wp);
% like do:
i = 1;
maxIterations = 100;
while i <= maxIterations+1
% until mat iterations
if i == maxIterations
fprintf('No convergence: ', p,maxIterations);
break;
end
wp_old = wp;
% this is the main part of the algorithm and where
% I'm confused about the particular implementation
u = 1;
t = X'*b;
g = t.^3;
dg = 3*t.^2;
wp = ((1-u)*t'*g*wp+u*X*g)/M-mean(dg)*wp;
% 2nd and 3rd wp update steps make sense to me
wp = wp-W*W'*wp;
wp = wp / norm(wp);
% or until w_p converges
if abs(abs(b'*bOld)-1)<1e-10
W(:,p)=b;
break;
end
i=i+1;
end
end
And the Wiki algorithms for quick reference:
First, I don't understand why the term that is always zero remains in the code:
wp = ((1-u)*t'*g*wp+u*X*g)/M-mean(dg)*wp;
The above can be simplified into:
wp = X*g/M-mean(dg)*wp;
Also removing u since it is always 1.
Second, I believe the following line is wrong:
t = X'*b;
The correct expression is:
t = X'*wp;
Now let's go through each variable here. Let's refer to
w = E{Xg(wTX)T} - E{g'(wTX)}w
as the iteration equation.
X is your input data, i.e. X in the iteration equation.
wp is the weight vector, i.e. w in the iteration equation. Its initial value is randomised.
g is the first derivative of a nonquadratic nonlinear function, i.e. g(wTX) in the iteration equation
dg is the first derivative of g, i.e. g'(wTX) in the iteration equation
M although its definition is not shown in the code you provide, but I think it should be the size of X.
Having the knowledge of the meaning of all variables, we can now try to understand the codes.
t = X'*b;
The above line computes wTX.
g = t.^3;
The above line computes g(wTX) = (wTX)3. Note that g(u) can be any equation as long as f(u), where g(u) = df(u)/du, is nonlinear and nonquadratic.
dg = 3*t.^2;
The above line computes the derivative of g.
wp = X*g/M-mean(dg)*wp;
Xg obviously calculates Xg(wTX). Xg/M calculates the average of Xg, which is equivalent to E{Xg(wTX)T}.
mean(dg) is E{g'(wTX)} and multiplies by wp or w in the equation.
Now you have what you needed for Newton-Raphson Method.

Optimize/ Vectorize Mahalanobis distance calculations in MATLAB

I have the following piece of Matlab code, which calculates Mahalanobis distances between a vector and a matrix with several iterations. I am trying to find a faster method to do this by vectorization but without success.
S.data=0+(20-0).*rand(15000,3);
S.a=0+(20-0).*rand(2500,3);
S.resultat=ones(length(S.data),length(S.a))*nan;
S.b=ones(length(S.a),3,length(S.a))*nan;
for i=1:length(S.data)
for j=1:length(S.a)
S.a2=S.a;
S.a2(j,:)=S.data(i,:);
S.b(:,:,j)=S.a2;
if j==length(S.a)
for k=1:length(S.a);
S.resultat(i,k)=mahal(S.a(k,:),S.b(:,:,k));
end
end
end
end
I have now modified the code and avoid one of the loop. But it is still very long. If someone have an idea, I will be very greatful!
S.data=0+(20-0).*rand(15000,3);
S.a=0+(20-0).*rand(2500,3);
S.resultat=ones(length(S.data),length(S.a))*nan;
for i=1:length(S.data)
for j=1:length(S.a)
S.a2=S.a;
S.a2(j,:)=S.data(i,:);
S.resultat(i,j)=mahal(S.a(j,:),S.a2);
end
end
Introduction and solution code
You can replace the innermost loop that uses mahal with something that is a bit vectorized, as it uses some pre-calculated values (with the help of bsxfun) inside a loop-shortened and hacked version of mahal.
Basically you have a 2D array, let's call it A for easy reference and a 3D array, let's call it B. Let the output be stored be into a variable out. So, the innermost code snippet could be extracted and based on the assumed variable names.
Original loopy code
for k=1:size(A,1)
out(k)=mahal(A(k,:),B(:,:,k));
end
So, what I did was to hack into mahal.m and look for portions that could be vectorized when the inputs are 2D and 3D. Now, mahal uses qr inside it, which could not be vectorized. Thus, we end up with a hacked code.
Hacked code
%// Pre-calculate certain values that could be avoided than using into loop
meanB = mean(B,1); %// mean of B along dim-1
B_meanB = bsxfun(#minus,B,meanB); %// B minus mean values of B
A_B_meanB = A' - reshape(meanB,size(B,2),[]); %//'# A minus B_meanB
%// QR calculations in a for-loop starts until the output is obtained
for k = 1:size(A,1)
[~,R] = qr(B_meanB(:,:,k),0);
out2(k) = sum((R'\A_B_meanB(:,k)).^2)*(size(A,1)-1);
end
Now, to extend this hack solution to the problem code, one can introduce few more tweaks to pre-calculate more values being used those nested loops.
Final solution code
A = S.a; %// Get data from S
[rx,cx] = size(A); %// Get size parameters
Atr = A'; %//'# Pre-calculate transpose of A
%// Pre-calculate replicated B and the indices to be modified at each iteration
B_rep = repmat(S.a,1,1,rx);
B_idx = bsxfun(#plus,[(0:cx-1)*rx + 1]',[0:rx-1]*(rx*cx+1)); %//'
out = zeros(size(S.data,1),rx); %// initialize output array
for i=1:length(S.data)
B = B_rep;
B(B_idx) = repmat(S.data(i,:)',1,rx); %//'
meanB = mean(B,1); %// mean of B along dim-1
B_meanB = bsxfun(#minus,B,meanB); %// B minus mean values of B
A_B_meanB = Atr - reshape(meanB,3,[]); %// A minus B_meanB
for jj = 1:rx
[~,R] = qr(B_meanB(:,:,jj),0);
out(i,jj) = sum((R'\A_B_meanB(:,jj)).^2)*(rx-1); %//'
end
end
S.resultat = out;
Benchmarking
Here's the benchmarking code to compare the proposed solution against the code listed in the problem -
%// Random inputs
S.data=0+(20-0).*rand(1500,3); %(size 10x reduced for a quicker runtime test)
S.a=0+(20-0).*rand(250,3);
S.resultat=ones(length(S.data),length(S.a))*nan;
disp('----------------------------- With original code')
tic
S.b=ones(length(S.a),3,length(S.a))*nan;
for i=1:length(S.data)
for j=1:length(S.a)
S.a2=S.a;
S.a2(j,:)=S.data(i,:);
S.b(:,:,j)=S.a2;
if j==length(S.a)
for k=1:length(S.a);
S.resultat(i,k)=mahal(S.a(k,:),S.b(:,:,k));
end
end
end
end
toc, clear i j S.a2 k S.resultat
S.resultat=ones(length(S.data),length(S.a))*nan;
disp('----------------------------- With proposed solution code')
tic
[ ... Proposed solution code ...]
toc
Runtimes -
----------------------------- With original code
Elapsed time is 17.734394 seconds.
----------------------------- With proposed solution code
Elapsed time is 6.602860 seconds.
Thus, we might get around 2.7x speedup with the proposed approach and some tweaks!

Looking for efficient way to perform a computation - Matlab

I have a scalar function f([x,y],[i,j])= exp(-norm([x,y]-[i,j])^2/sigma^2) which receives two 2-dimensional vectors as input (norm here implements the Euclidean norm). The values of x,i range in 1:w and the values y,j range in 1:h. I want to create a cell array X such that X{x,y} will contain a w x h matrix such that X{x,y}(i,j) = f([x,y],[i,j]). This can obviously be done using 4 nested loops like so:
for x=1:w;
for y=1:h;
X{x,y}=zeros(w,h);
for i=1:w
for j=1:h
X{x,y}(i,j)=f([x,y],[i,j])
end
end
end
end
This is however extremely inefficient. I would very much appreciate an efficient way to create X.
The one way to do this is to remove the 2 innermost loops and replace then with a vectorised version. By the look of your f function this shouldn't be too bad
First we need to construct two matrices containing the 1 to w on every row and 1 to h on every column like so
wMat=repmat(1:w,h,1);
hMat=repmat(1:h,w,1)';
This is going to represent the inner two loops, and the transpose will allow us to get all combinations. Now we can vectorise the calculation (f([x,y],[i,j])= exp(-norm([x,y]-[i,j])^2/sigma^2)):
for x=1:w;
for y=1:h;
temp1=sqrt((x-wMat).^2+(y-hMat).^2);
X{x,y}=exp(temp1/(sigma^2));
end
end
Where we have computed the Euclidean norm for all pairs of nodes in the inner loops at once.
Some discussion and code
The trick here is to perform the norm-calculations with numeric arrays and save the results into a cell array version as late as possible. For performing the norm-calculations you can take help of ndgrid, bsxfun and some permute + reshape to give it the "shape" as needed for the final cell array version. So, here's the vectorized approach to perform these tasks -
%// Create x-y/i-j values to be used for calculation of function values
[xi,yi] = ndgrid(1:w,1:h);
%// Get the norm values
normvals = sqrt(bsxfun(#minus,xi(:),xi(:).').^2 + ...
bsxfun(#minus,yi(:),yi(:).').^2);
%// Get the actual function values
vals = exp(-normvals.^2/sigma^2);
%// Get the values into blocks of a 4D array and then re-arrange to match
%// with the shape of numeric array version of X
blks = reshape(permute(reshape(vals, w*h, h, []), [2 1 3]), h, w, h, w);
arranged_blks = reshape(permute(blks,[2 3 1 4]),w,h,w,h);
%// Finally get the cell array version
X = squeeze(mat2cell(arranged_blks,w,h,ones(1,w),ones(1,h)));
Benchmarking and runtimes
After improving the original loopy code with pre-allocation for X and function-inling f, runtime-benchmarks were performed with it against the proposed vectorized approach with datasizes as w, h = 60 and the runtime results thus obtained were -
----------- With Improved loopy code
Elapsed time is 41.227797 seconds.
----------- With Vectorized code
Elapsed time is 2.116782 seconds.
This suggested a whooping close to 20x speedup with the proposed solution!
For extremely huge datasizes
If you are dealing with huge datasizes, essentially you are not giving enough memory for bsxfun to work with, and bsxfun is known to use up a lot of memory for giving you a performance-efficient vectorized solution. So, for such huge-datasize cases, you can use the following loopy approach to replace normvals calculations that was listed in the earlier bsxfun based solution -
%// Get the norm values
nx = numel(xi);
normvals = zeros(nx,nx);
for ii = 1:nx
normvals(:,ii) = sqrt( (xi(:) - xi(ii)).^2 + (yi(:) - yi(ii)).^2 );
end
It seems to me that when you run through the cycle for x=w, y=h, you are calculating all the values you need at once. So you don't need recalculate them. Once you have this:
for i=1:w
for j=1:h
temp(i,j)=f([x,y],[i,j])
end
end
Then, e.g. X{1,1} is just temp(1,1), X{2,2} is just temp(1:2,1:2), and so on. If you can vectorise the calculation of f (norm here is just the Euclidean norm of that vector?) then it will get even simpler.

Matlab - if exists a faster way to assign values to big matrix?

I am a new student learning to use Matlab.
Could anyone please tell me is there a faster way possibly without loops:
to assign for each row only two values 1, -1 into different positions of a big sparse matrix.
My code to build a bimatrix or bibimatrix for the MILP problem of condition :
f^k_{ij} <= y_{ij} for every arc (i,j) and all k ~=r; in a multi-commodity flow model.
Naive approach:
bimatrix=[];
% create each row and then add to bimatrix
newrow4= zeros(1,n*(n+1)^2);
for k=1:n
for i=0:n
for j=1: n
if j~=i
%change value of some positions to -1 and 1
newrow4(i*n^2+(j-1)*n+k)=1;
newrow4((n+1)*n^2+i*n+j)=-1;
% add to bimatrix
bimatrix=[bimatrix; newrow4];
% change newrow4 back to zeros row.
newrow4(i*n^2+(j-1)*n+k)=0;
newrow4((n+1)*n^2+i*n+j)=0;
end
end
end
end
OR:
% Generate the big sparse matrix first.
bibimatrix=zeros(n^3 ,n*(n+1)^2);
t=1;
for k=1:n
for i=0:n
for j=1: n
if j~=i
%Change 2 positions in each row to -1 and 1 in each row.
bibimatrix(t,i*n^2+(j-1)*n+k)=1;
bibimatrix(t,(n+1)*n^2+i*n+j)=-1;
t=t+1
end
end
end
end
With these above code in Matlab, the time to generate this matrix, with n~12, is more than 3s. I need to generate a larger matrix in less time.
Thank you.
Suggestion: Use sparse matrices.
You should be able to create two vectors containing the column number where you want your +1 and -1 in each row. Let's call these two vectors vec_1 and vec_2. You should be able to do this without loops (if not, I still think the procedure below will be faster).
Let the size of your matrix be (max_row X max_col). Then you can create your matrix like this:
bibimatrix = sparse(1:max_row,vec_1,1,max_row,max_col);
bibimatrix = bibimatrix + sparse(1:max_row, vec_2,-1,max_row,max_col)
If you want to see the entire matrix (which you don't, since it's huge) you can write: full(bibimatrix).
EDIT:
You may also do it this way:
col_vec = [vec_1, vec_2];
row_vec = [1:max_row, 1:max_row];
s = [ones(1,max_row), -1*ones(1,max_row)];
bibimatrix = sparse(row_vec, col_vec, s, max_row, max_col)
Disclaimer: I don't have MATLAB available, so it might not be error-free.

Most efficient way to weight and sum a number of matrices in Fortran

I am trying to write a function in Fortran that multiplies a number of matrices with different weights and then adds them together to form a single matrix. I have identified that this process is the bottleneck in my program (this weighting will be made many times for a single run of the program, with different weights). Right now I'm trying to make it run faster by switching from Matlab to Fortran. I am a newbie at Fortran so I appreciate all help.
In Matlab the fastest way I have found to make such a computation looks like this:
function B = weight_matrices()
n = 46;
m = 1800;
A = rand(n,m,m);
w = rand(n,1);
tic;
B = squeeze(sum(bsxfun(#times,w,A),1));
toc;
The line where B is assigned runs in about 0.9 seconds on my machine (Matlab R2012b, MacBook Pro 13" retina, 2.5 GHz Intel Core i5, 8 GB 1600 MHz DDR3). It should be noted that for my problem, the tensor A will be the same (constant) for the whole run of the program (after initialization), but w can take any values. Also, typical values of n and m are used here, meaning that the tensor A will have a size of about 1 GB in memory.
The clearest way I can think of writing this in Fortran is something like this:
pure function weight_matrices(w,A) result(B)
implicit none
integer, parameter :: n = 46
integer, parameter :: m = 1800
double precision, dimension(num_sizes), intent(in) :: w
double precision, dimension(num_sizes,msize,msize), intent(in) :: A
double precision, dimension(msize,msize) :: B
integer :: i
B = 0
do i = 1,n
B = B + w(i)*A(i,:,:)
end do
end function weight_matrices
This function runs in about 1.4 seconds when compiled with gfortran 4.7.2, using -O3 (function call timed with "call cpu_time(t)"). If I manually unwrap the loop into
B = w(1)*A(1,:,:)+w(2)*A(2,:,:)+ ... + w(46)*A(46,:,:)
the function takes about 0.11 seconds to run instead. This is great and means that I get a speedup of about 8 times compared to the Matlab version. However, I still have some questions on readability and performance.
First, I wonder if there is an even faster way to perform this weighting and summing of matrices. I have looked through BLAS and LAPACK, but can't find any function that seems to fit. I have also tried to put the dimension in A that enumerates the matrices as the last dimension (i.e. switching from (i,j,k) to (k,i,j) for the elements), but this resulted in slower code.
Second, this fast version is not very flexible, and actually looks quite ugly, since it is so much text for such a simple computation. For the tests I am running I would like to try to use different numbers of weights, so that the length of w will vary, to see how it affects the rest of my algorithm. However, that means I quite tedious rewrite of the assignment of B every time. Is there any way to make this more flexible, while keeping the performance the same (or better)?
Third, the tensor A will, as mentioned before, be constant during the run of the program. I have set constant scalar values in my program using the "parameter" attribute in their own module, importing them with the "use" expression into the functions/subroutines that need them. What is the best way to do the equivalent thing for the tensor A? I want to tell the compiler that this tensor will be constant, after init., so that any corresponding optimizations can be done. Note that A is typically ~1 GB in size, so it is not practical to enter it directly in the source file.
Thank you in advance for any input! :)
Perhaps you could try something like
do k=1,m
do j=1,m
B(j,k)=sum( [ ( (w(i)*A(i,j,k)), i=1,n) ])
enddo
enddo
The square brace is a newer form of (/ /), the 1d matrix (vector). The term in sum is a matrix of dimension (n) and sum sums all of those elements. This is precisely what your unwrapped code does (and is not exactly equal to the do loop you have).
I tried to refine Kyle Vanos' solution.
Therefor I decided to use sum and Fortran's vector-capabilities.
I don't know, if the results are correct, because I only looked for the timings!
Version 1: (for comparison)
B = 0
do i = 1,n
B = B + w(i)*A(i,:,:)
end do
Version 2: (from Kyle Vanos)
do k=1,m
do j=1,m
B(j,k)=sum( [ ( (w(i)*A(i,j,k)), i=1,n) ])
enddo
enddo
Version 3: (mixed-up indices, work on one row/column at a time)
do j = 1, m
B(:,j)=sum( [ ( (w(i)*A(:,i,j)), i=1,n) ], dim=1)
enddo
Version 4: (complete matrices)
B=sum( [ ( (w(i)*A(:,:,i)), i=1,n) ], dim=1)
Timing
As you can see, I had to mixup the indices to get faster execution times. The third solution is really strange because the number of the matrix is the middle index, but this is necessary for memory-order-reasons.
V1: 1.30s
V2: 0.16s
V3: 0.02s
V4: 0.03s
Concluding, I would say, that you can get a massive speedup, if you have the possibility to change order of the matrix indices in arbitrary order.
I would not hide any looping as this is usually slower. You can write it explicitely, then you'll see that the inner loop access is over the last index, making it inefficient. So, you should make sure your n dimension is the last one by storing A is A(m,m,n):
B = 0
do i = 1,n
w_tmp = w(i)
do j = 1,m
do k = 1,m
B(k,j) = B(k,j) + w_tmp*A(k,j,i)
end do
end do
end do
this should be much more efficient as you are now accessing consecutive elements in memory in the inner loop.
Another solution is to use the level 1 BLAS subroutines _AXPY (y = a*x + y):
B = 0
do i = 1,n
CALL DAXPY(m*m, w(i), A(1,1,i), 1, B(1,1), 1)
end do
With Intel MKL this should be more efficient, but again you should make sure the last index is the one which changes in the outer loop (in this case the loop you're writing). You can find the necessary arguments for this call here: MKL
EDIT: you might also want to use some parallellization? (I don't know if Matlab takes advantage of that)
EDIT2: In the answer of Kyle, the inner loop is over different values of w, which is more efficient than n times reloading B as w can be kept in cache (using A(n,m,m)):
B = 0
do i = 1,m
do j = 1,m
B(j,i)=0.0d0
do k = 1,n
B(j,i) = B(j,i) + w(k)*A(k,j,i)
end do
end do
end do
This explicit looping performs about 10% better as the code of Kyle which uses whole-array operations. Bandwidth with ifort -O3 -xHost is ~6600 MB/s, with gfortran -O3 it's ~6000 MB/s, and the whole-array version with either compiler is also around 6000 MB/s.
I know this is an old post, however I will be glad to bring my contribution as I played with most of the posted solutions.
By adding a local unroll for the weights loop (from Steabert's answer ) gives me a little speed-up compared to the complete unroll version (from 10% to 80% with different size of the matrices). The partial unrolling may help the compiler to vectorize the 4 operations in one SSE call.
pure function weight_matrices_partial_unroll_4(w,A) result(B)
implicit none
integer, parameter :: n = 46
integer, parameter :: m = 1800
real(8), intent(in) :: w(n)
real(8), intent(in) :: A(n,m,m)
real(8) :: B(m,m)
real(8) :: Btemp(4)
integer :: i, j, k, l, ndiv, nmod, roll
!==================================================
roll = 4
ndiv = n / roll
nmod = mod( n, roll )
do i = 1,m
do j = 1,m
B(j,i)=0.0d0
k = 1
do l = 1,ndiv
Btemp(1) = w(k )*A(k ,j,i)
Btemp(2) = w(k+1)*A(k+1,j,i)
Btemp(3) = w(k+2)*A(k+2,j,i)
Btemp(4) = w(k+3)*A(k+3,j,i)
k = k + roll
B(j,i) = B(j,i) + sum( Btemp )
end do
do l = 1,nmod !---- process the rest of the loop
B(j,i) = B(j,i) + w(k)*A(k,j,i)
k = k + 1
enddo
end do
end do
end function

Resources