Code performance optimization : speeding up population of an array in Matlab - performance

Part of a longer algorithm, populating the column array b takes 80% of the total run time. The line (see code below) where S1 is calculated in the for loop is the bottleneck because it's called an enormously large amount of times due to the nested loops. I have tried vectorizing the loops and/or using the sum built-in function but I always end up with an even higher computation time.
For the sake of simplicity we can look at one single iteration, namely n = 1 and n + 1 = 2. So the 3D array containing the quantity R at each iteration has only two elements : R(:,:,1) and R(:,:,2).
Where:
I have posted here a minimum working sample that you can yourself run:
tic
Nz = 3636 ; % Number of rows
Nx = 910 ; % Number of columns
R = rand(Nz, Nx, 2) ;
alpha = 3 ;
for ii = Nx - 1 : -1 : 2
b = zeros(Nz, 1) ;
for jj = 2 : Nz - 1
S = 0 ;
S1 = 0 ;
for mm = Nx - 1 : -1 : ii % Summation
S1 = S1 + R(2, mm, 2) + R(2, mm, 1) - (R(1, mm, 2) + R(1, mm, 1)) ;
S = S + (R(jj + 1, mm, 2) + R(jj + 1, mm, 1)) - 2 * (R(jj, mm, 2) + R(jj, mm, 1)) + (R(jj - 1 , mm, 2) + R(jj - 1, mm, 1)) ;
end
b(1) = 2 *(R(1, ii, 1) + 2 * alpha * S1) ;
b(jj) = 2 *(R(jj, ii, 1) + alpha * S) ;
end
end
toc
The current elapsed time is about 28-30 [s] on my laptop (i7-9850H, 2.6 GHz, 16 GB RAM).
I am aware that b is reset to zero at each ii iteration, but to reduce the sample to bare minimum I omitted a line just before the last end: q = D\b. It's a linear system solved column by column starting from the right-most column of the grid (most external loop). That's why at every ii (column) iteration I compute a new b and thus a new solution q.
How can I speed up this computation?
EDIT:
I tried to vectorize the sum, by eliminating the inner for loop and using the function sum, but the time increases dramatically.
tic
Nz = 3636 ; % Number of rows
Nx = 910 ; % Number of columns
R = rand(Nz, Nx, 2) ;
alpha = 3 ;
for ii = Nx - 1 : -1 : 2
b = zeros(Nz, 1) ;
for jj = 2 : Nz - 1
S1 = sum([R(2, Nx - 1 : -1 : ii, 2), R(2, Nx - 1 : -1 : ii, 1), - R(1, Nx - 1 : -1 : ii, 2), R(1,Nx - 1 : -1 : ii, 1)]) ;
S = sum([R(jj + 1, Nx - 1 : -1 : ii, 2), R(jj + 1, Nx - 1 : -1 : ii, 1), - 2 * R(jj, Nx - 1 : -1 : ii, 2), 2 * R(jj, Nx - 1 : -1 : ii, 1), R(jj - 1 , Nx - 1 : -1 : ii, 2), R(jj - 1, Nx - 1 : -1 : ii, 1)]) ;
b(1) = 2 *(R(1, ii, 1) + 2 * alpha * S1) ;
b(jj) = 2 *(R(jj, ii, 1) + alpha * S) ;
end
end
toc

Most of the computations are repeated summations that can be converted to cumsum. The result b is a matrix that you simply can use D\b to compute q for all columns at once.
R = R(:, 2:end-1, :);
sm = sum(R, 3);
S1 = cumsum(sm(2, :) - sm(1, :), 2, 'reverse')
S = cumsum(conv2(sm, [1;-2;1], 'valid'), 2, 'reverse')
b = [2 .* (R(1, :, 1) + 2 * alpha * S1);
2 .* (R(2:end-1, :, 1) + alpha * S);
zeros(1, size(R, 2))];
q = D\b;

On my laptop (MATLAB R2020a, i7-5500U, 2.4 GHz, 8 GB RAM), your original code runs in about 45 s.
Elapsed time is 45.239229 seconds.
Your attempt using sum runs in about 191 s.
Elapsed time is 191.424418 seconds.
#Ander Biguri suggestion runs in about 174 s, which is faster, but not as fast as using for loops.
Elapsed time is 174.288989 seconds.
It appears that MATLAB built-in sum function runs slower than the for loop, as briefly discussed here. The array initialization b=zeros(Nz,1) is fast in newer versions of MATLAB (at least since R2019a), but you can initialize it differently if you are using an older version.
#Ander suggestions, however, are noteworthy and we can follow his suggestions, but use for loop for the summations:
tic
Nz = 3636 ; % Number of rows
Nx = 910 ; % Number of columns
R = rand(Nz, Nx, 2) ;
alpha = 3 ;
for ii = Nx - 1 : -1 : 2
b = zeros(Nz, 1) ;
S1 = 0 ;
for mm = Nx - 1 : -1 : ii % Summation
S = S + (R(jj + 1, mm, 2) + R(jj + 1, mm, 1)) - 2 * (R(jj, mm, 2) + R(jj, mm, 1)) + (R(jj - 1 , mm, 2) + R(jj - 1, mm, 1)) ;
end
for jj = 2 : Nz - 1
S = 0 ;
for mm = Nx - 1 : -1 : ii % Summation
S = S + (R(jj + 1, mm, 2) + R(jj + 1, mm, 1)) - 2 * (R(jj, mm, 2) + R(jj, mm, 1)) + (R(jj - 1 , mm, 2) + R(jj - 1, mm, 1)) ;
end
b(1) = 2 *(R(1, ii, 1) + 2 * alpha * S1) ;
b(jj) = 2 *(R(jj, ii, 1) + alpha * S) ;
end
end
toc
Elapsed time is 26.911663 seconds.
If computing time is a big concern, you can create some temporary variables before the for loop to avoid some summations/multiplications. This makes the code slightly worse to read, but it runs faster:
tic
Nz = 3636 ; % Number of rows
Nx = 910 ; % Number of columns
R = rand(Nz, Nx, 2) ;
R12 = R(:,:,1)+R(:,:,2);
alpha = 3 ;
Nx1 = Nx-1;
Nz1 = Nz-1;
two_alpha = 2*alpha;
for ii = Nx1:-1:2
b = zeros(Nz, 1) ;
S1 = 0 ;
for mm = ii:Nx1 % Summation
S1 = S1 + R12(2, mm) - R12(1, mm) ;
end
for jj = 2:Nz1
S = 0 ;
for mm = ii:Nx1 % Summation
S = S + R12(jj + 1, mm) -2*R12(jj, mm) + R12(jj - 1 , mm) ;
end
b(1) = 2 *(R(1, ii,1) + two_alpha*S1) ;
b(jj) = 2 *(R(jj, ii,1) + alpha*S) ;
end
end
toc
Elapsed time is 12.202103 seconds.
Since the limits of summation extend up to N-1, creation of the Nx1 and Nz1 variables avoids three subtractions on each ii-loop. The two_alpha variable also avoids one multiplication per iteration. Creation of the R12 also speeds up calculation, since in the original code this sum on the 3rd dimesion appears on every line of the summations.
I tried to also create a two_R12=2*R12 variable for the mm-loop, but it runs slower since the for loop must access elements on two different variables.
You also stated that the b variable is used to solve a liner system q = D\b. This will also slow down computations. If the D matrix is constant, consider decompose it, in order to also speed up calculations.

I'm just working with your optimized code. Mostly because your optimized code does not do the same as the non optimized one, there are more things that make no sense (b is replaced in all of the ii loops, so ii is superfluous). I give you thus some tips to improve code in general, but I can't fix a broken question.
Tips:
make sure your code is easy to read by you.
R(2, Nx - 1 : -1 : ii, 2), R(2, Nx - 1 : -1 : ii, 1) is just R(2, Nx - 1 : -1 : ii, 2:1), and in fact as you are adding them R(2, Nx - 1 : -1 : ii, :).
Indexing an array is much faster if you do it as expected, i.e., in order. You are adding array elements, but reading them in reverse. Just read them normally, i.e. Nx-1:-1:ii is just ii:Nx-1. Bonus that it's much more readable. In fact the above line is now suddenly R(2, ii:Nx-1, :). Much more readable.
Don't create temporary arrays if can be avoided. Particularly if in the end, these will be the size of R! You are just indexing it, and you don't care about the ordering because you will add it together. S can be just:
S = sum(R(jj-1:jj+1, ii : Nx-1, :).*cat(3,[1;2;1],[1; -2; 1]),'all') ;
% R(jj-1:jj+1, ii : Nx-1, :) % grabs the needed values
% cat(3,[1;2;1],[1; -2; 1]) % small trick to multiply the arrays (implicit broadcast)
% sum( ,'all') % add them regardless of shape.
This already speeds up the code by a lot.
4- Don't overcompute. S1 does not depend in jj, in fact, it's the same for each ii. So why compute it jj times for each ii ?
So, you can change your optimized code to the following for a x1.7 speedup:
tic
Nz = 1000 ; % Number of rows
Nx = 200 ; % Number of columns
R = rand(Nz, Nx, 2) ;
alpha = 3 ;
for ii = Nx - 1 : -1 : 2
b1 = zeros(Nz, 1) ;
% we can optimize this one too, but I don't think its the main culprit
% here.
S1 = sum([R(2, ii : Nx-1, 2), R(2, ii : Nx-1, 1),...
- R(1, ii : Nx-1, 2), R(1,ii : Nx-1, 1)]) ;
b1(1) = 2 *(R(1, ii, 1) + 2 * alpha * S1) ;
for jj = 2 : Nz - 1
S = sum(R(jj-1:jj+1, ii : Nx-1, :).*cat(3,[1;2;1],[1; -2; 1]),'all') ;
b1(jj) = 2 *(R(jj, ii, 1) + alpha * S) ;
end
end
toc

Related

Using conditions to find imaginary and real part

I have used Solve to find the solution of an equation in Mathematica (The reason I am posting here is that no one could answer my question in mathematica stack.)The solution is called s and it is a function of two variables called v and ro. I want to find imaginary and real part of s and I want to use the information that v and ro are real and they are in the below interval:
$ 0.02 < ro < 1 ,
40
The code I used is as below:
ClearAll["Global`*"]
d = 1; l = 100; k = 0.001; kk = 0.001;ke = 0.0014;dd = 0.5 ; dr = 0.06; dc = 1000; p = Sqrt[8 (ro l /2 - 1)]/l^2;
m = (4 dr + ke^2 (d + dd)/2) (-k^2 + kk^2) (1 - l ro/2) (d - dd)/4 -
I v p k l (4 dr + ke^2 (d + dd)/2)/4 - v^2 ke^2/4 + I v k dr l p/4;
xr = 0.06/n;
tr = d/n;
dp = (x (v I kk/2 (4 dr + ke^2 (d + dd)/2) - I v kk ke^2 (d - dd)/8 - dr l p k kk (d - dd)/4) + y ((xr I kk (ro - 1/l) (4 dr + ke^2 (d + dd)/2)) - I v kk tr ke^2 (1/l - ro/2) + I dr xr 4 kk (1/l - ro/2)))/m;
a = -I v k dp/4 - I xr y kk p/2 + l ke^2 dp p (d + dd)/8 + (-d + dd)/4 k kk x + dr l p dp;
aa = -v I kk dp/4 + xr I y k p/2 - tr y ke^2 (1/l - ro/2) - (d - dd) x kk^2/4 + ke^2 x (d - dd)/8;
ca = CoefficientArrays[{x (s + ke^2 (d + dd)/2) +
dp (v I kk - l (d - dd) k p kk/2) + y (tr ro ke^2) - (d -
dd) ((-kk^2 + k^2) aa - 2 k kk a)/(4 dr + ke^2 (d + dd)/2) == 0, y (s + dc ke^2) + n x == 0}, {x, y}];
mat = Normal[ca];
matt = Last#mat;
sha = Solve[Det[matt] == 0, s];
shaa = Assuming[v < 100 && v > 40 && ro < 1 && ro > 0.03,Simplify[%]];
reals = Re[shaa];
ims = Im[shaa];
Solve[reals == 0, ro]
but it gives no answer. Could anyone help? I really appreciate any solution to this problem.
I run your code down to this point
mat = Normal[ca]
and look at the result.
There are lots of very tiny floating point coefficients, so small that I suspect most of them are just floating point noise now. Mathematica thinks 0.1 is only known to 1 significant digit of precision and your mat result is perhaps nothing more than zero correct digits now.
I continue down to this point
sha = Solve[Det[matt] == 0, s]
If you look at the value of sha you will see it is s->stuff and I don't think that is at all what you think it is. Mathematica returns "rules" from Solve, not just expressions.
If I change that line to
sha = s/.Solve[Det[matt] == 0, s]
then I am guessing that is closer to what you are imagining you want.
I continue to
shaa = Assuming[40<v<100 && .03<ro<1, Simplify[sha]];
reals = Re[shaa]
And I instead use, because you are assuming v and ro to be Real and because ComplexExpand has often been very helpful in getting Re to provide desired results,
reals=Re[ComplexExpand[shaa]]
and I click on Show ALL to see the full expanded value of that. That is about 32 large screens full of your expression.
In that are hundreds of
Arg[-1. + 50. ro]
and if I understand your intention I believe all those simplify to 0. If that is correct then
reals=reals/.Arg[-1. + 50. ro]->0
reduces the size of reals down to about 20 large screen fulls.
But there are still hundreds of examples of Sqrt[(-1.+50. ro)^2] and ((-1.+50. ro)^2)^(1/4) making up your reals. Unfortunately I'm expecting your enormous expression is too large and will take too long for Simplify with assumptions to be able to be practically effective.
Perhaps additional replacements to coax it into dramatically simplifying your reals without making any mistakes about Real versus Complex, but you have to be extremely careful with such things because it is very common for users to make mistakes when dealing with complex numbers and roots and powers and functions and end up with an incorrect result, might get your problem down to the point where it might be feasible for
Solve[reals == 0, ro]
to give you a meaningful answer.
This should give you some ideas of what you need to think carefully about and work on.

Vectorized code slower than loops? MATLAB

In the problem Im working on there is such a part of code, as shown below. The definition part is just to show you the sizes of arrays. Below I pasted vectorized version - and it is >2x slower. Why it happens so? I know that i happens if vectorization requiers large temporary variables, but (it seems) it is not true here.
And generally, what (other than parfor, with I already use) can I do to speed up this code?
maxN = 100;
levels = maxN+1;
xElements = 101;
umn = complex(zeros(levels, levels));
umn2 = umn;
bessels = ones(xElements, xElements, levels); % 1.09 GB
posMcontainer = ones(xElements, xElements, maxN);
tic
for j = 1 : xElements
for i = 1 : xElements
for n = 1 : 2 : maxN
nn = n + 1;
mm = 1;
for m = 1 : 2 : n
umn(nn, mm) = bessels(i, j, nn) * posMcontainer(i, j, m);
mm = mm + 1;
end
end
end
end
toc % 0.520594 seconds
tic
for j = 1 : xElements
for i = 1 : xElements
for n = 1 : 2 : maxN
nn = n + 1;
m = 1:2:n;
numOfEl = ceil(n/2);
umn2(nn, 1:numOfEl) = bessels(i, j, nn) * posMcontainer(i, j, m);
end
end
end
toc % 1.275926 seconds
sum(sum(umn-umn2)) % veryfying, if all done right
Best regards,
Alex
From the profiler:
Edit:
In reply to #Jason answer, this alternative takes the same time:
for n = 1:2:maxN
nn(n) = n + 1;
numOfEl(n) = ceil(n/2);
end
for j = 1 : xElements
for i = 1 : xElements
for n = 1 : 2 : maxN
umn2(nn(n), 1:numOfEl(n)) = bessels(i, j, nn(n)) * posMcontainer(i, j, 1:2:n);
end
end
end
Edit2:
In reply to #EBH :
The point is to do the following:
parfor i = 1 : xElements
for j = 1 : xElements
umn = complex(zeros(levels, levels)); % cleaning
for n = 0:maxN
mm = 1;
for m = -n:2:n
nn = n + 1; % for indexing
if m < 0
umn(nn, mm) = bessels(i, j, nn) * negMcontainer(i, j, abs(m));
end
if m > 0
umn(nn, mm) = bessels(i, j, nn) * posMcontainer(i, j, m);
end
if m == 0
umn(nn, mm) = bessels(i, j, nn);
end
mm = mm + 1; % for indexing
end % m
end % n
beta1 = sum(sum(Aj1.*umn));
betaSumSq1(i, j) = abs(beta1).^2;
beta2 = sum(sum(Aj2.*umn));
betaSumSq2(i, j) = abs(beta2).^2;
end % j
end % i
I speeded it up as much, as I was able to. What you have written is taking only the last bessels and posMcontainer values, so it does not produce the same result. In the real code, those two containers are filled not with 1, but with some precalculated values.
After your edit, I can see that umn is just a temporary variable for another calculation. It still can be mostly vectorizable:
betaSumSq1 = zeros(xElements); % preallocating
betaSumSq2 = zeros(xElements); % preallocating
% an index matrix to fetch the right values from negMcontainer and
% posMcontainer:
indmat = tril(repmat([0 1;1 0],ceil((maxN+1)/2),floor(levels/2)));
indmat(end,:) = [];
% an index matrix to fetch the values in correct order for umn:
b_ind = repmat([1;0],ceil((maxN+1)/2),1);
b_ind(end) = [];
tempind = logical([fliplr(indmat) b_ind indmat+triu(ones(size(indmat)))]);
% permute the arrays to prevent squeeze:
PM = permute(posMcontainer,[3 1 2]);
NM = permute(negMcontainer,[3 1 2]);
B = permute(bessels,[3 1 2]);
for k = 1 : maxN+1 % third dim
for jj = 1 : xElements % columns
b = B(:,jj,k); % get one vector of B
% perform b*NM for every row of NM*indmat, than flip the result:
neg = fliplr(bsxfun(#times,bsxfun(#times,indmat,NM(:,jj,k).'),b));
% perform b*PM for every row of PM*indmat:
pos = bsxfun(#times,bsxfun(#times,indmat,PM(:,jj,k).'),b);
temp = [neg mod(1:levels,2).'.*b pos].'; % concat neg and pos
% assign them to the right place in umn:
umn = reshape(temp(tempind.'),[levels levels]).';
beta1 = Aj1.*umn;
betaSumSq1(jj,k) = abs(sum(beta1(:))).^2;
beta2 = Aj2.*umn;
betaSumSq2(jj,k) = abs(sum(beta2(:))).^2;
end
end
This reduce running time from ~95 seconds to less 3 seconds (both without parfor), so it improves in almost 97%.
I would suspect it is memory allocation. You are re-allocating the m array in a 3 deep loop.
try rearranging the code:
tic
for n = 1 : 2 : maxN
nn = n + 1;
m = 1:2:n;
numOfEl = ceil(n/2);
for j = 1 : xElements
for i = 1 : xElements
umn2(nn, 1:numOfEl) = bessels(i, j, nn) * posMcontainer(i, j, m);
end
end
end
toc % 1.275926 seconds
I was trying this in Igor pro, which a similar language, but with different optimizations. So the direct translations don't time the same way as Matlab (vectorized was slightly faster in Igor). But reordering the loops did speed up the vectorized form.
In your second part of the code, that is setting umn2, inside the loops, you have:
nn = n + 1;
m = 1:2:n;
numOfEl = ceil(n/2);
Those 3 lines don't require any input from the i and j loops, they only use the n loop. So reordering the loops such that i and j are inside the n loop will mean that those 3 lines are done xElements^2 (100^2) times less often. I suspect it is that m = 1:2:n line that takes time, since that is allocating an array.

Can anyone explain how this division algorithm works?

I saw this in an algorithm textbook. I am confused about the middle recursive function. If you can explain it with an example, such as 4/2, that would be great!
function divide(x, y)
Input: Two n-bit integers x and y, where y ≥ 1
Output: The quotient and remainder of x divided by y
if x = 0: return (q, r) = (0, 0)
(q, r) = divide(floor(x/2), y)
q = 2 · q, r = 2 · r
if x is odd: r = r + 1
if r ≥ y: r = r − y, q = q + 1
return (q, r)
You're seeing how many times it's divisible by 2. This is essentially performing bit shifts and operating on the binary digits. A more interesting case would be 13/3 (13 is 1101 in binary).
divide(13, 3) // initial binary value - 1101
divide(6, 3) // shift right - 110
divide(3, 3) // shift right - 11
divide(1, 3) // shift right - 1 (this is the most significant bit)
divide(0, 3) // shift right - 0 (no more significant bits)
return(0, 0) // roll it back up
return(0, 1) // since x is odd (1)
return(1, 0) // r = r * 2 = 2; x is odd (3) so r = 3 and the r > y condition is true
return(2, 0) // q = 2 * 1; r = 2 * 1 - so r >= y and q = 2 + 1
return(4, 1) // q = 2 * 2; x is odd to r = 0 + 1

vectorize/optimize this code in MATLAB?

I am building my first large-scale MATLAB program, and I've managed to write original vectorized code for everything so for until I came to trying to create an image representing vector density in stereographic projection. After a couple failed attempts I went to the Mathworks file exchange site and found an open source program which fits my needs courtesy of Malcolm Mclean. With a test matrix his function produces something like this:
And while this is almost exactly what I wanted, his code relies on a triply nested for-loop. On my workstation a test data matrix of size 25000x2 took 65 seconds in this section of code. This is unacceptable since I will be scaling up to a data matrices of size 500000x2 in my project.
So far I've been able to vectorize the innermost loop (which was the longest/worst loop), but I would like to continue and be rid of the loops entirely if possible. Here is Malcolm's original code that I need to vectorize:
dmap = zeros(height, width); % height, width: scalar with default value = 32
for ii = 0: height - 1 % 32 iterations of this loop
yi = limits(3) + ii * deltay + deltay/2; % limits(3) & deltay: scalars
for jj = 0 : width - 1 % 32 iterations of this loop
xi = limits(1) + jj * deltax + deltax/2; % limits(1) & deltax: scalars
dd = 0;
for kk = 1: length(x) % up to 500,000 iterations in this loop
dist2 = (x(kk) - xi)^2 + (y(kk) - yi)^2;
dd = dd + 1 / ( dist2 + fudge); % fudge is a scalar
end
dmap(ii+1,jj+1) = dd;
end
end
And here it is with the changes I've already made to the innermost loop (which was the biggest drain on efficiency). This cuts the time from 65 seconds down to 12 seconds on my machine for the same test matrix, which is better but still far slower than I would like.
dmap = zeros(height, width);
for ii = 0: height - 1
yi = limits(3) + ii * deltay + deltay/2;
for jj = 0 : width - 1
xi = limits(1) + jj * deltax + deltax/2;
dist2 = (x - xi) .^ 2 + (y - yi) .^ 2;
dmap(ii + 1, jj + 1) = sum(1 ./ (dist2 + fudge));
end
end
So my main question, are there any further changes I can make to optimize this code? Or even an alternative method to approach the problem? I've considered using C++ or F# instead of MATLAB for this section of the program, and I may do so if I cannot get to a reasonable efficiency level with the MATLAB code.
Please also note that at this point I don't have ANY additional toolboxes, if I did then I know this would be trivial (using hist3 from the statistics toolbox for example).
Mem consuming solution
yi = limits(3) + deltay * ( 1:height ) - .5 * deltay;
xi = limits(1) + deltax * ( 1:width ) - .5 * deltax;
dx = bsxfun( #minus, x(:), xi ) .^ 2;
dy = bsxfun( #minus, y(:), yi ) .^ 2;
dist2 = bsxfun( #plus, permute( dy, [2 3 1] ), permute( dx, [3 2 1] ) );
dmap = sum( 1./(dist2 + fudge ) , 3 );
EDIT
handling extremely large x and y by breaking the operation into blocks:
blockSize = 50000; % process up to XX elements at once
dmap = 0;
yi = limits(3) + deltay * ( 1:height ) - .5 * deltay;
xi = limits(1) + deltax * ( 1:width ) - .5 * deltax;
bi = 1;
while bi <= numel(x)
% take a block of x and y
bx = x( bi:min(end, bi + blockSize - 1) );
by = y( bi:min(end, bi + blockSize - 1) );
dx = bsxfun( #minus, bx(:), xi ) .^ 2;
dy = bsxfun( #minus, by(:), yi ) .^ 2;
dist2 = bsxfun( #plus, permute( dy, [2 3 1] ), permute( dx, [3 2 1] ) );
dmap = dmap + sum( 1./(dist2 + fudge ) , 3 );
bi = bi + blockSize;
end
This is a good example of why starting a loop from 1 matters. The only reason that ii and jj are initiated at 0 is to kill the ii * deltay and jj * deltax terms which however introduces sequentiality in the dmap indexing, preventing parallelization.
Now, by rewriting the loops you could use parfor() after opening a matlabpool:
dmap = zeros(height, width);
yi = limits(3) + deltay*(1:height) - .5*deltay;
matlabpool 8
parfor ii = 1: height
for jj = 1: width
xi = limits(1) + (jj-1) * deltax + deltax/2;
dist2 = (x - xi) .^ 2 + (y - yi(ii)) .^ 2;
dmap(ii, jj) = sum(1 ./ (dist2 + fudge));
end
end
matlabpool close
Keep in mind that opening and closing the pool has significant overhead (10 seconds on my Intel Core Duo T9300, vista 32 Matlab 2013a).
PS. I am not sure whether the inner loop instead of the outer one can be meaningfully parallelized. You can try to switch the parfor to the inner one and compare speeds (I would recommend going for the big matrix immediately since you are already running in 12 seconds and the overhead is almost as big).
Alternatively, this problem can be solved in using kernel density estimation techniques. This is part of the Statistics Toolbox, or there's this KDE implementation by Zdravko Botev (no toolboxes required).
For the example code below, I get 0.3 seconds for N = 500000, or 0.7 seconds for N = 1000000.
N = 500000;
data = [randn(N,2); rand(N,1)+3.5, randn(N,1);]; % 2 overlaid distrib
tic; [bandwidth,density,X,Y] = kde2d(data); toc;
imagesc(density);

Is there any easy way to do modulus of 2^32 - 1 operation?

I just heard about that x mod (2^32-1) and x / (2^32-1) would be easy, but how?
to calculate the formula:
xn = (xn-1 + xn-1 / b)mod b.
For b = 2^32, its easy, x%(2^32) == x & (2^32-1); and x / (2^32) == x >> 32. (the ^ here is not XOR). How to do that when b = 2^32 - 1.
In the page https://en.wikipedia.org/wiki/Multiply-with-carry. They say "arithmetic for modulus 2^32 − 1 requires only a simple adjustment from that for 2^32". So what is the "simple adjustment"?
(This answer only handles the mod case.)
I'll assume that the datatype of x is more than 32 bits (this answer will actually work with any positive integer) and that it is positive (the negative case is just -(-x mod 2^32-1)), since if it at most 32 bits, the question can be answered by
x mod (2^32-1) = 0 if x == 2^32-1, x otherwise
x / (2^32 - 1) = 1 if x == 2^32-1, 0 otherwise
We can write x in base 2^32, with digits x0, x1, ..., xn. So
x = x0 + 2^32 * x1 + (2^32)^2 * x2 + ... + (2^32)^n * xn
This makes the answer clearer when we do the modulus, since 2^32 == 1 mod 2^32-1. That is
x == x0 + 1 * x1 + 1^2 * x2 + ... + 1^n * xn (mod 2^32-1)
== x0 + x1 + ... + xn (mod 2^32-1)
x mod 2^32-1 is the same as the sum of the base 2^32 digits! (we can't drop the mod 2^32-1 yet). We have two cases now, either the sum is between 0 and 2^32-1 or it is greater. In the former, we are done; in the later, we can just recur until we get between 0 and 2^32-1. Getting the digits in base 2^32 is fast, since we can use bitwise operations. In Python (this doesn't handle negative numbers):
def mod_2to32sub1(x):
s = 0 # the sum
while x > 0: # get the digits
s += x & (2**32-1)
x >>= 32
if s > 2**32-1:
return mod_2to32sub1(s)
elif s == 2**32-1:
return 0
else:
return s
(This is extremely easy to generalise to x mod 2^n-1, in fact you just replace any occurance of 32 with n in this answer.)
(EDIT: added the elif clause to avoid an infinite loop on mod_2to32sub1(2**32-1). EDIT2: replaced ^ with **... oops.)
So you compute with the "rule" 232 = 1. In general, 232+x = 2x. You can simplify 2a by taking the exponent modulo 32. Example: 266 = 22.
You can express any number in binary, and then lower the exponents. Example: the number 240 + 238 + 220 + 2 + 1 can be simplified to 28 + 26 + 220 + 2 + 1.
In general, you can group the exponents every 32 powers of 2, and "downgrade" all exponents modulo 32.
For 64 bit words, the number can be expressed as
232 A + B
where 0 <= A,B <= 232-1. Getting A and B is easy with bitwise operations.
So you can simplify that to A + B, which is much smaller: at most 233. Then, check if this number is at least 232-1, and subtract 232 - 1 in that case.
This avoids expensive direct division.
The modulus has already been explained, nevertheless, let's recapitulate.
To find the remainder of k modulo 2^n-1, write
k = a + 2^n*b, 0 <= a < 2^n
Then
k = a + ((2^n-1) + 1) * b
= (a + b) + (2^n-1)*b
≡ (a + b) (mod 2^n-1)
If a + b >= 2^n, repeat until the remainder is less than 2^n, and if that leads you to a + b = 2^n-1, replace that with 0. Each "shift right by n and add to the last n bits" moves the first set bit right by n or n-1 places (unless k < 2^(2*n-1), when the first set bit after the shift-and-add may be the 2^n bit). So if the width of the type is large compared to n, this will need many shifts - consider a 128-bit type and n = 3, for large k you will need over 40 shifts. To reduce the number of shifts required, you can exploit the fact that
2^(m*n) - 1 = (2^n - 1) * (2^((m-1)*n) + 2^((m-2)*n) + ... + 2^(2*n) + 2^n + 1),
of which we will only use that 2^n - 1 divides 2^(m*n) - 1 for all m > 0. Then you shift by multiples of n that are roughly half the maximal bit-length the value can have at that step. For the above example of a 128-bit type and the remainder modulo 7 (2^3 - 1), the closest multiples of 3 to 128/2 are 63 and 66, first shift by 63 bits
r_1 = (k & (2^63 - 1)) + (k >> 63) // r_1 < 2^63 + 2^(128-63) < 2^66
to get a number with at most 66 bits, then shift by 66/2 = 33 bits
r_2 = (r_1 & (2^33 - 1)) + (r_1 >> 33) // r_2 < 2^33 + 2^(66-33) = 2^34
to reach at most 34 bits. Next shift by 18 bits, then 9, 6, 3
r_3 = (r_2 & (2^18 - 1)) + (r_2 >> 18) // r_3 < 2^18 + 2^(34-18) < 2^19
r_4 = (r_3 & (2^9 - 1)) + (r_3 >> 9) // r_4 < 2^9 + 2^(19-9) < 2^11
r_5 = (r_4 & (2^6 - 1)) + (r_4 >> 6) // r_5 < 2^6 + 2^(11-6) < 2^7
r_6 = (r_5 & (2^3 - 1)) + (r_5 >> 3) // r_6 < 2^3 + 2^(7-3) < 2^5
r_7 = (r_6 & (2^3 - 1)) + (r_6 >> 3) // r_7 < 2^3 + 2^(5-3) < 2^4
Now a single subtraction if r_7 >= 2^3 - 1 suffices. To calculate k % (2^n -1) in a b-bit type, O(log2 (b/n)) shifts are needed.
The quotient is obtained similarly, again we write
k = a + 2^n*b, 0 <= a < 2^n
= a + ((2^n-1) + 1)*b
= (2^n-1)*b + (a+b),
so k/(2^n-1) = b + (a+b)/(2^n-1), and we continue while a+b > 2^n-1. Here we unfortunately cannot reduce the work by shifting and masking about half the width, so the method is only efficient when n is not much smaller than the width of the type.
Code for the fast cases where n is not too small:
unsigned long long modulus_2n1(unsigned n, unsigned long long k) {
unsigned long long mask = (1ULL << n) - 1ULL;
while(k > mask) {
k = (k & mask) + (k >> n);
}
return k == mask ? 0 : k;
}
unsigned long long quotient_2n1(unsigned n, unsigned long long k) {
unsigned long long mask = (1ULL << n) - 1ULL, quotient = 0;
while(k > mask) {
quotient += k >> n;
k = (k & mask) + (k >> n);
}
return k == mask ? quotient + 1 : quotient;
}
For the special case where n is half the width of the type, the loop runs at most twice, so if branches are expensive, it may be better to unroll the loop and unconditionally execute the loop body twice.
It is not. What must you have heard is x mod 2^n and x/2^n being easier. x/2^n can be performed as x>>n, and x mod 2^n, do x&(1<<n-1)

Resources