vectorize/optimize this code in MATLAB? - performance

I am building my first large-scale MATLAB program, and I've managed to write original vectorized code for everything so for until I came to trying to create an image representing vector density in stereographic projection. After a couple failed attempts I went to the Mathworks file exchange site and found an open source program which fits my needs courtesy of Malcolm Mclean. With a test matrix his function produces something like this:
And while this is almost exactly what I wanted, his code relies on a triply nested for-loop. On my workstation a test data matrix of size 25000x2 took 65 seconds in this section of code. This is unacceptable since I will be scaling up to a data matrices of size 500000x2 in my project.
So far I've been able to vectorize the innermost loop (which was the longest/worst loop), but I would like to continue and be rid of the loops entirely if possible. Here is Malcolm's original code that I need to vectorize:
dmap = zeros(height, width); % height, width: scalar with default value = 32
for ii = 0: height - 1 % 32 iterations of this loop
yi = limits(3) + ii * deltay + deltay/2; % limits(3) & deltay: scalars
for jj = 0 : width - 1 % 32 iterations of this loop
xi = limits(1) + jj * deltax + deltax/2; % limits(1) & deltax: scalars
dd = 0;
for kk = 1: length(x) % up to 500,000 iterations in this loop
dist2 = (x(kk) - xi)^2 + (y(kk) - yi)^2;
dd = dd + 1 / ( dist2 + fudge); % fudge is a scalar
dmap(ii+1,jj+1) = dd;
And here it is with the changes I've already made to the innermost loop (which was the biggest drain on efficiency). This cuts the time from 65 seconds down to 12 seconds on my machine for the same test matrix, which is better but still far slower than I would like.
dmap = zeros(height, width);
for ii = 0: height - 1
yi = limits(3) + ii * deltay + deltay/2;
for jj = 0 : width - 1
xi = limits(1) + jj * deltax + deltax/2;
dist2 = (x - xi) .^ 2 + (y - yi) .^ 2;
dmap(ii + 1, jj + 1) = sum(1 ./ (dist2 + fudge));
So my main question, are there any further changes I can make to optimize this code? Or even an alternative method to approach the problem? I've considered using C++ or F# instead of MATLAB for this section of the program, and I may do so if I cannot get to a reasonable efficiency level with the MATLAB code.
Please also note that at this point I don't have ANY additional toolboxes, if I did then I know this would be trivial (using hist3 from the statistics toolbox for example).

Mem consuming solution
yi = limits(3) + deltay * ( 1:height ) - .5 * deltay;
xi = limits(1) + deltax * ( 1:width ) - .5 * deltax;
dx = bsxfun( #minus, x(:), xi ) .^ 2;
dy = bsxfun( #minus, y(:), yi ) .^ 2;
dist2 = bsxfun( #plus, permute( dy, [2 3 1] ), permute( dx, [3 2 1] ) );
dmap = sum( 1./(dist2 + fudge ) , 3 );
handling extremely large x and y by breaking the operation into blocks:
blockSize = 50000; % process up to XX elements at once
dmap = 0;
yi = limits(3) + deltay * ( 1:height ) - .5 * deltay;
xi = limits(1) + deltax * ( 1:width ) - .5 * deltax;
bi = 1;
while bi <= numel(x)
% take a block of x and y
bx = x( bi:min(end, bi + blockSize - 1) );
by = y( bi:min(end, bi + blockSize - 1) );
dx = bsxfun( #minus, bx(:), xi ) .^ 2;
dy = bsxfun( #minus, by(:), yi ) .^ 2;
dist2 = bsxfun( #plus, permute( dy, [2 3 1] ), permute( dx, [3 2 1] ) );
dmap = dmap + sum( 1./(dist2 + fudge ) , 3 );
bi = bi + blockSize;

This is a good example of why starting a loop from 1 matters. The only reason that ii and jj are initiated at 0 is to kill the ii * deltay and jj * deltax terms which however introduces sequentiality in the dmap indexing, preventing parallelization.
Now, by rewriting the loops you could use parfor() after opening a matlabpool:
dmap = zeros(height, width);
yi = limits(3) + deltay*(1:height) - .5*deltay;
matlabpool 8
parfor ii = 1: height
for jj = 1: width
xi = limits(1) + (jj-1) * deltax + deltax/2;
dist2 = (x - xi) .^ 2 + (y - yi(ii)) .^ 2;
dmap(ii, jj) = sum(1 ./ (dist2 + fudge));
matlabpool close
Keep in mind that opening and closing the pool has significant overhead (10 seconds on my Intel Core Duo T9300, vista 32 Matlab 2013a).
PS. I am not sure whether the inner loop instead of the outer one can be meaningfully parallelized. You can try to switch the parfor to the inner one and compare speeds (I would recommend going for the big matrix immediately since you are already running in 12 seconds and the overhead is almost as big).

Alternatively, this problem can be solved in using kernel density estimation techniques. This is part of the Statistics Toolbox, or there's this KDE implementation by Zdravko Botev (no toolboxes required).
For the example code below, I get 0.3 seconds for N = 500000, or 0.7 seconds for N = 1000000.
N = 500000;
data = [randn(N,2); rand(N,1)+3.5, randn(N,1);]; % 2 overlaid distrib
tic; [bandwidth,density,X,Y] = kde2d(data); toc;


Finite difference method for solving the Klein-Gordon equation in Matlab

I am trying to numerically solve the Klein-Gordon equation that can be found here. To make sure I solved it correctly, I am comparing it with an analytical solution that can be found on the same link. I am using the finite difference method and Matlab. The initial spatial conditions are known, not the initial time conditions.
I start off by initializing the constants and the space-time coordinate system:
close all
%% Constant parameters
A = 2;
B = 3;
lambda = 2;
mu = 3;
a = 4;
b = - (lambda^2 / a^2) + mu^2;
%% Coordinate system
number_of_discrete_time_steps = 300;
t = linspace(0, 2, number_of_discrete_time_steps);
dt = t(2) - t(1);
number_of_discrete_space_steps = 100;
x = transpose( linspace(0, 1, number_of_discrete_space_steps) );
dx = x(2) - x(1);
Next, I define and plot the analitical solution:
%% Analitical solution
Wa = cos(lambda * x) * ( A * cos(mu * t) + B * sin(mu * t) );
figure('Name', 'Analitical solution');
surface(t, x, Wa, 'edgecolor', 'none');
title('Wa(x, t) - analitical solution');
The plot of the analytical solution is shown here.
In the end, I define the initial spatial conditions, execute the finite difference method algorithm and plot the solution:
%% Numerical solution
Wn = zeros(number_of_discrete_space_steps, number_of_discrete_time_steps);
Wn(1, :) = Wa(1, :);
Wn(2, :) = Wa(2, :);
for j = 2 : (number_of_discrete_time_steps - 1)
for i = 2 : (number_of_discrete_space_steps - 1)
Wn(i + 1, j) = dx^2 / a^2 ...
* ( ( Wn(i, j + 1) - 2 * Wn(i, j) + Wn(i, j - 1) ) / dt^2 + b * Wn(i - 1, j - 1) ) ...
+ 2 * Wn(i, j) - Wn(i - 1, j);
figure('Name', 'Numerical solution');
surface(t, x, Wn, 'edgecolor', 'none');
title('Wn(x, t) - numerical solution');
The plot of the numerical solution is shown here.
The two plotted graphs are not the same, which is proof that I did something wrong in the algorithm. The problem is, I can't find the errors. Please help me find them.
To summarize, please help me change the code so that the two plotted graphs become approximately the same. Thank you for your time.
The finite difference discretization of w_tt = a^2 * w_xx - b*w is
( w(i,j+1) - 2*w(i,j) + w(i,j-1) ) / dt^2
= a^2 * ( w(i+1,j) - 2*w(i,j) + w(i-1,j) ) / dx^2 - b*w(i,j)
In your order this gives the recursion equation
w(i,j+1) = dt^2 * ( (a/dx)^2 * ( w(i+1,j) - 2*w(i,j) + w(i-1,j) ) - b*w(i,j) )
+2*w(i,j) - w(i,j-1)
The stability condition is that at least a*dt/dx < 1. For the present parameters this is not satisfied, they give this ratio as 2.6. Increasing the time discretization to 1000 points is sufficient.
Next up is the boundary conditions. Besides the two leading columns for times 0 and dt one also needs to set the values at the boundaries for x=0 and x=1. Copy also them from the exact solution.
Wn(:,1:2) = Wa(:,1:2);
Then also correct the definition (and use) of b to that in the source
b = - (lambda^2 * a^2) + mu^2;
and the resulting numerical image looks identical to the analytical image in the color plot. The difference plot confirms the closeness

Matlab visualization

I used this link to walk through FDTD code and write it for myself to practice.
This is the code that I wrote (its not verbatim, but extremely similar). When I run the program, I am told that "Field amplitude is too small to visualize properly." I don't understand why. As far as I understand Matlab (I'm very new to it), I am scaling the graph on my own. There is a function called draw1d that is in a protected file here:
How should I fix this?
close all;
clear all;
% Units
meters = 1;
seconds = 1;
%Fundamental constants
c0 = 3e8 * meters/seconds;
e0 = 8.85e-12 * 1/meters;
u0 = 1.26e-6 * 1/meters;
%Figure Window
figure('Color', 'b');
%Simple parameters
dz = 5 * meters;
Nz = 200;
dt = 1e-3 * seconds;
STEPS = 1000;
%Grid Device - let it be air
ER = ones(1, Nz);
UR = ones(1, Nz);
%% Initialize FDTD Parameters
%Initialize Vectors
mEy = (c0*dt)./ER; %unique update coefficient for each place on the grid
mHx = (c0*dt)./UR; %unique update coefficent again
%% Initialize Fields
%Initialize Fields
Ey = zeros(1, Nz);
Hx = zeros(1, Nz);
%Actually doing FDTD
for T = 1 : STEPS
%Update H from E
for nz = 1 : Nz - 1
Hx(nz) = Hx(nz) + mHx(nz)*(Ey(nz+1) - Ey(nz))/dz;
Hx(Nz) = Hx(Nz) + mHx(Nz)*(0 - Ey(Nz))/dz;
%Update E from H
Ey(1) = Ey(1) + mEy(1) * ( Hx(1) - 0 )/dz;
for nz = 2 : Nz
Ey(nz) = Ey(nz) + mEy(nz) * ( Hx(nz) - Hx(nz-1) )/dz;
%Show Status
if ~mod(T, 10)
draw1d(ER, Ey, Hx, dz);
xlim([dz Nz*dz]);
title(['Field at step ' num2str(T) ' of ' num2str(STEPS)]);

Find area of two overlapping circles using monte carlo method

Actually i have two intersecting circles as specified in the figure
i want to find the area of each part separately using Monte carlo method in Matlab .
The code doesn't draw the rectangle or the circles correctly so
i guess what is wrong is my calculation for the x and y and i am not much aware about the geometry equations for solving it so i need help about the equations.
this is my code so far :
%supposing that a rectangle will contain both circles so :
% the mid point of the distance between 2 circles will be (0,6)
% then by adding the radius of the left and right circles the total distance
% will be 27 , 11 from the left and 16 from the right
% width of rectangle = 24
for i=1:n
if((x(i))^2+(y(i))^2<=25 && (x(i))^2+(y(i)-12)^2<=100)
hold on
elseif(~(x(i))^2+(y(i))^2<=25 &&(x(i))^2+(y(i)-12)^2<=100)
hold on
Here are the errors I found:
x = 27*rand(n,1)-5
y = 24*rand(n,1)-12
The rectangle extents were incorrect, and if you use rand(n-1) will give you a (n-1) by (n-1) matrix.
first If:
(x(i))^2+(y(i))^2<=25 && (x(i)-12)^2+(y(i))^2<=100
the center of the large circle is at x=12 not y=12
Second If:
~(x(i))^2+(y(i))^2<=25 &&(x(i)-12)^2+(y(i))^2<=100
This code can be improved by using logical indexing.
For example, using R, you could do (Matlab code is left as an excercise):
n = 10000
x = 27*runif(n)-5
y = 24*runif(n)-12
r = (x^2 + y^2)<=25 & ((x-12)^2 + y^2)<=100
g = (x^2 + y^2)<=25
b = ((x-12)^2 + y^2)<=100
which gives:
Here is my generic solution for any two circles (without any hardcoded value):
function [ P ] = circles_intersection_area( k1, k2, N )
% Adnan A.
x1 = k1(1);
y1 = k1(2);
r1 = k1(3);
x2 = k2(1);
y2 = k2(2);
r2 = k2(3);
if sqrt((x1-x2)*(x1-x2) + (y1-y2)*(y1-y2)) >= (r1 + r2)
% no intersection
P = 0;
% Wrapper rectangle config
a_min = x1 - r1 - 2*r2;
a_max = x1 + r1 + 2*r2;
b_min = y1 - r1 - 2*r2;
b_max = y1 + r1 + 2*r2;
% Monte Carlo algorithm
n = 0;
for i = 1:N
rand_x = unifrnd(a_min, a_max);
rand_y = unifrnd(b_min, b_max);
if sqrt((rand_x - x1)^2 + (rand_y - y1)^2) < r1 && sqrt((rand_x - x2)^2 + (rand_y - y2)^2) < r2
% is a point in the both of circles
n = n + 1;
plot(rand_x,rand_y, 'go-');
hold on;
plot(rand_x,rand_y, 'ko-');
hold on;
P = (a_max - a_min) * (b_max - b_min) * n / N;
Call it like: circles_intersection_area([-0.4,0,1], [0.4,0,1], 10000) where the first param is the first circle (x,y,r) and the second param is the second circle.
Without using For loop.
n = 100000;
data = rand(2,n);
data = data*2*30 - 30;
x = data(1,:);
y = data(2,:);
inside5 = find(x.^2 + y.^2 <=25);
hold on
plot (x(inside5),y(inside5),'bo');
hold on
inside12 = find(x.^2 + (y-12).^2<=144);
plot (x(inside12),y(inside12),'g');
hold on
insidefinal1 = find(x.^2 + y.^2 <=25 & x.^2 + (y-12).^2>=144);
insidefinal2 = find(x.^2 + y.^2 >=25 & x.^2 + (y-12).^2<=144);
% plot(x(insidefinal1),y(insidefinal1),'bo');
hold on
% plot(x(insidefinal2),y(insidefinal2),'ro');
insidefinal3 = find(x.^2 + y.^2 <=25 & x.^2 + (y-12).^2<=144);
% plot(x(insidefinal3),y(insidefinal3),'ro');
area2= (60^2)*(length(insidefinal3)/n);

Vectorizing nested loops in matlab using bsxfun and with GPU

For loops seem to be extremely slow, so I was wondering if the nested loops in the code shown next could be vectorized using bsxfun and maybe GPU could be introduced too.
%// Paramaters
i = 1;
j = 3;
n1 = 1500;
n2 = 1500;
%// Pre-allocate for output
%// Nested Loops - I
for x = 1:n1
for y = 1:n1
num = ((n2 ^ 2) * (L1(i, i) + L2(j, j) + 1)) - (n2 * n * (L1(x,i) + L1(y,i)));
LInc(x, y) = L1(x, y) + (num/denom);
LInc(y, x) = LInc(x, y);
%// Nested Loops - II
for x = 1:n1
for y = 1:n2
num = (n1 * n * L1(x,i)) + (n2 * n * L2(y,j)) - ((n1 * n2 * (L1(i, i) + L2(j, j) + 1)));
LInc(x, n1+y) = num/denom;
LInc(n1+y, x) = LInc(x, n1+y);
Edit 1: n and denom could be assumed as constants too.
Here are vectorized CPU and GPU codes and I am hoping that I am using at least good practices for the GPU code and the benchmarking later on.
CPU Code
%// Pre-allocate for output
%// Calculate num/denom value for stage 1 and 2
nd1 = L1 + (((n2 ^ 2) * (L1(i, i) + L2(j, j) + 1)) - n2*n*bsxfun(#plus,L1(:,i),L1(:,i).'))./denom; %//'
nd2 = (bsxfun(#plus,n1*n*L1(:,i),n2*n*L2(:,j).') - ((n1 * n2 * (L1(i, i) + L2(j, j) + 1))))./denom; %//'
%// Plug in the values in the output matrix
LInc(1:n1,1:n1) = tril(nd1) + tril(nd1,-1).'; %//'
LInc(n1+1:end,1:n1) = nd2.'; %//'
LInc(1:n1,n1+1:end) = nd2;
GPU Code
%// Pre-allocate for output
gLInc = zeros(n1+n2,n1+n2,'gpuArray');
%// Convert to gpu arrays
gL1 = gpuArray(L1);
gL2 = gpuArray(L2);
%// Calculate num/denom value for stage 1 and 2
nd1 = gL1 + (((n2 ^ 2) * (gL1(i, i) + gL2(j, j) + 1)) - n2*n*bsxfun(#plus,gL1(:,i),gL1(:,i).'))./denom; %//'
nd2 = (bsxfun(#plus,n1*n*gL1(:,i),n2*n*gL2(:,j).') - ((n1 * n2 * (gL1(i, i) + gL2(j, j) + 1))))./denom; %//'
%// Plug in the values in the output matrix
gLInc(1:n1,1:n1) = tril(nd1) + tril(nd1,-1).'; %//'
gLInc(n1+1:end,1:n1) = nd2.'; %//'
gLInc(1:n1,n1+1:end) = nd2;
%// Gather data from GPU back to CPU
LInc = gather(gLInc);
GPU benchmarking tips were taken from Measure and Improve GPU Performance.
%// Warm up GPU call with insignificant small scalar inputs, just in case
%// gputimeit doesn't do the same
temp1 = modp2(1,1,1,1,1,1,1,1); %// This is vectorized GPU code
i = 1;
j = 3;
n = 1000; %// Assumed
denom = 1e6; %// Assumed
N_arr = [50 100 200 500 1000 1500]; %// array elements for N (datasize)
timeall = zeros(3,numel(N_arr));
for k1 = 1:numel(N_arr)
N = N_arr(k1);
n1 = N; %// n1, n2 are assumed identical for less-complicated benchmarking
n2 = N;
L1 = rand(n1,n1);
L2 = rand(n2,j);
f = #() modp0(i,j,n1,n2,L1,L2,n,denom);%// Original CPU w/ preallocation
timeall(1,k1) = timeit(f);
clear f
f = #() modp1(i,j,n1,n2,L1,L2,n,denom);%// Vectorzied CPU code
timeall(2,k1) = timeit(f);
clear f
f = #() modp2(i,j,n1,n2,L1,L2,n,denom);%// Vectorized GPU(GTX 750Ti) code
timeall(3,k1) = gputimeit(f);
clear f
%// Display benchmark results
figure,hold on, grid on
legend('Original CPU','Vectorized CPU','Vectorized GPU (GTX 750 Ti)')
xlabel('Datasize (N) ->'),ylabel('Time(sec) ->')
Results show that the vectorized GPU code performs really well with higher datasize and goes from slower than both the vectorized CPU and original code to being twice as fast as the vectorized CPU code.
If you have not done so, you should preallocate LInc.
LInc = zeros(n1,n2);
If you want to vectorize it, you don't need to use bsxfun to vectorize your code. I think you can do something like
x = 1:n1;
y = 1:n1;
num = ((n2 ^ 2) * (L1(i, i) + L2(j, j) + 1)) - (n2 * n * (L1(x,i) + L1(y,i)));
LInc(x, y) = L1(x, y) + (num/denom);
However, this code is confusing to me because as it is, you are overwriting the value of LInc several times. Without knowing what your goal is its hard for me to help more. The above code probably will not return the same values as your function.

Calculating value for n when incrementing a value using a for loop

First of all, sorry for the bad title. I'm not really sure how to title this topic, so feel free to mod it where necessary.
I am drawing X rings inside my stage with given dimensions. To give this some sense of depth, each ring towards the screen boundaries is slightly wider:
The largest ring should be as wide as the largest dimension of the stage (note that in the picture i am drawing 3 extra rings which are drawn outside the stage boundaries). Also it should be twice as wide as the smallest ring. With ring i am refering to the space between 2 red circles.
After calculating an _innerRadius, which is the width of the smallest ring, i am drawing them using
const RINGS:Number = 10; //the amount of rings, note we will draw 3 extra rings to fill the screen
const DEPTH:Number = 7; //the amount of size difference between rings to create depth effect
var radius:Number = 0;
for (var i:uint = 0; i < RINGS + 3; i++) {
radius += _innerRadius + _innerRadius * ((i*DEPTH) / (RINGS - 1));
_graphics.lineStyle(1, 0xFF0000, 1);
_graphics.drawCircle(0, 0, radius * .5);
One of the sliders at the bottom goes from 0-100 being a percentage of the radius for the green ring which goes from the smallest to the largest ring.
I tried lerping between the smallest radius and the largest radius, which works fine if the DEPTH value is 1. However I don't want the distance between the rings to be the same for the sake of the illusion of depth.
Now I've been trying to figure this out for hours but it seems I've run into a wall.. it seems like I need some kind of non-linear formula here.. How would I calculate the radius based on the slider percentage value? Effectively for anywhere in between or on the red circles going from the smallest to the largest red circle?
Here's my example calculation for _innerRadius
//lets calculate _innerRadius for 10 rings
//inner ring width = X + 0/9 * X;
//ring 1 width = X + 1/9 * X;
//ring 2 width = X + 2/9 * X
//ring 3 width = X + 3/9 * X
//ring 4 width = X + 4/9 * X
//ring 5 width = X + 5/9 * X
//ring 6 width = X + 6/9 * X
//ring 8 width = X + 7/9 * X
//ring 9 width = X + 8/9 * X
//ring 10 width = X + 9/9 * X
//extent = Math.max(stage.stageWidth, stage.stageHeight);
//now we should solve extent = X + (X + 0/9 * X) + (X + 1/9 * X) + (X + 2/9 * X) + (X + 3/9 * X) + (X + 4/9 * X) + (X + 5/9 * X) + (X + 6/9 * X) + (X + 7/9 * X) + (X + 8/9 * X) + (X + 9/9 * X);
//lets add all X's
//extent = 10 * X + 45/9 * X
//extent = 15 * X;
//now reverse to solve for _innerRadius
//_innerRadius = extent / 15;
The way your drawing algorithm works, your radii are:
r[i + 1] = r[i] + (1 + i*a)*r0
where a is a constant that is depth / (rings - 1). This results in:
r1 = r0 + (1 + a)*r0 = (2 + a)*r0
r2 = r1 + (1 + 2*a)*r0 = (3 + 3*a)*r0
r3 = r2 + (1 + 3*a)*r0 = (4 + 6*a)*r0
rn = (1 + n + a * sum(1 ... n))*r0
= (1 + n + a * n*(n - 1) / 2)*r0
Because you want ring n - 1 to correspond with your outer radius (let's forget about the three extra rings for now), you get:
r[n - 1] = (n + (n - 1)*(n - 2) / 2)*r0
a = 2 * (extent / r0 - n) / (n - 1) / (n - 2)
Then you can draw the rings:
for (var i = 0; i < rings; i++) {
r = (1 + i + 0.5 * a * (i - 1)*i) * r0;
// draw circle with radius r
You must have at least three rings in order not to have a division by zero when calculating a. Also note that this does not yield good results for all combinations: if the ratio of outer and inner circles is smaller than the number of rings, you get a negative a and have the depth effect reversed.
Another, maybe simpler, approach to create the depth effect is to make each circle's radius a constant multiple of the previous:
r[i + 1] = r[i] * c
r[i] = r0 * Math.pow(c, i)
and draw them like this:
c = Math.pow(extent / r0, 1 / (rings - 1))
r = r0
for (var i = 0; i < rings; i++) {
// draw circle with radius r
r *= c;
This will create a "positive" depth effect as long as the ratio of radii is positive. (And as long as there is more than one ring, of course.)
