How to optimize MATLAB bitwise operations - performance

I have written my own SHA1 implementation in MATLAB, and it gives correct hashes. However, it's very slow (a string a 1000 a's takes 9.9 seconds on my Core i7-2760QM), and I think the slowness is a result of how MATLAB implements bitwise logical operations (bitand, bitor, bitxor, bitcmp) and bitwise shifts (bitshift, bitrol, bitror) of integers.
Especially I wonder the need to construct fixed-point numeric objects for bitrol and bitror using fi command, because anyway in Intel x86 assembly there's rol and ror both for registers and memory addresses of all sizes. However, bitshift is quite fast (it doesn't need any fixed-point numeric costructs, a regular uint64 variable works fine), which makes the situation stranger: why in MATLAB bitrol and bitror need fixed-point numeric objects constructed with fi, whereas bitshift does not, when in assembly level it all comes down to shl, shr, rol and ror?
So, before writing this function in C/C++ as a .mex file, I'd be happy to know if there is any way to improve the performance of this function. I know there are some specific optimizations for SHA1, but that's not the issue, if the very basic implementation of bitwise rotations is so slow.
Testing a little bit with tic and toc, it's evident that what makes it slow are the loops in with bitrol and fi. There are two such loops:
%# Define some variables.
FFFFFFFF = uint64(hex2dec('FFFFFFFF'));
%# constants: K(1), K(2), K(3), K(4).
K(1) = uint64(hex2dec('5A827999'));
K(2) = uint64(hex2dec('6ED9EBA1'));
K(3) = uint64(hex2dec('8F1BBCDC'));
K(4) = uint64(hex2dec('CA62C1D6'));
W = uint64(zeros(1, 80));
... some other code here ...
%# First slow loop begins here.
for index = 17:80
W(index) = uint64(bitrol(fi(bitxor(bitxor(bitxor(W(index-3), W(index-8)), W(index-14)), W(index-16)), 0, 32, 0), 1));
end
%# First slow loop ends here.
H = sha1_handle_block_struct.H;
A = H(1);
B = H(2);
C = H(3);
D = H(4);
E = H(5);
%# Second slow loop begins here.
for index = 1:80
rotatedA = uint64(bitrol(fi(A, 0, 32, 0), 5));
if (index <= 20)
% alternative #1.
xorPart = bitxor(D, (bitand(B, (bitxor(C, D)))));
xorPart = bitand(xorPart, FFFFFFFF);
temp = rotatedA + xorPart + E + W(index) + K(1);
elseif ((index >= 21) && (index <= 40))
% FIPS.
xorPart = bitxor(bitxor(B, C), D);
xorPart = bitand(xorPart, FFFFFFFF);
temp = rotatedA + xorPart + E + W(index) + K(2);
elseif ((index >= 41) && (index <= 60))
% alternative #2.
xorPart = bitor(bitand(B, C), bitand(D, bitxor(B, C)));
xorPart = bitand(xorPart, FFFFFFFF);
temp = rotatedA + xorPart + E + W(index) + K(3);
elseif ((index >= 61) && (index <= 80))
% FIPS.
xorPart = bitxor(bitxor(B, C), D);
xorPart = bitand(xorPart, FFFFFFFF);
temp = rotatedA + xorPart + E + W(index) + K(4);
else
error('error in the code of sha1_handle_block.m!');
end
temp = bitand(temp, FFFFFFFF);
E = D;
D = C;
C = uint64(bitrol(fi(B, 0, 32, 0), 30));
B = A;
A = temp;
end
%# Second slow loop ends here.
Measuring with tic and toc, the entire computation of SHA1 hash of message abc takes on my laptop around 0.63 seconds, of which around 0.23 seconds is passed in the first slow loop and around 0.38 seconds in the second slow loop. So is there some way to optimize those loops in MATLAB before writing a .mex file?

There's this DataHash from the MATLAB File Exchange that calculates SHA-1 hashes lightning fast.
I ran the following code:
x = 'The quick brown fox jumped over the lazy dog'; %# Just a short sentence
y = repmat('a', [1, 1e6]); %# A million a's
opt = struct('Method', 'SHA-1', 'Format', 'HEX', 'Input', 'bin');
tic, x_hashed = DataHash(uint8(x), opt), toc
tic, y_hashed = DataHash(uint8(y), opt), toc
and got the following results:
x_hashed = F6513640F3045E9768B239785625CAA6A2588842
Elapsed time is 0.029250 seconds.
y_hashed = 34AA973CD4C4DAA4F61EEB2BDBAD27316534016F
Elapsed time is 0.020595 seconds.
I verified the results with a random online SHA-1 tool, and the calculation was indeed correct. Also, the 106 a's were hashed ~1.5 times faster than the first sentence.
So how does DataHash do it so fast??? Using the java.security.MessageDigest library, no less!
If you're interested with a fast MATLAB-friendly SHA-1 function, this is the way to go.
However, if this is just an exercise for implementing fast bit-level operations, then MATLAB doesn't really handle them efficiently, and in most cases you'll have to resort to MEX.

why in MATLAB bitrol and bitror need fixed-point numeric objects constructed with fi, whereas bitshift does not
bitrol and bitror are not part of the set of bitwise logic functions that are applicable for uints. They are part of the fixed-point toolbox, which also contains variants of bitand, bitshift etc that apply to fixed-point inputs.
A bitrol could be expressed as two bitshifts, a bitand and a bitor if you want to try using only the uint-functions. That might be even slower though.

As most MATLAB functions, bitand, bitor, bitxor are vectorized. So you get a lot faster if you give these function vector input rather than calling them in a loop over each element
Example:
%# create two sets of 10k random numbers
num = 10000;
hex = '0123456789ABCDEF';
A = uint64(hex2dec( hex(randi(16, [num 16])) ));
B = uint64(hex2dec( hex(randi(16, [num 16])) ));
%# compare loop vs. vectorized call
tic
C1 = zeros(size(A), class(A));
for i=1:numel(A)
C1(i) = bitxor(A(i),B(i));
end
toc
tic
C2 = bitxor(A,B);
toc
assert(isequal(C1,C2))
The timing was:
Elapsed time is 0.139034 seconds.
Elapsed time is 0.000960 seconds.
That's an order of magnitude faster!
The problem is, and as far as I can tell, the SHA-1 computation cannot be well vectorized. So you might not be able to take advantage of such vectorization.
As an experiment, I implemented a pure MATLAB-based funciton to compute such bit operations:
function num = my_bitops(op,A,B)
%# operation to perform: not, and, or, xor
if ischar(op)
op = str2func(op);
end
%# integer class: uint8, uint16, uint32, uint64
clss = class(A);
depth = str2double(clss(5:end));
%# bit exponents
e = 2.^(depth-1:-1:0);
%# convert to binary
b1 = logical(dec2bin(A,depth)-'0');
if nargin == 3
b2 = logical(dec2bin(B,depth)-'0');
end
%# perform binary operation
if nargin < 3
num = op(b1);
else
num = op(b1,b2);
end
%# convert back to integer
num = sum(bsxfun(#times, cast(num,clss), cast(e,clss)), 2, 'native');
end
Unfortunately, this was even worse in terms of performance:
tic, C1 = bitxor(A,B); toc
tic, C2 = my_bitops('xor',A,B); toc
assert(isequal(C1,C2))
The timing was:
Elapsed time is 0.000984 seconds.
Elapsed time is 0.485692 seconds.
Conclusion: write a MEX function or search the File Exchange to see if someone already did :)

Related

Image Processing: Algorithm taking too long in MATLAB

I am working in MATLAB to process two 512x512 images, the domain image and the range image. What I am trying to accomplish is the following:
Divide both domain and range images into 8x8 pixel blocks
For each 8x8 block in the domain image, I have to apply a linear transformations to it and compare each of the 4096 transformed blocks with each of the 4096 range blocks.
Compute error in each case between the transformed block and the range image block and find the minimum error.
Finally I'll have for each 8x8 range block, the id of the 8x8 domain block for which the error was minimum (error between the range block and the transformed domain block)
To achieve this, I have written the following code:
RangeImagecolor = imread('input.png'); %input is 512x512
DomainImagecolor = imread('input.png'); %Range and Domain images are identical
RangeImagetemp = rgb2gray(RangeImagecolor);
DomainImagetemp = rgb2gray(DomainImagecolor);
RangeImage = im2double(RangeImagetemp);
DomainImage = im2double(DomainImagetemp);
%For the (k,l)th 8x8 range image block
for k = 1:64
for l = 1:64
minerror = 9999;
min_i = 0;
min_j = 0;
for i = 1:64
for j = 1:64
%here I compute for the (i,j)th domain block, the transformed domain block stored in D_trans
error = 0;
D_trans = zeros(8,8);
R = zeros(8,8); %Contains the pixel values of the (k,l)th range block
for m = 1:8
for n = 1:8
R(m,n) = RangeImage(8*k-8+m,8*l-8+n);
%ApplyTransformation can depend on (k,l) so I can't compute the transformation outside the k,l loop.
[m_dash,n_dash] = ApplyTransformation(8*i-8+m,8*j-8+n);
D_trans(m,n) = DomainImage(m_dash,n_dash);
error = error + (R(m,n)-D_trans(m,n))^2;
end
end
if(error < minerror)
minerror = error;
min_i = i;
min_j = j;
end
end
end
end
end
As an example ApplyTransformation, one can use the identity transformation:
function [x_dash,y_dash] = Iden(x,y)
x_dash = x;
y_dash = y;
end
Now the problem I am facing is the high computation time. The order of computation in the above code is 64^5, which is of the order 10^9. This computation should take at the worst minutes or an hour. It takes about 40 minutes to compute just 50 iterations. I don't know why the code is running so slow.
Thanks for reading my question.
You can use im2col* to convert the image to column format so each block forms a column of a [64 * 4096] matrix. Then apply transformation to each column and use bsxfun to vectorize computation of error.
DomainImage=rand(512);
RangeImage=rand(512);
DomainImage_col = im2col(DomainImage,[8 8],'distinct');
R = im2col(RangeImage,[8 8],'distinct');
[x y]=ndgrid(1:8);
function [x_dash, y_dash] = ApplyTransformation(x,y)
x_dash = x;
y_dash = y;
end
[x_dash, y_dash] = ApplyTransformation(x,y);
idx = sub2ind([8 8],x_dash, y_dash);
D_trans = DomainImage_col(idx,:); %transformation is reduced to matrix indexing
Error = 0;
for mn = 1:64
Error = Error + bsxfun(#minus,R(mn,:),D_trans(mn,:).').^2;
end
[minerror ,min_ij]= min(Error,[],2); % linear index of minimum of each block;
[min_i min_j]=ind2sub([64 64],min_ij); % convert linear index to subscript
Explanation:
Our goal is to reduce number of loops as much as possible. For it we should avoid matrix indexing and instead we should use vectorization. Nested loops should be converted to one loop. As the first step we can create a more optimized loop as here:
min_ij = zeros(4096,1);
for kl = 1:4096 %%% => 1:size(D_trans,2)
minerror = 9999;
min_ij(kl) = 0;
for ij = 1:4096 %%% => 1:size(R,2)
Error = 0;
for mn = 1:64
Error = Error + (R(mn,kl) - D_trans(mn,ij)).^2;
end
if(Error < minerror)
minerror = Error;
min_ij(kl) = ij;
end
end
end
We can re-arrange the loops and we can make the most inner loop as the outer loop and separate computation of the minimum from the computation of the error.
% Computation of the error
Error = zeros(4096,4096);
for mn = 1:64
for kl = 1:4096
for ij = 1:4096
Error(kl,ij) = Error(kl,ij) + (R(mn,kl) - D_trans(mn,ij)).^2;
end
end
end
% Computation of the min
min_ij = zeros(4096,1);
for kl = 1:4096
minerror = 9999;
min_ij(kl) = 0;
for ij = 1:4096
if(Error(kl,ij) < minerror)
minerror = Error(kl,ij);
min_ij(kl) = ij;
end
end
end
Now the code is arranged in a way that can best be vectorized:
Error = 0;
for mn = 1:64
Error = Error + bsxfun(#minus,R(mn,:),D_trans(mn,:).').^2;
end
[minerror ,min_ij] = min(Error, [], 2);
[min_i ,min_j] = ind2sub([64 64], min_ij);
*If you don't have the Image Processing Toolbox a more efficient implementation of im2col can be found here.
*The whole computation takes less than a minute.
First things first - your code doesn't do anything. But you likely do something with this minimum error stuff and only forgot to paste this here, or still need to code that bit. Never mind for now.
One big issue with your code is that you calculate transformation for 64x64 blocks of resulting image AND source image. 64^5 iterations of a complex operation are bound to be slow. Rather, you should calculate all transformations at once and save them.
allTransMats = cell(64);
for i = 1 : 64
for j = 1 : 64
allTransMats{i,j} = getTransformation(DomainImage, i, j)
end
end
function D_trans = getTransformation(DomainImage, i,j)
D_trans = zeros(8);
for m = 1 : 8
for n = 1 : 8
[m_dash,n_dash] = ApplyTransformation(8*i-8+m,8*j-8+n);
D_trans(m,n) = DomainImage(m_dash,n_dash);
end
end
end
This serves to get allTransMat and is OUTSIDE the k, l loop. Preferably as a simple function.
Now, you make your big k, l, i, j loop, where you compare all the elements as needed. Comparison could be also done block-wise instead of filling a small 8x8 matrix, yet doing it per element for some reason.
m = 1 : 8;
n = m;
for ...
R = RangeImage(...); % This will give 8x8 output as n and m are vectors.
D = allTransMats{i,j};
difference = sum(sum((R-D).^2));
if (difference < minDifference) ...
end
Even though this is a simple no transformations case, this speeds up code a lot.
Finally, are you sure you need to compare each block of transformed output with each block in the source? Typically you compare block1(a,b) with block2(a,b) - blocks (or pixels) on the same position.
EDIT: allTransMats requires k and l too. Ouch. There is NO WAY to make this fast for a single iteration, as you require 64^5 calls to ApplyTransformation (or a vectorization of that function, but even then it might not be fast - we would have to see the function to help here).
Therefore, I will re-iterate my advice to generate all transformations and then perform lookup: this upper part of the answer with allTransMats generation should be changed to have all 4 loops and generate allTransMats{i,j,k,l};. It WILL be slow, there is no way around that as I mentioned in the upper part of edit. But, it is a cost you pay once, as after saving the allTransMats, all further image analyses will be able to simply load it instead of generating it again.
But ... what do you even do? Transformation that depends on source and destination block indices plus pixel indices (= 6 values total) sounds like a mistake somewhere, or a prime candidate to optimize instead of all the rest.

Speeding up simulation of the Levy motion algorithm

Here is my little script for simulating Levy motion:
clear all;
clc; close all;
t = 0; T = 1000; I = T-t;
dT = T/I; t = 0:dT:T; tau = T/I;
alpha = 1.5;
sigma = dT^(1/alpha);
mu = 0; beta = 0;
N = 1000;
X = zeros(N, length(I));
for k=1:N
L = zeros(1,I);
for i = 1:I-1
L( (i + 1) * tau ) = L(i*tau) + stable2( alpha, beta, sigma, mu, 1);
end
X(k,1:length(L)) = L;
end
q = 0.1:0.1:0.9;
quant = qlines2(X, q, t(1:length(X)), tau);
hold all
for i = 1:length(quant)
plot( t, quant(i) * t.^(1/alpha), ':k' );
end
Where stable2 returns a stable random variable with given parameters (you may replace it with normrnd(mu, sigma) for this case, it's not crucial); qlines2 returns quantiles needed for plotting.
But I don't want to talk about math here. My problem is that this implementation is pretty slow, and I would like to speed it up. Unfortunately, computer science is not my main field - I heard something about methods like memoization, vectorization and that there is a lot of other techniques, but I don't know how to use them.
For example, I'm pretty sure I should replace this filthy double for-loop somehow, but I'm not sure what to do instead.
EDIT: Maybe I should use (and learn...) another language (Python, C, any functional one)? I always though that Matlab/OCTAVE is designed for numerical computation, but if change, then for which one?
The crucial bit is, as you said, the for loops, Matlab does not like those, so vectorization is indeed the keyword. (Together with preallocating the space.
I just altered you for loop section somewhat so that you do not have to reset L over and over again, instead we save all Ls in a bigger matrix (also I elimiated the length(L) command).
L = zeros(N,I);
for k=1:N
for i = 1:I-1
L(k,(i + 1) * tau ) = L(k,i*tau) + normrnd(mu, sigma);
end
X(k,1:I) = L(k,1:I);
end
Now you can already see that X(k,1:I) = L(k,1:I); in the loop is obsolete and that also means that we can switch the order of the loops. This is crucial, because the i-steps are recursive (depend on the previous step) that means we cannot vectorize this loop, we can only vectorize the k-loop.
Now your original code needed 9.3 seconds on my machine, the new code still needs about the same time)
L = zeros(N,I);
for i = 1:I-1
for k=1:N
L(k,(i + 1) * tau ) = L(k,i*tau) + normrnd(mu, sigma);
end
end
X = L;
But now we can apply the vectorization, instead of looping throu all rows (the loop over k) we can instead eliminate this loop, and doing all rows at "once".
L = zeros(N,I);
for i = 1:I-1
L(:,(i + 1) * tau ) = L(:,i*tau) + normrnd(mu, sigma); %<- this is not yet what you want, see comment below
end
X = L;
This code need only 0.045 seconds on my machine. I hope you still get the same output, because I have no idea what you are calculating, but I also hope you could see how you go about vectorizing code.
PS: I just noticed that we now use the same random number in the last example for the whole column, this is obviously not what you want. Instad you should generate a whole vector of random numbers, e.g:
L = zeros(N,I);
for i = 1:I-1
L(:,(i + 1) * tau ) = L(:,i*tau) + normrnd(mu, sigma,N,1);
end
X = L;
PPS: Great question!

How to speed up a double loop in matlab

This is a follow-up question of this question.
The following code takes an enormous amount of time to loop through. Do you have any recommendations for speeding up the process? The variable z has a size of 479x1672 and others will be around 479x12000.
z = HongKongPrices;
zmat = false(size(z));
r = size(z,1);
c = size(z,2);
for k = 1:c
for i = 5:r
if z(i,k) == z(i-4,k) && z(i,k) == z(i-3,k) && z(i,k) == z(end,k)
zmat(i-3:i,k) = 1
end
end
end
z(zmat) = NaN
I am currently running this with MatLab R2014b on an iMac with 3.2 Intel i5 and 16 GB DDR3.
You can use logical indexing here to your advantage to replace the IF-conditional statement and have a small-loop -
%// Get size parameters
[r,c] = size(z);
%// Get logical mask with ones for each column at places that satisfy the condition
%// mentioned as the IF conditional statement in the problem code
mask = z(1:r-4,:) == z(5:r,:) & z(2:r-3,:) == z(5:r,:) & ...
bsxfun(#eq,z(end,:),z(5:r,:));
%// Use logical indexing to map entire z array and set mask elements as NaNs
for k = 1:4
z([false(k,c) ; mask ; false(4-k,c)]) = NaN;
end
Benchmarking
%// Size parameters
nrows = 479;
ncols = 12000;
max_num = 10;
num_iter = 10; %// number of iterations to run each approach,
%// so that runtimes are over 1 sec mark
z_org = randi(max_num,nrows,ncols); %// random input data of specified size
disp('--------------------------------- With proposed approach')
tic
for iter = 1:num_iter
z = z_org;
[..... code from the proposed approach ...]
end
toc, clear z k mask r c
disp('--------------------------------- With original approach')
tic
for iter = 1:num_iter
z = z_org;
[..... code from the problem ...]
end
toc
Results
Case # 1: z as 479 x 1672 (num_iter = 50)
--------------------------------- With proposed approach
Elapsed time is 1.285337 seconds.
--------------------------------- With original approach
Elapsed time is 2.008256 seconds.
Case # 2: z as 479 x 12000 (num_iter = 10)
--------------------------------- With proposed approach
Elapsed time is 1.941858 seconds.
--------------------------------- With original approach
Elapsed time is 2.897006 seconds.

speeding up some for loops in matlab

Basically I am trying to solve a 2nd order differential equation with the forward euler method. I have some for loops inside my code, which take considerable time to solve and I would like to speed things up a bit. Does anyone have any suggestions how could I do this?
And also when looking at the time it takes, I notice that my end at line 14 takes 45 % of my total time. What is end actually doing and why is it taking so much time?
Here is my simplified code:
t = 0:0.01:100;
dt = t(2)-t(1);
B = 3.5 * t;
F0 = 2 * t;
BB=zeros(1,length(t)); % Preallocation
x = 2; % Initial value
u = 0; % Initial value
for ii = 1:length(t)
for kk = 1:ii
BB(ii) = BB(ii) + B(kk) * u(ii-kk+1)*dt; % This line takes the most time
end % This end takes 45% of the other time
x(ii+1) = x(ii) + dt*u(ii);
u(ii+1) = u(ii) + dt * (F0(ii) - BB(ii));
end
Running the code it takes me 8.552 sec.
You can remove the inner loop, I think:
for ii = 1:length(t)
for kk = 1:ii
BB(ii) = BB(ii) + B(kk) * u(ii-kk+1)*dt; % This line takes the most time
end % This end takes 45% of the other time
x(ii+1) = x(ii) + dt*u(ii);
u(ii+1) = u(ii) + dt * (F0(ii) - BB(ii));
end
So BB(ii) = BB(ii) (zero at initalisation) + sum for 1 to ii of BB(kk)* u(ii-kk+1).dt
but kk = 1:ii, so for a given ii, ii-kk+1 → ii-(1:ii) + 1 → ii:-1:1
So I think this is equivalent to:
for ii = 1:length(t)
BB(ii) = sum(B(1:ii).*u(ii:-1:1)*dt);
x(ii+1) = x(ii) + dt*u(ii);
u(ii+1) = u(ii) + dt * (F0(ii) - BB(ii));
end
It doesn't take as long as 8 seconds for me using either method, but the version with only one loop is about 2x as fast (the output of BB appears to be the same).
Is the sum loop of B(kk) * u(ii-kk+1) just conv(B(1:ii),u(1:ii),'same')
The best way to speed up loops in matlab is to try to avoid them. Try if you are able to perform a matrix operation instead of the inner loop. For example try to break the calculation you do there in small parts, then decide, if there are parts you can perform in advance without knowing the results of the next iteration of the loop.
to your secound part of the question, my guess:: The end contains the check if the loop runs for another round and this check by it self is not that long but called 50.015.001 times!

Memory and excecution speed in Matlab

I am trying to create random lines and select some of them, which are really rare. My code is rather simple, but to get something that I can use I need to create very large vectors(i.e.: <100000000 x 1, tracks variable in my code). Is there any way to be able to creater larger vectors and to reduce the time needed for all those calculations?
My code is
%Initial line values
tracks=input('Give me the number of muon tracks: ');
width=1e-4;
height=2e-4;
Ystart=15.*ones(tracks,1);
Xstart=-40+80.*rand(tracks,1);
%Xend=-40+80.*rand(tracks,1);
Xend=laprnd(tracks,1,Xstart,15);
X=[Xstart';Xend'];
Y=[Ystart';zeros(1,tracks)];
b=(Ystart.*Xend)./(Xend-Xstart);
hot=0;
cold=0;
for i=1:tracks
if ((Xend(i,1)<width/2 && Xend(i,1)>-width/2)||(b(i,1)<height && b(i,1)>0))
plot(X(:, i),Y(:, i),'r');%the chosen ones!
hold all
hot=hot+1;
else
%plot(X(:, i),Y(:, i),'b');%the rest of them
%hold all
cold=cold+1;
end
end
I am also using and calling a Laplace distribution generator made my Elvis Chen which can be found here
function y = laprnd(m, n, mu, sigma)
%LAPRND generate i.i.d. laplacian random number drawn from laplacian distribution
% with mean mu and standard deviation sigma.
% mu : mean
% sigma : standard deviation
% [m, n] : the dimension of y.
% Default mu = 0, sigma = 1.
% For more information, refer to
% http://en.wikipedia.org./wiki/Laplace_distribution
% Author : Elvis Chen (bee33#sjtu.edu.cn)
% Date : 01/19/07
%Check inputs
if nargin < 2
error('At least two inputs are required');
end
if nargin == 2
mu = 0; sigma = 1;
end
if nargin == 3
sigma = 1;
end
% Generate Laplacian noise
u = rand(m, n)-0.5;
b = sigma / sqrt(2);
y = mu - b * sign(u).* log(1- 2* abs(u));
The result plot is
As you indicate, your problem is two-fold. On the one hand, you have memory issues because you need to do so many trials. On the other hand, you have performance issues, because you have to process all those trials.
Solutions to each issue often have a negative impact on the other issue. IMHO, the best approach would be to find a compromise.
More trials are only possible of you get rid of those gargantuan arrays that are required for vectorization, and use a different strategy to do the loop. I will give priority to the possibility of using more trials, possibly at the cost of optimal performance.
When I execute your code as-is in the Matlab profiler, it immediately shows that the initial memory allocation for all your variables takes a lot of time. It also shows that the plot and hold all commands are the most time-consuming lines of them all. Some more trial-and-error shows that there is a disappointingly low maximum value for the trials you can do before OUT OF MEMORY errors start appearing.
The loop can be accelerated tremendously if you know a few things about its limitations in Matlab. In older versions of Matlab, it used to be true that loops should be avoided completely in favor of 'vectorized' code. In recent versions (I believe R2008a and up), the Mathworks introduced a piece of technology called the JIT accelerator (Just-in-Time compiler) which translates M-code into machine language on the fly during execution. Simply put, the JIT accelerator allows your code to bypass Matlab's interpreter and talk much more directly with the underlying hardware, which can save a lot of time.
The advice you'll hear a lot that loops should be avoided in Matlab, is no longer generally true. While vectorization still has its value, any procedure of sizable complexity that is implemented using only vectorized code is often illegible, hard to understand, hard to change and hard to upkeep. An implementation of the same procedure that uses loops, often has none of these drawbacks, and moreover, it will quite often be faster and require less memory.
Unfortunately, the JIT accelerator has a few nasty (and IMHO, unnecessary) limitations that you'll have to learn about.
One such thing is plot; it's generally a better idea to let a loop do nothing other than collect and manipulate data, and delay any plotting commands etc. until after the loop.
Another such thing is hold; the hold function is not a Matlab built-in function, meaning, it is implemented in M-language. Matlab's JIT accelerator is not able to accelerate non-builtin functions when used in a loop, meaning, your entire loop will run at Matlab's interpretation speed, rather than machine-language speed! Therefore, also delay this command until after the loop :)
Now, in case you're wondering, this last step can make a HUGE difference -- I know of one case where copy-pasting a function body into the upper-level loop caused a 1200x performance improvement. Days of execution time had been reduced to minutes!).
There is actually another minor issue in your loop (which is really small, and rather inconvenient, I will immediately agree with) -- the name of the loop variable should not be i. The name i is the name of the imaginary unit in Matlab, and the name resolution will also unnecessarily consume time on each iteration. It's small, but non-negligible.
Now, considering all this, I've come to the following implementation:
function [hot, cold, h] = MuonTracks(tracks)
% NOTE: no variables larger than 1x1 are initialized
width = 1e-4;
height = 2e-4;
% constant used for Laplacian noise distribution
bL = 15 / sqrt(2);
% Loop through all tracks
X = [];
hot = 0;
ii = 0;
while ii <= tracks
ii = ii + 1;
% Note that I've inlined (== copy-pasted) the original laprnd()
% function call. This was necessary to work around limitations
% in loops in Matlab, and prevent the nececessity of those HUGE
% variables.
%
% Of course, you can still easily generalize all of this:
% the new data
u = rand-0.5;
Ystart = 15;
Xstart = 800*rand-400;
Xend = Xstart - bL*sign(u)*log(1-2*abs(u));
b = (Ystart*Xend)/(Xend-Xstart);
% the test
if ((b < height && b > 0)) ||...
(Xend < width/2 && Xend > -width/2)
hot = hot+1;
% growing an array is perfectly fine when the chances of it
% happening are so slim
X = [X [Xstart; Xend]]; %#ok
end
end
% This is trivial to do here, and prevents an 'else' in the loop
cold = tracks - hot;
% Now plot the chosen ones
h = figure;
hold all
Y = repmat([15;0], 1, size(X,2));
plot(X, Y, 'r');
end
With this implementation, I can do this:
>> tic, MuonTracks(1e8); toc
Elapsed time is 24.738725 seconds.
with a completely negligible memory footprint.
The profiler now also shows a nice and even distribution of effort along the code; no lines that really stand out because of their memory use or performance.
It's possibly not the fastest possible implementation (if anyone sees obvious improvements, please, feel free to edit them in). But, if you're willing to wait, you'll be able to do MuonTracks(1e23) (or higher :)
I've also done an implementation in C, which can be compiled into a Matlab MEX file:
/* DoMuonCounting.c */
#include <math.h>
#include <matrix.h>
#include <mex.h>
#include <time.h>
#include <stdlib.h>
void CountMuons(
unsigned long long tracks,
unsigned long long *hot, unsigned long long *cold, double *Xout);
/* simple little helper functions */
double sign(double x) { return (x>0)-(x<0); }
double rand_double() { return (double)rand()/(double)RAND_MAX; }
/* the gateway function */
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
int
dims[] = {1,1};
const mxArray
/* Output arguments */
*hot_out = plhs[0] = mxCreateNumericArray(2,dims, mxUINT64_CLASS,0),
*cold_out = plhs[1] = mxCreateNumericArray(2,dims, mxUINT64_CLASS,0),
*X_out = plhs[2] = mxCreateDoubleMatrix(2,10000, mxREAL);
const unsigned long long
tracks = (const unsigned long long)mxGetPr(prhs[0])[0];
unsigned long long
*hot = (unsigned long long*)mxGetPr(hot_out),
*cold = (unsigned long long*)mxGetPr(cold_out);
double
*Xout = mxGetPr(X_out);
/* call the actual function, and return */
CountMuons(tracks, hot,cold, Xout);
}
// The actual muon counting
void CountMuons(
unsigned long long tracks,
unsigned long long *hot, unsigned long long *cold, double *Xout)
{
const double
width = 1.0e-4,
height = 2.0e-4,
bL = 15.0/sqrt(2.0),
Ystart = 15.0;
double
Xstart,
Xend,
u,
b;
unsigned long long
i = 0ul;
*hot = 0ul;
*cold = tracks;
/* seed the RNG */
srand((unsigned)time(NULL));
/* aaaand start! */
while (i++ < tracks)
{
u = rand_double() - 0.5;
Xstart = 800.0*rand_double() - 400.0;
Xend = Xstart - bL*sign(u)*log(1.0-2.0*fabs(u));
b = (Ystart*Xend)/(Xend-Xstart);
if ((b < height && b > 0.0) || (Xend < width/2.0 && Xend > -width/2.0))
{
Xout[0 + *hot*2] = Xstart;
Xout[1 + *hot*2] = Xend;
++(*hot);
--(*cold);
}
}
}
compile in Matlab with
mex DoMuonCounting.c
(after having run mex setup :) and then use it in conjunction with a small M-wrapper like this:
function [hot,cold, h] = MuonTrack2(tracks)
% call the MEX function
[hot,cold, Xtmp] = DoMuonCounting(tracks);
% process outputs, and generate plots
hot = uint32(hot); % circumvents limitations in 32-bit matlab
X = Xtmp(:,1:hot);
clear Xtmp
h = NaN;
if ~isempty(X)
h = figure;
hold all
Y = repmat([15;0], 1, hot);
plot(X, Y, 'r');
end
end
which allows me to do
>> tic, MuonTrack2(1e8); toc
Elapsed time is 14.496355 seconds.
Note that the memory footprint of the MEX version is slightly larger, but I think that's nothing to worry about.
The only flaw I see is the fixed maximum number of Muon counts (hard-coded as 10000 as the initial array size of Xout; needed because there are no dynamically growing arrays in standard C)...if you're worried this limit could be broken, simply increase it, change it to be equal to a fraction of tracks, or do some smarter (but more painful) dynamic array-growing tricks.
In Matlab, it is sometimes faster to vectorize rather than use a for loop. For example, this expression:
(Xend(i,1) < width/2 && Xend(i,1) > -width/2) || (b(i,1) < height && b(i,1) > 0)
which is defined for each value of i, can be rewritten in a vectorised manner like this:
isChosen = (Xend(:,1) < width/2 & Xend(:,1) > -width/2) | (b(:,1) < height & b(:,1)>0)
Expessions like Xend(:,1) will give you a column vector, so Xend(:,1) < width/2 will give you a column vector of boolean values. Note then that I have used & rather than && - this is because & performs an element-wise logical AND, unlike && which only works on scalar values. In this way you can build the entire expression, such that the variable isChosen holds a column vector of boolean values, one for each row of your Xend/b vectors.
Getting counts is now as simple as this:
hot = sum(isChosen);
since true is represented by 1. And:
cold = sum(~isChosen);
Finally, you can get the data points by using the boolean vector to select rows:
plot(X(:, isChosen),Y(:, isChosen),'r'); % Plot chosen values
hold all;
plot(X(:, ~isChosen),Y(:, ~isChosen),'b'); % Plot unchosen values
EDIT: The code should look like this:
isChosen = (Xend(:,1) < width/2 & Xend(:,1) > -width/2) | (b(:,1) < height & b(:,1)>0);
hot = sum(isChosen);
cold = sum(~isChosen);
plot(X(:, isChosen),Y(:, isChosen),'r'); % Plot chosen values

Resources