Purposefully Slow MATLAB Function? - debugging

I want to write a really, really, slow program for MATLAB. I'm talking like, O(2^n) or worse. It has to finish, and it has to be deterministically slow, so no "if rand() = 123,123, exit!" This sounds crazy, but it's actually for a distributed systems test. I need to create a .m file, compile it (with MCC), and then run it on my distributed system to perform some debugging operations.
The program must constantly be doing work, so sleep() is not a valid option.
I tried making a random large matrix and finding its inverse, but this was completing too quickly. Any ideas?

This naive implementation of the Discrete Fourier Transform takes ~ 9 seconds for a 2048 long input vector x on my 1.86 GHz single core machine. Going to 4096 inputs extends the time to ~ 35 seconds, close to the 4x I would expect for O(N^2). I don't have the patience to try longer inputs :)
function y = SlowDFT(x)
t = cputime;
y = zeros(size(x));
for c1=1:length(x)
for c2=1:length(x)
y(c1) = y(c1) + x(c2)*(cos((c1-1)*(c2-1)*2*pi/length(x)) - ...
1j*sin((c1-1)*(c2-1)*2*pi/length(x)));
end
end
disp(cputime-t);
EDIT: Or if you're looking to stress memory more than CPU:
function y = SlowDFT_MemLookup(x)
t = cputime;
y = zeros(size(x));
cosbuf = cos((0:1:(length(x)-1))*2*pi/length(x));
for c1=1:length(x)
cosctr = 1;
sinctr = round(3*length(x)/4)+1;
for c2=1:length(x)
y(c1) = y(c1) + x(c2)*(cosbuf(cosctr) ...
-1j*cosbuf(sinctr));
cosctr = cosctr + (c1-1);
if cosctr > length(x), cosctr = cosctr - length(x); end
sinctr = sinctr + (c1-1);
if sinctr > length(x), sinctr = sinctr - length(x); end
end
end
disp(cputime-t);
This is faster than calculating sin and cos on each iteration. A 2048 long input took ~ 3 seconds, and a 16384 long input took ~ 180 seconds.

Count to 2n. Optionally, make a slow function call in each iteration.

If you want real work that's easy to set up and stresses CPU way over memory:
Large dense matrix inversion (not slow enough? make it bigger.)
Factor an RSA number

How about using inv? It has been reported to be quite slow.

Do some work in a loop. You can tune the time it takes to complete using the number of loop iterations.

I don't speak MATLAB but something equivalent to the following might work.
loops = 0
counter = 0
while (loops < MAX_INT) {
counter = counter + 1;
if (counter == MAX_INT) {
loops = loops + 1;
counter = 0;
}
}
This will iterate MAX_INT*MAX_INT times. You can put some computationally heavy thing in the loop for it to take longer if this is not enough.

Easy! Go back to your Turing machine roots and think of processes that are O(2^n) or worse.
Here's a fairly simple one (warning, untested but you get the point)
N = 12; radix = 10;
odometer = zeros(N, 1);
done = false;
while (~done)
done = true;
for i = 1:N
odometer(i) = odometer(i) + 1;
if (odometer(i) >= radix)
odometer(i) = 0;
else
done = false;
break;
end
end
end
Even better, how about calculating Fibonacci numbers recursively? Runtime is O(2^N), since fib(N) has to make two function calls fib(N-1) and fib(N-2), but stack depth is O(N), since only one of those function calls happens at a time.
function y = fib(n)
if (n <= 1)
y = 1;
else
y = fib(n-1) + fib(n-2);
end
end

You could ask it to factor(X) for a suitably large X

You could also test if a given input is prime by just dividing it by all smaller numbers. This would give you O(n^2).

Try this one:
tic
isprime( primes(99999999) );
toc
EDIT:
For a more extensive set of tests, use these benchmarks (perhaps for multiple repetitions even):
disp(repmat('-',1,85))
disp(['MATLAB Version ' version])
disp(['Operating System: ' system_dependent('getos')])
disp(['Java VM Version: ' version('-java')]);
disp(['Date: ' date])
disp(repmat('-',1,85))
N = 3000; % matrix size
A = rand(N,N);
A = A*A;
tic; A*A; t=toc;
fprintf('A*A \t\t\t%f sec\n', t)
tic; [L,U,P] = lu(A); t=toc; clear L U P
fprintf('LU(A)\t\t\t%f sec\n', t)
tic; inv(A); t=toc;
fprintf('INV(A)\t\t\t%f sec\n', t)
tic; [U,S,V] = svd(A); t=toc; clear U S V
fprintf('SVD(A)\t\t\t%f sec\n', t)
tic; [Q,R,P] = qr(A); t=toc; clear Q R P
fprintf('QR(A)\t\t\t%f sec\n', t)
tic; [V,D] = eig(A); t=toc; clear V D
fprintf('EIG(A)\t\t\t%f sec\n', t)
tic; det(A); t=toc;
fprintf('DET(A)\t\t\t%f sec\n', t)
tic; rank(A); t=toc;
fprintf('RANK(A)\t\t\t%f sec\n', t)
tic; cond(A); t=toc;
fprintf('COND(A)\t\t\t%f sec\n', t)
tic; sqrtm(A); t=toc;
fprintf('SQRTM(A)\t\t%f sec\n', t)
tic; fft(A(:)); t=toc;
fprintf('FFT\t\t\t%f sec\n', t)
tic; isprime(primes(10^7)); t=toc;
fprintf('Primes\t\t\t%f sec\n', t)
The following are the results on my machine using N=1000 for one iteration only (note primes is using as upper bound 10^7 NOT 10^8 [takes way more time!])
A*A 0.178329 sec
LU(A) 0.118864 sec
INV(A) 0.319275 sec
SVD(A) 15.236875 sec
QR(A) 0.841982 sec
EIG(A) 3.967812 sec
DET(A) 0.121882 sec
RANK(A) 1.813042 sec
COND(A) 1.809365 sec
SQRTM(A) 22.750331 sec
FFT 0.113233 sec
Primes 27.080918 sec

this will run 100% cpu for WANTED_TIME seconds
WANTED_TIME = 2^n; % seconds
t0=cputime;
t=cputime;
while (t-t0 < WANTED_TIME)
t=cputime;
end;

Related

How to reduce the allocations in Julia?

I am starting to use Julia mainly because of its speed. Currently, I am solving a fixed point problem. Although the current version of my code runs fast I would like to know some methods to improve its speed.
First of all, let me summarize the algorithm.
There is an initial seed called C0 that maps from the space (b,y) into an action space c, then we have C0(b,y)
There is a formula that generates a rule Ct from C0.
Then, using an additional restriction, I can obtain an updating of b [let's called it bt]. Thus,it generates a rule Ct(bt,y)
I need to interpolate the previous rule to move from the grid bt into the original grid b. It gives me an update for C0 [let's called that C1]
I will iterate until the distance between C1 and C0 is below a convergence threshold.
To implement it I created two structures:
struct Parm
lC::Array{Float64, 2} # Lower limit
uC::Array{Float64, 2} # Upper limit
γ::Float64 # CRRA coefficient
δ::Float64 # factor in the euler
γ1::Float64 #
r1::Float64 # inverse of the gross interest rate
yb1::Array{Float64, 2} # y - b(t+1)
P::Array{Float64, 2} # Transpose of transition matrix
end
mutable struct Upd1
pol::Array{Float64,2} # policy function
b::Array{Float64, 1} # exogenous grid for interpolation
dif::Float64 # updating difference
end
The first one is a set of parameters while the second one stores the decision rule C1. I also define some functions:
function eulerm(x::Upd1,p::Parm)
ct = p.δ *(x.pol.^(-p.γ)*p.P).^(-p.γ1); #Euler equation
bt = p.r1.*(ct .+ p.yb1); #Endeogenous grid for bonds
return ct,bt
end
function interp0!(bt::Array{Float64},ct::Array{Float64},x::Upd1, p::Parm)
polold = x.pol;
polnew = similar(x.pol);
#inbounds #simd for col in 1:size(bt,2)
F1 = LinearInterpolation(bt[:,col], ct[:,col],extrapolation_bc=Line());
polnew[:,col] = F1(x.b);
end
polnew[polnew .< p.lC] .= p.lC[polnew .< p.lC];
polnew[polnew .> p.uC] .= p.uC[polnew .> p.uC];
dif = maximum(abs.(polnew - polold));
return polnew,dif
end
function updating!(x::Upd1,p::Parm)
ct, bt = eulerm(x,p); # endogeneous grid
x.pol, x.dif = interp0!(bt,ct,x,p);
end
function conver(x::Upd1,p::Parm)
while x.dif>1e-8
updating!(x,p);
end
end
The first formula implements steps 2 and 3. The third one makes the updating (last part of step 4), and the last one iterates until convergence (step 5).
The most important function is the second one. It makes the interpolation. While I was running the function #time and #btime I realized that the largest number of allocations are in the loop inside this function. I tried to reduce it by not defining polnew and goes directly to x.pol but in this case, the results are not correct since it only need two iterations to converge (I think that Julia is thinking that polold is exactly the same than x.pol and it is updating both at the same time).
Any advice is well received.
To anyone that wants to run it by themselves, I add the rest of the required code:
function rouwen(ρ::Float64, σ2::Float64, N::Int64)
if (N % 2 != 1)
return "N should be an odd number"
end
sigz = sqrt(σ2/(1-ρ^2));
zn = sigz*sqrt(N-1);
z = range(-zn,zn,N);
p = (1+ρ)/2;
q = p;
Rho = [p 1-p;1-q q];
for i = 3:N
zz = zeros(i-1,1);
Rho = p*[Rho zz; zz' 0] + (1-p)*[zz Rho; 0 zz'] + (1-q)*[zz' 0; Rho zz] + q *[0 zz'; zz Rho];
Rho[2:end-1,:] = Rho[2:end-1,:]/2;
end
return z,Rho;
end
#############################################################
# Parameters of the model
############################################################
lb = 0; ub = 1000; pivb = 0.25; nb = 500;
ρ = 0.988; σz = 0.0439; μz =-σz/2; nz = 7;
ϕ = 0.0; σe = 0.6376; μe =-σe/2; ne = 7;
β = 0.98; r = 1/400; γ = 1;
b = exp10.(range(start=log10(lb+pivb), stop=log10(ub+pivb), length=nb)) .- pivb;
#=========================================================
Algorithm
======================================================== =#
(z,Pz) = rouwen(ρ,σz, nz);
μZ = μz/(1-ρ);
z = z .+ μZ;
(ee,Pe) = rouwen(ϕ,σe,ne);
ee = ee .+ μe;
y = exp.(vec((z .+ ee')'));
P = kron(Pz,Pe);
R = 1 + r;
r1 = R^(-1);
γ1 = 1/γ;
δ = (β*R)^(-γ1);
m = R*b .+ y';
lC = max.(m .- ub,0);
uC = m .- lb;
by1 = b .- y';
# initial guess for C0
c0 = 0.1*(m);
# Set of parameters
pp = Parm(lC,uC,γ,δ,γ1,r1,by1,P');
# Container of results
up1 = Upd1(c0,b,1);
# Fixed point problem
conver(up1,pp)
UPDATE As it was reccomend, I made the following changes to the third function
function interp0!(bt::Array{Float64},ct::Array{Float64},x::Upd1, p::Parm)
polold = x.pol;
polnew = similar(x.pol);
#inbounds for col in 1:size(bt,2)
F1 = LinearInterpolation(#view(bt[:,col]), #view(ct[:,col]),extrapolation_bc=Line());
polnew[:,col] = F1(x.b);
end
for j in eachindex(polnew)
polnew[j] < p.lC[j] ? polnew[j] = p.lC[j] : nothing
polnew[j] > p.uC[j] ? polnew[j] = p.uC[j] : nothing
end
dif = maximum(abs.(polnew - polold));
return polnew,dif
end
This leads to an improvement in the speed (from ~1.5 to ~1.3 seconds). And a reduction in the number of allocations. Somethings that I noted were:
Changing from polnew[:,col] = F1(x.b) to polnew[:,col] .= F1(x.b) can reduce the total allocations but the time is slower, why is that?
How should I understand the difference between #time and #btime. For this case, I have:
up1 = Upd1(c0,b,1);
#time conver(up1,pp)
1.338042 seconds (385.72 k allocations: 1.157 GiB, 3.37% gc time)
up1 = Upd1(c0,b,1);
#btime conver(up1,pp)
4.200 ns (0 allocations: 0 bytes)
Just to be precise, in both cases, I run it several times and I choose representative numbers for each line.
Does it mean that all the time is due allocations during the compilation?
Start going through the "performance tips" as advised by #DNF but below you will find most important comments for your code.
Vectorize vector assignments - a small dot makes big difference
julia> julia> a = rand(3,4);
julia> #btime $a[3,:] = $a[3,:] ./ 2;
40.726 ns (2 allocations: 192 bytes)
julia> #btime $a[3,:] .= $a[3,:] ./ 2;
20.562 ns (1 allocation: 96 bytes)
Use views when doing something with subarrays:
julia> #btime sum($a[3,:]);
18.719 ns (1 allocation: 96 bytes)
julia> #btime sum(#view($a[3,:]));
5.600 ns (0 allocations: 0 bytes)
Your code around a lines polnew[polnew .< p.lC] .= p.lC[polnew .< p.lC]; will make much less allocations when you do it with a for loop over each element of polnew
#simd will have no effect on conditionals (point 3) neither when code is calling complex external functions
I want to give an update about this problem. I made two main changes to my code: (i) I define my own linear interpolation function and (ii) I include the check of bounds in the interpolation.
With this the new function three is
function interp0!(bt::Array{Float64},ct::Array{Float64},x::Upd1, p::Parm)
polold = x.pol;
polnew = similar(x.pol);
#inbounds #views for col in 1:size(bt,2)
polnew[:,col] = myint(bt[:,col], ct[:,col],x.b[:],p.lC[:,col],p.uC[:,col]);
end
dif = maximum(abs.(polnew - polold));
return polnew,dif
end
And the interpolation is now:
function myint(x0,y0,x1,ly,uy)
y1 = similar(x1);
n = size(x0,1);
j = 1;
#simd for i in eachindex(x1)
while (j <= n) && (x1[i] > x0[j])
j+=1;
end
if j == 1
y1[i] = y0[1] + ((y0[2]-y0[1])/(x0[2]-x0[1]))*(x1[i]-x0[1]) ;
elseif j == n+1
y1[i] = y0[n] + ((y0[n]-y0[n-1])/(x0[n]-x0[n-1]))*(x1[i]-x0[n]);
else
y1[i] = y0[j-1]+ ((x1[i]-x0[j-1])/(x0[j]-x0[j-1]))*(y0[j]-y0[j-1]);
end
y1[i] > uy[i] ? y1[i] = uy[i] : nothing;
y1[i] < ly[i] ? y1[i] = ly[i] : nothing;
end
return y1;
end
As you can see, I am taking advantage (and assuming) that both vectors that we use as basis are ordered while the two last lines in the outer loops checks the bounds imposed by lC and uC.
With that I get the following total time
up1 = Upd1(c0,b,1);
#time conver(up1,pp)
0.734630 seconds (28.93 k allocations: 752.214 MiB, 3.82% gc time)
up1 = Upd1(c0,b,1);
#btime conver(up1,pp)
4.200 ns (0 allocations: 0 bytes)
which is almost as twice faster with ~8% of the total allocations. the use of views in the loop of the function interp0! also helps a lot.

Memory allocation in a fixed point algorithm

I need to find the fixed point of a function f. The algorithm is very simple:
Given X, compute f(X)
If ||X-f(X)|| is lower than a certain tolerance, exit and return X,
otherwise set X equal to f(X) and go back to 1.
I'd like to be sure I'm not allocating memory for a new object at every iteration
For now, the algorithm looks like this:
iter1 = function(x::Vector{Float64})
for iter in 1:max_it
oldx = copy(x)
g1(x)
delta = vnormdiff(x, oldx, 2)
if delta < tolerance
break
end
end
end
Here g1(x) is a function that sets x to f(x)
But it seems this loop allocates a new vector at every loop (see below).
Another way to write the algorithm is the following:
iter2 = function(x::Vector{Float64})
oldx = similar(x)
for iter in 1:max_it
(oldx, x) = (x, oldx)
g2(x, oldx)
delta = vnormdiff(oldx, x, 2)
if delta < tolerance
break
end
end
end
where g2(x1, x2) is a function that sets x1 to f(x2).
Is thi the most efficient and natural way to write this kind of iteration problem?
Edit1: timing shows that the second code is faster:
using NumericExtensions
max_it = 1000
tolerance = 1e-8
max_it = 100
g1 = function(x::Vector{Float64})
for i in 1:length(x)
x[i] = x[i]/2
end
end
g2 = function(newx::Vector{Float64}, x::Vector{Float64})
for i in 1:length(x)
newx[i] = x[i]/2
end
end
x = fill(1e7, int(1e7))
#time iter1(x)
# elapsed time: 4.688103075 seconds (4960117840 bytes allocated, 29.72% gc time)
x = fill(1e7, int(1e7))
#time iter2(x)
# elapsed time: 2.187916177 seconds (80199676 bytes allocated, 0.74% gc time)
Edit2: using copy!
iter3 = function(x::Vector{Float64})
oldx = similar(x)
for iter in 1:max_it
copy!(oldx, x)
g1(x)
delta = vnormdiff(x, oldx, 2)
if delta < tolerance
break
end
end
end
x = fill(1e7, int(1e7))
#time iter3(x)
# elapsed time: 2.745350176 seconds (80008088 bytes allocated, 1.11% gc time)
I think replacing the following lines in the first code
for iter = 1:max_it
oldx = copy( x )
...
by
oldx = zeros( N )
for iter = 1:max_it
oldx[:] = x # or copy!( oldx, x )
...
will be more efficient because no array is allocated. Also, the code can be made more efficient by writing for-loops explicitly. This can be seen, for example, from the following comparison
function test()
N = 1000000
a = zeros( N )
b = zeros( N )
#time c = copy( a )
#time b[:] = a
#time copy!( b, a )
#time \
for i = 1:length(a)
b[i] = a[i]
end
#time \
for i in eachindex(a)
b[i] = a[i]
end
end
test()
The result obtained with Julia0.4.0 on Linux(x86_64) is
elapsed time: 0.003955609 seconds (7 MB allocated)
elapsed time: 0.001279142 seconds (0 bytes allocated)
elapsed time: 0.000836167 seconds (0 bytes allocated)
elapsed time: 1.19e-7 seconds (0 bytes allocated)
elapsed time: 1.28e-7 seconds (0 bytes allocated)
It seems that copy!() is faster than using [:] in the left-hand side,
though the difference becomes marginal in repeated calculations (there seems to be
some overhead for the first [:] calculation). Btw, the last example using eachindex() is very convenient for looping over multi-dimensional arrays.
Similar comparison can be made for vnormdiff(), where use of norm( x - oldx ) etc is slower than an explicit loop for vector norm, because the former allocates one temporary array for x - oldx.

How can I vectorize these nested for-loops in Matlab?

I have a piece of code here I need to streamline as it is greatly increasing the runtime of my script:
size=300;
resultLength = (size+1)^3;
freqResult=zeros(1, resultLength);
inc=1;
for i=0:size,
for j=0:size,
for k=0:size,
freqResult(inc)=(c/2)*sqrt((i/L)^2+(j/W)^2+(k/H)^2);
inc=inc+1;
end
end
end
c, L, W, and H are all constants. As the size input gets over about 400, the runtime is too long to wait for, and I can watch my disk space draining by the gigabyte. Any advice?
Thanks!
What about this:
[kT, jT, iT] = ind2sub([size+1, size+1, size+1], [1:(size+1)^3]);
for indx = 1:numel(iT)
i = iT(indx) - 1;
j = jT(indx) - 1;
k = kT(indx) - 1;
freqResult1(indx) = (c/2)*sqrt((i/L)^2+(j/W)^2+(k/H)^2);
end
On my PC, for size = 400, version with 3 loops takes 136s and this one takes 19s.
For more "matlaby" way u could also even do as follows:
[kT, jT, iT] = ind2sub([size+1, size+1, size+1], [1:(size+1)^3]);
func = #(i, j, k) (c/2)*sqrt((i/L)^2+(j/W)^2+(k/H)^2);
freqResult2 = arrayfun(func, iT-1, jT-1, kT-1);
But for some reason, this is slower then the above version.
A faster solution can be (based on Marcin's answer):
[k, j, i] = ind2sub([size+1, size+1, size+1], [1:(size+1)^3]);
freqResult = (c/2)*sqrt(((i-1)/L).^2+((j-1)/W).^2+((k-1)/H).^2);
It takes about 5 seconds to run on my PC for size = 300
The following is even faster (but it doesn't look very good):
k = repmat(0:size,[1 (size+1)^2]);
j = repmat(kron(0:size, ones(1,size+1)),[1 (size+1)]);
i = kron(0:size, ones(1,(size+1)^2));
freqResult = (c/2)*sqrt((i/L).^2+(j/W).^2+(k/H).^2);
which takes ~3.5s for size = 300

More on using i and j as variables in Matlab: speed

The Matlab documentation says that
For speed and improved robustness, you can replace complex i and j by 1i. For example, instead of using
a = i;
use
a = 1i;
The robustness part is clear, as there might be variables called i or j. However, as for speed, I have made a simple test in Matlab 2010b and I obtain results which seem to contradict the claim:
>>clear all
>> a=0; tic, for n=1:1e8, a=i; end, toc
Elapsed time is 3.056741 seconds.
>> a=0; tic, for n=1:1e8, a=1i; end, toc
Elapsed time is 3.205472 seconds.
Any ideas? Could it be a version-related issue?
After comments by #TryHard and #horchler, I have tried assigning other values to the variable a, with these results:
Increasing order of elapsed time:
"i" < "1i" < "1*i" (trend "A")
"2i" < "2*1i" < "2*i" (trend "B")
"1+1i" < "1+i" < "1+1*i" (trend "A")
"2+2i" < "2+2*1i" < "2+2*i" (trend "B")
I think you are looking at a pathological example. Try something more complex (results shown for R2012b on OSX):
(repeated addition)
>> clear all
>> a=0; tic, for n=1:1e8, a = a + i; end, toc
Elapsed time is 2.217482 seconds. % <-- slower
>> clear all
>> a=0; tic, for n=1:1e8, a = a + 1i; end, toc
Elapsed time is 1.962985 seconds. % <-- faster
(repeated multiplication)
>> clear all
>> a=0; tic, for n=1:1e8, a = a * i; end, toc
Elapsed time is 2.239134 seconds. % <-- slower
>> clear all
>> a=0; tic, for n=1:1e8, a = a * 1i; end, toc
Elapsed time is 1.998718 seconds. % <-- faster
One thing to remember is that optimizations are applied differently whether you are running from the command line or a saved M-function.
Here is a test of my own:
function testComplex()
tic, test1(); toc
tic, test2(); toc
tic, test3(); toc
tic, test4(); toc
tic, test5(); toc
tic, test6(); toc
end
function a = test1
a = zeros(1e7,1);
for n=1:1e7
a(n) = 2 + 2i;
end
end
function a = test2
a = zeros(1e7,1);
for n=1:1e7
a(n) = 2 + 2j;
end
end
function a = test3
a = zeros(1e7,1);
for n=1:1e7
a(n) = 2 + 2*i;
end
end
function a = test4
a = zeros(1e7,1);
for n=1:1e7
a(n) = 2 + 2*j;
end
end
function a = test5
a = zeros(1e7,1);
for n=1:1e7
a(n) = complex(2,2);
end
end
function a = test6
a = zeros(1e7,1);
for n=1:1e7
a(n) = 2 + 2*sqrt(-1);
end
end
The results on my Windows machine running R2013a:
>> testComplex
Elapsed time is 0.946414 seconds. %// 2 + 2i
Elapsed time is 0.947957 seconds. %// 2 + 2j
Elapsed time is 0.811044 seconds. %// 2 + 2*i
Elapsed time is 0.685793 seconds. %// 2 + 2*j
Elapsed time is 0.767683 seconds. %// complex(2,2)
Elapsed time is 8.193529 seconds. %// 2 + 2*sqrt(-1)
Note that the results fluctuate a little bit with different runs where the order of calls is shuffled. So take the timings with a grain of salt.
My conclusion: doesn't matter in terms of speed if you use 1i or 1*i.
One interesting difference is that if you also have a variable in the function scope where you also use it as the imaginary unit, MATLAB throws an error:
Error: File: testComplex.m Line: 38 Column: 5
"i" previously appeared to be used as a function or command, conflicting with its
use here as the name of a variable.
A possible cause of this error is that you forgot to initialize the variable, or you
have initialized it implicitly using load or eval.
To see the error, change the above test3 function into:
function a = test3
a = zeros(1e7,1);
for n=1:1e7
a(n) = 2 + 2*i;
end
i = rand(); %// added this line!
end
i.e, the variable i was used as both a function and a variable in the same local scope.

How to optimize MATLAB bitwise operations

I have written my own SHA1 implementation in MATLAB, and it gives correct hashes. However, it's very slow (a string a 1000 a's takes 9.9 seconds on my Core i7-2760QM), and I think the slowness is a result of how MATLAB implements bitwise logical operations (bitand, bitor, bitxor, bitcmp) and bitwise shifts (bitshift, bitrol, bitror) of integers.
Especially I wonder the need to construct fixed-point numeric objects for bitrol and bitror using fi command, because anyway in Intel x86 assembly there's rol and ror both for registers and memory addresses of all sizes. However, bitshift is quite fast (it doesn't need any fixed-point numeric costructs, a regular uint64 variable works fine), which makes the situation stranger: why in MATLAB bitrol and bitror need fixed-point numeric objects constructed with fi, whereas bitshift does not, when in assembly level it all comes down to shl, shr, rol and ror?
So, before writing this function in C/C++ as a .mex file, I'd be happy to know if there is any way to improve the performance of this function. I know there are some specific optimizations for SHA1, but that's not the issue, if the very basic implementation of bitwise rotations is so slow.
Testing a little bit with tic and toc, it's evident that what makes it slow are the loops in with bitrol and fi. There are two such loops:
%# Define some variables.
FFFFFFFF = uint64(hex2dec('FFFFFFFF'));
%# constants: K(1), K(2), K(3), K(4).
K(1) = uint64(hex2dec('5A827999'));
K(2) = uint64(hex2dec('6ED9EBA1'));
K(3) = uint64(hex2dec('8F1BBCDC'));
K(4) = uint64(hex2dec('CA62C1D6'));
W = uint64(zeros(1, 80));
... some other code here ...
%# First slow loop begins here.
for index = 17:80
W(index) = uint64(bitrol(fi(bitxor(bitxor(bitxor(W(index-3), W(index-8)), W(index-14)), W(index-16)), 0, 32, 0), 1));
end
%# First slow loop ends here.
H = sha1_handle_block_struct.H;
A = H(1);
B = H(2);
C = H(3);
D = H(4);
E = H(5);
%# Second slow loop begins here.
for index = 1:80
rotatedA = uint64(bitrol(fi(A, 0, 32, 0), 5));
if (index <= 20)
% alternative #1.
xorPart = bitxor(D, (bitand(B, (bitxor(C, D)))));
xorPart = bitand(xorPart, FFFFFFFF);
temp = rotatedA + xorPart + E + W(index) + K(1);
elseif ((index >= 21) && (index <= 40))
% FIPS.
xorPart = bitxor(bitxor(B, C), D);
xorPart = bitand(xorPart, FFFFFFFF);
temp = rotatedA + xorPart + E + W(index) + K(2);
elseif ((index >= 41) && (index <= 60))
% alternative #2.
xorPart = bitor(bitand(B, C), bitand(D, bitxor(B, C)));
xorPart = bitand(xorPart, FFFFFFFF);
temp = rotatedA + xorPart + E + W(index) + K(3);
elseif ((index >= 61) && (index <= 80))
% FIPS.
xorPart = bitxor(bitxor(B, C), D);
xorPart = bitand(xorPart, FFFFFFFF);
temp = rotatedA + xorPart + E + W(index) + K(4);
else
error('error in the code of sha1_handle_block.m!');
end
temp = bitand(temp, FFFFFFFF);
E = D;
D = C;
C = uint64(bitrol(fi(B, 0, 32, 0), 30));
B = A;
A = temp;
end
%# Second slow loop ends here.
Measuring with tic and toc, the entire computation of SHA1 hash of message abc takes on my laptop around 0.63 seconds, of which around 0.23 seconds is passed in the first slow loop and around 0.38 seconds in the second slow loop. So is there some way to optimize those loops in MATLAB before writing a .mex file?
There's this DataHash from the MATLAB File Exchange that calculates SHA-1 hashes lightning fast.
I ran the following code:
x = 'The quick brown fox jumped over the lazy dog'; %# Just a short sentence
y = repmat('a', [1, 1e6]); %# A million a's
opt = struct('Method', 'SHA-1', 'Format', 'HEX', 'Input', 'bin');
tic, x_hashed = DataHash(uint8(x), opt), toc
tic, y_hashed = DataHash(uint8(y), opt), toc
and got the following results:
x_hashed = F6513640F3045E9768B239785625CAA6A2588842
Elapsed time is 0.029250 seconds.
y_hashed = 34AA973CD4C4DAA4F61EEB2BDBAD27316534016F
Elapsed time is 0.020595 seconds.
I verified the results with a random online SHA-1 tool, and the calculation was indeed correct. Also, the 106 a's were hashed ~1.5 times faster than the first sentence.
So how does DataHash do it so fast??? Using the java.security.MessageDigest library, no less!
If you're interested with a fast MATLAB-friendly SHA-1 function, this is the way to go.
However, if this is just an exercise for implementing fast bit-level operations, then MATLAB doesn't really handle them efficiently, and in most cases you'll have to resort to MEX.
why in MATLAB bitrol and bitror need fixed-point numeric objects constructed with fi, whereas bitshift does not
bitrol and bitror are not part of the set of bitwise logic functions that are applicable for uints. They are part of the fixed-point toolbox, which also contains variants of bitand, bitshift etc that apply to fixed-point inputs.
A bitrol could be expressed as two bitshifts, a bitand and a bitor if you want to try using only the uint-functions. That might be even slower though.
As most MATLAB functions, bitand, bitor, bitxor are vectorized. So you get a lot faster if you give these function vector input rather than calling them in a loop over each element
Example:
%# create two sets of 10k random numbers
num = 10000;
hex = '0123456789ABCDEF';
A = uint64(hex2dec( hex(randi(16, [num 16])) ));
B = uint64(hex2dec( hex(randi(16, [num 16])) ));
%# compare loop vs. vectorized call
tic
C1 = zeros(size(A), class(A));
for i=1:numel(A)
C1(i) = bitxor(A(i),B(i));
end
toc
tic
C2 = bitxor(A,B);
toc
assert(isequal(C1,C2))
The timing was:
Elapsed time is 0.139034 seconds.
Elapsed time is 0.000960 seconds.
That's an order of magnitude faster!
The problem is, and as far as I can tell, the SHA-1 computation cannot be well vectorized. So you might not be able to take advantage of such vectorization.
As an experiment, I implemented a pure MATLAB-based funciton to compute such bit operations:
function num = my_bitops(op,A,B)
%# operation to perform: not, and, or, xor
if ischar(op)
op = str2func(op);
end
%# integer class: uint8, uint16, uint32, uint64
clss = class(A);
depth = str2double(clss(5:end));
%# bit exponents
e = 2.^(depth-1:-1:0);
%# convert to binary
b1 = logical(dec2bin(A,depth)-'0');
if nargin == 3
b2 = logical(dec2bin(B,depth)-'0');
end
%# perform binary operation
if nargin < 3
num = op(b1);
else
num = op(b1,b2);
end
%# convert back to integer
num = sum(bsxfun(#times, cast(num,clss), cast(e,clss)), 2, 'native');
end
Unfortunately, this was even worse in terms of performance:
tic, C1 = bitxor(A,B); toc
tic, C2 = my_bitops('xor',A,B); toc
assert(isequal(C1,C2))
The timing was:
Elapsed time is 0.000984 seconds.
Elapsed time is 0.485692 seconds.
Conclusion: write a MEX function or search the File Exchange to see if someone already did :)

Resources