I've been asked to make some MATLAB code run faster, and have run into something that seems strange to me.
In one of the functions there's a loop where we multiply a 3x1 vector (let's call it x) - a 3x3 matrix (let's call it A) - and the transpose of x, yielding a scalar. The code has the whole set of element-by-element multiplications and additions, and is pretty cumbersome:
val = x(1)*A(1,1)*x(1) + x(1)*A(1,2)*x(2) + x(1)*A(1,3)*x(3) + ...
x(2)*A(2,1)*x(1) + x(2)*A(2,2)*x(2) + x(2)*A(2,3)*x(3) + ...
x(3)*A(3,1)*x(1) + x(3)*A(3,2)*x(2) + x(3)*A(3,3)*x(3);
I figured I'd just replace it all by:
val = x*A*x';
To my surprise, it ran significantly slower (as in 4-5 times slower). Is it just that the vector and matrix are so small that MATLAB's optimizations don't apply?
EDIT: I improved the tests to give more accurate times. I also optimized the unrolled version which is now much better than what I initially had, still matrix multiplication is way faster as you increase the size.
EDIT2: To make sure that the JIT compiler is working on the unrolled functions, I modified the code to write the generated functions as M-files. Also the comparison can now be seen as fair as both methods get evaluated by passing TIMEIT the function handle: timeit(#myfunc)
I am not convinced that your approach is faster than matrix multiplication for reasonable sizes. So lets compare the two methods.
I am using the Symbolic Math Toolbox to help me get the "unrolled" form of the equation of x'*A*x (try multiplying by hand a 20x20 matrix and a 20x1 vector!):
function f = buildUnrolledFunction(N)
% avoid regenerating files, CCODE below can be really slow!
fname = sprintf('f%d',N);
if exist([fname '.m'], 'file')
f = str2func(fname);
return
end
% construct symbolic vector/matrix of the specified size
x = sym('x', [N 1]);
A = sym('A', [N N]);
% work out the expanded form of the matrix-multiplication
% and convert it to a string
s = ccode(expand(x.'*A*x)); % instead of char(.) to avoid x^2
% a bit of RegExp to fix the notation of the variable names
% also convert indexing into linear indices: A(3,3) into A(9)
s = regexprep(regexprep(s, '^.*=\s+', ''), ';$', '');
s = regexprep(regexprep(s, 'x(\d+)', 'x($1)'), 'A(\d+)_(\d+)', ...
'A(${ int2str(sub2ind([N N],str2num($1),str2num($2))) })');
% build an M-function from the string, and write it to file
fid = fopen([fname '.m'], 'wt');
fprintf(fid, 'function v = %s(A,x)\nv = %s;\nend\n', fname, s);
fclose(fid);
% rehash path and return a function handle
rehash
clear(fname)
f = str2func(fname);
end
I tried to optimize the generated function by avoid exponentiation (we prefer x*x to x^2). I also converted the subscripts into linear indices (A(9) instead of A(3,3)). Therefore for n=3 we get the same equation you had:
>> s
s =
A(1)*(x(1)*x(1)) + A(5)*(x(2)*x(2)) + A(9)*(x(3)*x(3)) +
A(4)*x(1)*x(2) + A(7)*x(1)*x(3) + A(2)*x(1)*x(2) +
A(8)*x(2)*x(3) + A(3)*x(1)*x(3) + A(6)*x(2)*x(3)
Given the above method to construct M-functions, we now evaluate it for various sizes and compare it against the matrix-multiplication form (I put it in a separate function to account for function calling overhead). I am using the TIMEIT function instead of tic/toc to get more accurate timings. Also to have a fair comparison, each method is implemented as an M-file function that gets passed all of the needed variables as input arguments.
function results = testMatrixMultVsUnrolled()
% vector/matrix size
N_vec = 2:50;
results = zeros(numel(N_vec),3);
for ii = 1:numel(N_vec);
% some random data
N = N_vec(ii);
x = rand(N,1); A = rand(N,N);
% matrix multiplication
f = #matMult;
results(ii,1) = timeit(#() feval(f, A,x));
% unrolled equation
f = buildUnrolledFunction(N);
results(ii,2) = timeit(#() feval(f, A,x));
% check result
results(ii,3) = norm(matMult(A,x) - f(A,x));
end
% display results
fprintf('N = %2d: mtimes = %.6f ms, unroll = %.6f ms [error = %g]\n', ...
[N_vec(:) results(:,1:2)*1e3 results(:,3)]')
plot(N_vec, results(:,1:2)*1e3, 'LineWidth',2)
xlabel('size (N)'), ylabel('timing [msec]'), grid on
legend({'mtimes','unrolled'})
title('Matrix multiplication: $$x^\mathsf{T}Ax$$', ...
'Interpreter','latex', 'FontSize',14)
end
function v = matMult(A,x)
v = x.' * A * x;
end
The results:
N = 2: mtimes = 0.008816 ms, unroll = 0.006793 ms [error = 0]
N = 3: mtimes = 0.008957 ms, unroll = 0.007554 ms [error = 0]
N = 4: mtimes = 0.009025 ms, unroll = 0.008261 ms [error = 4.44089e-16]
N = 5: mtimes = 0.009075 ms, unroll = 0.008658 ms [error = 0]
N = 6: mtimes = 0.009003 ms, unroll = 0.008689 ms [error = 8.88178e-16]
N = 7: mtimes = 0.009234 ms, unroll = 0.009087 ms [error = 1.77636e-15]
N = 8: mtimes = 0.008575 ms, unroll = 0.009744 ms [error = 8.88178e-16]
N = 9: mtimes = 0.008601 ms, unroll = 0.011948 ms [error = 0]
N = 10: mtimes = 0.009077 ms, unroll = 0.014052 ms [error = 0]
N = 11: mtimes = 0.009339 ms, unroll = 0.015358 ms [error = 3.55271e-15]
N = 12: mtimes = 0.009271 ms, unroll = 0.018494 ms [error = 3.55271e-15]
N = 13: mtimes = 0.009166 ms, unroll = 0.020238 ms [error = 0]
N = 14: mtimes = 0.009204 ms, unroll = 0.023326 ms [error = 7.10543e-15]
N = 15: mtimes = 0.009396 ms, unroll = 0.024767 ms [error = 3.55271e-15]
N = 16: mtimes = 0.009193 ms, unroll = 0.027294 ms [error = 2.4869e-14]
N = 17: mtimes = 0.009182 ms, unroll = 0.029698 ms [error = 2.13163e-14]
N = 18: mtimes = 0.009330 ms, unroll = 0.033295 ms [error = 7.10543e-15]
N = 19: mtimes = 0.009411 ms, unroll = 0.152308 ms [error = 7.10543e-15]
N = 20: mtimes = 0.009366 ms, unroll = 0.167336 ms [error = 7.10543e-15]
N = 21: mtimes = 0.009335 ms, unroll = 0.183371 ms [error = 0]
N = 22: mtimes = 0.009349 ms, unroll = 0.200859 ms [error = 7.10543e-14]
N = 23: mtimes = 0.009411 ms, unroll = 0.218477 ms [error = 8.52651e-14]
N = 24: mtimes = 0.009307 ms, unroll = 0.235668 ms [error = 4.26326e-14]
N = 25: mtimes = 0.009425 ms, unroll = 0.256491 ms [error = 1.13687e-13]
N = 26: mtimes = 0.009392 ms, unroll = 0.274879 ms [error = 7.10543e-15]
N = 27: mtimes = 0.009515 ms, unroll = 0.296795 ms [error = 2.84217e-14]
N = 28: mtimes = 0.009567 ms, unroll = 0.319032 ms [error = 5.68434e-14]
N = 29: mtimes = 0.009548 ms, unroll = 0.339517 ms [error = 3.12639e-13]
N = 30: mtimes = 0.009617 ms, unroll = 0.361897 ms [error = 1.7053e-13]
N = 31: mtimes = 0.009672 ms, unroll = 0.387270 ms [error = 0]
N = 32: mtimes = 0.009629 ms, unroll = 0.410932 ms [error = 1.42109e-13]
N = 33: mtimes = 0.009605 ms, unroll = 0.434452 ms [error = 1.42109e-13]
N = 34: mtimes = 0.009534 ms, unroll = 0.462961 ms [error = 0]
N = 35: mtimes = 0.009696 ms, unroll = 0.489474 ms [error = 5.68434e-14]
N = 36: mtimes = 0.009691 ms, unroll = 0.512198 ms [error = 8.52651e-14]
N = 37: mtimes = 0.009671 ms, unroll = 0.544485 ms [error = 5.68434e-14]
N = 38: mtimes = 0.009710 ms, unroll = 0.573564 ms [error = 8.52651e-14]
N = 39: mtimes = 0.009946 ms, unroll = 0.604567 ms [error = 3.41061e-13]
N = 40: mtimes = 0.009735 ms, unroll = 0.636640 ms [error = 3.12639e-13]
N = 41: mtimes = 0.009858 ms, unroll = 0.665719 ms [error = 5.40012e-13]
N = 42: mtimes = 0.009876 ms, unroll = 0.697364 ms [error = 0]
N = 43: mtimes = 0.009956 ms, unroll = 0.730506 ms [error = 2.55795e-13]
N = 44: mtimes = 0.009897 ms, unroll = 0.765358 ms [error = 4.26326e-13]
N = 45: mtimes = 0.009991 ms, unroll = 0.800424 ms [error = 0]
N = 46: mtimes = 0.009956 ms, unroll = 0.829717 ms [error = 2.27374e-13]
N = 47: mtimes = 0.010210 ms, unroll = 0.865424 ms [error = 2.84217e-13]
N = 48: mtimes = 0.010022 ms, unroll = 0.907974 ms [error = 3.97904e-13]
N = 49: mtimes = 0.010098 ms, unroll = 0.944536 ms [error = 5.68434e-13]
N = 50: mtimes = 0.010153 ms, unroll = 0.984486 ms [error = 4.54747e-13]
At small sizes the two methods perform somewhat similarly. Although for N<7 the expanded version beats mtimes, but the difference is hardly significant. Once we move past tiny sizes, matrix multiplication is orders of magnitude faster.
This is not really surprising; with only N=20 the formula is scary long and involves adding 400 terms. As MATLAB language is interpreted, I doubt this is very efficient..
Now I agree that there is an overhead for calling an external function vs. directly embedding the code in-line, but how practical is such an approach. Even for a small size as N=20, the generated line is over 7000 characters! I also noticed the MATLAB editor becoming sluggish on account of the long lines :)
Besides, the advantage quickly disappears after around N>10. I compared the embedded-code/explicitly-written vs. matrix-multiplication, similar to what #DennisJaheruddin suggested. The results:
N=3:
Elapsed time is 0.062295 seconds. % unroll
Elapsed time is 1.117962 seconds. % mtimes
N=12:
Elapsed time is 1.024837 seconds. % unroll
Elapsed time is 1.126147 seconds. % mtimes
N=19:
Elapsed time is 140.915138 seconds. % unroll
Elapsed time is 1.305382 seconds. % mtimes
... and it only gets worse for the unrolled version. Like I said before, MATLAB is interpreted so the cost of parsing the code starts to show at such huge files.
The way I see it, after doing a million iterations we only gained 1 second at best, which I think does not justify all the trouble and hacks, over using the much more readable and succinct v=x'*A*x. So perhaps there are other places in the code one can improve, rather than focusing on an already optimized operation such as matrix multiplication.
Matrix multiplication in MATLAB is seriously fast (this is what MATLAB is best at!). It really shines once you reach large enough data (as multithreading kicks in):
>> N=5000; x=rand(N,1); A=rand(N,N);
>> tic, for i=1e4, v=x.'*A*x; end, toc
Elapsed time is 0.021959 seconds.
#Amro has given en exstensive answer, and I agree that in general you should not bother writing out the explicit calculations and simply use matrix multiplication everywhere in your code.
However, if your matrix is small enough, and you really need to calculate something a few billion times, the written out form can be significantly faster (less overhead). The trick however, is not to put your code in a separate function, as the call overhead will be much larger than the calculation time.
Here is a smalle example:
x = 1:3;
A = rand(3);
v=0;
unroll = #(x) A(1)*(x(1)*x(1)) + A(5)*(x(2)*x(2)) + A(9)*(x(3)*x(3)) + A(4)*x(1)*x(2) + A(7)*x(1)*x(3) + A(2)*x(1)*x(2) + A(8)*x(2)*x(3) + A(3)*x(1)*x(3) + A(6)*x(2)*x(3);
regular = #(x) x*A*x';
%Written out, no function call
tic
for t = 1:1e6
v = A(1)*(x(1)*x(1)) + A(5)*(x(2)*x(2)) + A(9)*(x(3)*x(3)) + A(4)*x(1)*x(2) + A(7)*x(1)*x(3) + A(2)*x(1)*x(2) + A(8)*x(2)*x(3) + A(3)*x(1)*x(3) + A(6)*x(2)*x(3);;
end
t1=toc;
%Matrix form, no function call
tic
for t = 1:1e6
v = x*A*x';
end
t2=toc;
%Written out, function call
tic
for t = 1:1e6
v = unroll(x);
end
t3=toc;
%Matrix form, function call
tic
for t = 1:1e6
v = regular(x);
end
t4=toc;
[t1;t2;t3;t4]
Which will give these results:
0.0767
1.6988
6.1975
7.9353
So if you call it via an (anonymous) function, it won't be interesting to use the written out form, however if you really want to get the best speed simply using the written out form directly can get you a big speedup for tiny matrices.
Related
I am starting to use Julia mainly because of its speed. Currently, I am solving a fixed point problem. Although the current version of my code runs fast I would like to know some methods to improve its speed.
First of all, let me summarize the algorithm.
There is an initial seed called C0 that maps from the space (b,y) into an action space c, then we have C0(b,y)
There is a formula that generates a rule Ct from C0.
Then, using an additional restriction, I can obtain an updating of b [let's called it bt]. Thus,it generates a rule Ct(bt,y)
I need to interpolate the previous rule to move from the grid bt into the original grid b. It gives me an update for C0 [let's called that C1]
I will iterate until the distance between C1 and C0 is below a convergence threshold.
To implement it I created two structures:
struct Parm
lC::Array{Float64, 2} # Lower limit
uC::Array{Float64, 2} # Upper limit
γ::Float64 # CRRA coefficient
δ::Float64 # factor in the euler
γ1::Float64 #
r1::Float64 # inverse of the gross interest rate
yb1::Array{Float64, 2} # y - b(t+1)
P::Array{Float64, 2} # Transpose of transition matrix
end
mutable struct Upd1
pol::Array{Float64,2} # policy function
b::Array{Float64, 1} # exogenous grid for interpolation
dif::Float64 # updating difference
end
The first one is a set of parameters while the second one stores the decision rule C1. I also define some functions:
function eulerm(x::Upd1,p::Parm)
ct = p.δ *(x.pol.^(-p.γ)*p.P).^(-p.γ1); #Euler equation
bt = p.r1.*(ct .+ p.yb1); #Endeogenous grid for bonds
return ct,bt
end
function interp0!(bt::Array{Float64},ct::Array{Float64},x::Upd1, p::Parm)
polold = x.pol;
polnew = similar(x.pol);
#inbounds #simd for col in 1:size(bt,2)
F1 = LinearInterpolation(bt[:,col], ct[:,col],extrapolation_bc=Line());
polnew[:,col] = F1(x.b);
end
polnew[polnew .< p.lC] .= p.lC[polnew .< p.lC];
polnew[polnew .> p.uC] .= p.uC[polnew .> p.uC];
dif = maximum(abs.(polnew - polold));
return polnew,dif
end
function updating!(x::Upd1,p::Parm)
ct, bt = eulerm(x,p); # endogeneous grid
x.pol, x.dif = interp0!(bt,ct,x,p);
end
function conver(x::Upd1,p::Parm)
while x.dif>1e-8
updating!(x,p);
end
end
The first formula implements steps 2 and 3. The third one makes the updating (last part of step 4), and the last one iterates until convergence (step 5).
The most important function is the second one. It makes the interpolation. While I was running the function #time and #btime I realized that the largest number of allocations are in the loop inside this function. I tried to reduce it by not defining polnew and goes directly to x.pol but in this case, the results are not correct since it only need two iterations to converge (I think that Julia is thinking that polold is exactly the same than x.pol and it is updating both at the same time).
Any advice is well received.
To anyone that wants to run it by themselves, I add the rest of the required code:
function rouwen(ρ::Float64, σ2::Float64, N::Int64)
if (N % 2 != 1)
return "N should be an odd number"
end
sigz = sqrt(σ2/(1-ρ^2));
zn = sigz*sqrt(N-1);
z = range(-zn,zn,N);
p = (1+ρ)/2;
q = p;
Rho = [p 1-p;1-q q];
for i = 3:N
zz = zeros(i-1,1);
Rho = p*[Rho zz; zz' 0] + (1-p)*[zz Rho; 0 zz'] + (1-q)*[zz' 0; Rho zz] + q *[0 zz'; zz Rho];
Rho[2:end-1,:] = Rho[2:end-1,:]/2;
end
return z,Rho;
end
#############################################################
# Parameters of the model
############################################################
lb = 0; ub = 1000; pivb = 0.25; nb = 500;
ρ = 0.988; σz = 0.0439; μz =-σz/2; nz = 7;
ϕ = 0.0; σe = 0.6376; μe =-σe/2; ne = 7;
β = 0.98; r = 1/400; γ = 1;
b = exp10.(range(start=log10(lb+pivb), stop=log10(ub+pivb), length=nb)) .- pivb;
#=========================================================
Algorithm
======================================================== =#
(z,Pz) = rouwen(ρ,σz, nz);
μZ = μz/(1-ρ);
z = z .+ μZ;
(ee,Pe) = rouwen(ϕ,σe,ne);
ee = ee .+ μe;
y = exp.(vec((z .+ ee')'));
P = kron(Pz,Pe);
R = 1 + r;
r1 = R^(-1);
γ1 = 1/γ;
δ = (β*R)^(-γ1);
m = R*b .+ y';
lC = max.(m .- ub,0);
uC = m .- lb;
by1 = b .- y';
# initial guess for C0
c0 = 0.1*(m);
# Set of parameters
pp = Parm(lC,uC,γ,δ,γ1,r1,by1,P');
# Container of results
up1 = Upd1(c0,b,1);
# Fixed point problem
conver(up1,pp)
UPDATE As it was reccomend, I made the following changes to the third function
function interp0!(bt::Array{Float64},ct::Array{Float64},x::Upd1, p::Parm)
polold = x.pol;
polnew = similar(x.pol);
#inbounds for col in 1:size(bt,2)
F1 = LinearInterpolation(#view(bt[:,col]), #view(ct[:,col]),extrapolation_bc=Line());
polnew[:,col] = F1(x.b);
end
for j in eachindex(polnew)
polnew[j] < p.lC[j] ? polnew[j] = p.lC[j] : nothing
polnew[j] > p.uC[j] ? polnew[j] = p.uC[j] : nothing
end
dif = maximum(abs.(polnew - polold));
return polnew,dif
end
This leads to an improvement in the speed (from ~1.5 to ~1.3 seconds). And a reduction in the number of allocations. Somethings that I noted were:
Changing from polnew[:,col] = F1(x.b) to polnew[:,col] .= F1(x.b) can reduce the total allocations but the time is slower, why is that?
How should I understand the difference between #time and #btime. For this case, I have:
up1 = Upd1(c0,b,1);
#time conver(up1,pp)
1.338042 seconds (385.72 k allocations: 1.157 GiB, 3.37% gc time)
up1 = Upd1(c0,b,1);
#btime conver(up1,pp)
4.200 ns (0 allocations: 0 bytes)
Just to be precise, in both cases, I run it several times and I choose representative numbers for each line.
Does it mean that all the time is due allocations during the compilation?
Start going through the "performance tips" as advised by #DNF but below you will find most important comments for your code.
Vectorize vector assignments - a small dot makes big difference
julia> julia> a = rand(3,4);
julia> #btime $a[3,:] = $a[3,:] ./ 2;
40.726 ns (2 allocations: 192 bytes)
julia> #btime $a[3,:] .= $a[3,:] ./ 2;
20.562 ns (1 allocation: 96 bytes)
Use views when doing something with subarrays:
julia> #btime sum($a[3,:]);
18.719 ns (1 allocation: 96 bytes)
julia> #btime sum(#view($a[3,:]));
5.600 ns (0 allocations: 0 bytes)
Your code around a lines polnew[polnew .< p.lC] .= p.lC[polnew .< p.lC]; will make much less allocations when you do it with a for loop over each element of polnew
#simd will have no effect on conditionals (point 3) neither when code is calling complex external functions
I want to give an update about this problem. I made two main changes to my code: (i) I define my own linear interpolation function and (ii) I include the check of bounds in the interpolation.
With this the new function three is
function interp0!(bt::Array{Float64},ct::Array{Float64},x::Upd1, p::Parm)
polold = x.pol;
polnew = similar(x.pol);
#inbounds #views for col in 1:size(bt,2)
polnew[:,col] = myint(bt[:,col], ct[:,col],x.b[:],p.lC[:,col],p.uC[:,col]);
end
dif = maximum(abs.(polnew - polold));
return polnew,dif
end
And the interpolation is now:
function myint(x0,y0,x1,ly,uy)
y1 = similar(x1);
n = size(x0,1);
j = 1;
#simd for i in eachindex(x1)
while (j <= n) && (x1[i] > x0[j])
j+=1;
end
if j == 1
y1[i] = y0[1] + ((y0[2]-y0[1])/(x0[2]-x0[1]))*(x1[i]-x0[1]) ;
elseif j == n+1
y1[i] = y0[n] + ((y0[n]-y0[n-1])/(x0[n]-x0[n-1]))*(x1[i]-x0[n]);
else
y1[i] = y0[j-1]+ ((x1[i]-x0[j-1])/(x0[j]-x0[j-1]))*(y0[j]-y0[j-1]);
end
y1[i] > uy[i] ? y1[i] = uy[i] : nothing;
y1[i] < ly[i] ? y1[i] = ly[i] : nothing;
end
return y1;
end
As you can see, I am taking advantage (and assuming) that both vectors that we use as basis are ordered while the two last lines in the outer loops checks the bounds imposed by lC and uC.
With that I get the following total time
up1 = Upd1(c0,b,1);
#time conver(up1,pp)
0.734630 seconds (28.93 k allocations: 752.214 MiB, 3.82% gc time)
up1 = Upd1(c0,b,1);
#btime conver(up1,pp)
4.200 ns (0 allocations: 0 bytes)
which is almost as twice faster with ~8% of the total allocations. the use of views in the loop of the function interp0! also helps a lot.
i have written a fairly large class for the calculation of measurement uncertainties, but it is painfully slow. Profiling the code shows that the slowest operation, by far, is to insert the computation results into a large sparse matrix. About 97% of all time is spent on that operation. The matrix keeps the uncertainties of all measurement data, and I cannot change the data structures without breaking a lot of other code. So my only option is to optimize the data insertion step. This is done about 5700 times in my benchmark, and every time the amout of data increases.
First solution, extremely slow:
% this automatically sums up duplicate yInd entries
[zInd_grid, yInd_grid] = ndgrid(1:numel(z), yInd(:));
Uzw = sparse(zInd_grid(:), yInd_grid(:), Uzy(:), numel(z), numel(obj.w));
% this automatically sums up duplicate yInd entries
dz_dw = sparse(zInd_grid(:), yInd_grid(:), dz_dy(:), numel(z), numel(obj.w));
obj.w = [obj.w; z(:)]; % insert new measurement results into the column vector obj.w
obj.Uww = [obj.Uww, transpose(Uzw); Uzw, Uzz]; % insert new uncertainties of the measurement results
obj.dw_dw = [obj.dw_dw, transpose(dz_dw); dz_dw, dz_dz]; % insert "dependencies" of new measurements on old results
The line obj.Uww = [obj.Uww, transpose(Uzw); Uzw, Uzz]; is the slowest, by far. Perhaps it is slow because Matlab needs to allocate a new, larger buffer for obj.Uww and copy everything over. Thus I changed the code to the following:
% Preallocation in the class constructor
obj.w = spalloc(nnz_w, 1, nnz_w);
obj.Uww = spalloc(nnz_w, nnz_w, nnz_Uww);
obj.dw_dw = spalloc(nnz_w, nnz_w, nnz_dw_dw);
obj.num_w = 0; % to manually keep track of the "real" size of obj.w, obj.Uww and obj.dw_dw
The class constructor is called with the sizes the three properties w, Uww and dw_dw will have at the end of the computation (nzz_w is approximately 0.1 million, nzz_Uww is at 9 million and nzz_dw_dw is about 1.6 million). Thus, no new allocation of memory should be needed. This is the inserting step now:
% this automatically sums up duplicate yInd entries
[zInd_grid, yInd_grid] = ndgrid(1:numel(z), yInd(:));
Uzw = sparse(zInd_grid(:), yInd_grid(:), Uzy(:), numel(z), obj.num_w);
% this automatically sums up duplicate yInd entries
dz_dw = sparse(zInd_grid(:), yInd_grid(:), dz_dy(:), numel(z), obj.num_w);
wInd = 1:obj.num_w;
obj.w(zInd, 1) = z(:); % insert new measurement results
obj.num_w = zInd(end); % new "real" size of w, Uww and dw_dw
obj.Uww(zInd, wInd) = Uzw; % about 51% of all computation time
obj.Uww(wInd, zInd) = transpose(Uzw); % about 15% of all computation time
obj.Uww(zInd, zInd) = Uzz; % about 14.4% of all computation time
obj.dw_dw(zInd, wInd) = dz_dw; % about 13% of all computation time
obj.dw_dw(wInd, zInd) = transpose(dz_dw); % about 3.5% of all computation time
obj.dw_dw(zInd, zInd) = dz_dz; % less than 3.5% of all computation time
Still, these lines account for 97% of all computation time, and no speed improvement. Thus I tried version three:
obj.w = [obj.w; z(:)];
[zInd_zy, yInd_zy] = ndgrid(zInd, yInd(:));
[zzInd_i, zzInd_j] = ndgrid(zInd, zInd);
[Uww_i, Uww_j, Uww_v] = find(obj.Uww); % 14% of all computation time
Uww_new = sparse( ... % this statement takes 66% of all computation time
[Uww_i; zInd_zy(:); yInd_zy(:); zzInd_i(:)], ...
[Uww_j; yInd_zy(:); zInd_zy(:); zzInd_j(:)], ...
[Uww_v; Uzy(:); Uzy(:); Uzz(:)], ...
numel(obj.w), numel(obj.w));
[dw_dw_i, dw_dw_j, dw_dw_v] = find(obj.dw_dw);
dw_dw_new = sparse( ... % 14% of all computation time
[dw_dw_i; zInd_zy(:); yInd_zy(:); zzInd_i(:)], ...
[dw_dw_j; yInd_zy(:); zInd_zy(:); zzInd_j(:)], ...
[dw_dw_v; dz_dy(:); dz_dy(:); dz_dz(:)], ...
numel(obj.w), numel(obj.w));
obj.Uww = Uww_new;
obj.dw_dw = dw_dw_new;
which is even slower thant the two other versions. Why is inserting into an already preallocated array so slow? And how can I speed it up?
(All the matrices are symmetric, but I did not try to exploit that yet.)
I don't understand the details of your update pattern, but keep in mind that Matlab stores sparse matrices internally in compressed-sparse column format. So adding entries in sequence column-by-column is significantly faster than other orders. E.g., on my old version of Matlab (R2006a), this:
n=10000;
nz=400000;
v=floor(n*rand(nz,3))+1;
fprintf('Random\n');
A=sparse(n, n);
tic
for k=1:nz
A(v(k,1), v(k,2))=v(k,3);
end
toc
fprintf('Row-wise\n');
v=sortrows(v);
A=sparse(n, n);
tic
for k=1:nz
A(v(k,1), v(k,2))=v(k,3);
end
toc
fprintf('Column-wise\n');
v=sortrows(v, [2 1]);
A=sparse(n, n);
tic
for k=1:nz
A(v(k,1), v(k,2))=v(k,3);
end
toc
gives this:
>> sparsetest
Random
Elapsed time is 19.276089 seconds.
Row-wise
Elapsed time is 20.714324 seconds.
Column-wise
Elapsed time is 1.498150 seconds.
Likely best of all would be if you can somehow just collect the nonzeros in a form suitable for spconvert or sparse and then make the whole sparse matrix at the end, but I gather that you might not be able to do that.
#bg2b Pointed out how adding data column-wise is much faster.
It turns out adding rows to a sparse matrix or sparse vector is painfully slow (more precisely, to the lower triangular part). Thus, I now store only the upper triangular part of the sparse matrix, because that is fast. When I need data from the matrix, I recreate it from the upper triangular sparse matrix. See the end of the benchmarking script for this.
This is my benchmarking script. It nicely shows the exponential increase in computation time for adding data.
% Benchmark the extension of sparse matrices of the form
% Uww_new = [ Uww, Uwz; ...
% Uzw, Uzz];
% where Uzw = transpose(Uwz). Uww and Uzz are always square and symmetric.
close all;
clearvars;
rng(70101557, 'twister'); % the seed is the number of the stack overflow question
density = 0.25;
nZ = 10;
iterations = 5e2;
n = nZ * iterations;
nonzeros = n*n*density;
all_Uzz = cell(iterations, 1);
all_Uwz = cell(iterations, 1);
for k = 1:iterations
% Uzz must be symmetric!
Uzz_nonsymmetric = sprand(nZ, nZ, density);
all_Uzz{k} = (Uzz_nonsymmetric + transpose(Uzz_nonsymmetric))./2;
all_Uwz{k} = sprand((k-1)*nZ, nZ, density);
end
f = figure();
ax = axes(f);
hold(ax, 'on');
grid(ax, 'on');
xlabel(ax, 'Iteration');
ylabel(ax, 'Elapsed time in seconds.');
h = gobjects(1, 0);
name = 'Optimised.';
fprintf('%s\n', name);
Uww_optimised = spalloc(0, 0, nonzeros);
t1 = tic();
elapsedTimes = NaN(iterations, 1);
for k = 1:iterations
Uzz = all_Uzz{k};
Uwz = all_Uwz{k};
zInd = size(Uww_optimised, 1) + (1:size(Uzz, 1));
wInd = 1:size(Uww_optimised, 1);
Uzz_triu = triu(Uzz);
Uww_optimised(wInd, zInd) = Uwz; % add columns
Uww_optimised(zInd, zInd) = Uzz_triu; % add rows and columns
elapsedTimes(k, 1) = toc(t1);
end
toc(t1)
h = [h, plot(ax, 1:iterations, seconds(elapsedTimes), 'DisplayName', name)];
name = 'Only Uwz and Uzz.';
fprintf('%s\n', name);
Uww_wz_zz = spalloc(0, 0, nonzeros);
t1 = tic();
elapsedTimes = NaN(iterations, 1);
for k = 1:iterations
Uzz = all_Uzz{k};
Uwz = all_Uwz{k};
zInd = size(Uww_wz_zz, 1) + (1:size(Uzz, 1));
wInd = 1:size(Uww_wz_zz, 1);
Uzw = transpose(Uwz);
Uww_wz_zz(wInd, zInd) = Uwz; % add columns
% Uww_wz_zz(zInd, wInd) = Uzw; % add rows
Uww_wz_zz(zInd, zInd) = Uzz; % add rows and columns
elapsedTimes(k, 1) = toc(t1);
end
toc(t1)
h = [h, plot(ax, 1:iterations, seconds(elapsedTimes), 'DisplayName', name)];
name = 'Only Uzw and Uzz.';
fprintf('%s\n', name);
Uww_zw_zz = spalloc(0, 0, nonzeros);
t1 = tic();
elapsedTimes = NaN(iterations, 1);
for k = 1:iterations
Uzz = all_Uzz{k};
Uwz = all_Uwz{k};
zInd = size(Uww_zw_zz, 1) + (1:size(Uzz, 1));
wInd = 1:size(Uww_zw_zz, 1);
Uzw = transpose(Uwz);
% Uww_zw_zz(wInd, zInd) = Uwz;
Uww_zw_zz(zInd, wInd) = Uzw;
Uww_zw_zz(zInd, zInd) = Uzz;
elapsedTimes(k, 1) = toc(t1);
end
toc(t1)
h = [h, plot(ax, 1:iterations, seconds(elapsedTimes), 'DisplayName', name)];
name = 'Uzw, Uwz and Uzz.';
fprintf('%s\n', name);
Uww = spalloc(0, 0, nonzeros);
t1 = tic();
elapsedTimes = NaN(iterations, 1);
for k = 1:iterations
Uzz = all_Uzz{k};
Uwz = all_Uwz{k};
zInd = size(Uww, 1) + (1:size(Uzz, 1));
wInd = 1:size(Uww, 1);
Uzw = transpose(Uwz);
Uww(wInd, zInd) = Uwz;
Uww(zInd, wInd) = Uzw;
Uww(zInd, zInd) = Uzz;
elapsedTimes(k, 1) = toc(t1);
end
toc(t1)
h = [h, plot(ax, 1:iterations, seconds(elapsedTimes), 'DisplayName', name)];
leg = legend(ax, h, 'Location', 'northwest');
assert(issymmetric(Uww));
assert(istriu(Uww_optimised));
assert(isequal(Uww, Uww_optimised + transpose(triu(Uww_optimised, 1))));
%% Get Uyy from Uww_optimised. Uyy is a symmetric subset of Uww
yInd = randi(size(Uww_optimised, 1), 1, nZ); % indices to extract
[yIndRowInds, yIndColInds] = ndgrid(yInd, yInd);
indsToFlip = yIndRowInds > yIndColInds;
temp = yIndColInds(indsToFlip);
yIndColInds(indsToFlip) = yIndRowInds(indsToFlip);
yIndRowInds(indsToFlip) = temp;
linInd = sub2ind(size(Uww_optimised), yIndRowInds, yIndColInds);
assert(issymmetric(linInd));
Uyy = Uww_optimised(linInd);
assert(issymmetric(Uyy));
This is a follow-up question of this question.
The following code takes an enormous amount of time to loop through. Do you have any recommendations for speeding up the process? The variable z has a size of 479x1672 and others will be around 479x12000.
z = HongKongPrices;
zmat = false(size(z));
r = size(z,1);
c = size(z,2);
for k = 1:c
for i = 5:r
if z(i,k) == z(i-4,k) && z(i,k) == z(i-3,k) && z(i,k) == z(end,k)
zmat(i-3:i,k) = 1
end
end
end
z(zmat) = NaN
I am currently running this with MatLab R2014b on an iMac with 3.2 Intel i5 and 16 GB DDR3.
You can use logical indexing here to your advantage to replace the IF-conditional statement and have a small-loop -
%// Get size parameters
[r,c] = size(z);
%// Get logical mask with ones for each column at places that satisfy the condition
%// mentioned as the IF conditional statement in the problem code
mask = z(1:r-4,:) == z(5:r,:) & z(2:r-3,:) == z(5:r,:) & ...
bsxfun(#eq,z(end,:),z(5:r,:));
%// Use logical indexing to map entire z array and set mask elements as NaNs
for k = 1:4
z([false(k,c) ; mask ; false(4-k,c)]) = NaN;
end
Benchmarking
%// Size parameters
nrows = 479;
ncols = 12000;
max_num = 10;
num_iter = 10; %// number of iterations to run each approach,
%// so that runtimes are over 1 sec mark
z_org = randi(max_num,nrows,ncols); %// random input data of specified size
disp('--------------------------------- With proposed approach')
tic
for iter = 1:num_iter
z = z_org;
[..... code from the proposed approach ...]
end
toc, clear z k mask r c
disp('--------------------------------- With original approach')
tic
for iter = 1:num_iter
z = z_org;
[..... code from the problem ...]
end
toc
Results
Case # 1: z as 479 x 1672 (num_iter = 50)
--------------------------------- With proposed approach
Elapsed time is 1.285337 seconds.
--------------------------------- With original approach
Elapsed time is 2.008256 seconds.
Case # 2: z as 479 x 12000 (num_iter = 10)
--------------------------------- With proposed approach
Elapsed time is 1.941858 seconds.
--------------------------------- With original approach
Elapsed time is 2.897006 seconds.
When I run the code shown below, the tic/toc pair inside the function shows it takes very short time (<< 1sec) to go through all the lines. However, it actually takes around 2.3secs to get the outputs!!! I use the tic/toc pair to measure the time.
tic
rnn.v = 11;
rnn.h = 101;
rnn.o = 7;
rnn.h_init = randn(1,rnn.h,'gpuArray');
rnn.W_vh = randn(rnn.v,rnn.h,'gpuArray');
rnn.W_hh = randn(rnn.h,rnn.h,'gpuArray');
rnn.W_ho = randn(rnn.h,rnn.o,'gpuArray');
inData.V = randn(10000,11,100,'gpuArray');
inData.TimeSteps =100;
inData.BatchSize = 10000;
[H,OX] = forward_pass(rnn, inData)
toc
All the matrices in rnn, and inData are gpuArray, so all the calculation are carried out in GPU. The outputs are also gpuArray.
function [H,OX] = forward_pass(rnn, inData)
tic;
%initial hidden state values
H_init = gpuArray(repmat(rnn.h_init,[inData.BatchSize,1]));
%initialize state H
H = zeros(inData.BatchSize, rnn.h, inData.TimeSteps,'gpuArray');
%initialize OX (which is H * Who)
OX = zeros(inData.BatchSize, rnn.o, inData.TimeSteps,'gpuArray');
for t = 1 : inData.TimeSteps
if t == 1
HX_t = H_init * rnn.W_hh...
+ inData.V(:,:,t) * rnn.W_vh;
else
HX_t = H(:,:,(t-1)) * rnn.W_hh...
+ inData.V(:,:,t) * rnn.W_vh;
end
H(:,:,t) = tanh(HX_t);
OX(:,:,t) = H(:,:,t) * rnn.W_ho;
end
toc;
end
Normally, if you use gather() function, it will be slow. I didn't use the gather() function to transfer the outputs to workspace, I don't know why it is still so slow. It looks like the last line "end" takes more than 2secs.
Anyone knows how to accelerate the function call?
First off, for proper benchmarking you do need to use gather either inside the function call or afterwards. In the former case, you would have a non-gpu output from the function call and in the latter case, a gpu-based datatype would be the output. Now, back to your problem, you are using very few TimeSteps and as such any optimization that you might try out won't reflect in a huge manner. Here's an optimized version that will show increased performance as you increase Timesteps -
function [H,OX] = forward_pass(rnn, inData)
H = zeros(inData.BatchSize, rnn.h, inData.TimeSteps,'gpuArray');
T = reshape(permute(inData.V,[1 3 2]),[],size(inData.V,2))*rnn.W_vh;
H(:,:,1) = tanh(bsxfun(#plus,rnn.h_init * rnn.W_hh,T(1:size(inData.V,1),:)));
for t = 2 : inData.TimeSteps
H(:,:,t) = tanh( H(:,:,(t-1))*rnn.W_hh + ...
T((t-1)*size(inData.V,1)+1: t*size(inData.V,1),:));
end
A = reshape(permute(H,[1 3 2]),[],size(H,2))*rnn.W_ho;
OX = permute(reshape(A,size(H,1),size(A,1)/size(H,1),[]),[1 3 2]);
return;
Benchmarking
Test Case #1
Parameters
rnn.v = 11;
rnn.h = 5;
rnn.o = 7;
inData.TimeSteps = 10000;
inData.BatchSize = 10;
Results
---- Original Code :
Elapsed time is 5.678876 seconds.
---- Modified Code :
Elapsed time is 3.821059 seconds.
Test Case #2
Parameters
inData.TimeSteps = 50000; (rest are same as in Test Case #1)
Results
---- Original Code :
Elapsed time is 28.392290 seconds.
---- Modified Code :
Elapsed time is 19.031776 seconds.
Please note that these are tested on GTX 750 Ti.
I'm doing a project that involves comparing programming languages. I'm computing the Ackermann function. I tested Java, Python, and Ruby, and got responses between 10 and 30 milliseconds. But C++ seems to take 125 milliseconds. Is this normal, or is it a problem with the gettimeofday()? Gettimeofday() is in time.h.
I'm testing on a (virtual) Ubuntu Natty Narwhal 32-bit. I'm not short processing power (Quad-core 2.13 GHz Intel Xeon).
My code is here:
#include <iostream>
#include <sys/time.h>
using namespace std;
int a(int m,int n) {
if (m == 0) {
return n + 1;
} else if (m > 0 and n == 0) {
return a(m-1,1);
} else if (m > 0 and n > 0) {
return a(m-1,a(m,n-1));
}
}
int main() {
timeval tim;
gettimeofday(&tim,NULL);
double t1 = tim.tv_usec;
int v = a(3,4);
gettimeofday(&tim,NULL);
double t2 = tim.tv_usec;
cout << v << endl << t2-t1;
return 0;
}
Assuming you're talking about the resolution of the data returned, the POSIX specification for gettimeofday states:
The resolution of the system clock is unspecified.
This is due to the fact that systems may have a widely varying capacity for tracking small time periods. Even the ISO standard clock() function includes caveats like this.
If you're talking about how long it takes to call it, the standard makes no guarantees about performance along those lines. An implementation is perfectly free to wait 125 minutes before giving you the time although I doubt such an implementation would have much market success :-)
As an example of the limited resolution, I typed in the following code to check it on my system:
#include <stdio.h>
#include <sys/time.h>
#define NUMBER 30
int main (void) {
struct timeval tv[NUMBER];
int count[NUMBER], i, diff;
gettimeofday (&tv[0], NULL);
for (i = 1; i < NUMBER; i++) {
gettimeofday (&tv[i], NULL);
count[i] = 1;
while ((tv[i].tv_sec == tv[i-1].tv_sec) &&
(tv[i].tv_usec == tv[i-1].tv_usec))
{
count[i]++;
gettimeofday (&tv[i], NULL);
}
}
printf ("%2d: secs = %d, usecs = %6d\n", 0, tv[0].tv_sec, tv[0].tv_usec);
for (i = 1; i < NUMBER; i++) {
diff = (tv[i].tv_sec - tv[i-1].tv_sec) * 1000000;
diff += tv[i].tv_usec - tv[i-1].tv_usec;
printf ("%2d: secs = %d, usecs = %6d, count = %5d, diff = %d\n",
i, tv[i].tv_sec, tv[i].tv_usec, count[i], diff);
}
return 0;
}
The code basically records the changes in the underlying time, keeping a count of how many calls it took to gettimeofday() for the time to actually change. This is on a reasonably powerful machine so it's not short on processing power (the count indicates how often it was able to call gettimeofday() for each time quantum, around the 5,800 mark, ignoring the first since we don't know when in that quantum we started the measurements).
The output was:
0: secs = 1318554836, usecs = 990820
1: secs = 1318554836, usecs = 991820, count = 5129, diff = 1000
2: secs = 1318554836, usecs = 992820, count = 5807, diff = 1000
3: secs = 1318554836, usecs = 993820, count = 5901, diff = 1000
4: secs = 1318554836, usecs = 994820, count = 5916, diff = 1000
5: secs = 1318554836, usecs = 995820, count = 5925, diff = 1000
6: secs = 1318554836, usecs = 996820, count = 5814, diff = 1000
7: secs = 1318554836, usecs = 997820, count = 5814, diff = 1000
8: secs = 1318554836, usecs = 998820, count = 5819, diff = 1000
9: secs = 1318554836, usecs = 999820, count = 5901, diff = 1000
10: secs = 1318554837, usecs = 820, count = 5815, diff = 1000
11: secs = 1318554837, usecs = 1820, count = 5866, diff = 1000
12: secs = 1318554837, usecs = 2820, count = 5849, diff = 1000
13: secs = 1318554837, usecs = 3820, count = 5857, diff = 1000
14: secs = 1318554837, usecs = 4820, count = 5867, diff = 1000
15: secs = 1318554837, usecs = 5820, count = 5852, diff = 1000
16: secs = 1318554837, usecs = 6820, count = 5865, diff = 1000
17: secs = 1318554837, usecs = 7820, count = 5867, diff = 1000
18: secs = 1318554837, usecs = 8820, count = 5885, diff = 1000
19: secs = 1318554837, usecs = 9820, count = 5864, diff = 1000
20: secs = 1318554837, usecs = 10820, count = 5918, diff = 1000
21: secs = 1318554837, usecs = 11820, count = 5869, diff = 1000
22: secs = 1318554837, usecs = 12820, count = 5866, diff = 1000
23: secs = 1318554837, usecs = 13820, count = 5875, diff = 1000
24: secs = 1318554837, usecs = 14820, count = 5925, diff = 1000
25: secs = 1318554837, usecs = 15820, count = 5870, diff = 1000
26: secs = 1318554837, usecs = 16820, count = 5877, diff = 1000
27: secs = 1318554837, usecs = 17820, count = 5868, diff = 1000
28: secs = 1318554837, usecs = 18820, count = 5874, diff = 1000
29: secs = 1318554837, usecs = 19820, count = 5862, diff = 1000
showing that the resolution seems to be limited to no better than one thousand microseconds. Of course, your system may be different to that, the bottom line is that it depends on your implementation and/or environment.
One way to get around this type of limitation is to not do something once but to do it N times and then divide the elapsed time by N.
For example, let's say you call your function and the timer says it took 125 milliseconds, something that you suspect seems a little high. I would suggest then calling it a thousand times in a loop, measuring the time it took for the entire thousand.
If that turns out to be 125 seconds then, yes, it's probably slow. However, if it takes only 27 seconds, that would indicate your timer resolution is what's causing the seemingly large times, since that would equate to 27 milliseconds per iteration, on par with what you're seeing from the other results.
Modifying your code to take this into account would be along the lines of:
int main() {
const int count = 1000;
timeval tim;
gettimeofday(&tim, NULL);
double t1 = 1.0e6 * tim.tv_sec + tim.tv_usec;
int v;
for (int i = 0; i < count; ++i)
v = a(3, 4);
gettimeofday(&tim, NULL);
double t2 = 1.0e6 * tim.tv_sec + tim.tv_usec;
cout << v << '\n' << ((t2 - t1) / count) << '\n';
return 0;
}