How to reduce the allocations in Julia? - time
I am starting to use Julia mainly because of its speed. Currently, I am solving a fixed point problem. Although the current version of my code runs fast I would like to know some methods to improve its speed.
First of all, let me summarize the algorithm.
There is an initial seed called C0 that maps from the space (b,y) into an action space c, then we have C0(b,y)
There is a formula that generates a rule Ct from C0.
Then, using an additional restriction, I can obtain an updating of b [let's called it bt]. Thus,it generates a rule Ct(bt,y)
I need to interpolate the previous rule to move from the grid bt into the original grid b. It gives me an update for C0 [let's called that C1]
I will iterate until the distance between C1 and C0 is below a convergence threshold.
To implement it I created two structures:
struct Parm
lC::Array{Float64, 2} # Lower limit
uC::Array{Float64, 2} # Upper limit
γ::Float64 # CRRA coefficient
δ::Float64 # factor in the euler
γ1::Float64 #
r1::Float64 # inverse of the gross interest rate
yb1::Array{Float64, 2} # y - b(t+1)
P::Array{Float64, 2} # Transpose of transition matrix
end
mutable struct Upd1
pol::Array{Float64,2} # policy function
b::Array{Float64, 1} # exogenous grid for interpolation
dif::Float64 # updating difference
end
The first one is a set of parameters while the second one stores the decision rule C1. I also define some functions:
function eulerm(x::Upd1,p::Parm)
ct = p.δ *(x.pol.^(-p.γ)*p.P).^(-p.γ1); #Euler equation
bt = p.r1.*(ct .+ p.yb1); #Endeogenous grid for bonds
return ct,bt
end
function interp0!(bt::Array{Float64},ct::Array{Float64},x::Upd1, p::Parm)
polold = x.pol;
polnew = similar(x.pol);
#inbounds #simd for col in 1:size(bt,2)
F1 = LinearInterpolation(bt[:,col], ct[:,col],extrapolation_bc=Line());
polnew[:,col] = F1(x.b);
end
polnew[polnew .< p.lC] .= p.lC[polnew .< p.lC];
polnew[polnew .> p.uC] .= p.uC[polnew .> p.uC];
dif = maximum(abs.(polnew - polold));
return polnew,dif
end
function updating!(x::Upd1,p::Parm)
ct, bt = eulerm(x,p); # endogeneous grid
x.pol, x.dif = interp0!(bt,ct,x,p);
end
function conver(x::Upd1,p::Parm)
while x.dif>1e-8
updating!(x,p);
end
end
The first formula implements steps 2 and 3. The third one makes the updating (last part of step 4), and the last one iterates until convergence (step 5).
The most important function is the second one. It makes the interpolation. While I was running the function #time and #btime I realized that the largest number of allocations are in the loop inside this function. I tried to reduce it by not defining polnew and goes directly to x.pol but in this case, the results are not correct since it only need two iterations to converge (I think that Julia is thinking that polold is exactly the same than x.pol and it is updating both at the same time).
Any advice is well received.
To anyone that wants to run it by themselves, I add the rest of the required code:
function rouwen(ρ::Float64, σ2::Float64, N::Int64)
if (N % 2 != 1)
return "N should be an odd number"
end
sigz = sqrt(σ2/(1-ρ^2));
zn = sigz*sqrt(N-1);
z = range(-zn,zn,N);
p = (1+ρ)/2;
q = p;
Rho = [p 1-p;1-q q];
for i = 3:N
zz = zeros(i-1,1);
Rho = p*[Rho zz; zz' 0] + (1-p)*[zz Rho; 0 zz'] + (1-q)*[zz' 0; Rho zz] + q *[0 zz'; zz Rho];
Rho[2:end-1,:] = Rho[2:end-1,:]/2;
end
return z,Rho;
end
#############################################################
# Parameters of the model
############################################################
lb = 0; ub = 1000; pivb = 0.25; nb = 500;
ρ = 0.988; σz = 0.0439; μz =-σz/2; nz = 7;
ϕ = 0.0; σe = 0.6376; μe =-σe/2; ne = 7;
β = 0.98; r = 1/400; γ = 1;
b = exp10.(range(start=log10(lb+pivb), stop=log10(ub+pivb), length=nb)) .- pivb;
#=========================================================
Algorithm
======================================================== =#
(z,Pz) = rouwen(ρ,σz, nz);
μZ = μz/(1-ρ);
z = z .+ μZ;
(ee,Pe) = rouwen(ϕ,σe,ne);
ee = ee .+ μe;
y = exp.(vec((z .+ ee')'));
P = kron(Pz,Pe);
R = 1 + r;
r1 = R^(-1);
γ1 = 1/γ;
δ = (β*R)^(-γ1);
m = R*b .+ y';
lC = max.(m .- ub,0);
uC = m .- lb;
by1 = b .- y';
# initial guess for C0
c0 = 0.1*(m);
# Set of parameters
pp = Parm(lC,uC,γ,δ,γ1,r1,by1,P');
# Container of results
up1 = Upd1(c0,b,1);
# Fixed point problem
conver(up1,pp)
UPDATE As it was reccomend, I made the following changes to the third function
function interp0!(bt::Array{Float64},ct::Array{Float64},x::Upd1, p::Parm)
polold = x.pol;
polnew = similar(x.pol);
#inbounds for col in 1:size(bt,2)
F1 = LinearInterpolation(#view(bt[:,col]), #view(ct[:,col]),extrapolation_bc=Line());
polnew[:,col] = F1(x.b);
end
for j in eachindex(polnew)
polnew[j] < p.lC[j] ? polnew[j] = p.lC[j] : nothing
polnew[j] > p.uC[j] ? polnew[j] = p.uC[j] : nothing
end
dif = maximum(abs.(polnew - polold));
return polnew,dif
end
This leads to an improvement in the speed (from ~1.5 to ~1.3 seconds). And a reduction in the number of allocations. Somethings that I noted were:
Changing from polnew[:,col] = F1(x.b) to polnew[:,col] .= F1(x.b) can reduce the total allocations but the time is slower, why is that?
How should I understand the difference between #time and #btime. For this case, I have:
up1 = Upd1(c0,b,1);
#time conver(up1,pp)
1.338042 seconds (385.72 k allocations: 1.157 GiB, 3.37% gc time)
up1 = Upd1(c0,b,1);
#btime conver(up1,pp)
4.200 ns (0 allocations: 0 bytes)
Just to be precise, in both cases, I run it several times and I choose representative numbers for each line.
Does it mean that all the time is due allocations during the compilation?
Start going through the "performance tips" as advised by #DNF but below you will find most important comments for your code.
Vectorize vector assignments - a small dot makes big difference
julia> julia> a = rand(3,4);
julia> #btime $a[3,:] = $a[3,:] ./ 2;
40.726 ns (2 allocations: 192 bytes)
julia> #btime $a[3,:] .= $a[3,:] ./ 2;
20.562 ns (1 allocation: 96 bytes)
Use views when doing something with subarrays:
julia> #btime sum($a[3,:]);
18.719 ns (1 allocation: 96 bytes)
julia> #btime sum(#view($a[3,:]));
5.600 ns (0 allocations: 0 bytes)
Your code around a lines polnew[polnew .< p.lC] .= p.lC[polnew .< p.lC]; will make much less allocations when you do it with a for loop over each element of polnew
#simd will have no effect on conditionals (point 3) neither when code is calling complex external functions
I want to give an update about this problem. I made two main changes to my code: (i) I define my own linear interpolation function and (ii) I include the check of bounds in the interpolation.
With this the new function three is
function interp0!(bt::Array{Float64},ct::Array{Float64},x::Upd1, p::Parm)
polold = x.pol;
polnew = similar(x.pol);
#inbounds #views for col in 1:size(bt,2)
polnew[:,col] = myint(bt[:,col], ct[:,col],x.b[:],p.lC[:,col],p.uC[:,col]);
end
dif = maximum(abs.(polnew - polold));
return polnew,dif
end
And the interpolation is now:
function myint(x0,y0,x1,ly,uy)
y1 = similar(x1);
n = size(x0,1);
j = 1;
#simd for i in eachindex(x1)
while (j <= n) && (x1[i] > x0[j])
j+=1;
end
if j == 1
y1[i] = y0[1] + ((y0[2]-y0[1])/(x0[2]-x0[1]))*(x1[i]-x0[1]) ;
elseif j == n+1
y1[i] = y0[n] + ((y0[n]-y0[n-1])/(x0[n]-x0[n-1]))*(x1[i]-x0[n]);
else
y1[i] = y0[j-1]+ ((x1[i]-x0[j-1])/(x0[j]-x0[j-1]))*(y0[j]-y0[j-1]);
end
y1[i] > uy[i] ? y1[i] = uy[i] : nothing;
y1[i] < ly[i] ? y1[i] = ly[i] : nothing;
end
return y1;
end
As you can see, I am taking advantage (and assuming) that both vectors that we use as basis are ordered while the two last lines in the outer loops checks the bounds imposed by lC and uC.
With that I get the following total time
up1 = Upd1(c0,b,1);
#time conver(up1,pp)
0.734630 seconds (28.93 k allocations: 752.214 MiB, 3.82% gc time)
up1 = Upd1(c0,b,1);
#btime conver(up1,pp)
4.200 ns (0 allocations: 0 bytes)
which is almost as twice faster with ~8% of the total allocations. the use of views in the loop of the function interp0! also helps a lot.
Related
Tricks to improve the performance of a cunstom function in Julia
I am replicating using Julia a sequence of steps originally made in Matlab. In Octave, this procedure takes 1.4582 seconds and in Julia (using Jupyter) it takes approximately 10 seconds. I'll try to be brief in the scripts. My goal is to achieve or improve Octave's performance. First of all, I will describe my variables and some function: zgrid (double 1x7 size) kgrid (double 500x1 size) V0 (double 500x7 size) P (double 7x7 size) a transition matrix delta and beta are fixed parameters. F(z,k) and u(c) are particular functions and are specified in the Julia script. % Octave script % V0 is given [K, Z, K2] = meshgrid(kgrid, zgrid, kgrid); K = permute(K, [2, 1, 3]); Z = permute(Z, [2, 1, 3]); K2 = permute(K2, [2, 1, 3]); C = max(f(Z,K) + (1-delta)*K - K2,0); U = u(C); EV = V0*P';% EV is a 500x7 matrix size EV = permute(repmat(EV, 1, 1, 500), [3, 2, 1]); H = U + beta*EV; [TV, index] = max(H, [], 3); In Julia, I created a function that replicates this procedure. I used loops, but it has a performance 9 times longer. % Julia script % V0 is the input of my T operator function V0 = repeat(sqrt.(kgrid), outer = [1,7]); F = (z,k) -> exp(z)*(k^α); u = (c) -> (c^(1-μ) - 1)/(1-μ) % parameters α = 1/3 β = 0.987 δ = 0.012; μ = 2 Kss = 48.1905148382166 kgrid = range(0.75*Kss, stop=1.25*Kss, length=500); zgrid = [-0.06725382459813659, -0.044835883065424395, -0.0224179415327122, 0 , 0.022417941532712187, 0.04483588306542438, 0.06725382459813657] function T(V) E=V*P' T1 = zeros(Float64, 500, 7 ) aux = zeros(Float64, 500) for i = 1:7 for j = 1:500 for l = 1:500 c= maximum( (F(zrid[i],kgrid[j]) +(1-δ)*kgrid[j] - kgrid[l],0)) aux[l] = u(c) + β*E[l,i] end T1[j,i] = maximum(aux) end end return T1 end I would very much like to improve my performance in Julia. I believe there is a way to improve, but I am new in Julia programming.
This code runs for me in 5ms. Note that I have made F and u into proper (not anonymous) functions, F_ and u_, but you could get a similar effect by making the anonymous functions const. Your main problem is that you have a lot of non-const global variables, and also that your main function is doing unnecessary work multiple times, and creating an unnecessary array, aux. The performance tips section in the manual is essential reading: https://docs.julialang.org/en/v1/manual/performance-tips/ F_(z,k) = exp(z) * (k^(1/3)); # you can still use α, but it must be const u_(c) = (c^(1-2) - 1)/(1-2) function T_(V, P, kgrid, zgrid, β, δ) E = V * P' T1 = similar(V) for i in axes(T1, 2) for j in axes(T1, 1) temp = F_(zgrid[i], kgrid[j]) + (1-δ)*kgrid[j] aux = -Inf for l in eachindex(kgrid) c = max(0.0, temp - kgrid[l]) aux = max(aux, u_(c) + β * E[l, i]) end T1[j,i] = aux end end return T1 end Benchmark: V0 = repeat(sqrt.(kgrid), outer = [1,7]); zgrid = sort!(rand(1, 7); dims=2) kgrid = sort!(rand(500, 1); dims=1) P = rand(length(zgrid), length(zgrid)) #btime T_($V0, $P, $kgrid, $zgrid, $β, $δ); # output: 5.126 ms (4 allocations: 54.91 KiB)
The following should perform much better. The most noticeable differences are that it calculates F 500x less, and doesn't rely on global variables. function T(V,kgrid,zgrid,β,δ) E=V*P' T1 = zeros(Float64, 500, 7) for j = 1:500 for i = 1:7 x = F(zrid[i],kgrid[j]) +(1-δ)*kgrid[j] T1[j,i] = maximum(u(max(x - kgrid[l], 0)) + β*E[l,i] for l in 1:500) end end return T1 end
Julia: why doesn't shared memory multi-threading give me a speedup?
I want to use shared memory multi-threading in Julia. As done by the Threads.#threads macro, I can use ccall(:jl_threading_run ...) to do this. And whilst my code now runs in parallel, I don't get the speedup I expected. The following code is intended as a minimal example of the approach I'm taking and the performance problem I'm having: [EDIT: See later for even more minimal example] nthreads = Threads.nthreads() test_size = 1000000 println("STARTED with ", nthreads, " thread(s) and test size of ", test_size, ".") # Something to be processed: objects = rand(test_size) # Somewhere for our results results = zeros(nthreads) counts = zeros(nthreads) # A function to do some work. function worker_fn() work_idx = 1 my_result = results[Threads.threadid()] while work_idx > 0 my_result += objects[work_idx] work_idx += nthreads if work_idx > test_size break end counts[Threads.threadid()] += 1 end end # Call our worker function using jl_threading_run #time ccall(:jl_threading_run, Ref{Cvoid}, (Any,), worker_fn) # Verify that we made as many calls as we think we did. println("\nCOUNTS:") println("\tPer thread:\t", counts) println("\tSum:\t\t", sum(counts)) On an i7-7700, a typical single threaded result is: STARTED with 1 thread(s) and test size of 1000000. 0.134606 seconds (5.00 M allocations: 76.563 MiB, 1.79% gc time) COUNTS: Per thread: [999999.0] Sum: 999999.0 And with 4 threads: STARTED with 4 thread(s) and test size of 1000000. 0.140378 seconds (1.81 M allocations: 25.661 MiB) COUNTS: Per thread: [249999.0, 249999.0, 249999.0, 249999.0] Sum: 999996.0 Multi-threading slows things down! Why? EDIT: A better minimal example can be created #threads macro itself. a = zeros(Threads.nthreads()) b = rand(test_size) calls = zeros(Threads.nthreads()) #time Threads.#threads for i = 1 : test_size a[Threads.threadid()] += b[i] calls[Threads.threadid()] += 1 end I falsely assumed that the #threads macro's inclusion in Julia would mean that there was a benefit to be had.
The problem you have is most probably false sharing. You can solve it by separating the areas you write to far enough like this (here is a "quick and dirty" implementation to show the essence of the change): julia> function f(spacing) test_size = 1000000 a = zeros(Threads.nthreads()*spacing) b = rand(test_size) calls = zeros(Threads.nthreads()*spacing) Threads.#threads for i = 1 : test_size #inbounds begin a[Threads.threadid()*spacing] += b[i] calls[Threads.threadid()*spacing] += 1 end end a, calls end f (generic function with 1 method) julia> #btime f(1); 41.525 ms (35 allocations: 7.63 MiB) julia> #btime f(8); 2.189 ms (35 allocations: 7.63 MiB) or doing per-thread accumulation on a local variable like this (this is a preferred approach as it should be uniformly faster): function getrange(n) tid = Threads.threadid() nt = Threads.nthreads() d , r = divrem(n, nt) from = (tid - 1) * d + min(r, tid - 1) + 1 to = from + d - 1 + (tid ≤ r ? 1 : 0) from:to end function f() test_size = 10^8 a = zeros(Threads.nthreads()) b = rand(test_size) calls = zeros(Threads.nthreads()) Threads.#threads for k = 1 : Threads.nthreads() local_a = 0.0 local_c = 0.0 for i in getrange(test_size) for j in 1:10 local_a += b[i] local_c += 1 end end a[Threads.threadid()] = local_a calls[Threads.threadid()] = local_c end a, calls end Also note that you are probably using 4 treads on a machine with 2 physical cores (and only 4 virtual cores) so the gains from threading will not be linear.
Filling a matrix using parallel processing in Julia
I'm trying to speed up the solution time for a dynamic programming problem in Julia (v. 0.5.0), via parallel processing. The problem involves choosing the optimal values for every element of a 1073 x 19 matrix at every iteration, until successive matrix differences fall within a tolerance. I thought that, within each iteration, filling in the values for each element of the matrix could be parallelized. However, I'm seeing a huge performance degradation using SharedArray, and I'm wondering if there's a better way to approach parallel processing for this problem. I construct the arguments for the function below: est_params = [.788,.288,.0034,.1519,.1615,.0041,.0077,.2,0.005,.7196] r = 0.015 tau = 0.35 rho =est_params[1] sigma =est_params[2] delta = 0.15 gamma =est_params[3] a_capital =est_params[4] lambda1 =est_params[5] lambda2 =est_params[6] s =est_params[7] theta =est_params[8] mu =est_params[9] p_bar_k_ss =est_params[10] beta = (1+r)^(-1) sigma_range = 4 gz = 19 gp = 29 gk = 37 lnz=collect(linspace(-sigma_range*sigma,sigma_range*sigma,gz)) z=exp(lnz) gk_m = fld(gk,2) # Need to add mu somewhere to k_ss k_ss = (theta*(1-tau)/(r+delta))^(1/(1-theta)) k=cat(1,map(i->k_ss*((1-delta)^i),collect(1:gk_m)),map(i->k_ss/((1-delta)^i),collect(1:gk_m))) insert!(k,gk_m+1,k_ss) sort!(k) p_bar=p_bar_k_ss*k_ss p = collect(linspace(-p_bar/2,p_bar,gp)) #Tauchen N = length(z) Z = zeros(N,1) Zprob = zeros(Float32,N,N) Z[N] = lnz[length(z)] Z[1] = lnz[1] zstep = (Z[N] - Z[1]) / (N - 1) for i=2:(N-1) Z[i] = Z[1] + zstep * (i - 1) end for a = 1 : N for b = 1 : N if b == 1 Zprob[a,b] = 0.5*erfc(-((Z[1] - mu - rho * Z[a] + zstep / 2) / sigma)/sqrt(2)) elseif b == N Zprob[a,b] = 1 - 0.5*erfc(-((Z[N] - mu - rho * Z[a] - zstep / 2) / sigma)/sqrt(2)) else Zprob[a,b] = 0.5*erfc(-((Z[b] - mu - rho * Z[a] + zstep / 2) / sigma)/sqrt(2)) - 0.5*erfc(-((Z[b] - mu - rho * Z[a] - zstep / 2) / sigma)/sqrt(2)) end end end # Collecting tauchen results in a 2 element array of linspace and array; [2] gets array # Zprob=collect(tauchen(gz, rho, sigma, mu, sigma_range))[2] Zcumprob=zeros(Float32,gz,gz) # 2 in cumsum! denotes the 2nd dimension, i.e. columns cumsum!(Zcumprob, Zprob,2) gm = gk * gp control=zeros(gm,2) for i=1:gk control[(1+gp*(i-1)):(gp*i),1]=fill(k[i],(gp,1)) control[(1+gp*(i-1)):(gp*i),2]=p end endog=copy(control) E=Array(Float32,gm,gm,gz) for h=1:gm for m=1:gm for j=1:gz # set the nonzero net debt indicator if endog[h,2]<0 p_ind=1 else p_ind=0 end # set the investment indicator if (control[m,1]-(1-delta)*endog[h,1])!=0 i_ind=1 else i_ind=0 end E[m,h,j] = (1-tau)*z[j]*(endog[h,1]^theta) + control[m,2]-endog[h,2]*(1+r*(1-tau)) + delta*endog[h,1]*tau-(control[m,1]-(1-delta)*endog[h,1]) - (i_ind*gamma*endog[h,1]+endog[h,1]*(a_capital/2)*(((control[m,1]-(1-delta)*endog[h,1])/endog[h,1])^2)) + s*endog[h,2]*p_ind elem = E[m,h,j] if E[m,h,j]<0 E[m,h,j]=elem+lambda1*elem-.5*lambda2*elem^2 else E[m,h,j]=elem end end end end I then constructed the function with serial processing. The two for loops iterate through each element to find the largest value in a 1072-sized (=the gm scalar argument in the function) array: function dynam_serial(E,gm,gz,beta,Zprob) v = Array(Float32,gm,gz ) fill!(v,E[cld(gm,2),cld(gm,2),cld(gz,2)]) Tv = Array(Float32,gm,gz) # Set parameters for the loop convcrit = 0.0001 # chosen convergence criterion diff = 1 # arbitrary initial value greater than convcrit while diff>convcrit exp_v=v*Zprob' for h=1:gm for j=1:gz Tv[h,j]=findmax(E[:,h,j] + beta*exp_v[:,j])[1] end end diff = maxabs(Tv - v) v=copy(Tv) end end Timing this, I get: #time dynam_serial(E,gm,gz,beta,Zprob) > 106.880008 seconds (91.70 M allocations: 203.233 GB, 15.22% gc time) Now, I try using Shared Arrays to benefit from parallel processing. Note that I reconfigured the iteration so that I only have one for loop, rather than two. I also use v=deepcopy(Tv); otherwise, v is copied as an Array object, rather than a SharedArray: function dynam_parallel(E,gm,gz,beta,Zprob) v = SharedArray(Float32,(gm,gz),init = S -> S[Base.localindexes(S)] = myid() ) fill!(v,E[cld(gm,2),cld(gm,2),cld(gz,2)]) # Set parameters for the loop convcrit = 0.0001 # chosen convergence criterion diff = 1 # arbitrary initial value greater than convcrit while diff>convcrit exp_v=v*Zprob' Tv = SharedArray(Float32,gm,gz,init = S -> S[Base.localindexes(S)] = myid() ) #sync #parallel for hj=1:(gm*gz) j=cld(hj,gm) h=mod(hj,gm) if h==0;h=gm;end; #async Tv[h,j]=findmax(E[:,h,j] + beta*exp_v[:,j])[1] end diff = maxabs(Tv - v) v=deepcopy(Tv) end end Timing the parallel version; and using a 4-core 2.5 GHz I7 processor with 16GB of memory, I get: addprocs(3) #time dynam_parallel(E,gm,gz,beta,Zprob) > 164.237208 seconds (2.64 M allocations: 201.812 MB, 0.04% gc time) Am I doing something incorrect here? Or is there a better way to approach parallel processing in Julia for this particular problem? I've considered using Distributed Arrays, but it's difficult for me to see how to apply them to the present problem. UPDATE: Per #DanGetz and his helpful comments, I turned instead to trying to speed up the serial processing version. I was able to get performance down to 53.469780 seconds (67.36 M allocations: 103.419 GiB, 19.12% gc time) through: 1) Upgrading to 0.6.0 (saved about 25 seconds), which includes the helpful #views macro. 2) Preallocating the main array I'm trying to fill in (Tv), per the section on Preallocating Outputs in the Julia Performance Tips: https://docs.julialang.org/en/latest/manual/performance-tips/. (saved another 25 or so seconds) The biggest remaining slow-down seems to be coming from the add_vecs function, which sums together subarrays of two larger matrices. I've tried devectorizing and using BLAS functions, but haven't been able to produce better performance. In any event, the improved code for dynam_serial is below: function add_vecs(r::Array{Float32},h::Int,j::Int,E::Array{Float32},exp_v::Array{Float32},beta::Float32) #views r=E[:,h,j] + beta*exp_v[:,j] return r end function dynam_serial(E::Array{Float32},gm::Int,gz::Int,beta::Float32,Zprob::Array{Float32}) v = Array{Float32}(gm,gz) fill!(v,E[cld(gm,2),cld(gm,2),cld(gz,2)]) Tv = Array{Float32}(gm,gz) r = Array{Float32}(gm) # Set parameters for the loop convcrit = 0.0001 # chosen convergence criterion diff = 1 # arbitrary initial value greater than convcrit while diff>convcrit exp_v=v*Zprob' for h=1:gm for j=1:gz #views Tv[h,j]=findmax(add_vecs(r,h,j,E,exp_v,beta))[1] end end diff = maximum(abs,Tv - v) v=copy(Tv) end return Tv end
If add_vecs seems to be the critical function, writing an explicit for loop could offer more optimization. How does the following benchmark: function add_vecs!(r::Array{Float32},h::Int,j::Int,E::Array{Float32}, exp_v::Array{Float32},beta::Float32) #inbounds for i=1:size(E,1) r[i]=E[i,h,j] + beta*exp_v[i,j] end return r end UPDATE To continue optimizing dynam_serial I have tried to remove more allocations. The result is: function add_vecs_and_max!(gm::Int,r::Array{Float64},h::Int,j::Int,E::Array{Float64}, exp_v::Array{Float64},beta::Float64) #inbounds for i=1:gm r[i] = E[i,h,j]+beta*exp_v[i,j] end return findmax(r)[1] end function dynam_serial(E::Array{Float64},gm::Int,gz::Int, beta::Float64,Zprob::Array{Float64}) v = Array{Float64}(gm,gz) fill!(v,E[cld(gm,2),cld(gm,2),cld(gz,2)]) r = Array{Float64}(gm) exp_v = Array{Float64}(gm,gz) # Set parameters for the loop convcrit = 0.0001 # chosen convergence criterion diff = 1.0 # arbitrary initial value greater than convcrit while diff>convcrit A_mul_Bt!(exp_v,v,Zprob) diff = -Inf for h=1:gm for j=1:gz oldv = v[h,j] newv = add_vecs_and_max!(gm,r,h,j,E,exp_v,beta) v[h,j]= newv diff = max(diff, oldv-newv, newv-oldv) end end end return v end Switching the functions to use Float64 should increase speed (as CPUs are inherently optimized for 64-bit word lengths). Also, using the mutating A_mul_Bt! directly saves another allocation. Avoiding the copy(...) by switching the arrays v and Tv. How do these optimizations improve your running time? 2nd UPDATE Updated the code in the UPDATE section to use findmax. Also, changed dynam_serial to use v without Tv, as there was no need to save the old version except for the diff calculation, which is now done inside the loop.
Here's the code I copied-and-pasted, provided by Dan Getz above. I include the array and scalar definitions exactly as I ran them. Performance was: 39.507005 seconds (11 allocations: 486.891 KiB) when running #time dynam_serial(E,gm,gz,beta,Zprob). using SpecialFunctions est_params = [.788,.288,.0034,.1519,.1615,.0041,.0077,.2,0.005,.7196] r = 0.015 tau = 0.35 rho =est_params[1] sigma =est_params[2] delta = 0.15 gamma =est_params[3] a_capital =est_params[4] lambda1 =est_params[5] lambda2 =est_params[6] s =est_params[7] theta =est_params[8] mu =est_params[9] p_bar_k_ss =est_params[10] beta = (1+r)^(-1) sigma_range = 4 gz = 19 #15 #19 gp = 29 #19 #29 gk = 37 #25 #37 lnz=collect(linspace(-sigma_range*sigma,sigma_range*sigma,gz)) z=exp.(lnz) gk_m = fld(gk,2) # Need to add mu somewhere to k_ss k_ss = (theta*(1-tau)/(r+delta))^(1/(1-theta)) k=cat(1,map(i->k_ss*((1-delta)^i),collect(1:gk_m)),map(i->k_ss/((1-delta)^i),collect(1:gk_m))) insert!(k,gk_m+1,k_ss) sort!(k) p_bar=p_bar_k_ss*k_ss p = collect(linspace(-p_bar/2,p_bar,gp)) #Tauchen N = length(z) Z = zeros(N,1) Zprob = zeros(Float64,N,N) Z[N] = lnz[length(z)] Z[1] = lnz[1] zstep = (Z[N] - Z[1]) / (N - 1) for i=2:(N-1) Z[i] = Z[1] + zstep * (i - 1) end for a = 1 : N for b = 1 : N if b == 1 Zprob[a,b] = 0.5*erfc(-((Z[1] - mu - rho * Z[a] + zstep / 2) / sigma)/sqrt(2)) elseif b == N Zprob[a,b] = 1 - 0.5*erfc(-((Z[N] - mu - rho * Z[a] - zstep / 2) / sigma)/sqrt(2)) else Zprob[a,b] = 0.5*erfc(-((Z[b] - mu - rho * Z[a] + zstep / 2) / sigma)/sqrt(2)) - 0.5*erfc(-((Z[b] - mu - rho * Z[a] - zstep / 2) / sigma)/sqrt(2)) end end end # Collecting tauchen results in a 2 element array of linspace and array; [2] gets array # Zprob=collect(tauchen(gz, rho, sigma, mu, sigma_range))[2] Zcumprob=zeros(Float64,gz,gz) # 2 in cumsum! denotes the 2nd dimension, i.e. columns cumsum!(Zcumprob, Zprob,2) gm = gk * gp control=zeros(gm,2) for i=1:gk control[(1+gp*(i-1)):(gp*i),1]=fill(k[i],(gp,1)) control[(1+gp*(i-1)):(gp*i),2]=p end endog=copy(control) E=Array(Float64,gm,gm,gz) for h=1:gm for m=1:gm for j=1:gz # set the nonzero net debt indicator if endog[h,2]<0 p_ind=1 else p_ind=0 end # set the investment indicator if (control[m,1]-(1-delta)*endog[h,1])!=0 i_ind=1 else i_ind=0 end E[m,h,j] = (1-tau)*z[j]*(endog[h,1]^theta) + control[m,2]-endog[h,2]*(1+r*(1-tau)) + delta*endog[h,1]*tau-(control[m,1]-(1-delta)*endog[h,1]) - (i_ind*gamma*endog[h,1]+endog[h,1]*(a_capital/2)*(((control[m,1]-(1-delta)*endog[h,1])/endog[h,1])^2)) + s*endog[h,2]*p_ind elem = E[m,h,j] if E[m,h,j]<0 E[m,h,j]=elem+lambda1*elem-.5*lambda2*elem^2 else E[m,h,j]=elem end end end end function add_vecs_and_max!(gm::Int,r::Array{Float64},h::Int,j::Int,E::Array{Float64}, exp_v::Array{Float64},beta::Float64) maxr = -Inf #inbounds for i=1:gm r[i] = E[i,h,j]+beta*exp_v[i,j] maxr = max(r[i],maxr) end return maxr end function dynam_serial(E::Array{Float64},gm::Int,gz::Int, beta::Float64,Zprob::Array{Float64}) v = Array{Float64}(gm,gz) fill!(v,E[cld(gm,2),cld(gm,2),cld(gz,2)]) Tv = Array{Float64}(gm,gz) r = Array{Float64}(gm) exp_v = Array{Float64}(gm,gz) # Set parameters for the loop convcrit = 0.0001 # chosen convergence criterion diff = 1.0 # arbitrary initial value greater than convcrit while diff>convcrit A_mul_Bt!(exp_v,v,Zprob) diff = -Inf for h=1:gm for j=1:gz Tv[h,j]=add_vecs_and_max!(gm,r,h,j,E,exp_v,beta) diff = max(abs(Tv[h,j]-v[h,j]),diff) end end (v,Tv)=(Tv,v) end return v end Now, here's another version of the algorithm and inputs. The functions are similar to what Dan Getz suggested, except that I use findmax rather than an iterated max function to find the array maximum. In the input construction, I am using both Float32 and mixing different bit-types together. However, I've consistently achieved better performance this way: 24.905569 seconds (1.81 k allocations: 46.829 MiB, 0.01% gc time). But it's not clear at all why. using SpecialFunctions est_params = [.788,.288,.0034,.1519,.1615,.0041,.0077,.2,0.005,.7196] r = 0.015 tau = 0.35 rho =est_params[1] sigma =est_params[2] delta = 0.15 gamma =est_params[3] a_capital =est_params[4] lambda1 =est_params[5] lambda2 =est_params[6] s =est_params[7] theta =est_params[8] mu =est_params[9] p_bar_k_ss =est_params[10] beta = Float32((1+r)^(-1)) sigma_range = 4 gz = 19 gp = 29 gk = 37 lnz=collect(linspace(-sigma_range*sigma,sigma_range*sigma,gz)) z=exp(lnz) gk_m = fld(gk,2) # Need to add mu somewhere to k_ss k_ss = (theta*(1-tau)/(r+delta))^(1/(1-theta)) k=cat(1,map(i->k_ss*((1-delta)^i),collect(1:gk_m)),map(i->k_ss/((1-delta)^i),collect(1:gk_m))) insert!(k,gk_m+1,k_ss) sort!(k) p_bar=p_bar_k_ss*k_ss p = collect(linspace(-p_bar/2,p_bar,gp)) #Tauchen N = length(z) Z = zeros(N,1) Zprob = zeros(Float32,N,N) Z[N] = lnz[length(z)] Z[1] = lnz[1] zstep = (Z[N] - Z[1]) / (N - 1) for i=2:(N-1) Z[i] = Z[1] + zstep * (i - 1) end for a = 1 : N for b = 1 : N if b == 1 Zprob[a,b] = 0.5*erfc(-((Z[1] - mu - rho * Z[a] + zstep / 2) / sigma)/sqrt(2)) elseif b == N Zprob[a,b] = 1 - 0.5*erfc(-((Z[N] - mu - rho * Z[a] - zstep / 2) / sigma)/sqrt(2)) else Zprob[a,b] = 0.5*erfc(-((Z[b] - mu - rho * Z[a] + zstep / 2) / sigma)/sqrt(2)) - 0.5*erfc(-((Z[b] - mu - rho * Z[a] - zstep / 2) / sigma)/sqrt(2)) end end end # Collecting tauchen results in a 2 element array of linspace and array; [2] gets array # Zprob=collect(tauchen(gz, rho, sigma, mu, sigma_range))[2] Zcumprob=zeros(Float32,gz,gz) # 2 in cumsum! denotes the 2nd dimension, i.e. columns cumsum!(Zcumprob, Zprob,2) gm = gk * gp control=zeros(gm,2) for i=1:gk control[(1+gp*(i-1)):(gp*i),1]=fill(k[i],(gp,1)) control[(1+gp*(i-1)):(gp*i),2]=p end endog=copy(control) E=Array(Float32,gm,gm,gz) for h=1:gm for m=1:gm for j=1:gz # set the nonzero net debt indicator if endog[h,2]<0 p_ind=1 else p_ind=0 end # set the investment indicator if (control[m,1]-(1-delta)*endog[h,1])!=0 i_ind=1 else i_ind=0 end E[m,h,j] = (1-tau)*z[j]*(endog[h,1]^theta) + control[m,2]-endog[h,2]*(1+r*(1-tau)) + delta*endog[h,1]*tau-(control[m,1]-(1-delta)*endog[h,1]) - (i_ind*gamma*endog[h,1]+endog[h,1]*(a_capital/2)*(((control[m,1]-(1-delta)*endog[h,1])/endog[h,1])^2)) + s*endog[h,2]*p_ind elem = E[m,h,j] if E[m,h,j]<0 E[m,h,j]=elem+lambda1*elem-.5*lambda2*elem^2 else E[m,h,j]=elem end end end end function add_vecs!(gm::Int,r::Array{Float32},h::Int,j::Int,E::Array{Float32}, exp_v::Array{Float32},beta::Float32) #inbounds #views for i=1:gm r[i]=E[i,h,j] + beta*exp_v[i,j] end return r end function dynam_serial(E::Array{Float32},gm::Int,gz::Int,beta::Float32,Zprob::Array{Float32}) v = Array{Float32}(gm,gz) fill!(v,E[cld(gm,2),cld(gm,2),cld(gz,2)]) Tv = Array{Float32}(gm,gz) # Set parameters for the loop convcrit = 0.0001 # chosen convergence criterion diff = 1.00000 # arbitrary initial value greater than convcrit iter=0 exp_v=Array{Float32}(gm,gz) r=Array{Float32}(gm) while diff>convcrit A_mul_Bt!(exp_v,v,Zprob) for h=1:gm for j=1:gz Tv[h,j]=findmax(add_vecs!(gm,r,h,j,E,exp_v,beta))[1] end end diff = maximum(abs,Tv - v) (v,Tv)=(Tv,v) end return v end
Memory allocation in a fixed point algorithm
I need to find the fixed point of a function f. The algorithm is very simple: Given X, compute f(X) If ||X-f(X)|| is lower than a certain tolerance, exit and return X, otherwise set X equal to f(X) and go back to 1. I'd like to be sure I'm not allocating memory for a new object at every iteration For now, the algorithm looks like this: iter1 = function(x::Vector{Float64}) for iter in 1:max_it oldx = copy(x) g1(x) delta = vnormdiff(x, oldx, 2) if delta < tolerance break end end end Here g1(x) is a function that sets x to f(x) But it seems this loop allocates a new vector at every loop (see below). Another way to write the algorithm is the following: iter2 = function(x::Vector{Float64}) oldx = similar(x) for iter in 1:max_it (oldx, x) = (x, oldx) g2(x, oldx) delta = vnormdiff(oldx, x, 2) if delta < tolerance break end end end where g2(x1, x2) is a function that sets x1 to f(x2). Is thi the most efficient and natural way to write this kind of iteration problem? Edit1: timing shows that the second code is faster: using NumericExtensions max_it = 1000 tolerance = 1e-8 max_it = 100 g1 = function(x::Vector{Float64}) for i in 1:length(x) x[i] = x[i]/2 end end g2 = function(newx::Vector{Float64}, x::Vector{Float64}) for i in 1:length(x) newx[i] = x[i]/2 end end x = fill(1e7, int(1e7)) #time iter1(x) # elapsed time: 4.688103075 seconds (4960117840 bytes allocated, 29.72% gc time) x = fill(1e7, int(1e7)) #time iter2(x) # elapsed time: 2.187916177 seconds (80199676 bytes allocated, 0.74% gc time) Edit2: using copy! iter3 = function(x::Vector{Float64}) oldx = similar(x) for iter in 1:max_it copy!(oldx, x) g1(x) delta = vnormdiff(x, oldx, 2) if delta < tolerance break end end end x = fill(1e7, int(1e7)) #time iter3(x) # elapsed time: 2.745350176 seconds (80008088 bytes allocated, 1.11% gc time)
I think replacing the following lines in the first code for iter = 1:max_it oldx = copy( x ) ... by oldx = zeros( N ) for iter = 1:max_it oldx[:] = x # or copy!( oldx, x ) ... will be more efficient because no array is allocated. Also, the code can be made more efficient by writing for-loops explicitly. This can be seen, for example, from the following comparison function test() N = 1000000 a = zeros( N ) b = zeros( N ) #time c = copy( a ) #time b[:] = a #time copy!( b, a ) #time \ for i = 1:length(a) b[i] = a[i] end #time \ for i in eachindex(a) b[i] = a[i] end end test() The result obtained with Julia0.4.0 on Linux(x86_64) is elapsed time: 0.003955609 seconds (7 MB allocated) elapsed time: 0.001279142 seconds (0 bytes allocated) elapsed time: 0.000836167 seconds (0 bytes allocated) elapsed time: 1.19e-7 seconds (0 bytes allocated) elapsed time: 1.28e-7 seconds (0 bytes allocated) It seems that copy!() is faster than using [:] in the left-hand side, though the difference becomes marginal in repeated calculations (there seems to be some overhead for the first [:] calculation). Btw, the last example using eachindex() is very convenient for looping over multi-dimensional arrays. Similar comparison can be made for vnormdiff(), where use of norm( x - oldx ) etc is slower than an explicit loop for vector norm, because the former allocates one temporary array for x - oldx.
How to speed up a double loop in matlab
This is a follow-up question of this question. The following code takes an enormous amount of time to loop through. Do you have any recommendations for speeding up the process? The variable z has a size of 479x1672 and others will be around 479x12000. z = HongKongPrices; zmat = false(size(z)); r = size(z,1); c = size(z,2); for k = 1:c for i = 5:r if z(i,k) == z(i-4,k) && z(i,k) == z(i-3,k) && z(i,k) == z(end,k) zmat(i-3:i,k) = 1 end end end z(zmat) = NaN I am currently running this with MatLab R2014b on an iMac with 3.2 Intel i5 and 16 GB DDR3.
You can use logical indexing here to your advantage to replace the IF-conditional statement and have a small-loop - %// Get size parameters [r,c] = size(z); %// Get logical mask with ones for each column at places that satisfy the condition %// mentioned as the IF conditional statement in the problem code mask = z(1:r-4,:) == z(5:r,:) & z(2:r-3,:) == z(5:r,:) & ... bsxfun(#eq,z(end,:),z(5:r,:)); %// Use logical indexing to map entire z array and set mask elements as NaNs for k = 1:4 z([false(k,c) ; mask ; false(4-k,c)]) = NaN; end Benchmarking %// Size parameters nrows = 479; ncols = 12000; max_num = 10; num_iter = 10; %// number of iterations to run each approach, %// so that runtimes are over 1 sec mark z_org = randi(max_num,nrows,ncols); %// random input data of specified size disp('--------------------------------- With proposed approach') tic for iter = 1:num_iter z = z_org; [..... code from the proposed approach ...] end toc, clear z k mask r c disp('--------------------------------- With original approach') tic for iter = 1:num_iter z = z_org; [..... code from the problem ...] end toc Results Case # 1: z as 479 x 1672 (num_iter = 50) --------------------------------- With proposed approach Elapsed time is 1.285337 seconds. --------------------------------- With original approach Elapsed time is 2.008256 seconds. Case # 2: z as 479 x 12000 (num_iter = 10) --------------------------------- With proposed approach Elapsed time is 1.941858 seconds. --------------------------------- With original approach Elapsed time is 2.897006 seconds.