Tricks to improve the performance of a cunstom function in Julia

Tricks to improve the performance of a cunstom function in Julia - performance

I am replicating using Julia a sequence of steps originally made in Matlab. In Octave, this procedure takes 1.4582 seconds and in Julia (using Jupyter) it takes approximately 10 seconds. I'll try to be brief in the scripts. My goal is to achieve or improve Octave's performance. First of all, I will describe my variables and some function:
zgrid (double 1x7 size)
kgrid (double 500x1 size)
V0 (double 500x7 size)
P (double 7x7 size) a transition matrix
delta and beta are fixed parameters.
F(z,k) and u(c) are particular functions and are specified in the Julia script.
% Octave script
% V0 is given
[K, Z, K2] = meshgrid(kgrid, zgrid, kgrid);
K = permute(K, [2, 1, 3]);
Z = permute(Z, [2, 1, 3]);
K2 = permute(K2, [2, 1, 3]);
C = max(f(Z,K) + (1-delta)*K - K2,0);
U = u(C);
EV = V0*P';% EV is a 500x7 matrix size
EV = permute(repmat(EV, 1, 1, 500), [3, 2, 1]);
H = U + beta*EV;
[TV, index] = max(H, [], 3);
In Julia, I created a function that replicates this procedure. I used loops, but it has a performance 9 times longer.
% Julia script
% V0 is the input of my T operator function
V0 = repeat(sqrt.(kgrid), outer = [1,7]);
F = (z,k) -> exp(z)*(k^α);
u = (c) -> (c^(1-μ) - 1)/(1-μ)
% parameters
α = 1/3
β = 0.987
δ = 0.012;
μ = 2
Kss = 48.1905148382166
kgrid = range(0.75*Kss, stop=1.25*Kss, length=500);
zgrid = [-0.06725382459813659, -0.044835883065424395, -0.0224179415327122, 0 , 0.022417941532712187, 0.04483588306542438, 0.06725382459813657]
function T(V)
E=V*P'
T1 = zeros(Float64, 500, 7 )
aux = zeros(Float64, 500)
for i = 1:7
for j = 1:500
for l = 1:500
c= maximum( (F(zrid[i],kgrid[j]) +(1-δ)*kgrid[j] - kgrid[l],0))
aux[l] = u(c) + β*E[l,i]
end
T1[j,i] = maximum(aux)
end
end
return T1
end
I would very much like to improve my performance in Julia. I believe there is a way to improve, but I am new in Julia programming.

This code runs for me in 5ms. Note that I have made F and u into proper (not anonymous) functions, F_ and u_, but you could get a similar effect by making the anonymous functions const.
Your main problem is that you have a lot of non-const global variables, and also that your main function is doing unnecessary work multiple times, and creating an unnecessary array, aux.
The performance tips section in the manual is essential reading: https://docs.julialang.org/en/v1/manual/performance-tips/
F_(z,k) = exp(z) * (k^(1/3)); # you can still use α, but it must be const
u_(c) = (c^(1-2) - 1)/(1-2)
function T_(V, P, kgrid, zgrid, β, δ)
E = V * P'
T1 = similar(V)
for i in axes(T1, 2)
for j in axes(T1, 1)
temp = F_(zgrid[i], kgrid[j]) + (1-δ)*kgrid[j]
aux = -Inf
for l in eachindex(kgrid)
c = max(0.0, temp - kgrid[l])
aux = max(aux, u_(c) + β * E[l, i])
end
T1[j,i] = aux
end
end
return T1
end
Benchmark:
V0 = repeat(sqrt.(kgrid), outer = [1,7]);
zgrid = sort!(rand(1, 7); dims=2)
kgrid = sort!(rand(500, 1); dims=1)
P = rand(length(zgrid), length(zgrid))
#btime T_($V0, $P, $kgrid, $zgrid, $β, $δ);
# output: 5.126 ms (4 allocations: 54.91 KiB)

The following should perform much better. The most noticeable differences are that it calculates F 500x less, and doesn't rely on global variables.
function T(V,kgrid,zgrid,β,δ)
E=V*P'
T1 = zeros(Float64, 500, 7)
for j = 1:500
for i = 1:7
x = F(zrid[i],kgrid[j]) +(1-δ)*kgrid[j]
T1[j,i] = maximum(u(max(x - kgrid[l], 0)) + β*E[l,i] for l in 1:500)
end
end
return T1
end

Related

Monte Carlo program throws a method error in Julia

I am running this code but it shows a method error. Can someone please help me?
Code:
function lsmc_am_put(S, K, r, σ, t, N, P)
Δt = t / N
R = exp(r * Δt)
T = typeof(S * exp(-σ^2 * Δt / 2 + σ * √Δt * 0.1) / R)
X = Array{T}(N+1, P)
for p = 1:P
X[1, p] = x = S
for n = 1:N
x *= R * exp(-σ^2 * Δt / 2 + σ * √Δt * randn())
X[n+1, p] = x
end
end
V = [max(K - x, 0) / R for x in X[N+1, :]]
for n = N-1:-1:1
I = V .!= 0
A = [x^d for d = 0:3, x in X[n+1, :]]
β = A[:, I]' \ V[I]
cV = A' * β
for p = 1:P
ev = max(K - X[n+1, p], 0)
if I[p] && cV[p] < ev
V[p] = ev / R
else
V[p] /= R
end
end
end
return max(mean(V), K - S)
end
lsmc_am_put(100, 90, 0.05, 0.3, 180/365, 1000, 10000)
error:
MethodError: no method matching (Array{Float64})(::Int64, ::Int64)
Closest candidates are:
(Array{T})(::LinearAlgebra.UniformScaling, ::Integer, ::Integer) where T at /Volumes/Julia-1.8.3/Julia-1.8.app/Contents/Resources/julia/share/julia/stdlib/v1.8/LinearAlgebra/src/uniformscaling.jl:508
(Array{T})(::Nothing, ::Any...) where T at baseext.jl:45
(Array{T})(::UndefInitializer, ::Int64) where T at boot.jl:473
...
Stacktrace:
[1] lsmc_am_put(S::Int64, K::Int64, r::Float64, σ::Float64, t::Float64, N::Int64, P::Int64)
# Main ./REPL[39]:5
[2] top-level scope
# REPL[40]:1
I tried this code and I was expecting a numeric answer but this error came up. I tried to look it up on google but I found nothing that matches my situation.

The error occurs where you wrote X = Array{T}(N+1, P). Instead, use one of the following approaches if you need a Vector:
julia> Array{Float64, 1}([1,2,3])
3-element Vector{Float64}:
1.0
2.0
3.0
julia> Vector{Float64}([1, 2, 3])
3-element Vector{Float64}:
1.0
2.0
3.0
And in your case, you should write X = Array{T,1}([N+1, P]) or X = Vector{T}([N+1, P]). But since there's such a X[1, p] = x = S expression in your code, I guess you mean to initialize a 2D array and update its elements through the algorithm. For this, you can define X like the following:
X = zeros(Float64, N+1, P)
# Or
X = Array{Float64, 2}(undef, N+1, P)
So, I tried the following in your code:
# I just changed the definition of `X` in your code like the following
X = Array{T, 2}(undef, N+1, P)
#And the result of the code was:
julia> lsmc_am_put(100, 90, 0.05, 0.3, 180/365, 1000, 10000)
3.329213731484463

How to reduce the allocations in Julia?

I am starting to use Julia mainly because of its speed. Currently, I am solving a fixed point problem. Although the current version of my code runs fast I would like to know some methods to improve its speed.
First of all, let me summarize the algorithm.
There is an initial seed called C0 that maps from the space (b,y) into an action space c, then we have C0(b,y)
There is a formula that generates a rule Ct from C0.
Then, using an additional restriction, I can obtain an updating of b [let's called it bt]. Thus,it generates a rule Ct(bt,y)
I need to interpolate the previous rule to move from the grid bt into the original grid b. It gives me an update for C0 [let's called that C1]
I will iterate until the distance between C1 and C0 is below a convergence threshold.
To implement it I created two structures:
struct Parm
lC::Array{Float64, 2} # Lower limit
uC::Array{Float64, 2} # Upper limit
γ::Float64 # CRRA coefficient
δ::Float64 # factor in the euler
γ1::Float64 #
r1::Float64 # inverse of the gross interest rate
yb1::Array{Float64, 2} # y - b(t+1)
P::Array{Float64, 2} # Transpose of transition matrix
end
mutable struct Upd1
pol::Array{Float64,2} # policy function
b::Array{Float64, 1} # exogenous grid for interpolation
dif::Float64 # updating difference
end
The first one is a set of parameters while the second one stores the decision rule C1. I also define some functions:
function eulerm(x::Upd1,p::Parm)
ct = p.δ *(x.pol.^(-p.γ)*p.P).^(-p.γ1); #Euler equation
bt = p.r1.*(ct .+ p.yb1); #Endeogenous grid for bonds
return ct,bt
end
function interp0!(bt::Array{Float64},ct::Array{Float64},x::Upd1, p::Parm)
polold = x.pol;
polnew = similar(x.pol);
#inbounds #simd for col in 1:size(bt,2)
F1 = LinearInterpolation(bt[:,col], ct[:,col],extrapolation_bc=Line());
polnew[:,col] = F1(x.b);
end
polnew[polnew .< p.lC] .= p.lC[polnew .< p.lC];
polnew[polnew .> p.uC] .= p.uC[polnew .> p.uC];
dif = maximum(abs.(polnew - polold));
return polnew,dif
end
function updating!(x::Upd1,p::Parm)
ct, bt = eulerm(x,p); # endogeneous grid
x.pol, x.dif = interp0!(bt,ct,x,p);
end
function conver(x::Upd1,p::Parm)
while x.dif>1e-8
updating!(x,p);
end
end
The first formula implements steps 2 and 3. The third one makes the updating (last part of step 4), and the last one iterates until convergence (step 5).
The most important function is the second one. It makes the interpolation. While I was running the function #time and #btime I realized that the largest number of allocations are in the loop inside this function. I tried to reduce it by not defining polnew and goes directly to x.pol but in this case, the results are not correct since it only need two iterations to converge (I think that Julia is thinking that polold is exactly the same than x.pol and it is updating both at the same time).
Any advice is well received.
To anyone that wants to run it by themselves, I add the rest of the required code:
function rouwen(ρ::Float64, σ2::Float64, N::Int64)
if (N % 2 != 1)
return "N should be an odd number"
end
sigz = sqrt(σ2/(1-ρ^2));
zn = sigz*sqrt(N-1);
z = range(-zn,zn,N);
p = (1+ρ)/2;
q = p;
Rho = [p 1-p;1-q q];
for i = 3:N
zz = zeros(i-1,1);
Rho = p*[Rho zz; zz' 0] + (1-p)*[zz Rho; 0 zz'] + (1-q)*[zz' 0; Rho zz] + q *[0 zz'; zz Rho];
Rho[2:end-1,:] = Rho[2:end-1,:]/2;
end
return z,Rho;
end
#############################################################
# Parameters of the model
############################################################
lb = 0; ub = 1000; pivb = 0.25; nb = 500;
ρ = 0.988; σz = 0.0439; μz =-σz/2; nz = 7;
ϕ = 0.0; σe = 0.6376; μe =-σe/2; ne = 7;
β = 0.98; r = 1/400; γ = 1;
b = exp10.(range(start=log10(lb+pivb), stop=log10(ub+pivb), length=nb)) .- pivb;
#=========================================================
Algorithm
======================================================== =#
(z,Pz) = rouwen(ρ,σz, nz);
μZ = μz/(1-ρ);
z = z .+ μZ;
(ee,Pe) = rouwen(ϕ,σe,ne);
ee = ee .+ μe;
y = exp.(vec((z .+ ee')'));
P = kron(Pz,Pe);
R = 1 + r;
r1 = R^(-1);
γ1 = 1/γ;
δ = (β*R)^(-γ1);
m = R*b .+ y';
lC = max.(m .- ub,0);
uC = m .- lb;
by1 = b .- y';
# initial guess for C0
c0 = 0.1*(m);
# Set of parameters
pp = Parm(lC,uC,γ,δ,γ1,r1,by1,P');
# Container of results
up1 = Upd1(c0,b,1);
# Fixed point problem
conver(up1,pp)
UPDATE As it was reccomend, I made the following changes to the third function
function interp0!(bt::Array{Float64},ct::Array{Float64},x::Upd1, p::Parm)
polold = x.pol;
polnew = similar(x.pol);
#inbounds for col in 1:size(bt,2)
F1 = LinearInterpolation(#view(bt[:,col]), #view(ct[:,col]),extrapolation_bc=Line());
polnew[:,col] = F1(x.b);
end
for j in eachindex(polnew)
polnew[j] < p.lC[j] ? polnew[j] = p.lC[j] : nothing
polnew[j] > p.uC[j] ? polnew[j] = p.uC[j] : nothing
end
dif = maximum(abs.(polnew - polold));
return polnew,dif
end
This leads to an improvement in the speed (from ~1.5 to ~1.3 seconds). And a reduction in the number of allocations. Somethings that I noted were:
Changing from polnew[:,col] = F1(x.b) to polnew[:,col] .= F1(x.b) can reduce the total allocations but the time is slower, why is that?
How should I understand the difference between #time and #btime. For this case, I have:
up1 = Upd1(c0,b,1);
#time conver(up1,pp)
1.338042 seconds (385.72 k allocations: 1.157 GiB, 3.37% gc time)
up1 = Upd1(c0,b,1);
#btime conver(up1,pp)
4.200 ns (0 allocations: 0 bytes)
Just to be precise, in both cases, I run it several times and I choose representative numbers for each line.
Does it mean that all the time is due allocations during the compilation?

Start going through the "performance tips" as advised by #DNF but below you will find most important comments for your code.
Vectorize vector assignments - a small dot makes big difference
julia> julia> a = rand(3,4);
julia> #btime $a[3,:] = $a[3,:] ./ 2;
40.726 ns (2 allocations: 192 bytes)
julia> #btime $a[3,:] .= $a[3,:] ./ 2;
20.562 ns (1 allocation: 96 bytes)
Use views when doing something with subarrays:
julia> #btime sum($a[3,:]);
18.719 ns (1 allocation: 96 bytes)
julia> #btime sum(#view($a[3,:]));
5.600 ns (0 allocations: 0 bytes)
Your code around a lines polnew[polnew .< p.lC] .= p.lC[polnew .< p.lC]; will make much less allocations when you do it with a for loop over each element of polnew
#simd will have no effect on conditionals (point 3) neither when code is calling complex external functions

I want to give an update about this problem. I made two main changes to my code: (i) I define my own linear interpolation function and (ii) I include the check of bounds in the interpolation.
With this the new function three is
function interp0!(bt::Array{Float64},ct::Array{Float64},x::Upd1, p::Parm)
polold = x.pol;
polnew = similar(x.pol);
#inbounds #views for col in 1:size(bt,2)
polnew[:,col] = myint(bt[:,col], ct[:,col],x.b[:],p.lC[:,col],p.uC[:,col]);
end
dif = maximum(abs.(polnew - polold));
return polnew,dif
end
And the interpolation is now:
function myint(x0,y0,x1,ly,uy)
y1 = similar(x1);
n = size(x0,1);
j = 1;
#simd for i in eachindex(x1)
while (j <= n) && (x1[i] > x0[j])
j+=1;
end
if j == 1
y1[i] = y0[1] + ((y0[2]-y0[1])/(x0[2]-x0[1]))*(x1[i]-x0[1]) ;
elseif j == n+1
y1[i] = y0[n] + ((y0[n]-y0[n-1])/(x0[n]-x0[n-1]))*(x1[i]-x0[n]);
else
y1[i] = y0[j-1]+ ((x1[i]-x0[j-1])/(x0[j]-x0[j-1]))*(y0[j]-y0[j-1]);
end
y1[i] > uy[i] ? y1[i] = uy[i] : nothing;
y1[i] < ly[i] ? y1[i] = ly[i] : nothing;
end
return y1;
end
As you can see, I am taking advantage (and assuming) that both vectors that we use as basis are ordered while the two last lines in the outer loops checks the bounds imposed by lC and uC.
With that I get the following total time
up1 = Upd1(c0,b,1);
#time conver(up1,pp)
0.734630 seconds (28.93 k allocations: 752.214 MiB, 3.82% gc time)
up1 = Upd1(c0,b,1);
#btime conver(up1,pp)
4.200 ns (0 allocations: 0 bytes)
which is almost as twice faster with ~8% of the total allocations. the use of views in the loop of the function interp0! also helps a lot.

MATLAB Optimisation of Weighted Gram-Schmidt Orthogonalisation

I have a function in MATLAB which performs the Gram-Schmidt Orthogonalisation with a very important weighting applied to the inner-products (I don't think MATLAB's built in function supports this).
This function works well as far as I can tell, however, it is too slow on large matrices.
What would be the best way to improve this?
I have tried converting to a MEX file but I lose parallelisation with the compiler I'm using and so it is then slower.
I was thinking of running it on a GPU as the element-wise multiplications are highly parallelised. (But I'd prefer the implementation to be easily portable)
Can anyone vectorise this code or make it faster? I am not sure how to do it elegantly ...
I know the stackoverflow minds here are amazing, consider this a challenge :)
Function
function [Q, R] = Gram_Schmidt(A, w)
[m, n] = size(A);
Q = complex(zeros(m, n));
R = complex(zeros(n, n));
v = zeros(n, 1);
for j = 1:n
v = A(:,j);
for i = 1:j-1
R(i,j) = sum( v .* conj( Q(:,i) ) .* w ) / ...
sum( Q(:,i) .* conj( Q(:,i) ) .* w );
v = v - R(i,j) * Q(:,i);
end
R(j,j) = norm(v);
Q(:,j) = v / R(j,j);
end
end
where A is an m x n matrix of complex numbers and w is an m x 1 vector of real numbers.
Bottle-neck
This is the expression for R(i,j) which is the slowest part of the function (not 100% sure if the notation is correct):
where w is a non-negative weight function.
The weighted inner-product is mentioned on several Wikipedia pages, this is one on the weight function and this is one on orthogonal functions.
Reproduction
You can produce results using the following script:
A = complex( rand(360000,100), rand(360000,100));
w = rand(360000, 1);
[Q, R] = Gram_Schmidt(A, w);
where A and w are the inputs.
Speed and Computation
If you use the above script you will get profiler results synonymous to the following:
Testing Result
You can test the results by comparing a function with the one above using the following script:
A = complex( rand( 100, 10), rand( 100, 10));
w = rand( 100, 1);
[Q , R ] = Gram_Schmidt( A, w);
[Q2, R2] = Gram_Schmidt2( A, w);
zeros1 = norm( Q - Q2 );
zeros2 = norm( R - R2 );
where Gram_Schmidt is the function described earlier and Gram_Schmidt2 is an alternative function. The results zeros1 and zeros2 should then be very close to zero.
Note:
I tried speeding up the calculation of R(i,j) with the following but to no avail ...
R(i,j) = ( w' * ( v .* conj( Q(:,i) ) ) ) / ...
( w' * ( Q(:,i) .* conj( Q(:,i) ) ) );

1)
My first attempt at vectorization:
function [Q, R] = Gram_Schmidt1(A, w)
[m, n] = size(A);
Q = complex(zeros(m, n));
R = complex(zeros(n, n));
for j = 1:n
v = A(:,j);
QQ = Q(:,1:j-1);
QQ = bsxfun(#rdivide, bsxfun(#times, w, conj(QQ)), w.' * abs(QQ).^2);
for i = 1:j-1
R(i,j) = (v.' * QQ(:,i));
v = v - R(i,j) * Q(:,i);
end
R(j,j) = norm(v);
Q(:,j) = v / R(j,j);
end
end
Unfortunately, it turned out to be slower than the original function.
2)
Then I realized that the columns of this intermediate matrix QQ are built incrementally, and that previous ones are not modified. So here is my second attempt:
function [Q, R] = Gram_Schmidt2(A, w)
[m, n] = size(A);
Q = complex(zeros(m, n));
R = complex(zeros(n, n));
QQ = complex(zeros(m, n-1));
for j = 1:n
if j>1
qj = Q(:,j-1);
QQ(:,j-1) = (conj(qj) .* w) ./ (w.' * (qj.*conj(qj)));
end
v = A(:,j);
for i = 1:j-1
R(i,j) = (v.' * QQ(:,i));
v = v - R(i,j) * Q(:,i);
end
R(j,j) = norm(v);
Q(:,j) = v / R(j,j);
end
end
Technically no major vectorization was done; I've only precomputed intermediate results, and moved the computation outside the inner loop.
Based on a quick benchmark, this new version is definitely faster:
% some random data
>> M = 10000; N = 100;
>> A = complex(rand(M,N), rand(M,N));
>> w = rand(M,1);
% time
>> timeit(#() Gram_Schmidt(A,w), 2) % original version
ans =
1.2444
>> timeit(#() Gram_Schmidt1(A,w), 2) % first attempt (vectorized)
ans =
2.0990
>> timeit(#() Gram_Schmidt2(A,w), 2) % final version
ans =
0.4698
% check results
>> [Q,R] = Gram_Schmidt(A,w);
>> [Q2,R2] = Gram_Schmidt2(A,w);
>> norm(Q-Q2)
ans =
4.2796e-14
>> norm(R-R2)
ans =
1.7782e-12
EDIT:
Following the comments, we can rewrite the second solution to get rid of the if-statmenet, by moving that part to the end of the outer loop (i.e immediately after computing the new column Q(:,j), we compute and store the corresponding QQ(:,j)).
The function is identical in output, and timing is not that different either; the code is just a bit shorter!
function [Q, R] = Gram_Schmidt3(A, w)
[m, n] = size(A);
Q = zeros(m, n, 'like',A);
R = zeros(n, n, 'like',A);
QQ = zeros(m, n, 'like',A);
for j = 1:n
v = A(:,j);
for i = 1:j-1
R(i,j) = (v.' * QQ(:,i));
v = v - R(i,j) * Q(:,i);
end
R(j,j) = norm(v);
Q(:,j) = v / R(j,j);
QQ(:,j) = (conj(Q(:,j)) .* w) ./ (w.' * (Q(:,j).*conj(Q(:,j))));
end
end
Note that I used the zeros(..., 'like',A) syntax (new in recent MATLAB versions). This allows us to run the function unmodified on the GPU (assuming you have the Parallel Computing Toolbox):
% CPU
[Q3,R3] = Gram_Schmidt3(A, w);
vs.
% GPU
AA = gpuArray(A);
[Q3,R3] = Gram_Schmidt3(AA, w);
Unfortunately in my case, it wasn't any faster. In fact it was many times slower to run on the GPU than on the CPU, but it was worth a shot :)

There is a long discussion here, but, to jump to the answer. You have weighted the numerator and denominator of the R calculation by a vector w. The weighting occurs on the inner loop, and consist of a triple dot product, A dot Q dot w in the numerator, and Q dot Q dot w in the denominator. If you make one change, I think the code will run significantly faster. Write num = (A dot sqrt(w)) dot (Q dot sqrt(w)) and write den = (Q dot sqrt(w)) dot (Q dot sqrt(w)). That moves the (A dot sqrt(w)) and (Q dot sqrt(w)) product calculations out of the inner loop.
I would like to give an description of the formulation to Gram Schmidt Orthogonalization, that, hopefully, in addition to giving an alternate computational solution, gives further insight into the advantage of GSO.
The "goals" of GSO are two fold. First, to enable the solution of an equation like Ax=y, where A has far more rows than columns. This situation occurs frequently when measuring data, in that it is easy to measure more data than the number of states. The approach to the first goal is to rewrite A as QR such that the columns of Q are orthogonal and normalized, and R is a triangular matrix. The algorithm you provided, I believe, achieves the first goal. The Q represents the basis space of the A matrix, and R represents the amplitude of each basis space required to generate each column of A.
The second goal of GSO is to rank the basis vectors in order of significance. This the step that you have not done. And, while including this step, may increase the solution time, the results will identify which elements of x are important, according the data contained in the measurements represented by A.
But, I think, with this implementation, the solution is faster than the approach you presented.
Aij = Qij Rij where Qj are orthonormal and Rij is upper triangular, Ri,j>i=0. Qj are the orthogonal basis vectors for A, and Rij is the participation of each Qj to create a column in A. So,
A_j1 = Q_j1 * R_1,1
A_j2 = Q_j1 * R_1,2 + Q_j2 * R_2,2
A_j3 = Q_j1 * R_1,3 + Q_j2 * R_2,3 + Q_j3 * R_3,3
By inspection, you can write
A_j1 = ( A_j1 / | A_j1 | ) * | A_j1 | = Q_j1 * R_1,1
Then you project Q_j1 onto from every other column A to get the R_1,j elements
R_1,2 = Q_j1 dot Aj2
R_1,3 = Q_j1 dot Aj3
...
R_1,j(j>1) = A_j dot Q_j1
Then you subtract the elements of project of Q_j1 from the columns of A (this would set the first column to zero, so you can ignore the first column
for j = 2,n
A_j = A_j - R_1,j * Q_j1
end
Now one column from A has been removed, the first orthonormal basis vector, Q,j1, was determined, and the contribution of the first basis vector to each column, R_1,j has been determined, and the contribution of the first basis vector has been subtracted from each column. Repeat this process for the remaining columns of A to obtain the remaining columns of Q and rows of R.
for i = 1,n
R_ii = |A_i| A_i is the ith column of A, |A_i| is magnitude of A_i
Q_i = A_i / R_ii Q_i is the ith column of Q
for j = i, n
R_ij = | A_j dot Q_i |
A_j = A_j - R_ij * Q_i
end
end
You are trying to weight the rows of A, with w. Here is one approach. I would normalize w, and incorporate the effect into R. You "removed" the effects of w by multiply and dividing by w. An alternative to "removing" the effect is to normalize the amplitude of w to one.
w = w / | w |
for i = 1,n
R_ii = |A_i inner product w| # A_i inner product w = A_i .* w
Q_i = A_i / R_ii
for j = i, n
R_ij = | (A_i inner product w) dot Q_i | # A dot B = A' * B
A_j = A_j - R_ij * Q_i
end
end
Another approach to implementing w is to normalize w and then premultiply every column of A by w. That cleanly weights the rows of A, and reduces the number of multiplications.
Using the following may help in speeding up your code
A inner product B = A .* B
A dot w = A' w
(A B)' = B'A'
A' conj(A) = |A|^2
The above can be vectorized easily in matlab, pretty much as written.
But, you are missing the second portion of ranking of A, which tells you which states (elements of x in A x = y) are significantly represented in the data
The ranking procedure is easy to describe, but I'll let you work out the programming details. The above procedure essentially assumes the columns of A are in order of significance, and the first column is subtracted off all the remaining columns, then the 2nd column is subtracted off the remaining columns, etc. The first row of R represents the contribution of the first column of Q to each column of A. If you sum the absolute value of the first row of R contributions, you get a measurement of the contribution of the first column of Q to the matrix A. So, you just evaluate each column of A as the first (or next) column of Q, and determine the ranking score of the contribution of that Q column to the remaining columns of A. Then select the A column that has the highest rank as the next Q column. Coding this essentially comes down to pre estimating the next row of R, for every remaining column in A, in order to determine which ranked R magnitude has the largest amplitude. Having a index vector that represents the original column order of A will be beneficial. By ranking the basis vectors, you end up with the "principal" basis vectors that represent A, which is typically much smaller in number than the number of columns in A.
Also, if you rank the columns, it is not necessary to calculate every column of R. When you know which columns of A don't contain any useful information, there's no real benefit to keeping those columns.
In structural dynamics, one approach to reducing the number of degrees of freedom is to calculate the eigenvalues, assuming you have representative values for the mass and stiffness matrix. If you think about it, the above approach can be used to "calculate" the M and K (and C) matrices from measured response, and also identify the "measurement response shapes" that are significantly represented in the data. These are diffenrent, and potentially more important, than the mode shapes. So, you can solve very difficult problems, i.e., estimation of state matrices and number of degrees of freedom represented, from measured response, by the above approach. If you read up on N4SID, he did something similar, except he used SVD instead of GSO. I don't like the technical description for N4SID, too much focus on vector projection notation, which is simply a dot product.
There may be one or two errors in the above information, I'm writing this off the top of my head, before rushing off to work. So, check the algorithm / equations as you implement... Good Luck
Coming back to your question, of how to optimize the algorithm when you weight with w. Here is a basic GSO algorithm, without the sorting, written compatible with your function.
Note, the code below is in octave, not matlab. There are some minor differences.
function [Q, R] = Gram_Schmidt_2(A, w)
[m, n] = size(A);
Q = complex(zeros(m, n));
R = complex(zeros(n, n));
# Outer loop identifies the basis vectors
for j = 1:n
aCol = A(:,j);
# Subtract off the basis vector
for i = 1:(j-1)
R(i,j) = ctranspose(Q(:,j)) * aCol;
aCol = aCol - R(i,j) * Q(:,j);
end
amp_A_col = norm(aCol);
R(j,j) = amp_A_col;
Q(:,j) = aCol / amp_A_col;
end
end
To get your algorithm, only change one line. But, you lose a lot of speed because "ctranspose(Q(:,j)) * aCol" is a vector operation but "sum( aCol .* conj( Q(:,i) ) .* w )" is a row operation.
function [Q, R] = Gram_Schmidt_2(A, w)
[m, n] = size(A);
Q = complex(zeros(m, n));
R = complex(zeros(n, n));
# Outer loop identifies the basis vectors
for j = 1:n
aCol = A(:,j);
# Subtract off the basis vector
for i = 1:(j-1)
# R(i,j) = ctranspose(Q(:,j)) * aCol;
R(i,j) = sum( aCol .* conj( Q(:,i) ) .* w ) / ...
sum( Q(:,i) .* conj( Q(:,i) ) .* w );
aCol = aCol - R(i,j) * Q(:,j);
end
amp_A_col = norm(aCol);
R(j,j) = amp_A_col;
Q(:,j) = aCol / amp_A_col;
end
end
You can change it back to a vector operation by weighting aCol and Q by the sqrt of w.
function [Q, R] = Gram_Schmidt_3(A, w)
[m, n] = size(A);
Q = complex(zeros(m, n));
R = complex(zeros(n, n));
Q_sw = complex(zeros(m, n));
sw = w .^ 0.5;
for j = 1:n
aCol = A(:,j);
aCol_sw = aCol .* sw;
# Subtract off the basis vector
for i = 1:(j-1)
# R(i,j) = ctranspose(Q(:,i)) * aCol;
numTerm = ctranspose( Q_sw(:,i) ) * aCol_sw;
denTerm = ctranspose( Q_sw(:,i) ) * Q_sw(:,i);
R(i,j) = numTerm / denTerm;
aCol_sw = aCol_sw - R(i,j) * Q_sw(:,i);
end
aCol = aCol_sw ./ sw;
amp_A_col = norm(aCol);
R(j,j) = amp_A_col;
Q(:,j) = aCol / amp_A_col;
Q_sw(:,j) = Q(:,j) .* sw;
end
end
As pointed out by JacobD, the above function does not run faster. Possibly it takes time to create the additional arrays. Another grouping strategy for the triple product is to group w with conj(Q). Hope this is faster...
function [Q, R] = Gram_Schmidt_4(A, w)
[m, n] = size(A);
Q = complex(zeros(m, n));
R = complex(zeros(n, n));
for j = 1:n
aCol = A(:,j);
for i = 1:(j-1)
cqw = conj(Q(:,i)) .* w;
R(i,j) = ( transpose( aCol ) * cqw) ...
/ (transpose( Q(:,i) ) * cqw);
aCol = aCol - R(i,j) * Q(:,i);
end
amp_A_col = norm(aCol);
R(j,j) = amp_A_col;
Q(:,j) = aCol / amp_A_col;
end
end
Here's a driver function to time different versions.
function Gram_Schmidt_tester_2
nSamples = 360000;
nMeas = 100;
nMeas = 15;
A = complex( rand(nSamples,nMeas), rand(nSamples,nMeas));
w = rand(nSamples, 1);
profile on;
[Q1, R1] = Gram_Schmidt_basic(A);
profile off;
data1 = profile ("info");
tData1=data1.FunctionTable(1).TotalTime;
approx_zero1 = A - Q1 * R1;
max_value1 = max(max(abs(approx_zero1)));
profile on;
[Q2, R2] = Gram_Schmidt_w_Orig(A, w);
profile off;
data2 = profile ("info");
tData2=data2.FunctionTable(1).TotalTime;
approx_zero2 = A - Q2 * R2;
max_value2 = max(max(abs(approx_zero2)));
sw=w.^0.5;
profile on;
[Q3, R3] = Gram_Schmidt_sqrt_w(A, w);
profile off;
data3 = profile ("info");
tData3=data3.FunctionTable(1).TotalTime;
approx_zero3 = A - Q3 * R3;
max_value3 = max(max(abs(approx_zero3)));
profile on;
[Q4, R4] = Gram_Schmidt_4(A, w);
profile off;
data4 = profile ("info");
tData4=data4.FunctionTable(1).TotalTime;
approx_zero4 = A - Q4 * R4;
max_value4 = max(max(abs(approx_zero4)));
profile on;
[Q5, R5] = Gram_Schmidt_5(A, w);
profile off;
data5 = profile ("info");
tData5=data5.FunctionTable(1).TotalTime;
approx_zero5 = A - Q5 * R5;
max_value5 = max(max(abs(approx_zero5)));
profile on;
[Q2a, R2a] = Gram_Schmidt2a(A, w);
profile off;
data2a = profile ("info");
tData2a=data2a.FunctionTable(1).TotalTime;
approx_zero2a = A - Q2a * R2a;
max_value2a = max(max(abs(approx_zero2a)));
profshow (data1, 6);
profshow (data2, 6);
profshow (data3, 6);
profshow (data4, 6);
profshow (data5, 6);
profshow (data2a, 6);
sprintf('Time for %s is %5.3f sec with %d samples and %d meas, max value is %g',
data1.FunctionTable(1).FunctionName,
data1.FunctionTable(1).TotalTime,
nSamples, nMeas, max_value1)
sprintf('Time for %s is %5.3f sec with %d samples and %d meas, max value is %g',
data2.FunctionTable(1).FunctionName,
data2.FunctionTable(1).TotalTime,
nSamples, nMeas, max_value2)
sprintf('Time for %s is %5.3f sec with %d samples and %d meas, max value is %g',
data3.FunctionTable(1).FunctionName,
data3.FunctionTable(1).TotalTime,
nSamples, nMeas, max_value3)
sprintf('Time for %s is %5.3f sec with %d samples and %d meas, max value is %g',
data4.FunctionTable(1).FunctionName,
data4.FunctionTable(1).TotalTime,
nSamples, nMeas, max_value4)
sprintf('Time for %s is %5.3f sec with %d samples and %d meas, max value is %g',
data5.FunctionTable(1).FunctionName,
data5.FunctionTable(1).TotalTime,
nSamples, nMeas, max_value5)
sprintf('Time for %s is %5.3f sec with %d samples and %d meas, max value is %g',
data2a.FunctionTable(1).FunctionName,
data2a.FunctionTable(1).TotalTime,
nSamples, nMeas, max_value2a)
end
On my old home laptop, in Octave, the results are
ans = Time for Gram_Schmidt_basic is 0.889 sec with 360000 samples and 15 meas, max value is 1.57009e-16
ans = Time for Gram_Schmidt_w_Orig is 0.952 sec with 360000 samples and 15 meas, max value is 6.36717e-16
ans = Time for Gram_Schmidt_sqrt_w is 0.390 sec with 360000 samples and 15 meas, max value is 6.47366e-16
ans = Time for Gram_Schmidt_4 is 0.452 sec with 360000 samples and 15 meas, max value is 6.47366e-16
ans = Time for Gram_Schmidt_5 is 2.636 sec with 360000 samples and 15 meas, max value is 6.47366e-16
ans = Time for Gram_Schmidt2a is 0.905 sec with 360000 samples and 15 meas, max value is 6.68443e-16
These results indicate the fastest algorithm is the sqrt_w algorithm above at 0.39 sec, followed by the grouping of conj(Q) with w (above) at 0.452 sec, then version 2 of Amro solution at 0.905 sec, then the original algorithm in the question at 0.952, then a version 5 which interchanges rows / columns to see if row storage presented (code not included) at 2.636 sec. These results indicate the sqrt(w) split between A and Q is the fastest solution. But these results are not consistent with JacobD's comment about sqrt(w) not being faster.

It is possible to vectorize this so only one loop is necessary. The important fundamental change from the original algorithm is that if you swap the inner and outer loops you can vectorize the projection of the reference vector to all remaining vectors. Working off #Amro's solution, I found that an inner loop is actually faster than the matrix subtraction. I do not understand why this would be. Timing this against #Amro's solution, it is about 45% faster.
function [Q, R] = Gram_Schmidt5(A, w)
Q = A;
n_dimensions = size(A, 2);
R = zeros(n_dimensions);
R(1, 1) = norm(Q(:, 1));
Q(:, 1) = Q(:, 1) ./ R(1, 1);
for i = 2 : n_dimensions
Qw = (Q(:, i - 1) .* w)' * Q(:, (i - 1) : end);
R(i - 1, i : end) = Qw(2:end) / Qw(1);
%% Surprisingly this loop beats the matrix multiply
for j = i : n_dimensions
Q(:, j) = Q(:, j) - Q(:, i - 1) * R(i - 1, j);
end
%% This multiply is slower than above
% Q(:, i : end) = ...
% Q(:, i : end) - ...
% Q(:, i - 1) * R(i - 1, i : end);
R(i, i) = norm(Q(:,i));
Q(:, i) = Q(:, i) ./ R(i, i);
end

Matlab - Speeding up Nested For-Loops

I'm working on a function with three nested for loops that is way too slow for its intended use. The bottleneck is clearly the looping part - almost 100 % of the execution time is spent in the innermost loop.
The function takes a 2d matrix called rM as input and returns a 3d matrix called ec:
rows = size(rM, 1);
cols = size(rM, 2);
%preallocate.
ec = zeros(rows+1, cols, numRiskLevels);
ec(1, :, :) = 100;
for risk = minRisk:stepRisk:maxRisk;
for c = 1:cols,
for r = 2:rows+1,
ec(r, c, risk) = ec(r-1, c, risk) * (1 + risk * rM(r-1, c));
end
end
end
Any help on speeding up the for loops would be appreciated...

The problem is, that the inner loop is slowest, while it is also near-impossible to vectorize. As every iteration directly depends on the previous one.
The outer two are possible:
clc;
rM = rand(50);
rows = size(rM, 1);
cols = size(rM, 2);
minRisk = 1;
stepRisk = 1;
maxRisk = 100;
numRiskLevels = maxRisk/stepRisk;
%preallocate.
ec = zeros(rows+1, cols, numRiskLevels);
ec(1, :, :) = 100;
riskArray = (minRisk:stepRisk:maxRisk)';
tic
for r = 2:rows+1
tmp = riskArray * rM(r-1, :);
tmp = permute(tmp, [3 2 1]);
ec(r, :, :) = ec(r-1, :, :) .* (1 + tmp);
end
toc
%preallocate.
ec2 = zeros(rows+1, cols, numRiskLevels);
ec2(1, :, :) = 100;
tic
for risk = minRisk:stepRisk:maxRisk;
for c = 1:cols
for r = 2:rows+1
ec2(r, c, risk) = ec2(r-1, c, risk) * (1 + risk * rM(r-1, c));
end
end
end
toc
all(all(all(ec == ec2)))
But to my surprise, the vectorized code is indeed slower. (But maybe someone can improve the code, so I figured I leave it her for you.)

I have just tried to vectorize the outer loop, and actually noticed a significant speed increase. Of course it is hard to judge the speed of a script without knowing (the size of) the inputs but I would say this is a good starting point:
% Here you can change the input parameters
riskVec = 1:3:120;
rM = rand(50);
%preallocate and calculate non vectorized solution
ec2 = zeros(size(rM,2)+1, size(rM,1), max(riskVec));
ec2(1, :, :) = 100;
tic
for risk = riskVec
for c = 1:size(rM,2)
for r = 2:size(rM,1)+1
ec2(r, c, risk) = ec2(r-1, c, risk) * (1 + risk * rM(r-1, c));
end
end
end
t1=toc;
%preallocate and calculate vectorized solution
ec = zeros(size(rM,2)+1, size(rM,1), max(riskVec));
ec(1, :, :) = 100;
tic
for c = 1:size(rM,2)
for r = 2:size(rM,1)+1
ec(r, c, riskVec) = ec(r-1, c, riskVec) .* reshape(1 + riskVec * rM(r-1, c),[1 1 length(riskVec)]);
end
end
t2=toc;
% Check whether the vectorization is done correctly and show the timing results
if ec(:) == ec2(:)
t1
t2
end
The given output is:
t1 =
0.1288
t2 =
0.0408
So for this riskVec and rM it is about 3 times as fast as the non-vectorized solution.

Python performance: iteration and operations on nested lists

Problem Hey folks. I'm looking for some advice on python performance. Some background on my problem:
Given:
A (x,y) mesh of nodes each with a value (0...255) starting at 0
A list of N input coordinates each at a specified location within the range (0...x, 0...y)
A value Z that defines the "neighborhood" in count of nodes
Increment the value of the node at the input coordinate and the node's neighbors. Neighbors beyond the mesh edge are ignored. (No wrapping)
BASE CASE: A mesh of size 1024x1024 nodes, with 400 input coordinates and a range Z of 75 nodes.
Processing should be O(x*y*Z*N). I expect x, y and Z to remain roughly around the values in the base case, but the number of input coordinates N could increase up to 100,000. My goal is to minimize processing time.
Current results Between my start and the comments below, we've got several implementations.
Running speed on my 2.26 GHz Intel Core 2 Duo with Python 2.6.1:
f1: 2.819s
f2: 1.567s
f3: 1.593s
f: 1.579s
f3b: 1.526s
f4: 0.978s
f1 is the initial naive implementation: three nested for loops.
f2 is replaces the inner for loop with a list comprehension.
f3 is based on Andrei's suggestion in the comments and replaces the outer for with map()
f is Chris's suggestion in the answers below
f3b is kriss's take on f3
f4 is Alex's contribution.
Code is included below for your perusal.
Question How can I further reduce the processing time? I'd prefer sub-1.0s for the test parameters.
Please, keep the recommendations to native Python. I know I can move to a third-party package such as numpy, but I'm trying to avoid any third party packages. Also, I've generated random input coordinates, and simplified the definition of the node value updates to keep our discussion simple. The specifics have to change slightly and are outside the scope of my question.
thanks much!
**`f1` is the initial naive implementation: three nested `for` loops.**
def f1(x,y,n,z):
rows = [[0]*x for i in xrange(y)]
for i in range(n):
inputX, inputY = (int(x*random.random()), int(y*random.random()))
topleft = (inputX - z, inputY - z)
for i in xrange(max(0, topleft[0]), min(topleft[0]+(z*2), x)):
for j in xrange(max(0, topleft[1]), min(topleft[1]+(z*2), y)):
if rows[i][j] <= 255: rows[i][j] += 1
f2 is replaces the inner for loop with a list comprehension.
def f2(x,y,n,z):
rows = [[0]*x for i in xrange(y)]
for i in range(n):
inputX, inputY = (int(x*random.random()), int(y*random.random()))
topleft = (inputX - z, inputY - z)
for i in xrange(max(0, topleft[0]), min(topleft[0]+(z*2), x)):
l = max(0, topleft[1])
r = min(topleft[1]+(z*2), y)
rows[i][l:r] = [j+(j<255) for j in rows[i][l:r]]
UPDATE: f3 is based on Andrei's suggestion in the comments and replaces the outer for with map(). My first hack at this requires several out-of-local-scope lookups, specifically recommended against by Guido: local variable lookups are much faster than global or built-in variable lookups I hardcoded all but the reference to the main data structure itself to minimize that overhead.
rows = [[0]*x for i in xrange(y)]
def f3(x,y,n,z):
inputs = [(int(x*random.random()), int(y*random.random())) for i in range(n)]
rows = map(g, inputs)
def g(input):
inputX, inputY = input
topleft = (inputX - 75, inputY - 75)
for i in xrange(max(0, topleft[0]), min(topleft[0]+(75*2), 1024)):
l = max(0, topleft[1])
r = min(topleft[1]+(75*2), 1024)
rows[i][l:r] = [j+(j<255) for j in rows[i][l:r]]
UPDATE3: ChristopeD also pointed out a couple improvements.
def f(x,y,n,z):
rows = [[0] * y for i in xrange(x)]
rn = random.random
for i in xrange(n):
topleft = (int(x*rn()) - z, int(y*rn()) - z)
l = max(0, topleft[1])
r = min(topleft[1]+(z*2), y)
for u in xrange(max(0, topleft[0]), min(topleft[0]+(z*2), x)):
rows[u][l:r] = [j+(j<255) for j in rows[u][l:r]]
UPDATE4: kriss added a few improvements to f3, replacing min/max with the new ternary operator syntax.
def f3b(x,y,n,z):
rn = random.random
rows = [g1(x, y, z) for x, y in [(int(x*rn()), int(y*rn())) for i in xrange(n)]]
def g1(x, y, z):
l = y - z if y - z > 0 else 0
r = y + z if y + z < 1024 else 1024
for i in xrange(x - z if x - z > 0 else 0, x + z if x + z < 1024 else 1024 ):
rows[i][l:r] = [j+(j<255) for j in rows[i][l:r]]
UPDATE5: Alex weighed in with his substantive revision, adding a separate map() operation to cap the values at 255 and removing all non-local-scope lookups. The perf differences are non-trivial.
def f4(x,y,n,z):
rows = [[0]*y for i in range(x)]
rr = random.randrange
inc = (1).__add__
sat = (0xff).__and__
for i in range(n):
inputX, inputY = rr(x), rr(y)
b = max(0, inputX - z)
t = min(inputX + z, x)
l = max(0, inputY - z)
r = min(inputY + z, y)
for i in range(b, t):
rows[i][l:r] = map(inc, rows[i][l:r])
for i in range(x):
rows[i] = map(sat, rows[i])
Also, since we all seem to be hacking around with variations, here's my test harness to compare speeds: (improved by ChristopheD)
def timing(f,x,y,z,n):
fn = "%s(%d,%d,%d,%d)" % (f.__name__, x, y, z, n)
ctx = "from __main__ import %s" % f.__name__
results = timeit.Timer(fn, ctx).timeit(10)
return "%4.4s: %.3f" % (f.__name__, results / 10.0)
if __name__ == "__main__":
print timing(f, 1024, 1024, 400, 75)
#add more here.

On my (slow-ish;-) first-day Macbook Air, 1.6GHz Core 2 Duo, system Python 2.5 on MacOSX 10.5, after saving your code in op.py I see the following timings:
$ python -mtimeit -s'import op' 'op.f1()'
10 loops, best of 3: 5.58 sec per loop
$ python -mtimeit -s'import op' 'op.f2()'
10 loops, best of 3: 3.15 sec per loop
So, my machine is slower than yours by a factor of a bit more than 1.9.
The fastest code I have for this task is:
def f3(x=x,y=y,n=n,z=z):
rows = [[0]*y for i in range(x)]
rr = random.randrange
inc = (1).__add__
sat = (0xff).__and__
for i in range(n):
inputX, inputY = rr(x), rr(y)
b = max(0, inputX - z)
t = min(inputX + z, x)
l = max(0, inputY - z)
r = min(inputY + z, y)
for i in range(b, t):
rows[i][l:r] = map(inc, rows[i][l:r])
for i in range(x):
rows[i] = map(sat, rows[i])
which times as:
$ python -mtimeit -s'import op' 'op.f3()'
10 loops, best of 3: 3 sec per loop
so, a very modest speedup, projecting to more than 1.5 seconds on your machine - well above the 1.0 you're aiming for:-(.
With a simple C-coded extensions, exte.c...:
#include "Python.h"
static PyObject*
dopoint(PyObject* self, PyObject* args)
{
int x, y, z, px, py;
int b, t, l, r;
int i, j;
PyObject* rows;
if(!PyArg_ParseTuple(args, "iiiiiO",
&x, &y, &z, &px, &py, &rows
))
return 0;
b = px - z;
if (b < 0) b = 0;
t = px + z;
if (t > x) t = x;
l = py - z;
if (l < 0) l = 0;
r = py + z;
if (r > y) r = y;
for(i = b; i < t; ++i) {
PyObject* row = PyList_GetItem(rows, i);
for(j = l; j < r; ++j) {
PyObject* pyitem = PyList_GetItem(row, j);
long item = PyInt_AsLong(pyitem);
if (item < 255) {
PyObject* newitem = PyInt_FromLong(item + 1);
PyList_SetItem(row, j, newitem);
}
}
}
Py_RETURN_NONE;
}
static PyMethodDef exteMethods[] = {
{"dopoint", dopoint, METH_VARARGS, "process a point"},
{0}
};
void
initexte()
{
Py_InitModule("exte", exteMethods);
}
(note: I haven't checked it carefully -- I think it doesn't leak memory due to the correct interplay of reference stealing and borrowing, but it should be code inspected very carefully before being put in production;-), we could do
import exte
def f4(x=x,y=y,n=n,z=z):
rows = [[0]*y for i in range(x)]
rr = random.randrange
for i in range(n):
inputX, inputY = rr(x), rr(y)
exte.dopoint(x, y, z, inputX, inputY, rows)
and the timing
$ python -mtimeit -s'import op' 'op.f4()'
10 loops, best of 3: 345 msec per loop
shows an acceleration of 8-9 times, which should put you in the ballpark you desire. I've seen a comment saying you don't want any third-party extension, but, well, this tiny extension you could make entirely your own;-). ((Not sure what licensing conditions apply to code on Stack Overflow, but I'll be glad to re-release this under the Apache 2 license or the like, if you need that;-)).

1. A (smaller) speedup could definitely be the initialization of your rows...
Replace
rows = []
for i in range(x):
rows.append([0 for i in xrange(y)])
with
rows = [[0] * y for i in xrange(x)]
2. You can also avoid some lookups by moving random.random out of the loops (saves a little).
3. EDIT: after corrections -- you could arrive at something like this:
def f(x,y,n,z):
rows = [[0] * y for i in xrange(x)]
rn = random.random
for i in xrange(n):
topleft = (int(x*rn()) - z, int(y*rn()) - z)
l = max(0, topleft[1])
r = min(topleft[1]+(z*2), y)
for u in xrange(max(0, topleft[0]), min(topleft[0]+(z*2), x)):
rows[u][l:r] = [j+(j<255) for j in rows[u][l:r]]
EDIT: some new timings with timeit (10 runs) -- seems this provides only minor speedups:
import timeit
print timeit.Timer("f1(1024,1024,400,75)", "from __main__ import f1").timeit(10)
print timeit.Timer("f2(1024,1024,400,75)", "from __main__ import f2").timeit(10)
print timeit.Timer("f(1024,1024,400,75)", "from __main__ import f3").timeit(10)
f1 21.1669280529
f2 12.9376120567
f 11.1249599457

in your f3 rewrite, g can be simplified. (Can also be applied to f4)
You have the following code inside a for loop.
l = max(0, topleft[1])
r = min(topleft[1]+(75*2), 1024)
However, it appears that those values never change inside the for loop. So calculate them once, outside the loop instead.

Based on your f3 version I played with the code. As l and r are constants you can avoid to compute them in g1 loop. Also using new ternary if instead of min and max seems to be consistently faster. Also simplified expression with topleft. On my system it appears to be about 20% faster using with the code below.
def f3b(x,y,n,z):
rows = [g1(x, y, z) for x, y in [(int(x*random.random()), int(y*random.random())) for i in range(n)]]
def g1(x, y, z):
l = y - z if y - z > 0 else 0
r = y + z if y + z < 1024 else 1024
for i in xrange(x - z if x - z > 0 else 0, x + z if x + z < 1024 else 1024 ):
rows[i][l:r] = [j+(j<255) for j in rows[i][l:r]]

You can create your own Python module in C, and control the performance as you want:
http://docs.python.org/extending/

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Tricks to improve the performance of a cunstom function in Julia - performance

Related

Monte Carlo program throws a method error in Julia

How to reduce the allocations in Julia?

MATLAB Optimisation of Weighted Gram-Schmidt Orthogonalisation

Matlab - Speeding up Nested For-Loops

Python performance: iteration and operations on nested lists

Categories

Resources