Related
In Julia, I would like to randomly generate an array of arbitrary size, where all the elements of the array are complex numbers with absolute value one. Is there perhaps any way to do this within Julia?
I've got four options so far:
f1(n) = exp.((2*im*π).*rand(n))
f2(n) = map(x->(z = x[1]+im*x[2] ; z ./ abs(z) ),
eachcol(randn(2,n)))
f3(n) = [im*x[1]+x[2] for x in sincos.(2π*rand(n))]
f4(n) = cispi.(2 .*rand(n))
We have:
julia> using BenchmarkTools
julia> begin
#btime f1(1_000);
#btime f2(1_000);
#btime f3(1_000);
#btime f4(1_000);
end;
29.390 μs (2 allocations: 23.69 KiB)
15.559 μs (2 allocations: 31.50 KiB)
25.733 μs (4 allocations: 47.38 KiB)
27.662 μs (2 allocations: 23.69 KiB)
Not a crucial difference.
One way is:
randcomplex() = (c = Complex(rand(2)...); c / abs(c))
randcomplex(numwanted) = [randcomplex() for _ in 1:numwanted]
or
randcomplex(dims...) = (a = zeros(Complex, dims...); for i in eachindex(a) a[i] = randcomplex() end; a)
If you are looking for something faster, here are two options. They return a perhaps slightly unfamiliar type, but it is equivalent to a regular Vector
function f5(n)
r = rand(2, n)
for i in 1:n
a = sqrt(r[1, i]^2 + r[2, i]^2)
r[1, i] /= a
r[2, i] /= a
end
return reinterpret(reshape, ComplexF64, r)
end
using LoopVectorization: #turbo
function f5t(n)
r = rand(2, n)
#turbo for i in 1:n
a = sqrt(r[1, i]^2 + r[2, i]^2)
r[1, i] /= a
r[2, i] /= a
end
return reinterpret(reshape, ComplexF64, r)
end
julia> #btime f5(1000);
4.186 μs (1 allocation: 15.75 KiB)
julia> #btime f5t(1000);
2.900 μs (1 allocation: 15.75 KiB)
The matrix Y is defined as
Y = cumsum(cumsum(X,dims=1), dims=2)
For example,
julia> X = [1 4 2 3; 2 4 5 2; 4 3 4 1; 2 5 4 2];
julia> Y = cumsum(cumsum(X,dims=1), dims=2)
4x4 Matrix{Int64}:
1 5 7 10
3 11 18 23
7 18 29 35
9 25 40 48
I want to reproduce the matrix X from Y. It seems that function diff is helpful. However, as you can see below, we cannot reproduce the first line and first column of X.
julia> diff(diff(y, dims=1), dims=2)
3x3 Matrix{Int64}:
4 5 2
3 4 1
5 4 2
So, I concatenate zeros. Then, it works.
julia> y00 = vcat(zeros(5)',hcat(zeros(4), y))
5x5 Matrix{Int64}:
0 0 0 0 0
0 1 5 7 10
0 3 11 18 23
0 7 18 29 35
0 9 25 40 48
julia> diff(diff(y00, dims=1), dims=2)
4x4 Matrix{Int64}:
1 5 7 10
3 11 18 23
7 18 29 35
9 25 40 48
But I think concatenating takes time and memory.
Is there any better idea to reproduce X from Y?
Context
I want to expand the above matrices X and Y to any dimensional array. For example, I want to reconstruct a three-dimensional array X from given three-dimensional array
Y = cumsum( cumsum( cumsum(X, dims=1), dims=2), dims=3)
When both speed and succinctness are required, it's hard to beat powerful Julia packages like Tullio.jl. Here is a one-liner that's about 4X faster than the fastest solution by #DanGetz.
using Tullio
cumdiff(Y) = #tullio X[i,j] = Y[i,j] - Y[i,j-1] - Y[i-1,j] + Y[i-1,j-1]
Benchmarking with a 100-by-100 matrix gives:
X = rand(0:100,100,100)
Y = cumsum(cumsum(X,dims=1), dims=2)
#btime cumdiff($Y)
#btime decumsum3($Y)
4.957 μs (17 allocations: 464 bytes)
21.300 μs (2 allocations: 78.17 KiB)
Fix: The code above was using the predefined X instead of creating a new one. This is fixed below, and the speedup is more like 3.5X and not 4X.
function cumdiff(Y)
X = similar(Y)
X[1] = Y[1]
for i = 2:size(Y,1) X[i,1] = Y[i,1] - Y[i-1,1] end
for j = 2:size(Y,2) X[1,j] = Y[1,j] - Y[1,j-1] end
#tullio X[i,j] = Y[i,j] - Y[i,j-1] - Y[i-1,j] + Y[i-1,j-1]
end
#btime cumdiff($Y)
#btime decumsum3($Y)
6.000 μs (4 allocations: 78.23 KiB)
21.300 μs (2 allocations: 78.17 KiB)
See EDIT section below.
Some options so far:
decumsum1(X) = begin
Z = copy(X)
Z[2:end,:] .-= Z[1:end-1,:]
Z[:,2:end] .-= Z[:,1:end-1]
return Z
end
decumsum2(X) = begin # This is from question #
r,c = size(X)
Z = vcat(zeros(eltype(X),r+1)',
hcat(zeros(eltype(X),c), X))
return diff(diff(Z, dims=1), dims=2)
end
decumsum3(Y) = [Y[I]-(I[2]==1 ? 0 : Y[I[1],I[2]-1])-
(I[1]==1 ? 0 : Y[I[1]-1,I[2]])+
((I[1]==1 || I[2]==1) ? 0 : Y[I[1]-1,I[2]-1])
for I in CartesianIndices(Y)]
function decumsum5(Y)
R = similar(Y)
h,w = size(Y)
R[1,1] = Y[1,1]
#inbounds for i=2:h R[i,1] = Y[i,1]-Y[i-1,1] ; end
#inbounds for j=2:w R[1,j] = Y[1,j]-Y[1,j-1] ; end
#inbounds for i=2:h,j=2:w R[i,j] = Y[i,j]-Y[i-1,j]-Y[i,j-1]+Y[i-1,j-1] ; end
return R
end
Giving the following benchmarks:
julia> using BenchmarkTools
julia> decumsum1(Y) == decumsum2(Y) == decumsum3(Y) == X
true
julia> #btime decumsum1($Y);
352.571 ns (5 allocations: 832 bytes)
julia> #btime decumsum2($Y);
475.438 ns (9 allocations: 1.14 KiB)
julia> #btime decumsum3($Y);
96.875 ns (1 allocation: 192 bytes)
julia> #btime decumsum5($Y);
60.805 ns (1 allocation: 192 bytes)
EDIT: Perhaps the prettier solutions is:
decumsum(Y; dims) = [Y[I] - (
I[dims]==1 ? 0 : Y[(ifelse(k == dims,I[k]-1,I[k])
for k in 1:ndims(Y))...]
) for I in CartesianIndices(Y)]
and with it, the cumsum can be walked back:
julia> decumsum(decumsum(Y, dims=1), dims=2)
4×4 Matrix{Int64}:
1 4 2 3
2 4 5 2
4 3 4 1
2 5 4 2
julia> decumsum(decumsum(Y, dims=1), dims=2) == X
true
julia> #btime decumsum(decumsum($Y, dims=1), dims=2);
165.656 ns (2 allocations: 384 bytes)
with nice performance and also generalized to any Array dimension.
Update: another version decumsum5 added. Still faster.
I am starting to use Julia mainly because of its speed. Currently, I am solving a fixed point problem. Although the current version of my code runs fast I would like to know some methods to improve its speed.
First of all, let me summarize the algorithm.
There is an initial seed called C0 that maps from the space (b,y) into an action space c, then we have C0(b,y)
There is a formula that generates a rule Ct from C0.
Then, using an additional restriction, I can obtain an updating of b [let's called it bt]. Thus,it generates a rule Ct(bt,y)
I need to interpolate the previous rule to move from the grid bt into the original grid b. It gives me an update for C0 [let's called that C1]
I will iterate until the distance between C1 and C0 is below a convergence threshold.
To implement it I created two structures:
struct Parm
lC::Array{Float64, 2} # Lower limit
uC::Array{Float64, 2} # Upper limit
γ::Float64 # CRRA coefficient
δ::Float64 # factor in the euler
γ1::Float64 #
r1::Float64 # inverse of the gross interest rate
yb1::Array{Float64, 2} # y - b(t+1)
P::Array{Float64, 2} # Transpose of transition matrix
end
mutable struct Upd1
pol::Array{Float64,2} # policy function
b::Array{Float64, 1} # exogenous grid for interpolation
dif::Float64 # updating difference
end
The first one is a set of parameters while the second one stores the decision rule C1. I also define some functions:
function eulerm(x::Upd1,p::Parm)
ct = p.δ *(x.pol.^(-p.γ)*p.P).^(-p.γ1); #Euler equation
bt = p.r1.*(ct .+ p.yb1); #Endeogenous grid for bonds
return ct,bt
end
function interp0!(bt::Array{Float64},ct::Array{Float64},x::Upd1, p::Parm)
polold = x.pol;
polnew = similar(x.pol);
#inbounds #simd for col in 1:size(bt,2)
F1 = LinearInterpolation(bt[:,col], ct[:,col],extrapolation_bc=Line());
polnew[:,col] = F1(x.b);
end
polnew[polnew .< p.lC] .= p.lC[polnew .< p.lC];
polnew[polnew .> p.uC] .= p.uC[polnew .> p.uC];
dif = maximum(abs.(polnew - polold));
return polnew,dif
end
function updating!(x::Upd1,p::Parm)
ct, bt = eulerm(x,p); # endogeneous grid
x.pol, x.dif = interp0!(bt,ct,x,p);
end
function conver(x::Upd1,p::Parm)
while x.dif>1e-8
updating!(x,p);
end
end
The first formula implements steps 2 and 3. The third one makes the updating (last part of step 4), and the last one iterates until convergence (step 5).
The most important function is the second one. It makes the interpolation. While I was running the function #time and #btime I realized that the largest number of allocations are in the loop inside this function. I tried to reduce it by not defining polnew and goes directly to x.pol but in this case, the results are not correct since it only need two iterations to converge (I think that Julia is thinking that polold is exactly the same than x.pol and it is updating both at the same time).
Any advice is well received.
To anyone that wants to run it by themselves, I add the rest of the required code:
function rouwen(ρ::Float64, σ2::Float64, N::Int64)
if (N % 2 != 1)
return "N should be an odd number"
end
sigz = sqrt(σ2/(1-ρ^2));
zn = sigz*sqrt(N-1);
z = range(-zn,zn,N);
p = (1+ρ)/2;
q = p;
Rho = [p 1-p;1-q q];
for i = 3:N
zz = zeros(i-1,1);
Rho = p*[Rho zz; zz' 0] + (1-p)*[zz Rho; 0 zz'] + (1-q)*[zz' 0; Rho zz] + q *[0 zz'; zz Rho];
Rho[2:end-1,:] = Rho[2:end-1,:]/2;
end
return z,Rho;
end
#############################################################
# Parameters of the model
############################################################
lb = 0; ub = 1000; pivb = 0.25; nb = 500;
ρ = 0.988; σz = 0.0439; μz =-σz/2; nz = 7;
ϕ = 0.0; σe = 0.6376; μe =-σe/2; ne = 7;
β = 0.98; r = 1/400; γ = 1;
b = exp10.(range(start=log10(lb+pivb), stop=log10(ub+pivb), length=nb)) .- pivb;
#=========================================================
Algorithm
======================================================== =#
(z,Pz) = rouwen(ρ,σz, nz);
μZ = μz/(1-ρ);
z = z .+ μZ;
(ee,Pe) = rouwen(ϕ,σe,ne);
ee = ee .+ μe;
y = exp.(vec((z .+ ee')'));
P = kron(Pz,Pe);
R = 1 + r;
r1 = R^(-1);
γ1 = 1/γ;
δ = (β*R)^(-γ1);
m = R*b .+ y';
lC = max.(m .- ub,0);
uC = m .- lb;
by1 = b .- y';
# initial guess for C0
c0 = 0.1*(m);
# Set of parameters
pp = Parm(lC,uC,γ,δ,γ1,r1,by1,P');
# Container of results
up1 = Upd1(c0,b,1);
# Fixed point problem
conver(up1,pp)
UPDATE As it was reccomend, I made the following changes to the third function
function interp0!(bt::Array{Float64},ct::Array{Float64},x::Upd1, p::Parm)
polold = x.pol;
polnew = similar(x.pol);
#inbounds for col in 1:size(bt,2)
F1 = LinearInterpolation(#view(bt[:,col]), #view(ct[:,col]),extrapolation_bc=Line());
polnew[:,col] = F1(x.b);
end
for j in eachindex(polnew)
polnew[j] < p.lC[j] ? polnew[j] = p.lC[j] : nothing
polnew[j] > p.uC[j] ? polnew[j] = p.uC[j] : nothing
end
dif = maximum(abs.(polnew - polold));
return polnew,dif
end
This leads to an improvement in the speed (from ~1.5 to ~1.3 seconds). And a reduction in the number of allocations. Somethings that I noted were:
Changing from polnew[:,col] = F1(x.b) to polnew[:,col] .= F1(x.b) can reduce the total allocations but the time is slower, why is that?
How should I understand the difference between #time and #btime. For this case, I have:
up1 = Upd1(c0,b,1);
#time conver(up1,pp)
1.338042 seconds (385.72 k allocations: 1.157 GiB, 3.37% gc time)
up1 = Upd1(c0,b,1);
#btime conver(up1,pp)
4.200 ns (0 allocations: 0 bytes)
Just to be precise, in both cases, I run it several times and I choose representative numbers for each line.
Does it mean that all the time is due allocations during the compilation?
Start going through the "performance tips" as advised by #DNF but below you will find most important comments for your code.
Vectorize vector assignments - a small dot makes big difference
julia> julia> a = rand(3,4);
julia> #btime $a[3,:] = $a[3,:] ./ 2;
40.726 ns (2 allocations: 192 bytes)
julia> #btime $a[3,:] .= $a[3,:] ./ 2;
20.562 ns (1 allocation: 96 bytes)
Use views when doing something with subarrays:
julia> #btime sum($a[3,:]);
18.719 ns (1 allocation: 96 bytes)
julia> #btime sum(#view($a[3,:]));
5.600 ns (0 allocations: 0 bytes)
Your code around a lines polnew[polnew .< p.lC] .= p.lC[polnew .< p.lC]; will make much less allocations when you do it with a for loop over each element of polnew
#simd will have no effect on conditionals (point 3) neither when code is calling complex external functions
I want to give an update about this problem. I made two main changes to my code: (i) I define my own linear interpolation function and (ii) I include the check of bounds in the interpolation.
With this the new function three is
function interp0!(bt::Array{Float64},ct::Array{Float64},x::Upd1, p::Parm)
polold = x.pol;
polnew = similar(x.pol);
#inbounds #views for col in 1:size(bt,2)
polnew[:,col] = myint(bt[:,col], ct[:,col],x.b[:],p.lC[:,col],p.uC[:,col]);
end
dif = maximum(abs.(polnew - polold));
return polnew,dif
end
And the interpolation is now:
function myint(x0,y0,x1,ly,uy)
y1 = similar(x1);
n = size(x0,1);
j = 1;
#simd for i in eachindex(x1)
while (j <= n) && (x1[i] > x0[j])
j+=1;
end
if j == 1
y1[i] = y0[1] + ((y0[2]-y0[1])/(x0[2]-x0[1]))*(x1[i]-x0[1]) ;
elseif j == n+1
y1[i] = y0[n] + ((y0[n]-y0[n-1])/(x0[n]-x0[n-1]))*(x1[i]-x0[n]);
else
y1[i] = y0[j-1]+ ((x1[i]-x0[j-1])/(x0[j]-x0[j-1]))*(y0[j]-y0[j-1]);
end
y1[i] > uy[i] ? y1[i] = uy[i] : nothing;
y1[i] < ly[i] ? y1[i] = ly[i] : nothing;
end
return y1;
end
As you can see, I am taking advantage (and assuming) that both vectors that we use as basis are ordered while the two last lines in the outer loops checks the bounds imposed by lC and uC.
With that I get the following total time
up1 = Upd1(c0,b,1);
#time conver(up1,pp)
0.734630 seconds (28.93 k allocations: 752.214 MiB, 3.82% gc time)
up1 = Upd1(c0,b,1);
#btime conver(up1,pp)
4.200 ns (0 allocations: 0 bytes)
which is almost as twice faster with ~8% of the total allocations. the use of views in the loop of the function interp0! also helps a lot.
I have something like this (simple example):
using BenchmarkTools
function assign()
e = zeros(100, 90000)
e2 = ones(100) * 0.16
e[:, 100:end] .= e2[:]
end
#benchmark assign()
and need to this for thousands of time steps. This gives
BenchmarkTools.Trial:
memory estimate: 68.67 MiB
allocs estimate: 6
--------------
minimum time: 16.080 ms (0.00% GC)
median time: 27.811 ms (0.00% GC)
mean time: 31.822 ms (12.31% GC)
maximum time: 43.439 ms (27.66% GC)
--------------
samples: 158
evals/sample: 1
Is there a faster way of doing this?
First of all I will assume that you meant
function assign1()
e = zeros(100, 90000)
e2 = ones(100) * 0.16
e[:, 100:end] .= e2[:]
return e # <- important!
end
Since otherwise you will not return the first 99 columns of e(!):
julia> size(assign())
(100, 89901)
Secondly, don't do this:
e[:, 100:end] .= e2[:]
e2[:] makes a copy of e2 and assigns that, but why? Just assign e2 directly:
e[:, 100:end] .= e2
Ok, but let's try a few different versions. Notice that there is no need to make e2 a vector, just assign a scalar:
function assign2()
e = zeros(100, 90000)
e[:, 100:end] .= 0.16 # Just broadcast a scalar!
return e
end
function assign3()
e = fill(0.16, 100, 90000) # use fill instead of writing all those zeros that you will throw away
e[:, 1:99] .= 0
return e
end
function assign4()
# only write exactly the values you need!
e = Matrix{Float64}(undef, 100, 90000)
e[:, 1:99] .= 0
e[:, 100:end] .= 0.16
return e
end
Time to benchmark
julia> #btime assign1();
14.550 ms (5 allocations: 68.67 MiB)
julia> #btime assign2();
14.481 ms (2 allocations: 68.66 MiB)
julia> #btime assign3();
9.636 ms (2 allocations: 68.66 MiB)
julia> #btime assign4();
10.062 ms (2 allocations: 68.66 MiB)
Versions 1 and 2 are equally fast, but you'll notice that there are 2 allocations instead of 5, but, of course, the big allocation dominates.
Versions 3 and 4 are faster, not dramatically so, but you see that it avoids some duplicate work, such as writing values into the matrix twice. Version 3 is the fastest, not by much, but this changes if the assignment is a bit more balanced, in which case version 4 is faster:
function assign3_()
e = fill(0.16, 100, 90000)
e[:, 1:44999] .= 0
return e
end
function assign4_()
e = Matrix{Float64}(undef, 100, 90000)
e[:, 1:44999] .= 0
e[:, 45000:end] .= 0.16
return e
end
julia> #btime assign3_();
11.576 ms (2 allocations: 68.66 MiB)
julia> #btime assign4_();
8.658 ms (2 allocations: 68.66 MiB)
The lesson is to avoid doing unnecessary work.
I want to use shared memory multi-threading in Julia. As done by the Threads.#threads macro, I can use ccall(:jl_threading_run ...) to do this. And whilst my code now runs in parallel, I don't get the speedup I expected.
The following code is intended as a minimal example of the approach I'm taking and the performance problem I'm having: [EDIT: See later for even more minimal example]
nthreads = Threads.nthreads()
test_size = 1000000
println("STARTED with ", nthreads, " thread(s) and test size of ", test_size, ".")
# Something to be processed:
objects = rand(test_size)
# Somewhere for our results
results = zeros(nthreads)
counts = zeros(nthreads)
# A function to do some work.
function worker_fn()
work_idx = 1
my_result = results[Threads.threadid()]
while work_idx > 0
my_result += objects[work_idx]
work_idx += nthreads
if work_idx > test_size
break
end
counts[Threads.threadid()] += 1
end
end
# Call our worker function using jl_threading_run
#time ccall(:jl_threading_run, Ref{Cvoid}, (Any,), worker_fn)
# Verify that we made as many calls as we think we did.
println("\nCOUNTS:")
println("\tPer thread:\t", counts)
println("\tSum:\t\t", sum(counts))
On an i7-7700, a typical single threaded result is:
STARTED with 1 thread(s) and test size of 1000000.
0.134606 seconds (5.00 M allocations: 76.563 MiB, 1.79% gc time)
COUNTS:
Per thread: [999999.0]
Sum: 999999.0
And with 4 threads:
STARTED with 4 thread(s) and test size of 1000000.
0.140378 seconds (1.81 M allocations: 25.661 MiB)
COUNTS:
Per thread: [249999.0, 249999.0, 249999.0, 249999.0]
Sum: 999996.0
Multi-threading slows things down! Why?
EDIT: A better minimal example can be created #threads macro itself.
a = zeros(Threads.nthreads())
b = rand(test_size)
calls = zeros(Threads.nthreads())
#time Threads.#threads for i = 1 : test_size
a[Threads.threadid()] += b[i]
calls[Threads.threadid()] += 1
end
I falsely assumed that the #threads macro's inclusion in Julia would mean that there was a benefit to be had.
The problem you have is most probably false sharing.
You can solve it by separating the areas you write to far enough like this (here is a "quick and dirty" implementation to show the essence of the change):
julia> function f(spacing)
test_size = 1000000
a = zeros(Threads.nthreads()*spacing)
b = rand(test_size)
calls = zeros(Threads.nthreads()*spacing)
Threads.#threads for i = 1 : test_size
#inbounds begin
a[Threads.threadid()*spacing] += b[i]
calls[Threads.threadid()*spacing] += 1
end
end
a, calls
end
f (generic function with 1 method)
julia> #btime f(1);
41.525 ms (35 allocations: 7.63 MiB)
julia> #btime f(8);
2.189 ms (35 allocations: 7.63 MiB)
or doing per-thread accumulation on a local variable like this (this is a preferred approach as it should be uniformly faster):
function getrange(n)
tid = Threads.threadid()
nt = Threads.nthreads()
d , r = divrem(n, nt)
from = (tid - 1) * d + min(r, tid - 1) + 1
to = from + d - 1 + (tid ≤ r ? 1 : 0)
from:to
end
function f()
test_size = 10^8
a = zeros(Threads.nthreads())
b = rand(test_size)
calls = zeros(Threads.nthreads())
Threads.#threads for k = 1 : Threads.nthreads()
local_a = 0.0
local_c = 0.0
for i in getrange(test_size)
for j in 1:10
local_a += b[i]
local_c += 1
end
end
a[Threads.threadid()] = local_a
calls[Threads.threadid()] = local_c
end
a, calls
end
Also note that you are probably using 4 treads on a machine with 2 physical cores (and only 4 virtual cores) so the gains from threading will not be linear.