Why does the allocated memory increase with the number of iterations of calling this lambda in sum()? - performance

I'm just trying to experiment with Julia, so I wrote this code:
function f()
i::Int = 0
return () -> i += 1
end
#time (h = f(); sum(h() for _ in 1:1000000))
#time (h = f(); sum(h() for _ in 1:10000000))
When I ran it I obtain
0.178856 seconds (3.07 M allocations: 49.691 MiB, 4.24% gc time, 33.87% compilation time)
500000500000
1.128918 seconds (30.07 M allocations: 461.536 MiB, 4.79% gc time, 7.03% compilation time)
50000005000000
We see that it requires a lot of memory, that grows as the size of the range increases. But I don't understand why, I'd expect it to require a constant amount of memory, basically the memory to store the lambda with its captured variable (and the variables required to compute the sum).
So the question is: why does the required memory grow with the size of the range?
(I know that I can obtain the same result simply with sum(1:1000000) or with other approaches, but this is not what I am interested in, here I'm just trying to understand this specific example)

There are two issues in your function: boxing and binding in global scope. Here is an example how both can be resolved:
julia> function f()
i = Ref(0)
return () -> i[] += 1
end
f (generic function with 1 method)
julia> #time let h = f()
sum(h() for _ in 1:1000000)
end
0.027134 seconds (113.72 k allocations: 6.053 MiB, 98.97% compilation time)
500000500000
julia> #time let h = f()
sum(h() for _ in 1:10000000)
end
0.024634 seconds (109.76 k allocations: 5.821 MiB, 96.81% compilation time)
50000005000000
julia> #time let h = f()
sum(h() for _ in 1:100000000)
end
0.031428 seconds (109.76 k allocations: 5.826 MiB, 81.02% compilation time)
5000000050000000
julia> #time let h = f()
sum(h() for _ in 1:1000000000)
end
0.088952 seconds (109.77 k allocations: 5.824 MiB, 8.99% gc time, 36.63% compilation time)
500000000500000000

because the closure over i is boxed, and accessing a boxed variable requires allocation, and thus it increase linearly as the number of time you access it.
The memory allocation is showing the total amount of memory allocated, not the peak memory.
and you can see the type instability:
julia> #code_warntype h()
MethodInstance for (::var"#3#4")()
from (::var"#3#4")() # Main REPL[1]:3
Arguments
#self#::var"#3#4"
Locals
i::Union{}
Body::Int64
1 ─ %1 = Core.getfield(#self#, :i)::Core.Box
│ %2 = Core.isdefined(%1, :contents)::Bool
└── goto #3 if not %2
2 ─ goto #4
3 ─ Core.NewvarNode(:(i))
└── i
4 ┄ %7 = Core.getfield(%1, :contents)::Any
│ %8 = Core.typeassert(%7, Main.Int)::Int64
│ %9 = (%8 + 1)::Int64
│ %10 = Core.getfield(#self#, :i)::Core.Box
│ %11 = Base.convert(Main.Int, %9)::Int64
│ %12 = Core.typeassert(%11, Main.Int)::Int64
│ Core.setfield!(%10, :contents, %12)
└── return %9

Related

Performance issues with parallel Julia code

I am working on a scientific code that is experiencing issues with parallelization.
The parallel version is slower than the serial one and I am not sure if the right approaches are used for this application.
How can I improve the performance of the parallel calculation?
Is the right approach being used or should other packages / functions be considered for parallelization?
I have already tried a larger workload, however this makes no difference.
I suspect the problem is somehow due to data movement between workers, but I don't know how to check or improve this one.
Parallel programming with Julia is still relatively new for me, so I am very grateful for any help!
The simulation code is something of a benchmark for the Julia programming language, as our team is considering using Julia for all future projects if strong performance advantages to the current workflow can be demonstrated.
Because of this, I would like to maximize performance, also since calculations with very large models as well as possible use on a cluster are planned.
Minimum Working Example
The critical parts of the code can be broken down to the following example.
I start the process as follows:
using Distributed
addprocs();
#everywhere using SharedArrays, LinearAlgebra, Test
First I define the simulation model, containing all data used for the calculations.
Is it actually okay to store SharedArrays with other data in a struct or should a different approach be used?
#everywhere struct Model
idx::Vector{Tuple{Int,Int}} # indices
A::SharedMatrix{Float64} # results, will be constantly updated
B::Vector{Float64} # part of pre-processing, will only be read
end
See the non-parallel version of the function used for the update of the model below.
function update(m::Model, factor::Float64)
L::Float64 = 0.
k::Float64 = 0.
cnt::Int = 0
for (i,j) in m.idx
cnt+=1
L = norm(m.A[:,i]-m.A[:,j])
k = factor * m.B[cnt]
m.A[:,i] .+= k*L
m.A[:,j] .-= k*L
end
end
For parallelization, I simply tried the following. Is perhaps an approach with pmap better in this case?
#everywhere function parallel_update(m::Model, factor::Float64)
L::Float64 = 0.
k::Float64 = 0.
cnt::Int = 0
#sync #distributed for (i,j) in m.idx
cnt+=1
L = norm(m.A[:,i]-m.A[:,j])
k = factor * m.B[cnt]
m.A[:,i] .+= k*L
m.A[:,j] .-= k*L
end
end
To test the results I use the following function:
#everywhere function test_my_code()
# provide some data
n = 10000000
idx = [(rand(1:n),rand(1:n)) for k=1:n]
A = SharedArray(hcat(([rand(0.:1000.);rand(0.:1000.);rand(0.:1000.)] for k=1:n)...))
B = [rand(0.:1000.) for k=1:n]
# define models
model1 = Model(idx,A,B)
model2 = Model(idx,A,B)
# test and compare results
#time update(model1,2.)
#time parallel_update(model2,2.)
#test model1 == model2
end
julia> test_my_code() # first run
6.350694 seconds (50.00 M allocations: 5.215 GiB, 13.66% gc time)
11.422999 seconds (6.69 k allocations: 446.156 KiB)
Test Passed
julia> test_my_code() # second run
6.286828 seconds (50.00 M allocations: 5.215 GiB, 18.35% gc time)
6.297144 seconds (2.92 k allocations: 143.516 KiB)
Test Passed
Note: significant performance improvements for the serial code
I was already able to significantly improve the performance of the serial function and reduce the number of allocations to zero.
Since this seems to make no difference to the parallelization problem, I used the shorter, easier-to-read version for the previous example.
See the serial code below.
using LinearAlgebra, Test
struct Model
idx::Vector{Tuple{Int,Int}}
A::Matrix{Float64}
B::Vector{Float64}
end
function update(m::Model, factor::Float64)
L::Float64 = 0.
k::Float64 = 0.
cnt::Int = 0
for (i,j) in m.idx
cnt+=1
L = norm(m.A[:,i]-m.A[:,j])
k = factor * m.B[cnt]
m.A[:,i] .+= k*L
m.A[:,j] .-= k*L
end
end
function update_fast(m::Model, factor::Float64)
L::Float64 = 0.
k::Float64 = 0.
cnt::Int = 0
for (i,j) in m.idx
cnt+=1
L = sqrt((m.A[1,i]-m.A[1,j])^2 +
(m.A[2,i]-m.A[2,j])^2 +
(m.A[3,i]-m.A[3,j])^2)
k = factor * m.B[cnt]
m.A[1,i] += k*L
m.A[2,i] += k*L
m.A[3,i] += k*L
m.A[1,j] -= k*L
m.A[2,j] -= k*L
m.A[3,j] -= k*L
end
end
function test_serial_speedup()
n = 10000000
idx = [(rand(1:n),rand(1:n)) for k=1:n]
A = hcat(([rand(0.:1000.);rand(0.:1000.);rand(0.:1000.)] for k=1:n)...)
B = [rand(0.:1000.) for k=1:n]
model1 = Model(idx,A,B)
model2 = Model(idx,A,B)
#time update(model1,2.)
#time update_fast(model2,2.)
#test model1 == model2
end
julia> test_serial_speedup()
5.008049 seconds (50.00 M allocations: 5.215 GiB, 18.14% gc time)
0.464986 seconds
Test Passed

Fast random selection from short vectors in Julia

I have a simple function which appears at several places in my Julia code and is run millions of times inside a loop. The function essentially does rand([1,-1,im,-im]), that's, it picks one of four possible given values. I noticed that this function takes a substantial amount of time in my huge loop, so, I tried to write it in a slightly faster way like this:
function qpsk()
temp1 = ifelse(rand(Bool), 1+0im, -1+0im)
temp2 = ifelse(rand(Bool), 1+0im, 0+1im)
temp1*temp2
end
Then, it is typically called like this:
sig = complex(zeros(N))
for i = 1:N
sig[i] = qpsk()
end
Now, is there any way to further optimize this function, or use another faster method? Appreciate your help.
Comments on current answers:
The answer of #DanGetz (22 lines??) doesn't solve the problem, because at the moment, Julia is not as good at vectors as with explicit loops. Also,
my simple, 1 line qpsk2(s) below, is about 2X faster than those "cryptic" 22 lines of code in the original answer by Dan (a vector is created, though, which adds more time).
But the question remains, why they
didn't implement something like qpsk1 below? and why my original qpsk with branching is more than 3X faster than the straightforward qpsk4(s) below?
I added more versions below to guide the discussion if more experienced people like to jump in.
qpsk1(s) = s[1+(rand(Int8)&3)] # Blazingly fast
qpsk2(s) = s[1+rand(Bool)+2rand(Bool)] # Very fast
qpsk3(s) = s[rand(1:4,1)] # Compiler issue here?
qpsk4(s) = s[rand(1:4)] # Why slow?
qpsk5(s) = rand([s]) # Ridiculously slow!!
function test_orig(n) # Test qpsk(), very fast(branching!), why?
for i = 1:n
qpsk()
end
end
using StaticArrays
function test(func, n) # Test all qpsk1 --> qpsk5
s = SVector(1,-1,im,-im)
for i=1:n
func(s)
end
end
#time test(qpsk1,10^8) 0.554994 seconds (5 allocations: 176 bytes)
#time test(qpsk2,10^8) 0.755286 seconds (5 allocations: 176 bytes)
#time test(qpsk3,10^8) 13.431529 seconds (400 M allocations: 26.822 GiB, 20.68% gc time)
#time test(qpsk4,10^8) 2.520085 seconds (5 allocations: 176 bytes)
#time test(qpsk5,10^8) 10.881852 seconds (200 M allocations: 20.862 GiB, 19.76% gc time)
#time test_orig(10^8) 0.771778 seconds (5 allocations: 176 bytes)
#time nqpsk2(10^8); 1.402830 seconds (9 allocations: 1.490 GiB, 6.39% gc time)
Summary of answer
[(-1)^b1*im^b2 for (b1,b2) in zip(rand!(BitVector(N)),rand!(BitVector(N)))]
generates a length N vector faster.
Answer
Calculating the random bits is the bulk of the work, so exploring Chris' idea from comments of using RandomNumbers.jl is worth a shot. Additionally, we can use #rickhg12hs's idea to extract more bits from each random number generated. Regardless, generating a block of values together is essential for better optimization.
For example, the following code (nqpsk1 uses qpsk from the question as the baseline. nqpsk2 is a suggested improvement):
function qpsk()
temp1 = ifelse(rand(Bool), 1+0im, -1+0im)
temp2 = ifelse(rand(Bool), 1+0im, 0+1im)
temp1*temp2
end
nqpsk1(n::Int) = [qpsk() for i=1:n]
nqpsk2(n::Int) = begin
res = zeros(Int,2*n)
blocks = n >>> 4 # use blocks of 16 values
btail = n & 0x000000000000000f # in case n is not a multiple of 16
pos = 1
#inbounds for i=1:blocks
bits = rand(UInt32) # get random bits for a whole block
for j=1:16
b1 = Bool(bits & 1)
bits >>>= 1
b2 = Bool(bits & 1)
bits >>>= 1
res[pos+b1] = (-1)^b2
pos += 2
end
end
#inbounds for i=1:btail
res[pos+rand(Bool)] = (-1)^rand(Bool)
pos += 2
end
return reinterpret(Complex{Int64},res)
end
achieved a >4x improvement on my setup (Julia 0.7):
julia> using BenchmarkTools
julia> #btime nqpsk1(320);
8.791 μs (323 allocations: 15.19 KiB)
julia> #btime nqpsk2(320);
1.056 μs (3 allocations: 5.20 KiB)
Update
With only a modest compromise in speed (and some allocation), but much better looking code:
function nqpsk3(n::Int)
res = zeros(Int,2n)
rv1 = rand!(BitVector(n))
rv2 = rand!(BitVector(n))
#inbounds for (b1,b2,i) in zip(rv1,rv2,1:2:2n)
res[i+b1] = (-1)^b2
end
return reinterpret(Complex{Int},res)
end
The benchmark:
julia> #btime nqpsk3(320);
1.780 μs (11 allocations: 5.83 KiB)
Addendum
And the one-(wrapped)-line version, does OK (2.48 μs) too:
nqpsk4(n) = [(1+0im,-1+0im,0+im,0-im)[2b1+b2+1] for
(b1,b2) in zip(rand!(BitVector(n)),rand!(BitVector(n)))]
Finally, the real one-line version (1.96 μs):
nqpsk5(n) = [(-1)^b1*im^b2 for (b1,b2) in zip(rand!(BitVector(n)),rand!(BitVector(n)))]
Latest state of investigation
My current best solution is the following:
function g(pX::Array{Complex{Float64},1})
tab = [1.0,im,-1.0,-im]
bits = UInt128(0)
#inbounds for i = 1 : length(pX)
bits = (i % 64) == 1 ? rand(UInt128) : bits >>> 2
pX[i] = tab[(bits & 3)+1]
end
end
sig = complex(zeros(1280));
using BenchmarkTools
#btime g(sig)
3.838 μs (13 allocations: 464 bytes)
This is better than my optimized version of Dan Getz which runs with the same N, and i feel much more readable
4.236 μs (4 allocations: 20.16 KiB)
However, the performance is extremely fragil. Just have a look at subtle differences to this 36 times slower version:
function g(pX::Array{Complex{Float64},1})
tab = [1,im,-1,-im]
bits = 0
for i = 1 : length(pX)
bits = (i % 64) == 1 ? rand(UInt128) : bits >>> 2
pX[i] = tab[(bits & 3)+1]
end
end
138.320 μs (10209 allocations: 319.14 KiB)
Did you find the differences?
no conversion from Int64 to Float64
Type stability
disable range checking
to follow the convention g() should be renamed to g!()
In the following you find the evolution to the currently best timed solution
My first approach to answer was addressing general weaknesses
a) calling functions is expensive due to calling overhead.
b) complex calculations are more time consuming than lookup.
This end up with the proposal
cases = [1+0im,0+1im,-1+0im,0-1im]
g() = cases[rand(1:4)]
// to use just call g()
g()
What happend?
Why does a) do not succeed?
using BenchmarkTools
test(n) = [q() for i = 1:n]
g() = rand()
#btime test(800);
This results to
rand() => 5.784
rand(Float32) => 5.604
rand(Float64) => 5.821
rand(Bool) => 5.167
rand(Int8) => 5.126
rand(Int16) => 5.171
rand(Int32) => 5.631
rand(Int64) => 7.980
rand(Int128) => 10.549
rand(1:4) => 28.603
(rand(Int8) % 4) + 1 => 6.053
(rand(Int8) & 3) + 1 => 5.843
rand(0:255) => 28.568
rand(UInt8) => 5.104
rand([1,2,3,4]) => 58.437
l = [1,2,3,4]; g() = rand(l) => 47.399
rand(l, 1) => 70.052
m = (1,2,3,4); rand(m) => 124.311
0 => 0.872
0.0 => 0.887
Int8(0) => 0.113
return => 0.33
(running Julia 0.6 on Ubuntu)
How to judge the results
requesting float32 and float64 needs same time. This may be an indicator that float64 does NOT the full mantisse (of 56 bits) for random value
rand for Bool, Int8, Int16 needs nearly the same time. Probably same algorithmus just using fewer bits.
rand for Int32 slightly more time consuming. Int64 and Int128 takes under proportional more time.
rand(1:4) takes surprisingly much more time. It should be in the range of rand(Int8) since it is equivalent to (rand(Int8) % 4) + 1 and (rand(Int8) & 3) + 1.
Even if I hurt somebodies religious feelings, this is just poor code.
Same to rand(Uint) and rand(0:255)
The performance of rand with arrays and tupel is far from acceptable!
Why does b) do not succeed?
Julia seems not able to lookup efficiently from tupels or arrays.
But even when lookup were fast, the rand methods dominates.
Other approaches
Dan Getz approach uses all bits from a rand call. So in end it needs in his first algorithm 1/16 calls per value.
However, this approach could be improved by using UInt128 since now 1/64 calls per value is required.
On my machine Dan Getz original code takes 17.314 for 1280 values while the modified code takes 4.595. The improvement is proportional to the reduced number of calls to rand!
test2(n::Int) = begin
res = zeros(Int,2*n)
blocks = n >>> 7 # use blocks of 16 values
btail = n & 0x000000000000007f # in case n is not a multiple of 16
pos = 1
#inbounds for i=1:blocks
bits = rand(UInt128) # get random bits for a whole block
for j=1:16
b1 = Bool(bits & 1)
bits >>>= 1
b2 = Bool(bits & 1)
bits >>>= 1
res[pos+b1] = (-1)^b2
pos += 2
end
end
#inbounds for i=1:btail
res[pos+rand(Bool)] = (-1)^rand(Bool)
pos += 2
end
return reinterpret(Complex{Int64},res)
end
#btime test2(1280);
However, the use of reinterpret means to know the bit layout of the different structures. That not a real good idea.
high level view
In the end all what the questioneer has coded was a complicated complex build array of random numbers from 1 to 4 (or 0 to 3). I would try to optimize the next step in questioneers following task. However, no info were supplied.
In that following case julia performs much better, which sounds somewhat strange. More returns, less time??
#btime rand(0:3, 1280)
=> 24.377
PS:
Just for comparing the numbers with Dan Getz last approach, the following code takes 27.004
N=1280
#btime [(-1)^b1*im^b2 for (b1,b2) in zip(rand!(BitVector(N)),rand!(BitVector(N)))]

while-loop faster than for when returning iterator

I'm trying to oversimplify this as much as possible.
functions f1and f2 implement a very simplified version of a roulette wheel selection over a Vector R. The only difference between them is that f1 uses a for and f2 a while. Both functions return the index of the array where the condition was met.
R=rand(100)
function f1(X::Vector)
l = length(X)
r = rand()*X[l]
for i = 1:l
if r <= X[i]
return i
end
end
end
function f2(X::Vector)
l = length(X)
r = rand()*X[l]
i = 1
while true
if r <= X[i]
return i
end
i += 1
end
end
now I created a couple of test functions...
M is the number of times we repeat the function execution.
Now this is critical... I want to store the values I get from the functions because I need them later... To oversimplify the code I just created a new variable r where I sum up the returns from the functions.
function test01(M,R)
cumR = cumsum(R)
r = 0
for i = 1:M
a = f1(cumR)
r += a
end
return r
end
function test02(M,R)
cumR = cumsum(R)
r = 0
for i = 1:M
a = f2(cumR)
r += a
end
return r
end
So, next I get:
#time test01(1e7,R)
elapsed time: 1.263974802 seconds (320000832 bytes allocated, 15.06% gc time)
#time test02(1e7,R)
elapsed time: 0.57086421 seconds (1088 bytes allocated)
So, for some reason I can't figure out f1 allocates a lot of memory and its even greater the larger M gets.
I said the line r += a was critical, because if I remove it from both test functions, I get the same result with both tests, so no problems! So I thought there was a problem with the type of a being returned by the functions (because f1 returns the iterator of the for loop, and f2 uses its own variable i "manually declared" inside the function).
But...
aa = f1(cumsum(R))
bb = f2(cumsum(R))
typeof(aa) == typeof(bb)
true
So... what that hell is going on???
I apologize if this is some sort of basic question but, I've been going over this for over 3 hours now and couldn't find an answer... Even though the functions are fixed by using a while loop I hate not knowing what's going on.
Thanks.
When you see lots of surprising allocations like that, a good first thing to check is type-stability. The #code_warntype macro is very helpful here:
julia> #code_warntype f1(R)
# … lots of annotated code, but the important part is this last line:
end::Union{Int64,Void}
Compare that to f2:
julia> #code_warntype f2(R)
# ...
end::Int64
So, why are the two different? Julia thinks that f1 might sometimes return nothing (which is of type Void)! Look again at your f1 function: what would happen if the last element of X is NaN? It'll just fall off the end of the function with no explicit return statement. In f2, however, you'll end up indexing beyond the bounds of X and get an error instead. Fix this type-instabillity by deciding what to do if the loop completes without finding the answer and you'll see more similar timings.
As I stated in the comment, your functions f1 and f2 both contain random numbers inside it, and you are using the random numbers as stopping criterion. Thus, there is no deterministic way to measure which of the functions is faster (doesn't depend in the implementation).
You can replace f1 and f2 functions to accept r as a parameter:
function f1(X::Vector, r)
for i = 1:length(X)
if r <= X[i]
return i
end
end
end
function f2(X::Vector, r)
i = 1
while i <= length(X)
if r <= X[i]
return i
end
i += 1
end
end
And then measure the time properly with the same R and r for both functions:
>>> R = cumsum(rand(100))
>>> r = rand(1_000_000) * R[end] # generate 1_000_000 random thresholds
>>> #time for i=1:length(r); f1(R, r[i]); end;
0.177048 seconds (4.00 M allocations: 76.278 MB, 2.70% gc time)
>>> #time for i=1:length(r); f2(R, r[i]); end;
0.173244 seconds (4.00 M allocations: 76.278 MB, 2.76% gc time)
As you can see, the timings are now nearly identical. Any difference will be caused for external factors (warming or processor busy with other tasks).

Julia allocates huge amount of memory for unknown reason

I'm trying to parallelize a little scientific code I wrote. But when I add #parallelize, similar code on just one processor suddenly takes 10 times as long to execute. It should take roughly the same amount of time. The first code makes one memory allocation, while the second makes 20. But zeros(Float64, num_bins) should not be a bottleneck. num_bins is 1800. So each call to zeros() should be allocating 8*1800 bytes. 20 calls to allocate 14,400 bytes should not be taking this long.
I can't figure out what I'm doing wrong, and the Julia documentation is vague and non-specific about how variables are accessed within #parallel. Both versions of the code below compute the correct value for the rdf vector. Can anyone tell by looking at it what is making it allocate so much memory and take so long?
atoms = readAtoms(file)
rdf = zeros(Float64, num_bins)
#time for k = 1:20
for i = 1:num_atoms
for j = 1:num_atoms
r = distance(k, atoms, i, atoms, j)
bin_number = floor(r / dr) + 1
rdf[bin_number] += 1
end
end
end
elapsed time: 8.1 seconds (0 bytes allocated)
atoms = readAtoms(file)
#time rdf = #parallel (+) for k = 1:20
rdf_part = zeros(Float64, num_bins)
for i = 1:num_atoms
for j = 1:num_atoms
r = distance(k, atoms, i, atoms, j)
bin_number = floor(r / dr) + 1
rdf_part[bin_number] += 1
end
end
rdf_part
end
elapsed time: 81.2 seconds (33472513332 bytes allocated, 17.40% gc time)

Memory usage blows up when iterating over array

I'm working on a Jacobi solver for the Poisson equation using Julia. The solver is called iteratively until err is sufficiently small (~1e-8), which takes around 25,000 loops through the function for my nx = ny = 80 test case. Profiling shows that most of the time is spent in the inner loop (as expected), but memory allocation seems to be running away--the #time macro gives 38 gigabytes allocated in order to reach convergence, which seems way too much since I don't think I'm creating new arrays for each loop.
function jacobi(P::Array{Float64,2}, maxiter::Int64)
P_old = copy(P)
for j = 2:ny-1
# Main body loop
for i = 2:nx-1
#inbounds P[i,j] = ((P_old[i+1,j] + P_old[i-1,j])*dx2
+ (P_old[i,j+1] + P_old[i,j-1])*dy2)/denom-Rmod[i,j]
end
end
err = vecnorm(P::Array{Float64,2}-P_old::Array{Float64,2})/sqrt(nx+ny)
return (P, err)
end
I've timed the function for 1000 loops, calling from a function wrapper (methodwrap) that sets initial conditions:
function methodwrap(solver, maxiter::Int64) # (solver fn name, max # of iterations)
P = copy(P0)
iter = 1
err = 1.0
maxerr = 1e-8
prog = Progress(maxiter,.2, "Solving using $solver method", 10) # Show progress bar
while (err > maxerr) && (iter < maxiter)
P, err = solver(P, maxiter)
next!(prog) # Iterates progress bar counter
iter += 1
end
println()
return (P, iter, err)
end
Contrary to my wishes, it looks like memory allocation scales with the number of loops, so I'm doing something wrong. It looks as if approximately 1.4 mb is allocated with each Jacobi pass:
julia> #time methodwrap(jacobi,1000)
Solving using jacobi method 98%|##########| ETA: 0:00:00
elapsed time: 4.001988593 seconds (1386549012 bytes allocated, 26.45% gc time)
I've tried reducing the inner loop arrays to vector subarrays and using #simd:
function jacobi2(P::Array{Float64,2}, maxiter::Int64)
P_old = copy(P)::Array{Float64,2}
for j = 2:ny-1
# Main body loop
Pojm = sub(P_old,:,j-1)
Poj = sub(P_old,:,j)
Pojp = sub(P_old,:,j+1)
Pj = sub(P,:,j)
Rmodj = sub(Rmod,:,j)
#simd for i = 2:nx-1
#inbounds Pj[i] = ((Poj[i+1] + Poj[i-1])*dx2
+ (Pojp[i] + Pojm[i])*dy2)/denom-Rmodj[i]
end
end
err = vecnorm(P::Array{Float64,2}-P_old::Array{Float64,2})/sqrt(nx+ny)
return (P, err)
end
However, this only seems to increase memory allocation and decrease speed, and I get a #simd warning:
julia> #time methodwrap(jacobi2,1000);
Warning: could not attach metadata for #simd loop.
Solving using jacobi2 method: 100%|##########| ETA: 0:00:00
elapsed time: 4.947097666 seconds (1455818184 bytes allocated, 29.85% gc time)
This is my first project in Julia, so I'm probably making a really obvious mistake, but I haven't found a solution yet. I've defined global vars as constants. I've gone through the performance tips several times, I've linted the file, I've used TypeCheck to make sure my types are consistent, and everything looks fairly kosher to my eyes. What am I doing wrong? I've posted my full code on Gist if you'd like to check that as well.
It turns out the problem was subtle. I made 3 changes (see below). I did use as #IainDunning suggested --track-allocation=user which pointed to the questionable line. Both of these problems come from using global variables.
After these changes
julia> #time methodwrap(jacobi,1000)
elapsed time: 0.481986712 seconds (116650236 bytes allocated)
change 1 add const to nx and ny
You had const everywhere except for these 2 variables but when left non const and global that cause the loop iterator i to allocate unnecessarily.
nx=80 # Number of mesh points in the x-direction
ny=80 # Number of mesh points in the y-direction
was changed to
const nx=80 # Number of mesh points in the x-direction
const ny=80 # Number of mesh points in the y-direction
change 2: avoid Rmod of type Array{Any,2}
const Rmod = dx2*dy2*R/(2*(dx2+dy2))
was changed to
const Rmod = convert(Array{Float64,2},dx2*dy2*R/(2*(dx2+dy2)))

Resources