Julia allocates huge amount of memory for unknown reason - debugging

I'm trying to parallelize a little scientific code I wrote. But when I add #parallelize, similar code on just one processor suddenly takes 10 times as long to execute. It should take roughly the same amount of time. The first code makes one memory allocation, while the second makes 20. But zeros(Float64, num_bins) should not be a bottleneck. num_bins is 1800. So each call to zeros() should be allocating 8*1800 bytes. 20 calls to allocate 14,400 bytes should not be taking this long.
I can't figure out what I'm doing wrong, and the Julia documentation is vague and non-specific about how variables are accessed within #parallel. Both versions of the code below compute the correct value for the rdf vector. Can anyone tell by looking at it what is making it allocate so much memory and take so long?
atoms = readAtoms(file)
rdf = zeros(Float64, num_bins)
#time for k = 1:20
for i = 1:num_atoms
for j = 1:num_atoms
r = distance(k, atoms, i, atoms, j)
bin_number = floor(r / dr) + 1
rdf[bin_number] += 1
end
end
end
elapsed time: 8.1 seconds (0 bytes allocated)
atoms = readAtoms(file)
#time rdf = #parallel (+) for k = 1:20
rdf_part = zeros(Float64, num_bins)
for i = 1:num_atoms
for j = 1:num_atoms
r = distance(k, atoms, i, atoms, j)
bin_number = floor(r / dr) + 1
rdf_part[bin_number] += 1
end
end
rdf_part
end
elapsed time: 81.2 seconds (33472513332 bytes allocated, 17.40% gc time)

Related

Fastest way to generate a kmer count vector from a nucleotide sequence (Julia)

Given a nucleotide sequence, I'm writing some Julia code to generate a sparse vector of (masked) kmer counts, and I would like it to run as fast as possible.
Here is my current implementation,
using Distributions
using SparseArrays
function kmer_profile(seq, k, mask)
basis = [4^i for i in (k - 1):-1:0]
d = Dict('A'=>0, 'C'=>1, 'G'=>2, 'T'=>3)
kmer_dict = Dict{Int, Int32}(4^k=>0)
for n in 1:(length(seq) - length(mask) + 1)
kmer_hash = 1
j = 1
for i in 1:length(mask)
if mask[i]
kmer_hash += d[seq[n+i-1]] * basis[j]
j += 1
end
end
haskey(kmer_dict, kmer_hash) ? kmer_dict[kmer_hash] += 1 : kmer_dict[kmer_hash] = 1
end
return sparsevec(kmer_dict)
end
seq = join(sample(['A','C','G','T'], 1000000))
mask_str = "111111011111001111111111111110"
mask = BitArray([parse(Bool, string(m)) for m in split(mask_str, "")])
k = sum(mask)
#time kmer_profile(seq, k, mask)
This code runs in about 0.3 seconds on my M1 MacBook Pro, is there any way to make it run significantly faster?
The function kmer_profile uses a sliding window of size length(mask) to count the number of times each masked kmer appears in the nucleotide sequence. A mask is a binary sequence, and a masked kmer is a kmer with nucleotides dropped at positions at which the mask is zero. E.g. the kmer ACGT and mask 1001 will produce the masked kmer AT.
To produce the kmer hash, the function treats each kmer as a base 4 number and then converts it to a (base 10) 64-bit integer, for indexing into the kmer vector.
The size of k is equal to the number of ones in the mask string, and is implicitly limited to 31 so that kmer hashes can fit into a 64-bit integer type.
There are several possible optimizations to make this code faster.
First of all, one can convert the Dict to an array since array-based indexing is faster than dictionary-based indexing one and this is possible here since the key is an ASCII character.
Moreover, the extraction of the sequence codes can be done once instead of length(mask) times by pre-computing code and putting the result in a temporary array.
Additionally, the mask-based conditional and the loop carried dependency make things slow. Indeed, the condition cannot be (easily) predicted by the processor causing it to stall for several cycles. The loop carried dependency make things even worse since the processor can hardly execute other instructions during this stall. This problem can be solved by pre-computing the factors based on both mask and basis. The result is a faster branch-less loop.
Once the above optimizations are done, the biggest bottleneck is sparsevec. In fact, it was also taking nearly half the time of the initial implementation! Optimizing this step is difficult but not impossible. It is slow because of random accesses in the Julia implementation. One can speed this up by sorting the keys-values pairs in the first place. It is faster due to a more cache-friendly execution and it can also help the prediction unit of the processor. This is a complex topic. For more details about how this works, please read Why is processing a sorted array faster than processing an unsorted array?.
Here is the final optimized code:
function kmer_profile_opt(seq, k, mask)
basis = [4^i for i in (k - 1):-1:0]
d = zeros(Int8, 128)
d[Int64('A')] = 0
d[Int64('C')] = 1
d[Int64('G')] = 2
d[Int64('T')] = 3
seq_codes = [d[Int8(e)] for e in seq]
j = 1
premult = zeros(Int64, length(mask))
for i in 1:length(mask)
if mask[i]
premult[i] = basis[j]
j += 1
end
end
kmer_dict = Dict{Int, Int32}(4^k=>0)
for n in 1:(length(seq) - length(mask) + 1)
kmer_hash = 1
j = 1
for i in 1:length(mask)
kmer_hash += seq_codes[n+i-1] * premult[i]
end
haskey(kmer_dict, kmer_hash) ? kmer_dict[kmer_hash] += 1 : kmer_dict[kmer_hash] = 1
end
sorted_kmer_pairs = sort(collect(kmer_dict))
sorted_kmer_keys = [e[1] for e in sorted_kmer_pairs]
sorted_kmer_values = [e[2] for e in sorted_kmer_pairs]
return sparsevec(sorted_kmer_keys, sorted_kmer_values)
end
This code is a bit more than twice faster than the initial implementation on my machine. A significant fraction of the time is still spent in the sorting algorithm.
The code can still be optimized further. One way is to use a parallel sort algorithm. Another way is to replace the premult[i] multiplication by a shift which is faster assuming premult[i] is modified so to contain exponents. I expect the code to be about 4 times faster than the original code. The main bottleneck should be the big dictionary creation. Improving further the performance of this is very hard (though it is still possible).
Inspired by Jérôme's answer, and squeezing some more by avoiding Dicts altogether:
function kmer_profile_opt3a(seq, k, mask)
d = zeros(Int8, 128)
d[Int64('A')] = 0
d[Int64('C')] = 1
d[Int64('G')] = 2
d[Int64('T')] = 3
seq_codes = [d[Int8(e)] for e in seq]
basis = [4^i for i in (k-1):-1:0]
j = 1
premult = zeros(Int64, length(mask))
for i in 1:length(mask)
if mask[i]
premult[i] = basis[j]
j += 1
end
end
kmer_vec = Vector{Int}(undef, length(seq)-length(mask)+1)
#inbounds for n in 1:(length(seq) - length(mask) + 1)
kmer_hash = 1
for i in 1:length(mask)
kmer_hash += seq_codes[n+i-1] * premult[i]
end
kmer_vec[n] = kmer_hash
end
sort!(kmer_vec)
return sparsevec(kmer_vec, ones(length(kmer_vec)), 4^k, +)
end
This achieved another 2x over Jérôme's answer on my machine.
The auto-combining feature of sparsevec makes the code a bit more compact.
Trying to slim the code further, and avoid unnecessary allocations in sparse vector creation, the following can be used:
using SparseArrays, LinearAlgebra
function specialsparsevec(nzs, n)
vals = Vector{Int}(undef, length(nzs))
j, k, count, last = (1, 1, 0, nzs[1])
while k <= length(nzs)
if nzs[k] == last
count += 1
else
vals[j], nzs[j] = (count, last)
count, last = (1, nzs[k])
j += 1
end
k += 1
end
vals[j], nzs[j] = (count, last)
resize!(nzs, j)
resize!(vals, j)
return SparseVector(n, nzs, vals)
end
function kmer_profile_opt3(seq, k, mask)
d = zeros(Int8, 128)
foreach(((i,c),) -> d[Int(c)]=i-1, enumerate(collect("ACGT")))
seq_codes = getindex.(Ref(d), Int8.(collect(seq)))
premult = foldr(
(i,(p,j))->(mask[i] && (p[i]=j ; j<<=2) ; (p,j)),
1:length(mask); init=(zeros(Int64,length(mask)),1)) |> first
kmer_vec = sort(
[ dot(#view(seq_codes[n:n+length(mask)-1]),premult) + 1 for
n in 1:(length(seq)-length(mask)+1)
])
return specialsparsevec(kmer_vec, 4^k)
end
This last version gets another 10% speedup (but is a little cryptic):
julia> #btime kmer_profile_opt($seq, $k, $mask);
367.584 ms (81 allocations: 134.71 MiB) # other answer
julia> #btime kmer_profile_opt3a($seq, $k, $mask);
140.882 ms (22 allocations: 54.36 MiB) # 1st this answer
julia> #btime kmer_profile_opt3($seq, $k, $mask);
127.016 ms (14 allocations: 27.66 MiB) # 2nd this answer

Fast random selection from short vectors in Julia

I have a simple function which appears at several places in my Julia code and is run millions of times inside a loop. The function essentially does rand([1,-1,im,-im]), that's, it picks one of four possible given values. I noticed that this function takes a substantial amount of time in my huge loop, so, I tried to write it in a slightly faster way like this:
function qpsk()
temp1 = ifelse(rand(Bool), 1+0im, -1+0im)
temp2 = ifelse(rand(Bool), 1+0im, 0+1im)
temp1*temp2
end
Then, it is typically called like this:
sig = complex(zeros(N))
for i = 1:N
sig[i] = qpsk()
end
Now, is there any way to further optimize this function, or use another faster method? Appreciate your help.
Comments on current answers:
The answer of #DanGetz (22 lines??) doesn't solve the problem, because at the moment, Julia is not as good at vectors as with explicit loops. Also,
my simple, 1 line qpsk2(s) below, is about 2X faster than those "cryptic" 22 lines of code in the original answer by Dan (a vector is created, though, which adds more time).
But the question remains, why they
didn't implement something like qpsk1 below? and why my original qpsk with branching is more than 3X faster than the straightforward qpsk4(s) below?
I added more versions below to guide the discussion if more experienced people like to jump in.
qpsk1(s) = s[1+(rand(Int8)&3)] # Blazingly fast
qpsk2(s) = s[1+rand(Bool)+2rand(Bool)] # Very fast
qpsk3(s) = s[rand(1:4,1)] # Compiler issue here?
qpsk4(s) = s[rand(1:4)] # Why slow?
qpsk5(s) = rand([s]) # Ridiculously slow!!
function test_orig(n) # Test qpsk(), very fast(branching!), why?
for i = 1:n
qpsk()
end
end
using StaticArrays
function test(func, n) # Test all qpsk1 --> qpsk5
s = SVector(1,-1,im,-im)
for i=1:n
func(s)
end
end
#time test(qpsk1,10^8) 0.554994 seconds (5 allocations: 176 bytes)
#time test(qpsk2,10^8) 0.755286 seconds (5 allocations: 176 bytes)
#time test(qpsk3,10^8) 13.431529 seconds (400 M allocations: 26.822 GiB, 20.68% gc time)
#time test(qpsk4,10^8) 2.520085 seconds (5 allocations: 176 bytes)
#time test(qpsk5,10^8) 10.881852 seconds (200 M allocations: 20.862 GiB, 19.76% gc time)
#time test_orig(10^8) 0.771778 seconds (5 allocations: 176 bytes)
#time nqpsk2(10^8); 1.402830 seconds (9 allocations: 1.490 GiB, 6.39% gc time)
Summary of answer
[(-1)^b1*im^b2 for (b1,b2) in zip(rand!(BitVector(N)),rand!(BitVector(N)))]
generates a length N vector faster.
Answer
Calculating the random bits is the bulk of the work, so exploring Chris' idea from comments of using RandomNumbers.jl is worth a shot. Additionally, we can use #rickhg12hs's idea to extract more bits from each random number generated. Regardless, generating a block of values together is essential for better optimization.
For example, the following code (nqpsk1 uses qpsk from the question as the baseline. nqpsk2 is a suggested improvement):
function qpsk()
temp1 = ifelse(rand(Bool), 1+0im, -1+0im)
temp2 = ifelse(rand(Bool), 1+0im, 0+1im)
temp1*temp2
end
nqpsk1(n::Int) = [qpsk() for i=1:n]
nqpsk2(n::Int) = begin
res = zeros(Int,2*n)
blocks = n >>> 4 # use blocks of 16 values
btail = n & 0x000000000000000f # in case n is not a multiple of 16
pos = 1
#inbounds for i=1:blocks
bits = rand(UInt32) # get random bits for a whole block
for j=1:16
b1 = Bool(bits & 1)
bits >>>= 1
b2 = Bool(bits & 1)
bits >>>= 1
res[pos+b1] = (-1)^b2
pos += 2
end
end
#inbounds for i=1:btail
res[pos+rand(Bool)] = (-1)^rand(Bool)
pos += 2
end
return reinterpret(Complex{Int64},res)
end
achieved a >4x improvement on my setup (Julia 0.7):
julia> using BenchmarkTools
julia> #btime nqpsk1(320);
8.791 μs (323 allocations: 15.19 KiB)
julia> #btime nqpsk2(320);
1.056 μs (3 allocations: 5.20 KiB)
Update
With only a modest compromise in speed (and some allocation), but much better looking code:
function nqpsk3(n::Int)
res = zeros(Int,2n)
rv1 = rand!(BitVector(n))
rv2 = rand!(BitVector(n))
#inbounds for (b1,b2,i) in zip(rv1,rv2,1:2:2n)
res[i+b1] = (-1)^b2
end
return reinterpret(Complex{Int},res)
end
The benchmark:
julia> #btime nqpsk3(320);
1.780 μs (11 allocations: 5.83 KiB)
Addendum
And the one-(wrapped)-line version, does OK (2.48 μs) too:
nqpsk4(n) = [(1+0im,-1+0im,0+im,0-im)[2b1+b2+1] for
(b1,b2) in zip(rand!(BitVector(n)),rand!(BitVector(n)))]
Finally, the real one-line version (1.96 μs):
nqpsk5(n) = [(-1)^b1*im^b2 for (b1,b2) in zip(rand!(BitVector(n)),rand!(BitVector(n)))]
Latest state of investigation
My current best solution is the following:
function g(pX::Array{Complex{Float64},1})
tab = [1.0,im,-1.0,-im]
bits = UInt128(0)
#inbounds for i = 1 : length(pX)
bits = (i % 64) == 1 ? rand(UInt128) : bits >>> 2
pX[i] = tab[(bits & 3)+1]
end
end
sig = complex(zeros(1280));
using BenchmarkTools
#btime g(sig)
3.838 μs (13 allocations: 464 bytes)
This is better than my optimized version of Dan Getz which runs with the same N, and i feel much more readable
4.236 μs (4 allocations: 20.16 KiB)
However, the performance is extremely fragil. Just have a look at subtle differences to this 36 times slower version:
function g(pX::Array{Complex{Float64},1})
tab = [1,im,-1,-im]
bits = 0
for i = 1 : length(pX)
bits = (i % 64) == 1 ? rand(UInt128) : bits >>> 2
pX[i] = tab[(bits & 3)+1]
end
end
138.320 μs (10209 allocations: 319.14 KiB)
Did you find the differences?
no conversion from Int64 to Float64
Type stability
disable range checking
to follow the convention g() should be renamed to g!()
In the following you find the evolution to the currently best timed solution
My first approach to answer was addressing general weaknesses
a) calling functions is expensive due to calling overhead.
b) complex calculations are more time consuming than lookup.
This end up with the proposal
cases = [1+0im,0+1im,-1+0im,0-1im]
g() = cases[rand(1:4)]
// to use just call g()
g()
What happend?
Why does a) do not succeed?
using BenchmarkTools
test(n) = [q() for i = 1:n]
g() = rand()
#btime test(800);
This results to
rand() => 5.784
rand(Float32) => 5.604
rand(Float64) => 5.821
rand(Bool) => 5.167
rand(Int8) => 5.126
rand(Int16) => 5.171
rand(Int32) => 5.631
rand(Int64) => 7.980
rand(Int128) => 10.549
rand(1:4) => 28.603
(rand(Int8) % 4) + 1 => 6.053
(rand(Int8) & 3) + 1 => 5.843
rand(0:255) => 28.568
rand(UInt8) => 5.104
rand([1,2,3,4]) => 58.437
l = [1,2,3,4]; g() = rand(l) => 47.399
rand(l, 1) => 70.052
m = (1,2,3,4); rand(m) => 124.311
0 => 0.872
0.0 => 0.887
Int8(0) => 0.113
return => 0.33
(running Julia 0.6 on Ubuntu)
How to judge the results
requesting float32 and float64 needs same time. This may be an indicator that float64 does NOT the full mantisse (of 56 bits) for random value
rand for Bool, Int8, Int16 needs nearly the same time. Probably same algorithmus just using fewer bits.
rand for Int32 slightly more time consuming. Int64 and Int128 takes under proportional more time.
rand(1:4) takes surprisingly much more time. It should be in the range of rand(Int8) since it is equivalent to (rand(Int8) % 4) + 1 and (rand(Int8) & 3) + 1.
Even if I hurt somebodies religious feelings, this is just poor code.
Same to rand(Uint) and rand(0:255)
The performance of rand with arrays and tupel is far from acceptable!
Why does b) do not succeed?
Julia seems not able to lookup efficiently from tupels or arrays.
But even when lookup were fast, the rand methods dominates.
Other approaches
Dan Getz approach uses all bits from a rand call. So in end it needs in his first algorithm 1/16 calls per value.
However, this approach could be improved by using UInt128 since now 1/64 calls per value is required.
On my machine Dan Getz original code takes 17.314 for 1280 values while the modified code takes 4.595. The improvement is proportional to the reduced number of calls to rand!
test2(n::Int) = begin
res = zeros(Int,2*n)
blocks = n >>> 7 # use blocks of 16 values
btail = n & 0x000000000000007f # in case n is not a multiple of 16
pos = 1
#inbounds for i=1:blocks
bits = rand(UInt128) # get random bits for a whole block
for j=1:16
b1 = Bool(bits & 1)
bits >>>= 1
b2 = Bool(bits & 1)
bits >>>= 1
res[pos+b1] = (-1)^b2
pos += 2
end
end
#inbounds for i=1:btail
res[pos+rand(Bool)] = (-1)^rand(Bool)
pos += 2
end
return reinterpret(Complex{Int64},res)
end
#btime test2(1280);
However, the use of reinterpret means to know the bit layout of the different structures. That not a real good idea.
high level view
In the end all what the questioneer has coded was a complicated complex build array of random numbers from 1 to 4 (or 0 to 3). I would try to optimize the next step in questioneers following task. However, no info were supplied.
In that following case julia performs much better, which sounds somewhat strange. More returns, less time??
#btime rand(0:3, 1280)
=> 24.377
PS:
Just for comparing the numbers with Dan Getz last approach, the following code takes 27.004
N=1280
#btime [(-1)^b1*im^b2 for (b1,b2) in zip(rand!(BitVector(N)),rand!(BitVector(N)))]

Why is Julia allocating so much memory?

I am trying to write a fast coordinate descent algorithm for solving ordinary least squares regression. The following Julia code works, but I don't understand why it's allocating so much memory
function OLS_cd{T<:Float64}(A::Array{T,2}, b::Array{T,1}, tolerance::T=1e-12)
N,P = size(A)
x = zeros(P)
r = copy(b)
d = ones(P)
while sum(d.*d) > tolerance
#inbounds for j = 1:P
d[j] = sum(A[:,j].*r)
x[j] += d[j]
r -= d[j]*A[:,j]
end
end
return(x)
end
On the data I generate with
n = 100
p = 75
σ = 0.1
β_nz = float([i*(-1)^i for i in 1:10])
β = append!(β_nz,zeros(p-length(β_nz)))
X = randn(n,p); X .-= mean(X,1); X ./= sqrt(sum(abs2(X),1))
y = X*β + σ*randn(n); y .-= mean(y);
Using #benchmark OLS_cd(X, y) I get
BenchmarkTools.Trial:
memory estimate: 65.94 mb
allocs estimate: 151359
--------------
minimum time: 19.316 ms (16.49% GC)
median time: 20.545 ms (16.60% GC)
mean time: 22.164 ms (16.24% GC)
maximum time: 42.114 ms (10.82% GC)
--------------
samples: 226
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
The OLS problem gets harder as p gets bigger, and I've noticed that the as I make p bigger and need to run longer, the more memory Julia allocates.
Why would each pass through the while loop allocate more memory? To my eye, it seems like all of my operations are in place, and the types are clearly specified.
Nothing popped out to me while profiling, but I could post that output as well if it's useful.
Update:
As pointed out below, temporary arrays caused by using vectorized operations were the culprit. The following eliminated extraneous allocations and runs pretty quickly:
function OLS_cd_unrolled{T<:Float64}(A::Array{T,2}, b::Array{T,1}, tolerance::T=1e-12)
N,P = size(A)
x = zeros(P)
r = copy(b)
d = ones(P)
while norm(d,Inf) > tolerance
#inbounds for j = 1:P
d[j] = 0.0; #inbounds for i = 1:N d[j] += A[i,j]*r[i] end
#inbounds for i = 1:N r[i] -= d[j]*A[i,j] end
x[j] += d[j]
end
end
return(x)
end
A[:,j] creates a copy, not a view. You want to use #view A[:,j] or view(A,:,j).
You can devectorize r -= d[j]*A[:,j] with r .= -.(r,d[j]*A[:.j]) to get rid of some more temporaries. As #LutfullahTomak said sum(A[:,j].*r) should devectorize as dot(view(A,:,j),r) to get rid of all of the temporaries in there. To use an infix operator, you can use \cdot, as in view(A,:,j)⋅r.
You should read up on copies vs views and how vectorization causes temporary arrays. The jist of it is that when vectorized operations occur, they have to create a new vector as output. Instead, you want to write to an existing vector. r = ... for an array changes reference, so r = ex for some expression which makes an array will make a new array, and then point r to that array. r .= ex will replace the values of the array r with the values from the expression. The former allocates a temporary, the latter does not. Repeated applications of this idea is where all of the temporaries come from.
Actually, sum(d.*d) , sum(A[:,j].*r) and so on are not inplace and make temporary arrays.. First, sum(d.*d) == dot(d,d) I think and sum(A[:,j].*r) makes 2 temporary arrays. I'd do dot(view(A,:,j),r) for the latter. Current stable version of julia(0.5) doesn't have short version for r -= d[j]*A[:,j] so you need to devectorize it make a loop.

Joint Entropy Performance in Julia

I wrote a function to calculate the joint entropy of each column pair in a matrix. But I would like to increase the performance regarding time and memory.
The function looks like this:
function jointentropy(aln)
mat = Array(Float64,size(aln,2),size(aln,2))
for i in combinations(1:size(aln,2),2)
a = i[1]
b = i[2]
mina, maxa = extrema(aln[:,a])
minb, maxb = extrema(aln[:,b])
h = Array(Float64,(maxa-mina+1,maxb-minb+1))
h = hist2d([aln[:,a] aln[:,b]],mina-1:1:maxa,minb-1:1:maxb)[3]
h = h/size(aln[:,1],1)
I,J,V = findnz(h)
l = sparse(I,J,log2(V),maxa-mina+1,maxb-minb+1)
mat[b,a] = - sum(l.*h)
end
return mat
end
Matrices that go into this function look like this:
rand(45:122,rand(1:2000),rand(1:2000))
An example with a 500x500 matrix resulted in the following #time output:
elapsed time: 33.692081413 seconds (33938843192 bytes allocated, 36.42% gc time)
...which seems to be a whole lot of memory...
Any suggestions on how to speed up this function and reduce memory allocation?
Thanks in advance for any help!
Here are a few ideas to speed up your function.
If the range of all the columns is roughly the same, you can move the extrema computations outside the loop and reuse the same h array.
hist2d creates a new array: you can use hist2d! to reuse the previous one.
The assignment h = h/size(aln[:,1],1) creates a new array.
The division in h = h/size(aln[:,1],1) is done for all the elements of the array, including the zeroes.
You can use a loop instead of findnz and a sparse matrix (findnz already contains a loop).
.
function jointentropy2(aln)
n1 = size(aln,1)
n2 = size(aln,2)
mat = Array(Float64,n2,n2)
lower, upper = extrema(aln)
m = upper-lower+1
h = Array(Float64,(m,m))
for a in 1:n2
for b in (a+1):n2
Base.hist2d!(h,[aln[:,a] aln[:,b]],lower-1:1:upper,lower-1:1:upper)[3]
s = 0
for i in 1:m
for j in 1:m
if h[i,j] != 0
p = h[i,j] / n1
s += p * log2(p)
end
end
end
mat[b,a] = - s
end
end
return mat
end
This is twice as fast as the initial function,
and the memory allocations were divided by 4.
aln = rand(45:122,500,400)
#time x = jointentropy(aln)
# elapsed time: 26.946314168 seconds (21697858752 bytes allocated, 29.97% gc time)
#time y = jointentropy2(aln)
# elapsed time: 13.626282821 seconds (5087119968 bytes allocated, 16.21% gc time)
x - y # approximately zero (at least below the diagonal --
# the matrix was not initialized above it)
The next candidate for optimization is hist2d (here, you could use a loop and a sparse matrix).
#profile jointentropy2(aln)
Profile.print()

Memory usage blows up when iterating over array

I'm working on a Jacobi solver for the Poisson equation using Julia. The solver is called iteratively until err is sufficiently small (~1e-8), which takes around 25,000 loops through the function for my nx = ny = 80 test case. Profiling shows that most of the time is spent in the inner loop (as expected), but memory allocation seems to be running away--the #time macro gives 38 gigabytes allocated in order to reach convergence, which seems way too much since I don't think I'm creating new arrays for each loop.
function jacobi(P::Array{Float64,2}, maxiter::Int64)
P_old = copy(P)
for j = 2:ny-1
# Main body loop
for i = 2:nx-1
#inbounds P[i,j] = ((P_old[i+1,j] + P_old[i-1,j])*dx2
+ (P_old[i,j+1] + P_old[i,j-1])*dy2)/denom-Rmod[i,j]
end
end
err = vecnorm(P::Array{Float64,2}-P_old::Array{Float64,2})/sqrt(nx+ny)
return (P, err)
end
I've timed the function for 1000 loops, calling from a function wrapper (methodwrap) that sets initial conditions:
function methodwrap(solver, maxiter::Int64) # (solver fn name, max # of iterations)
P = copy(P0)
iter = 1
err = 1.0
maxerr = 1e-8
prog = Progress(maxiter,.2, "Solving using $solver method", 10) # Show progress bar
while (err > maxerr) && (iter < maxiter)
P, err = solver(P, maxiter)
next!(prog) # Iterates progress bar counter
iter += 1
end
println()
return (P, iter, err)
end
Contrary to my wishes, it looks like memory allocation scales with the number of loops, so I'm doing something wrong. It looks as if approximately 1.4 mb is allocated with each Jacobi pass:
julia> #time methodwrap(jacobi,1000)
Solving using jacobi method 98%|##########| ETA: 0:00:00
elapsed time: 4.001988593 seconds (1386549012 bytes allocated, 26.45% gc time)
I've tried reducing the inner loop arrays to vector subarrays and using #simd:
function jacobi2(P::Array{Float64,2}, maxiter::Int64)
P_old = copy(P)::Array{Float64,2}
for j = 2:ny-1
# Main body loop
Pojm = sub(P_old,:,j-1)
Poj = sub(P_old,:,j)
Pojp = sub(P_old,:,j+1)
Pj = sub(P,:,j)
Rmodj = sub(Rmod,:,j)
#simd for i = 2:nx-1
#inbounds Pj[i] = ((Poj[i+1] + Poj[i-1])*dx2
+ (Pojp[i] + Pojm[i])*dy2)/denom-Rmodj[i]
end
end
err = vecnorm(P::Array{Float64,2}-P_old::Array{Float64,2})/sqrt(nx+ny)
return (P, err)
end
However, this only seems to increase memory allocation and decrease speed, and I get a #simd warning:
julia> #time methodwrap(jacobi2,1000);
Warning: could not attach metadata for #simd loop.
Solving using jacobi2 method: 100%|##########| ETA: 0:00:00
elapsed time: 4.947097666 seconds (1455818184 bytes allocated, 29.85% gc time)
This is my first project in Julia, so I'm probably making a really obvious mistake, but I haven't found a solution yet. I've defined global vars as constants. I've gone through the performance tips several times, I've linted the file, I've used TypeCheck to make sure my types are consistent, and everything looks fairly kosher to my eyes. What am I doing wrong? I've posted my full code on Gist if you'd like to check that as well.
It turns out the problem was subtle. I made 3 changes (see below). I did use as #IainDunning suggested --track-allocation=user which pointed to the questionable line. Both of these problems come from using global variables.
After these changes
julia> #time methodwrap(jacobi,1000)
elapsed time: 0.481986712 seconds (116650236 bytes allocated)
change 1 add const to nx and ny
You had const everywhere except for these 2 variables but when left non const and global that cause the loop iterator i to allocate unnecessarily.
nx=80 # Number of mesh points in the x-direction
ny=80 # Number of mesh points in the y-direction
was changed to
const nx=80 # Number of mesh points in the x-direction
const ny=80 # Number of mesh points in the y-direction
change 2: avoid Rmod of type Array{Any,2}
const Rmod = dx2*dy2*R/(2*(dx2+dy2))
was changed to
const Rmod = convert(Array{Float64,2},dx2*dy2*R/(2*(dx2+dy2)))

Resources