My loops are slow. Is that because of if statements? - performance

I read this post and realized that loops are faster in Julia. Thus, I decided to change my vectorized code into loops. However, I had to use a few if statements in my loop but my loops slowed down after I added more such if statements.
Consider this excerpt, which I directly copied from the post:
function devectorized()
a = [1.0, 1.0]
b = [2.0, 2.0]
x = [NaN, NaN]
for i in 1:1000000
for index in 1:2
x[index] = a[index] + b[index]
end
end
return
end
function time(N)
timings = Array(Float64, N)
# Force compilation
devectorized()
for itr in 1:N
timings[itr] = #elapsed devectorized()
end
return timings
end
I then added a few if statements to test the speed:
function devectorized2()
a = [1.0, 1.0]
b = [2.0, 2.0]
x = [NaN, NaN]
for i in 1:1000000
for index in 1:2
####repeat this 6 times
if index * i < 20
x[index] = a[index] - b[index]
else
x[index] = a[index] + b[index]
end
####
end
end
return
end
I repeated this block six times:
if index * i < 20
x[index] = a[index] - b[index]
else
x[index] = a[index] + b[index]
end
For the sake of conciseness, I'm not repeating this block in my sample code. After repeating the if statements 6 times, devectorized2() took 3 times as long.
I have two questions:
Are there better ways to implement if statements?
Why are if statements so slow? I know that Julia is trying to do loops in a way that matches C. Is Julia providing better "translation" between Julia and C and these if statements just made the translation process more difficult?

Firstly, I don't think the performance here is very odd, since you're adding a lot of work to your function.
Secondly, you should actually return x here, otherwise the compiler might decide that you're not using x, and just skip the whole computation, which would thoroughly confuse the timings.
Thirdly, to answer your question 1: You can implement it like this:
x[index] = a[index] + ifelse(index * i < 20, -1, 1) * b[index]
This can be faster in some cases, but not necessarily in your case, where the branch is very easy to predict. Sometimes you can also get speedups by using Bools, for example like this:
x[index] = a[index] + (2*(index * i >= 20)-1) * b[index]
Again, in your example this doesn't help much, but there are cases when this approach can give you a decent speedup.
BTW: It isn't necessarily always true that loops are preferable to vectorized code any longer. The post you linked to is quite old. Take a look at this blog post, which shows how vectorized code can achieve similar performance to loopy code. In many cases, though, a loop is the clearest, easiest and fastest way to accomplish your goal.

Related

Optimizing matrix multiplication with varying sizes

Suppose I have the following data generating process
using Random
using StatsBase
m_1 = [1.0 2.0]
m_2 = [1.0 2.0; 3.0 4.0]
DD = []
y = zeros(2,200)
for i in 1:100
rand!(m_1)
rand!(m_2)
push!(DD, m_1)
push!(DD, m_2)
end
idxs = sample(1:200,10)
for i in idxs
DD[i] = DD[1]
end
and suppose given the data, I have the following function
function test(y, DD, n)
v_1 = [1 2]
v_2 = [3 4]
for j in 1:n
for i in 1:size(DD,1)
if size(DD[i],1) == 1
y[1:size(DD[i],1),i] .= (v_1 * DD[i]')[1]
else
y[1:size(DD[i],1),i] = (v_2 * DD[i]')'
end
end
end
end
I'm struggling to optimize the speed of test. In particular, memory allocation increases as I increase n. However, I'm not really allocating anything new.
The data generating process captures the fact that I don't know for sure the size of DD[i] beforehand. That is, the first time I call test, DD[1] could be a 2x2 matrix. The second time I call test, DD[1] could be a 1x2 matrix. I think this could be part of the issue with memory allocation: Julia doesn't know the sizes beforehand.
I'm completely stuck. I've tried #inbounds but that didn't help. Is there a way to improve this?
One important thing to check for performance is that Julia can understand the types. You can check this by running #code_warntype test(y, DD, 1), the output will make it clear that DD is of type Any[] (since you declared it that way). Working with Any can incur quite a performance penalty so declaring DD = Matrix{Float64}[] cuts the time to a third in my testing.
I'm not sure how close this example is to the actual code you want to write but in this particular case the size(DD[i],1) == 1 branch can be replaced by a call to LinearAlgebra.dot:
y[1:size(DD[i],1),i] .= dot(v_1, DD[i])
this cuts the time by another 50% for me. Finally you can squeeze out just a tiny bit more by using mul! to perform the other multiplication in place:
mul!(view(y, 1:size(DD[i],1),i:i), DD[i], v_2')
Full example:
using Random
using LinearAlgebra
DD = [rand(i,2) for _ in 1:100 for i in 1:2]
y = zeros(2,200)
shuffle!(DD)
function test(y, DD, n)
v_1 = [1 2]
v_2 = [3 4]'
for j in 1:n
for i in 1:size(DD,1)
if size(DD[i],1) == 1
y[1:size(DD[i],1),i] .= dot(v_1, DD[i])
else
mul!(view(y, 1:size(DD[i],1),i:i), DD[i], v_2)
end
end
end
end

Is there a way to refactor the julia code below in order to avoid the loop/malloc?

m,n =size(l.x)
for batch=1:m
l.ly = l.y[batch,:]
l.jacobian .= -l.ly .* l.ly'
l.jacobian[diagind(l.jacobian)] .= l.ly.*(1.0.-l.ly)
# # n x 1 = n x n * n x 1
l.dldx[batch,:] = l.jacobian * DLDY[batch,:]
end
return l.dldx
l.x is a m by n matrix. l.y is another matrix with the same size as l.x. My goal is to create another m by n matrix, l.dldx, in which each row is the result of the operation inside the for loop. Can any one spot further optimization for this block of code? The code above is part of https://github.com/stevenygd/NN.jl.
The following should implement the same calculation and is more efficient:
l.dldx = l.y .* (DLDY .- sum( l.y .* DLDY , 2))
There might be a slight improvement available by refactoring the sum into a loop.
As the question does not have runnable code, or a test case, it is hard to give definite benchmarks, so feedback would be welcome.
UPDATE
Here is the code above with explicit loops:
function calc_dldx(y,DLDY)
tmp = zeros(eltype(y),size(y,1))
dldx = similar(y)
#inbounds for j=1:size(y,2)
for i=1:size(y,1)
tmp[i] += y[i,j]*DLDY[i,j]
end
end
#inbounds for j=1:size(y,2)
for i=1:size(y,1)
dldx[i,j] = y[i,j]*(DLDY[i,j]-tmp[i])
end
end
return dldx
end
The long version should run even faster. A good way to measure the performance of code is using the BenchmarkTools package.

breaking out of a loop in Julia

I have a Vector of Vectors of different length W. These last vectors contain integers between 0 and 150,000 in steps of 5 but can also be empty. I am trying to compute the empirical cdf for each of those vectors. I could compute these cdf iterating over every vector and every integer like this
cdfdict = Dict{Tuple{Int,Int},Float64}()
for i in 1:length(W)
v = W[i]
len = length(v)
if len == 0
pcdf = 1.0
else
for j in 0:5:150_000
pcdf = length(v[v .<= j])/len
cdfdict[i, j] = pcdf
end
end
end
However, this approach is inefficient because the cdf will be equal to 1 for j >= maximum(v) and sometimes this maximum(v) will be much lower than 150,000.
My question is: how can I include a condition that breaks out of the j loop for j > maximum(v) but still assigns pcdf = 1.0 for the rest of js?
I tried including a break when j > maximum(v) but this, of course, stops the loop from continuing for the rest of js. Also, I can break the loop and then use get! to access/include 1.0 for keys not found in cdfdict later on, but that is not what I'm looking for.
To elaborate on my comment, this answer details an implementation which fills an Array instead of a Dict.
First to create a random test case:
W = [rand(0:mv,rand(0:10)) for mv in floor(Int,exp(log(150_000)*rand(10)))]
Next create an array of the right size filled with 1.0s:
cdfmat = ones(Float64,length(W),length(0:5:150_000));
Now to fill the beginning of the CDFs:
for i=1:length(W)
v = sort(W[i])
k = 1
thresh = 0
for j=1:length(v)
if (j>1 && v[j]==v[j-1])
continue
end
pcdf = (j-1)/length(v)
while thresh<v[j]
cdfmat[i,k]=pcdf
k += 1
thresh += 5
end
end
end
This implementation uses a sort which can be slow sometimes, but the other implementations basically compare the vector with various values which is even slower in most cases.
break only does one level. You can do what you want by wrapping the for loop function and using return (instead of where you would've put break), or using #goto.
Or where you would break, you could switch a boolean breakd=true and then break, and at the bottom of the larger loop do if breakd break end.
You can use another for loop to set all remaining elements to 1.0. The inner loop becomes
m = maximum(v)
for j in 0:5:150_000
if j > m
for k in j:5:150_000
cdfdict[i, k] = 1.0
end
break
end
pcdf = count(x -> x <= j, v)/len
cdfdict[i, j] = pcdf
end
However, this is rather hard to understand. It would be easier to use a branch. In fact, this should be just as fast because the branch is very predictable.
m = maximum(v)
for j in 0:5:150_000
if j > m
cdfdict[i, j] = 1.0
else
pcdf = count(x -> x <= j, v)/len
cdfdict[i, j] = pcdf
end
end
Another answer gave an implementation using an Array which calculated the CDF by sorting the samples and filling up the CDF bins with quantile values. Since the whole Array is thus filled, doing another pass on the array should not be overly costly (we tolerate a single pass already). The sorting bit and the allocation accompanying it can be avoided by calculating a histogram in the array and using cumsum to produce a CDF. Perhaps the code will explain this better:
Initialize sizes, lengths and widths:
n = 10; w = 5; rmax = 150_000; hl = length(0:w:rmax)
Produce a sample example:
W = [rand(0:mv,rand(0:10)) for mv in floor(Int,exp(log(rmax)*rand(n)))];
Calculate the CDFs:
cdfmat = zeros(Float64,n,hl); # empty histograms
for i=1:n # drop samples into histogram bins
for j=1:length(W[i])
cdfmat[i,1+(W[i][j]+w-1)รท5]+=one(Float64)
end
end
cumsum!(cdfmat,cdfmat,2) # calculate pre-CDF by cumsum
for i=1:n # normalize each CDF by total
if cdfmat[i,hl]==zero(Float64) # check if histogram empty?
for j=1:hl # CDF of 1.0 as default (might be changed)
cdfmat[i,j] = one(Float64)
end
else # the normalization factor calc-ed once
f = one(Float64)/cdfmat[i,hl]
for j=1:hl
cdfmat[i,j] *= f
end
end
end
(a) Note the use of one,zero to prepare for change of Real type - this is good practice. (b) Also adding various #inbounds and #simd should optimize further. (c) Putting this code in a function is recommended (this is not done in this answer). (d) If having a zero CDF for empty samples is OK (which means no samples means huge samples semantically), then the second for can be simplified.
See other answers for more options, and reminder: Premature optimization is the root of all evil (Knuth??)

More Efficient Alternatives to Max and Min

So, I'm trying to optimize a program I made, and two glaring inefficiencies I have found with the help of the profiler are these:
if (min(image_arr(j,i,:)) > 0.1)
image_arr(j,i,:) = image_arr(j,i,:) - min(image_arr(j,i,:));
end
%"Grounds" the data, making sure the points start close to 0
Called 4990464 times, takes 58.126s total, 21.8% of total compile time.
[max_mag , max_index] = max(image_arr(j, i, :));
%finds the maximum value and its index in the set
Called 4990464 times, takes 50.900s total, 19.1% of total compile time.
Is there any alternative to max and min that I can use here, that would be more efficient?
There is no way to reduce the number of times these lines are called.
Based on the call count, these are probably inside a loop. Both min and max are vectorized (they work on vectors of vectors).
Since you want to find extrema along the third dimension, you can use:
image_arr = bsxfun(#minus, image_arr, min(image_arr, [], 3));
and
[max_mag , max_index] = max(image_arr, [], 3);
It seems like:
if (min(image_arr(j,i,:)) > 0.1)
image_arr(j,i,:) = image_arr(j,i,:) - min(image_arr(j,i,:));
end
could be rewritten like this:
data = image_arr(j,i,:);
mn = min(data);
if (mn > 0.1)
image_arr(j,i,:) = data - mn;
end
which seems like the inner loop of something that could be written like:
minarr = min(image_arr)
[a,b] = find(minarr > 0.1);
image_arr(a,b,:) = image_arr(a,b,:) - minarr(a,b)
Rename your i and j.
Those names have meaning to MATLAB, and every time it sees them it has to check whether you have your own definition or they mean sqrt(-1).
The first part can be done without loops using bsxfun.
m = min(image_arr,[],3);
image_arr = bsxfun(#minus, image_arr, m.*(m>0.1));

Vectorization of nested loops and if statements in MATLAB

I am fairly new to the concept of vectorization in MATLAB so please excuse my naivety in this regard. I was trying to vectorize the following MATLAB code which includes if statements within nested for loops:
h = zeros(dimV);
for a = 1 : dimV
for b = 1 : dimV
if a ~= b && C(a,b) == 1
h(a,b) = C(a,b)*exp(-1i*A(a,b)*L(a,b))*(sin(k*L(a,b)))^-1;
else if a == b
for m = 1 : dimV
if m ~= a && C(a,m) == 1
h(a,b) = h(a,b) - C(a,m)*cot(k*L(a,m));
end
end
end
end
end
end
Here the variable dimV specifies the size of the matrix h and is fairly large, of the order of 100, and C is a symmetric square matrix (previously defined) of size dimV all of whose off-diagonal elements are either 0 or 1 and the diagonal elements are necesarrily 0. The elements of matrix L are also zero in the same positions as the zeros of the matrix C. Following the vectorization techniques that I found on this website and here, I was able to vectorize the code, albeit partially, and my MWE is as follows:
h = zeros(dimV);
idx = (C == 1);
h(idx) = C(idx).*exp(-1i*A(idx).*L(idx)).*(sin(k*L(idx))).^-1;
for a = 1:dimV;
m = 1 : dimV;
m = m(C(a,:) == 1);
h(a,a) = - sum (C(a,m).*cot(k*L(a,m)));
end
My main problem is in converting the for loop in the variable a to a vector as I need the individual values of a to address the diagonal elements of h. I compared the evaluation time of both the code blocks using the MATLAB profiler and the latter version is only marginally faster and the improvement in efficiency is really insignificant. In fact the profiler showed that the line allocating the values to h(a,a) takes up nearly 50% of the execution time in the second case. So I was wondering if there is a more elegant way to rewrite the above code using the appropriate vectorization schemes, which would help improve its efficiency. I am really in a bit of bother about this and any I would be greatly appreciative of any help in this regard. Thank you so much.
Vectorized Code -
diag_ind = 1:dimV+1:numel(C);
C_neq1 = C~=1;
parte1_2 = (sin(k.*L)).^-1;
parte1_2(C_neq1 & (L==0))=0;
parte1 = exp(-1i.*A.*L).*parte1_2;
parte1(C_neq1)=0;
h = parte1;
parte2 = cot(k*L);
parte2(C_neq1)=0;
parte2(diag_ind)=0;
h(diag_ind) = - sum(parte2,2);

Resources