Optimize Julia Code by Example - performance

I am currently writing a numerical solver in Julia. I don't think the math behind it matters too much. It all boils down to the fact, that a specific operation is executed several times and uses a large percentage (~80%) of running time.
I tried to reduce it as much as possible and present you this piece of code, which can be saved as dummy.jl and then executed via include("dummy.jl") followed by dummy(10) (for compilation) and then dummy(1000).
function dummy(N::Int64)
A = rand(N,N)
#time timethis(A)
end
function timethis(A::Array{Float64,2})
dummyvariable = 0.0
for k=1:100 # just repeat a few times
for i=2:size(A)[1]-1
for j=2:size(A)[2]-1
dummyvariable += slopefit(A[i-1,j],A[i,j],A[i+1,j],2.0)
dummyvariable += slopefit(A[i,j-1],A[i,j],A[i,j+1],2.0)
end
end
end
println(dummyvariable)
end
#inline function minmod(x::Float64, y::Float64)
return sign(x) * max(0.0, min(abs(x),y*sign(x) ) );
end
#inline function slopefit(left::Float64,center::Float64,right::Float64,theta::Float64)
# arg=ccall((:minmod,"libminmod"),Float64,(Float64,Float64),0.5*(right-left),theta*(center-left));
# result=ccall((:minmod,"libminmod"),Float64,(Float64,Float64),theta*(right-center),arg);
# return result
tmp = minmod(0.5*(right-left),theta*(center-left));
return minmod(theta*(right-center),tmp);
#return 1.0
end
Here, timethis shall imitate the part of the code where I spend a lot of time. I notice, that slopefitis extremely expensive to execute.
For example, dummy(1000) takes roughly 4 seconds on my machine. If instead, slopefit would just always return 1 and not compute anything, the time goes down to one tenth of the overall time.
Now, obviously there is no free lunch.
I am aware, that this is simply a costly operation. But I would still try to optimize it as much as possible, given that a lot of time is spend in something that looks like one could optimize it easily as it is just a few lines of code.
So far, I tried to implement minmod and slopefit as C-functions and call them, however that just increased computing time (maybe I did it wrong).
So my question is, what possibilities do I have to optimize the call of slopefit?
Note, that in the actual code, the arguments of slopefit are not the ones mentioned here, but depend on conditional statements which makes everything hard to vectorize (if that would bring any performance gain I am not sure).

There are two levels of optimization I can think of.
First: the following implementation of minmod will be faster as it avoids branching (I understand this is the functionality you want):
#inline minmod(x::Float64, y::Float64) = ifelse(x<0, clamp(y, x, 0.0), clamp(y, 0.0, x))
Second: you can use #inbounds to speed up loop a bit:
#inbounds for i=2:size(A)[1]-1

Related

Poor performance in matlab

So I had to write a program in Matlab to calculate the convolution of two functions, manually. I wrote this simple piece of code that I know is not that optimized probably:
syms recP(x);
recP(x) = rectangularPulse(-1,1,x);
syms triP(x);
triP(x) = triangularPulse(-1,1,x);
t = -10:0.1:10;
s1 = -10:0.1:10;
for i = 1:201
s1(i) = 0;
for j = t
s1(i) = s1(i) + ( recP(j) * triP(t(i)-j) );
end
end
plot(t,s1);
I have a core i7-7700HQ coupled with 32 GB of RAM. Matlab is stored on my HDD and my Windows is on my SSD. The problem is that this simple code is taking I think at least 20 minutes to run. I have it in a section and I don't run the whole code. Matlab is only taking 18% of my CPU and 3 GB of RAM for this task. Which is I think probably enough, I don't know. But I don't think it should take that long.
Am I doing anything wrong? I've searched for how to increase the RAM limit of Matlab, and I found that it is not limited and it takes how much it needs. I don't know if I can increase the CPU usage of it or not.
Is there any solution to how make things a little bit faster? I have like 6 or 7 of these for loops in my homework and it takes forever if I run the whole live script. Thanks in advance for your help.
(Also, it highlights the piece of code that is currently running. It is the for loop, the outer one is highlighted)
Like Ander said, use the symbolic toolbox in matlab as a last resort. Additionally, when trying to speed up matlab code, focus on taking advantage of matlab's vectorized operations. What I mean by this is matlab is very efficient at performing operations like this:
y = x.*z;
where x and z are some Nx1 vectors each and the operator '.*' is called 'dot multiplication'. This is essentially telling matlab to perform multiplication on x1*z1, x[2]*z[2] .... x[n]*z[n] and assign all the values to the corresponding value in the vector y. Additionally, many of the functions in matlab are able to accept vectors as inputs and perform their operations on each element and return an equal size vector with the output at each element. You can check this for any given function by scrolling down in its documentation to the inputs and outputs section and checking what form of array the inputs and outputs can take. For example, rectangularPulse's documentation says it can accept vectors as inputs. Therefore, you can simplify your inner loop to this:
s1(i) = s1(i) + ( rectangularPulse(-1,1,t) * triP(t(i)-t) );
So to summarize:
Avoid the symbolic toolbox in matlab until you have a better handle of what you're doing or you absolutely have to use it.
Use matlab's ability to handle vectors and arrays very well.
Deconstruct any nested loops you write one at a time from the inside out. Usually this dramatically accelerates matlab code especially when you are new to writing it.
See if you can even further simplify the code and get rid of your outer loop as well.

Can I take advantage of parallelization to make this piece of code faster?

OK, a follow-up of this and this question. The code I want to modify is of course:
function fdtd1d_local(steps, ie = 200)
ez = zeros(ie + 1);
hy = zeros(ie);
for n in 1:steps
for i in 2:ie
ez[i]+= (hy[i] - hy[i-1])
end
ez[1]= sin(n/10)
for i in 1:ie
hy[i]+= (ez[i+1]- ez[i])
end
end
(ez, hy)
end
fdtd1d_local(1);
#time sol1=fdtd1d_local(10);
elapsed time: 3.4292e-5 seconds (4148 bytes allocated)
And I've naively tried:
function fdtd1d_local_parallel(steps, ie = 200)
ez = dzeros(ie + 1);
hy = dzeros(ie);
for n in 1:steps
for i in 2:ie
localpart(ez)[i]+= (hy[i] - hy[i-1])
end
localpart(ez)[1]= sin(n/10)
for i in 1:ie
localpart(hy)[i]+= (ez[i+1]- ez[i])
end
end
(ez, hy)
end
fdtd1d_local_parallel(1);
#time sol2=fdtd1d_local_parallel(10);
elapsed time: 0.0418593 seconds (3457828 bytes allocated)
sol2==sol1
true
The result is correct, but the performance is much worse. So why? Because parallelization isn't for a dual core old lap-top, or I'm wrong again?
Well, I admit that the only thing I know about parallelization is it can speed up codes but not every piece of code can be paralleled, is there any basic knowledge that one should know before trying parallel programming?
Any help would be appreciated.
There are several things going on. First, notice the difference in memory consumed. That's a sign that something is wrong. You'll get greater clarity by separating allocation (your zeros and dzeros lines) from the core algorithm. However, it's unlikely that very much of that memory is being used by allocation; more likely, something in your loop is using memory. Notice that you're describing the localpart on the left hand side, but you're using the raw DArray on the right hand side. That may be triggering some IPC traffic. If you need to debug the memory consumption, see the ProfileView package.
Second, it's not obvious to me that you're really breaking the problem up among processes. You're looping over each element of the whole array, instead you should have each worker loop over its own piece of the array. However, you're going to run into problems at the edges between localparts, because the updates require the neighboring values. You'd be much better off using a SharedArray.
Finally, launching threads has overhead; for small problems, you're better off not parallelizing and just using simple algorithms. Only when the computation time gets to hundreds of milliseconds (or more) would I even think about going to the effort to parallelize.
N.B.: I'm a relative Julia, FDTD, Maxwell's Equations, and parallel processing noob.
#tholy provided a good answer presenting the important issues to be considered.
In addition, the Wikipedia Finite-difference time-domain method page presents some good info with references and links to software packages, some of which use some style of parallel processing.
It seems that many parallel processing approaches to FDTD partition the physical environment into smaller chunks and then calculate the chunks in parallel. One complication is that the boundary conditions must be passed between adjacent chunks.
Using your toy 1D problem, and my limited Julia skills, I implemented the toy to use two cores on my machine. It's not the most general, modular, extendable, effective, nor efficient, but it does demonstrate parallel processing. Hopefully a Julia wizard will improve it.
Here's the Julia code I used:
addprocs(2)
#everywhere function ez_front(n::Int, ez::DArray, hy::DArray)
ez_local=localpart(ez)
hy_local=localpart(hy)
ez_local[1]=sin(n/10)
#simd for i=2:length(ez_local)
#inbounds ez_local[i] += (hy_local[i] - hy_local[i-1])
end
end
#everywhere function ez_back(ez::DArray, hy::DArray)
ez_local=localpart(ez)
hy_local=localpart(hy)
index_boundary::Int = first(localindexes(hy)[1])-1
ez_local[1] += (hy_local[1]-hy[index_boundary])
#simd for i=2:length(ez_local)
#inbounds ez_local[i] += (hy_local[i] - hy_local[i-1])
end
end
#everywhere function hy_front(ez::DArray, hy::DArray)
ez_local=localpart(ez)
hy_local=localpart(hy)
index_boundary = last(localindexes(ez)[1])+1
#simd for i=1:(length(hy_local)-1)
#inbounds hy_local[i] += (ez_local[i+1] - ez_local[i])
end
hy_local[end] += (ez[index_boundary] - ez_local[end])
end
#everywhere function hy_back(ez::DArray, hy::DArray)
ez_local=localpart(ez)
hy_local=localpart(hy)
#simd for i=2:(length(hy_local)-1)
#inbounds hy_local[i] += (ez_local[i+1] - ez_local[i])
end
hy_local[end] -= ez_local[end]
end
function fdtd1d_parallel(steps::Int, ie::Int = 200)
ez = dzeros((ie,),workers()[1:2],2)
hy = dzeros((ie,),workers()[1:2],2)
for n = 1:steps
#sync begin
#async begin
remotecall(workers()[1],ez_front,n,ez,hy)
remotecall(workers()[2],ez_back,ez,hy)
end
end
#sync begin
#async begin
remotecall(workers()[1],hy_front,ez,hy)
remotecall(workers()[2],hy_back,ez,hy)
end
end
end
(convert(Array{Float64},ez), convert(Array{Float64},hy))
end
fdtd1d_parallel(1);
#time sol2=fdtd1d_parallel(10);
On my machine (an old 32-bit 2-core laptop), this parallel version wasn't faster than the local version until ie was set to somewhere around 5000000.
This is an interesting case for learning parallel processing in Julia, but if I needed to solve Maxwell's equations using FDTD, I'd first consider the many FDTD software libraries that are already available. Perhaps a Julia package could interface to one of those.

Unexpected slowdown of function that modifies array in-place

This bug is due to Matlab being too smart for its own good.
I have something like
for k=1:N
stats = subfun(E,k,stats);
end
where statsis a 1xNarray, N=5000 say, and subfun calculates stats(k)from E, and fills it into stats
function stats = subfun(E,k,stats)
s = mean(E);
stats(k) = s;
end
Of course, there is some overhead in passing a large array back and forth, only to fill in one of its elements. In my case, however, the overhead is negligable, and I prefer this code instead of
for k=1:N
s = subfun(E,k);
stats(k) = s;
end
My preference is because I actually have a lot more assignments than just stats.
Also some of the assignments are actually a good deal more complicated.
As mentioned, the overhead is negligable. But, if I do something trivial, like this inconsequential if-statement
for k=1:N
i = k;
if i>=1
stats = subfun(E,i,stats);
end
end
the assignments that take place inside subfun then suddenly takes "forever" (it increases much faster than linearly with N). And it's the assignment, not the calculation that takes forever. In fact, it is even worse than the following nonsensical subfun
function stats = subfun(E,k,stats)
s = calculation_on_E(E);
clear stats
stats(k) = s;
end
which requires re-allocation of stats every time.
Does anybody have the faintest idea why this happens?
This might be due to some obscure detail of Matlab's JIT. The JIT of recent versions of Matlab knows not to create a new array, but to do modifications in-place in some limited cases. One of the requirements is that the function is defined as
function x = modify_big_matrix(x, i, j)
x(i, j) = 123;
and not as
function x_out = modify_big_matrix(x_in, i, j)
x_out = x_in;
x_out(i, j) = 123;
Your examples seem to follow this rule, so, as Praetorian mentioned, your if statement might prevent the JIT from recognizing that it is an in-place operation.
If you really need to speed up your algorithm, it is possible to modify arrays in-place using your own mex-functions. I have successfully used this trick to gain a factor of 4 speedup on some medium sized arrays (order 100x100x100 IIRC). This is however not recommended, could segfault Matlab if you are not careful and might stop working in future versions.
As discussed by others, the problem almost certainly lies with JIT and its relatively fragile ability to modify in place.
As mentioned, I really prefer the first form of the function call and assignments, although other workable solutions have been suggested. Without relying on JIT, the only way this can be efficient (as far as I can see) is some form of passing by reference.
Therefore I made a class Stats that inherits from handle, and which contains the data array for k=1:N. It is then passed by reference.
For future reference, this seems to work very well, with good performance, and I'm currently using it as my working solution.

Vectorization of matlab code

i'm kinda new to vectorization. Have tried myself but couldn't. Can somebody help me vectorize this code as well as give a short explaination on how u do it, so that i can adapt the thinking process too. Thanks.
function [result] = newHitTest (point,Polygon,r,tol,stepSize)
%This function calculates whether a point is allowed.
%First is a quick test is done by calculating the distance from point to
%each point of the polygon. If that distance is smaller than range "r",
%the point is not allowed. This will slow down the algorithm at some
%points, but will greatly speed it up in others because less calls to the
%circleTest routine are needed.
polySize=size(Polygon,1);
testCounter=0;
for i=1:polySize
d = sqrt(sum((Polygon(i,:)-point).^2));
if d < tol*r
testCounter=1;
break
end
end
if testCounter == 0
circleTestResult = circleTest (point,Polygon,r,tol,stepSize);
testCounter = circleTestResult;
end
result = testCounter;
Given the information that Polygon is 2 dimensional, point is a row vector and the other variables are scalars, here is the first version of your new function (scroll down to see that there are lots of ways to skin this cat):
function [result] = newHitTest (point,Polygon,r,tol,stepSize)
result = 0;
linDiff = Polygon-repmat(point,size(Polygon,1),1);
testLogicals = sqrt( sum( ( linDiff ).^2 ,2 )) < tol*r;
if any(testLogicals); result = circleTest (point,Polygon,r,tol,stepSize); end
The thought process for vectorization in Matlab involves trying to operate on as much data as possible using a single command. Most of the basic builtin Matlab functions operate very efficiently on multi-dimensional data. Using for loop is the reverse of this, as you are breaking your data down into smaller segments for processing, each of which must be interpreted individually. By resorting to data decomposition using for loops, you potentially loose some of the massive performance benefits associated with the highly optimised code behind the Matlab builtin functions.
The first thing to think about in your example is the conditional break in your main loop. You cannot break from a vectorized process. Instead, calculate all possibilities, make an array of the outcome for each row of your data, then use the any keyword to see if any of your rows have signalled that the circleTest function should be called.
NOTE: It is not easy to efficiently conditionally break out of a calculation in Matlab. However, as you are just computing a form of Euclidean distance in the loop, you'll probably see a performance boost by using the vectorized version and calculating all possibilities. If the computation in your loop were more expensive, the input data were large, and you wanted to break out as soon as you hit a certain condition, then a matlab extension made with a compiled language could potentially be much faster than a vectorized version where you might be performing needless calculation. However this is assuming that you know how to program code that matches the performance of the Matlab builtins in a language that compiles to native code.
Back on topic ...
The first thing to do is to take the linear difference (linDiff in the code example) between Polygon and your row vector point. To do this in a vectorized manner, the dimensions of the 2 variables must be identical. One way to achieve this is to use repmat to copy each row of point to make it the same size as Polygon. However, bsxfun is usually a superior alternative to repmat (as described in this recent SO question), making the code ...
function [result] = newHitTest (point,Polygon,r,tol,stepSize)
result = 0;
linDiff = bsxfun(#minus, Polygon, point);
testLogicals = sqrt( sum( ( linDiff ).^2 ,2 )) < tol*r;
if any(testLogicals); result = circleTest (point,Polygon,r,tol,stepSize); end
I rolled your d value into a column of d by summing across the 2nd axis (note the removal of the array index from Polygon and the addition of ,2 in the sum command). I then went further and evaluated the logical array testLogicals inline with the calculation of the distance measure. You will quickly see that a downside of heavy vectorisation is that it can make the code less readable to those not familiar with Matlab, but the performance gains are worth it. Comments are pretty necessary.
Now, if you want to go completely crazy, you could argue that the test function is so simple now that it warrants use of an 'anonymous function' or 'lambda' rather than a complete function definition. The test for whether or not it is worth doing the circleTest does not require the stepSize argument either, which is another reason for perhaps using an anonymous function. You can roll your test into an anonymous function and then jut use circleTest in your calling script, making the code self documenting to some extent . . .
doCircleTest = #(point,Polygon,r,tol) any(sqrt( sum( bsxfun(#minus, Polygon, point).^2, 2 )) < tol*r);
if doCircleTest(point,Polygon,r,tol)
result = circleTest (point,Polygon,r,tol,stepSize);
else
result = 0;
end
Now everything is vectorised, the use of function handles gives me another idea . . .
If you plan on performing this at multiple points in the code, the repetition of the if statements would get a bit ugly. To stay dry, it seems sensible to put the test with the conditional function into a single function, just as you did in your original post. However, the utility of that function would be very narrow - it would only test if the circleTest function should be executed, and then execute it if needs be.
Now imagine that after a while, you have some other conditional functions, just like circleTest, with their own equivalent of doCircleTest. It would be nice to reuse the conditional switching code maybe. For this, make a function like your original that takes a default value, the boolean result of the computationally cheap test function, and the function handle of the expensive conditional function with its associated arguments ...
function result = conditionalFun( default, cheapFunResult, expensiveFun, varargin )
if cheapFunResult
result = expensiveFun(varargin{:});
else
result = default;
end
end %//of function
You could call this function from your main script with the following . . .
result = conditionalFun(0, doCircleTest(point,Polygon,r,tol), #circleTest, point,Polygon,r,tol,stepSize);
...and the beauty of it is you can use any test, default value, and expensive function. Perhaps a little overkill for this simple example, but it is where my mind wandered when I brought up the idea of using function handles.

Why does the Matlab Profiler say there is a bottleneck on the 'end' statement of a 'for' loop?

So, I've recently started using Matlab's built-in profiler on a regular basis, and I've noticed that while its usually great at showing which lines are taking up the most time, sometimes it'll tell me a large chunk of time is being used on the end statement of a for loop.
Now, seeing as such a line is just used for denoting the end of the loop, I can't imagine how it could use anything other than a trivial amount of processing.
I've seen a specific version of this question asked on matlab central, but a consensus didn't seem to be reached.
EDIT: Here's a minimal example of this problem:
for i =1:1000
x = 1;
x = [x 1];
% clear x;
end
Even if you uncomment the clear, the end line still takes up a lot of computation (about 20%), and the clear actually increases the absolute amount of computation performed by the end line.
When I've seen this in my code, it's been the deallocation of large temporaries created in the loop. Each new variable created in the loop is deallocated at the end.

Resources