Unexpected slowdown of function that modifies array in-place - performance

This bug is due to Matlab being too smart for its own good.
I have something like
for k=1:N
stats = subfun(E,k,stats);
end
where statsis a 1xNarray, N=5000 say, and subfun calculates stats(k)from E, and fills it into stats
function stats = subfun(E,k,stats)
s = mean(E);
stats(k) = s;
end
Of course, there is some overhead in passing a large array back and forth, only to fill in one of its elements. In my case, however, the overhead is negligable, and I prefer this code instead of
for k=1:N
s = subfun(E,k);
stats(k) = s;
end
My preference is because I actually have a lot more assignments than just stats.
Also some of the assignments are actually a good deal more complicated.
As mentioned, the overhead is negligable. But, if I do something trivial, like this inconsequential if-statement
for k=1:N
i = k;
if i>=1
stats = subfun(E,i,stats);
end
end
the assignments that take place inside subfun then suddenly takes "forever" (it increases much faster than linearly with N). And it's the assignment, not the calculation that takes forever. In fact, it is even worse than the following nonsensical subfun
function stats = subfun(E,k,stats)
s = calculation_on_E(E);
clear stats
stats(k) = s;
end
which requires re-allocation of stats every time.
Does anybody have the faintest idea why this happens?

This might be due to some obscure detail of Matlab's JIT. The JIT of recent versions of Matlab knows not to create a new array, but to do modifications in-place in some limited cases. One of the requirements is that the function is defined as
function x = modify_big_matrix(x, i, j)
x(i, j) = 123;
and not as
function x_out = modify_big_matrix(x_in, i, j)
x_out = x_in;
x_out(i, j) = 123;
Your examples seem to follow this rule, so, as Praetorian mentioned, your if statement might prevent the JIT from recognizing that it is an in-place operation.
If you really need to speed up your algorithm, it is possible to modify arrays in-place using your own mex-functions. I have successfully used this trick to gain a factor of 4 speedup on some medium sized arrays (order 100x100x100 IIRC). This is however not recommended, could segfault Matlab if you are not careful and might stop working in future versions.

As discussed by others, the problem almost certainly lies with JIT and its relatively fragile ability to modify in place.
As mentioned, I really prefer the first form of the function call and assignments, although other workable solutions have been suggested. Without relying on JIT, the only way this can be efficient (as far as I can see) is some form of passing by reference.
Therefore I made a class Stats that inherits from handle, and which contains the data array for k=1:N. It is then passed by reference.
For future reference, this seems to work very well, with good performance, and I'm currently using it as my working solution.

Related

Optimize Julia Code by Example

I am currently writing a numerical solver in Julia. I don't think the math behind it matters too much. It all boils down to the fact, that a specific operation is executed several times and uses a large percentage (~80%) of running time.
I tried to reduce it as much as possible and present you this piece of code, which can be saved as dummy.jl and then executed via include("dummy.jl") followed by dummy(10) (for compilation) and then dummy(1000).
function dummy(N::Int64)
A = rand(N,N)
#time timethis(A)
end
function timethis(A::Array{Float64,2})
dummyvariable = 0.0
for k=1:100 # just repeat a few times
for i=2:size(A)[1]-1
for j=2:size(A)[2]-1
dummyvariable += slopefit(A[i-1,j],A[i,j],A[i+1,j],2.0)
dummyvariable += slopefit(A[i,j-1],A[i,j],A[i,j+1],2.0)
end
end
end
println(dummyvariable)
end
#inline function minmod(x::Float64, y::Float64)
return sign(x) * max(0.0, min(abs(x),y*sign(x) ) );
end
#inline function slopefit(left::Float64,center::Float64,right::Float64,theta::Float64)
# arg=ccall((:minmod,"libminmod"),Float64,(Float64,Float64),0.5*(right-left),theta*(center-left));
# result=ccall((:minmod,"libminmod"),Float64,(Float64,Float64),theta*(right-center),arg);
# return result
tmp = minmod(0.5*(right-left),theta*(center-left));
return minmod(theta*(right-center),tmp);
#return 1.0
end
Here, timethis shall imitate the part of the code where I spend a lot of time. I notice, that slopefitis extremely expensive to execute.
For example, dummy(1000) takes roughly 4 seconds on my machine. If instead, slopefit would just always return 1 and not compute anything, the time goes down to one tenth of the overall time.
Now, obviously there is no free lunch.
I am aware, that this is simply a costly operation. But I would still try to optimize it as much as possible, given that a lot of time is spend in something that looks like one could optimize it easily as it is just a few lines of code.
So far, I tried to implement minmod and slopefit as C-functions and call them, however that just increased computing time (maybe I did it wrong).
So my question is, what possibilities do I have to optimize the call of slopefit?
Note, that in the actual code, the arguments of slopefit are not the ones mentioned here, but depend on conditional statements which makes everything hard to vectorize (if that would bring any performance gain I am not sure).
There are two levels of optimization I can think of.
First: the following implementation of minmod will be faster as it avoids branching (I understand this is the functionality you want):
#inline minmod(x::Float64, y::Float64) = ifelse(x<0, clamp(y, x, 0.0), clamp(y, 0.0, x))
Second: you can use #inbounds to speed up loop a bit:
#inbounds for i=2:size(A)[1]-1

How does JIT optimize branching while processing elements of collections? (in Scala)

This is a question about performance of code written in Scala.
Consider the following two code snippets, assume that x is some collection containing ~50 million elements:
def process(x: Traversable[T]) = {
processFirst x.head
x reduce processPair
processLast x.last
}
Versus something like this (assume for now we have some way to determine if we're operating on the first element versus the last element):
def isFirstElement[T](x: T) = ???
def isLastElement[T](x: T) = ???
def process(x: Traversable[T]) = {
x reduce {
(left, right) =>
if (isFirstElement(left)
processFirst(left)
else if (isLastElement(right))
processLast(right)
processPair(left, right)
}
}
Which approach is faster? and for ~50 million elements, how much faster?
It seems to me that the first example would be faster because there are fewer conditional checks occurring for all but the first and last elements. However for the latter example there is some argument to suggest that the JIT might be clever enough to optimize away those additional head/last conditional checks that would otherwise occur for all but the first/last elements.
Is the JIT clever enough to perform such operations? The obvious advantage of the latter approach is that all business can be placed in the same function body while in the latter case business must be partitioned into three separate function bodies invoked separately.
** EDIT **
Thanks for all the great responses. While I am leaving the second code snippet above to illustrate its incorrectness, I want to revise the first approach slightly to reflect better the problem I am attempting to solve:
// x is some iterator
def process(x: Iterator[T]) = {
if (x.hasNext)
{
var previous = x.next
var current = null
processFirst previous
while(x.hasNext)
{
current = x.next
processPair(previous, current)
previous = current
}
processLast previous
}
}
While there are no additional checks occurring in the body, there is an additional reference assignment that appears to be unavoidable (previous = current). This is also a much more imperative approach that relies on nullable mutable variables. Implementing this in a functional yet high performance manner would be another exercise for another question.
How does this code snippet stack-up against the last of the two examples above? (the single-iteration block approach containing all the branches). The other thing I realize is that the latter of the two examples is also broken on collections containing fewer than two elements.
If your underlying collection has an inexpensive head and last method (not true for a generic Traversable), and the reduction operations are relatively inexpensive, then the second way takes about 10% longer (maybe a little less) than the first on my machine. (You can use a var to get first, and you can keep updating a second far with the right argument to obtain last, and then do the final operation outside of the loop.)
If you have an expensive last (i.e. you have to traverse the whole collection), then the first operation takes about 10% longer (maybe a little more).
Mostly you shouldn't worry too much about it and instead worry more about correctness. For instance, in a 2-element list your second code has a bug (because there is an else instead of a separate test). In a 1-element list, the second code never calls reduce's lambda at all, so again fails to work.
This argues that you should do it the first way unless you're sure last is really expensive in your case.
Edit: if you switch to a manual reduce-like-operation using an iterator, you might be able to shave off up to about 40% of your time compared to the expensive-last case (e.g. list). For inexpensive last, probably not so much (up to ~20%). (I get these values when operating on lengths of strings, for example.)
First of all, note that, depending on the concrete implementation of Traversable, doing something like x.last may be really expensive. Like, more expensive than all the rest of what's going on here.
Second, I doubt the cost of conditionals themselves is going to be noticeable, even on a 50 million collection, but actually figuring out whether a given element is the first or the last, might again, depending on implementation, get pricey.
Third, JIT will not be able to optimize the conditionals away: if there was a way to do that, you would have been able to write your implementation without conditionals to begin with.
Finally, if you are at a point where it starts looking like an extra if statement might affect performance, you might consider switching to java or even "C". Don't get me wrong, I love scala, it is a great language, with lots of power and useful features, but being super-fast just isn't one of them.

Fast check if element is in MATLAB matrix

I would like to verify whether an element is present in a MATLAB matrix.
At the beginning, I implemented as follows:
if ~isempty(find(matrix(:) == element))
which is obviously slow. Thus, I changed to:
if sum(matrix(:) == element) ~= 0
but this is again slow: I am calling a lot of times the function that contains this instruction, and I lose 14 seconds each time!
Is there a way of further optimize this instruction?
Thanks.
If you just need to know if a value exists in a matrix, using the second argument of find to specify that you just want one value will be slightly faster (25-50%) and even a bit faster than using sum, at least on my machine. An example:
matrix = randi(100,1e4,1e4);
element = 50;
~isempty(find(matrix(:)==element,1))
However, in recent versions of Matlab (I'm using R2014b), nnz is finally faster for this operation, so:
matrix = randi(100,1e4,1e4);
element = 50;
nnz(matrix==element)~=0
On my machine this is about 2.8 times faster than any other approach (including using any, strangely) for the example provided. To my mind, this solution also has the benefit of being the most readable.
In my opinion, there are several things you could try to improve performance:
following your initial idea, i would go for the function any to test is any of the equality tests had a success:
if any(matrix(:) == element)
I tested this on a 1000 by 1000 matrix and it is faster than the solutions you have tested.
I do not think that the unfolding matrix(:) is penalizing since it is equivalent to a reshape and Matlab does this in a smart way where it does not actually allocate and move memory since you are not modifying the temporary object matrix(:)
If your does not change between the calls to the function or changes rarely you could simply use another vector containing all the elements of your matrix, but sorted. This way you could use a more efficient search algorithm O(log(N)) test for the presence of your element.
I personally like the ismember function for this kind of problems. It might not be the fastest but for non critical parts of the code it greatly improves readability and code maintenance (and I prefer to spend one hour coding something that will take day to run than spending one day to code something that will run in one hour (this of course depends on how often you use this program, but it is something one should never forget)
If you can have a sorted copy of the elements of your matrix, you could consider using the undocumented Matlab function ismembc but remember that inputs must be sorted non-sparse non-NaN values.
If performance really is critical you might want to write your own mex file and for this task you could even include some simple parallelization using openmp.
Hope this helps,
Adrien.

Vectorization of matlab code

i'm kinda new to vectorization. Have tried myself but couldn't. Can somebody help me vectorize this code as well as give a short explaination on how u do it, so that i can adapt the thinking process too. Thanks.
function [result] = newHitTest (point,Polygon,r,tol,stepSize)
%This function calculates whether a point is allowed.
%First is a quick test is done by calculating the distance from point to
%each point of the polygon. If that distance is smaller than range "r",
%the point is not allowed. This will slow down the algorithm at some
%points, but will greatly speed it up in others because less calls to the
%circleTest routine are needed.
polySize=size(Polygon,1);
testCounter=0;
for i=1:polySize
d = sqrt(sum((Polygon(i,:)-point).^2));
if d < tol*r
testCounter=1;
break
end
end
if testCounter == 0
circleTestResult = circleTest (point,Polygon,r,tol,stepSize);
testCounter = circleTestResult;
end
result = testCounter;
Given the information that Polygon is 2 dimensional, point is a row vector and the other variables are scalars, here is the first version of your new function (scroll down to see that there are lots of ways to skin this cat):
function [result] = newHitTest (point,Polygon,r,tol,stepSize)
result = 0;
linDiff = Polygon-repmat(point,size(Polygon,1),1);
testLogicals = sqrt( sum( ( linDiff ).^2 ,2 )) < tol*r;
if any(testLogicals); result = circleTest (point,Polygon,r,tol,stepSize); end
The thought process for vectorization in Matlab involves trying to operate on as much data as possible using a single command. Most of the basic builtin Matlab functions operate very efficiently on multi-dimensional data. Using for loop is the reverse of this, as you are breaking your data down into smaller segments for processing, each of which must be interpreted individually. By resorting to data decomposition using for loops, you potentially loose some of the massive performance benefits associated with the highly optimised code behind the Matlab builtin functions.
The first thing to think about in your example is the conditional break in your main loop. You cannot break from a vectorized process. Instead, calculate all possibilities, make an array of the outcome for each row of your data, then use the any keyword to see if any of your rows have signalled that the circleTest function should be called.
NOTE: It is not easy to efficiently conditionally break out of a calculation in Matlab. However, as you are just computing a form of Euclidean distance in the loop, you'll probably see a performance boost by using the vectorized version and calculating all possibilities. If the computation in your loop were more expensive, the input data were large, and you wanted to break out as soon as you hit a certain condition, then a matlab extension made with a compiled language could potentially be much faster than a vectorized version where you might be performing needless calculation. However this is assuming that you know how to program code that matches the performance of the Matlab builtins in a language that compiles to native code.
Back on topic ...
The first thing to do is to take the linear difference (linDiff in the code example) between Polygon and your row vector point. To do this in a vectorized manner, the dimensions of the 2 variables must be identical. One way to achieve this is to use repmat to copy each row of point to make it the same size as Polygon. However, bsxfun is usually a superior alternative to repmat (as described in this recent SO question), making the code ...
function [result] = newHitTest (point,Polygon,r,tol,stepSize)
result = 0;
linDiff = bsxfun(#minus, Polygon, point);
testLogicals = sqrt( sum( ( linDiff ).^2 ,2 )) < tol*r;
if any(testLogicals); result = circleTest (point,Polygon,r,tol,stepSize); end
I rolled your d value into a column of d by summing across the 2nd axis (note the removal of the array index from Polygon and the addition of ,2 in the sum command). I then went further and evaluated the logical array testLogicals inline with the calculation of the distance measure. You will quickly see that a downside of heavy vectorisation is that it can make the code less readable to those not familiar with Matlab, but the performance gains are worth it. Comments are pretty necessary.
Now, if you want to go completely crazy, you could argue that the test function is so simple now that it warrants use of an 'anonymous function' or 'lambda' rather than a complete function definition. The test for whether or not it is worth doing the circleTest does not require the stepSize argument either, which is another reason for perhaps using an anonymous function. You can roll your test into an anonymous function and then jut use circleTest in your calling script, making the code self documenting to some extent . . .
doCircleTest = #(point,Polygon,r,tol) any(sqrt( sum( bsxfun(#minus, Polygon, point).^2, 2 )) < tol*r);
if doCircleTest(point,Polygon,r,tol)
result = circleTest (point,Polygon,r,tol,stepSize);
else
result = 0;
end
Now everything is vectorised, the use of function handles gives me another idea . . .
If you plan on performing this at multiple points in the code, the repetition of the if statements would get a bit ugly. To stay dry, it seems sensible to put the test with the conditional function into a single function, just as you did in your original post. However, the utility of that function would be very narrow - it would only test if the circleTest function should be executed, and then execute it if needs be.
Now imagine that after a while, you have some other conditional functions, just like circleTest, with their own equivalent of doCircleTest. It would be nice to reuse the conditional switching code maybe. For this, make a function like your original that takes a default value, the boolean result of the computationally cheap test function, and the function handle of the expensive conditional function with its associated arguments ...
function result = conditionalFun( default, cheapFunResult, expensiveFun, varargin )
if cheapFunResult
result = expensiveFun(varargin{:});
else
result = default;
end
end %//of function
You could call this function from your main script with the following . . .
result = conditionalFun(0, doCircleTest(point,Polygon,r,tol), #circleTest, point,Polygon,r,tol,stepSize);
...and the beauty of it is you can use any test, default value, and expensive function. Perhaps a little overkill for this simple example, but it is where my mind wandered when I brought up the idea of using function handles.

Is it quicker to access a local variable than an attribute of an object?

Straight-forward language agnostic question. I've always done this:
myVar = myObj.myAttribute
when I need to access myAttribute a lot.
I'm wondering if this is just a superstition I've acquired, or if it's generally faster?
Edit: I would also like to know if this
myVar = myObj.myAttribute/100
for (i=0; i<100; i++) {
print myVar*i;
}
is more efficient than putting (myObj.myAttribute/100) in the loop. Will modern compilers and interpreters detect that that part of the equation doesn't vary?
In this particular case what you did is more efficient, since it's one division vs 100.
I do property assign to variables only if I can optimize the operations done later, like in your case or expect multiple calls to the same property and the object lookup is likely to be expensive. Generally using local variable should be the more cpu wize way, since it can be costly to do complex property lookups, along with the better control of that property value and possible pre-validation before looping. That said it may be inefficient only if the lookup is likely to occur once or twice for the function call, thus adding overhead and making the code harder to follow up.
I suppose it might depend on the language, and/or the compiler ; but, generally speaking, the less your code has to do, the faster it'll be.
But the difference shouldn't be that important... and what matters the most is people are able to understand your code easily.
In Javascript, for instance, it's said that it's faster using a local variable instead of re-calculating object-access several times.
i.e. this :
var a = obj.a.b.c;
a.a = 10;
a.b = 20;
a.c = 30:
is faster than that :
obj.a.b.c.a = 10;
obj.a.b.c.b = 20;
obj.a.b.c.c = 30:
As a rule, depending on the language, maybe.
You are unlikely to notice the difference however, unless you are running (for example) a tight loop.
Usually I would say the savings are not worth the extra cognitive load on the programmer.
However if you have a bit of code which you know has a slowness problem, this kind of optimisation is definitely worth considering.

Resources