Fortran multidimensional sub-array performance - performance

While manipulating and assigning sub-arrays within multidimensional arrays in Fortran90, I stumbled across an interesting performance quirk.
Fortran90 introduced the ability to manipulate sub-sections of arrays and I have seen a few places which recommends that array operations be performed using this "slicing" method instead of loops. For instance, if I have to add two arrays, a and b of size 10, it is better to write:
c(1:10) = a(1:10) + b(1:10)
or
c = a + b
Instead of
do i = 1, 10
c(i) = a(i) + b(i)
end do
I tried this method for simple one dimensional and two dimensional arrays and found it to be faster with the "slicing" notation. However, things began to get a little interesting when assigning such results within multidimensional arrays.
First of all, I must apologize for my rather crude performance measuring exercise. I am not even sure if the method I have adopted is the right way to time and test codes, but I am fairly confident about the qualitative results of the test.
program main
implicit none
integer, parameter :: mSize = 10000
integer :: i, j
integer :: pCnt, nCnt, cntRt, cntMx
integer, dimension(mSize, mSize) :: a, b
integer, dimension(mSize, mSize, 3) :: c
pCnt = 0
call SYSTEM_CLOCK(nCnt, cntRt, cntMx)
print *, "First call: ", nCnt-pCnt
pCnt = nCnt
do j = 1, mSize
do i = 1, mSize
a(i, j) = i*j
b(i, j) = i+j
end do
end do
call SYSTEM_CLOCK(nCnt, cntRt, cntMx)
print *, "Created Matrices: ", nCnt-pCnt
pCnt = nCnt
! OPERATIONS BY SLICING NOTATION
!c(1:mSize, 1:mSize, 1) = a + b
!c(1:mSize, 1:mSize, 2) = a - b
!c(1:mSize, 1:mSize, 3) = a * b
! OPERATIONS WITH LOOP
do j = 1, mSize
do i = 1, mSize
c(i, j, 1) = a(i, j) + b(i, j)
c(i, j, 2) = a(i, j) - b(i, j)
c(i, j, 3) = a(i, j) * b(i, j)
end do
end do
call SYSTEM_CLOCK(nCnt, cntRt, cntMx)
print *, "Added Matrices: ", nCnt-pCnt
pCnt = nCnt
end program main
As can be seen, I have two methods of operating upon and assigning two large 2D arrays into a 3D array. I was heavily in favour of using the slicing notation as it helped me write shorter and more elegant looking code. But upon observing how severely sluggish my code was, I was forced to recheck the capacity of slicing notation over calculating within loops.
I ran the above code with and without -O3 flag using GNU Fortran 4.8.4 for Ubuntu 14.04
Without -O3 flag
a. Slicing notation
5 Runs - 843, 842, 842, 841, 859
Average - 845.4
b. Looped calculation
5 Runs - 1713, 1713, 1723, 1711, 1713
Average - 1714.6
With -O3 flag
a. Slicing notation
5 Runs - 545, 545, 544, 544, 548
Average - 545.2
b. Looped calculation
5 Runs - 479, 477, 475, 472, 472
Average - 475
I found it very interesting that without -O3 flag, the slicing notation continued to perform way better than loops. However, using -O3 flag causes this advantage to vanish completely. Contrarily, it becomes detrimental to use array slicing notation in this case.
In fact, with my rather large 3D parallel computation code, this is turning out to be a significant bottle-neck. I strongly suspect that the formation of array temporaries during the assignment of a lower dimensional array to a higher dimensional array is the culprit here. But why did the optimization flag fail to optimize the assignment in this case?
Moreover, I feel that blaming -O3 flag is not a respectable thing to do. So are array temporaries really the culprit? Is there something else I may be missing? Any insight will be extremely helpful in speeding up my code. Thanks!

When doing any performance comparison, you have to compare apple with apples and orange with oranges. What I mean is that you are not really comparing the same thing. They are totally different even if they are producing the same result.
What comes into play here is the memory management, think of cache faults during the operation. If you turn the loop version into 3 different loops as suggested by haraldkl you will certainly get similar performance.
What happens is that when you combine the 3 assignments in the same loop, there is a lot of cache reuse for right hand side since all the 3 share the same variables in the right hand side. Each element of a or b is loaded into the cache and into registers only once for the loop version while for the array operation version, each element of a or b gets loaded 3 times. That is what makes the difference. The larger the size of the array, the larger the difference, because you will get more cache fault and more reloading of elements into the registers.

I don't know what the compiler really does so not really an answer, but too much text for a comment...
I'd have the suspicion that the compiler expands the array notation into something like this:
do j = 1, mSize
do i = 1, mSize
c(i, j, 1) = a(i, j) + b(i, j)
end do
end do
do j = 1, mSize
do i = 1, mSize
c(i, j, 2) = a(i, j) - b(i, j)
end do
end do
do j = 1, mSize
do i = 1, mSize
c(i, j, 3) = a(i, j) * b(i, j)
end do
end do
Of course, the compiler might still collapse these loops if written like that, so you might need to confuse him a little more, for example by writing something of c to the screen between the loops.

Related

Quickly generating the "triangle sequence": avoiding mispredictions

I'm interested in calculating the triangle sequence1, which is the sequence of pairs (i, j): (0, 0), (1, 0), (1, 1), (2, 0), (2, 1) ...
which iterates though all pairs (i, j) with the restriction that i >= j. The same sequence with but with the restriction i > j is also interesting.
These sequences represent, among others things, all the ways to choose 2 (possibly identical) elements from a n-element set (for the sequence up to (n, n)2), or the indices of the lower triagular elements of a matrix3. The sequence of values for i alone is A003056 in OEIS, while j alone is A002262. The sequence frequently arises in combinartorial algorithms, where their performance may be critical.
A simple but branchy way to generate the next value in the sequence is:
if (i == j) {
j = 0;
i++;
} else {
j++;
}
}
However, this suffers from many mispredicts while calculating the initial elements of the sequence, when checking the condition (i == j) -
generally one mispredict each time i is incremented. As the sequence increases, the number of mispredicts becomes lower since i is incremented
with reduced frequency, so the j++ branch dominates and is well predicted. Still, some types of combinatorial search repeatedly iterate over the
small terms in the sequence, so I'm looking for a branch-free approach or some other approach that suffers fewer mispredicts.
For many uses, the order of the sequences isn't as important, so generating the values in differnet order than above is a allowable if it leads to
a better solution. For example, j could count down rather than up: (0, 0), (1, 1), (1, 0), (2, 2), (2, 1), ....
1 I'm also interested in knowing what the right name for this sequence is (perhaps so I make a better title for this question). I just kind of made up "triangle sequence".
2 Here, the i >= j version represents sub-multisets (repetition allowed), while the i > j variant represents normal subsets (no repetition).
3 Here, the i >= j version includes the main diagonal, while the i > j variant excludes it.
Here are two branch-free approaches that do not use any expensive calculations. First one uses comparison and logical AND:
const bool eq = i == j;
i += eq;
j = (j + 1) & (eq - 1);
Second one uses comparison and multiplication:
const bool eq = i == j;
i += eq;
j = (j + 1) * (1 - eq);
In theory "multiplication" variant should be slower than "logical" one, but measurements show very little difference.
Both approaches would result in branchless code only for processors that allow branchless comparisons (for example x86). Also these approaches assume to be implemented using a language where results of conditional expressions could be easily converted to integers (for example C/C++, where "false" comparisons are converted to zero integers, and "true" ones - to integers equal to "1").
The only problem with these approaches is performance. They could in theory outperform branchy code, but only when mispredicts are really frequent. A simple test where there is no other work besides generating "triangle sequence" (see it on ideone) shows miserable mispredict rate and therefore both branchless methods about 3 times slower than branchy one. The explanation is simple: there should be not much mispredicts for longer sequences; as for shorter ones, modern processors have very good branch predictors that almost never fail in case of short branch patterns; so we have not many mispredicts, branchy code almost always executes only 2 instructions (compare, increment), while branchless code executes both active and incative "branches" plus some instructions specific to branchless approach.
In case you want to repeatedly iterate over the small terms in the sequence, probably other approach would be preferable. You calculate the sequence only once, then repeatedly read it from memory.
In Python we can express this as:
i, j = i + (i == j), (j + 1) * (i != j)
but it turns out, at around a million iterations or so, on my machine, the following, more long winded, lazy evaluation code is about 20% faster:
from itertools import count, repeat
def gen_i():
""" A003056 """
for x in count(0): # infinitely counts up
yield from repeat(x, x + 1) # replication
def gen_j():
""" A002262 """
for x in count(0): # infinitely counts up
yield from range(x + 1) # count up to (including) x
sequence = zip(gen_i(), gen_j())
for _ in range(1000000):
i, j = next(sequence)
In the above code, gen_i(), gen_j(), count(), repeat(), and zip() are all generators (and range() is an iterator) so sequence continues to call into the code on demand as new (i, j) pairs are required. I assume both the implementation of range() and repeat() terminate with a misprediction.
Simple isn't necessarily also quick (i.e. consider all the unnecessary additions of zero and multiplictions by one in the compact form.)
So which is more important, quickly generating the sequence or avoiding mispredictions?
You can derive j from i:
...set val...
old_j = j;
j = (j + 1) % (i + 1);
if (i == old_j) {
i++;
}
...loop if more...
And further derive i increment from j and current i:
...set val...
old_j = j;
j = (j + 1) % (i + 1);
i = i + (i / old_j);
...loop if more...
(Can't test it at the moment... Please review)

Why does the same function written in different ways have such different results time?

I've been playing with wolfram language and noticed something: the same function written in different ways works very differently in terms of time.
Consider these two functions:
NthFibonacci[num_] :=
If [num == 0 || num == 1, Return[ 1],
Return[NthFibonacci[num - 1] + NthFibonacci[num - 2]]
]
Fibn[num_] := {
a = 1;
b = 1;
For[i = 0, i < num - 1, i++,
c = a + b;
a = b;
b = c;
];
Return [b];
}
NthFibonacci[30] takes around 5 seconds to evaluate.
Fibn[900 000] also takes around 5 seconds to evaluate.
So does the built-in Fibonacci[50 000 000]
I simply can't get why are there such differences in speed between the three. In theory, recursion should be more or less equivalent to a for loop. What is causing this?
It's because the recursive version you present does lots and lots of repeated calculations. Build a tree of the function calls to see what I mean. Even for an argument as small as 4, look at how many function calls are generated to get to a base case down each chain of the logic.
f(1)
/
f(2)
/ \
f(3) f(0)
/ \
/ f(1)
/
f(4)
\
\ f(1)
\ /
f(2)
\
f(0)
With your recursion, the number of function calls grows exponentially with the argument num.
By contrast, your looped version grows linearly in num. It doesn't take a very large value of n before n is a lot less work than 2n.
There are many ways to implement recursion; the Fibonacci function is a lovely example. As pjs already pointed out, the classic, double-recursive definition grows exponentially. The base is
φ = (sqrt(5)+1) / 2 = 1.618+
Your NthFibonacci implementation works this way. It's order φ^n, meaning that for large n, calling f(n+1) takes φ times as long as f(n).
The gentler approach computes each functional value only once in the stream of execution. Instead of exponential time, it takes linear time, meaning that calling f(2n) takes 2 times as long as f(n).
There are other approaches. For instance, Dynamic Programming (DP) keeps a cache of previous results. In pjs f(4) case, a DP implementation would compute f(2) only once; the second call would see that the result of the first was in cache, and return the result rather than making further calls to f(0) and f(1). This tends toward linear time.
There are also implementations that make checkpoints, such as caching f(k) and f(k)+1 for k divisible by 1000. These save time by have a starting point not too far below the desired value, giving them an upper bound of 998 iterations to find the needed value.
Ultimately, the fastest implementations use the direct computation (at least for larger numbers) and work in constant time.
φ = (1+sqrt(5)) / 2 = 1.618...
ψ = (1-sqrt(5)) / 2 = -.618...
f(n) = (φ^n - ψ^n) / sqrt(5)
The issue noted by #pjs can be addressed to a degree by having the recursive function remember prior values. (eliminating the If helps too)
Clear[NthFibonacci]
NthFibonacci[0] = 1
NthFibonacci[1] = 1
NthFibonacci[num_] :=
NthFibonacci[num] = NthFibonacci[num - 1] + NthFibonacci[num - 2]
NthFibonacci[300] // AbsoluteTiming
{0.00201479, 3.59 10^62}
Cleaning up you loop version as well (You should almost never use Return in mathematica):
Fibn[num_] := Module[{a = 1, b = 1,c},
Do[c = a + b; a = b; b = c, {num - 1}]; b]
Fibn[300] // AbsoluteTiming
{0.000522175 ,3.59 10^62}
you see the recursive form is slower, but not horribly so. (Note the recursive form hits a depth limit around 1000 as well )

Most efficient way to weight and sum a number of matrices in Fortran

I am trying to write a function in Fortran that multiplies a number of matrices with different weights and then adds them together to form a single matrix. I have identified that this process is the bottleneck in my program (this weighting will be made many times for a single run of the program, with different weights). Right now I'm trying to make it run faster by switching from Matlab to Fortran. I am a newbie at Fortran so I appreciate all help.
In Matlab the fastest way I have found to make such a computation looks like this:
function B = weight_matrices()
n = 46;
m = 1800;
A = rand(n,m,m);
w = rand(n,1);
tic;
B = squeeze(sum(bsxfun(#times,w,A),1));
toc;
The line where B is assigned runs in about 0.9 seconds on my machine (Matlab R2012b, MacBook Pro 13" retina, 2.5 GHz Intel Core i5, 8 GB 1600 MHz DDR3). It should be noted that for my problem, the tensor A will be the same (constant) for the whole run of the program (after initialization), but w can take any values. Also, typical values of n and m are used here, meaning that the tensor A will have a size of about 1 GB in memory.
The clearest way I can think of writing this in Fortran is something like this:
pure function weight_matrices(w,A) result(B)
implicit none
integer, parameter :: n = 46
integer, parameter :: m = 1800
double precision, dimension(num_sizes), intent(in) :: w
double precision, dimension(num_sizes,msize,msize), intent(in) :: A
double precision, dimension(msize,msize) :: B
integer :: i
B = 0
do i = 1,n
B = B + w(i)*A(i,:,:)
end do
end function weight_matrices
This function runs in about 1.4 seconds when compiled with gfortran 4.7.2, using -O3 (function call timed with "call cpu_time(t)"). If I manually unwrap the loop into
B = w(1)*A(1,:,:)+w(2)*A(2,:,:)+ ... + w(46)*A(46,:,:)
the function takes about 0.11 seconds to run instead. This is great and means that I get a speedup of about 8 times compared to the Matlab version. However, I still have some questions on readability and performance.
First, I wonder if there is an even faster way to perform this weighting and summing of matrices. I have looked through BLAS and LAPACK, but can't find any function that seems to fit. I have also tried to put the dimension in A that enumerates the matrices as the last dimension (i.e. switching from (i,j,k) to (k,i,j) for the elements), but this resulted in slower code.
Second, this fast version is not very flexible, and actually looks quite ugly, since it is so much text for such a simple computation. For the tests I am running I would like to try to use different numbers of weights, so that the length of w will vary, to see how it affects the rest of my algorithm. However, that means I quite tedious rewrite of the assignment of B every time. Is there any way to make this more flexible, while keeping the performance the same (or better)?
Third, the tensor A will, as mentioned before, be constant during the run of the program. I have set constant scalar values in my program using the "parameter" attribute in their own module, importing them with the "use" expression into the functions/subroutines that need them. What is the best way to do the equivalent thing for the tensor A? I want to tell the compiler that this tensor will be constant, after init., so that any corresponding optimizations can be done. Note that A is typically ~1 GB in size, so it is not practical to enter it directly in the source file.
Thank you in advance for any input! :)
Perhaps you could try something like
do k=1,m
do j=1,m
B(j,k)=sum( [ ( (w(i)*A(i,j,k)), i=1,n) ])
enddo
enddo
The square brace is a newer form of (/ /), the 1d matrix (vector). The term in sum is a matrix of dimension (n) and sum sums all of those elements. This is precisely what your unwrapped code does (and is not exactly equal to the do loop you have).
I tried to refine Kyle Vanos' solution.
Therefor I decided to use sum and Fortran's vector-capabilities.
I don't know, if the results are correct, because I only looked for the timings!
Version 1: (for comparison)
B = 0
do i = 1,n
B = B + w(i)*A(i,:,:)
end do
Version 2: (from Kyle Vanos)
do k=1,m
do j=1,m
B(j,k)=sum( [ ( (w(i)*A(i,j,k)), i=1,n) ])
enddo
enddo
Version 3: (mixed-up indices, work on one row/column at a time)
do j = 1, m
B(:,j)=sum( [ ( (w(i)*A(:,i,j)), i=1,n) ], dim=1)
enddo
Version 4: (complete matrices)
B=sum( [ ( (w(i)*A(:,:,i)), i=1,n) ], dim=1)
Timing
As you can see, I had to mixup the indices to get faster execution times. The third solution is really strange because the number of the matrix is the middle index, but this is necessary for memory-order-reasons.
V1: 1.30s
V2: 0.16s
V3: 0.02s
V4: 0.03s
Concluding, I would say, that you can get a massive speedup, if you have the possibility to change order of the matrix indices in arbitrary order.
I would not hide any looping as this is usually slower. You can write it explicitely, then you'll see that the inner loop access is over the last index, making it inefficient. So, you should make sure your n dimension is the last one by storing A is A(m,m,n):
B = 0
do i = 1,n
w_tmp = w(i)
do j = 1,m
do k = 1,m
B(k,j) = B(k,j) + w_tmp*A(k,j,i)
end do
end do
end do
this should be much more efficient as you are now accessing consecutive elements in memory in the inner loop.
Another solution is to use the level 1 BLAS subroutines _AXPY (y = a*x + y):
B = 0
do i = 1,n
CALL DAXPY(m*m, w(i), A(1,1,i), 1, B(1,1), 1)
end do
With Intel MKL this should be more efficient, but again you should make sure the last index is the one which changes in the outer loop (in this case the loop you're writing). You can find the necessary arguments for this call here: MKL
EDIT: you might also want to use some parallellization? (I don't know if Matlab takes advantage of that)
EDIT2: In the answer of Kyle, the inner loop is over different values of w, which is more efficient than n times reloading B as w can be kept in cache (using A(n,m,m)):
B = 0
do i = 1,m
do j = 1,m
B(j,i)=0.0d0
do k = 1,n
B(j,i) = B(j,i) + w(k)*A(k,j,i)
end do
end do
end do
This explicit looping performs about 10% better as the code of Kyle which uses whole-array operations. Bandwidth with ifort -O3 -xHost is ~6600 MB/s, with gfortran -O3 it's ~6000 MB/s, and the whole-array version with either compiler is also around 6000 MB/s.
I know this is an old post, however I will be glad to bring my contribution as I played with most of the posted solutions.
By adding a local unroll for the weights loop (from Steabert's answer ) gives me a little speed-up compared to the complete unroll version (from 10% to 80% with different size of the matrices). The partial unrolling may help the compiler to vectorize the 4 operations in one SSE call.
pure function weight_matrices_partial_unroll_4(w,A) result(B)
implicit none
integer, parameter :: n = 46
integer, parameter :: m = 1800
real(8), intent(in) :: w(n)
real(8), intent(in) :: A(n,m,m)
real(8) :: B(m,m)
real(8) :: Btemp(4)
integer :: i, j, k, l, ndiv, nmod, roll
!==================================================
roll = 4
ndiv = n / roll
nmod = mod( n, roll )
do i = 1,m
do j = 1,m
B(j,i)=0.0d0
k = 1
do l = 1,ndiv
Btemp(1) = w(k )*A(k ,j,i)
Btemp(2) = w(k+1)*A(k+1,j,i)
Btemp(3) = w(k+2)*A(k+2,j,i)
Btemp(4) = w(k+3)*A(k+3,j,i)
k = k + roll
B(j,i) = B(j,i) + sum( Btemp )
end do
do l = 1,nmod !---- process the rest of the loop
B(j,i) = B(j,i) + w(k)*A(k,j,i)
k = k + 1
enddo
end do
end do
end function

Performance of Fibonacci

f[0] = 0;
f[1] = 1;
f[x_] := f[x-1] + f[x-2]
This function is running slow in Mathematica and I need to increase the speed. I have to use functional programming and recursion. I'm not sure why this is running so slow, and even the slightest idea how to improve this would be helpful.
A good way to write a faster recursive function is to have it memorize previous values. This does come at the cost of memory, of course, but it can help in cases like this. In order to calculate f[x], you calculate f[x-1] and f[x-2] - and then to calculate f[x-1], you calculate f[x-2] again; you end up recalculating a lot of values a lot of times. (Forgive my imprecision!)
To store things as you go, you can use this idiom:
f[x_] := ( f[x] = (* calculation of f[x] goes here *) )
Edit: I don't have mathematica on this machine, but I'm pretty sure there's no way this computes the wrong value.
f[0] = 0;
f[1] = 1;
f[x_] := ( f[x] = f[x-1] + f[x-2] );
f[256]
Like I said in a comment below, if you have other definitions of f, you might want to wipe them out first with Clear[f].
Thanks to rcollyer: Be careful about $RecursionLimit! It defaults to 256. (Of course, this is with good reason. Really deep recursion is usually a bad idea.)
Jefromi is right. Look at Memoization on wikipedia. They use the example of factorial and how to speed it up with memoization.
Memoization is a good way to write a faster recursive function. However, in this case there is a recursive alternative that runs tremendously faster than original function, without requiring memoization.
The key observation is to see that the original definition performs a lot of redundant calculations. Consider what happens if we calculate fib[4]:
fib[4] = fib[3] + fib[2]
fib[3] = fib[2] + fib[1]
fib[2] = fib[1] + fib[0]
fib[1] = 1
fib[0] = 1
∴ fib[2] = 1 + 1 = 2
fib[1] = 1
∴ fib[3] = 2 + 1 = 3
fib[2] = fib[1] + fib[0]
fib[1] = 1
fib[0] = 1
∴ fib[2] = 1 + 1 = 2
∴ fib[4] = 2 + 1 = 3
In this process, fib[2] and fib[0] were computed twice each and fib[1] was computed thrice. For larger computations, the waste grows dramatically -- exponentially in fact.
If one were to calculate the same Fibonacci number by hand, one might proceed something like this:
0: given 0
1: given 1
2: 0 + 1 = 1
3: 1 + 1 = 2
4: 1 + 2 = 3
There are no redundant calculations. At any given point, one only needs to consider the previous two results. This latter approach can be expressed recursively thus:
fib2[0] = 0;
fib2[n_] :=
Module[{f},
f[n, p1_, _] := p1;
f[x_, p1_, p2_] := f[x + 1, p1 + p2, p1];
f[1, 1, 0]
]
Block[{$IterationLimit = Infinity}, fib2[100000]]
No doubt, this form is not as easy to read as the original. On the other hand, the original function took 35 seconds to compute fib[35] on my machine whereas the revised function's runtime for same was reported as zero. Furthermore, the revised function computes fib2[100000] in 0.281 seconds, without requiring any of the extra storage of memoization. fib[100000] is quite out of reach of the original function and the memoized version crashed my Mathematica 7.01 kernel -- too many memoized rules perhaps?
Note that Mathematica, by default, will iterate a function no more than 4096 times. To raise that limit, you must assign a higher value to $IterationLimit as illustrated in the example above.
Of course, in Mathematica there are plenty of non-recursive ways to calculate Fibonacci numbers, up to and including the built-in Fibonacci function. But that is not the point of this exercise.
Tail Call Optimization?
It is always desirable to express recursive functions using tail calls. This permits the recursion to be executed by simple iteration, without the overhead of retaining intermediate results on the stack. fib2 is tail recursive. Some languages, like Scheme, mandate tail call optimization. Other languages, like Java, could support it but don't (or won't, as in the case of Python).
In the case of Mathematica, it is not clear to what extent tail call optimization is performed. For further discussion of this point, see another SO question.

Performance of swapping two elements in MATLAB

Purely as an experiment, I'm writing sort functions in MATLAB then running these through the MATLAB profiler. The aspect I find most perplexing is to do with swapping elements.
I've found that the "official" way of swapping two elements in a matrix
self.Data([i1, i2]) = self.Data([i2, i1])
runs much slower than doing it in four lines of code:
e1 = self.Data(i1);
e2 = self.Data(i2);
self.Data(i1) = e2;
self.Data(i2) = e1;
The total length of time taken up by the second example is 12 times less than the single line of code in the first example.
Would somebody have an explanation as to why?
Based on suggestions posted, I've run some more tests.
It appears the performance hit comes when the same matrix is referenced in both the LHS and RHS of the assignment.
My theory is that MATLAB uses an internal reference-counting / copy-on-write mechanism, and this is causing the entire matrix to be copied internally when it's referenced on both sides. (This is a guess because I don't know the MATLAB internals).
Here are the results from calling the function 885548 times. (The difference here is times four, not times twelve as I originally posted. Each of the functions have the additional function-wrapping overhead, while in my initial post I just summed up the individual lines).
swap1: 12.547 s
swap2: 14.301 s
swap3: 51.739 s
Here's the code:
methods (Access = public)
function swap(self, i1, i2)
swap1(self, i1, i2);
swap2(self, i1, i2);
swap3(self, i1, i2);
self.SwapCount = self.SwapCount + 1;
end
end
methods (Access = private)
%
% swap1: stores values in temporary doubles
% This has the best performance
%
function swap1(self, i1, i2)
e1 = self.Data(i1);
e2 = self.Data(i2);
self.Data(i1) = e2;
self.Data(i2) = e1;
end
%
% swap2: stores values in a temporary matrix
% Marginally slower than swap1
%
function swap2(self, i1, i2)
m = self.Data([i1, i2]);
self.Data([i2, i1]) = m;
end
%
% swap3: does not use variables for storage.
% This has the worst performance
%
function swap3(self, i1, i2)
self.Data([i1, i2]) = self.Data([i2, i1]);
end
end
In the first (slow) approach, the RHS value is a matrix, so I think MATLAB incurs a performance penalty in creating a new matrix to store the two elements. The second (fast) approach avoids this by working directly with the elements.
Check out the "Techniques for Improving Performance" article on MathWorks for ways to improve your MATLAB code.
you could also do:
tmp = self.Data(i1);
self.Data(i1) = self.Data(i2);
self.Data(i2) = tmp;
Zach is potentially right in that a temporary copy of the matrix may be made to perform the first operation, although I would hazard a guess that there is some internal optimization within MATLAB that attempts to avoid this. It may be a function of the version of MATLAB you are using. I tried both of your cases in version 7.1.0.246 (a couple years old) and only saw a speed difference of about 2-2.5.
It's possible that this may be an example of speed improvement by what's called "loop unrolling". When doing vector operations, at some level within the internal code there is likely a FOR loop which loops over the indices you are swapping. By performing the scalar operations in the second example, you are avoiding any overhead from loops. Note these two (somewhat silly) examples:
vec = [1 2 3 4];
%Example 1:
for i = 1:4,
vec(i) = vec(i)+1;
end;
%Example 2:
vec(1) = vec(1)+1;
vec(2) = vec(2)+1;
vec(3) = vec(3)+1;
vec(4) = vec(4)+1;
Admittedly, it would be much easier to simply use vector operations like:
vec = vec+1;
but the examples above are for the purpose of illustration. When I repeat each example multiple times over and time them, Example 2 is actually somewhat faster than Example 1. For a small loop with a known number (in the example, just 4), it can actually be more efficient to forgo the loop. Of course, in this particular example, the vector operation given above is actually the fastest.
I usually follow this rule: Try a few different things, and pick the fastest for your specific problem.
This post deserves an update, since the JIT compiler is now a thing (since R2015b) and so is timeit (since R2013b) for more reliable function timing.
Below is a short benchmarking function for element swapping within a large array.
I have used the terms "directly swapping" and "using a temporary variable" to describe the two methods in the question respectively.
The results are pretty staggering, the performance of directly swapping 2 elements using is increasingly poor by comparison to using a temporary variable.
function benchie()
% Variables for plotting, loop to increase size of the arrays
M = 15; D = zeros(1,M); W = zeros(1,M);
for n = 1:M;
N = 2^n;
% Create some random array of length N, and random indices to swap
v = rand(N,1);
x = randi([1, N], N, 1);
y = randi([1, N], N, 1);
% Time the functions
D(n) = timeit(#()direct);
W(n) = timeit(#()withtemp);
end
% Plotting
plot(2.^(1:M), D, 2.^(1:M), W);
legend('direct', 'with temp')
xlabel('number of elements'); ylabel('time (s)')
function direct()
% Direct swapping of two elements
for k = 1:N
v([x(k) y(k)]) = v([y(k) x(k)]);
end
end
function withtemp()
% Using an intermediate temporary variable
for k = 1:N
tmp = v(y(k));
v(y(k)) = v(x(k));
v(x(k)) = tmp;
end
end
end

Resources