Strange observation about timing comparison between Julia and Matlab - performance

The goal of this experiment is to compare speed of Matlab and Julia with a small piece of code below.
First the Matlab code:
>> t = 5000; n = 10000; x = 1:t*n;
>> x = reshape(x, t, n);
>> tic(); y1 = sum(x(:) .* x(:)); toc()
Elapsed time is 0.229563 seconds.
>> y1
y1 =
4.1667e+22
>> tic(); y2 = trace(x * x'); toc()
Elapsed time is 15.332694 seconds.
>> y2
y2 =
4.1667e+22
Versus in Julia
julia> t = 5000; n = 10000; x = 1: t*n;
julia> x = reshape(x, t, n);
julia> tic(); y1 = sum(x[:].* x[:]); toc();
elapsed time: 1.235170533 seconds
julia> y1
-4526945843202100544
julia> tic();y2 = trace(x*x'); toc();
The second one did not finish the job in more than 1 minutes. So what is the matter here with Julia? This piece of code happen to run both slower and overflow in Julia? Is there any problem in my style? I think one reason to convert from Matlab to Julia is the speed, and I used to think that Julia handles big number arithmetic operations by default. Now, it looks like these are not correct. Can someone explain?

There are a couple of things going on here.
Firstly, unlike Matlab, the x is an array of machine integers, not floating point values. This appears to be the main difference in speed, as it is unable to use BLAS routines for linear algebra.
You need to do either
x = 1.0:t*n
or explicitly convert it via
x = float(x)
This is also the reason for a different answer: Julia uses machine arithmetic, which for integers will wrap around on overflow (hence the negative number). You won't have this problem with floating point values.
(Julia does have arbitrary-precision integers, but doesn't promote by default: instead you would need to promote them yourself via big(n))
This will give you some speed up, but you can do better. Julia does require a slightly different programming style than Matlab. See the links that Isaiah provided in the comments above.
In regards to your specific examples
sum(x[:].* x[:])
This is slow as it creates 3 intermediate vectors (2 copies of x[:], though hopefully this will change in future, and their product).
Similarly,
trace(x*x')
computes an intermediate matrix (most of which is not used in the trace).
My suggestion would be to use
dot(vec(x),vec(x))
vec(x) just reshapes x into a vector (so no copying), and dot is the usual sum-product. You could also "roll-your-own":
function test(x)
s = zero(eltype(x)) # this prevents type-instability
for xi in x
s += xi*xi
end
s
end
test(x)
(this needs to be in a function for the JIT compiler to work its magic). It should be reasonably fast, though probably still not as fast as dot, which uses BLAS calls underneath.

Related

Vectorization or alternative to speed up MATLAB loop

I am using MATLAB to run a for loop in which variable-length portions of a large vector are updated at each iteration with the content of another vector; something like:
for k=1:N
vec1(idx_start1(k):idx_end1(k)) = vec1(idx_start1(k):idx_end1(k)) +...
a(k)*vec2(idx_start2(k):idx_end2(k));
end
The selected portions of vec1 and vec2 are not so small and N can be quite large; moreover, if this can be useful, idx_end(k)<idx_start(k+1) does not necessarily hold (i.e. vec1's edited portions may be partially re-updated in subsequent iterations). As a consequence, the above is by far the slowest portion of code in my script and I would like to speed it up, if possible.
Is there any way to vectorize the above for loop in order to make it run faster? Or, are there any alternative approaches to improve its execution speed?
EDIT:
As requested in the comments, here are some example values: Using the profiler to check execution times, the loop above runs in about 3.3 s with N=5e4, length(vec1)=3e6, length(vec2)=1.7e3 and the portions indexed by idx_start/end are slightly shorter on average than the latter, although not significantly.
Of course, 3.3 s is not particularly worrying in itself, but I would like to be able to increase especially N and vec1 by 1 or 2 orders of magnitude and in such a loop it will take quite longer to run.
Sorry, I couldn't find a way to speed up your code. This is the code I created to try to speed it up:
N = 5e4;
vec1 = 1:3e6;
vec2 = 1:1.7e3;
rng(0)
a = randn(N, 1);
idx_start1 = randi([1, 2.9e6], N, 1);
idx_end1 = idx_start1 + 1000;
idx_start2 = randi([1, 0.6e3], N, 1);
idx_end2 = idx_start2 + 1000;
for k=1:N
vec1(idx_start1(k):idx_end1(k)) = vec1(idx_start1(k):idx_end1(k)) + a(k) * vec2(idx_start2(k):idx_end2(k));
% use = idx_start1(k):idx_end1(k);
% vec1(use) = vec1(use) + a(k) * vec2(idx_start2(k):idx_end2(k));
end
The two commented-out lines of code in the for loop were my attempt to speed it up, but it actually made it slower, much to my surprise. Generally, I would create a variable for an array that is used more than once thinking that is faster, but it is not. The code that is not commented out runs in 0.24 s versus 0.67 seconds for the code that is commented out.

MATLAB: assignin caller to variables performance issues

I have variables outside a particular function, which I want to put in a workspace. I do this by using assignin('caller'....
As a test case, I did the same scenario, one as described above, and the other as simply defining a duplicate scenario within the main function (code below)
Quite simply there is a performance issue with this, here are my results:
normal
Elapsed time is 1.613414 seconds.
assignin('caller',...
Elapsed time is 1.849663 seconds.
The timing is not significant here, but it does matter as number of (single) variables increase. I see orders of magnitude performance decrease in my work. I checked that both versions give the exact same results at the end
note: For some strange reason in my code not present here, I have many matrices and single-variables each involved in operations and computations. If matrices are called with assignin('caller'..., there is literally no performance hit. With single variables, the problem in performance arises.
MATLAB versions tested in:
2013a Win7-64bit (me)
2010b Win7-64bit (Luis Mendo)
CODE
function sof
sig = 0.3;
max_iter = 100000;
% in functiond efined variables
y1 = 2.5;
y2 = 7.3;
y3 = 3.4;
y4 = 7.2;
y5 = 2.2;
y6 = 1.7;
y7 = 9.2;
k = zeros(1,max_iter);
% defined elswhere using assignin 'caller'
get_vars (max_iter);
% perform calculations with variables defined here
tic
for i=1:max_iter
k(i) = normrnd(y1,sig)/y2*y3*y4/y5*y6/y7/y1*y2 + y1*y3*y5/y2;
end
toc
% perform calculations with variables defined with assignin('caller',...;
tic
for i=1:max_iter
k_a(i) = normrnd(x1,sig)/x2*x3*x4/x5*x6/x7/x1*x2 + x1*x3*x5/x2;
end
toc
end
function get_vars (mrange)
assignin('caller','x1',2.5);
assignin('caller','x2',7.3);
assignin('caller','x3',3.4);
assignin('caller','x4',7.2);
assignin('caller','x5',2.2);
assignin('caller','x6',1.7);
assignin('caller','x7',9.2);
assignin('caller','k_a',zeros(1,mrange ));
end

Efficient replacement for ppval

I have a loop in which I use ppval to evaluate a set of values from a piecewise polynomial spline. The interpolation is easily the most time consuming part of the loop and I am looking for a way improve the function's efficiency.
More specifically, I'm using a finite difference scheme to calculate transient temperature distributions in friction welds. To do this I need to recalculate the material properties (as a function of temperature and position) at each time step. The rate limiting factor is the interpolation of these values. I could use an alternate finite difference scheme (less restrictive in the time domain) but would rather stick with what I have if at all possible.
I've included a MWE below:
x=0:.1:10;
y=sin(x);
pp=spline(x,y);
tic
for n=1:10000
x_int=10*rand(1000,1);
y_int=ppval(pp,x_int);
end
toc
plot(x,y,x_int,y_int,'*') % plot for sanity of data
Elapsed time is 1.265442 seconds.
Edit - I should probably mention that I would be more than happy with a simple linear interpolation between values but the interp1 function is slower than ppval
x=0:.1:10;
y=sin(x);
tic
for n=1:10000
x_int=10*rand(1000,1);
y_int=interp1(x,y,x_int,'linear');
end
toc
plot(x,y,x_int,y_int,'*') % plot for sanity of data
Elapsed time is 1.957256 seconds.
This is slow, because you're running into the single most annoying limitation of JIT. It's the cause of many many many oh so many questions in the MATLAB tag here on SO:
MATLAB's JIT accelerator cannot accelerate loops that call non-builtin functions.
Both ppval and interp1 are not built in (check with type ppval or edit interp1). Their implementation is not particularly slow, they just aren't fast when placed in a loop.
Now I have the impression it's getting better in more recent versions of MATLAB, but there are still quite massive differences between "inlined" and "non-inlined" loops. Why their JIT doesn't automate this task by simply recursing into non-builtins, I really have no idea.
Anyway, to fix this, you should copy-paste the essence of what happens in ppval into the loop body:
% Example data
x = 0:.1:10;
y = sin(x);
pp = spline(x,y);
% Your original version
tic
for n = 1:10000
x_int = 10*rand(1000,1);
y_int = ppval(pp, x_int);
end
toc
% "inlined" version
tic
br = pp.breaks.';
cf = pp.coefs;
for n = 1:10000
x_int = 10*rand(1000,1);
[~, inds] = histc(x_int, [-inf; br(2:end-1); +inf]);
x_shf = x_int - br(inds);
zero = ones(size(x_shf));
one = x_shf;
two = one .* x_shf;
three = two .* x_shf;
y_int = sum( [three two one zero] .* cf(inds,:), 2);
end
toc
Profiler:
Results on my crappy machine:
Elapsed time is 2.764317 seconds. % ppval
Elapsed time is 1.695324 seconds. % "inlined" version
The difference is actually less than what I expected, but I think that's mostly due to the sum() -- for this ppval case, I usually only need to evaluate a single site per iteration, which you can do without histc (but with simple vectorized code) and matrix/vector multiplication x*y (BLAS) instead of sum(x.*y) (fast, but not BLAS-fast).
Oh well, a ~60% reduction is not bad :)
It is a bit surprising that interp1 is slower than ppval, but having a quick look at its source code, it seems that it has to check for many special cases and has to loop over all the points since it it cannot be sure if the step-size is constant.
I didn't check the timing, but I guess you can speed up the linear interpolation by a lot if you can guarantee that steps in x of your table are constant, and that the values to be interpolated are stricktly within the given range, so that you do not have to do any checking. In that case, linear interpolation can be converted to a simple lookup problem like so:
%data to be interpolated, on grid with constant step
x = 0:0.5:10;
y = sin(x);
x_int = 0:0.1:9.9;
%make sure it is interpolation, not extrapolation
assert(all(x(1) <= x_int & x_int < x(end)));
% compute mapping, this can be precomputed for constant grid
slope = (length(x) - 1) / (x(end) - x(1));
offset = 1 - slope*x(1);
%map x_int to interval 1..lenght(i)
xmapped = offset + slope * x_int;
ind = floor(xmapped);
frac = xmapped - ind;
%interpolate by taking weighted sum of neighbouring points
y_int = y(ind) .* (1 - frac) + y(ind+1) .* frac;
% make plot to check correctness
plot(x, y, 'o-', x_int, y_int, '.')

Most efficient way to weight and sum a number of matrices in Fortran

I am trying to write a function in Fortran that multiplies a number of matrices with different weights and then adds them together to form a single matrix. I have identified that this process is the bottleneck in my program (this weighting will be made many times for a single run of the program, with different weights). Right now I'm trying to make it run faster by switching from Matlab to Fortran. I am a newbie at Fortran so I appreciate all help.
In Matlab the fastest way I have found to make such a computation looks like this:
function B = weight_matrices()
n = 46;
m = 1800;
A = rand(n,m,m);
w = rand(n,1);
tic;
B = squeeze(sum(bsxfun(#times,w,A),1));
toc;
The line where B is assigned runs in about 0.9 seconds on my machine (Matlab R2012b, MacBook Pro 13" retina, 2.5 GHz Intel Core i5, 8 GB 1600 MHz DDR3). It should be noted that for my problem, the tensor A will be the same (constant) for the whole run of the program (after initialization), but w can take any values. Also, typical values of n and m are used here, meaning that the tensor A will have a size of about 1 GB in memory.
The clearest way I can think of writing this in Fortran is something like this:
pure function weight_matrices(w,A) result(B)
implicit none
integer, parameter :: n = 46
integer, parameter :: m = 1800
double precision, dimension(num_sizes), intent(in) :: w
double precision, dimension(num_sizes,msize,msize), intent(in) :: A
double precision, dimension(msize,msize) :: B
integer :: i
B = 0
do i = 1,n
B = B + w(i)*A(i,:,:)
end do
end function weight_matrices
This function runs in about 1.4 seconds when compiled with gfortran 4.7.2, using -O3 (function call timed with "call cpu_time(t)"). If I manually unwrap the loop into
B = w(1)*A(1,:,:)+w(2)*A(2,:,:)+ ... + w(46)*A(46,:,:)
the function takes about 0.11 seconds to run instead. This is great and means that I get a speedup of about 8 times compared to the Matlab version. However, I still have some questions on readability and performance.
First, I wonder if there is an even faster way to perform this weighting and summing of matrices. I have looked through BLAS and LAPACK, but can't find any function that seems to fit. I have also tried to put the dimension in A that enumerates the matrices as the last dimension (i.e. switching from (i,j,k) to (k,i,j) for the elements), but this resulted in slower code.
Second, this fast version is not very flexible, and actually looks quite ugly, since it is so much text for such a simple computation. For the tests I am running I would like to try to use different numbers of weights, so that the length of w will vary, to see how it affects the rest of my algorithm. However, that means I quite tedious rewrite of the assignment of B every time. Is there any way to make this more flexible, while keeping the performance the same (or better)?
Third, the tensor A will, as mentioned before, be constant during the run of the program. I have set constant scalar values in my program using the "parameter" attribute in their own module, importing them with the "use" expression into the functions/subroutines that need them. What is the best way to do the equivalent thing for the tensor A? I want to tell the compiler that this tensor will be constant, after init., so that any corresponding optimizations can be done. Note that A is typically ~1 GB in size, so it is not practical to enter it directly in the source file.
Thank you in advance for any input! :)
Perhaps you could try something like
do k=1,m
do j=1,m
B(j,k)=sum( [ ( (w(i)*A(i,j,k)), i=1,n) ])
enddo
enddo
The square brace is a newer form of (/ /), the 1d matrix (vector). The term in sum is a matrix of dimension (n) and sum sums all of those elements. This is precisely what your unwrapped code does (and is not exactly equal to the do loop you have).
I tried to refine Kyle Vanos' solution.
Therefor I decided to use sum and Fortran's vector-capabilities.
I don't know, if the results are correct, because I only looked for the timings!
Version 1: (for comparison)
B = 0
do i = 1,n
B = B + w(i)*A(i,:,:)
end do
Version 2: (from Kyle Vanos)
do k=1,m
do j=1,m
B(j,k)=sum( [ ( (w(i)*A(i,j,k)), i=1,n) ])
enddo
enddo
Version 3: (mixed-up indices, work on one row/column at a time)
do j = 1, m
B(:,j)=sum( [ ( (w(i)*A(:,i,j)), i=1,n) ], dim=1)
enddo
Version 4: (complete matrices)
B=sum( [ ( (w(i)*A(:,:,i)), i=1,n) ], dim=1)
Timing
As you can see, I had to mixup the indices to get faster execution times. The third solution is really strange because the number of the matrix is the middle index, but this is necessary for memory-order-reasons.
V1: 1.30s
V2: 0.16s
V3: 0.02s
V4: 0.03s
Concluding, I would say, that you can get a massive speedup, if you have the possibility to change order of the matrix indices in arbitrary order.
I would not hide any looping as this is usually slower. You can write it explicitely, then you'll see that the inner loop access is over the last index, making it inefficient. So, you should make sure your n dimension is the last one by storing A is A(m,m,n):
B = 0
do i = 1,n
w_tmp = w(i)
do j = 1,m
do k = 1,m
B(k,j) = B(k,j) + w_tmp*A(k,j,i)
end do
end do
end do
this should be much more efficient as you are now accessing consecutive elements in memory in the inner loop.
Another solution is to use the level 1 BLAS subroutines _AXPY (y = a*x + y):
B = 0
do i = 1,n
CALL DAXPY(m*m, w(i), A(1,1,i), 1, B(1,1), 1)
end do
With Intel MKL this should be more efficient, but again you should make sure the last index is the one which changes in the outer loop (in this case the loop you're writing). You can find the necessary arguments for this call here: MKL
EDIT: you might also want to use some parallellization? (I don't know if Matlab takes advantage of that)
EDIT2: In the answer of Kyle, the inner loop is over different values of w, which is more efficient than n times reloading B as w can be kept in cache (using A(n,m,m)):
B = 0
do i = 1,m
do j = 1,m
B(j,i)=0.0d0
do k = 1,n
B(j,i) = B(j,i) + w(k)*A(k,j,i)
end do
end do
end do
This explicit looping performs about 10% better as the code of Kyle which uses whole-array operations. Bandwidth with ifort -O3 -xHost is ~6600 MB/s, with gfortran -O3 it's ~6000 MB/s, and the whole-array version with either compiler is also around 6000 MB/s.
I know this is an old post, however I will be glad to bring my contribution as I played with most of the posted solutions.
By adding a local unroll for the weights loop (from Steabert's answer ) gives me a little speed-up compared to the complete unroll version (from 10% to 80% with different size of the matrices). The partial unrolling may help the compiler to vectorize the 4 operations in one SSE call.
pure function weight_matrices_partial_unroll_4(w,A) result(B)
implicit none
integer, parameter :: n = 46
integer, parameter :: m = 1800
real(8), intent(in) :: w(n)
real(8), intent(in) :: A(n,m,m)
real(8) :: B(m,m)
real(8) :: Btemp(4)
integer :: i, j, k, l, ndiv, nmod, roll
!==================================================
roll = 4
ndiv = n / roll
nmod = mod( n, roll )
do i = 1,m
do j = 1,m
B(j,i)=0.0d0
k = 1
do l = 1,ndiv
Btemp(1) = w(k )*A(k ,j,i)
Btemp(2) = w(k+1)*A(k+1,j,i)
Btemp(3) = w(k+2)*A(k+2,j,i)
Btemp(4) = w(k+3)*A(k+3,j,i)
k = k + roll
B(j,i) = B(j,i) + sum( Btemp )
end do
do l = 1,nmod !---- process the rest of the loop
B(j,i) = B(j,i) + w(k)*A(k,j,i)
k = k + 1
enddo
end do
end do
end function

Performance of swapping two elements in MATLAB

Purely as an experiment, I'm writing sort functions in MATLAB then running these through the MATLAB profiler. The aspect I find most perplexing is to do with swapping elements.
I've found that the "official" way of swapping two elements in a matrix
self.Data([i1, i2]) = self.Data([i2, i1])
runs much slower than doing it in four lines of code:
e1 = self.Data(i1);
e2 = self.Data(i2);
self.Data(i1) = e2;
self.Data(i2) = e1;
The total length of time taken up by the second example is 12 times less than the single line of code in the first example.
Would somebody have an explanation as to why?
Based on suggestions posted, I've run some more tests.
It appears the performance hit comes when the same matrix is referenced in both the LHS and RHS of the assignment.
My theory is that MATLAB uses an internal reference-counting / copy-on-write mechanism, and this is causing the entire matrix to be copied internally when it's referenced on both sides. (This is a guess because I don't know the MATLAB internals).
Here are the results from calling the function 885548 times. (The difference here is times four, not times twelve as I originally posted. Each of the functions have the additional function-wrapping overhead, while in my initial post I just summed up the individual lines).
swap1: 12.547 s
swap2: 14.301 s
swap3: 51.739 s
Here's the code:
methods (Access = public)
function swap(self, i1, i2)
swap1(self, i1, i2);
swap2(self, i1, i2);
swap3(self, i1, i2);
self.SwapCount = self.SwapCount + 1;
end
end
methods (Access = private)
%
% swap1: stores values in temporary doubles
% This has the best performance
%
function swap1(self, i1, i2)
e1 = self.Data(i1);
e2 = self.Data(i2);
self.Data(i1) = e2;
self.Data(i2) = e1;
end
%
% swap2: stores values in a temporary matrix
% Marginally slower than swap1
%
function swap2(self, i1, i2)
m = self.Data([i1, i2]);
self.Data([i2, i1]) = m;
end
%
% swap3: does not use variables for storage.
% This has the worst performance
%
function swap3(self, i1, i2)
self.Data([i1, i2]) = self.Data([i2, i1]);
end
end
In the first (slow) approach, the RHS value is a matrix, so I think MATLAB incurs a performance penalty in creating a new matrix to store the two elements. The second (fast) approach avoids this by working directly with the elements.
Check out the "Techniques for Improving Performance" article on MathWorks for ways to improve your MATLAB code.
you could also do:
tmp = self.Data(i1);
self.Data(i1) = self.Data(i2);
self.Data(i2) = tmp;
Zach is potentially right in that a temporary copy of the matrix may be made to perform the first operation, although I would hazard a guess that there is some internal optimization within MATLAB that attempts to avoid this. It may be a function of the version of MATLAB you are using. I tried both of your cases in version 7.1.0.246 (a couple years old) and only saw a speed difference of about 2-2.5.
It's possible that this may be an example of speed improvement by what's called "loop unrolling". When doing vector operations, at some level within the internal code there is likely a FOR loop which loops over the indices you are swapping. By performing the scalar operations in the second example, you are avoiding any overhead from loops. Note these two (somewhat silly) examples:
vec = [1 2 3 4];
%Example 1:
for i = 1:4,
vec(i) = vec(i)+1;
end;
%Example 2:
vec(1) = vec(1)+1;
vec(2) = vec(2)+1;
vec(3) = vec(3)+1;
vec(4) = vec(4)+1;
Admittedly, it would be much easier to simply use vector operations like:
vec = vec+1;
but the examples above are for the purpose of illustration. When I repeat each example multiple times over and time them, Example 2 is actually somewhat faster than Example 1. For a small loop with a known number (in the example, just 4), it can actually be more efficient to forgo the loop. Of course, in this particular example, the vector operation given above is actually the fastest.
I usually follow this rule: Try a few different things, and pick the fastest for your specific problem.
This post deserves an update, since the JIT compiler is now a thing (since R2015b) and so is timeit (since R2013b) for more reliable function timing.
Below is a short benchmarking function for element swapping within a large array.
I have used the terms "directly swapping" and "using a temporary variable" to describe the two methods in the question respectively.
The results are pretty staggering, the performance of directly swapping 2 elements using is increasingly poor by comparison to using a temporary variable.
function benchie()
% Variables for plotting, loop to increase size of the arrays
M = 15; D = zeros(1,M); W = zeros(1,M);
for n = 1:M;
N = 2^n;
% Create some random array of length N, and random indices to swap
v = rand(N,1);
x = randi([1, N], N, 1);
y = randi([1, N], N, 1);
% Time the functions
D(n) = timeit(#()direct);
W(n) = timeit(#()withtemp);
end
% Plotting
plot(2.^(1:M), D, 2.^(1:M), W);
legend('direct', 'with temp')
xlabel('number of elements'); ylabel('time (s)')
function direct()
% Direct swapping of two elements
for k = 1:N
v([x(k) y(k)]) = v([y(k) x(k)]);
end
end
function withtemp()
% Using an intermediate temporary variable
for k = 1:N
tmp = v(y(k));
v(y(k)) = v(x(k));
v(x(k)) = tmp;
end
end
end

Resources