speed of individual function and entire program - performance

I have large loop that contains abs([imaginary]).
It took me a lot of time to complete the program.
I tried multiple other ways to compute abs() such as when C is imaginary, (real(C)^2+imag(C)^2).^.5.
This result same as abs(C).
When I used tic,toc, (real(C)^2+imag(C)^2).^.5 was slightly faster. So I substituted and ran it again.
However profile shows that when I had abs() was much faster.
How can this happen and how can I make abs(C) process faster?

i take it form your comment that you are using large loops, matlab is not that efficent with those, example:
test = randn(10000000,2);
testC = complex(test(:,1),test(:,2));
%%vector
tic
foo = abs(testC);
toc
%%loop
bar = zeros(size(foo));
tic
for i=1:10000000
bar(i) = abs(testC(i));
end
toc
gives you something like
Elapsed time is 0.106635 seconds.
Elapsed time is 0.928885 seconds.
Thats why i would recommend to calculate abs() outside the loop. If replacing the loop in total is no option you can do so only in Parts. For exsample you could use your loop until you got all your complex numbers, end the loop, calc abs() then start a new loop with those results. Also if each iteration of your loop is independat of other iteration results you might want to look in parfor as an replacement for for-loops

Related

Optimize Julia Code by Example

I am currently writing a numerical solver in Julia. I don't think the math behind it matters too much. It all boils down to the fact, that a specific operation is executed several times and uses a large percentage (~80%) of running time.
I tried to reduce it as much as possible and present you this piece of code, which can be saved as dummy.jl and then executed via include("dummy.jl") followed by dummy(10) (for compilation) and then dummy(1000).
function dummy(N::Int64)
A = rand(N,N)
#time timethis(A)
end
function timethis(A::Array{Float64,2})
dummyvariable = 0.0
for k=1:100 # just repeat a few times
for i=2:size(A)[1]-1
for j=2:size(A)[2]-1
dummyvariable += slopefit(A[i-1,j],A[i,j],A[i+1,j],2.0)
dummyvariable += slopefit(A[i,j-1],A[i,j],A[i,j+1],2.0)
end
end
end
println(dummyvariable)
end
#inline function minmod(x::Float64, y::Float64)
return sign(x) * max(0.0, min(abs(x),y*sign(x) ) );
end
#inline function slopefit(left::Float64,center::Float64,right::Float64,theta::Float64)
# arg=ccall((:minmod,"libminmod"),Float64,(Float64,Float64),0.5*(right-left),theta*(center-left));
# result=ccall((:minmod,"libminmod"),Float64,(Float64,Float64),theta*(right-center),arg);
# return result
tmp = minmod(0.5*(right-left),theta*(center-left));
return minmod(theta*(right-center),tmp);
#return 1.0
end
Here, timethis shall imitate the part of the code where I spend a lot of time. I notice, that slopefitis extremely expensive to execute.
For example, dummy(1000) takes roughly 4 seconds on my machine. If instead, slopefit would just always return 1 and not compute anything, the time goes down to one tenth of the overall time.
Now, obviously there is no free lunch.
I am aware, that this is simply a costly operation. But I would still try to optimize it as much as possible, given that a lot of time is spend in something that looks like one could optimize it easily as it is just a few lines of code.
So far, I tried to implement minmod and slopefit as C-functions and call them, however that just increased computing time (maybe I did it wrong).
So my question is, what possibilities do I have to optimize the call of slopefit?
Note, that in the actual code, the arguments of slopefit are not the ones mentioned here, but depend on conditional statements which makes everything hard to vectorize (if that would bring any performance gain I am not sure).
There are two levels of optimization I can think of.
First: the following implementation of minmod will be faster as it avoids branching (I understand this is the functionality you want):
#inline minmod(x::Float64, y::Float64) = ifelse(x<0, clamp(y, x, 0.0), clamp(y, 0.0, x))
Second: you can use #inbounds to speed up loop a bit:
#inbounds for i=2:size(A)[1]-1

trying to improve the efficency of a triple for loop in matlab

I'm trying to vectorize or make this loop run faster (it's a minimal code):
n=1000;
L=100;
x=linspace(-L/2,L/2);
V1=rand(n);
for i=1:length(x)
for k=1:n
for j=1:n
V2(j,k)=V1(j,k)*log(2/L)*tan(pi/L*(x(i)+L/2)*j);
end
end
V3(i,:)=sum(V2);
end
would appreciate you help.
An alternative to vectorization, is to recognize the expensive operations in the code and somehow reduce them. For instance, the log(2/L) is called 100*1000*1000 times with input that does not depend on any of the three for loops. If we calculate this value outside of the for loops, then we can use it instead:
logResult = log(2/L);
and
V2(j,k)=V1(j,k)*log(2/L)*tan(pi/L*(x(i)+L/2)*j);
becomes
V2(j,k)=(V1(j,k)*logResult*tan(pi/L*(x(i)+L/2)*j));
Likewise, the code calls the tan function the same 100*1000*1000 times. Note how this calculation, tan(pi/L*(x(i)+L/2)*j) does not depend on k. And so if we calculate these values outside of the for loops, we can reduce this calculation by 1000 times:
tanValues = zeros(lenx,n);
for i=1:lenx
for j=1:n
tanValues(i,j) = tan(pi/L*(x(i)+L/2)*j);
end
end
and the calculation for V2(j,k) becomes
V2(j,k)=V1(j,k)*logResult*tanValues(i,j);
Also, memory can be pre-allocated to the V2 and V3 matrices to avoid the internal resizing that occurs on each iteration. Just do the following outside the for loops
V2 = zeros(n,n);
V3 = zeros(lenx,n);
Using tic and toc reduces the original execution from ~14 seconds to ~6 on my workstation. This is still three times slower than natan's solution which is ~2 seconds for me.
here's a vectorized solution using meshgrid, bsxfun and repmat:
% fast preallocation
jj(n,n)=0; B(n,n,L)=0; V3(L,n)=0;
lg=log(2/L);
% the vectorizaion part
jj=meshgrid(1:n);
B=bsxfun(#times,ones(n),permute(x,[3 1 2]));
V3=squeeze(sum(lg*repmat(V1,1,1,numel(x)).*tan(bsxfun(#times,jj',pi/L*(B+L/2))),1)).';
Running your code at my computer using tic\toc took ~25 seconds. The bsxfun code took ~4.5 seconds...

How can I precisely profile /benchmark algorithms in MATLAB?

The algorithm repeats the same thing again-and-again. I expected to get the same time in each trial but I got very unexpected times for the four identical trials
in which I expected the curves to be identical but they act totally differently. The reason is probably in the tic/toc precision.
What kind of profiling/timing tools should I use in Matlab?
What am I doing wrong in the below code? How reliable is the tic/toc profiling?
Anyway to guarantee consistent results?
Algorithm
A=[];
for ii=1:25
tic;
timerval=tic;
AlgoCalculatesTheSameThing();
tElapsed=toc(timerval);
A=[A,tElapsed];
end
You should try timeit.
Have a look at this related question:
How to benchmark Matlab processes?
A snippet from Sam Roberts answer to the other question:
It handles many subtle issues related to benchmarking MATLAB code for you, such as:
ensuring that JIT compilation is used by wrapping the benchmarked code in a function
warming up the code
running the code several times and averaging
Have a look at this question for discussion regarding warm up:
Why does Matlab run faster after a script is "warmed up"?
Update:
Since timeit was first submitted at the fileexchange, the source code is available here and can be studied and analyzed (as opposed to most other MATLAB functions).
From the header of timeit.m:
% TIMEIT handles automatically the usual benchmarking procedures of "warming
% up" F, figuring out how many times to repeat F in a timing loop, etc.
% TIMEIT also compensates for the estimated time-measurement overhead
% associated with tic/toc and with calling function handles. TIMEIT returns
% the median of several repeated measurements.
You can go through the function step-by-step. The comments are very good and descriptive in my opinion. It is of course possible that Mathworks has changed parts of the code, but the overall functionality is there.
For instance, to account for the time it takes to run tic/toc:
function t = tictocTimeExperiment
% Call tic/toc 100 times and return the average time required.
It is later substracted from the total time.
The following is said regarding number of computations:
function t = roughEstimate(f, num_f_outputs)
% Return rough estimate of time required for one execution of
% f(). Basic warmups are done, but no fancy looping, medians,
% etc.
This rough estimate is used to determine how many times the computations should run.
If you want to change the number of computation times, you can modify the timeit function yourself, as it is available. I would recommend you to save it as my_timeit, or something else, so that you avoid overwriting the built-in version.
Qualitatively there are large differences between the same runs. I did the same four trials as in the question and tested them with the methods suggested so far and I created my own version of the timeit timeitH because the timeit has too large standard deviation between different trials. The timeitH returns far more robust results to other methods because it warm ups the code similarly to the timeit and then it has increased the amount of outer loops in the original timeit from 11 to 50.
The below has the four trials done with the three different methods. The closer the curves are to each other, the better.
TimeitH: results pretty good!
Some observations.
timeit: result smoothed but bumps
tic/toc: easy to adjust for larger cases to get the standard deviation smaller in computation times but no warming up
timeitH: download the code and change 60th line to num_outer_iterations = 50; to get smoother results
In summarum
I think the timeitH is the best candidate here, yet only tested in evaluating sparse polynomials. The timeit and tic/toc like 500 times do not result into robust results.
Timeit
500 trials and average with tic/toc
Algorithm for the 500 trials with tic/toc
for ii=1:25
numTrials = 500;
tic;
for ii=1:numTrials
AlgoCalculatesTheSameThing();
end
tTotal = toc;
tElapsed = tTotal/numTrials;
A=[A,tElapsed];
end
Is the time for AlgoCalculatesTheSameThing() relatively short (fractions of sec or a few sec) or long (multi-minutes or hours)? If the former I would suggest doing it more like this: move your timing functions outside your loop, then compute averages:
A=[];
numTrials = 25;
tic;
for ii=1:numTrials
AlgoCalculatesTheSameThing();
end
tTotal = toc;
tAvg = tTotal/numTrials;
If the event is short enough (fraction of sec) then you should also increase the value of numTrials to 100s or even 1000s.
You have to consider that with any timing function there will be error bars (like in any other measurement). If the event your timing is short enough, the uncertainties in your measurement can be relatively big, keeping in mind that the resolution of tic and toc also has some finite value.
More discussion on the accuracy of tic and toc can be found here.
You need to work out these uncertainties for your specific application, so do experiments: perform averages over a number of trials and then compute the standard deviation to get a sense of the "scatter" or uncertainly in your results.

Matlab GPU performance fft vs. simple addition

I am wondering about the big performance difference of a fft and a simple addition on a GPU using Matlab. I would expect that a fft is slower on the GPU than a simple addition. But why is it the other way around? Any suggestions?
a=rand(2.^20,1);
a=gpuArray(a);
b=gpuArray(0);
c=gpuArray(1);
tic % should take a long time
for k=1:1000
fft(a);
end
toc % Elapsed time is 0.085893 seconds.
tic % should be fast, but isn't
for k=1:1000
b=b+c;
end
toc % Elapsed time is 1.430682 seconds.
It is also interesting to note that the computational time for the addition (second loop) decreases if I reduce the length of the vetor a.
EDIT
If I change the order of the two loops, i.e. if the addition is done first, the addition takes 0.2 seconds instead of 1.4 seconds. The FFT time is still the same.
I'm guessing that Matlab isn't actually running the fft because the output is not used anywhere. Also, in your simple addition loop, each iteration depends on the previous one, so it has to run serially.
I don't know why the order of the loops matters. Maybe it has something to do with cleaning up the GPU memory after the first loop. You could try calling pause(1) between the loops to let your computer get back to an idle state before the second loop. That may make your timing more consistent.
I don't have a 2012b MATLAB with GPU to hand to check this but I think that you are missing a wait() command. In 2012a, MATLAB introduced asynchronous GPU calculations. So, when you send something to the GPU it doesn't wait until its finished before moving on in code. Try this:
mygpu=gpuDevice(1);
a=rand(2.^20,1);
a=gpuArray(a);
b=gpuArray(0);
c=gpuArray(1);
tic % should take a long time
for k=1:1000
fft(a);
end
wait(mygpu); %Wait until the GPU has finished calculating before moving on
toc
tic % should be fast
for k=1:1000
b=b+c;
end
wait(mygpu); %Wait until the GPU has finished calculating before moving on
toc
The computation time of the addition should no longer depend on when its carried out. Would you mind checking and getting back to me please?

How to benchmark Matlab processes?

Searching for an idea how to avoid using loop in my Matlab code, I found following comments under one question on SE:
The statement "for loops are slow in Matlab" is no longer generally true since Matlab...euhm, R2008a?
and
Have you tried to benchmark a for loop vs what you already have? sometimes it is faster than vectorized code...
So I would like to ask, is there commonly used way to test the speed of a process in Matlab? Can user see somewhere how much time the process takes or the only way is to extend the processes for several minutes in order to compare the times between each other?
The best tool for testing the performance of MATLAB code is Steve Eddins' timeit function, available here from the MATLAB Central File Exchange.
It handles many subtle issues related to benchmarking MATLAB code for you, such as:
ensuring that JIT compilation is used by wrapping the benchmarked code in a function
warming up the code
running the code several times and averaging
Update: As of release R2013b, timeit is part of core MATLAB.
Update: As of release R2016a, MATLAB also includes a performance testing framework that handles the above issues for you in a similar way to timeit.
You can use the profiler to assess how much time your functions, and the blocks of code within them, are taking.
>> profile on; % Starts the profiler
>> myfunctiontorun( ); % This can be a function, script or block of code
>> profile viewer; % Opens the viewer showing you how much time everything took
Viewer also clears the current profile data for next time.
Bear in mind, profile does tend to slow execution a bit, but I believe it does so in a uniform way across everything.
Obviously if your function is very quick, you might find you don't get reliable results so if you can run it many times or extend the computation that would improve matters.
If it's really simple stuff you're testing, you can also just time it using tic and toc:
>> tic; % Start the timer
>> myfunctionname( );
>> toc; % End the timer and display elapsed time
Also if you want multiple timers, you can assign them to variables:
>> mytimer = tic;
>> myfunctionname( );
>> toc(mytimer);
Finally, if you want to store the elapsed time instead of display it:
>> myresult = toc;
I think that I am right to state that many of us time Matlab by wrapping the block of code we're interested in between tic and toc. Furthermore, we take care to ensure that the total time is of the order of 10s of seconds (rather than 1s of seconds or 100s of seconds) and repeat it 3 - 5 times and take some measure of central tendency (such as the mean) and draw our conclusions from that.
If the piece of code takes less than, say 10s, then repeat it as many times as necessary to bring it into the range, being careful to avoid any impact of one iteration on the next. And if the code naturally takes 100s of seconds or longer, either spend longer on the testing or try it with artificially small input data to run more quickly.
In my experience it's not necessary to run programs for minutes to get data on average run time with acceptably low variance. If I run a program 5 times and one (or two) of the results is wildly different from the mean I'll re-run it.
Of course, if the code has any features which make its run time non-deterministic then it's a different matter.

Resources