R code execution using : system time() - algorithm

I have some code that I ported from Matlab to R. I want to compare their performance.
However, I encountered a problem: Using system.time() in R, but I get different results for the same code. Is this supposed to happen? How do I compare it?

You'll get different results if you time your self running a 100m sprint too! The computer has lots of things going on that will slightly vary the time it takes to run your code.
The solution is to run the code many times. The R package benchmark is what you're looking for.

As #Justin said, the times will always vary. Especially the first time couple of times, since the garbage collection system hasn't adjusted itself to your specific use. It might be a good idea to restart R before measuring (and close other programs, ensure the system isn't scanning for viruses at this time etc)...
Note that if the measured time is small (fractions of a second), the relative error will be rather large, so try to adjust the problem so it takes at least a second.
The packages benchmark or rbenchmark can help.
...but I typically just do a for-loop around the problem and adjust it until it takes a second or so - and then I run it several times too.
Here's an example:
f <- function(x, y) {
sum <- 1
for (i in seq_along(x)) sum <- x[[i]] + y[[i]] * sum
sum
}
n <- 10000
x <- 1:n + 0.5
y <- -1:-n + 0.5
system.time(f(x,y)) # 0.02-0.03 secs
system.time(for(i in 1:100) f(x,y)) # 1.56-1.59 secs
...so calling it 100 times reduced the relative error a lot.

Related

polyfit on GPUArray is extremely slow [duplicate]

function w=oja(X, varargin)
% get the dimensionality
[m n] = size(X);
% random initial weights
w = randn(m,1);
options = struct( ...
'rate', .00005, ...
'niter', 5000, ...
'delta', .0001);
options = getopt(options, varargin);
success = 0;
% run through all input samples
for iter = 1:options.niter
y = w'*X;
for ii = 1:n
% y is a scalar, not a vector
w = w + options.rate*(y(ii)*X(:,ii) - y(ii)^2*w);
end
end
if (any(~isfinite(w)))
warning('Lost convergence; lower learning rate?');
end
end
size(X)= 400 153600
This code implements oja's rule and runs slow. I am not able to vectorize it any more. To make it run faster I wanted to do computations on the GPU, therefore I changed
X=gpuArray(X)
But the code instead ran slower. The computation used seems to be compatible with GPU. Please let me know my mistake.
Profile Code Output:
Complete details:
https://drive.google.com/file/d/0B16PrXUjs69zRjFhSHhOSTI5RzQ/view?usp=sharing
This is not a full answer on how to solve it, but more an explanation why GPUs does not speed up, but actually enormously slow down your code.
GPUs are fantastic to speed up code that is parallel, meaning that they can do A LOT of things at the same time (i.e. my GPU can do 30070 things at the same time, while a modern CPU cant go over 16). However, GPU processors are very slow! Nowadays a decent CPU has around 2~3Ghz speed while a modern GPU has 700Mhz. This means that a CPU is much faster than a GPU, but as GPUs can do lots of things at the same time they can win overall.
Once I saw it explained as: What do you prefer, A million dollar sports car or a scooter? A million dolar car or a thousand scooters? And what if your job is to deliver pizza? Hopefully you answered a thousand scooters for this last one (unless you are a scooter fan and you answered the scooters in all of them, but that's not the point). (source and good introduction to GPU)
Back to your code: your code is incredibly sequential. Every inner iteration depends in the previous one and the same with the outer iteration. You can not run 2 of these in parallel, as you need the result from one iteration to run the next one. This means that you will not get a pizza order until you have delivered the last one, thus what you want is to deliver 1 by 1, as fast as you can (so sports car is better!).
And actually, each of these 1 line equations is incredibly fast! If I run 50 of them in my computer I get 13.034 seconds on that line which is 1.69 microseconds per iteration (7680000 calls).
Thus your problem is not that your code is slow, is that you call it A LOT of times. The GPU will not accelerate this line of code, because it is already very fast, and we know that CPUs are faster than GPUs for these kind of things.
Thus, unfortunately, GPUs suck for sequential code and your code is very sequential, therefore you can not use GPUs to speed up. An HPC will neither help, because every loop iteration depends in the previous one (no parfor :( ).
So, as far I can say, you will need to deal with it.

estimate performance gain based on application profiling (math)

this question is rather "math" related - but certainly is of interest to any "software developer".
i have done some profiling of my application. and i have observed there is a huge performance difference that is "environment specific".
there is a "fast" environment and a "slow" environment.
the overall application performance on "fast" is 5 times faster than on "slow".
a particular function call on "fast" is 18 times faster than on "slow".
so let's assume i will be able to reduce invoking this particular function by 50 percent.
how do i calculate the estimated performance improvement on the "slow" environment?
is there any approximate formula for calculating the expected performance gain?
apologies:
nowadays i'm no longer good at doing any math. or rather i never was!
i have been thinking about where to ask such question, best.
didn't come up with any more suitable place.
also, i wasn't able to come up with an optimal question's subject line and also what tags to assign ...
Let's make an assumption (questionable but we have nothing else to go on).
Let's assume all of the 5:1 reduction in time is due to function foo reducing by 18:1.
That means everything else in the program takes the same amount of time.
So suppose in the fast environment the total time is f + x, where f is the time that foo takes in the fast environment, and x is everything else.
In the slow environment, the time is 18f+x, which equals 5(f+x).
OK, solve for x.
18f+x = 5f+5x
13f = 4x
x = 13/4 f
OK, now on the slow environment you want to call foo half as much.
So then the time would be 9f+x, which is:
9f + 13/4 f = 49/4 f
The original time was 18f+x = (18+13/4)f = 85/4 f
So the time goes from 85/4 f to 49/4 f.
That's a speed ratio of 85/49 = 1.73
In other words, that's a speedup of 73%.

GPGPU computation with MATLAB does not scale properly

I've been experimenting with the GPU support of Matlab (v2014a). The notebook I'm using to test my code has a NVIDIA 840M build in.
Since I'm new to GPU computing with Matlab, I started out with a few simple examples and observed a strange scalability behavior. Instead of increasing the size of my problem, I simply put a forloop around my computation. I expected the time for the computation, to scale with the number of iterations, since the problem size itself does not increase. This was also true for smaller numbers of iterations, however at a certain point the time does not scale as expected, instead I observe a huge increase in computation time. Afterwards, the problem continues to scale again as expected.
The code example started from a random walk simulation, but I tried to produce an example that is easy and still shows the problem.
Here's what my code does. I initialize two matrices as sin(alpha)and cos(alpha). Then I loop over the number of iterations from 2**1to 2**15. I then repead the computation sin(alpha)^2 and cos(alpha)^2and add them up (this was just to check the result). I perform this calculation as often as the number of iterations suggests.
function gpu_scale
close all
NP = 10000;
NT = 1000;
ALPHA = rand(NP,NT,'single')*2*pi;
SINALPHA = sin(ALPHA);
COSALPHA = cos(ALPHA);
GSINALPHA = gpuArray(SINALPHA); % move array to gpu
GCOSALPHA = gpuArray(COSALPHA);
PMAX=15;
for P = 1:PMAX;
for i=1:2^P
GX = GSINALPHA.^2;
GY = GCOSALPHA.^2;
GZ = GX+GY;
end
end
The following plot, shows the computation time in a log-log plot for the case that I always double the number of iterations. The jump occurs when doubling from 1024 to 2048 iterations.
The initial bump for two iterations might be due to initialization and is not really relevant anyhow.
I see no reason for the jump between 2**10 and 2**11 computations, since the computation time should only depend on the number of iterations.
My question: Can somebody explain this behavior to me? What is happening on the software/hardware side, that explains this jump?
Thanks in advance!
EDIT: As suggested by Divakar, I changed the way I time my code. I wasn't sure I was using gputimeit correctly. however MathWorks suggests another possible way, namely
gd= gpuDevice();
tic
% the computation
wait(gd);
Time = toc;
Using this way to measure my performance, the time is significantly slower, however I don't observe the jump in the previous plot. I added the CPU performance for comparison and keept both timings for the GPU (wait / no wait), which can be seen in the following plot
It seems, that the observed jump "corrects" the timining in the direction of the case where I used wait. If I understand the problem correctly, then the good performance in the no wait case is due to the fact, that we do not wait for the GPU to finish completely. However, then I still don't see an explanation for the jump.
Any ideas?

How can I precisely profile /benchmark algorithms in MATLAB?

The algorithm repeats the same thing again-and-again. I expected to get the same time in each trial but I got very unexpected times for the four identical trials
in which I expected the curves to be identical but they act totally differently. The reason is probably in the tic/toc precision.
What kind of profiling/timing tools should I use in Matlab?
What am I doing wrong in the below code? How reliable is the tic/toc profiling?
Anyway to guarantee consistent results?
Algorithm
A=[];
for ii=1:25
tic;
timerval=tic;
AlgoCalculatesTheSameThing();
tElapsed=toc(timerval);
A=[A,tElapsed];
end
You should try timeit.
Have a look at this related question:
How to benchmark Matlab processes?
A snippet from Sam Roberts answer to the other question:
It handles many subtle issues related to benchmarking MATLAB code for you, such as:
ensuring that JIT compilation is used by wrapping the benchmarked code in a function
warming up the code
running the code several times and averaging
Have a look at this question for discussion regarding warm up:
Why does Matlab run faster after a script is "warmed up"?
Update:
Since timeit was first submitted at the fileexchange, the source code is available here and can be studied and analyzed (as opposed to most other MATLAB functions).
From the header of timeit.m:
% TIMEIT handles automatically the usual benchmarking procedures of "warming
% up" F, figuring out how many times to repeat F in a timing loop, etc.
% TIMEIT also compensates for the estimated time-measurement overhead
% associated with tic/toc and with calling function handles. TIMEIT returns
% the median of several repeated measurements.
You can go through the function step-by-step. The comments are very good and descriptive in my opinion. It is of course possible that Mathworks has changed parts of the code, but the overall functionality is there.
For instance, to account for the time it takes to run tic/toc:
function t = tictocTimeExperiment
% Call tic/toc 100 times and return the average time required.
It is later substracted from the total time.
The following is said regarding number of computations:
function t = roughEstimate(f, num_f_outputs)
% Return rough estimate of time required for one execution of
% f(). Basic warmups are done, but no fancy looping, medians,
% etc.
This rough estimate is used to determine how many times the computations should run.
If you want to change the number of computation times, you can modify the timeit function yourself, as it is available. I would recommend you to save it as my_timeit, or something else, so that you avoid overwriting the built-in version.
Qualitatively there are large differences between the same runs. I did the same four trials as in the question and tested them with the methods suggested so far and I created my own version of the timeit timeitH because the timeit has too large standard deviation between different trials. The timeitH returns far more robust results to other methods because it warm ups the code similarly to the timeit and then it has increased the amount of outer loops in the original timeit from 11 to 50.
The below has the four trials done with the three different methods. The closer the curves are to each other, the better.
TimeitH: results pretty good!
Some observations.
timeit: result smoothed but bumps
tic/toc: easy to adjust for larger cases to get the standard deviation smaller in computation times but no warming up
timeitH: download the code and change 60th line to num_outer_iterations = 50; to get smoother results
In summarum
I think the timeitH is the best candidate here, yet only tested in evaluating sparse polynomials. The timeit and tic/toc like 500 times do not result into robust results.
Timeit
500 trials and average with tic/toc
Algorithm for the 500 trials with tic/toc
for ii=1:25
numTrials = 500;
tic;
for ii=1:numTrials
AlgoCalculatesTheSameThing();
end
tTotal = toc;
tElapsed = tTotal/numTrials;
A=[A,tElapsed];
end
Is the time for AlgoCalculatesTheSameThing() relatively short (fractions of sec or a few sec) or long (multi-minutes or hours)? If the former I would suggest doing it more like this: move your timing functions outside your loop, then compute averages:
A=[];
numTrials = 25;
tic;
for ii=1:numTrials
AlgoCalculatesTheSameThing();
end
tTotal = toc;
tAvg = tTotal/numTrials;
If the event is short enough (fraction of sec) then you should also increase the value of numTrials to 100s or even 1000s.
You have to consider that with any timing function there will be error bars (like in any other measurement). If the event your timing is short enough, the uncertainties in your measurement can be relatively big, keeping in mind that the resolution of tic and toc also has some finite value.
More discussion on the accuracy of tic and toc can be found here.
You need to work out these uncertainties for your specific application, so do experiments: perform averages over a number of trials and then compute the standard deviation to get a sense of the "scatter" or uncertainly in your results.

Replicate() versus a for loop?

Does anyone know how the replicate() function works in R and how efficient it is relative to using a for loop?
For example, is there any efficiency difference between...
means <- replicate(100000, mean(rnorm(50)))
And...
means <- c()
for(i in 1:100000) {
means <- c(means, mean(rnorm(50)))
}
(I may have typed something slightly off above, but you get the idea.)
You can just benchmark the code and get your answer empirically. Note that I also added a second for loop flavor which circumvents the growing vector problem by preallocating the vector.
repl_function = function(no_rep) means <- replicate(no_rep, mean(rnorm(50)))
for_loop = function(no_rep) {
means <- c()
for(i in 1:no_rep) {
means <- c(means, mean(rnorm(50)))
}
means
}
for_loop_prealloc = function(no_rep) {
means <- vector(mode = "numeric", length = no_rep)
for(i in 1:no_rep) {
means[i] <- mean(rnorm(50))
}
means
}
no_loops = 50e3
benchmark(repl_function(no_loops),
for_loop(no_loops),
for_loop_prealloc(no_loops),
replications = 3)
test replications elapsed relative user.self sys.self
2 for_loop(no_loops) 3 18.886 6.274 17.803 0.894
3 for_loop_prealloc(no_loops) 3 3.209 1.066 3.189 0.000
1 repl_function(no_loops) 3 3.010 1.000 2.997 0.000
user.child sys.child
2 0 0
3 0 0
1 0 0
Looking at the relative column, the un-preallocated for loop is 6.2 times slower. However, the preallocated for loop is just as fast as replicate.
replicate is a wrapper for sapply, which itself is a wrapper for lapply. lapply is ultimately an .Internal function that is written in C and performs the looping in an optimised way, rather than through the interpreter. It's main advantages are efficient memory management, especially compared to the highly inefficient vector growing method you present above.
I have a very different experience with replicate which also confuses me. It often happens that my R crashes and my laptop hangs when I use replicate compared to for and this surprises me, as for the reasons mentioned above, I also expected a C-written function to outperform the for loop. For example, if you execute the functions below, you'll see that for loop is faster than replicate
system.time(for (i in 1:10) runif(1e7))
# user system elapsed
# 3.340 0.218 3.558
system.time(replicate(10, runif(1e7)))
# user system elapsed
# 4.622 0.484 5.109
so with 10 replicates, the for loop is clearly faster. If you repeat it for 100 replicates you get similar results. So I wonder if anyone can come with an example that shows its practical privileges compared to for.
PS I also created a function for the runif(1e7) and that made no difference in the comparison. Basically I failed to come with any example that shows the advantage of replicate.
Vectorization is the key difference between them. I will tray to explain this point. R is an high-level-interpreted computer language. It takes care of many basic computer tasks for you. When you write
x <- 2.0
you don’t have to tell your computer that
“2.0” is a floating-point number;
“x” should store numeric-type data;
it has to find a place in memory to put “5”;
it has to register “x” as a pointer to a certain place in memory.
R figures these things by itself.
But, for such comfortable issue, there is a price: it is slower than low level languages.
In C or FORTRAN, much of this "test if" would be accomplished during the compilation step, not during the program execution. They are translated into binary computer language (0/1) after they are written, BUT before they are run. This allows the compiler to organize the binary machine code in an optimal way for the computer to interpret.
What does this have to do with vectorization in R? Well, many R functions are actually written in a a compiled language, such as C, C++, and FORTRAN, and have a small R “wrapper”. This is the difference between yours approach. for loops add further test if operations that the machine has to do on data, making it slower

Resources