I know that loops are slow in R and that I should try to do things in a vectorised manner instead.
But, why? Why are loops slow and apply is fast? apply calls several sub-functions -- that doesn't seem fast.
Update: I'm sorry, the question was ill-posed. I was confusing vectorisation with apply. My question should have been,
"Why is vectorisation faster?"
It's not always the case that loops are slow and apply is fast. There's a nice discussion of this in the May, 2008, issue of R News:
Uwe Ligges and John Fox. R Help Desk: How can I avoid this loop or
make it faster? R News, 8(1):46-50, May 2008.
In the section "Loops!" (starting on pg 48), they say:
Many comments about R state that using loops is a particularly bad idea. This is not necessarily true. In certain cases, it is difficult to write vectorized code, or vectorized code may consume a huge amount of memory.
They further suggest:
Initialize new objects to full length before the loop, rather
than increasing their size within the loop. Do not do things in a
loop that can be done outside the loop. Do not avoid loops simply
for the sake of avoiding loops.
They have a simple example where a for loop takes 1.3 sec but apply runs out of memory.
Loops in R are slow for the same reason any interpreted language is slow: every
operation carries around a lot of extra baggage.
Look at R_execClosure in eval.c (this is the function called to call a
user-defined function). It's nearly 100 lines long and performs all sorts of
operations -- creating an environment for execution, assigning arguments into
the environment, etc.
Think how much less happens when you call a function in C (push args on to
stack, jump, pop args).
So that is why you get timings like these (as joran pointed out in the comment,
it's not actually apply that's being fast; it's the internal C loop in mean
that's being fast. apply is just regular old R code):
A = matrix(as.numeric(1:100000))
Using a loop: 0.342 seconds:
system.time({
Sum = 0
for (i in seq_along(A)) {
Sum = Sum + A[[i]]
}
Sum
})
Using sum: unmeasurably small:
sum(A)
It's a little disconcerting because, asymptotically, the loop is just as good
as sum; there's no practical reason it should be slow; it's just doing more
extra work each iteration.
So consider:
# 0.370 seconds
system.time({
I = 0
while (I < 100000) {
10
I = I + 1
}
})
# 0.743 seconds -- double the time just adding parentheses
system.time({
I = 0
while (I < 100000) {
((((((((((10))))))))))
I = I + 1
}
})
(That example was discovered by Radford Neal)
Because ( in R is an operator, and actually requires a name lookup every time you use it:
> `(` = function(x) 2
> (3)
[1] 2
Or, in general, interpreted operations (in any language) have more steps. Of course, those steps provide benefits as well: you couldn't do that ( trick in C.
The only Answer to the Question posed is; loops are not slow if what you need to do is iterate over a set of data performing some function and that function or the operation is not vectorized. A for() loop will be as quick, in general, as apply(), but possibly a little bit slower than an lapply() call. The last point is well covered on SO, for example in this Answer, and applies if the code involved in setting up and operating the loop is a significant part of the overall computational burden of the loop.
Why many people think for() loops are slow is because they, the user, are writing bad code. In general (though there are several exceptions), if you need to expand/grow an object, that too will involve copying so you have both the overhead of copying and growing the object. This is not just restricted to loops, but if you copy/grow at each iteration of a loop, of course, the loop is going to be slow because you are incurring many copy/grow operations.
The general idiom for using for() loops in R is that you allocate the storage you require before the loop starts, and then fill in the object thus allocated. If you follow that idiom, loops will not be slow. This is what apply() manages for you, but it is just hidden from view.
Of course, if a vectorised function exists for the operation you are implementing with the for() loop, don't do that. Likewise, don't use apply() etc if a vectorised function exists (e.g. apply(foo, 2, mean) is better performed via colMeans(foo)).
Just as a comparison (don't read too much into it!): I ran a (very) simple for loop in R and in JavaScript in Chrome and IE 8.
Note that Chrome does compilation to native code, and R with the compiler package compiles to bytecode.
# In R 2.13.1, this took 500 ms
f <- function() { sum<-0.5; for(i in 1:1000000) sum<-sum+i; sum }
system.time( f() )
# And the compiled version took 130 ms
library(compiler)
g <- cmpfun(f)
system.time( g() )
#Gavin Simpson: Btw, it took 1162 ms in S-Plus...
And the "same" code as JavaScript:
// In IE8, this took 282 ms
// In Chrome 14.0, this took 4 ms
function f() {
var sum = 0.5;
for(i=1; i<=1000000; ++i) sum = sum + i;
return sum;
}
var start = new Date().getTime();
f();
time = new Date().getTime() - start;
Related
I am using MATLAB to run a for loop in which variable-length portions of a large vector are updated at each iteration with the content of another vector; something like:
for k=1:N
vec1(idx_start1(k):idx_end1(k)) = vec1(idx_start1(k):idx_end1(k)) +...
a(k)*vec2(idx_start2(k):idx_end2(k));
end
The selected portions of vec1 and vec2 are not so small and N can be quite large; moreover, if this can be useful, idx_end(k)<idx_start(k+1) does not necessarily hold (i.e. vec1's edited portions may be partially re-updated in subsequent iterations). As a consequence, the above is by far the slowest portion of code in my script and I would like to speed it up, if possible.
Is there any way to vectorize the above for loop in order to make it run faster? Or, are there any alternative approaches to improve its execution speed?
EDIT:
As requested in the comments, here are some example values: Using the profiler to check execution times, the loop above runs in about 3.3 s with N=5e4, length(vec1)=3e6, length(vec2)=1.7e3 and the portions indexed by idx_start/end are slightly shorter on average than the latter, although not significantly.
Of course, 3.3 s is not particularly worrying in itself, but I would like to be able to increase especially N and vec1 by 1 or 2 orders of magnitude and in such a loop it will take quite longer to run.
Sorry, I couldn't find a way to speed up your code. This is the code I created to try to speed it up:
N = 5e4;
vec1 = 1:3e6;
vec2 = 1:1.7e3;
rng(0)
a = randn(N, 1);
idx_start1 = randi([1, 2.9e6], N, 1);
idx_end1 = idx_start1 + 1000;
idx_start2 = randi([1, 0.6e3], N, 1);
idx_end2 = idx_start2 + 1000;
for k=1:N
vec1(idx_start1(k):idx_end1(k)) = vec1(idx_start1(k):idx_end1(k)) + a(k) * vec2(idx_start2(k):idx_end2(k));
% use = idx_start1(k):idx_end1(k);
% vec1(use) = vec1(use) + a(k) * vec2(idx_start2(k):idx_end2(k));
end
The two commented-out lines of code in the for loop were my attempt to speed it up, but it actually made it slower, much to my surprise. Generally, I would create a variable for an array that is used more than once thinking that is faster, but it is not. The code that is not commented out runs in 0.24 s versus 0.67 seconds for the code that is commented out.
I've noticed a strange behavior of julia during a matrix copy.
Consider the following three functions:
function priv_memcopyBtoA!(A::Matrix{Int}, B::Matrix{Int}, n::Int)
A[1:n,1:n] = B[1:n,1:n]
return nothing
end
function priv_memcopyBtoA2!(A::Matrix{Int}, B::Matrix{Int}, n::Int)
ii = 1; jj = 1;
while ii <= n
jj = 1 #(*)
while jj <= n
A[jj,ii] = B[jj,ii]
jj += 1
end
ii += 1
end
return nothing
end
function priv_memcopyBtoA3!(A::Matrix{Int}, B::Matrix{Int}, n::Int)
A[1:n,1:n] = view(B, 1:n, 1:n)
return nothing
end
Edit: 1) I tested if the code would throw an BoundsError so the line marked with jj = 1 #(*) was missing in the initial code. The testing results were already from the fixed version, so they remain unchanged. 2) I've added the view variant, thanks to #Colin T Bowers for addressing both issues.
It seems like both functions should lead to more or less the same code. Yet I get for
A = fill!(Matrix{Int32}(2^12,2^12),2); B = Int32.(eye(2^12));
the results
#timev priv_memcopyBtoA!(A,B, 2000)
0.178327 seconds (10 allocations: 15.259 MiB, 85.52% gc time)
elapsed time (ns): 178326537
gc time (ns): 152511699
bytes allocated: 16000304
pool allocs: 9
malloc() calls: 1
GC pauses: 1
and
#timev priv_memcopyBtoA2!(A,B, 2000)
0.015760 seconds (4 allocations: 160 bytes)
elapsed time (ns): 15759742
bytes allocated: 160
pool allocs: 4
and
#timev priv_memcopyBtoA3!(A,B, 2000)
0.043771 seconds (7 allocations: 224 bytes)
elapsed time (ns): 43770978
bytes allocated: 224
pool allocs: 7
That's a drastic difference. It's also surprising. I've expected the first version to be like memcopy, which is hard to beat for a large memory block.
The second version has overhead from the pointer arithmetic (getindex), the branch condition (<=) and the bounds check in each assignment. Yet each assignment takes just ~3 ns.
Also, the time which the garbage collector consumes, varies a lot for the first function. If no garbage collection is performed, the large difference becomes small, but it remains. It's still a factor of ~2.5 between version 3 and 2.
So why is the "memcopy" version not as efficient as the "assignment" version?
Firstly, your code contains a bug. Run this:
A = [1 2 ; 3 4]
B = [5 6 ; 7 8]
priv_memcopyBtoA2!(A, B, 2)
then:
julia> A
2×2 Array{Int64,2}:
5 2
7 4
You need to re-assign jj back to 1 at the end of each inner while loop, ie:
function priv_memcopyBtoA2!(A::Matrix{Int}, B::Matrix{Int}, n::Int)
ii = 1
while ii <= n
jj = 1
while jj <= n
A[jj,ii] = B[jj,ii]
jj += 1
end
ii += 1
end
return nothing
end
Even with the bug fix, you'll still note that the while loop solution is faster. This is because array slices in julia create temporary arrays. So in this line:
A[1:n,1:n] = B[1:n,1:n]
the right-hand side operation creates a temporary nxn array, and then assigns the temporary array to the left-hand side.
If you wanted to avoid the temporary array allocation, you would instead write:
A[1:n,1:n] = view(B, 1:n, 1:n)
and you'll notice that the timings of the two methods is now pretty close, although the while loop is still slightly faster. As a general rule, loops in Julia are fast (as in C fast), and explicitly writing out the loop will usually get you the most optimized compiled code. I would still expect the explicit loop to be faster than the view method.
As for the garbage collection stuff, that is just a result of your method of timing. Much better to use #btime from the package BenchmarkTools, which uses various tricks to avoid traps like timing garbage collection etc.
Why is A[1:n,1:n] = view(B, 1:n, 1:n) or variants of it, slower than a set of while loops? Let's look at what A[1:n,1:n] = view(B, 1:n, 1:n) does.
view returns an iterator which contains a pointer to the parent B and information how to compute the indices which should be copied. A[1:n,1:n] = ... is parsed to a call _setindex!(...). After that, and a few calls down the call chain, the main work is done by:
.\abstractarray.jl:883;
# In general, we simply re-index the parent indices by the provided ones
function getindex(V::SlowSubArray{T,N}, I::Vararg{Int,N}) where {T,N}
#_inline_meta
#boundscheck checkbounds(V, I...)
#inbounds r = V.parent[reindex(V, V.indexes, I)...]
r
end
#.\multidimensional.jl:212;
#inline function next(iter::CartesianRange{I}, state) where I<:CartesianIndex
state, I(inc(state.I, iter.start.I, iter.stop.I))
end
#inline inc(::Tuple{}, ::Tuple{}, ::Tuple{}) = ()
#inline inc(state::Tuple{Int}, start::Tuple{Int}, stop::Tuple{Int}) = (state[1]+1,)
#inline function inc(state, start, stop)
if state[1] < stop[1]
return (state[1]+1,tail(state)...)
end
newtail = inc(tail(state), tail(start), tail(stop))
(start[1], newtail...)
end
getindex takes a view V and an index I. We get the view from B and the index I from A. In each step reindex computes from the view V and the index I indices to get an element in B. It's called r and we return it. Finally r is written to A.
After each copy inc increments the index I to the next element in A and tests if one is done. Note that the code is from v0.63 but in master it's more or less the same.
In principle the code could be reduced to a set of while loops, yet it is more general. It works for arbitrary views of B and arbitrary slices of the form a:b:c and for an arbitrary number of matrix dimensions. The big N is in our case 2.
Since the functions are more complex, the compiler doesn't optimize them as well. I.e. there is a recommendation that the compiler should inline them, but it doesn't do that. This shows that the shown functions are non trivial.
For a set of loops the compiler reduces the innermost loop to three additions (each for a pointer to A and B and one for the loop index) and a single copy instruction.
tl;dr The internal call chain of A[1:n,1:n] = view(B, 1:n, 1:n) coupled with multiple dispatch is non trivial and handles the general case. This induces overhead. A set of while loops is already optimized to a special case.
Note that the performance depends on the compiler. If one looks at the one dimensional case A[1:n] = view(B, 1:n), it's faster than a while loop because it vectorizes the code. Yet for higher dimensions N >2 the difference grows.
I'm using GameMaker Studio and you can think of it as a giant loop.
I use a counter variable step to keep track of what frame it is.
I'd like to run some code only every Xth step for efficiency.
if step mod 60 {
}
Would run that block every 60 steps (or 1 second at 60 fps).
My understanding is modulus is a heavy operation though and with thousands of steps I imagine the computation can get out of hand. Is there a more efficient way to do this?
Perhaps involving bitwise operator?
I know this can work for every other frame:
// Declare
counter = 0
// Step
counter = (counter + 1) & 1
if counter {
}
Or is the performance impact of modulus negligible at 60FPS even with large numbers?
In essence:
i := 0
WHILE i < n/4
do rest of stuff × 4
do stuff that you want to do one time in four
Increment i
Do rest of stuff i%4 times
The variant of this that takes the modulus and switches based on that is called Duff’s Device. Which is faster will depend on your architecture: on many RISC chips, mod executes in a single clock cycle, but on other CPUs, integer division might not even be a native instruction.
If you don’t have a loop counter per se, because it’s an event loop for example, you can always make one and reset it every four times in the if block where you execute your code:
i := 1
WHILE you loop
do other stuff
if i == 4
do stuff
i := 1
else
i := i + 1
Here’s an example of doing some stuff one time in two and stuff one time in three:
WHILE looping
do stuff
do stuff a second time
do stuff B
do stuff a third time
do stuff C
do stuff a fourth time
do stuff B
do stuff a fifth time
do stuff a sixth time
do stuff B
do stiff C
Note that the stuff you do can include calling an event loop once.
Since this can get unwieldy, you can use template metaprogramming to write these loops for you in C++, something like:
constexpr unsigned a = 5, b = 7, LCM_A_B = 35;
template<unsigned N>
inline void do_stuff(void)
{
do_stuff_always();
if (N%a)
do_stuff_a(); // Since N is a compile-time constant, the compiler does not have to check this at runtime!
if (N%b)
do_stuff_b();
do_stuff<N-1>();
}
template<>
inline void do_stuff<1U>(void)
{
do_stuff_always();
}
while (sentinel)
do_stuff<LCM_A_B>();
In general, though, if you want to know whether your optimizations are helping, profile.
The most important part of the answer: that test probably takes so little time, in context, that it isn't worth the ions moving around your brain to think about it.
If it only costs 1% it's almost certain there are bigger speedups you should be thinking about.
However, if the loop is fast, you could put in something like this:
if (--count < 0){
count = 59;
// do your thing
}
In some hardware, that test comes down to a single instruction decrement-and-branch-if-negative.
I recently read a post about for loops over a range of integers being slower than the corresponding while loops, which is true, but wanted to see if the same held up for iterating over existing sequences and was surprised to find the complete opposite by a large margin.
First and foremost, I'm using the following function for timing:
def time[A](f: => A) = {
val s = System.nanoTime
val ret = f
println("time: " + (System.nanoTime - s) / 1e6 + "ms")
ret
}
and I'm using a simple sequence of Integers:
val seq = List.range(0, 10000)
(I also tried creating this sequence a few other ways in case the way this sequence was accessed affected the run time. Using the Range type certainly did. This should ensure that each item in the sequence is an independent object.)
I ran the following:
time {
for(item <- seq) {
println(item)
}
}
and
time {
var i = 0
while(i < seq.size) {
println(seq(i))
i += 1
}
}
I printed the results so to ensure that we're actually accessing the values in both loops. The first code snippet runs in an average of 33 ms on my machine. The second takes an average of 305 ms.
I tried adding the mutable variable i to the for loop, but it only adds a few milliseconds. The map function gets similar performance to a for loop, as expected. For whatever reason, this doesn't seem to occur if I use an array (converting the above defined seq with seq.toArray). In such a case, the for loop takes 90 ms and the while loop takes 40 ms.
What is the reason for this major performance difference?
The reason is: complexity. seq(i) is Θ(i) for List, which means your whole loop takes quadratic time. The foreach method, however, is linear.
If you compile with -optimize, the for loop version will likely be even faster, because List.foreach should be inlined, eliminating the cost of the lambda.
I am trying to evaluate the following integral:
I can find the area for the following polynomial as follows:
pn =
-0.0250 0.0667 0.2500 -0.6000 0
First using the integration by Simpson's rule
fn=#(x) exp(polyval(pn,x));
area=quad(fn,-10,10);
fprintf('area evaluated by Simpsons rule : %f \n',area)
and the result is area evaluated by Simpsons rule : 11.483072
Then with the following code that evaluates the summation in the above formula with gamma function
a=pn(1);b=pn(2);c=pn(3);d=pn(4);f=pn(5);
area=0;
result=0;
for n=0:40;
for m=0:40;
for p=0:40;
if(rem(n+p,2)==0)
result=result+ (b^n * c^m * d^p) / ( factorial(n)*factorial(m)*factorial(p) ) *...
gamma( (3*n+2*m+p+1)/4 ) / (-a)^( (3*n+2*m+p+1)/4 );
end
end
end
end
result=result*1/2*exp(f)
and this returns 11.4831. More or less the same result with the quad function. Now my question is whether or not it is possible for me to get rid of this nested loop as I will construct the cumulative distribution function so that I can get samples from this distribution using the inverse CDF transform. (for constructing the cdf I will use gammainc i.e. the incomplete gamma function instead of gamma)
I will need to sample from such densities that may have different polynomial coefficients and speed is of concern to me. I can already sample from such densities using Monte Carlo methods but I would like to see whether or not it is possible for me to use exact sampling from the density in order to speed up.
Thank you very much in advance.
There are several things one might do. The simplest is to avoid calling factorial. Instead one can use the relation that
factorial(n) = gamma(n+1)
Since gamma seems to be actually faster than a call to factorial, you can save a bit there. Even better, you can
>> timeit(#() factorial(40))
ans =
4.28681157826087e-05
>> timeit(#() gamma(41))
ans =
2.06671024634146e-05
>> timeit(#() gammaln(41))
ans =
2.17632543333333e-05
Even better, one can do all 4 calls in a single call to gammaln. For example, think about what this does:
gammaln([(3*n+2*m+p+1)/4,n+1,m+1,p+1])*[1 -1 -1 -1]'
Note that this call has no problem with overflows either in case your numbers get large enough. And since gammln is vectorized, that one call is fast. It costs little more time to compute 4 values than it does to compute one.
>> timeit(#() gammaln([15 20 40 30]))
ans =
2.73937416896552e-05
>> timeit(#() gammaln(40))
ans =
2.46521943333333e-05
Admittedly, if you use gammaln, you will need a call to exp at the end to recover the final result. You could do it with a single call to gamma however too. Perhaps like this:
g = gamma([(3*n+2*m+p+1)/4,n+1,m+1,p+1]);
g = g(1)/(g(2)*g(3)*g(4));
Next, you can be more creative in the inner loop on p. Rather than a full loop, coupled with a test to ignore the combinations you don't need, why not just do this?
for p=mod(n,2):2:40
That statement will select only those values of p that would have been used anyway, so now you can drop the if statement completely.
All of the above will give you what I'll guess is about a 5x speed increase in your loops. But it still has a set of nested loops. With some effort, you might be able to improve that too.
For example, rather than computing all of those factorials (or gamma functions) many times, do it ONCE. This should work:
a=pn(1);b=pn(2);c=pn(3);d=pn(4);f=pn(5);
area=0;
result=0;
nlim = 40;
facts = factorial(0:nlim);
gammas = gamma((0:(6*nlim+1))/4);
for n=0:nlim
for m=0:nlim
for p=mod(n,2):2:nlim
result = result + (b.^n * c.^m * d.^p) ...
.*gammas(3*n+2*m+p+1 + 1) ...
./ (facts(n+1).*facts(m+1).*facts(p+1)) ...
./ (-a)^( (3*n+2*m+p+1)/4 );
end
end
end
result=result*1/2*exp(f)
In my test on my machine, I find that your triply nested loops required 4.3 seconds to run. My version above produces the same result, yet required only 0.028418 seconds, a speedup of roughly 150 to 1, despite the triply nested loops.
Well, without even making changes to your code you could install an excellent package from Tom Minka at Microsoft called lightspeed which replaces some built-in matlab functions with much faster versions. I know there's a replacement for gammaln().
You'll get nontrivial speed improvements, though I'm not sure how much, and it's straight-forward to install.