Why is it slower to prespecify type in a data.frame? - performance

I was preallocating a big data.frame to fill in later, which I normally do with NA's like this:
n <- 1e6
a <- data.frame(c1 = 1:n, c2 = NA, c3 = NA)
and I wondered if it would make things any faster later if I specified data types up front, so I tested
f1 <- function() {
a <- data.frame(c1 = 1:n, c2 = NA, c3 = NA)
a$c2 <- 1:n
a$c3 <- sample(LETTERS, size= n, replace = TRUE)
}
f2 <- function() {
b <- data.frame(c1 = 1:n, c2 = numeric(n), c3 = character(n))
b$c2 <- 1:n
b$c3 <- sample(LETTERS, size= n, replace = TRUE)
}
> system.time(f1())
user system elapsed
0.219 0.042 0.260
> system.time(f2())
user system elapsed
1.018 0.052 1.072
So it was actually much slower! I tried again with a factor column too, and the difference wasn't closer to 2x than 4x, but I'm curious about why this is slower, and wonder if it is ever appropriate to initialize with data types rather than NA's.
--
Edit: Flodel pointed out that 1:n is integer, not numeric. With that correction the runtimes are nearly identical; of course it hurts to incorrectly specify a data type and change it later!

Assigning any data to a large data frame takes time. If you're going to assign your data all at once in a vector (as you should), it's much faster not to assign the c2 and c3 columns in the original definition at all. For example:
f3 <- function() {
c <- data.frame(c1 = 1:n)
c$c2 <- 1:n
c$c3 <- sample(LETTERS, size= n, replace = TRUE)
}
print(system.time(f1()))
# user system elapsed
# 0.194 0.023 0.216
print(system.time(f2()))
# user system elapsed
# 0.336 0.037 0.374
print(system.time(f3()))
# user system elapsed
# 0.057 0.007 0.063
The reason for this is that when you preassign, a column of length n is created. eg
str(data.frame(x=1:2, y = character(2)))
## 'data.frame': 2 obs. of 2 variables:
## $ x: int 1 2
## $ y: Factor w/ 1 level "": 1 1
Note that the character column has been converted to factor which will be slower than setting stringsAsFactors = F.

#David Robinson's answer is correct, but I will add some profiling here to show how to investigate why some thngs are slower than you might expect.
The best thing to do here is to do some profiling to see what is being called, that can give a clue as to why some things calls are slower than others
library(profr)
profr(f1())
## Read 9 items
## f level time start end leaf source
## 8 f1 1 0.16 0.00 0.16 FALSE <NA>
## 9 data.frame 2 0.04 0.00 0.04 TRUE base
## 10 $<- 2 0.02 0.04 0.06 FALSE base
## 11 sample 2 0.04 0.06 0.10 TRUE base
## 12 $<- 2 0.06 0.10 0.16 FALSE base
## 13 $<-.data.frame 3 0.12 0.04 0.16 TRUE base
profr(f2())
## Read 15 items
## f level time start end leaf source
## 8 f2 1 0.28 0.00 0.28 FALSE <NA>
## 9 data.frame 2 0.12 0.00 0.12 TRUE base
## 10 : 2 0.02 0.12 0.14 TRUE base
## 11 $<- 2 0.02 0.18 0.20 FALSE base
## 12 sample 2 0.02 0.20 0.22 TRUE base
## 13 $<- 2 0.06 0.22 0.28 FALSE base
## 14 as.data.frame 3 0.08 0.04 0.12 FALSE base
## 15 $<-.data.frame 3 0.10 0.18 0.28 TRUE base
## 16 as.data.frame.character 4 0.08 0.04 0.12 FALSE base
## 17 factor 5 0.08 0.04 0.12 FALSE base
## 18 unique 6 0.06 0.04 0.10 FALSE base
## 19 match 6 0.02 0.10 0.12 TRUE base
## 20 unique.default 7 0.06 0.04 0.10 TRUE base
profr(f3())
## Read 4 items
## f level time start end leaf source
## 8 f3 1 0.06 0.00 0.06 FALSE <NA>
## 9 $<- 2 0.02 0.00 0.02 FALSE base
## 10 sample 2 0.04 0.02 0.06 TRUE base
## 11 $<-.data.frame 3 0.02 0.00 0.02 TRUE base
clearly f2() is slower than f1() as there is a lot of character to factor conversions, and recreating levels etc.
For efficient use of memory I would suggest the data.table package. This avoids (as much as possible) the internal copying of objects
library(data.table)
f4 <- function(){
f <- data.table(c1 = 1:n)
f[,c2:=1L:n]
f[,c3:=sample(LETTERS, size= n, replace = TRUE)]
}
system.time(f1())
## user system elapsed
## 0.15 0.02 0.18
system.time(f2())
## user system elapsed
## 0.19 0.00 0.19
system.time(f3())
## user system elapsed
## 0.09 0.00 0.09
system.time(f4())
## user system elapsed
## 0.04 0.00 0.04
Note, that using data.table you could add two columns at once (and by reference)
# Thanks to #Thell for pointing this out.
f[,`:=`(c('c2','c3'), list(1L:n, sample(LETTERS,n, T))), with = F]
EDIT -- functions that will return the required object (Well picked up #Dwin)
n= 1e7
f1 <- function() {
a <- data.frame(c1 = 1:n, c2 = NA, c3 = NA)
a$c2 <- 1:n
a$c3 <- sample(LETTERS, size = n, replace = TRUE)
a
}
f2 <- function() {
b <- data.frame(c1 = 1:n, c2 = numeric(n), c3 = character(n))
b$c2 <- 1:n
b$c3 <- sample(LETTERS, size = n, replace = TRUE)
b
}
f3 <- function() {
c <- data.frame(c1 = 1:n)
c$c2 <- 1:n
c$c3 <- sample(LETTERS, size = n, replace = TRUE)
c
}
f4 <- function() {
f <- data.table(c1 = 1:n)
f[, `:=`(c2, 1L:n)]
f[, `:=`(c3, sample(LETTERS, size = n, replace = TRUE))]
}
system.time(f1())
## user system elapsed
## 1.62 0.34 2.13
system.time(f2())
## user system elapsed
## 2.14 0.66 2.79
system.time(f3())
## user system elapsed
## 0.78 0.25 1.03
system.time(f4())
## user system elapsed
## 0.37 0.08 0.46
profr(f1())
## Read 105 items
## f level time start end leaf source
## 8 f1 1 2.08 0.00 2.08 FALSE <NA>
## 9 data.frame 2 0.66 0.00 0.66 FALSE base
## 10 : 2 0.02 0.66 0.68 TRUE base
## 11 $<- 2 0.32 0.84 1.16 FALSE base
## 12 sample 2 0.40 1.16 1.56 TRUE base
## 13 $<- 2 0.32 1.76 2.08 FALSE base
## 14 : 3 0.02 0.00 0.02 TRUE base
## 15 as.data.frame 3 0.04 0.02 0.06 FALSE base
## 16 unlist 3 0.12 0.54 0.66 TRUE base
## 17 $<-.data.frame 3 1.24 0.84 2.08 TRUE base
## 18 as.data.frame.integer 4 0.04 0.02 0.06 TRUE base
profr(f2())
## Read 145 items
## f level time start end leaf source
## 8 f2 1 2.88 0.00 2.88 FALSE <NA>
## 9 data.frame 2 1.40 0.00 1.40 FALSE base
## 10 : 2 0.04 1.40 1.44 TRUE base
## 11 $<- 2 0.36 1.64 2.00 FALSE base
## 12 sample 2 0.40 2.00 2.40 TRUE base
## 13 $<- 2 0.36 2.52 2.88 FALSE base
## 14 : 3 0.02 0.00 0.02 TRUE base
## 15 numeric 3 0.06 0.02 0.08 TRUE base
## 16 character 3 0.04 0.08 0.12 TRUE base
## 17 as.data.frame 3 1.06 0.12 1.18 FALSE base
## 18 unlist 3 0.20 1.20 1.40 TRUE base
## 19 $<-.data.frame 3 1.24 1.64 2.88 TRUE base
## 20 as.data.frame.integer 4 0.04 0.12 0.16 TRUE base
## 21 as.data.frame.numeric 4 0.16 0.18 0.34 TRUE base
## 22 as.data.frame.character 4 0.78 0.40 1.18 FALSE base
## 23 factor 5 0.74 0.40 1.14 FALSE base
## 24 as.data.frame.vector 5 0.04 1.14 1.18 TRUE base
## 25 unique 6 0.38 0.40 0.78 FALSE base
## 26 match 6 0.32 0.78 1.10 TRUE base
## 27 unique.default 7 0.38 0.40 0.78 TRUE base
profr(f3())
## Read 37 items
## f level time start end leaf source
## 8 f3 1 0.72 0.00 0.72 FALSE <NA>
## 9 data.frame 2 0.10 0.00 0.10 FALSE base
## 10 : 2 0.02 0.10 0.12 TRUE base
## 11 $<- 2 0.08 0.14 0.22 FALSE base
## 12 sample 2 0.26 0.22 0.48 TRUE base
## 13 $<- 2 0.16 0.56 0.72 FALSE base
## 14 : 3 0.02 0.00 0.02 TRUE base
## 15 as.data.frame 3 0.04 0.02 0.06 FALSE base
## 16 unlist 3 0.02 0.08 0.10 TRUE base
## 17 $<-.data.frame 3 0.58 0.14 0.72 TRUE base
## 18 as.data.frame.integer 4 0.04 0.02 0.06 TRUE base
profr(f4())
## Read 15 items
## f level time start end leaf source
## 8 f4 1 0.28 0.00 0.28 FALSE <NA>
## 9 data.table 2 0.02 0.00 0.02 FALSE data.table
## 10 [ 2 0.26 0.02 0.28 FALSE base
## 11 : 3 0.02 0.00 0.02 TRUE base
## 12 [.data.table 3 0.26 0.02 0.28 FALSE <NA>
## 13 eval 4 0.26 0.02 0.28 FALSE base
## 14 eval 5 0.26 0.02 0.28 FALSE base
## 15 : 6 0.02 0.02 0.04 TRUE base
## 16 sample 6 0.24 0.04 0.28 TRUE base

Related

Fi score -Sklearn

What is the F1-score of the model in the following? I used scikit learn package.
print(classification_report(y_true, y_pred, target_names=target_names))
precision recall f1-score support
<BLANKLINE>
class 0 0.50 1.00 0.67 1
class 1 0.00 0.00 0.00 1
class 2 1.00 0.67 0.80 3
<BLANKLINE>
accuracy 0.60 5
macro avg 0.50 0.56 0.49 5
weighted avg 0.70 0.60 0.61 5
This article explains it pretty well
Basically it's
F1 = 2 * precision * recall / (precision + recall)

Julia pmap speed - parallel processing - dynamic programming

I am trying to speed up filling in a matrix for a dynamic programming problem in Julia (v0.6.0), and I can't seem to get much extra speed from using pmap. This is related to this question I posted almost a year ago: Filling a matrix using parallel processing in Julia. I was able to speed up serial processing with some great help then, and I'm now trying to get extra speed from parallel processing tools in Julia.
For the serial processing case, I was using a 3-dimensional matrix (essentially a set of equally-sized matrices, indexed by the 1st-dimension) and iterating over the 1st-dimension. I wanted to give pmap a try, though, to more efficiently iterate over the set of matrices.
Here is the code setup. To use pmap with the v_iter function below, I converted the three dimensional matrix into a dictionary object, with the dictionary keys equal to the index values in the 1st dimension (v_dict in the code below, with gcc equal to the 1st-dimension size). The v_iter function takes other dictionary objects (E_opt_dict and gridpoint_m_dict below) as additional inputs:
function v_iter(a,b,c)
diff_v = 1
while diff_v>convcrit
diff_v = -Inf
#These lines efficiently multiply the value function by the Markov transition matrix, using the A_mul_B function
exp_v = zeros(Float64,gkpc,1)
A_mul_B!(exp_v,a[1:gkpc,:],Zprob[1,:])
for j=2:gz
temp=Array{Float64}(gkpc,1)
A_mul_B!(temp,a[(j-1)*gkpc+1:(j-1)*gkpc+gkpc,:],Zprob[j,:])
exp_v=hcat(exp_v,temp)
end
#This tries to find the optimal value of v
for h=1:gm
for j=1:gz
oldv = a[h,j]
newv = (1-tau)*b[h,j]+beta*exp_v[c[h,j],j]
a[h,j] = newv
diff_v = max(diff_v, oldv-newv, newv-oldv)
end
end
end
end
gz = 9
gp = 13
gk = 17
gcc = 5
gm = gk * gp * gcc * gz
gkpc = gk * gp * gcc
gkp = gk*gp
beta = ((1+0.015)^(-1))
tau = 0.35
Zprob = [0.43 0.38 0.15 0.03 0.00 0.00 0.00 0.00 0.00; 0.05 0.47 0.35 0.11 0.02 0.00 0.00 0.00 0.00; 0.01 0.10 0.50 0.30 0.08 0.01 0.00 0.00 0.00; 0.00 0.02 0.15 0.51 0.26 0.06 0.01 0.00 0.00; 0.00 0.00 0.03 0.21 0.52 0.21 0.03 0.00 0.00 ; 0.00 0.00 0.01 0.06 0.26 0.51 0.15 0.02 0.00 ; 0.00 0.00 0.00 0.01 0.08 0.30 0.50 0.10 0.01 ; 0.00 0.00 0.00 0.00 0.02 0.11 0.35 0.47 0.05; 0.00 0.00 0.00 0.00 0.00 0.03 0.15 0.38 0.43]
convcrit = 0.001 # chosen convergence criterion
E_opt = Array{Float64}(gcc,gm,gz)
fill!(E_opt,10.0)
gridpoint_m = Array{Int64}(gcc,gm,gz)
fill!(gridpoint_m,fld(gkp,2))
v_dict=Dict(i => zeros(Float64,gm,gz) for i=1:gcc)
E_opt_dict=Dict(i => E_opt[i,:,:] for i=1:gcc)
gridpoint_m_dict=Dict(i => gridpoint_m[i,:,:] for i=1:gcc)
For parallel processing, I executed the following two commands:
wp = CachingPool(workers())
addprocs(3)
pmap(wp,v_iter,values(v_dict),values(E_opt_dict),values(gridpoint_m_dict))
...which produced this performance:
135.626417 seconds (3.29 G allocations: 57.152 GiB, 3.74% gc time)
I then tried to serial process instead:
for i=1:gcc
v_iter(v_dict[i],E_opt_dict[i],gridpoint_m_dict[i])
end
...and received better performance.
128.263852 seconds (3.29 G allocations: 57.101 GiB, 4.53% gc time)
This also gives me about the same performance as running v_iter on the original 3-dimensional objects:
v=zeros(Float64,gcc,gm,gz)
for i=1:gcc
v_iter(v[i,:,:],E_opt[i,:,:],gridpoint_m[i,:,:])
end
I know that parallel processing involves setup time, but when I increase the value of gcc, I still get about equal processing time for serial and parallel. This seems like a good candidate for parallel processing, since there is no need for messaging between the workers! But I can't seem to make it work efficiently.
You create the CachingPool before adding the worker processes. Hence your caching pool passed to pmap tells it to use just a single worker.
You can simply check it by running wp.workers you will see something like Set([1]).
Hence it should be:
addprocs(3)
wp = CachingPool(workers())
You could also consider running Julia -p command line parameter e.g. julia -p 3 and then you can skip the addprocs(3) command.
On top of that your for and pmap loops are not equivalent. The Julia Dict object is a hashmap and similar to other languages does not offer anything like element order. Hence in your for loop you are guaranteed to get the same matching i-th element while with the values the ordering of values does not need to match the original ordering (and you can have different order for each of those three variables in the pmap loop).
Since the keys for your Dicts are just numbers from 1 up to gcc you should simply use arrays instead. You can use generators very similar to Python. For an example instead of
v_dict=Dict(i => zeros(Float64,gm,gz) for i=1:gcc)
use
v_dict_a = [zeros(Float64,gm,gz) for i=1:gcc]
Hope that helps.
Based on #Przemyslaw Szufeul's helpful advice, I've placed below the code that properly executes parallel processing. After running it once, I achieved substantial improvement in running time:
77.728264 seconds (181.20 k allocations: 12.548 MiB)
In addition to reordering the wp command and using the generator Przemyslaw recommended, I also recast v_iter as an anonymous function, in order to avoid having to sprinkle #everywhere around the code to feed functions and data to the workers.
I also added return a to the v_iter function, and set v_a below equal to the output of pmap, since you cannot pass by reference to a remote object.
addprocs(3)
v_iter = function(a,b,c)
diff_v = 1
while diff_v>convcrit
diff_v = -Inf
#These lines efficiently multiply the value function by the Markov transition matrix, using the A_mul_B function
exp_v = zeros(Float64,gkpc,1)
A_mul_B!(exp_v,a[1:gkpc,:],Zprob[1,:])
for j=2:gz
temp=Array{Float64}(gkpc,1)
A_mul_B!(temp,a[(j-1)*gkpc+1:(j-1)*gkpc+gkpc,:],Zprob[j,:])
exp_v=hcat(exp_v,temp)
end
#This tries to find the optimal value of v
for h=1:gm
for j=1:gz
oldv = a[h,j]
newv = (1-tau)*b[h,j]+beta*exp_v[c[h,j],j]
a[h,j] = newv
diff_v = max(diff_v, oldv-newv, newv-oldv)
end
end
end
return a
end
gz = 9
gp = 13
gk = 17
gcc = 5
gm = gk * gp * gcc * gz
gkpc = gk * gp * gcc
gkp =gk*gp
beta = ((1+0.015)^(-1))
tau = 0.35
Zprob = [0.43 0.38 0.15 0.03 0.00 0.00 0.00 0.00 0.00; 0.05 0.47 0.35 0.11 0.02 0.00 0.00 0.00 0.00; 0.01 0.10 0.50 0.30 0.08 0.01 0.00 0.00 0.00; 0.00 0.02 0.15 0.51 0.26 0.06 0.01 0.00 0.00; 0.00 0.00 0.03 0.21 0.52 0.21 0.03 0.00 0.00 ; 0.00 0.00 0.01 0.06 0.26 0.51 0.15 0.02 0.00 ; 0.00 0.00 0.00 0.01 0.08 0.30 0.50 0.10 0.01 ; 0.00 0.00 0.00 0.00 0.02 0.11 0.35 0.47 0.05; 0.00 0.00 0.00 0.00 0.00 0.03 0.15 0.38 0.43]
convcrit = 0.001 # chosen convergence criterion
E_opt = Array{Float64}(gcc,gm,gz)
fill!(E_opt,10.0)
gridpoint_m = Array{Int64}(gcc,gm,gz)
fill!(gridpoint_m,fld(gkp,2))
v_a=[zeros(Float64,gm,gz) for i=1:gcc]
E_opt_a=[E_opt[i,:,:] for i=1:gcc]
gridpoint_m_a=[gridpoint_m[i,:,:] for i=1:gcc]
wp = CachingPool(workers())
v_a = pmap(wp,v_iter,v_a,E_opt_a,gridpoint_m_a)

Why python implementation of miller-rabin faster than ruby by a lot?

For one of my classes I recently came across both a ruby and a python implementations of using the miller-rabin algorithm to identify the number of primes between 20 and 29000. I am curious why, even though they are seemingly the same implementation, the python code runs so much faster. I have read that python was typically faster than ruby but is this much of a speed difference to be expected?
miller_rabin.rb
def miller_rabin(m,k)
t = (m-1)/2;
s = 1;
while(t%2==0)
t/=2
s+=1
end
for r in (0...k)
b = 0
b = rand(m) while b==0
prime = false
y = (b**t) % m
if(y ==1)
prime = true
end
for i in (0...s)
if y == (m-1)
prime = true
break
else
y = (y*y) % m
end
end
if not prime
return false
end
end
return true
end
count = 0
for j in (20..29000)
if(j%2==1 and miller_rabin(j,2))
count+=1
end
end
puts count
miller_rabin.py:
import math
import random
def miller_rabin(m, k):
s=1
t = (m-1)/2
while t%2 == 0:
t /= 2
s += 1
for r in range(0,k):
rand_num = random.randint(1,m-1)
y = pow(rand_num, t, m)
prime = False
if (y == 1):
prime = True
for i in range(0,s):
if (y == m-1):
prime = True
break
else:
y = (y*y)%m
if not prime:
return False
return True
count = 0
for j in range(20,29001):
if j%2==1 and miller_rabin(j,2):
count+=1
print count
When I measure the execution time of each using Measure-Command in Windows Powershell, I get the following:
Python 2.7:
Ticks: 4874403
Total Milliseconds: 487.4403
Ruby 1.9.3:
Ticks: 682232430
Total Milliseconds: 68223.243
I would appreciate any insight anyone can give me into why their is such a huge difference
In ruby you are using (a ** b) % c to calculate the modulo of exponentiation. In Python, you are using the much more efficient three-element pow call whose docstring explicitly states:
With three arguments, equivalent to (x**y) % z, but may be more
efficient (e.g. for longs).
Whether you want to count the lack of such built-in operator against ruby is a matter of opinion. On the one hand, if ruby doesn't provide one, you might say that it's that much slower. On the other hand, you're not really testing the same thing algorithmically, so some would say that the comparison is not fair.
A quick googling reveals that there are implementations of modulo exponentiation for ruby.
I think these profile results should answer your question:
%self total self wait child calls name
96.81 43.05 43.05 0.00 0.00 17651 Fixnum#**
1.98 0.88 0.88 0.00 0.00 17584 Bignum#%
0.22 44.43 0.10 0.00 44.33 14490 Object#miller_rabin
0.11 0.05 0.05 0.00 0.00 32142 <Class::Range>#allocate
0.11 0.06 0.05 0.00 0.02 17658 Kernel#rand
0.08 44.47 0.04 0.00 44.43 32142 *Range#each
0.04 0.02 0.02 0.00 0.00 17658 Kernel#respond_to_missing?
0.00 44.47 0.00 0.00 44.47 1 Kernel#load
0.00 44.47 0.00 0.00 44.47 2 Global#[No method]
0.00 0.00 0.00 0.00 0.00 2 IO#write
0.00 0.00 0.00 0.00 0.00 1 Kernel#puts
0.00 0.00 0.00 0.00 0.00 1 IO#puts
0.00 0.00 0.00 0.00 0.00 2 IO#set_encoding
0.00 0.00 0.00 0.00 0.00 1 Fixnum#to_s
0.00 0.00 0.00 0.00 0.00 1 Module#method_added
Looks like Ruby's ** operator is slow as compared to Python.
It looks like (b**t) is often too big to fix in a Fixnum, so you are using Bignum (or arbitrary-precision) arithmetic, which is much slower.

Speed up assembling matrix by interleaving vectors?

I have two vectors of arbitrary and equal length
a <- c(0.8,0.8,0.8)
b <- c(0.4,0.4,0.4)
n <- length(a)
From these I need to assemble an 2n by 2n matrix of the form:
x = [1-a1 b1 1-a2 b2 1-a3 b3
a1 1-b1 a2 1-b2 a3 1-b3
1-a1 b1 1-a2 b2 1-a3 b3
a1 1-b1 a2 1-b2 a3 1-b3
1-a1 b1 1-a2 b2 1-a3 b3
a1 1-b1 a2 1-b2 a3 1-b3]
I currently do this using
x <- matrix(rep(as.vector(rbind(
c(1-a,a),
c(b, 1-b))),
n),
ncol=n*2, byrow=TRUE)
How can I speed up this operation? Profiling indicates that matrix is taking the most time:
Rprof("out.prof")
for (i in 1:100000) {
x <- matrix(rep(as.vector(rbind(
c(1-a,a),
c(b, 1-b))),
n),
ncol=n*2, byrow=TRUE)
}
Rprof(NULL)
summaryRprof("out.prof")
##$by.self
## self.time self.pct total.time total.pct
##"matrix" 1.02 63.75 1.60 100.00
##"rbind" 0.24 15.00 0.36 22.50
##"as.vector" 0.18 11.25 0.54 33.75
##"c" 0.10 6.25 0.10 6.25
##"*" 0.04 2.50 0.04 2.50
##"-" 0.02 1.25 0.02 1.25
##
##$by.total
## total.time total.pct self.time self.pct
##"matrix" 1.60 100.00 1.02 63.75
##"as.vector" 0.54 33.75 0.18 11.25
##"rbind" 0.36 22.50 0.24 15.00
##"c" 0.10 6.25 0.10 6.25
##"*" 0.04 2.50 0.04 2.50
##"-" 0.02 1.25 0.02 1.25
##
##$sample.interval
##[1] 0.02
##
##$sampling.time
##[1] 1.6
I don't think there is an alternative to matrix being the slowest part of your profile, but you can definitely save a little time by optimizing the rest. For example:
x <- matrix(rbind(c(1-a,a), c(b, 1-b)), 2*n, 2*n, byrow=TRUE)
Also, although I would not recommend it, you can save a little extra time by using the Internal matrix function:
x <- .Internal(matrix(rbind(c(1-a,a), c(b, 1-b)),
n*2, n*2, TRUE, NULL, FALSE, FALSE))
Here are some benchmarks:
benchmark(
method0 = matrix(rep(as.vector(rbind(c(1-a,a), c(b, 1-b))), n),
ncol=n*2, byrow=TRUE),
method1 = matrix(rbind(c(1-a,a), c(b, 1-b)), 2*n, 2*n, byrow=TRUE),
method2 = .Internal(matrix(rbind(c(1-a,a), c(b, 1-b)),
n*2, n*2, TRUE, NULL, FALSE, FALSE)),
replications = 100000,
order = "relative")
# test replications elapsed relative user.self sys.self user.child sys.child
# 3 method2 100000 1.00 1.00 0.99 0 NA NA
# 2 method1 100000 1.13 1.13 1.12 0 NA NA
# 1 method0 100000 1.46 1.46 1.46 0 NA NA
I get a small speedup with the following:
f = function(a, b, n){
z = rbind(
c(rbind(1 - a, b)),
c(rbind(a, 1 - b))
)
do.call(rbind, lapply(1:n, function(i) z))
}
I'll keep looking.
Edit I'm stumped. If this isn't good enough, I'd recommend inlining some rcpp.

Slower ddply when .parallel=TRUE on Mac OS X Version 10.6.7

I am trying to get ddply to run in parallel on my mac. The code I've used is as follows:
library(doMC)
library(ggplot2) # for the purposes of getting the baseball data.frame
registerDoMC(2)
> system.time(ddply(baseball, .(year), numcolwise(mean)))
user system elapsed
0.959 0.106 1.522
> system.time(ddply(baseball, .(year), numcolwise(mean), .parallel=TRUE))
user system elapsed
2.221 2.790 2.552
Why is ddply slower when I run .parallel=TRUE? I have searched online to no avail. I've also tried registerDoMC() and the results were the same.
The baseball data may be too small to see improvement by making the computations parallel; the overhead of passing the data to the different processes may be swamping any speedup by doing the calculations in parallel. Using the rbenchmark package:
baseball10 <- baseball[rep(seq(length=nrow(baseball)), 10),]
benchmark(noparallel = ddply(baseball, .(year), numcolwise(mean)),
parallel = ddply(baseball, .(year), numcolwise(mean), .parallel=TRUE),
noparallel10 = ddply(baseball10, .(year), numcolwise(mean)),
parallel10 = ddply(baseball10, .(year), numcolwise(mean), .parallel=TRUE),
replications = 10)
gives results
test replications elapsed relative user.self sys.self user.child sys.child
1 noparallel 10 4.562 1.000000 4.145 0.408 0.000 0.000
3 noparallel10 10 14.134 3.098203 9.815 4.242 0.000 0.000
2 parallel 10 11.927 2.614423 2.394 1.107 4.836 6.891
4 parallel10 10 18.406 4.034634 4.045 2.580 10.210 9.769
With a 10 times bigger data set, the penalty for parallel is smaller. A more complicated computation would also tilt it even further in parallel's favor, likely giving it an advantage.
This was run on a Mac OS X 10.5.8 Core 2 Duo machine.
Running in parallel will be slower than running sequentially when the communication costs between the nodes is greater than the calculation time of the function. In other words, it takes longer to send the data to/from the nodes than it does to perform the calculation.
For the same data set, the communication costs are approximately fixed, so parallel processing is going to be more useful as the time spent evaluating the function increases.
UPDATE:
The code below shows 0.14 seconds (on my machine) are spent is spent evaluating .fun. That means communication has to be less than 0.07 seconds and that's not realistic for a data set the size of baseball.
Rprof()
system.time(ddply(baseball, .(year), numcolwise(mean)))
# user system elapsed
# 0.28 0.02 0.30
Rprof(NULL)
summaryRprof()$by.self
# self.time self.pct total.time total.pct
# [.data.frame 0.04 12.50 0.10 31.25
# unlist 0.04 12.50 0.10 31.25
# match 0.04 12.50 0.04 12.50
# .fun 0.02 6.25 0.14 43.75
# structure 0.02 6.25 0.12 37.50
# [[ 0.02 6.25 0.08 25.00
# FUN 0.02 6.25 0.06 18.75
# rbind.fill 0.02 6.25 0.06 18.75
# anyDuplicated 0.02 6.25 0.02 6.25
# gc 0.02 6.25 0.02 6.25
# is.array 0.02 6.25 0.02 6.25
# list 0.02 6.25 0.02 6.25
# mean.default 0.02 6.25 0.02 6.25
Here's the parallel version with snow:
library(doSNOW)
cl <- makeSOCKcluster(2)
registerDoSNOW(cl)
Rprof()
system.time(ddply(baseball, .(year), numcolwise(mean), .parallel=TRUE))
# user system elapsed
# 0.46 0.01 0.73
Rprof(NULL)
summaryRprof()$by.self
# self.time self.pct total.time total.pct
# .Call 0.24 33.33 0.24 33.33
# socketSelect 0.16 22.22 0.16 22.22
# lazyLoadDBfetch 0.08 11.11 0.08 11.11
# accumulate.iforeach 0.04 5.56 0.06 8.33
# rbind.fill 0.04 5.56 0.06 8.33
# structure 0.04 5.56 0.04 5.56
# <Anonymous> 0.02 2.78 0.54 75.00
# lapply 0.02 2.78 0.04 5.56
# constantFoldEnv 0.02 2.78 0.02 2.78
# gc 0.02 2.78 0.02 2.78
# stopifnot 0.02 2.78 0.02 2.78
# summary.connection 0.02 2.78 0.02 2.78

Resources