gfortran "-march=haswell" slower than "-march=core2" - performance

I run gfortran 4.9.2 on a 64-bit Windows 7 machine with an Intel Core i5-4570 (Haswell). I compile and execute on this same machine.
Compiling my code (scientific simulation) with
gfortran -frecord-marker-4 -fno-automatic -O3 -fdefault-real-8 (...)
-Wline-truncation -Wsurprising -ffpe-trap=invalid,zero,overflow (...)
-march=core2 -mfpmath=sse -c
is about ~30% FASTER than compiling with
gfortran -frecord-marker-4 -fno-automatic -O3 -fdefault-real-8 (...)
-Wline-truncation -Wsurprising -ffpe-trap=invalid,zero,overflow (...)
-march=haswell -mfpmath=sse -c
(-march=native gives the same result as with -march=haswell).
This feels strange/weird to me, as I would expect having additional instructions available should make the code faster, not slower.
First: this a new machine and a replacement of my old one at work so unfortunately:
I can't test with the previous processor anymore
It is difficult for me test with another gfortran version than the one installed
Now, I did some profiling with gprof and different -march= settings (see gcc online listing). On this test:
core2, nehalem, westmere all lead to ~85s
starting from sandybridge (adding the AVX instruction set), execution time jumps to 122s (128s for haswell).
Here are the reported profiles, cut at functions > 1.0s self time.
Flat profile for -march=core2:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls us/call us/call name
8.92 6.18 6.18 __sinl_internal
8.50 12.07 5.89 __cosl_internal
7.26 17.10 5.03 _mcount_private
6.42 21.55 4.45 exp
6.41 25.99 4.44 exp2l
5.08 29.51 3.52 __fentry__
3.71 32.08 2.57 35922427 0.07 0.18 predn_
3.53 34.53 2.45 log2l
3.36 36.86 2.33 79418108 0.03 0.03 vxs_tvxs_
2.90 38.87 2.01 97875942 0.02 0.02 rk4m_
2.83 40.83 1.96 403671 4.86 77.44 radarx_
2.16 42.33 1.50 4063165 0.37 0.43 dchdd_
2.14 43.81 1.48 pow
2.11 45.27 1.46 8475809 0.17 0.27 aerosj_
2.09 46.72 1.45 23079874 0.06 0.06 snrm2_
1.86 48.01 1.29 cos
1.80 49.26 1.25 sin
1.75 50.47 1.21 15980084 0.08 0.08 sgemv_
1.66 51.62 1.15 61799016 0.02 0.05 x2acc_
1.64 52.76 1.14 43182542 0.03 0.03 atmostd_
1.56 53.84 1.08 24821235 0.04 0.04 axb_
1.53 54.90 1.06 138497449 0.01 0.01 axvc_
Flat profile for -march=haswell:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls us/call us/call name
6.49 6.71 6.71 __sinl_internal
6.05 12.96 6.25 __cosl_internal
5.55 18.70 5.74 _mcount_private
5.16 24.03 5.33 exp
5.14 29.34 5.31 cos
4.87 34.37 5.03 sin
4.67 39.20 4.83 exp2l
4.55 43.90 4.70 35922756 0.13 0.34 predn_
4.38 48.43 4.53 8475884 0.53 0.69 aerosj_
3.72 52.27 3.84 pow
3.43 55.82 3.55 __fentry__
2.79 58.70 2.88 403672 7.13 120.62 radarx_
2.64 61.43 2.73 79396558 0.03 0.03 vxs_tvxs_
2.36 63.87 2.44 log2l
1.95 65.89 2.02 97881202 0.02 0.02 rk4m_
1.80 67.75 1.86 12314052 0.15 0.15 axs_txs_
1.74 69.55 1.80 8475848 0.21 0.66 mvpd_
1.72 71.33 1.78 36345392 0.05 0.05 gauss_
1.53 72.91 1.58 25028687 0.06 0.06 aescudi_
1.52 74.48 1.57 43187368 0.04 0.04 atmostd_
1.44 75.97 1.49 23077428 0.06 0.06 snrm2_
1.43 77.45 1.48 17560212 0.08 0.08 txs_axs_
1.38 78.88 1.43 4062635 0.35 0.42 dchdd_
1.36 80.29 1.41 internal_modf
1.30 81.63 1.34 61800367 0.02 0.06 x2acc_
1.26 82.93 1.30 log
1.25 84.22 1.29 138497176 0.01 0.01 axvc_
1.24 85.50 1.28 15978523 0.08 0.08 sgemv_
1.10 86.64 1.14 10707022 0.11 0.11 ec_txs_
1.09 87.77 1.13 8475648 0.13 0.21 g_eval_
1.06 88.87 1.10 __logl_internal
0.98 89.88 1.01 17765874 0.06 0.07 solgeo_
0.98 90.89 1.01 15978523 0.06 0.06 sger_
You'll notice basically everything seems slower with -haswell (even internal functions like sin/cos/exp !).
I can give an example of code, the function vxs_tvxs, which consumes 2.73s vs 2.33s:
SUBROUTINE VXS_TVXS(VXS,TVXS)
REAL VXS(3),TVXS(3,3)
VTOT=sqrt(sum(VXS**2))
VH=sqrt(VXS(1)**2+VXS(2)**2)
if (VTOT==0.) then
print*,'PB VXS_TVXS : VTOT=',VTOT
stop
endif
sg=-VXS(3)/VTOT
cg=VH/VTOT
if (VH==0.) then
sc=0.
cc=1.
else
sc=VXS(2)/VH
cc=VXS(1)/VH
endif
TVXS(1,:)=(/ cg*cc, cg*sc, -sg/)
TVXS(2,:)=(/ -sc, cc, 0./)
TVXS(3,:)=(/ sg*cc, sg*sc, cg/)
RETURN
END
Seems quite an innocuous function to me...
I have made a very simple program
PROGRAM PIPO
REAL VXS0(3),VXS(3),TVXS(3,3)
VXS0=(/50.,100.,200./)
VXS=VXS0
call cpu_time(start)
do k=1,50 000 000
call VXS_TVXS(VXS,TVXS)
VXS=0.5*(VXS0+TVXS(1+mod(k,3),:))
VXS=cos(VXS)
enddo
call cpu_time(finish)
print*,finish-start,VXS
END
Unfortunately, in this test case, all -march settings end up with about the same time requirement.
So I really don't get what is happening... plus, as we see from the previous profile, the fact that even internal functions are costing more feels very puzzling.

Related

Julia pmap speed - parallel processing - dynamic programming

I am trying to speed up filling in a matrix for a dynamic programming problem in Julia (v0.6.0), and I can't seem to get much extra speed from using pmap. This is related to this question I posted almost a year ago: Filling a matrix using parallel processing in Julia. I was able to speed up serial processing with some great help then, and I'm now trying to get extra speed from parallel processing tools in Julia.
For the serial processing case, I was using a 3-dimensional matrix (essentially a set of equally-sized matrices, indexed by the 1st-dimension) and iterating over the 1st-dimension. I wanted to give pmap a try, though, to more efficiently iterate over the set of matrices.
Here is the code setup. To use pmap with the v_iter function below, I converted the three dimensional matrix into a dictionary object, with the dictionary keys equal to the index values in the 1st dimension (v_dict in the code below, with gcc equal to the 1st-dimension size). The v_iter function takes other dictionary objects (E_opt_dict and gridpoint_m_dict below) as additional inputs:
function v_iter(a,b,c)
diff_v = 1
while diff_v>convcrit
diff_v = -Inf
#These lines efficiently multiply the value function by the Markov transition matrix, using the A_mul_B function
exp_v = zeros(Float64,gkpc,1)
A_mul_B!(exp_v,a[1:gkpc,:],Zprob[1,:])
for j=2:gz
temp=Array{Float64}(gkpc,1)
A_mul_B!(temp,a[(j-1)*gkpc+1:(j-1)*gkpc+gkpc,:],Zprob[j,:])
exp_v=hcat(exp_v,temp)
end
#This tries to find the optimal value of v
for h=1:gm
for j=1:gz
oldv = a[h,j]
newv = (1-tau)*b[h,j]+beta*exp_v[c[h,j],j]
a[h,j] = newv
diff_v = max(diff_v, oldv-newv, newv-oldv)
end
end
end
end
gz = 9
gp = 13
gk = 17
gcc = 5
gm = gk * gp * gcc * gz
gkpc = gk * gp * gcc
gkp = gk*gp
beta = ((1+0.015)^(-1))
tau = 0.35
Zprob = [0.43 0.38 0.15 0.03 0.00 0.00 0.00 0.00 0.00; 0.05 0.47 0.35 0.11 0.02 0.00 0.00 0.00 0.00; 0.01 0.10 0.50 0.30 0.08 0.01 0.00 0.00 0.00; 0.00 0.02 0.15 0.51 0.26 0.06 0.01 0.00 0.00; 0.00 0.00 0.03 0.21 0.52 0.21 0.03 0.00 0.00 ; 0.00 0.00 0.01 0.06 0.26 0.51 0.15 0.02 0.00 ; 0.00 0.00 0.00 0.01 0.08 0.30 0.50 0.10 0.01 ; 0.00 0.00 0.00 0.00 0.02 0.11 0.35 0.47 0.05; 0.00 0.00 0.00 0.00 0.00 0.03 0.15 0.38 0.43]
convcrit = 0.001 # chosen convergence criterion
E_opt = Array{Float64}(gcc,gm,gz)
fill!(E_opt,10.0)
gridpoint_m = Array{Int64}(gcc,gm,gz)
fill!(gridpoint_m,fld(gkp,2))
v_dict=Dict(i => zeros(Float64,gm,gz) for i=1:gcc)
E_opt_dict=Dict(i => E_opt[i,:,:] for i=1:gcc)
gridpoint_m_dict=Dict(i => gridpoint_m[i,:,:] for i=1:gcc)
For parallel processing, I executed the following two commands:
wp = CachingPool(workers())
addprocs(3)
pmap(wp,v_iter,values(v_dict),values(E_opt_dict),values(gridpoint_m_dict))
...which produced this performance:
135.626417 seconds (3.29 G allocations: 57.152 GiB, 3.74% gc time)
I then tried to serial process instead:
for i=1:gcc
v_iter(v_dict[i],E_opt_dict[i],gridpoint_m_dict[i])
end
...and received better performance.
128.263852 seconds (3.29 G allocations: 57.101 GiB, 4.53% gc time)
This also gives me about the same performance as running v_iter on the original 3-dimensional objects:
v=zeros(Float64,gcc,gm,gz)
for i=1:gcc
v_iter(v[i,:,:],E_opt[i,:,:],gridpoint_m[i,:,:])
end
I know that parallel processing involves setup time, but when I increase the value of gcc, I still get about equal processing time for serial and parallel. This seems like a good candidate for parallel processing, since there is no need for messaging between the workers! But I can't seem to make it work efficiently.
You create the CachingPool before adding the worker processes. Hence your caching pool passed to pmap tells it to use just a single worker.
You can simply check it by running wp.workers you will see something like Set([1]).
Hence it should be:
addprocs(3)
wp = CachingPool(workers())
You could also consider running Julia -p command line parameter e.g. julia -p 3 and then you can skip the addprocs(3) command.
On top of that your for and pmap loops are not equivalent. The Julia Dict object is a hashmap and similar to other languages does not offer anything like element order. Hence in your for loop you are guaranteed to get the same matching i-th element while with the values the ordering of values does not need to match the original ordering (and you can have different order for each of those three variables in the pmap loop).
Since the keys for your Dicts are just numbers from 1 up to gcc you should simply use arrays instead. You can use generators very similar to Python. For an example instead of
v_dict=Dict(i => zeros(Float64,gm,gz) for i=1:gcc)
use
v_dict_a = [zeros(Float64,gm,gz) for i=1:gcc]
Hope that helps.
Based on #Przemyslaw Szufeul's helpful advice, I've placed below the code that properly executes parallel processing. After running it once, I achieved substantial improvement in running time:
77.728264 seconds (181.20 k allocations: 12.548 MiB)
In addition to reordering the wp command and using the generator Przemyslaw recommended, I also recast v_iter as an anonymous function, in order to avoid having to sprinkle #everywhere around the code to feed functions and data to the workers.
I also added return a to the v_iter function, and set v_a below equal to the output of pmap, since you cannot pass by reference to a remote object.
addprocs(3)
v_iter = function(a,b,c)
diff_v = 1
while diff_v>convcrit
diff_v = -Inf
#These lines efficiently multiply the value function by the Markov transition matrix, using the A_mul_B function
exp_v = zeros(Float64,gkpc,1)
A_mul_B!(exp_v,a[1:gkpc,:],Zprob[1,:])
for j=2:gz
temp=Array{Float64}(gkpc,1)
A_mul_B!(temp,a[(j-1)*gkpc+1:(j-1)*gkpc+gkpc,:],Zprob[j,:])
exp_v=hcat(exp_v,temp)
end
#This tries to find the optimal value of v
for h=1:gm
for j=1:gz
oldv = a[h,j]
newv = (1-tau)*b[h,j]+beta*exp_v[c[h,j],j]
a[h,j] = newv
diff_v = max(diff_v, oldv-newv, newv-oldv)
end
end
end
return a
end
gz = 9
gp = 13
gk = 17
gcc = 5
gm = gk * gp * gcc * gz
gkpc = gk * gp * gcc
gkp =gk*gp
beta = ((1+0.015)^(-1))
tau = 0.35
Zprob = [0.43 0.38 0.15 0.03 0.00 0.00 0.00 0.00 0.00; 0.05 0.47 0.35 0.11 0.02 0.00 0.00 0.00 0.00; 0.01 0.10 0.50 0.30 0.08 0.01 0.00 0.00 0.00; 0.00 0.02 0.15 0.51 0.26 0.06 0.01 0.00 0.00; 0.00 0.00 0.03 0.21 0.52 0.21 0.03 0.00 0.00 ; 0.00 0.00 0.01 0.06 0.26 0.51 0.15 0.02 0.00 ; 0.00 0.00 0.00 0.01 0.08 0.30 0.50 0.10 0.01 ; 0.00 0.00 0.00 0.00 0.02 0.11 0.35 0.47 0.05; 0.00 0.00 0.00 0.00 0.00 0.03 0.15 0.38 0.43]
convcrit = 0.001 # chosen convergence criterion
E_opt = Array{Float64}(gcc,gm,gz)
fill!(E_opt,10.0)
gridpoint_m = Array{Int64}(gcc,gm,gz)
fill!(gridpoint_m,fld(gkp,2))
v_a=[zeros(Float64,gm,gz) for i=1:gcc]
E_opt_a=[E_opt[i,:,:] for i=1:gcc]
gridpoint_m_a=[gridpoint_m[i,:,:] for i=1:gcc]
wp = CachingPool(workers())
v_a = pmap(wp,v_iter,v_a,E_opt_a,gridpoint_m_a)

Ruby-prof with graph printer and sorting by self puts out total percentages higher than 100%

If I run
ruby-prof -p graph -s self aggregate.rb > graph.txt
the first few lines of my graph.txt will look like:
Total Time: 40.092432
%total %self total self wait child calls Name
--------------------------------------------------------------------------------
5.16 5.16 0.00 0.00 98304/98304 Object#totalDurationFromFile
100.00% 100.00% 5.16 5.16 0.00 0.00 98304 IO#read
--------------------------------------------------------------------------------
4.91 4.91 0.00 0.00 98304/98304 <Class::IO>#new
95.17% 95.17% 4.91 4.91 0.00 0.00 98304 File#initialize
--------------------------------------------------------------------------------
0.37 0.19 0.00 0.17 32768/32769 Hash#each
28.89 4.67 0.00 24.22 1/32769 Object#readFiles
566.81% 94.24% 29.26 4.86 0.00 24.39 32769 Array#collect
14.71 1.98 0.00 12.73 98304/98304 Object#totalDurationFromFile
9.11 0.64 0.00 8.48 98304/131072 Class#new
0.39 0.39 0.00 0.00 98304/196609 <Class::File>#basename
0.00 0.17 0.00 0.00 98304/1202331 Object#main
--------------------------------------------------------------------------------
3.76 3.35 0.00 0.42 524288/524288 Module#class_eval
72.94% 64.85% 3.76 3.35 0.00 0.42 524288 Module#define_method
0.42 0.42 0.00 0.00 524288/524288 BasicObject#singleton_method_added
I don't think that this is specific to my script aggregate.rb. Therefore, I am leaving the source code out for the sake of brevity.
Question is: Why are there percentages higher than 100% in the %total column? Is sorting by self not allowed with the graph printer? Is this a bug or did I overlook something. Help greatly appreciated.
Thanks!
Have you checked if this change on Github resolves the issue? Apparently, the gem version is out of date and/or does not include that change (as it would also increase the number of decimal places to three).

storing multiway data from for loop

I have the following three-way data (I X J X K) for my polymerization system: Z (23x4x3)
Z(:,:,1) = [0 6.70 NaN NaN
0.14 5.79 27212.52 17735.36
0.26 5.04 26545.98 17279.95
0.35 4.43 26007.91 16902.22
0.43 3.92 25567.61 16586.18
0.49 3.50 25202.48 16319.65
0.54 3.15 24898.99 16094.87
0.59 2.85 24648.07 15906.19
0.63 2.60 24441.06 15748.28
0.66 2.38 24270.42 15616.51
0.68 2.20 24130.05 15506.90
0.71 2.05 24014.78 15415.87
0.73 1.92 23921.74 15341.59
0.74 1.80 23847.57 15281.63
0.76 1.70 23789.06 15233.54
0.77 1.61 23744.29 15195.99
0.78 1.54 23710.83 15167.01
0.79 1.47 23687.05 15145.38
0.80 1.41 23671.47 15129.72
0.81 1.36 23662.99 15119.14
0.81 1.31 23660.58 15112.77
0.82 1.27 23663.32 15109.86
0.82 1.23 23670.44 15109.74];
Z(:,:,2) = [0 6.70 NaN NaN
0.17 5.63 24826.03 16191.26
0.30 4.80 24198.87 15757.83
0.40 4.14 23720.27 15417.52
0.47 3.61 23347.38 15147.16
0.54 3.19 23058.01 14933.52
0.59 2.85 22836.18 14766.65
0.63 2.57 22667.24 14637.38
0.66 2.34 22539.27 14537.68
0.69 2.15 22445.60 14463.08
0.71 2.00 22379.90 14409.04
0.73 1.87 22336.70 14371.44
0.75 1.76 22311.74 14347.04
0.76 1.66 22301.57 14333.13
0.77 1.58 22303.32 14327.31
0.78 1.51 22314.83 14327.75
0.79 1.45 22334.27 14333.00
0.80 1.40 22360.11 14341.81
0.81 1.36 22391.09 14353.22
0.81 1.32 22426.11 14366.39
0.82 1.28 22464.22 14380.67
0.82 1.25 22504.61 14395.53
0.82 1.23 22546.61 14410.57];
Z(:,:,3) = [0 6.70 NaN NaN
0.19 5.45 22687.71 14805.97
0.34 4.53 22119.24 14408.55
0.44 3.84 21720.37 14120.95
0.52 3.31 21437.68 13912.54
0.58 2.90 21244.60 13766.39
0.63 2.59 21117.60 13667.05
0.66 2.34 21040.03 13602.91
0.69 2.14 21000.70 13565.85
0.72 1.98 20990.89 13549.24
0.73 1.85 21003.53 13547.54
0.75 1.74 21033.19 13556.41
0.76 1.65 21075.85 13572.54
0.77 1.58 21128.37 13593.46
0.78 1.52 21188.17 13617.25
0.79 1.47 21253.16 13642.44
0.80 1.42 21321.69 13668.02
0.80 1.39 21392.34 13693.18
0.81 1.36 21463.83 13717.38
0.81 1.33 21535.27 13740.33
0.81 1.31 21605.87 13761.81
0.82 1.29 21674.84 13781.70
0.82 1.27 21741.68 13799.97];
where I is time (y-axis), J is variables (x-axis) and K is batch (z-axis). However, since I want to use this data to do PCA and PLS analysis, I must change this (time x variables x batch) dimension to (batch (I) x variables (J) x time (K)) dimension, means that the new Z is Z(3 x 4 x 23).
To perform this I can extract the first row value from each slab (K dimension) and rearrange them as a new matrix slab using the following command:
T1=squeeze(Z(1,:,:))’
Thus, I use for loop to get the results for all 23 slabs. But I cant (dont know how to) store the results in workspace except for the last one. The command I used:
[I,J,K] = size(Z);
SLAB = zeros(K,J,I); %preallocating the matrix; where I=23,J=4,K=3
for t = 1 : I %here I = 23
slab = squeeze(Z(t,:,:))’; %removing semicolon here I can see the wanted results in command window
SLAB = slab;
end
HOpe anyone here can help me on this.
Thank you
I found the solution;
since I know the slab will have size of (K,J,I), so must provide the same format in the for loop:
[I,J,K] = size(Z);
SLAB = zeros(K,J,I); %preallocating the matrix; where I=23,J=4,K=3
for t = 1 : I %here I = 23
slab(:,:,t) = squeeze(Z(t,:,:))’;
end

Why is running "unique" faster on a data frame than a matrix in R?

I've begun to believe that data frames hold no advantages over matrices, except for notational convenience. However, I noticed this oddity when running unique on matrices and data frames: it seems to run faster on a data frame.
a = matrix(sample(2,10^6,replace = TRUE), ncol = 10)
b = as.data.frame(a)
system.time({
u1 = unique(a)
})
user system elapsed
1.840 0.000 1.846
system.time({
u2 = unique(b)
})
user system elapsed
0.380 0.000 0.379
The timing results diverge even more substantially as the number of rows is increased. So, there are two parts to this question.
Why is this slower for a matrix? It seems faster to convert to a data frame, run unique, and then convert back.
Is there any reason not to just wrap unique in myUnique, which does the conversions in part #1?
Note 1. Given that a matrix is atomic, it seems that unique should be faster for a matrix, rather than slower. Being able to iterate over fixed-size, contiguous blocks of memory should generally be faster than running over separate blocks of linked lists (I assume that's how data frames are implemented...).
Note 2. As demonstrated by the performance of data.table, running unique on a data frame or a matrix is a comparatively bad idea - see the answer by Matthew Dowle and the comments for relative timings. I've migrated a lot of objects to data tables, and this performance is another reason to do so. So although users should be well served to adopt data tables, for pedagogical / community reasons I'll leave the question open for now regarding the why does this take longer on the matrix objects. The answers below address where does the time go, and how else can we get better performance (i.e. data tables). The answer to why is close at hand - the code can be found via unique.data.frame and unique.matrix. :) An English explanation of what it's doing & why is all that is lacking.
In this implementation, unique.matrix is the same as unique.array
> identical(unique.array, unique.matrix)
[1] TRUE
unique.array has to handle multi-dimensional arrays which requires additional processing to ‘collapse’ the extra dimensions (those extra calls to paste()) which are not needed in the 2-dimensional case. The key section of code is:
collapse <- (ndim > 1L) && (prod(dx[-MARGIN]) > 1L)
temp <- if (collapse)
apply(x, MARGIN, function(x) paste(x, collapse = "\r"))
unique.data.frame is optimised for the 2D case, unique.matrix is not. It could be, as you suggest, it just isn't in the current implementation.
Note that in all cases (unique.{array,matrix,data.table}) where there is more than one dimension it is the string representation that is compared for uniqueness. For floating point numbers this means 15 decimal digits so
NROW(unique(a <- matrix(rep(c(1, 1+4e-15), 2), nrow = 2)))
is 1 while
NROW(unique(a <- matrix(rep(c(1, 1+5e-15), 2), nrow = 2)))
and
NROW(unique(a <- matrix(rep(c(1, 1+4e-15), 1), nrow = 2)))
are both 2. Are you sure unique is what you want?
Not sure but I guess that because matrix is one contiguous vector, R copies it into column vectors first (like a data.frame) because paste needs a list of vectors. Note that both are slow because both use paste.
Perhaps because unique.data.table is already many times faster. Please upgrade to v1.6.7 by downloading it from the R-Forge repository because that has the fix to unique you raised in this question. data.table doesn't use paste to do unique.
a = matrix(sample(2,10^6,replace = TRUE), ncol = 10)
b = as.data.frame(a)
system.time(u1<-unique(a))
user system elapsed
2.98 0.00 2.99
system.time(u2<-unique(b))
user system elapsed
0.99 0.00 0.99
c = as.data.table(b)
system.time(u3<-unique(c))
user system elapsed
0.03 0.02 0.05 # 60 times faster than u1, 20 times faster than u2
identical(as.data.table(u2),u3)
[1] TRUE
In attempting to answer my own question, especially part 1, we can see where the time is spent by looking at the results of Rprof. I ran this again, with 5M elements.
Here are the results for the first unique operation (for the matrix):
> summaryRprof("u1.txt")
$by.self
self.time self.pct total.time total.pct
"paste" 5.70 52.58 5.96 54.98
"apply" 2.70 24.91 10.68 98.52
"FUN" 0.86 7.93 6.82 62.92
"lapply" 0.82 7.56 1.00 9.23
"list" 0.30 2.77 0.30 2.77
"!" 0.14 1.29 0.14 1.29
"c" 0.10 0.92 0.10 0.92
"unlist" 0.08 0.74 1.08 9.96
"aperm.default" 0.06 0.55 0.06 0.55
"is.null" 0.06 0.55 0.06 0.55
"duplicated.default" 0.02 0.18 0.02 0.18
$by.total
total.time total.pct self.time self.pct
"unique" 10.84 100.00 0.00 0.00
"unique.matrix" 10.84 100.00 0.00 0.00
"apply" 10.68 98.52 2.70 24.91
"FUN" 6.82 62.92 0.86 7.93
"paste" 5.96 54.98 5.70 52.58
"unlist" 1.08 9.96 0.08 0.74
"lapply" 1.00 9.23 0.82 7.56
"list" 0.30 2.77 0.30 2.77
"!" 0.14 1.29 0.14 1.29
"do.call" 0.14 1.29 0.00 0.00
"c" 0.10 0.92 0.10 0.92
"aperm.default" 0.06 0.55 0.06 0.55
"is.null" 0.06 0.55 0.06 0.55
"aperm" 0.06 0.55 0.00 0.00
"duplicated.default" 0.02 0.18 0.02 0.18
$sample.interval
[1] 0.02
$sampling.time
[1] 10.84
And for the data frame:
> summaryRprof("u2.txt")
$by.self
self.time self.pct total.time total.pct
"paste" 1.72 94.51 1.72 94.51
"[.data.frame" 0.06 3.30 1.82 100.00
"duplicated.default" 0.04 2.20 0.04 2.20
$by.total
total.time total.pct self.time self.pct
"[.data.frame" 1.82 100.00 0.06 3.30
"[" 1.82 100.00 0.00 0.00
"unique" 1.82 100.00 0.00 0.00
"unique.data.frame" 1.82 100.00 0.00 0.00
"duplicated" 1.76 96.70 0.00 0.00
"duplicated.data.frame" 1.76 96.70 0.00 0.00
"paste" 1.72 94.51 1.72 94.51
"do.call" 1.72 94.51 0.00 0.00
"duplicated.default" 0.04 2.20 0.04 2.20
$sample.interval
[1] 0.02
$sampling.time
[1] 1.82
What we notice is that the matrix version spends a lot of time on apply, paste, and lapply. In contrast, the data frame version simple runs duplicated.data.frame and most of the time is spent in paste, presumably aggregating results.
Although this explains where the time is going, it doesn't explain why these have different implementations, nor the effects of simply changing from one object type to another.

Slower ddply when .parallel=TRUE on Mac OS X Version 10.6.7

I am trying to get ddply to run in parallel on my mac. The code I've used is as follows:
library(doMC)
library(ggplot2) # for the purposes of getting the baseball data.frame
registerDoMC(2)
> system.time(ddply(baseball, .(year), numcolwise(mean)))
user system elapsed
0.959 0.106 1.522
> system.time(ddply(baseball, .(year), numcolwise(mean), .parallel=TRUE))
user system elapsed
2.221 2.790 2.552
Why is ddply slower when I run .parallel=TRUE? I have searched online to no avail. I've also tried registerDoMC() and the results were the same.
The baseball data may be too small to see improvement by making the computations parallel; the overhead of passing the data to the different processes may be swamping any speedup by doing the calculations in parallel. Using the rbenchmark package:
baseball10 <- baseball[rep(seq(length=nrow(baseball)), 10),]
benchmark(noparallel = ddply(baseball, .(year), numcolwise(mean)),
parallel = ddply(baseball, .(year), numcolwise(mean), .parallel=TRUE),
noparallel10 = ddply(baseball10, .(year), numcolwise(mean)),
parallel10 = ddply(baseball10, .(year), numcolwise(mean), .parallel=TRUE),
replications = 10)
gives results
test replications elapsed relative user.self sys.self user.child sys.child
1 noparallel 10 4.562 1.000000 4.145 0.408 0.000 0.000
3 noparallel10 10 14.134 3.098203 9.815 4.242 0.000 0.000
2 parallel 10 11.927 2.614423 2.394 1.107 4.836 6.891
4 parallel10 10 18.406 4.034634 4.045 2.580 10.210 9.769
With a 10 times bigger data set, the penalty for parallel is smaller. A more complicated computation would also tilt it even further in parallel's favor, likely giving it an advantage.
This was run on a Mac OS X 10.5.8 Core 2 Duo machine.
Running in parallel will be slower than running sequentially when the communication costs between the nodes is greater than the calculation time of the function. In other words, it takes longer to send the data to/from the nodes than it does to perform the calculation.
For the same data set, the communication costs are approximately fixed, so parallel processing is going to be more useful as the time spent evaluating the function increases.
UPDATE:
The code below shows 0.14 seconds (on my machine) are spent is spent evaluating .fun. That means communication has to be less than 0.07 seconds and that's not realistic for a data set the size of baseball.
Rprof()
system.time(ddply(baseball, .(year), numcolwise(mean)))
# user system elapsed
# 0.28 0.02 0.30
Rprof(NULL)
summaryRprof()$by.self
# self.time self.pct total.time total.pct
# [.data.frame 0.04 12.50 0.10 31.25
# unlist 0.04 12.50 0.10 31.25
# match 0.04 12.50 0.04 12.50
# .fun 0.02 6.25 0.14 43.75
# structure 0.02 6.25 0.12 37.50
# [[ 0.02 6.25 0.08 25.00
# FUN 0.02 6.25 0.06 18.75
# rbind.fill 0.02 6.25 0.06 18.75
# anyDuplicated 0.02 6.25 0.02 6.25
# gc 0.02 6.25 0.02 6.25
# is.array 0.02 6.25 0.02 6.25
# list 0.02 6.25 0.02 6.25
# mean.default 0.02 6.25 0.02 6.25
Here's the parallel version with snow:
library(doSNOW)
cl <- makeSOCKcluster(2)
registerDoSNOW(cl)
Rprof()
system.time(ddply(baseball, .(year), numcolwise(mean), .parallel=TRUE))
# user system elapsed
# 0.46 0.01 0.73
Rprof(NULL)
summaryRprof()$by.self
# self.time self.pct total.time total.pct
# .Call 0.24 33.33 0.24 33.33
# socketSelect 0.16 22.22 0.16 22.22
# lazyLoadDBfetch 0.08 11.11 0.08 11.11
# accumulate.iforeach 0.04 5.56 0.06 8.33
# rbind.fill 0.04 5.56 0.06 8.33
# structure 0.04 5.56 0.04 5.56
# <Anonymous> 0.02 2.78 0.54 75.00
# lapply 0.02 2.78 0.04 5.56
# constantFoldEnv 0.02 2.78 0.02 2.78
# gc 0.02 2.78 0.02 2.78
# stopifnot 0.02 2.78 0.02 2.78
# summary.connection 0.02 2.78 0.02 2.78

Resources