Column-wise sort or top-n of matrix - performance

I need to sort a matrix so that all elements stay in their columns and each column is in ascending order. Is there a vectorized column-wise sort for a matrix or a data frame in R? (My matrix is all-positive and bounded by B, so I can add j*B to each cell in column j and do a regular one-dimensional sort:
> set.seed(100523); m <- matrix(round(runif(30),2), nrow=6); m
[,1] [,2] [,3] [,4] [,5]
[1,] 0.47 0.32 0.29 0.54 0.38
[2,] 0.38 0.91 0.76 0.43 0.92
[3,] 0.71 0.32 0.48 0.16 0.85
[4,] 0.88 0.83 0.61 0.95 0.72
[5,] 0.16 0.57 0.70 0.82 0.05
[6,] 0.77 0.03 0.75 0.26 0.05
> offset <- rep(seq_len(5), rep(6, 5)); offset
[1] 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5
> m <- matrix(sort(m + offset), nrow=nrow(m)) - offset; m
[,1] [,2] [,3] [,4] [,5]
[1,] 0.16 0.03 0.29 0.16 0.05
[2,] 0.38 0.32 0.48 0.26 0.05
[3,] 0.47 0.32 0.61 0.43 0.38
[4,] 0.71 0.57 0.70 0.54 0.72
[5,] 0.77 0.83 0.75 0.82 0.85
[6,] 0.88 0.91 0.76 0.95 0.92
But is there something more beautiful already included?) Otherwise, what would be the fastest way if my matrix has around 1M (10M, 100M) entries (roughly a square matrix)? I'm worried about the performance penalty of apply and friends.
Actually, I don't need "sort", just "top n", with n being around 30 or 100, say. I am thinking about using apply and the partial parameter of sort, but I wonder if this is cheaper than just doing a vectorized sort. So, before doing benchmarks on my own, I'd like to ask for advice by experienced users.

If you want to use sort, ?sort indicates that method = "quick" can be twice as fast as the default method with on the order of 1 million elements.
Start with apply(m, 2, sort, method = "quick") and see if that provides sufficient speed.
Do note the comments on this in ?sort though; ties are sorted in a non-stable manner.

I have put down a quick testing framework for the solutions proposed so far.
library(rbenchmark)
sort.q <- function(m) {
sort(m, method='quick')
}
sort.p <- function(m) {
mm <- sort(m, partial=TOP)[1:TOP]
sort(mm)
}
sort.all.g <- function(f) {
function(m) {
o <- matrix(rep(seq_len(SIZE), rep(SIZE, SIZE)), nrow=SIZE)
matrix(f(m+o), nrow=SIZE)[1:TOP,]-o[1:TOP,]
}
}
sort.all <- sort.all.g(sort)
sort.all.q <- sort.all.g(sort.q)
apply.sort.g <- function(f) {
function(m) {
apply(m, 2, f)[1:TOP,]
}
}
apply.sort <- apply.sort.g(sort)
apply.sort.p <- apply.sort.g(sort.p)
apply.sort.q <- apply.sort.g(sort.q)
bb <- NULL
SIZE_LIMITS <- 3:9
TOP_LIMITS <- 2:5
for (SIZE in floor(sqrt(10)^SIZE_LIMITS)) {
for (TOP in floor(sqrt(10)^TOP_LIMITS)) {
print(c(SIZE, TOP))
TOP <- min(TOP, SIZE)
m <- matrix(runif(SIZE*SIZE), floor(SIZE))
if (SIZE < 1000) {
mr <- apply.sort(m)
stopifnot(apply.sort.q(m) == mr)
stopifnot(apply.sort.p(m) == mr)
stopifnot(sort.all(m) == mr)
stopifnot(sort.all.q(m) == mr)
}
b <- benchmark(apply.sort(m),
apply.sort.q(m),
apply.sort.p(m),
sort.all(m),
sort.all.q(m),
columns= c("test", "elapsed", "relative",
"user.self", "sys.self"),
replications=1,
order=NULL)
b$SIZE <- SIZE
b$TOP <- TOP
b$test <- factor(x=b$test, levels=b$test)
bb <- rbind(bb, b)
}
}
ftable(xtabs(user.self ~ SIZE+test+TOP, bb))
The results so far indicate that for all but the biggest matrices, apply really hurts performance unless doing a "top n". For "small" matrices < 1e6, just sorting the whole thing without apply is competitive. For "huge" matrices, sorting the whole array becomes slower than apply. Using partial works best for "huge" matrices and is only a slight loss for "small" matrices.
Please feel free to add your own sorting routine :-)
TOP 10 31 100 316
SIZE test
31 apply.sort(m) 0.004 0.012 0.000 0.000
apply.sort.q(m) 0.008 0.016 0.000 0.000
apply.sort.p(m) 0.008 0.020 0.000 0.000
sort.all(m) 0.000 0.008 0.000 0.000
sort.all.q(m) 0.000 0.004 0.000 0.000
100 apply.sort(m) 0.012 0.016 0.028 0.000
apply.sort.q(m) 0.016 0.016 0.036 0.000
apply.sort.p(m) 0.020 0.020 0.040 0.000
sort.all(m) 0.000 0.004 0.008 0.000
sort.all.q(m) 0.004 0.004 0.004 0.000
316 apply.sort(m) 0.060 0.060 0.056 0.060
apply.sort.q(m) 0.064 0.060 0.060 0.072
apply.sort.p(m) 0.064 0.068 0.108 0.076
sort.all(m) 0.016 0.016 0.020 0.024
sort.all.q(m) 0.020 0.016 0.024 0.024
1000 apply.sort(m) 0.356 0.276 0.276 0.292
apply.sort.q(m) 0.348 0.316 0.288 0.296
apply.sort.p(m) 0.256 0.264 0.276 0.320
sort.all(m) 0.268 0.244 0.213 0.244
sort.all.q(m) 0.260 0.232 0.200 0.208
3162 apply.sort(m) 1.997 1.948 2.012 2.108
apply.sort.q(m) 1.916 1.880 1.892 1.901
apply.sort.p(m) 1.300 1.316 1.376 1.544
sort.all(m) 2.424 2.452 2.432 2.480
sort.all.q(m) 2.188 2.184 2.265 2.244
10000 apply.sort(m) 18.193 18.466 18.781 18.965
apply.sort.q(m) 15.837 15.861 15.977 16.313
apply.sort.p(m) 9.005 9.108 9.304 9.925
sort.all(m) 26.030 25.710 25.722 26.686
sort.all.q(m) 23.341 23.645 24.010 24.073
31622 apply.sort(m) 201.265 197.568 196.181 196.104
apply.sort.q(m) 163.190 160.810 158.757 160.050
apply.sort.p(m) 82.337 81.305 80.641 82.490
sort.all(m) 296.239 288.810 289.303 288.954
sort.all.q(m) 260.872 249.984 254.867 252.087

Does
apply(m, 2, sort)
do the job? :)
Or for top-10, say, use:
apply(m, 2 ,function(x) {sort(x,dec=TRUE)[1:10]})
Performance is strong - for 1e7 rows and 5 cols (5e7 numbers in total), my computer took around 9 or 10 seconds.

R is very fast at matrix calculations. A matrix with 1e7 elements in 1e4 columns gets sorted in under 3 seconds on my machine
set.seed(1)
m <- matrix(runif(1e7), ncol=1e4)
system.time(sm <- apply(m, 2, sort))
user system elapsed
2.62 0.14 2.79
The first 5 columns:
sm[1:15, 1:5]
[,1] [,2] [,3] [,4] [,5]
[1,] 2.607703e-05 0.0002085913 9.364448e-05 0.0001937598 1.157424e-05
[2,] 9.228056e-05 0.0003156713 4.948019e-04 0.0002542199 2.126186e-04
[3,] 1.607228e-04 0.0003988042 5.015987e-04 0.0004544661 5.855639e-04
[4,] 5.756689e-04 0.0004399747 5.762535e-04 0.0004621083 5.877446e-04
[5,] 6.932740e-04 0.0004676797 5.784736e-04 0.0004749235 6.470268e-04
[6,] 7.856274e-04 0.0005927107 8.244428e-04 0.0005443178 6.498618e-04
[7,] 8.489799e-04 0.0006210336 9.249109e-04 0.0005917936 6.548134e-04
[8,] 1.001975e-03 0.0006522120 9.424880e-04 0.0007702231 6.569310e-04
[9,] 1.042956e-03 0.0007237203 1.101990e-03 0.0009826915 6.810103e-04
[10,] 1.246256e-03 0.0007968422 1.117999e-03 0.0009873926 6.888523e-04
[11,] 1.337960e-03 0.0009294956 1.229132e-03 0.0009997757 8.671272e-04
[12,] 1.372295e-03 0.0012221676 1.329478e-03 0.0010375632 8.806398e-04
[13,] 1.583430e-03 0.0012781983 1.433513e-03 0.0010662393 8.886999e-04
[14,] 1.603961e-03 0.0013518191 1.458616e-03 0.0012068383 8.903167e-04
[15,] 1.673268e-03 0.0013697683 1.590524e-03 0.0013617468 1.024081e-03

They say there's a fine line between genius and madness... take a look at this and see what you think of the idea. As in the question, the goal is to find the top 30 elements of a vector vec that might be long (1e7, 1e8, or more elements).
topn = 30
sdmult = max(1,qnorm(1-(topn/length(vec))))
sdmin = 1e-5
acceptmult = 10
calcsd = max(sd(vec),sdmin)
calcmn = mean(vec)
thresh = calcmn + sdmult*calcsd
subs = which(vec > thresh)
while (length(subs) > topn * acceptmult) {
thresh = thresh + calcsd
subs = which(vec > thresh)
}
while (length(subs) < topn) {
thresh = thresh - calcsd
subs = which(vec > thresh)
}
topvals = sort(vec[subs],dec=TRUE)[1:topn]
The basic idea is that even if we don't know much about the distribution of vec, we'd certainly expect the highest values in vec to be several standard deviations above the mean. If vec were normally distributed, then the qnorm expression on line 2 gives a rough idea how many sd's above the mean we'd need to look to find the highest topn values (e.g. if vec contains 1e8 values, the top 30 values are likely to be located in the region starting 5 sd's above the mean.) Even if vec isn't normal, this assumption is unlikely to be massively far away from the truth.
Ok, so we compute the mean and sd of vec, and use these to propose a threshold to look above - a certain number of sd's above the mean. We're hoping to find in this upper tail a subset of slightly more than topn values. If we do, we can sort it and easily identify the highest topn values - which will be the highest topn values in vec overall.
Now the exact rules here can probably be tweaked a bit, but the idea is that we need to guard against the original threshold being "out" for some reason. We therefore exploit the fact that it's quick to check how many elements lie above a certain threshold. So, we first raise the threshold, in increments of calcsd, until there are fewer than 10 * topn elements above the threshold. Then, if needed. we reduce thresh (again in steps of calcsd) until we definitely have at least topn elements above the threshold. This bi-directional search should always lead to a "threshold set" whose size is fairly close to topn (hopefully within a factor of 10 or 100). As topn is relatively small (typical value 30), it will be really fast to sort this threshold set, which of course immediately gives us the highest topn elements in the original vector vec.
My claim is that the calculations involved in generating a decent threshold set are all quick in R, so if only the top 30 or so elements of a very large vector are required, this indirect approach will beat any approach that involves sorting the whole vector.
What do you think?! If you think it's an interesting idea, please like/vote up :) I'll look at doing some proper timings but my initial tests on randomly generated data were really promising - it'd be great to test it out on "real" data though...!
Cheers :)

Related

Julia pmap speed - parallel processing - dynamic programming

I am trying to speed up filling in a matrix for a dynamic programming problem in Julia (v0.6.0), and I can't seem to get much extra speed from using pmap. This is related to this question I posted almost a year ago: Filling a matrix using parallel processing in Julia. I was able to speed up serial processing with some great help then, and I'm now trying to get extra speed from parallel processing tools in Julia.
For the serial processing case, I was using a 3-dimensional matrix (essentially a set of equally-sized matrices, indexed by the 1st-dimension) and iterating over the 1st-dimension. I wanted to give pmap a try, though, to more efficiently iterate over the set of matrices.
Here is the code setup. To use pmap with the v_iter function below, I converted the three dimensional matrix into a dictionary object, with the dictionary keys equal to the index values in the 1st dimension (v_dict in the code below, with gcc equal to the 1st-dimension size). The v_iter function takes other dictionary objects (E_opt_dict and gridpoint_m_dict below) as additional inputs:
function v_iter(a,b,c)
diff_v = 1
while diff_v>convcrit
diff_v = -Inf
#These lines efficiently multiply the value function by the Markov transition matrix, using the A_mul_B function
exp_v = zeros(Float64,gkpc,1)
A_mul_B!(exp_v,a[1:gkpc,:],Zprob[1,:])
for j=2:gz
temp=Array{Float64}(gkpc,1)
A_mul_B!(temp,a[(j-1)*gkpc+1:(j-1)*gkpc+gkpc,:],Zprob[j,:])
exp_v=hcat(exp_v,temp)
end
#This tries to find the optimal value of v
for h=1:gm
for j=1:gz
oldv = a[h,j]
newv = (1-tau)*b[h,j]+beta*exp_v[c[h,j],j]
a[h,j] = newv
diff_v = max(diff_v, oldv-newv, newv-oldv)
end
end
end
end
gz = 9
gp = 13
gk = 17
gcc = 5
gm = gk * gp * gcc * gz
gkpc = gk * gp * gcc
gkp = gk*gp
beta = ((1+0.015)^(-1))
tau = 0.35
Zprob = [0.43 0.38 0.15 0.03 0.00 0.00 0.00 0.00 0.00; 0.05 0.47 0.35 0.11 0.02 0.00 0.00 0.00 0.00; 0.01 0.10 0.50 0.30 0.08 0.01 0.00 0.00 0.00; 0.00 0.02 0.15 0.51 0.26 0.06 0.01 0.00 0.00; 0.00 0.00 0.03 0.21 0.52 0.21 0.03 0.00 0.00 ; 0.00 0.00 0.01 0.06 0.26 0.51 0.15 0.02 0.00 ; 0.00 0.00 0.00 0.01 0.08 0.30 0.50 0.10 0.01 ; 0.00 0.00 0.00 0.00 0.02 0.11 0.35 0.47 0.05; 0.00 0.00 0.00 0.00 0.00 0.03 0.15 0.38 0.43]
convcrit = 0.001 # chosen convergence criterion
E_opt = Array{Float64}(gcc,gm,gz)
fill!(E_opt,10.0)
gridpoint_m = Array{Int64}(gcc,gm,gz)
fill!(gridpoint_m,fld(gkp,2))
v_dict=Dict(i => zeros(Float64,gm,gz) for i=1:gcc)
E_opt_dict=Dict(i => E_opt[i,:,:] for i=1:gcc)
gridpoint_m_dict=Dict(i => gridpoint_m[i,:,:] for i=1:gcc)
For parallel processing, I executed the following two commands:
wp = CachingPool(workers())
addprocs(3)
pmap(wp,v_iter,values(v_dict),values(E_opt_dict),values(gridpoint_m_dict))
...which produced this performance:
135.626417 seconds (3.29 G allocations: 57.152 GiB, 3.74% gc time)
I then tried to serial process instead:
for i=1:gcc
v_iter(v_dict[i],E_opt_dict[i],gridpoint_m_dict[i])
end
...and received better performance.
128.263852 seconds (3.29 G allocations: 57.101 GiB, 4.53% gc time)
This also gives me about the same performance as running v_iter on the original 3-dimensional objects:
v=zeros(Float64,gcc,gm,gz)
for i=1:gcc
v_iter(v[i,:,:],E_opt[i,:,:],gridpoint_m[i,:,:])
end
I know that parallel processing involves setup time, but when I increase the value of gcc, I still get about equal processing time for serial and parallel. This seems like a good candidate for parallel processing, since there is no need for messaging between the workers! But I can't seem to make it work efficiently.
You create the CachingPool before adding the worker processes. Hence your caching pool passed to pmap tells it to use just a single worker.
You can simply check it by running wp.workers you will see something like Set([1]).
Hence it should be:
addprocs(3)
wp = CachingPool(workers())
You could also consider running Julia -p command line parameter e.g. julia -p 3 and then you can skip the addprocs(3) command.
On top of that your for and pmap loops are not equivalent. The Julia Dict object is a hashmap and similar to other languages does not offer anything like element order. Hence in your for loop you are guaranteed to get the same matching i-th element while with the values the ordering of values does not need to match the original ordering (and you can have different order for each of those three variables in the pmap loop).
Since the keys for your Dicts are just numbers from 1 up to gcc you should simply use arrays instead. You can use generators very similar to Python. For an example instead of
v_dict=Dict(i => zeros(Float64,gm,gz) for i=1:gcc)
use
v_dict_a = [zeros(Float64,gm,gz) for i=1:gcc]
Hope that helps.
Based on #Przemyslaw Szufeul's helpful advice, I've placed below the code that properly executes parallel processing. After running it once, I achieved substantial improvement in running time:
77.728264 seconds (181.20 k allocations: 12.548 MiB)
In addition to reordering the wp command and using the generator Przemyslaw recommended, I also recast v_iter as an anonymous function, in order to avoid having to sprinkle #everywhere around the code to feed functions and data to the workers.
I also added return a to the v_iter function, and set v_a below equal to the output of pmap, since you cannot pass by reference to a remote object.
addprocs(3)
v_iter = function(a,b,c)
diff_v = 1
while diff_v>convcrit
diff_v = -Inf
#These lines efficiently multiply the value function by the Markov transition matrix, using the A_mul_B function
exp_v = zeros(Float64,gkpc,1)
A_mul_B!(exp_v,a[1:gkpc,:],Zprob[1,:])
for j=2:gz
temp=Array{Float64}(gkpc,1)
A_mul_B!(temp,a[(j-1)*gkpc+1:(j-1)*gkpc+gkpc,:],Zprob[j,:])
exp_v=hcat(exp_v,temp)
end
#This tries to find the optimal value of v
for h=1:gm
for j=1:gz
oldv = a[h,j]
newv = (1-tau)*b[h,j]+beta*exp_v[c[h,j],j]
a[h,j] = newv
diff_v = max(diff_v, oldv-newv, newv-oldv)
end
end
end
return a
end
gz = 9
gp = 13
gk = 17
gcc = 5
gm = gk * gp * gcc * gz
gkpc = gk * gp * gcc
gkp =gk*gp
beta = ((1+0.015)^(-1))
tau = 0.35
Zprob = [0.43 0.38 0.15 0.03 0.00 0.00 0.00 0.00 0.00; 0.05 0.47 0.35 0.11 0.02 0.00 0.00 0.00 0.00; 0.01 0.10 0.50 0.30 0.08 0.01 0.00 0.00 0.00; 0.00 0.02 0.15 0.51 0.26 0.06 0.01 0.00 0.00; 0.00 0.00 0.03 0.21 0.52 0.21 0.03 0.00 0.00 ; 0.00 0.00 0.01 0.06 0.26 0.51 0.15 0.02 0.00 ; 0.00 0.00 0.00 0.01 0.08 0.30 0.50 0.10 0.01 ; 0.00 0.00 0.00 0.00 0.02 0.11 0.35 0.47 0.05; 0.00 0.00 0.00 0.00 0.00 0.03 0.15 0.38 0.43]
convcrit = 0.001 # chosen convergence criterion
E_opt = Array{Float64}(gcc,gm,gz)
fill!(E_opt,10.0)
gridpoint_m = Array{Int64}(gcc,gm,gz)
fill!(gridpoint_m,fld(gkp,2))
v_a=[zeros(Float64,gm,gz) for i=1:gcc]
E_opt_a=[E_opt[i,:,:] for i=1:gcc]
gridpoint_m_a=[gridpoint_m[i,:,:] for i=1:gcc]
wp = CachingPool(workers())
v_a = pmap(wp,v_iter,v_a,E_opt_a,gridpoint_m_a)

Algorithm for optimal expected amount in a profit/loss game

I came upon the following question recently,
"You have a box which has G green and B blue coins. Pick a random coin, G gives a profit of +1 and blue a loss of -1. If you play optimally what is the expected profit."
I was thinking of using a brute force algorithm where I consider all possibilities of combinations of green and blue coins but I'm sure there must be a better solution for this (range of B and G was from 0 to 5000). Also what does playing optimally mean? Does it mean that if i pick all blue coins then I would continue playing till all green coins are also picked? If so then this means I shouldn't consider all possibilities of green and blue coins?
The "obvious" answer is to play whenever there's more green coins than blue coins. In fact, this is wrong. For example, if there's 999 green coins and 1000 blue coins, here's a strategy that takes an expected profit:
Take 2 coins
If GG -- stop with a profit of 2
if BG or GB -- stop with a profit of 0
if BB -- take all the remaining coins for a profit of -1
Since the first and last possibilities both occur with near 25% probability, your overall expectation is approximately 0.25*2 - 0.25*1 = 0.25
This is just a simple strategy in one extreme example that shows that the problem is not as simple as it first seems.
In general, the expectations with g green coins and b blue coins is given by a recurrence relation:
E(g, 0) = g
E(0, b) = 0
E(g, b) = max(0, g(E(g-1, b) + 1)/(b+g) + b(E(g, b-1) - 1)/(b+g))
The max in the final row occurs because if it's -EV to play, then you're better stopping.
These recurrence relations can be solved using dynamic programming in O(gb) time.
from fractions import Fraction as F
def gb(G, B):
E = [[F(0, 1)] * (B+1) for _ in xrange(G+1)]
for g in xrange(G+1):
E[g][0] = F(g, 1)
for b in xrange(1, B+1):
for g in xrange(1, G+1):
E[g][b] = max(0, (g * (E[g-1][b]+1) + b * (E[g][b-1]-1)) * F(1, (b+g)))
for row in E:
for v in row:
print '%5.2f' % v,
print
print
return E[G][B]
print gb(8, 10)
Output:
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
1.00 0.50 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
2.00 1.33 0.67 0.20 0.00 0.00 0.00 0.00 0.00 0.00 0.00
3.00 2.25 1.50 0.85 0.34 0.00 0.00 0.00 0.00 0.00 0.00
4.00 3.20 2.40 1.66 1.00 0.44 0.07 0.00 0.00 0.00 0.00
5.00 4.17 3.33 2.54 1.79 1.12 0.55 0.15 0.00 0.00 0.00
6.00 5.14 4.29 3.45 2.66 1.91 1.23 0.66 0.23 0.00 0.00
7.00 6.12 5.25 4.39 3.56 2.76 2.01 1.34 0.75 0.30 0.00
8.00 7.11 6.22 5.35 4.49 3.66 2.86 2.11 1.43 0.84 0.36
7793/21879
From this you can see that the expectation is positive to play with 8 green and 10 blue coins (EV=7793/21879 ~= 0.36), and you even have positive expectation with 2 green and 3 blue coins (EV=0.2)
Simple and intuitive answer:
you should start off with an estimate for the total number of blue and green coins. After each pick you will update this estimate. If you estimate there are more blue coins than green coins at any point you should stop.
Example:
you start and you pick a coin. Its green so you estimate 100% of the coins are green. You pick a blue so you estimate 50% of coins are green. You pick another blue coin so you estimate 33% of the coins are green. At this point is isn't worth playing anymore, according to your estimate, so you stop.
This answer is wrong; see Paul Hankin's answer for counterexamples and a proper analysis. I leave this answer here as a learning example for all of us.
Assuming that your choice is only when to stop picking coins, you continue as long as G > B. That part is simple. If you start with G < B, then you never start drawing, and your gain is 0. For G = B, no strategy will get you a mathematical advantage; the gain there is also 0.
For the expected reward, take this in two steps:
(1) Expected value on any draw sequence. Do this recursively, figuring the chance of getting green or blue on the first draw, and then the expected values for the new state (G-1, B) or (G, B-1). You will quickly see that the expected value of any given draw number (such as all possibilities for the 3rd draw) is the same as the original.
Therefore, your expected value on any draw is e = (G-B) / (G+B). Your overall expected value is e * d, where d is the number of draws you choose.
(2) What is the expected number of draws? How many times do you expect to draw before G = B? I'll leave this as an exercise for the student, but note the previous idea of doing this recursively. You might find it easier to describe the state of the game as (extra, total), where extra = G-B and total = G+B.
Illustrative exercise: given G=4, B=2, what is the chance that you'll draw GG on the first two draws (and then stop the game)? What is the gain from that? How does that compare with the (4-2)/(4+2) advantage on each draw?

Heap sort running time

I posted a similar question last time when I compared the running times of two different implementations of insertion sort. I have a similar problem now. I know that a heap sort's complexity is O(nlogn), same as the quick sort in the average case. But here are my results for when I sort an array of randomly generated integers of size 10,000.
Quick sort: Time taken for execution: 0.005288
Heap sort: Time taken for execution: 0.234245
As you can see, my heap sort is taking much more time than it should.
Here is my code for the max_heapify, build_max_heap and heapsort functions:
void max_heapify(int *a,int n,int i)
{
int largest = 0;
if((2*i)<=n && a[2*i] > a[i]) largest = 2*i;
else largest = i;
if((2*i+1)<=n && a[2*i+1] > a[largest]) largest = 2*i+1;
if(largest != i)
{
swap(a[i],a[largest]);
max_heapify(a,n,largest);
}
}
void build_max_heap(int *a,int n)
{
for(int i=1;i<=n/2;i++)
{
max_heapify(a,n,i);
}
}
void heapsort(int *a,int n)
{
while(n>0)
{
build_max_heap(a,n);
swap(a[1],a[n--]);
}
}
Can anybody tell me the flaw in the implementation? The code works. I tested it personally for smaller sample sizes, but it's not efficient.
Any help is appreciated. Thanks!
I guess repetitive build_max_heap() is the flaw. Heapsort build a heap tree once, then delete a largest element in the root node repeatedly.
GIF Animation in wikipedia is a good sample to understand.
I have evaluated 3 algorithms with random strings. N=100,000
qsort(3) usec = 54011 call = 0 compare = 1536365 copy = 0
qsort_trad() usec = 67603 call = 99999 compare = 2368481 copy = 1344918
heap_sort() usec = 88814 call = 1624546 compare = 3019351 copy = 1963682
qsort(3) is merge sort of index sorting in GNU C library.
qsort_trad() is conventional median-of-3 quicksort.
heap_sort() is my implementation.
Synopsis is sames to qsort(3).
You are trying to make a determination based on a single random sample (regardless of how many elements are in the sample), which is unreliable in the extreme.
The difference may be due to code outside of the algorithms (possibly the first algorithm run took time to initialize things that did not need to be initialized again for the second). This can also be addressed by running multiple samples (say, by alternating between algorithms).
FWIW - I found sorting 30000 element produced several elements out of order.
void check(const int *a, int n)
{
int i=1;
for(i=1; i<n; i++)
{
if(a[i]< a[i-1])
{
printf("out of order a[%d]=%d a[%d]=%d\n", i,a[i], i-1, a[i-1]);
}
}
}
Produced this:
out of order a[1]=0 a[0]=1171034357
out of order a[29989]=2147184723 a[29988]=2147452680
out of order a[29991]=2146884982 a[29990]=2147187654
out of order a[29993]=2145044438 a[29992]=2147179272
out of order a[29995]=2145040183 a[29994]=2146044388
out of order a[29996]=2130746386 a[29995]=2145040183
out of order a[29997]=2062733752 a[29996]=2130746386
out of order a[29998]=2026139713 a[29997]=2062733752
out of order a[29999]=1957918169 a[29998]=2026139713
Profiling:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
42.57 1.06 1.06 _mcount_private
28.11 1.76 0.70 225000000 0.00 0.00 max_heapify
19.68 2.25 0.49 __fentry__
9.64 2.49 0.24 30000 0.01 0.03 build_max_heap
0.00 2.49 0.00 427498 0.00 0.00 swap
0.00 2.49 0.00 1 0.00 0.00 check
0.00 2.49 0.00 1 0.00 940.00 heapsort
Ignore _mcount_private it is there from profiling.
Consider:
1. Be sure your code works correctly.
2. Then worry about speed of execution.
3 #ScottHunter (+1) is correct - your analysis is flawed and so in the same sense is the above profile.
Second profile on different set of 30000 random numbers:
$ ./heapsort
out of order a[1]=0 a[0]=1451707585
out of order a[29989]=2147418839 a[29988]=2147447414
out of order a[29990]=2147336854 a[29989]=2147418839
out of order a[29991]=2147314086 a[29990]=2147336854
out of order a[29992]=2147202205 a[29991]=2147314086
out of order a[29993]=2147177981 a[29992]=2147202205
out of order a[29994]=2133236301 a[29993]=2147177981
out of order a[29996]=2100055600 a[29995]=2137349904
out of order a[29998]=2044161181 a[29997]=2108306471
out of order a[29999]=1580699248 a[29998]=2044161181
Owner#Owner-PC ~
$ gprof heapsort | more
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
45.20 1.13 1.13 _mcount_private
30.00 1.88 0.75 225000000 0.00 0.00 max_heapify
13.20 2.21 0.33 __fentry__
11.60 2.50 0.29 30000 0.00 0.00 build_max_heap
0.00 2.50 0.00 427775 0.00 0.00 swap
0.00 2.50 0.00 1 0.00 0.00 check
0.00 2.50 0.00 1 0.00 1.04 heapsort

Speed up assembling matrix by interleaving vectors?

I have two vectors of arbitrary and equal length
a <- c(0.8,0.8,0.8)
b <- c(0.4,0.4,0.4)
n <- length(a)
From these I need to assemble an 2n by 2n matrix of the form:
x = [1-a1 b1 1-a2 b2 1-a3 b3
a1 1-b1 a2 1-b2 a3 1-b3
1-a1 b1 1-a2 b2 1-a3 b3
a1 1-b1 a2 1-b2 a3 1-b3
1-a1 b1 1-a2 b2 1-a3 b3
a1 1-b1 a2 1-b2 a3 1-b3]
I currently do this using
x <- matrix(rep(as.vector(rbind(
c(1-a,a),
c(b, 1-b))),
n),
ncol=n*2, byrow=TRUE)
How can I speed up this operation? Profiling indicates that matrix is taking the most time:
Rprof("out.prof")
for (i in 1:100000) {
x <- matrix(rep(as.vector(rbind(
c(1-a,a),
c(b, 1-b))),
n),
ncol=n*2, byrow=TRUE)
}
Rprof(NULL)
summaryRprof("out.prof")
##$by.self
## self.time self.pct total.time total.pct
##"matrix" 1.02 63.75 1.60 100.00
##"rbind" 0.24 15.00 0.36 22.50
##"as.vector" 0.18 11.25 0.54 33.75
##"c" 0.10 6.25 0.10 6.25
##"*" 0.04 2.50 0.04 2.50
##"-" 0.02 1.25 0.02 1.25
##
##$by.total
## total.time total.pct self.time self.pct
##"matrix" 1.60 100.00 1.02 63.75
##"as.vector" 0.54 33.75 0.18 11.25
##"rbind" 0.36 22.50 0.24 15.00
##"c" 0.10 6.25 0.10 6.25
##"*" 0.04 2.50 0.04 2.50
##"-" 0.02 1.25 0.02 1.25
##
##$sample.interval
##[1] 0.02
##
##$sampling.time
##[1] 1.6
I don't think there is an alternative to matrix being the slowest part of your profile, but you can definitely save a little time by optimizing the rest. For example:
x <- matrix(rbind(c(1-a,a), c(b, 1-b)), 2*n, 2*n, byrow=TRUE)
Also, although I would not recommend it, you can save a little extra time by using the Internal matrix function:
x <- .Internal(matrix(rbind(c(1-a,a), c(b, 1-b)),
n*2, n*2, TRUE, NULL, FALSE, FALSE))
Here are some benchmarks:
benchmark(
method0 = matrix(rep(as.vector(rbind(c(1-a,a), c(b, 1-b))), n),
ncol=n*2, byrow=TRUE),
method1 = matrix(rbind(c(1-a,a), c(b, 1-b)), 2*n, 2*n, byrow=TRUE),
method2 = .Internal(matrix(rbind(c(1-a,a), c(b, 1-b)),
n*2, n*2, TRUE, NULL, FALSE, FALSE)),
replications = 100000,
order = "relative")
# test replications elapsed relative user.self sys.self user.child sys.child
# 3 method2 100000 1.00 1.00 0.99 0 NA NA
# 2 method1 100000 1.13 1.13 1.12 0 NA NA
# 1 method0 100000 1.46 1.46 1.46 0 NA NA
I get a small speedup with the following:
f = function(a, b, n){
z = rbind(
c(rbind(1 - a, b)),
c(rbind(a, 1 - b))
)
do.call(rbind, lapply(1:n, function(i) z))
}
I'll keep looking.
Edit I'm stumped. If this isn't good enough, I'd recommend inlining some rcpp.

Why is running "unique" faster on a data frame than a matrix in R?

I've begun to believe that data frames hold no advantages over matrices, except for notational convenience. However, I noticed this oddity when running unique on matrices and data frames: it seems to run faster on a data frame.
a = matrix(sample(2,10^6,replace = TRUE), ncol = 10)
b = as.data.frame(a)
system.time({
u1 = unique(a)
})
user system elapsed
1.840 0.000 1.846
system.time({
u2 = unique(b)
})
user system elapsed
0.380 0.000 0.379
The timing results diverge even more substantially as the number of rows is increased. So, there are two parts to this question.
Why is this slower for a matrix? It seems faster to convert to a data frame, run unique, and then convert back.
Is there any reason not to just wrap unique in myUnique, which does the conversions in part #1?
Note 1. Given that a matrix is atomic, it seems that unique should be faster for a matrix, rather than slower. Being able to iterate over fixed-size, contiguous blocks of memory should generally be faster than running over separate blocks of linked lists (I assume that's how data frames are implemented...).
Note 2. As demonstrated by the performance of data.table, running unique on a data frame or a matrix is a comparatively bad idea - see the answer by Matthew Dowle and the comments for relative timings. I've migrated a lot of objects to data tables, and this performance is another reason to do so. So although users should be well served to adopt data tables, for pedagogical / community reasons I'll leave the question open for now regarding the why does this take longer on the matrix objects. The answers below address where does the time go, and how else can we get better performance (i.e. data tables). The answer to why is close at hand - the code can be found via unique.data.frame and unique.matrix. :) An English explanation of what it's doing & why is all that is lacking.
In this implementation, unique.matrix is the same as unique.array
> identical(unique.array, unique.matrix)
[1] TRUE
unique.array has to handle multi-dimensional arrays which requires additional processing to ‘collapse’ the extra dimensions (those extra calls to paste()) which are not needed in the 2-dimensional case. The key section of code is:
collapse <- (ndim > 1L) && (prod(dx[-MARGIN]) > 1L)
temp <- if (collapse)
apply(x, MARGIN, function(x) paste(x, collapse = "\r"))
unique.data.frame is optimised for the 2D case, unique.matrix is not. It could be, as you suggest, it just isn't in the current implementation.
Note that in all cases (unique.{array,matrix,data.table}) where there is more than one dimension it is the string representation that is compared for uniqueness. For floating point numbers this means 15 decimal digits so
NROW(unique(a <- matrix(rep(c(1, 1+4e-15), 2), nrow = 2)))
is 1 while
NROW(unique(a <- matrix(rep(c(1, 1+5e-15), 2), nrow = 2)))
and
NROW(unique(a <- matrix(rep(c(1, 1+4e-15), 1), nrow = 2)))
are both 2. Are you sure unique is what you want?
Not sure but I guess that because matrix is one contiguous vector, R copies it into column vectors first (like a data.frame) because paste needs a list of vectors. Note that both are slow because both use paste.
Perhaps because unique.data.table is already many times faster. Please upgrade to v1.6.7 by downloading it from the R-Forge repository because that has the fix to unique you raised in this question. data.table doesn't use paste to do unique.
a = matrix(sample(2,10^6,replace = TRUE), ncol = 10)
b = as.data.frame(a)
system.time(u1<-unique(a))
user system elapsed
2.98 0.00 2.99
system.time(u2<-unique(b))
user system elapsed
0.99 0.00 0.99
c = as.data.table(b)
system.time(u3<-unique(c))
user system elapsed
0.03 0.02 0.05 # 60 times faster than u1, 20 times faster than u2
identical(as.data.table(u2),u3)
[1] TRUE
In attempting to answer my own question, especially part 1, we can see where the time is spent by looking at the results of Rprof. I ran this again, with 5M elements.
Here are the results for the first unique operation (for the matrix):
> summaryRprof("u1.txt")
$by.self
self.time self.pct total.time total.pct
"paste" 5.70 52.58 5.96 54.98
"apply" 2.70 24.91 10.68 98.52
"FUN" 0.86 7.93 6.82 62.92
"lapply" 0.82 7.56 1.00 9.23
"list" 0.30 2.77 0.30 2.77
"!" 0.14 1.29 0.14 1.29
"c" 0.10 0.92 0.10 0.92
"unlist" 0.08 0.74 1.08 9.96
"aperm.default" 0.06 0.55 0.06 0.55
"is.null" 0.06 0.55 0.06 0.55
"duplicated.default" 0.02 0.18 0.02 0.18
$by.total
total.time total.pct self.time self.pct
"unique" 10.84 100.00 0.00 0.00
"unique.matrix" 10.84 100.00 0.00 0.00
"apply" 10.68 98.52 2.70 24.91
"FUN" 6.82 62.92 0.86 7.93
"paste" 5.96 54.98 5.70 52.58
"unlist" 1.08 9.96 0.08 0.74
"lapply" 1.00 9.23 0.82 7.56
"list" 0.30 2.77 0.30 2.77
"!" 0.14 1.29 0.14 1.29
"do.call" 0.14 1.29 0.00 0.00
"c" 0.10 0.92 0.10 0.92
"aperm.default" 0.06 0.55 0.06 0.55
"is.null" 0.06 0.55 0.06 0.55
"aperm" 0.06 0.55 0.00 0.00
"duplicated.default" 0.02 0.18 0.02 0.18
$sample.interval
[1] 0.02
$sampling.time
[1] 10.84
And for the data frame:
> summaryRprof("u2.txt")
$by.self
self.time self.pct total.time total.pct
"paste" 1.72 94.51 1.72 94.51
"[.data.frame" 0.06 3.30 1.82 100.00
"duplicated.default" 0.04 2.20 0.04 2.20
$by.total
total.time total.pct self.time self.pct
"[.data.frame" 1.82 100.00 0.06 3.30
"[" 1.82 100.00 0.00 0.00
"unique" 1.82 100.00 0.00 0.00
"unique.data.frame" 1.82 100.00 0.00 0.00
"duplicated" 1.76 96.70 0.00 0.00
"duplicated.data.frame" 1.76 96.70 0.00 0.00
"paste" 1.72 94.51 1.72 94.51
"do.call" 1.72 94.51 0.00 0.00
"duplicated.default" 0.04 2.20 0.04 2.20
$sample.interval
[1] 0.02
$sampling.time
[1] 1.82
What we notice is that the matrix version spends a lot of time on apply, paste, and lapply. In contrast, the data frame version simple runs duplicated.data.frame and most of the time is spent in paste, presumably aggregating results.
Although this explains where the time is going, it doesn't explain why these have different implementations, nor the effects of simply changing from one object type to another.

Resources