a faster implementation of merge.data.frame() in R - performance

Let's say a and b are two data frames. The goal is to write a function
f(a,b) that produces a merged data frame, in the same way as merge
merge(a,b,all=TRUE) would do, that is filling missing variables in a or b with NAs. (The problem is merge() appears to be very slow.)
This can be done as follows (pseudo-code):
for each variable `var` found in either `a` or `b`, do:
unlist(list(a.srcvar, b.srcvar), recursive=FALSE, use.names=FALSE)
where:
x.srcvar is x$var if x$var exists, or else
rep(NA, nrow(x)) if y$var !is.factor, or else
as.factor(rep(NA, nrow(x)))
and then wrap everything in a data frame.
Here's a "naive" implementation:
merge.datasets1 <- function(a, b) {
a.fill <- rep(NA, nrow(a))
b.fill <- rep(NA, nrow(b))
a.fill.factor <- as.factor(a.fill)
b.fill.factor <- as.factor(b.fill)
out <- list()
for (v in union(names(a), names(b))) {
if (!v %in% names(a)) {
b.srcvar <- b[[v]]
if (is.factor(b.srcvar))
a.srcvar <- a.fill.factor
else
a.srcvar <- a.fill
} else {
a.srcvar <- a[[v]]
if (v %in% names(b))
b.srcvar <- b[[v]]
else if (is.factor(a.srcvar))
b.srcvar <- b.fill.factor
else
b.srcvar <- b.fill
}
out[[v]] <- unlist(list(a.srcvar, b.srcvar),
recursive=FALSE, use.names=FALSE)
}
data.frame(out)
}
Here's a different implementation that uses "vectorized" functions:
merge.datasets2 <- function(a, b) {
srcvar <- within(list(var=union(names(a), names(b))), {
a.exists <- var %in% names(a)
b.exists <- var %in% names(b)
a.isfactor <- unlist(lapply(var, function(v) is.factor(a[[v]])))
b.isfactor <- unlist(lapply(var, function(v) is.factor(b[[v]])))
a <- ifelse(a.exists, var, ifelse(b.isfactor, 'fill.factor', 'fill'))
b <- ifelse(b.exists, var, ifelse(a.isfactor, 'fill.factor', 'fill'))
})
a <- within(a, {
fill <- NA
fill.factor <- factor(fill)
})
b <- within(b, {
fill <- NA
fill.factor <- factor(fill)
})
out <- mapply(function(x,y) unlist(list(a[[x]], b[[y]]),
recursive=FALSE, use.names=FALSE),
srcvar$a, srcvar$b, SIMPLIFY=FALSE, USE.NAMES=FALSE)
out <- data.frame(out)
names(out) <- srcvar$var
out
}
Now we can test:
sample.datasets <- lapply(1:50, function(i) iris[,sample(names(iris), 4)])
system.time(invisible(Reduce(merge.datasets1, sample.datasets)))
>> user system elapsed
>> 0.192 0.000 0.190
system.time(invisible(Reduce(merge.datasets2, sample.datasets)))
>> user system elapsed
>> 2.292 0.000 2.293
So, the naive version is orders of magnitude faster than the other. How can
this be? I always thought that for loops are slow, and that one should
rather use lapply and friends and steer clear of loops in R. I would welcome any idea on how to improve my function in terms of speed.

In fact, you are not doing trying to replicate merge(a,b, all = TRUE) at all, as you are not trying to merge on any of the columns. Instead you are simply stacking the data, filling with NA where a column does not exist.
# note that this is not what you want/
dim(merge(sample.datasets[[1]], sample.datasets[[2]], all = T))
[1] 314 5
The reason merge(a,b, all = TRUE) will be slow is that it defaults to merging by the intersection of the names. If you convert to data.tables then the merge.data.table method is lightning fast, but with your test data, it would be creating an exponentially increasing dataset on each sucessive merge (not 7500 by 5 as you want your results to be)
An easy solution is to use rbind.fill from the plyr package.
library(plyr)
system.time({.x <- Reduce(rbind.fill, sample.datasets)})
## user system elapsed
## 0.16 0.00 0.15
# which is almost identical to
system.time(.old <- Reduce(merge.datasets1, sample.datasets))
## user system elapsed
## 0.14 0.00 0.14
EDIT 2-11-2012
On further consideration it is really useful to note that you can pass a list of data.frames to rbind.fill so
system.time(super_fast <- rbind.fill(sample.datasets))
## user system elapsed
## 0.02 0.00 0.02
identical(super_fast, .old)
[1] TRUE
The majority of time spent in the overheads for Reduce, which rbind.fill does not require.

Related

JAGS - pow function does not work properly in mixture model with label switching

I am fitting a mixture model to estimate the average of a trait in each of 3 populations.
I have a label switching issue and I am trying to compute the distance between the observed and expected numbers of individuals of each genotype in each population to relabel population clusters. Below is a reproducible example.
For some reasons, JAGS does not compute the square values for distance properly. The corresponding line in code below is: pow(DistNumPerClust[k,j], 2))
Hence, the output matrix results$mean$dist is different from the matrix, results$mean$DistNumPerClust^2, computed a posteriori.
Would anyone know a way to solve this?
library(R2jags)
library(runjags)
library(dirmult)
set.seed(123)
############################
## Simulation of the data ##
############################
npop=3
ngeno=2
freqbalance=1
nsamplesizeperpop <- 100
freqMLG <- t(rdirichlet(n=npop, alpha=rep(freqbalance, ngeno)))
samplesizegenoperpop <- sweep(freqMLG, 1, nsamplesizeperpop, "*")
## Compute membership (probability that a genotype comes from pop 1, 2 or 3)
## Genotype as rows and populations as columns
membership <- sweep(freqMLG, 1, rowSums(freqMLG), "/")
# Parameters for simulations
nind=90
N = npop*nind # nb of observations
clust <- rep(1:npop, each=N/npop)
geno <- c()
for (i in 1:N){
geno <- c(geno, sum(rmultinom(n=1, size=1, prob=freqMLG[, clust[i]])*1:ngeno))
}
numgeno <- as.numeric(table(geno))
## Multiply membership probabilities by sample size for each genotype
ExpNumPerClust <- sweep(membership, 1, numgeno, "*")
muOfClustsim <- c(1, 20, 50) # vector of population means
sigma <- 1.5 # residual sd
(tausim <- 1/(sigma*sigma)) # precision
# parameters are treated as data for the simulation step
data <- list(N=N, npop=npop, ngeno=ngeno, geno=geno, muOfClustsim=muOfClustsim, tausim=tausim, samplesizegenoperpop=samplesizegenoperpop)
## JAG model
txtstring <- "
data{
# Likelihood:
for (i in 1:N){
ysim[i] ~ dnorm(eta[i], tausim) # tau is precision (1 / variance)
eta[i] <- muOfClustsim[clust[i]]
clust[i] ~ dcat( pClust[geno[i], 1:npop] )
}
for (k in 1:ngeno){
pClust[k, 1:npop] ~ ddirch( samplesizegenoperpop[k,] )
}
}
model{
fake <- 0
}
"
# Simulate with jags
out <- run.jags(txtstring, data = data, monitor=c("ysim"), sample=1, n.chains=1, summarise=FALSE)
# reformat the outputs
ysim <- coda::as.mcmc(out)[1:N]
## Estimation model
bayes.mod <- function(){
# Likelihood:
for (i in 1:N){
ysim[i] ~ dnorm(eta[i], tau) # tau is precision (1 / variance)
eta[i] <- beta[clust[i]]
clust[i] ~ dcat( pClust[geno[i], 1:npop] )
}
for (k in 1:ngeno){
## pClust membership estimates
pClust[k, 1:npop] ~ ddirch( samplesizegenoperpop[k,] )
}
for (k in 1:ngeno){
for (j in 1:npop){
# problem of label switching: try to compute the distance between ObsNumPerClust and ExpNumPerClust (i.e. between observed and expected number of individuals of each genotype in each population)
ObsNumPerClust[k,j] <- pClust[k, j] * numgeno[k]
DistNumPerClust[k,j] <- ObsNumPerClust[k,j] - ExpNumPerClust[k,j]
dist[k,j] <- pow(DistNumPerClust[k,j], 2)
}
}
# Priors
beta ~ dmnorm(mu, sigma.inv)
mu ~ dmnorm(m, V)
sigma.inv ~ dwish(R, K)
tau ~ dgamma(0.01, 0.01)
# parameters transformations
sig <- sqrt(1/ tau)
}
m = rep(1, npop)
V = diag(rep(0.01, npop))
R = diag(rep(0.1, npop))
K = npop
## Input variables
sim.dat.jags<-list("ysim","N","npop", "ngeno", "geno","m","V","R", "K", "samplesizegenoperpop","numgeno","ExpNumPerClust")
## Variables to monitor
bayes.mod.params <- c("beta","tau","sig","DistNumPerClust","dist")
## Starting values
init1 <- list(beta = c(0, 100, 1000), tau = 1)
bayes.mod.inits <- list(init1)
## Run model
bayes.mod.fit<-jags(data = sim.dat.jags, inits = bayes.mod.inits, parameters.to.save = bayes.mod.params, n.chains=1, n.iter=101000, n.burnin=1000, n.thin=200, model.file = bayes.mod)
results <- print(bayes.mod.fit)
results$mean$dist
results$mean$DistNumPerClust^2
It seems that you expect that the mean of a transformed set of values will give the same result as transforming the mean of the same set of values. But this is not the case - for example:
values <- c(1,2,3,6,8,20)
mean(values)^2
mean(values^2)
Are not the same thing.
The equivalent is happening in your model - you calculate dist[k,j] as the square of DistNumPerClust[k,j] and then summarise to a mean of dist, and expect this to be the same as the square of the mean of DistNumPerClust[k,j]. Or in a simpler example:
library('runjags')
X <- 1:100
Y <- rnorm(length(X), 2*X + 10, 1)
model <- "model {
for(i in 1 : N){
Y[i] ~ dnorm(true.y[i], precision);
true.y[i] <- (m * X[i]) + c
}
m ~ dunif(-1000,1000)
c ~ dunif(-1000,1000)
precision ~ dexp(1)
p2 <- precision^2
}"
data <- list(X=X, Y=Y, N=length(X))
results <- run.jags(model=model, monitor=c("m", "c", "precision", "p2"),
data=data, n.chains=2)
results
More specifically, these should not be expected to be the same:
summary(results)['p2','Mean']
summary(results)['precision','Mean']^2
If you want to calculate the same thing you can extract the full chain of values as an MCMC object and do your transformation on these:
p <- combine.mcmc(results,vars='precision')
p2 <- combine.mcmc(results,vars='p2')
mean(p^2)
mean(p2)
mean(p)
mean(sqrt(p2))
Now everything is equivalent.
Matt

How can I improve the performance of my data cleaning code that currently uses ddply by using data.table?

I am trying to clean data using ddply but it is running very slowly on 1.3M rows.
Sample code:
#Create Sample Data Frame
num_rows <- 10000
df <- data.frame(id=sample(1:20, num_rows, replace=T),
Consumption=sample(-20:20, num_rows, replace=T),
StartDate=as.Date(sample(15000:15020, num_rows, replace=T), origin = "1970-01-01"))
df$EndDate <- df$StartDate + 90
#df <- df[order(df$id, df$StartDate, df$Consumption),]
#Are values negative?
# Needed for subsetting in ddply rows with same positive and negative values
df$Neg <- ifelse(df$Consumption < 0, -1, 1)
df$Consumption <- abs(df$Consumption)
I have written a function to remove rows where there is a consumption value in one row that is identical but negative to a consumption value in another row (for the same id).
#Remove rows from a data frame where there is an equal but opposite consumption value
#Should ensure only one negative value is removed for each positive one.
clean_negatives <- function(x3){
copies <- abs(sum(x3$Neg))
sgn <- ifelse(sum(x3$Neg) <0, -1, 1)
x3 <- x3[0:copies,]
x3$Consumption <- sgn*x3$Consumption
x3$Neg <- NULL
x3}
I then use ddply to apply that function to remove these erroneous rows in the data
ptm <- proc.time()
df_cleaned <- ddply(df, .(id,StartDate, EndDate, Consumption),
function(x){clean_negatives(x)})
proc.time() - ptm
I was hoping I could use data.table to make this go faster but I couldn't work out how to employ data.table to help.
With 1.3M rows, so far it is taking my desktop all day to compute and still hasn't finished.
Your question asks about data.table implementation. So, I've shown it here. Your function could be drastically simplified as well. You can first get the sign by summing up Neg and then filter the table and then multiply Consumption by sign (as shown below).
require(data.table)
# get the data.table in dt
dt <- data.table(df, key = c("id", "StartDate", "EndDate", "Consumption"))
# first obtain the sign directly
dt <- dt[, sign := sign(sum(Neg)), by = c("id", "StartDate", "EndDate", "Consumption")]
# then filter by abs(sum(Neg))
dt.fil <- dt[, .SD[seq_len(abs(sum(Neg)))], by = c("id", "StartDate", "EndDate", "Consumption")]
# modifying for final output (line commented after Statquant's comment
# dt.fil$Consumption <- dt.fil$Consumption * dt.fil$sign
dt.fil[, Consumption := (Consumption*sign)]
dt.fil <- subset(dt.fil, select=-c(Neg, sign))
Benchmarking
The data with million rows:
#Create Sample Data Frame
num_rows <- 1e6
df <- data.frame(id=sample(1:20, num_rows, replace=T),
Consumption=sample(-20:20, num_rows, replace=T),
StartDate=as.Date(sample(15000:15020, num_rows, replace=T), origin = "1970-01-01"))
df$EndDate <- df$StartDate + 90
df$Neg <- ifelse(df$Consumption < 0, -1, 1)
df$Consumption <- abs(df$Consumption)
The data.table function:
FUN.DT <- function() {
require(data.table)
dt <- data.table(df, key=c("id", "StartDate", "EndDate", "Consumption"))
dt <- dt[, sign := sign(sum(Neg)),
by = c("id", "StartDate", "EndDate", "Consumption")]
dt.fil <- dt[, .SD[seq_len(abs(sum(Neg)))],
by=c("id", "StartDate", "EndDate", "Consumption")]
dt.fil[, Consumption := (Consumption*sign)]
dt.fil <- subset(dt.fil, select=-c(Neg, sign))
}
Your function with ddply
FUN.PLYR <- function() {
require(plyr)
clean_negatives <- function(x3) {
copies <- abs(sum(x3$Neg))
sgn <- ifelse(sum(x3$Neg) <0, -1, 1)
x3 <- x3[0:copies,]
x3$Consumption <- sgn*x3$Consumption
x3$Neg <- NULL
x3
}
df_cleaned <- ddply(df, .(id, StartDate, EndDate, Consumption),
function(x) clean_negatives(x))
}
Benchmarking with rbenchmark (with 1 run only)
require(rbenchmark)
benchmark(FUN.DT(), FUN.PLYR(), replications = 1, order = "elapsed")
test replications elapsed relative user.self sys.self user.child sys.child
1 FUN.DT() 1 6.137 1.000 5.926 0.211 0 0
2 FUN.PLYR() 1 242.268 39.477 152.855 82.881 0 0
My data.table implementation is about 39 times faster than your current plyr implementation (I compare mine to your implementation because the functions are different).
Note: I loaded the packages within the function in order to obtain the complete time to obtain the result. Also, for the same reason I converted the data.frame to data.table with keys inside the benchmarking function. This is therefore the minimum speed-up.

Increase performance/speed

I need to take data from 1303 rasters (each raster has data for 1 month) and make a time series for each grid cell in the rasters. In the end I will join all the time series into one massive (zoo) file.
I have the code that can do it (I tried on a small portion of the dataset and it worked) but it seems to be taking for ever just to stack the raster (more than 2 hours now and still counting) and this is not the slower part, that will be doing the time series. So here is my code, if anyone knows a faster way to stack rasters and /or to create the time series (maybe without the double loop?) please help...
I don't know any other programming language but would this be just too much to ask from R?
files <- list.files(pattern=".asc")
pat <- "^.*pet_([0-9]{1,})_([0-9]{1,}).asc$"
ord_files <- as.Date(gsub(pat, sprintf("%s-%s-01", "\\1", "\\2"), files))
files<-files[order(ord_files)]
#using "raster" package to import data
s<- raster(files[1])
pet<-vector()
for (i in 2:length(files))
{
r<- raster(files[i])
s <- stack(s, r)
}
#creating a data vector
beginning = as.Date("1901-01-01")
full <- seq(beginning, by='1 month', length=length(files))
dat<-as.yearmon(full)
#building the time series
for (lat in 1:360)
for (long in 1:720)
{
pet<-as.vector(s[lat,long])
x <- xts(pet, dat)
write.zoo(x,file=paste("P:/WRSRL/Users1/ncgk/IBERIA/cru_pet/zoo/","lat",lat,"long",long,".csv", sep="") , sep=",")
}
The first bit could simply be:
s <- stack(files)
The reason why creating a stack is somewhat slow is that each file needs to be opened and checked to see if it has the same nrow, ncol etc. as the other files. If you are absolutely certain that is the case, you can use a shortcut like this (NOT generally recommended)
quickStack <- function(f) {
r <- raster(f[1])
ln <- extension(basename(f), '')
s <- stack(r)
s#layers <- sapply(1:length(f), function(x){ r#file#name = f[x]; r#layernames=ln[x]; r#data#haveminmax=FALSE ; r })
s#layernames <- ln
s
}
quickStack(files)
You can probably also speed up the second part as in the below examples, depending on how much RAM you have.
Read row by row:
for (lat in 1:360) {
pet <- getValues(s, lat, 1)
for (long in 1:720) {
x <- xts(pet[long,], dat)
write.zoo(x,file=paste("P:/WRSRL/Users1/ncgk/IBERIA/cru_pet/zoo/","lat",lat,"long",long,".csv", sep="") , sep=",")
}
}
more extreme, read all values in one go:
pet <- getValues(s)
for (lat in 1:360) {
for (long in 1:720) {
cell <- (lat-1) * 720 + long
x <- xts(pet[cell,], dat)
write.zoo(x,file=paste("P:/WRSRL/Users1/ncgk/IBERIA/cru_pet/zoo/","lat",lat,"long",long,".csv", sep="") , sep=",")
}
}
I will repost my comment here and give a better example:
The general idea: allocate the space for s before the 'raster'-loop is executed. If you concatenate s and r to a new object s inside the loop, R has to allocate new memory for s for each iteration. This is really slow, especially if s is large.
s <- c()
system.time(for(i in 1:1000){ s <- c(s, rnorm(100))})
# user system elapsed
# 0.584 0.244 0.885
s <- rep(NA, 1000*100)
system.time(for(i in seq(1,1000*100,100)){ s[i:(i+99)] <- rnorm(100) })
# user system elapsed
# 0.052 0.000 0.050
as you can see, pre-allocation is around 10 times faster.
Unfortunately I am not familiar with raster and stack so I can not tell you how to apply this to your code.
Something like this should work (if you have enough memory):
#using "raster" package to import data
rlist <- lapply(files, raster)
s <- do.call(stack, rlist)
rlist <- NULL # to allow freeing of memory
It loads all raster objects into a big list and then calls stack once.
Here's an example of the speed gains: 1.25 sec vs 8 secs for 60 files - but your old code is quadratic in time so the gains are much higher for more files...
library(raster)
f <- system.file("external/test.grd", package="raster")
files <- rep(f, 60)
system.time({
rlist <- lapply(files, raster)
s <- do.call(stack, rlist)
rlist <- NULL # to allow freeing of memory
}) # 1.25 secs
system.time({
s<- raster(files[1])
for (i in 2:length(files)) {
r<- raster(files[i])
s <- stack(s, r)
}
}) # 8 secs
I tried another way to dealing with numerous files.
First I combined the time series raster into one file in the NetCDF format,
Using write.Raster(x,format="CDF",..)
and then just read one file for each year, this time I used brick(netcdffile,varname='') it the reading saves a lot.
However, I need to save each cell's value for all the years according to some predefined format,in which I use write.fwf(x=v,...,append=TRUE)
but it takes a long time for nearly 500,000 points.
Is anyone has the same experiences and help on how to speed up this process?
Here is my code for extracting all the value for each point:
weather4Point <- function(startyear,endyear)
{
for (year in startyear:endyear)
{
#get the combined netCDF file
tminfile <- paste("tmin","_",year,".nc",sep='')
b_tmin <- brick(tminfile,varname='tmin')
pptfile <- paste("ppt","_",year,".nc",sep='')
b_ppt <- brick(pptfile,varname='ppt')
tmaxfile <- paste("tmax","_",year,".nc",sep='')
b_tmax <- brick(tmaxfile,varname='tmax')
#Get the first year here!!!
print(paste("processing year :",year,sep=''))
for(l in 1:length(pl))
{
v <- NULL
#generate file with the name convention with t_n(latitude)w(longitude).txt, 5 digits after point should be work
filename <- paste("c:/PRISM/MD/N",round(coordinates(pl[l,])[2],5),"W",abs(round(coordinates(pl[l,])[1],5)),".wth",sep='')
print(paste("processing file :",filename,sep=''))
tmin <- as.numeric(round(extract(b_tmin,coordinates(pl[l,])),digits=1))
tmax <- as.numeric(round(extract(b_tmax,coordinates(pl[l,])),digits=1))
ppt <- as.numeric(round(extract(b_ppt,coordinates(pl[l,])),digits=2))
v <- cbind(tmax,tmin,ppt)
tablename <- c("tmin","tmax","ppt")
v <- data.frame(v)
colnames(v) <- tablename
v["default"] <- 0
v["year"] <- year
date <- seq(as.Date(paste(year,"/1/1",sep='')),as.Date(paste(year,"/12/31",sep='')),"days")
month <- as.numeric(substr(date,6,7))
day <- as.numeric(substr(date,9,10))
v["month"] <- month
v["day"] <- day
v <- v[c("year","month","day","default","tmin","tmax","ppt")]
#write into a file with format
write.fwf(x=v,filename,append=TRUE,na="NA",rownames=FALSE,colnames=FALSE,width=c(6,3,3,5,5,5,6))
}
}
}

Comparing Record Results and Double For Loop

I have a double loop that I not only don't like, but would take 14 days to run on my computer since it is going over 3200 records and 1090 variables at about .12 per iteration.
A smaller reproducible bit. It simply checks how many numbers are in the same column between two records, not including NA's. Then it attaches the results to the original data frame.
y <- data.frame(c(1,2,1,NA,NA),c(3,3,3,4,NA),c(5,4,5,7,7),c(7,8,7,9,10))
resultdf <- NULL
for(i in 1:nrow(y))
{
results <- NULL
for(j in 1:nrow(y))
{
results <- c(results,sum((y[i,]==y[j,]),na.rm=TRUE))
}
resultdf <- cbind(resultdf,results)
}
y <- cbind(y,resultdf)
I have repeat calculations that could possibly be avoided leaving about 7 days.
If I understand correctly, a few apply functions are in C that might be faster. I haven't been able to get any to work though. I'm also curious if there is a package that would run faster. Can anyone help speed up the calculation?
Thank you!
I have created data to your specifications, and using #BenBolker's suggestion about using a matrix:
> y <- matrix(sample(c(1:9, NA), 3200 * 1090, replace = TRUE),
+ nrow = 3200, ncol = 1090)
and compared the computation times for three different implementations:
f1 was suggested by #Andrei:
> f1 <- function(y)apply(y, 1, function(r1)
+ apply(y, 1, function(r2)sum(r1==r2, na.rm=TRUE)))
> system.time(r1 <- f1(y))
user system elapsed
523.51 0.77 528.73
f2 was suggested by #VincentZoonekynd:
> f2 <- function(y) {
+ f <- function(i,j) sum(y[i,] == y[j,], na.rm=TRUE)
+ d <- outer( 1:nrow(y), 1:nrow(y), Vectorize(f) )
+ return(d)
+ }
> system.time(r2 <- f2(y))
user system elapsed
658.94 1.96 710.67
f3 is a double loop over the upper triangle as suggested by #BenBolker. It is also a bit more efficient than your OP in that it pre-allocates the output matrix:
> f3 <- function(y) {
+ result <- matrix(NA, nrow(y), nrow(y))
+ for (i in 1:nrow(y)) {
+ row1 <- y[i, ]
+ for (j in i:nrow(y)) {
+ row2 <- y[j, ]
+ num.matches <- sum(row1 == row2, na.rm = TRUE)
+ result[i, j] <- num.matches
+ result[j, i] <- num.matches
+ }
+ }
+ return(result)
+ }
> system.time(r3 <- f3(y))
user system elapsed
167.66 0.08 168.72
So the double loop is the fastest of all three, although not as elegant and compact as the other two answers.
Here is another solution, using outer.
f <- function(i,j) sum(y[i,] == y[j,], na.rm=TRUE)
d <- outer( 1:nrow(y), 1:nrow(y), Vectorize(f) )
Indeed, you can use apply function. Given the earlier hint that a matrix works faster, I would try:
ym <- as.matrix(y)
resultdf <- apply(ym, 1, function(r1) apply(ym, 1, function(r2) sum(r1==r2, na.rm=TRUE)))
You can get rid of the inner loop (using the y and f3 from #flodel's answer):
ty <- t(y)
ix <- rep(1:nrow(y),each = ncol(y))
f4 <- function(y){
result <- matrix(0L, nrow(y), nrow(y))
for(r in 1:nrow(y))
result[r,] <- rowsum(as.numeric(ty == y[r,]), ix, na.rm = T)
result
}
> system.time(out <- f4(y))
user system elapsed
52.616 21.061 74.000
> system.time(out <- f3(y))
user system elapsed
244.751 0.136 244.954
>
It actually does an extra job of computing twice the same thing, but is still 5 times faster. You can make it yet another 4 times faster by using the inner workings of rowsum. See this question for an example.

Make this loop faster in R

How can I speed up the following (noob) code:
#"mymatrix" is the matrix of word counts (docs X terms)
#"tfidfmatrix" is the transformed matrix
tfidfmatrix = Matrix(mymatrix, nrow=num_of_docs, ncol=num_of_words, sparse=T)
#Apply a transformation on each row of the matrix
for(i in 1:dim(mymatrix)[[1]]){
r = mymatrix[i,]
s = sapply(r, function(x) ifelse(x==0, 0, (1+log(x))*log((1+ndocs)/(1+x)) ) )
tfmat[i,] = s/sqrt(sum(s^2))
}
return (tfidfmatrix)
Problem is that the matrices I am working on are fairly large (~40kX100k), and this code is very slow.
The reason I am not using "apply" (instead of using a for loop and sapply) is that apply is going to give me the transpose of the matrix I want - I want num_of_docs X num_of_words, but apply will give me the transpose. I will then have to spend more time computing the transpose and re-allocating it.
Any thoughts on making this faster?
Thanks much.
Edit : I have found that the suggestions below greatly speed up my code (besides making me feel stupid). Any suggestions on where I can learn to write "optimized" R code from?
Edit 2: OK, so something is not right. Once I do s.vec[!is.finite(s.vec)] <- 0 every element of s.vec is being set to 0. Just to re-iterate my original matrix is a sparse matrix containing integers. This is due to some quirk of the Matrix package I am using. When I do s.vec[which(s.vec==-Inf)] <- 0 things work as expected. Thoughts?
As per my comment,
#Slightly larger example data
mymatrix <- matrix(runif(10000),nrow=10)
mymatrix[sample(10000,100)] <- 0
tfmat <- matrix(nrow=10, ncol=1000)
ndocs <- 1
justin <- function(){
s.vec <- ifelse(mymatrix==0, 0, (1 + log(mymatrix)) * log((1 + ndocs)/(1 + mymatrix)))
tfmat.vec <- s.vec/sqrt(rowSums(s.vec^2))
}
joran <- function(){
s.vec <- (1 + log(mymatrix)) * log((1 + ndocs)/(1 + mymatrix))
s.vec[!is.finite(s.vec)] <- 0
tfmat.vec <- s.vec/sqrt(rowSums(s.vec^2))
}
require(rbenchmark)
benchmark(justin(),joran(),replications = 1000)
test replications elapsed relative user.self sys.self user.child sys.child
2 joran() 1000 0.940 1.00000 0.842 0.105 0 0
1 justin() 1000 2.786 2.96383 2.617 0.187 0 0
So it's around 3x faster or so.
not sure what ndocs is, but ifelse is already vectorized, so you should be able to use the ifelse statement without walking through the matrix row by row and sapply along the row. The same can be said for the final calc.
However, you haven't given a complete example to replicate...
mymatrix <- matrix(runif(100),nrow=10)
tfmat <- matrix(nrow=10, ncol=10)
ndocs <- 1
s.vec <- ifelse(mymatrix==0, 0, 1 + log(mymatrix)) * log((1 + ndocs)/(1 + mymatrix))
for(i in 1:dim(mymatrix)[[1]]){
r = mymatrix[i,]
s = sapply(r, function(x) ifelse(x==0, 0, (1+log(x))*log((1+ndocs)/(1+x)) ) )
tfmat[i,] <- s
}
all.equal(s.vec, tfmat)
so the only piece missing is the rowSums in your final calc.
tfmat.vec <- s.vec/sqrt(rowSums(s.vec^2))
for(i in 1:dim(mymatrix)[[1]]){
r = mymatrix[i,]
s = sapply(r, function(x) ifelse(x==0, 0, (1+log(x))*log((1+ndocs)/(1+x)) ) )
tfmat[i,] = s/sqrt(sum(s^2))
}
all.equal(tfmat, tfmat.vec)

Resources