I need to take data from 1303 rasters (each raster has data for 1 month) and make a time series for each grid cell in the rasters. In the end I will join all the time series into one massive (zoo) file.
I have the code that can do it (I tried on a small portion of the dataset and it worked) but it seems to be taking for ever just to stack the raster (more than 2 hours now and still counting) and this is not the slower part, that will be doing the time series. So here is my code, if anyone knows a faster way to stack rasters and /or to create the time series (maybe without the double loop?) please help...
I don't know any other programming language but would this be just too much to ask from R?
files <- list.files(pattern=".asc")
pat <- "^.*pet_([0-9]{1,})_([0-9]{1,}).asc$"
ord_files <- as.Date(gsub(pat, sprintf("%s-%s-01", "\\1", "\\2"), files))
files<-files[order(ord_files)]
#using "raster" package to import data
s<- raster(files[1])
pet<-vector()
for (i in 2:length(files))
{
r<- raster(files[i])
s <- stack(s, r)
}
#creating a data vector
beginning = as.Date("1901-01-01")
full <- seq(beginning, by='1 month', length=length(files))
dat<-as.yearmon(full)
#building the time series
for (lat in 1:360)
for (long in 1:720)
{
pet<-as.vector(s[lat,long])
x <- xts(pet, dat)
write.zoo(x,file=paste("P:/WRSRL/Users1/ncgk/IBERIA/cru_pet/zoo/","lat",lat,"long",long,".csv", sep="") , sep=",")
}
The first bit could simply be:
s <- stack(files)
The reason why creating a stack is somewhat slow is that each file needs to be opened and checked to see if it has the same nrow, ncol etc. as the other files. If you are absolutely certain that is the case, you can use a shortcut like this (NOT generally recommended)
quickStack <- function(f) {
r <- raster(f[1])
ln <- extension(basename(f), '')
s <- stack(r)
s#layers <- sapply(1:length(f), function(x){ r#file#name = f[x]; r#layernames=ln[x]; r#data#haveminmax=FALSE ; r })
s#layernames <- ln
s
}
quickStack(files)
You can probably also speed up the second part as in the below examples, depending on how much RAM you have.
Read row by row:
for (lat in 1:360) {
pet <- getValues(s, lat, 1)
for (long in 1:720) {
x <- xts(pet[long,], dat)
write.zoo(x,file=paste("P:/WRSRL/Users1/ncgk/IBERIA/cru_pet/zoo/","lat",lat,"long",long,".csv", sep="") , sep=",")
}
}
more extreme, read all values in one go:
pet <- getValues(s)
for (lat in 1:360) {
for (long in 1:720) {
cell <- (lat-1) * 720 + long
x <- xts(pet[cell,], dat)
write.zoo(x,file=paste("P:/WRSRL/Users1/ncgk/IBERIA/cru_pet/zoo/","lat",lat,"long",long,".csv", sep="") , sep=",")
}
}
I will repost my comment here and give a better example:
The general idea: allocate the space for s before the 'raster'-loop is executed. If you concatenate s and r to a new object s inside the loop, R has to allocate new memory for s for each iteration. This is really slow, especially if s is large.
s <- c()
system.time(for(i in 1:1000){ s <- c(s, rnorm(100))})
# user system elapsed
# 0.584 0.244 0.885
s <- rep(NA, 1000*100)
system.time(for(i in seq(1,1000*100,100)){ s[i:(i+99)] <- rnorm(100) })
# user system elapsed
# 0.052 0.000 0.050
as you can see, pre-allocation is around 10 times faster.
Unfortunately I am not familiar with raster and stack so I can not tell you how to apply this to your code.
Something like this should work (if you have enough memory):
#using "raster" package to import data
rlist <- lapply(files, raster)
s <- do.call(stack, rlist)
rlist <- NULL # to allow freeing of memory
It loads all raster objects into a big list and then calls stack once.
Here's an example of the speed gains: 1.25 sec vs 8 secs for 60 files - but your old code is quadratic in time so the gains are much higher for more files...
library(raster)
f <- system.file("external/test.grd", package="raster")
files <- rep(f, 60)
system.time({
rlist <- lapply(files, raster)
s <- do.call(stack, rlist)
rlist <- NULL # to allow freeing of memory
}) # 1.25 secs
system.time({
s<- raster(files[1])
for (i in 2:length(files)) {
r<- raster(files[i])
s <- stack(s, r)
}
}) # 8 secs
I tried another way to dealing with numerous files.
First I combined the time series raster into one file in the NetCDF format,
Using write.Raster(x,format="CDF",..)
and then just read one file for each year, this time I used brick(netcdffile,varname='') it the reading saves a lot.
However, I need to save each cell's value for all the years according to some predefined format,in which I use write.fwf(x=v,...,append=TRUE)
but it takes a long time for nearly 500,000 points.
Is anyone has the same experiences and help on how to speed up this process?
Here is my code for extracting all the value for each point:
weather4Point <- function(startyear,endyear)
{
for (year in startyear:endyear)
{
#get the combined netCDF file
tminfile <- paste("tmin","_",year,".nc",sep='')
b_tmin <- brick(tminfile,varname='tmin')
pptfile <- paste("ppt","_",year,".nc",sep='')
b_ppt <- brick(pptfile,varname='ppt')
tmaxfile <- paste("tmax","_",year,".nc",sep='')
b_tmax <- brick(tmaxfile,varname='tmax')
#Get the first year here!!!
print(paste("processing year :",year,sep=''))
for(l in 1:length(pl))
{
v <- NULL
#generate file with the name convention with t_n(latitude)w(longitude).txt, 5 digits after point should be work
filename <- paste("c:/PRISM/MD/N",round(coordinates(pl[l,])[2],5),"W",abs(round(coordinates(pl[l,])[1],5)),".wth",sep='')
print(paste("processing file :",filename,sep=''))
tmin <- as.numeric(round(extract(b_tmin,coordinates(pl[l,])),digits=1))
tmax <- as.numeric(round(extract(b_tmax,coordinates(pl[l,])),digits=1))
ppt <- as.numeric(round(extract(b_ppt,coordinates(pl[l,])),digits=2))
v <- cbind(tmax,tmin,ppt)
tablename <- c("tmin","tmax","ppt")
v <- data.frame(v)
colnames(v) <- tablename
v["default"] <- 0
v["year"] <- year
date <- seq(as.Date(paste(year,"/1/1",sep='')),as.Date(paste(year,"/12/31",sep='')),"days")
month <- as.numeric(substr(date,6,7))
day <- as.numeric(substr(date,9,10))
v["month"] <- month
v["day"] <- day
v <- v[c("year","month","day","default","tmin","tmax","ppt")]
#write into a file with format
write.fwf(x=v,filename,append=TRUE,na="NA",rownames=FALSE,colnames=FALSE,width=c(6,3,3,5,5,5,6))
}
}
}
Related
I have a table of 5360*51200 size. Here, 5360 are the number of instances and 51200 are the number of features. I need to reduce the dimension of features. I was trying it by the help of stacked autoencoder in H2o, but it did not allow me to train to raise an error as:
Model is a large and large number of parameters
Here is the code:
library(h2o)
h2o.init(nthreads = -1)
check.deeplearning_stacked_autoencoder <- function() {
# this function builds a vector of autoencoder models, one per layer
#library(h2o)
#h2o.init()
get_stacked_ae_array <- function(training_data, layers, args) {
vector <- c()
index = 0
for (i in 1:length(layers)) {
index = index + 1
ae_model <- do.call(h2o.deeplearning,
modifyList(
list(
x = names(training_data),
training_frame = training_data,
autoencoder = T,
hidden = layers[i]
),
args
))
training_data = h2o.deepfeatures(ae_model, training_data, layer =
3)
names(training_data) <-
gsub("DF", paste0("L", index, sep = ""), names(training_data))
vector <- c(vector, ae_model)
}
cat(
length(vector))
}
# this function returns final encoded contents
apply_stacked_ae_array <- function(data, ae) {
index = 0
for (i in 1:length(ae)) {
index = index + 1
data = h2o.deepfeatures(ae[[i]], data, layer = 3)
names(data) <-
gsub("DF", paste0("L", index, sep = ""), names(data))
}
data
}
TRAIN <-
"E:/Chiranjibi file/Geometric features/Lu/Train/d_features.csv"
TEST <-
"E:/Chiranjibi file/Geometric features/Lu/Test/d_features.csv"
response <- 51201
# set to T for RUnit
# set to F for stand-alone demo
if (T) {
train_hex <- h2o.importFile((TRAIN))
test_hex <- h2o.importFile((TEST))
} else
{
library(h2o)
h2o.init()
homedir <-
paste0(path.expand("~"), "/h2o-dev/") #modify if needed
train_hex <-
h2o.importFile(path = paste0(homedir, TRAIN),
header = F,
sep = ',')
test_hex <-
h2o.importFile(path = paste0(homedir, TEST),
header = F,
sep = ',')
}
train <- train_hex[, -response]
test <- test_hex [, -response]
train_hex[, response] <- as.factor(train_hex[, response])
test_hex [, response] <- as.factor(test_hex [, response])
## Build reference model on full dataset and evaluate it on the test set
model_ref <-
h2o.deeplearning(
training_frame = train_hex,
x = 1:(ncol(train_hex) - 1),
y = response,
hidden = c(67),
epochs = 50
)
p_ref <- h2o.performance(model_ref, test_hex)
h2o.logloss(p_ref)
## Now build a stacked autoencoder model with three stacked layer AE models
## First AE model will compress the 717 non-const predictors into 200
## Second AE model will compress 200 into 100
## Third AE model will compress 100 into 50
layers <- c(50000,20000,10000,5000,2000, 1000, 500)
args <- list(activation = "Tanh",
epochs = 1,
l1 = 1e-5)
ae <- get_stacked_ae_array(train, layers, args)
## Now compress the training/testing data with this 3-stage set of AE models
train_compressed <- apply_stacked_ae_array(train, ae)
test_compressed <- apply_stacked_ae_array(test, ae)
## Build a simple model using these new features (compressed training data) and evaluate it on the compressed test set.
train_w_resp <- h2o.cbind(train_compressed, train_hex[, response])
test_w_resp <- h2o.cbind(test_compressed, test_hex[, response])
model_on_compressed_data <-
h2o.deeplearning(
training_frame = train_w_resp,
x = 1:(ncol(train_w_resp) - 1),
y = ncol(train_w_resp),
hidden = c(67),
epochs = 1
)
p <- h2o.performance(model_on_compressed_data, test_w_resp)
h2o.logloss(p)
}
#h2o.describe(train)
#doTest("Deep Learning Stacked Autoencoder", check.deeplearning_stacked_autoencoder)
As Tom says, your autoencoder first layer is too big.
51,200 is a lot of features. How much correlation is there between them? The more correlation you have, the smaller the first layer of your autoencoder can happily be.
Try h2o.prcomp() and seeing how many dimensions cover 99% of the variance, is often a good guide to how big your first layer can/should be.
Or, if you prefer a more experimental approach:
Start with, e.g. 200 neurons in one layer.
Look at the MSE it gets to, after enough epochs to stop improving.
Double the number of neurons in that layer.
See if the MSE gets any better. If not, stop there.
If it did, double again, and repeat.
You could then try moving to multiple layers. But not much point using a bigger first layer than the best you can get from trying a single layer.
Since your dataset has 51,200 features, and your layers array has 50,000 as the first value, 51200 * 50000 == 2.56e9 weights in that first set of network connections.
It’s too many, try smaller numbers.
I am trying to clean data using ddply but it is running very slowly on 1.3M rows.
Sample code:
#Create Sample Data Frame
num_rows <- 10000
df <- data.frame(id=sample(1:20, num_rows, replace=T),
Consumption=sample(-20:20, num_rows, replace=T),
StartDate=as.Date(sample(15000:15020, num_rows, replace=T), origin = "1970-01-01"))
df$EndDate <- df$StartDate + 90
#df <- df[order(df$id, df$StartDate, df$Consumption),]
#Are values negative?
# Needed for subsetting in ddply rows with same positive and negative values
df$Neg <- ifelse(df$Consumption < 0, -1, 1)
df$Consumption <- abs(df$Consumption)
I have written a function to remove rows where there is a consumption value in one row that is identical but negative to a consumption value in another row (for the same id).
#Remove rows from a data frame where there is an equal but opposite consumption value
#Should ensure only one negative value is removed for each positive one.
clean_negatives <- function(x3){
copies <- abs(sum(x3$Neg))
sgn <- ifelse(sum(x3$Neg) <0, -1, 1)
x3 <- x3[0:copies,]
x3$Consumption <- sgn*x3$Consumption
x3$Neg <- NULL
x3}
I then use ddply to apply that function to remove these erroneous rows in the data
ptm <- proc.time()
df_cleaned <- ddply(df, .(id,StartDate, EndDate, Consumption),
function(x){clean_negatives(x)})
proc.time() - ptm
I was hoping I could use data.table to make this go faster but I couldn't work out how to employ data.table to help.
With 1.3M rows, so far it is taking my desktop all day to compute and still hasn't finished.
Your question asks about data.table implementation. So, I've shown it here. Your function could be drastically simplified as well. You can first get the sign by summing up Neg and then filter the table and then multiply Consumption by sign (as shown below).
require(data.table)
# get the data.table in dt
dt <- data.table(df, key = c("id", "StartDate", "EndDate", "Consumption"))
# first obtain the sign directly
dt <- dt[, sign := sign(sum(Neg)), by = c("id", "StartDate", "EndDate", "Consumption")]
# then filter by abs(sum(Neg))
dt.fil <- dt[, .SD[seq_len(abs(sum(Neg)))], by = c("id", "StartDate", "EndDate", "Consumption")]
# modifying for final output (line commented after Statquant's comment
# dt.fil$Consumption <- dt.fil$Consumption * dt.fil$sign
dt.fil[, Consumption := (Consumption*sign)]
dt.fil <- subset(dt.fil, select=-c(Neg, sign))
Benchmarking
The data with million rows:
#Create Sample Data Frame
num_rows <- 1e6
df <- data.frame(id=sample(1:20, num_rows, replace=T),
Consumption=sample(-20:20, num_rows, replace=T),
StartDate=as.Date(sample(15000:15020, num_rows, replace=T), origin = "1970-01-01"))
df$EndDate <- df$StartDate + 90
df$Neg <- ifelse(df$Consumption < 0, -1, 1)
df$Consumption <- abs(df$Consumption)
The data.table function:
FUN.DT <- function() {
require(data.table)
dt <- data.table(df, key=c("id", "StartDate", "EndDate", "Consumption"))
dt <- dt[, sign := sign(sum(Neg)),
by = c("id", "StartDate", "EndDate", "Consumption")]
dt.fil <- dt[, .SD[seq_len(abs(sum(Neg)))],
by=c("id", "StartDate", "EndDate", "Consumption")]
dt.fil[, Consumption := (Consumption*sign)]
dt.fil <- subset(dt.fil, select=-c(Neg, sign))
}
Your function with ddply
FUN.PLYR <- function() {
require(plyr)
clean_negatives <- function(x3) {
copies <- abs(sum(x3$Neg))
sgn <- ifelse(sum(x3$Neg) <0, -1, 1)
x3 <- x3[0:copies,]
x3$Consumption <- sgn*x3$Consumption
x3$Neg <- NULL
x3
}
df_cleaned <- ddply(df, .(id, StartDate, EndDate, Consumption),
function(x) clean_negatives(x))
}
Benchmarking with rbenchmark (with 1 run only)
require(rbenchmark)
benchmark(FUN.DT(), FUN.PLYR(), replications = 1, order = "elapsed")
test replications elapsed relative user.self sys.self user.child sys.child
1 FUN.DT() 1 6.137 1.000 5.926 0.211 0 0
2 FUN.PLYR() 1 242.268 39.477 152.855 82.881 0 0
My data.table implementation is about 39 times faster than your current plyr implementation (I compare mine to your implementation because the functions are different).
Note: I loaded the packages within the function in order to obtain the complete time to obtain the result. Also, for the same reason I converted the data.frame to data.table with keys inside the benchmarking function. This is therefore the minimum speed-up.
Let's say a and b are two data frames. The goal is to write a function
f(a,b) that produces a merged data frame, in the same way as merge
merge(a,b,all=TRUE) would do, that is filling missing variables in a or b with NAs. (The problem is merge() appears to be very slow.)
This can be done as follows (pseudo-code):
for each variable `var` found in either `a` or `b`, do:
unlist(list(a.srcvar, b.srcvar), recursive=FALSE, use.names=FALSE)
where:
x.srcvar is x$var if x$var exists, or else
rep(NA, nrow(x)) if y$var !is.factor, or else
as.factor(rep(NA, nrow(x)))
and then wrap everything in a data frame.
Here's a "naive" implementation:
merge.datasets1 <- function(a, b) {
a.fill <- rep(NA, nrow(a))
b.fill <- rep(NA, nrow(b))
a.fill.factor <- as.factor(a.fill)
b.fill.factor <- as.factor(b.fill)
out <- list()
for (v in union(names(a), names(b))) {
if (!v %in% names(a)) {
b.srcvar <- b[[v]]
if (is.factor(b.srcvar))
a.srcvar <- a.fill.factor
else
a.srcvar <- a.fill
} else {
a.srcvar <- a[[v]]
if (v %in% names(b))
b.srcvar <- b[[v]]
else if (is.factor(a.srcvar))
b.srcvar <- b.fill.factor
else
b.srcvar <- b.fill
}
out[[v]] <- unlist(list(a.srcvar, b.srcvar),
recursive=FALSE, use.names=FALSE)
}
data.frame(out)
}
Here's a different implementation that uses "vectorized" functions:
merge.datasets2 <- function(a, b) {
srcvar <- within(list(var=union(names(a), names(b))), {
a.exists <- var %in% names(a)
b.exists <- var %in% names(b)
a.isfactor <- unlist(lapply(var, function(v) is.factor(a[[v]])))
b.isfactor <- unlist(lapply(var, function(v) is.factor(b[[v]])))
a <- ifelse(a.exists, var, ifelse(b.isfactor, 'fill.factor', 'fill'))
b <- ifelse(b.exists, var, ifelse(a.isfactor, 'fill.factor', 'fill'))
})
a <- within(a, {
fill <- NA
fill.factor <- factor(fill)
})
b <- within(b, {
fill <- NA
fill.factor <- factor(fill)
})
out <- mapply(function(x,y) unlist(list(a[[x]], b[[y]]),
recursive=FALSE, use.names=FALSE),
srcvar$a, srcvar$b, SIMPLIFY=FALSE, USE.NAMES=FALSE)
out <- data.frame(out)
names(out) <- srcvar$var
out
}
Now we can test:
sample.datasets <- lapply(1:50, function(i) iris[,sample(names(iris), 4)])
system.time(invisible(Reduce(merge.datasets1, sample.datasets)))
>> user system elapsed
>> 0.192 0.000 0.190
system.time(invisible(Reduce(merge.datasets2, sample.datasets)))
>> user system elapsed
>> 2.292 0.000 2.293
So, the naive version is orders of magnitude faster than the other. How can
this be? I always thought that for loops are slow, and that one should
rather use lapply and friends and steer clear of loops in R. I would welcome any idea on how to improve my function in terms of speed.
In fact, you are not doing trying to replicate merge(a,b, all = TRUE) at all, as you are not trying to merge on any of the columns. Instead you are simply stacking the data, filling with NA where a column does not exist.
# note that this is not what you want/
dim(merge(sample.datasets[[1]], sample.datasets[[2]], all = T))
[1] 314 5
The reason merge(a,b, all = TRUE) will be slow is that it defaults to merging by the intersection of the names. If you convert to data.tables then the merge.data.table method is lightning fast, but with your test data, it would be creating an exponentially increasing dataset on each sucessive merge (not 7500 by 5 as you want your results to be)
An easy solution is to use rbind.fill from the plyr package.
library(plyr)
system.time({.x <- Reduce(rbind.fill, sample.datasets)})
## user system elapsed
## 0.16 0.00 0.15
# which is almost identical to
system.time(.old <- Reduce(merge.datasets1, sample.datasets))
## user system elapsed
## 0.14 0.00 0.14
EDIT 2-11-2012
On further consideration it is really useful to note that you can pass a list of data.frames to rbind.fill so
system.time(super_fast <- rbind.fill(sample.datasets))
## user system elapsed
## 0.02 0.00 0.02
identical(super_fast, .old)
[1] TRUE
The majority of time spent in the overheads for Reduce, which rbind.fill does not require.
I have a double loop that I not only don't like, but would take 14 days to run on my computer since it is going over 3200 records and 1090 variables at about .12 per iteration.
A smaller reproducible bit. It simply checks how many numbers are in the same column between two records, not including NA's. Then it attaches the results to the original data frame.
y <- data.frame(c(1,2,1,NA,NA),c(3,3,3,4,NA),c(5,4,5,7,7),c(7,8,7,9,10))
resultdf <- NULL
for(i in 1:nrow(y))
{
results <- NULL
for(j in 1:nrow(y))
{
results <- c(results,sum((y[i,]==y[j,]),na.rm=TRUE))
}
resultdf <- cbind(resultdf,results)
}
y <- cbind(y,resultdf)
I have repeat calculations that could possibly be avoided leaving about 7 days.
If I understand correctly, a few apply functions are in C that might be faster. I haven't been able to get any to work though. I'm also curious if there is a package that would run faster. Can anyone help speed up the calculation?
Thank you!
I have created data to your specifications, and using #BenBolker's suggestion about using a matrix:
> y <- matrix(sample(c(1:9, NA), 3200 * 1090, replace = TRUE),
+ nrow = 3200, ncol = 1090)
and compared the computation times for three different implementations:
f1 was suggested by #Andrei:
> f1 <- function(y)apply(y, 1, function(r1)
+ apply(y, 1, function(r2)sum(r1==r2, na.rm=TRUE)))
> system.time(r1 <- f1(y))
user system elapsed
523.51 0.77 528.73
f2 was suggested by #VincentZoonekynd:
> f2 <- function(y) {
+ f <- function(i,j) sum(y[i,] == y[j,], na.rm=TRUE)
+ d <- outer( 1:nrow(y), 1:nrow(y), Vectorize(f) )
+ return(d)
+ }
> system.time(r2 <- f2(y))
user system elapsed
658.94 1.96 710.67
f3 is a double loop over the upper triangle as suggested by #BenBolker. It is also a bit more efficient than your OP in that it pre-allocates the output matrix:
> f3 <- function(y) {
+ result <- matrix(NA, nrow(y), nrow(y))
+ for (i in 1:nrow(y)) {
+ row1 <- y[i, ]
+ for (j in i:nrow(y)) {
+ row2 <- y[j, ]
+ num.matches <- sum(row1 == row2, na.rm = TRUE)
+ result[i, j] <- num.matches
+ result[j, i] <- num.matches
+ }
+ }
+ return(result)
+ }
> system.time(r3 <- f3(y))
user system elapsed
167.66 0.08 168.72
So the double loop is the fastest of all three, although not as elegant and compact as the other two answers.
Here is another solution, using outer.
f <- function(i,j) sum(y[i,] == y[j,], na.rm=TRUE)
d <- outer( 1:nrow(y), 1:nrow(y), Vectorize(f) )
Indeed, you can use apply function. Given the earlier hint that a matrix works faster, I would try:
ym <- as.matrix(y)
resultdf <- apply(ym, 1, function(r1) apply(ym, 1, function(r2) sum(r1==r2, na.rm=TRUE)))
You can get rid of the inner loop (using the y and f3 from #flodel's answer):
ty <- t(y)
ix <- rep(1:nrow(y),each = ncol(y))
f4 <- function(y){
result <- matrix(0L, nrow(y), nrow(y))
for(r in 1:nrow(y))
result[r,] <- rowsum(as.numeric(ty == y[r,]), ix, na.rm = T)
result
}
> system.time(out <- f4(y))
user system elapsed
52.616 21.061 74.000
> system.time(out <- f3(y))
user system elapsed
244.751 0.136 244.954
>
It actually does an extra job of computing twice the same thing, but is still 5 times faster. You can make it yet another 4 times faster by using the inner workings of rowsum. See this question for an example.
How can I speed up the following (noob) code:
#"mymatrix" is the matrix of word counts (docs X terms)
#"tfidfmatrix" is the transformed matrix
tfidfmatrix = Matrix(mymatrix, nrow=num_of_docs, ncol=num_of_words, sparse=T)
#Apply a transformation on each row of the matrix
for(i in 1:dim(mymatrix)[[1]]){
r = mymatrix[i,]
s = sapply(r, function(x) ifelse(x==0, 0, (1+log(x))*log((1+ndocs)/(1+x)) ) )
tfmat[i,] = s/sqrt(sum(s^2))
}
return (tfidfmatrix)
Problem is that the matrices I am working on are fairly large (~40kX100k), and this code is very slow.
The reason I am not using "apply" (instead of using a for loop and sapply) is that apply is going to give me the transpose of the matrix I want - I want num_of_docs X num_of_words, but apply will give me the transpose. I will then have to spend more time computing the transpose and re-allocating it.
Any thoughts on making this faster?
Thanks much.
Edit : I have found that the suggestions below greatly speed up my code (besides making me feel stupid). Any suggestions on where I can learn to write "optimized" R code from?
Edit 2: OK, so something is not right. Once I do s.vec[!is.finite(s.vec)] <- 0 every element of s.vec is being set to 0. Just to re-iterate my original matrix is a sparse matrix containing integers. This is due to some quirk of the Matrix package I am using. When I do s.vec[which(s.vec==-Inf)] <- 0 things work as expected. Thoughts?
As per my comment,
#Slightly larger example data
mymatrix <- matrix(runif(10000),nrow=10)
mymatrix[sample(10000,100)] <- 0
tfmat <- matrix(nrow=10, ncol=1000)
ndocs <- 1
justin <- function(){
s.vec <- ifelse(mymatrix==0, 0, (1 + log(mymatrix)) * log((1 + ndocs)/(1 + mymatrix)))
tfmat.vec <- s.vec/sqrt(rowSums(s.vec^2))
}
joran <- function(){
s.vec <- (1 + log(mymatrix)) * log((1 + ndocs)/(1 + mymatrix))
s.vec[!is.finite(s.vec)] <- 0
tfmat.vec <- s.vec/sqrt(rowSums(s.vec^2))
}
require(rbenchmark)
benchmark(justin(),joran(),replications = 1000)
test replications elapsed relative user.self sys.self user.child sys.child
2 joran() 1000 0.940 1.00000 0.842 0.105 0 0
1 justin() 1000 2.786 2.96383 2.617 0.187 0 0
So it's around 3x faster or so.
not sure what ndocs is, but ifelse is already vectorized, so you should be able to use the ifelse statement without walking through the matrix row by row and sapply along the row. The same can be said for the final calc.
However, you haven't given a complete example to replicate...
mymatrix <- matrix(runif(100),nrow=10)
tfmat <- matrix(nrow=10, ncol=10)
ndocs <- 1
s.vec <- ifelse(mymatrix==0, 0, 1 + log(mymatrix)) * log((1 + ndocs)/(1 + mymatrix))
for(i in 1:dim(mymatrix)[[1]]){
r = mymatrix[i,]
s = sapply(r, function(x) ifelse(x==0, 0, (1+log(x))*log((1+ndocs)/(1+x)) ) )
tfmat[i,] <- s
}
all.equal(s.vec, tfmat)
so the only piece missing is the rowSums in your final calc.
tfmat.vec <- s.vec/sqrt(rowSums(s.vec^2))
for(i in 1:dim(mymatrix)[[1]]){
r = mymatrix[i,]
s = sapply(r, function(x) ifelse(x==0, 0, (1+log(x))*log((1+ndocs)/(1+x)) ) )
tfmat[i,] = s/sqrt(sum(s^2))
}
all.equal(tfmat, tfmat.vec)