I have an operation I'd like to run for each row of a data frame, changing one column. I'm an apply/ddply/sqldf man, but I'll use loops when they make sense, and I think this is one of those times. This case is tricky because the column to changes depends on information that changes by row; depending on information in one cell, I should make a change to only one of ten other cells in that row. With 75 columns and 20000 rows, the operation takes 10 minutes, when every other operation in my script takes 0-5 seconds, ten seconds max. I've stripped my problem down to the very simple test case below.
n <- 20000
t.df <- data.frame(matrix(1:5000, ncol=10, nrow=n) )
system.time(
for (i in 1:nrow(t.df)) {
t.df[i,(t.df[i,1]%%10 + 1)] <- 99
}
)
This takes 70 seconds with ten columns, and 360 when ncol=50. That's crazy. Are loops the wrong approach? Is there a better, more efficient way to do this?
I already tried initializing the nested term (t.df[i,1]%%10 + 1) as a list outside the for loop. It saves about 30 seconds (out of 10 minutes) but makes the example code above more complicated. So it helps, but its not the solution.
My current best idea came while preparing this test case. For me, only 10 of the columns are relevant (and 75-11 columns are irrelevant). Since the run times depend so much on the number of columns, I can just run the above operation on a data frame that excludes irrelevant columns. That will get me down to just over a minute. But is "for loop with nested indices" even the best way to think about my problem?
It seems the real bottleneck is having the data in the form of a data.frame. I assume that in your real problem you have a compelling reason to use a data.frame. Any way to convert your data in such a way that it can remain in a matrix?
By the way, great question and a very good example.
Here's an illustration of how much faster loops are on matrices than on data.frames:
> n <- 20000
> t.df <- (matrix(1:5000, ncol=10, nrow=n) )
> system.time(
+ for (i in 1:nrow(t.df)) {
+ t.df[i,(t.df[i,1]%%10 + 1)] <- 99
+ }
+ )
user system elapsed
0.084 0.001 0.084
>
> n <- 20000
> t.df <- data.frame(matrix(1:5000, ncol=10, nrow=n) )
> system.time(
+ for (i in 1:nrow(t.df)) {
+ t.df[i,(t.df[i,1]%%10 + 1)] <- 99
+ }
+ )
user system elapsed
31.543 57.664 89.224
Using row and col seems less complicated to me:
t.df[col(t.df) == (row(t.df) %% 10) + 1] <- 99
I think Tommy's is still faster, but using row and col might be easier to understand.
#JD Long is right that if t.df can be represented as a matrix, things will be much faster.
...And then you can actually vectorize the whole thing so that it is lightning fast:
n <- 20000
t.df <- data.frame(matrix(1:5000, ncol=10, nrow=n) )
system.time({
m <- as.matrix(t.df)
m[cbind(seq_len(nrow(m)), m[,1]%%10L + 1L)] <- 99
t2.df <- as.data.frame(m)
}) # 0.00 secs
Unfortunately, the matrix indexing I use here does not seem to work on a data.frame.
EDIT
A variant where I create a logical matrix to index works on data.frame, and is almost as fast:
n <- 20000
t.df <- data.frame(matrix(1:5000, ncol=10, nrow=n) )
system.time({
t2.df <- t.df
# Create a logical matrix with TRUE wherever the replacement should happen
m <- array(FALSE, dim=dim(t2.df))
m[cbind(seq_len(nrow(t2.df)), t2.df[,1]%%10L + 1L)] <- TRUE
t2.df[m] <- 99
}) # 0.01 secs
UPDATE: Added the matrix version of Tommy's solution to the benchmarking exercise.
You can vectorize it. Here is my solution and a comparison with the loop
n <- 20000
t.df <- (matrix(1:5000, ncol=10, nrow=n))
f_ramnath <- function(x){
idx <- x[,1] %% 10 + 1
x[cbind(1:NROW(x), idx)] <- 99
return(x)
}
f_long <- function(t.df){
for (i in 1:nrow(t.df)) {
t.df[i,(t.df[i,1]%%10 + 1)] <- 99
}
return(t.df)
}
f_joran <- function(t.df){
t.df[col(t.df) == (row(t.df) %% 10) + 1] <- 99
return(t.df)
}
f_tommy <- function(t.df){
t2.df <- t.df
# Create a logical matrix with TRUE wherever the replacement should happen
m <- array(FALSE, dim=dim(t2.df))
m[cbind(seq_len(nrow(t2.df)), t2.df[,1]%%10L + 1L)] <- TRUE
t2.df[m] <- 99
return(t2.df)
}
f_tommy_mat <- function(m){
m[cbind(seq_len(nrow(m)), m[,1]%%10L + 1L)] <- 99
}
To compare the performance of the different approaches, we can use rbenchmark.
library(rbenchmark)
benchmark(f_long(t.df), f_ramnath(t.df), f_joran(t.df), f_tommy(t.df),
f_tommy_mat(t.df), replications = 20, order = 'relative',
columns = c('test', 'elapsed', 'relative')
test elapsed relative
5 f_tommy_mat(t.df) 0.135 1.000000
2 f_ramnath(t.df) 0.172 1.274074
4 f_tommy(t.df) 0.311 2.303704
3 f_joran(t.df) 0.705 5.222222
1 f_long(t.df) 2.411 17.859259
Another option for when you do need mixed column types (and so you can't use matrix) is := in data.table. Example from ?":=" :
require(data.table)
m = matrix(1,nrow=100000,ncol=100)
DF = as.data.frame(m)
DT = as.data.table(m)
system.time(for (i in 1:1000) DF[i,1] <- i)
# 591 seconds
system.time(for (i in 1:1000) DT[i,V1:=i])
# 1.16 seconds ( 509 times faster )
Related
In SparkR I have a DataFrame data contains id, amount_spent and amount_won.
For example for id=1 we have
head(filter(data, data$id==1))
and output is
1 30 10
1 40 100
1 22 80
1 14 2
So far I want to know if a fixed id has more won than losses. The amount can be ignored.
In R I can make it to run but it takes time. Say we have 100 id's. In R I have done this
w=c()
for(j in 1:100){
# Making it local for a fixed id
q=collect(filter(data, data$id==j))
# Checking the difference. 1 means wins and 0 means losses
if( as.numeric(q$amount_won) - as.numeric(q$amount_spent)>0 {
w[j]=1
}
else{w[j]=0}
}
Now w simply gives me 1's and 0's for all the id's. In sparkR I want to do this a more faster way.
I am not sure wether this is exactly what you want, so feel free to ask for adjustments.
df <- data.frame(id = c(1,1,1,1),
amount_spent = c(30,40,22,14),
amount_won = c(10,100,80,2))
DF <- createDataFrame(sqlContext, df)
DF <- withColumn(DF, "won", DF$amount_won > DF$amount_spent)
DF$won <- cast(DF$won, "integer")
grouped <- groupBy(DF, DF$id)
aggregated <- agg(grouped, total_won = sum(DF$won), total_games = n(DF$won))
result <- withColumn(aggregated, "percentage_won" , aggregated$total_won/aggregated$total_games)
collect(result)
I have added a column to DF whether the ID has won more than he spent on that row. The result has as output the amount of games someone played, the amount of games he won and the percentage of games he won.
I have a vector of length 14 and I would like to check in sets of 5 in this manner:
compare = c(rep(1,4),rep(0,10)) # Vector
g.test = matrix(0,5,10)
for (i in 5:14){
g.test[,i-4] = head(tail(compare,i),5)
}
if (sum(colSums(g.test) >= 3 & colSums(g.test) < 5 ) > 0){yield = T}
I am running through the vector
compare[c(10:14)] to compare[c(9:13)] to ... to compare[c(1:5)] and checking if any of it has a sum >= 3 and < 5.
BUT, compare is just 1 such vector; I've 100,000 such vector of different permutations of 1's and 0's but all of length 14. running my code like that took my computer 100 seconds to run through. Is there a better way to do this?
I'm actually running a simulation test for Texas poker. This portion of the code is used to check for incomplete straight draws.
Try this:
g.sums <- rowSums(embed(compare, 5))
yield <- any(g.sums >= 3 & g.sums < 5)
100,000 iterations on my machine:
# user system elapsed
# 2.438 0.052 2.493
I need to calculate the mean and variance of a subset of a vector. Let x be the vector and y be an indicator for whether the observation is in the subset. Which is more efficient:
sub.mean <- mean(x[y])
sub.var <- var(x[y])
or
sub <- x[y]
sub.mean <- mean(sub)
sub.var <- var(sub)
sub <- NULL
The first approach doesn't create a new object explicitly; but do the calls to mean and var do that implicitly? Or do they work on the original vector as stored?
Is the second faster because it doesn't have to do the subsetting twice?
I'm concerned with speed and with memory management for large data sets.
Benchmarking on a vector of length 10M indicates that (on my machine) the latter approach is faster:
f1 = function(x, y) {
sub.mean <- mean(x[y])
sub.var <- var(x[y])
}
f2 = function(x, y) {
sub <- x[y]
sub.mean <- mean(sub)
sub.var <- var(sub)
sub <- NULL
}
x = rnorm(10000000)
y = rbinom(10000000, 1, .5)
print(system.time(f1(x, y)))
# user system elapsed
# 0.403 0.037 0.440
print(system.time(f2(x, y)))
# user system elapsed
# 0.233 0.002 0.235
This isn't surprising- mean(x[y]) does have to create a new object for the mean function to act on, even if it doesn't add it to the local namespace. Thus, f1 is slower for having to do the subsetting twice (as you surmised).
Overview
Give a large (nrows > 5,000,000+) data frame, A, with string row names and a list of disjoint sets (n = 20,000+), B, where each set consists of row names from A, what is the best way to create a vector representing the sets in B via a unique value?
Illustration
Below is an example illustrating this problem:
# Input
A <- data.frame(d = rep("A", 5e6), row.names = as.character(sample(1:5e6)))
B <- list(c("4655297", "3177816", "3328423"), c("2911946", "2829484"), ...) # Size 20,000+
The desired result would be:
# An index of NA represents that the row is not part of any set in B.
> A[,"index", drop = F]
d index
4655297 A 1
3328423 A 1
2911946 A 2
2829484 A 2
3871770 A NA
2702914 A NA
2581677 A NA
4106410 A NA
3755846 A NA
3177816 A 1
Naive Attempt
Something like this can be achieved using the following method.
n <- 0
A$index <- NA
lapply(B, function(x){
n <<- n + 1
A[x, "index"] <<- n
})
Problem
However this is unreasonably slow (several hours) due to indexing A multiple times and is not very R-esque or elegant.
How can the desired result be generated in a quick and efficient manner?
Here is a suggestion using base that isn't too bad when compared to your current method.
Sample data:
A <- data.frame(d = rep("A", 5e6),
set = sample(c(NA, 1:20000), 5e6, replace = TRUE),
row.names = as.character(sample(1:5e6)))
B <- split(rownames(A), A$set)
Base method:
system.time({
A$index <- NA
A[unlist(B), "index"] <- rep(seq_along(B), times = lapply(B, length))
})
# user system elapsed
# 15.30 0.19 15.50
Check:
identical(A$set, A$index)
# TRUE
For anything faster, I suppose data.table will come handy.
Below is a simplified version of a segment of code that I'm working on (a lot of additional calculations are left out to avoid confusion). It's just a modified form of the cumsum function. I don't want to re-invent the wheel, so does this function already exist? If not, what scheme would provide the best speed?
#Set up the data
set.seed(1)
junk <- rnorm(1000000)
junk1 <- rnorm(1000000)
cumval <- numeric(1000000)
#Initialize the accumulator
cumval[1] <- 1
#Perform the modified cumsum
system.time({
for (i in 2:1000000) cumval[i] <- junk[i] + (junk1[i] * cumval[i-1])
})
#Plot the result
plot(cumval, type="l")
This algorithm is something that fits the compiler package perfectly!
#Set up the data
set.seed(1)
junk <- rnorm(1000000)
junk1 <- rnorm(1000000)
# The original code
f <- function(junk, junk1) {
cumval <- numeric(1000000)
cumval[1] <- 1
for (i in 2:1000000) cumval[i] <- junk[i] + (junk1[i] * cumval[i-1])
cumval
}
system.time( f(junk, junk1) ) # 4.11 secs
# Now try compiling it...
library(compiler)
g <- cmpfun(f)
system.time( g(junk, junk1) ) # 0.98 secs
...so it would be interesting to know if this algorithm is in any way "typical" - in that case the compiler could perhaps be even more optimized for situations like this...
It is faster but doesn't give correct results.
Run this
set.seed(1)
N <- 10
junk <- rnorm(N)
junk1 <- rnorm(N)
cumval <- numeric(N)
cumval.1 <- numeric(N)
cumval[1] <- 1
for( i in 2:N ) cumval[i] <- junk[i] + junk1[i]*cumval[i-1]
cumval
cumval.1 <- cumsum( junk[-1] + (junk1[-1] * cumval.1[-N]) )
cumval.1
and you'll see that cumval and cumval.1 are not even the same length.
One needs to rewrite the recurrence relation.
I don't see a way to convert the recurrence to a non recurrence formula.
Consider cumval[5]. Using j[] for junk and jk[] for junk1 and omitting * symbols, its expansion would be:
j[5] +jk[5]j[4] + jk[5]jk[4]j[3] + jk[5]jk[4]jk[3]j[2] + jk[5]jk[4]jk[3]jk[2]
The pattern suggests this might be (close to ?) an expression for the 5th term:
sum( j[1:5] * c(1, Reduce("*" , rev(jk[2:5]), accumulate=TRUE) )