I have the following dataset, which I would like to reshape from wide to long format:
Name Code CURRENCY 01/01/1980 02/01/1980 03/01/1980 04/01/1980
Abengoa 4256 USD 1.53 1.54 1.51 1.52
Adidas 6783 USD 0.23 0.54 0.61 0.62
The data consists of stock prices for different firms on each day from 1980 to 2013. Therefore, I have 8,612 columns in my wide data (and a abou 3,000 rows). Now, I am using the following command to reshape the data into long format:
library(reshape)
data <- read.csv("data.csv")
data1 <- melt(data,id=c("Name","Code", "CURRENCY"),variable_name="Date")
However, for .csv files that are about 50MB big, it already takes about two hours. The computing time shouldn't be driven by weak hardware, since I am running this on a 2.7 GHz Intel Core i7 with 16GB of RAM. Is there any other more efficient way to do this?
Many thanks!
Benchmarks Summary:
Using Stack (as suggested by #AnandaMahto) is definitely
the way to go for smaller data sets (N < 3,000).
As the data sets gets larger, data.table begins to outperform stack
Here is an option using data.table
dtt <- data.table(data)
# non value columns, ie, the columns to keep post reshape
nvc <- c("Name","Code", "CURRENCY")
# name of columns being transformed
dateCols <- setdiff(names(data), nvc)
# use rbind list to combine subsets
dtt2 <- rbindlist(lapply(dateCols, function(d) {
dtt[, Date := d]
cols <- c(nvc, "Date", d)
setnames(dtt[, cols, with=FALSE], cols, c(nvc, "Date", "value"))
}))
## Results:
dtt2
# Name Code CURRENCY Date value
# 1: Abengoa 4256 USD X_01_01_1980 1.53
# 2: Adidas 6783 USD X_01_01_1980 0.23
# 3: Abengoa 4256 USD X_02_01_1980 1.54
# 4: Adidas 6783 USD X_02_01_1980 0.54
# 5: ... <cropped>
Updated Benchmarks with larger sample data
As per the suggestion from #AnandaMahto, below are benchmarks using a large (larger) sample data.
Please feel free to improve any of the methods used below and/or add new methods.
Benchmarks
Resh <- quote(reshape::melt(data,id=c("Name","Code", "CURRENCY"),variable_name="Date"))
Resh2 <- quote(reshape2::melt(data,id=c("Name","Code", "CURRENCY"),variable_name="Date"))
DT <- quote({ nvc <- c("Name","Code", "CURRENCY"); dateCols <- setdiff(names(data), nvc); rbindlist(lapply(dateCols, function(d) { dtt[, Date := d]; cols <- c(nvc, "Date", d); setnames(dtt[, cols, with=FALSE], cols, c(nvc, "Date", "value"))}))})
Stack <- quote(data.frame(data[1:3], stack(data[-c(1, 2, 3)])))
# SAMPLE SIZE: ROWS = 900; COLS = 380 + 3;
dtt <- data.table(data);
benchmark(Resh=eval(Resh),Resh2=eval(Resh2),DT=eval(DT), Stack=eval(Stack), replications=5, columns=c("relative", "test", "elapsed", "user.self", "sys.self", "replications"), order="relative")
# relative test elapsed user.self sys.self replications
# 1.000 Stack 0.813 0.623 0.192 5
# 2.530 DT 2.057 2.035 0.026 5
# 40.470 Resh 32.902 18.410 14.602 5
# 40.578 Resh2 32.990 18.419 14.728 5
# SAMPLE SIZE: ROWS = 3,500; COLS = 380 + 3;
dtt <- data.table(data);
benchmark(DT=eval(DT), Stack=eval(Stack), replications=5, columns=c("relative", "test", "elapsed", "user.self", "sys.self", "replications"), order="relative")
# relative test elapsed user.self sys.self replications
# 1.00 DT 2.407 2.336 0.076 5
# 1.08 Stack 2.600 1.626 0.983 5
# SAMPLE SIZE: ROWS = 27,000; COLS = 380 + 3;
dtt <- data.table(data);
benchmark(DT=eval(DT), Stack=eval(Stack), replications=5, columns=c("relative", "test", "elapsed", "user.self", "sys.self", "replications"), order="relative")
# relative test elapsed user.self sys.self replications
# 1.000 DT 10.450 7.418 3.058 5
# 2.232 Stack 23.329 14.180 9.266 5
Sample Data Creation
# rm(list=ls(all=TRUE))
set.seed(1)
LLLL <- apply(expand.grid(LETTERS, LETTERS[10:15], LETTERS[1:20], LETTERS[1:5], stringsAsFactors=FALSE), 1, paste0, collapse="")
size <- 900
dateSamples <- 380
startDate <- as.Date("1980-01-01")
Name <- apply(matrix(LLLL[1:(2*size)], ncol=2), 1, paste0, collapse="")
Code <- sample(1e3:max(1e4-1, size+1e3), length(Name))
CURRENCY <- sample(c("USD", "EUR", "YEN"), length(Name), TRUE)
Dates <- seq(startDate, length.out=dateSamples, by="mon")
Values <- sample(c(1:1e2, 1:5e2), size=size*dateSamples, TRUE) / 1e2
# Calling the sample dataframe `data` to keep consistency, but I dont like this practice
data <- data.frame(Name, Code, CURRENCY,
matrix(Values, ncol=length(Dates), dimnames=list(c(), as.character(Dates)))
)
data[1:6, 1:8]
# Name Code CURRENCY X1980.01.01 X1980.02.01 X1980.03.01 X1980.04.01 X1980.05.01
# 1 AJAAQNFA 3389 YEN 0.37 0.33 3.58 4.33 1.06
# 2 BJAARNFA 4348 YEN 1.14 2.69 2.57 0.27 3.02
# 3 CJAASNFA 6154 USD 2.47 3.72 3.32 0.36 4.85
# 4 DJAATNFA 9171 USD 2.22 2.48 0.71 0.79 2.85
# 5 EJAAUNFA 2814 USD 2.63 2.17 1.66 0.55 3.12
# 6 FJAAVNFA 9081 USD 1.92 1.47 3.51 3.23 3.68
From the question :
data <- read.csv("data.csv")
and
... for .csv files that are about 50MB big, it already takes about two
hours ...
So although stack/melt/reshape comes into play, I'm guessing (since this is your fist ever S.O. question) that the biggest factor here is read.csv. Assuming you're including that in your timing as well as melt (it isn't clear).
Default arguments to read.csv are well known to be slow. A few quick searches should reveal hint and tips (e.g. stringsAsFactors, colClasses) such as :
http://cran.r-project.org/doc/manuals/R-data.html
Quickly reading very large tables as dataframes
But I'd suggest fread (since data.table 1.8.7). To get a feel for fread its manual page in raw text form is here:
https://www.rdocumentation.org/packages/data.table/versions/1.12.2/topics/fread
The examples section there, as it happens, has a 50MB example shown to be read in 3 seconds instead of up to 60. And benchmarks are starting to appear in other answers which is great to see.
Then the stack/reshape/melt answers are next order, if I guessed correctly.
While the testing is going on, I'll post my comment as an answer for you to consider. Try using stack as in:
data1 <- data.frame(data[1:3], stack(data[-c(1, 2, 3)]))
In many cases, stack works really efficiently with these types of operations, and adding back in the first few columns also works quickly because of how vectors are recycled in R.
For that matter, this might also be worth considering:
data.frame(data[1:3],
vals = as.vector(as.matrix(data[-c(1, 2, 3)])),
date = rep(names(data)[-c(1, 2, 3)], each = nrow(data)))
I'm cautious to benchmark on such a small sample of data though, because I suspect the results won't be quite comparable to benchmarking on your actual dataset.
Update: Results of some more benchmarks
Using #RicardoSaporta's benchmarking procedure, I have benchmarked data.table against what I've called "Manual" data.frame creation. You can see the results of the benchmarks here, on datasets ranging from 1000 rows to 3000 rows, in 500 row increments, and all with 8003 columns (8000 data columns, plus the three initial columns).
The results can be seen here: http://rpubs.com/mrdwab/reduce-computing-time
Ricardo's correct--there seems to be something about 3000 rows that makes a huge difference with the base R approaches (and it would be interesting if anyone has any explanation about what that might be). But this "Manual" approach is actually even faster than stack, if performance really is the primary concern.
Here are the results for just the last three runs:
data <- makeSomeData(2000, 8000)
dtt <- data.table(data)
suppressWarnings(benchmark(DT = eval(DT), Manual = eval(Manual), replications = 1,
columns = c("relative", "test", "elapsed", "user.self", "sys.self", "replications"),
order = "relative"))
## relative test elapsed user.self sys.self replications
## 2 1.000 Manual 0.908 0.696 0.108 1
## 1 3.963 DT 3.598 3.564 0.012 1
rm(data, dateCols, nvc, dtt)
data <- makeSomeData(2500, 8000)
dtt <- data.table(data)
suppressWarnings(benchmark(DT = eval(DT), Manual = eval(Manual), replications = 1,
columns = c("relative", "test", "elapsed", "user.self", "sys.self", "replications"),
order = "relative"))
## relative test elapsed user.self sys.self replications
## 2 1.000 Manual 2.841 1.044 0.296 1
## 1 1.694 DT 4.813 4.661 0.080 1
rm(data, dateCols, nvc, dtt)
data <- makeSomeData(3000, 8000)
dtt <- data.table(data)
suppressWarnings(benchmark(DT = eval(DT), Manual = eval(Manual), replications = 1,
columns = c("relative", "test", "elapsed", "user.self", "sys.self", "replications"),
order = "relative"))
## relative test elapsed user.self sys.self replications
## 1 1.00 DT 7.223 5.769 0.112 1
## 2 29.27 Manual 211.416 1.560 0.952 1
Ouch! data.table really turns the tables on that last run!
Related
I am working on a dataset to test the association between empirical antibiotics (variable emp, the antibiotics are cefuroxime or ceftriaxone compared with a reference antibiotic) and 30-day mortality (variable mort30). The data comes from patients admitted in 6 hospitals (variable site2) with a specific type of infection. Therefore, I would like to adjust for this clustering of patients on hospital level.
First I did this using the coxme() function for mixed models. However, based on visual inspection of the Schoenfeld residuals there were violations of the proportional hazards assumption and I tried adding a time transformation (tt) to the model. Unfortunately, the coxme() does not offer the possibility for time transformations.
Therfore, I tried other options to adjust for the clustering, including coxph() combined with frailty() and cluster. Surprisingly, the standard errors I get using the cluster() option are much smaller than using the coxme() or frailty().
**Does anyone know what is the explanation for this and which option would provide the most reliable estimates?
**
1) Using coxme:
> uni.mort <- coxme(Surv(FUdur30, mort30num) ~ emp + (1 | site2), data = total.pop)
> summary(uni.mort)
Cox mixed-effects model fit by maximum likelihood
Data: total.pop
events, n = 58, 253
Iterations= 24 147
NULL Integrated Fitted
Log-likelihood -313.8427 -307.6543 -305.8967
Chisq df p AIC BIC
Integrated loglik 12.38 3.00 0.0061976 6.38 0.20
Penalized loglik 15.89 3.56 0.0021127 8.77 1.43
Model: Surv(FUdur30, mort30num) ~ emp + (1 | site2)
Fixed coefficients
coef exp(coef) se(coef) z p
empCefuroxime 0.5879058 1.800214 0.6070631 0.97 0.33
empCeftriaxone 1.3422317 3.827576 0.5231278 2.57 0.01
Random effects
Group Variable Std Dev Variance
site2 Intercept 0.2194737 0.0481687
> confint(uni.mort)
2.5 % 97.5 %
empCefuroxime -0.6019160 1.777728
empCeftriaxone 0.3169202 2.367543
2) Using frailty()
uni.mort <- coxph(Surv(FUdur30, mort30num) ~ emp + frailty(site2), data = total.pop)
> summary(uni.mort)
Call:
coxph(formula = Surv(FUdur30, mort30num) ~ emp + frailty(site2),
data = total.pop)
n= 253, number of events= 58
coef se(coef) se2 Chisq DF p
empCefuroxime 0.6302 0.6023 0.6010 1.09 1.0 0.3000
empCeftriaxone 1.3559 0.5221 0.5219 6.75 1.0 0.0094
frailty(site2) 0.40 0.3 0.2900
exp(coef) exp(-coef) lower .95 upper .95
empCefuroxime 1.878 0.5325 0.5768 6.114
empCeftriaxone 3.880 0.2577 1.3947 10.796
Iterations: 7 outer, 27 Newton-Raphson
Variance of random effect= 0.006858179 I-likelihood = -307.8
Degrees of freedom for terms= 2.0 0.3
Concordance= 0.655 (se = 0.035 )
Likelihood ratio test= 12.87 on 2.29 df, p=0.002
3) Using cluster()
uni.mort <- coxph(Surv(FUdur30, mort30num) ~ emp, cluster = site2, data = total.pop)
> summary(uni.mort)
Call:
coxph(formula = Surv(FUdur30, mort30num) ~ emp, data = total.pop,
cluster = site2)
n= 253, number of events= 58
coef exp(coef) se(coef) robust se z Pr(>|z|)
empCefuroxime 0.6405 1.8975 0.6009 0.3041 2.106 0.035209 *
empCeftriaxone 1.3594 3.8937 0.5218 0.3545 3.834 0.000126 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
empCefuroxime 1.897 0.5270 1.045 3.444
empCeftriaxone 3.894 0.2568 1.944 7.801
Concordance= 0.608 (se = 0.027 )
Likelihood ratio test= 12.08 on 2 df, p=0.002
Wald test = 15.38 on 2 df, p=5e-04
Score (logrank) test = 10.69 on 2 df, p=0.005, Robust = 5.99 p=0.05
(Note: the likelihood ratio and score tests assume independence of
observations within a cluster, the Wald and robust score tests do not).
>
I want to create in R a column in my data set where I subtract row 2 from row1, row 4 from row 3 and so forth. Moreover, I want that the subtraction result is repeated for each row (e.g.if the result from the subtraction row2-row1 is -0.294803, I want this value to be present both in row1 and row2, hence repeated twice for both factors of the subtraction, and so forth for all subtractions).
Here my data set.
I tried with the function aggregate but I didn't succeed.
Any hint?
Another possible solution can be:
x <- read.table("mydata.csv",header=T,sep=";")
x$diff <- rep(x$log[seq(2,nrow(x),by=2)] - x$log[seq(1,nrow(x),by=2)], each=2)
By using the function seq(), you can generate the sequences of row positions:
1, 3, 5, ... 9
2, 4, 6, ... 10
Afterwards, the code subtracts the rows 2...10 to the rows 1...9. Each result is replicated by using the command rep() and it's assigned to the new column diff.
solution 1
One way to that is with one simple loop:
(download mydata.csv)
a = read.table("mydata.csv",header=T,sep=";")
a$delta= NA
for(i in seq(1, nrow(a), by=2 )){
a[i,"delta"] = a[i+1,"delta"] = a[i+1,"log"] - a[i,"log"]
}
What is going on here is that the for loop iterates on every odd number (that's what the seq(...,by=2) does. So for the first, third, fifth, etc. row we assign to that row AND the following one the computed difference.
which returns:
> a
su match log delta
1 1 match 5.80 0.30
2 1 mismatch 6.10 0.30
3 2 match 6.09 -0.04
4 2 mismatch 6.05 -0.04
5 3 match 6.42 -0.12
6 3 mismatch 6.30 -0.12
7 4 match 6.20 -0.20
8 4 mismatch 6.00 -0.20
9 5 match 5.90 0.19
10 5 mismatch 6.09 0.19
solution 2
If you have a lot of data this approach can be slow. And generally R works better with another form of iterative functions which are the apply family.
The same code of above can be optimized like this:
a$delta = rep(
sapply(seq(1, nrow(a), by=2 ),
function(i){ a[i+1,"log"] - a[i,"log"] }
),
each=2)
Which gives the very same result as the first solution, should be faster, but also somewhat less intuitive.
solution 3
Finally it looks to me that you're trying to use a convoluted approach by using the long dataframe format, given your kind of data.
I'd reshape it to wide, and then operate more logically with separate columns, without the need of duplicate data.
Like this:
a = read.table("mydata.csv",header=T,sep=";")
a = reshape(a, idvar = "su", timevar = "match", direction = "wide")
#now creating what you want became a very simple thing:
a$delta = a[[3]]-a[[2]]
Which returns:
>a
su log.match log.mismatch delta
1 1 5.80 6.10 0.30
3 2 6.09 6.05 -0.04
5 3 6.42 6.30 -0.12
7 4 6.20 6.00 -0.20
9 5 5.90 6.09 0.19
The delta column contains the values you need. If you really need the long format for further analysis you can always go back with:
a= reshape(a, idvar = "su", timevar = "match", direction = "long")
#sort to original order:
a = a[with(a, order(su)), ]
Suppose that you time a program as a function of N and produce
the following table.
N seconds
-------------------
19683 0.00
59049 0.00
177147 0.01
531441 0.08
1594323 0.44
4782969 2.46
14348907 13.58
43046721 74.99
129140163 414.20
387420489 2287.85
Estimate the order of growth of the running time as a function of N.
Assume that the running time obeys a power law T(N) ~ a N^b. For your
answer, enter the constant b. Your answer will be marked as correct
if it is within 1% of the target answer - we recommend using
two digits after the decimal separator, e.g., 2.34.
Can someone explain how to calculate this?
Well, it is a simple mathematical problem.
I : a*387420489^b = 2287.85 -> a = 387420489^b/2287.85
II: a*43046721^b = 74.99 -> a = 43046721^b/74.99
III: (I and II)-> 387420489^b/2287.85 = 43046721^b/74.99 ->
-> http://www.purplemath.com/modules/solvexpo2.htm
Use logarithms to solve.
1.You should calculate the ratio of the growth change from one row to the one next
N seconds
--------------------
14348907 13.58
43046721 74.99
129140163 414.2
387420489 2287.85
2.Calculate the change's ratio for N
43046721 / 14348907 = 3
129140163 / 43046721 = 3
therefore the rate of change for N is 3.
3.Calculate the change's ratio for seconds
74.99 / 13.58 = 5.52
Now let check the ratio between one more pare of rows to be sure
414.2 / 74.99 = 5.52
so the change's ratio for seconds is 5.52
4.Build the following equitation
3^b = 5.52
b = 1.55
Finally we get that the order of growth of the running time is 1.55.
I was wondering what the best way is to avoid row-wise processing in R, most of row-wise stuff is done in internal C routines. For example: I have a data frame a:
chromosome_name start_position end_position strand
1 15 35574797 35575181 1
2 15 35590448 35591641 -1
3 15 35688422 35688645 1
4 13 75402690 75404217 1
5 15 35692892 35693969 1
What I want is: based on whether strand is positive or negative, startOFgene as start_position or end_position. One way to avoid for loop will be to separate data.frame with +1 strand and -1 strand and perform selection. What can be other way for speed up? The method does not scale-up if one has certain other complicated processing per row.
Maybe this is fast enough...
transform(a, startOFgene = ifelse(strand == 1, start_position, end_position))
chromosome_name start_position end_position strand startOFgene
1 15 35574797 35575181 1 35574797
2 15 35590448 35591641 -1 35591641
3 15 35688422 35688645 1 35688422
4 13 75402690 75404217 1 75402690
5 15 35692892 35693969 1 35692892
First, since all your columns are integer/numeric, you could use a matrix instead of a data.frame. Many operations on a matrix are a lot faster than the same operation on a data.frame, even though they're not very different in this case. Then you can use logical subsetting to create the startOFgene column.
# Create some large-ish data
M <- do.call(rbind,replicate(1e3,as.matrix(a),simplify=FALSE))
M <- do.call(rbind,replicate(1e3,M,simplify=FALSE))
A <- as.data.frame(M)
# Create startOFgene column in a matrix
m <- function() {
M <- cbind(M, startOFgene=M[,"start_position"])
negStrand <- sign(M[,"strand"]) < 0
M[negStrand,"startOFgene"] <- M[negStrand,"end_position"]
}
# Create startOFgene column in a data.frame
d <- function() {
A$startOFgene <- A$start_position
negStrand <- sign(A$strand) < 0
A$startOFgene[negStrand] <- A$end_position[negStrand]
}
library(rbenchmark)
benchmark(m(), d(), replications=10)[,1:6]
# test replications elapsed relative user.self sys.self
# 2 d() 10 18.804 1.000 16.501 2.224
# 1 m() 10 19.713 1.048 16.457 3.152
I know there are many questions here in SO about ways to convert a list of data.frames to a single data.frame using do.call or ldply, but this questions is about understanding the inner workings of both methods and trying to figure out why I can't get either to work for concatenating a list of almost 1 million df's of the same structure, same field names, etc. into a single data.frame. Each data.frame is of one row and 21 columns.
The data started out as a JSON file, which I converted to lists using fromJSON, then ran another lapply to extract part of the list and converted to data.frame and ended up with a list of data.frames.
I've tried:
df <- do.call("rbind", list)
df <- ldply(list)
but I've had to kill the process after letting it run up to 3 hours and not getting anything back.
Is there a more efficient method of doing this? How can I troubleshoot what is happening and why is it taking so long?
FYI - I'm using RStudio server on a 72GB quad-core server with RHEL, so I don't think memory is the problem. sessionInfo below:
> sessionInfo()
R version 2.14.1 (2011-12-22)
Platform: x86_64-redhat-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] multicore_0.1-7 plyr_1.7.1 rjson_0.2.6
loaded via a namespace (and not attached):
[1] tools_2.14.1
>
Given that you are looking for performance, it appears that a data.table solution should be suggested.
There is a function rbindlist which is the same but much faster than do.call(rbind, list)
library(data.table)
X <- replicate(50000, data.table(a=rnorm(5), b=1:5), simplify=FALSE)
system.time(rbindlist.data.table <- rbindlist(X))
## user system elapsed
## 0.00 0.01 0.02
It is also very fast for a list of data.frame
Xdf <- replicate(50000, data.frame(a=rnorm(5), b=1:5), simplify=FALSE)
system.time(rbindlist.data.frame <- rbindlist(Xdf))
## user system elapsed
## 0.03 0.00 0.03
For comparison
system.time(docall <- do.call(rbind, Xdf))
## user system elapsed
## 50.72 9.89 60.88
And some proper benchmarking
library(rbenchmark)
benchmark(rbindlist.data.table = rbindlist(X),
rbindlist.data.frame = rbindlist(Xdf),
docall = do.call(rbind, Xdf),
replications = 5)
## test replications elapsed relative user.self sys.self
## 3 docall 5 276.61 3073.444445 264.08 11.4
## 2 rbindlist.data.frame 5 0.11 1.222222 0.11 0.0
## 1 rbindlist.data.table 5 0.09 1.000000 0.09 0.0
and against #JoshuaUlrich's solutions
benchmark(use.rbl.dt = rbl.dt(X),
use.rbl.ju = rbl.ju (Xdf),
use.rbindlist =rbindlist(X) ,
replications = 5)
## test replications elapsed relative user.self
## 3 use.rbindlist 5 0.10 1.0 0.09
## 1 use.rbl.dt 5 0.10 1.0 0.09
## 2 use.rbl.ju 5 0.33 3.3 0.31
I'm not sure you really need to use as.data.frame, because a data.table inherits class data.frame
rbind.data.frame does a lot of checking you don't need. This should be a pretty quick transformation if you only do exactly what you want.
# Use data from Josh O'Brien's post.
set.seed(21)
X <- replicate(50000, data.frame(a=rnorm(5), b=1:5), simplify=FALSE)
system.time({
Names <- names(X[[1]]) # Get data.frame names from first list element.
# For each name, extract its values from each data.frame in the list.
# This provides a list with an element for each name.
Xb <- lapply(Names, function(x) unlist(lapply(X, `[[`, x)))
names(Xb) <- Names # Give Xb the correct names.
Xb.df <- as.data.frame(Xb) # Convert Xb to a data.frame.
})
# user system elapsed
# 3.356 0.024 3.388
system.time(X1 <- do.call(rbind, X))
# user system elapsed
# 169.627 6.680 179.675
identical(X1,Xb.df)
# [1] TRUE
Inspired by the data.table answer, I decided to try and make this even faster. Here's my updated solution, to try and keep the check mark. ;-)
# My "rbind list" function
rbl.ju <- function(x) {
u <- unlist(x, recursive=FALSE)
n <- names(u)
un <- unique(n)
l <- lapply(un, function(N) unlist(u[N==n], FALSE, FALSE))
names(l) <- un
d <- as.data.frame(l)
}
# simple wrapper to rbindlist that returns a data.frame
rbl.dt <- function(x) {
as.data.frame(rbindlist(x))
}
library(data.table)
if(packageVersion("data.table") >= '1.8.2') {
system.time(dt <- rbl.dt(X)) # rbindlist only exists in recent versions
}
# user system elapsed
# 0.02 0.00 0.02
system.time(ju <- rbl.ju(X))
# user system elapsed
# 0.05 0.00 0.05
identical(dt,ju)
# [1] TRUE
Your observation that the time taken increases exponentially with the number of data.frames suggests that breaking the rbinding into two stages could speed things up.
This simple experiment seems to confirm that that's a very fruitful path to take:
## Make a list of 50,000 data.frames
X <- replicate(50000, data.frame(a=rnorm(5), b=1:5), simplify=FALSE)
## First, rbind together all 50,000 data.frames in a single step
system.time({
X1 <- do.call(rbind, X)
})
# user system elapsed
# 137.08 57.98 200.08
## Doing it in two stages cuts the processing time by >95%
## - In Stage 1, 100 groups of 500 data.frames are rbind'ed together
## - In Stage 2, the resultant 100 data.frames are rbind'ed
system.time({
X2 <- lapply(1:100, function(i) do.call(rbind, X[((i*500)-499):(i*500)]))
X3 <- do.call(rbind, X2)
})
# user system elapsed
# 6.14 0.05 6.21
## Checking that the results are the same
identical(X1, X3)
# [1] TRUE
You have a list of data.frames that each have a single row. If it is possible to convert each of those to a vector, I think that would speed things up a lot.
However, assuming that they need to be data.frames, I'll create a function with code borrowed from Dominik's answer at Can rbind be parallelized in R?
do.call.rbind <- function (lst) {
while (length(lst) > 1) {
idxlst <- seq(from = 1, to = length(lst), by = 2)
lst <- lapply(idxlst, function(i) {
if (i == length(lst)) {
return(lst[[i]])
}
return(rbind(lst[[i]], lst[[i + 1]]))
})
}
lst[[1]]
}
I have been using this function for several months, and have found it to be faster and use less memory than do.call(rbind, ...) [the disclaimer is that I've pretty much only used it on xts objects]
The more rows that each data.frame has, and the more elements that the list has, the more beneficial this function will be.
If you have a list of 100,000 numeric vectors, do.call(rbind, ...) will be better. If you have list of length one billion, this will be better.
> df <- lapply(1:10000, function(x) data.frame(x = sample(21, 21)))
> library(rbenchmark)
> benchmark(a=do.call(rbind, df), b=do.call.rbind(df))
test replications elapsed relative user.self sys.self user.child sys.child
1 a 100 327.728 1.755965 248.620 79.099 0 0
2 b 100 186.637 1.000000 181.874 4.751 0 0
The relative speed up will be exponentially better as you increase the length of the list.