Performance of rbind.data.frame - performance

I have a list of dataframes for which I am certain that they all contain at least one row (in fact, some contain only one row, and others contain a given number of rows), and that they all have the same columns (names and types). In case it matters, I am also certain that there are no NA's anywhere in the rows.
The situation can be simulated like this:
#create one row
onerowdfr<-do.call(data.frame, c(list(), rnorm(100) , lapply(sample(letters[1:2], 100, replace=TRUE), function(x){factor(x, levels=letters[1:2])})))
colnames(onerowdfr)<-c(paste("cnt", 1:100, sep=""), paste("cat", 1:100, sep=""))
#reuse it in a list
someParts<-lapply(rbinom(200, 1, 14/200)*6+1, function(reps){onerowdfr[rep(1, reps),]})
I've set the parameters (of the randomization) so that they approximate my true situation.
Now, I want to unite all these dataframes in one dataframe. I thought using rbind would do the trick, like this:
system.time(
result<-do.call(rbind, someParts)
)
Now, on my system (which is not particularly slow), and with the settings above, this takes is the output of the system.time:
user system elapsed
5.61 0.00 5.62
Nearly 6 seconds for rbind-ing 254 (in my case) rows of 200 variables? Surely there has to be a way to improve the performance here? In my code, I have to do similar things very often (it is a from of multiple imputation), so I need this to be as fast as possible.

Can you build your matrices with numeric variables only and convert to a factor at the end? rbind is a lot faster on numeric matrices.
On my system, using data frames:
> system.time(result<-do.call(rbind, someParts))
user system elapsed
2.628 0.000 2.636
Building the list with all numeric matrices instead:
onerowdfr2 <- matrix(as.numeric(onerowdfr), nrow=1)
someParts2<-lapply(rbinom(200, 1, 14/200)*6+1,
function(reps){onerowdfr2[rep(1, reps),]})
results in a lot faster rbind.
> system.time(result2<-do.call(rbind, someParts2))
user system elapsed
0.001 0.000 0.001
EDIT: Here's another possibility; it just combines each column in turn.
> system.time({
+ n <- 1:ncol(someParts[[1]])
+ names(n) <- names(someParts[[1]])
+ result <- as.data.frame(lapply(n, function(i)
+ unlist(lapply(someParts, `[[`, i))))
+ })
user system elapsed
0.810 0.000 0.813
Still not nearly as fast as using matrices though.
EDIT 2:
If you only have numerics and factors, it's not that hard to convert everything to numeric, rbind them, and convert the necessary columns back to factors. This assumes all factors have exactly the same levels. Converting to a factor from an integer is also faster than from a numeric so I force to integer first.
someParts2 <- lapply(someParts, function(x)
matrix(unlist(x), ncol=ncol(x)))
result<-as.data.frame(do.call(rbind, someParts2))
a <- someParts[[1]]
f <- which(sapply(a, class)=="factor")
for(i in f) {
lev <- levels(a[[i]])
result[[i]] <- factor(as.integer(result[[i]]), levels=seq_along(lev), labels=lev)
}
The timing on my system is:
user system elapsed
0.090 0.00 0.091

Not a huge boost, but swapping rbind for rbind.fill from the plyr package knocks about 10% off the running time (with the sample dataset, on my machine).

If you really want to manipulate your data.frames faster, I would suggest to use the package data.table and the function rbindlist(). I did not perform extensive tests but for my dataset (3000 dataframes, 1000 rows x 40 columns each) rbindlist() takes only 20 seconds.

This is ~25% faster, but there has to be a better way...
system.time({
N <- do.call(sum, lapply(someParts, nrow))
SP <- as.data.frame(lapply(someParts[[1]], function(x) rep(x,N)))
k <- 0
for(i in 1:length(someParts)) {
j <- k+1
k <- k + nrow(someParts[[i]])
SP[j:k,] <- someParts[[i]]
}
})

Make sure you're binding dataframe to dataframe. Ran into huge perf degradation when binding list to dataframe.

Related

Perform numpy.sum (or scipy.integrate.simps()) on large splitted array efficiently

Let's consider a very large numpy array a (M, N).
where M can typically be 1 or 100 and N 10-100,000,000
We have the array of indices that can split it into many (K = 1,000,000) along axis=1.
We want to efficiently perform an operation like integration along axis=1 (np.sum to take the simplest form) on each sub-array and return a (M, K) array.
An elegant and efficient solution was proposed by #Divakar in question [41920367]how to split numpy array and perform certain actions on split arrays [Python] but my understanding is that it only applies to cases where all sub-arrays have the same shape, which allows for reshaping.
But in our case the sub-arrays don't have the same shape, which, so far has forced me to loop on the index... please take me out of my misery...
Example
a = np.random.random((10, 100000000))
ind = np.sort(np.random.randint(10, 9000000, 1000000))
The size of the sub-arrays are not homogenous:
sizes = np.diff(ind)
print(sizes.min(), size.max())
2, 8732
So far, the best I found is:
output = np.concatenate([np.sum(vv, axis=1)[:, None] for vv in np.split(a, ind, axis=1)], axis=1)
Possible feature request for numpy and scipy:
If looping is really unavoidable, at least having it done in C inside the numpy and scipy.integrate.simps (or romb) functions would probably speed-up the output.
Something like
output = np.sum(a, axis=1, split_ind=ind)
output = scipy.integrate.simps(a, x=x, axis=1, split_ind=ind)
output = scipy.integrate.romb(a, x=x, axis=1, split_ind=ind)
would be very welcome !
(where x itself could be splitable, or not)
Side note:
While trying this example, I noticed that with these numbers there was almost always an element of sizes equal to 0 (the sizes.min() is almost always zero).
This looks peculiar to me, as we are picking 10,000 integers between 10 and 9,000,000, the odds that the same number comes up twice (such that diff = 0) should be close to 0. It seems to be very close to 1.
Would that be due to the algorithm behind np.random.randint ?
What you want is np.add.reduceat
output = np.add.reduceat(a, ind, axis = 1)
output.shape
Out[]: (10, 1000000)
Universal Functions (ufunc) are a very powerful tool in numpy
As for the repeated indices, that's simply the Birthday Problem cropping up.
Great !
Thanks ! on my VM Cent OS 6.9 I have the following results:
In [71]: a = np.random.random((10, 10000000))
In [72]: ind = np.unique(np.random.randint(10, 9000000, 100000))
In [73]: ind2 = np.append([0], ind)
In [74]: out = np.concatenate([np.sum(vv, axis=1)[:, None] for vv in np.split(a, ind, axis=1)], axis=1)
In [75]: out2 = np.add.reduceat(a, ind2, axis=1)
In [83]: np.allclose(out, out2)
Out[83]: True
In [84]: %timeit out = np.concatenate([np.sum(vv, axis=1)[:, None] for vv in np.split(a, ind, axis=1)], axis=1)
2.7 s ± 40.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [85]: %timeit out2 = np.add.reduceat(a, ind2, axis=1)
179 ms ± 15.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
That's a good 93 % speed gain (or factor 15 faster) over the list concatenation :-)
Great !

How to use pmap on a single large Matrix

I have one very large matrix M (around 5 Gig) and have to perform an operation f: Column -> Column on every column of M.
I suppose I should use pmap (correct me if am wrong), but as I understand I should give it a list of matrices. How do I effectively process M in order pass it to pmap?
The second question is if it is preferable that f can take multiple columns at once or not.
I think it might be a good idea to try SharedArray for this. Even better would be multithreading instead of Julia's current multiprocessing, but this isn't released yet.
f should take a reference to the matrix, and a list of columns, rather than the columns themselves, to avoid copying.
EDIT: Here is my attempt at a SharedArray example - I've never used it myself before, so its probably written poorly.
addprocs(3)
#everywhere rows = 10000
#everywhere cols = 100
data = SharedArray(Float64, (rows,cols))
#everywhere function f(col, data)
for row = 1:rows
new_val = rand()*col
for dowork = 1:10000
new_val = sqrt(new_val)^2
end
data[row,col] = new_val
end
end
tic()
pmap(g->f(g...), [(col,data) for col in 1:cols])
toc()
for i = 1:10:cols
println(i, " ", mean(data[:,i]), " ", 0.5*i)
end
tic()
map(g->f(g...), [(col,data) for col in 1:cols])
toc()
with output
elapsed time: 24.454875168 seconds
1 0.49883655930753457 0.5
11 5.480063271913496 5.5
21 10.495998948926 10.5
31 15.480227440365235 15.5
41 20.70105670567518 20.5
51 25.300540822213783 25.5
61 30.427728439076436 30.5
71 35.5280001975307 35.5
81 41.06101008798742 40.5
91 45.72394376323945 45.5
elapsed time: 69.651211534 seconds
So we are getting approximately a 3x speedup, as hoped for. It'll approach the ideal closer the longer the jobs runs, as there is probably some JIT warmup time.

Efficiently sample a data frame avoiding loops

I have a data frame which consists of a first column (experiment.id) and the rest of the columns are values associated with this experiment id. Each row is a unique experiment id. My data frame has columns in the order of 10⁴ - 10⁵.
data.frame(experiment.id=1:100, v1=rnorm(100,1,2),v2=rnorm(100,-1,2) )
This data frame is the source of my sample space. What i would like to do, is for each unique experiment.id (row) randomly sample (with replacement) one of the values v1, v2, ....,v10000 associated with this id and construct a sample s1. In each sample s1 all experiment ids are represented.
Eventually i want to perform 10⁴ samples, s1, s2, ....,s 10⁴ and calculate some statistic.
What would be the most efficient way (computationally) to perform this sampling process. I would like to avoid for loops as much as possible.
Update:
My questions in not all about sampling but also storing the samples. I guess my real question is if there is a quicker way to perform the above other than
d<-data.frame(experiment.id=1:1000, replicate (10000,rnorm(1000,100,2)) )
results<-data.frame(d$experiment.id,replicate(n=10000,apply(d[,2:10001],1,function(x){sample(x,size=1,replace=T)})))
Here is an expression that chooses one of the columns (excluding the first). It does not copy the first column, you will need to supply that as a separate step.
For a data frame d:
d[matrix(c(seq(nrow(d)), sample(ncol(d)-1, nrow(d), replace=TRUE)+1), ncol=2)]
That's one sample. To get N samples, just multiply the selection (as in John's answer):
mm <- matrix(c(rep(seq(nrow(d)), N), sample(ncol(d)-1, nrow(d)*N, replace=TRUE)+1), ncol=2)
result <- matrix(d[mm], ncol=N)
But you're going to have memory issues.
The shortest and most readable IMHO is still to use apply, but making good use of the fact that sample is vectorized:
results <- data.frame(experiment.id = d$experiment.id,
t(apply(d[, -1], 1, sample, 10000, replace = TRUE)))
If the 3 seconds it takes are too slow for your needs then I would recommend you use matrix indexing.
It's possible to do without any looping whatsoever. If you convert your columns after the first one to a matrix this gets easy because a matrix can be addressed either as [row, column] or sequentially as it's underlying vector.
mat <- as.matrix(datf[,-1])
nr <- nrow(mat); nc <- ncol(mat)
sel <- sample( 1:nc, nr, replace = TRUE )
sel <- sel + ((1:nr)-1) * nc
x <- t(mat)[sel]
seldatf <- data.frame( datf[,1], x = x )
Now, to get lots of the samples it pretty easy just multiplying the same logic.
ns <- 10 # number of samples / row
sel <- sample(1:nc, nr * ns, replace = TRUE )
sel <- sel + rep(((1:nr)-1) * nc, each = ns)
x <- t(mat)[sel]
seldatf <- cbind( datf[,1], data.frame(matrix(x, ncol = ns, byrow = TRUE)) )
It's possible that it's going to be a really big data frame if you're going to set ns <- 1e5 and you have lots of rows. You may have to watch running out of memory. I do a bit of unnecessary copying for readability reasons. You can eliminate that for memory, and speed because once you are using large amounts of memory you'll be swapping out other programs that are running. That is slow. You don't have to assign and save x, mat, or even sel. The result of not doing that would provide you about the fastest answer possible.

randomized SVD singular values

randomized SVD decomposes a matrix by extracting the first k singular values/vectors using k+p random projections. this works surprisingly well for large matrices.
my question concerns the singular values that are output from the algorithm. why aren't the values equal to the first k-singular values if you do the full SVD?
Below I have a simple implementation in R. Any suggestions on improving the performance would be appreciated.
rsvd = function(A, k=10, p=5) {
n = nrow(A)
y = A %*% matrix(rnorm(n * (k+p)), nrow=n)
q = qr.Q(qr(y))
b = t(q) %*% A
svd = svd(b)
list(u=q %*% svd$u, d=svd$d, v=svd$v)
}
> set.seed(10)
> A <- matrix(rnorm(500*500),500,500)
> svd(A)$d[1:15]
[1] 44.94307 44.48235 43.78984 43.44626 43.27146 43.15066 42.79720 42.54440 42.27439 42.21873 41.79763 41.51349 41.48338 41.35024 41.18068
> rsvd.o(A,10,5)$d
[1] 34.83741 33.83411 33.09522 32.65761 32.34326 31.80868 31.38253 30.96395 30.79063 30.34387 30.04538 29.56061 29.24128 29.12612 27.61804
Calculation
I reckon that your algorithm is a modification of the algorithm of Martinsson et al.. If I understood it correctly, this is especially meant for approximations for low rank matrices. I might be wrong though.
The difference is easily explained by the huge difference between the actual rank of A (500) and the values of k (10) and p (5). Plus, Martinsson et al mention that the value for p should actually be larger than the chosen value for k.
So if we apply your solution taking their considerations into account, using :
set.seed(10)
A <- matrix(rnorm(500*500),500,500) # rank 500
B <- matrix(rnorm(500*50),500,500) # rank 50
We find for the timings that the use of a larger p value still results in a huge speed-up compared to the original svd algorithm.
> system.time(t1 <- svd(A)$d[1:5])
user system elapsed
0.8 0.0 0.8
> system.time(t2 <- rsvd(A,10,5)$d[1:5])
user system elapsed
0.01 0.00 0.02
> system.time(t3 <- rsvd(A,10,30)$d[1:5])
user system elapsed
0.04 0.00 0.03
> system.time(t4 <- svd(B)$d[1:5] )
user system elapsed
0.55 0.00 0.55
> system.time(t5 <-rsvd(B,10,5)$d[1:5] )
user system elapsed
0.02 0.00 0.02
> system.time(t6 <-rsvd(B,10,30)$d[1:5] )
user system elapsed
0.05 0.00 0.05
> system.time(t7 <-rsvd(B,25,30)$d[1:5] )
user system elapsed
0.06 0.00 0.06
But we see that using a higher p for a lower rank matrix indeed gives a better approximation. If we let k also approach the rank a bit closer, the difference between the real solution and the approximation becomes appx. 0, while the speed gain is still substantial.
> round(mean(t2/t1),2)
[1] 0.77
> round(mean(t3/t1),2)
[1] 0.82
> round(mean(t5/t4),2)
[1] 0.92
> round(mean(t6/t4),2)
[1] 0.97
> round(mean(t7/t4),2)
[1] 1
So in general I believe that one could conclude that :
p should be chosen so p > k (Martinsson calls it l if I'm right)
k shouldn't be too much different from rank(A)
For low rank matrices the result is generally better.
Optimalization:
As far as I'm concerned, it's a neat way of doing it. I couldn't really find a more optimal way actually. The only thing I could say is that the construct t(q) %*% A is advised against. One should use crossprod(q,A) for that, which is supposed to be a tiny bit faster. But in your example the difference was nonexistent.
The paper by Halko, Martinsson and Tropp also recommends to do a couple of power iterations before computing the QR. We do 3 power iterations by default in the implementation in scikit-learn and we found it to work very well in practice.

Speed up the loop operation in R

I have a big performance problem in R. I wrote a function that iterates over a data.frame object. It simply adds a new column to a data.frame and accumulates something. (simple operation). The data.frame has roughly 850K rows. My PC is still working (about 10h now) and I have no idea about the runtime.
dayloop2 <- function(temp){
for (i in 1:nrow(temp)){
temp[i,10] <- i
if (i > 1) {
if ((temp[i,6] == temp[i-1,6]) & (temp[i,3] == temp[i-1,3])) {
temp[i,10] <- temp[i,9] + temp[i-1,10]
} else {
temp[i,10] <- temp[i,9]
}
} else {
temp[i,10] <- temp[i,9]
}
}
names(temp)[names(temp) == "V10"] <- "Kumm."
return(temp)
}
Any ideas how to speed up this operation?
Biggest problem and root of ineffectiveness is indexing data.frame, I mean all this lines where you use temp[,].
Try to avoid this as much as possible. I took your function, change indexing and here version_A
dayloop2_A <- function(temp){
res <- numeric(nrow(temp))
for (i in 1:nrow(temp)){
res[i] <- i
if (i > 1) {
if ((temp[i,6] == temp[i-1,6]) & (temp[i,3] == temp[i-1,3])) {
res[i] <- temp[i,9] + res[i-1]
} else {
res[i] <- temp[i,9]
}
} else {
res[i] <- temp[i,9]
}
}
temp$`Kumm.` <- res
return(temp)
}
As you can see I create vector res which gather results. At the end I add it to data.frame and I don't need to mess with names.
So how better is it?
I run each function for data.frame with nrow from 1,000 to 10,000 by 1,000 and measure time with system.time
X <- as.data.frame(matrix(sample(1:10, n*9, TRUE), n, 9))
system.time(dayloop2(X))
Result is
You can see that your version depends exponentially from nrow(X). Modified version has linear relation, and simple lm model predict that for 850,000 rows computation takes 6 minutes and 10 seconds.
Power of vectorization
As Shane and Calimo states in theirs answers vectorization is a key to better performance.
From your code you could move outside of loop:
conditioning
initialization of the results (which are temp[i,9])
This leads to this code
dayloop2_B <- function(temp){
cond <- c(FALSE, (temp[-nrow(temp),6] == temp[-1,6]) & (temp[-nrow(temp),3] == temp[-1,3]))
res <- temp[,9]
for (i in 1:nrow(temp)) {
if (cond[i]) res[i] <- temp[i,9] + res[i-1]
}
temp$`Kumm.` <- res
return(temp)
}
Compare result for this functions, this time for nrow from 10,000 to 100,000 by 10,000.
Tuning the tuned
Another tweak is to changing in a loop indexing temp[i,9] to res[i] (which are exact the same in i-th loop iteration).
It's again difference between indexing a vector and indexing a data.frame.
Second thing: when you look on the loop you can see that there is no need to loop over all i, but only for the ones that fit condition.
So here we go
dayloop2_D <- function(temp){
cond <- c(FALSE, (temp[-nrow(temp),6] == temp[-1,6]) & (temp[-nrow(temp),3] == temp[-1,3]))
res <- temp[,9]
for (i in (1:nrow(temp))[cond]) {
res[i] <- res[i] + res[i-1]
}
temp$`Kumm.` <- res
return(temp)
}
Performance which you gain highly depends on a data structure. Precisely - on percent of TRUE values in the condition.
For my simulated data it takes computation time for 850,000 rows below the one second.
I you want you can go further, I see at least two things which can be done:
write a C code to do conditional cumsum
if you know that in your data max sequence isn't large then you can change loop to vectorized while, something like
while (any(cond)) {
indx <- c(FALSE, cond[-1] & !cond[-n])
res[indx] <- res[indx] + res[which(indx)-1]
cond[indx] <- FALSE
}
Code used for simulations and figures is available on GitHub.
General strategies for speeding up R code
First, figure out where the slow part really is. There's no need to optimize code that isn't running slowly. For small amounts of code, simply thinking through it can work. If that fails, RProf and similar profiling tools can be helpful.
Once you figure out the bottleneck, think about more efficient algorithms for doing what you want. Calculations should be only run once if possible, so:
Store the results and access them rather than repeatedly recalculating
Take non-loop-dependent calculations out of loops
Avoid calculations which aren't necessary (e.g. don't use regular expressions with fixed searches will do)
Using more efficient functions can produce moderate or large speed gains. For instance, paste0 produces a small efficiency gain but .colSums() and its relatives produce somewhat more pronounced gains. mean is particularly slow.
Then you can avoid some particularly common troubles:
cbind will slow you down really quickly.
Initialize your data structures, then fill them in, rather than expanding them each
time.
Even with pre-allocation, you could switch to a pass-by-reference approach rather than a pass-by-value approach, but it may not be worth the hassle.
Take a look at the R Inferno for more pitfalls to avoid.
Try for better vectorization, which can often but not always help. In this regard, inherently vectorized commands like ifelse, diff, and the like will provide more improvement than the apply family of commands (which provide little to no speed boost over a well-written loop).
You can also try to provide more information to R functions. For instance, use vapply rather than sapply, and specify colClasses when reading in text-based data. Speed gains will be variable depending on how much guessing you eliminate.
Next, consider optimized packages: The data.table package can produce massive speed gains where its use is possible, in data manipulation and in reading large amounts of data (fread).
Next, try for speed gains through more efficient means of calling R:
Compile your R script. Or use the Ra and jit packages in concert for just-in-time compilation (Dirk has an example in this presentation).
Make sure you're using an optimized BLAS. These provide across-the-board speed gains. Honestly, it's a shame that R doesn't automatically use the most efficient library on install. Hopefully Revolution R will contribute the work that they've done here back to the overall community.
Radford Neal has done a bunch of optimizations, some of which were adopted into R Core, and many others which were forked off into pqR.
And lastly, if all of the above still doesn't get you quite as fast as you need, you may need to move to a faster language for the slow code snippet. The combination of Rcpp and inline here makes replacing only the slowest part of the algorithm with C++ code particularly easy. Here, for instance, is my first attempt at doing so, and it blows away even highly optimized R solutions.
If you're still left with troubles after all this, you just need more computing power. Look into parallelization (http://cran.r-project.org/web/views/HighPerformanceComputing.html) or even GPU-based solutions (gpu-tools).
Links to other guidance
http://www.noamross.net/blog/2013/4/25/faster-talk.html
If you are using for loops, you are most likely coding R as if it was C or Java or something else. R code that is properly vectorised is extremely fast.
Take for example these two simple bits of code to generate a list of 10,000 integers in sequence:
The first code example is how one would code a loop using a traditional coding paradigm. It takes 28 seconds to complete
system.time({
a <- NULL
for(i in 1:1e5)a[i] <- i
})
user system elapsed
28.36 0.07 28.61
You can get an almost 100-times improvement by the simple action of pre-allocating memory:
system.time({
a <- rep(1, 1e5)
for(i in 1:1e5)a[i] <- i
})
user system elapsed
0.30 0.00 0.29
But using the base R vector operation using the colon operator : this operation is virtually instantaneous:
system.time(a <- 1:1e5)
user system elapsed
0 0 0
This could be made much faster by skipping the loops by using indexes or nested ifelse() statements.
idx <- 1:nrow(temp)
temp[,10] <- idx
idx1 <- c(FALSE, (temp[-nrow(temp),6] == temp[-1,6]) & (temp[-nrow(temp),3] == temp[-1,3]))
temp[idx1,10] <- temp[idx1,9] + temp[which(idx1)-1,10]
temp[!idx1,10] <- temp[!idx1,9]
temp[1,10] <- temp[1,9]
names(temp)[names(temp) == "V10"] <- "Kumm."
As Ari mentioned at the end of his answer, the Rcpp and inline packages make it incredibly easy to make things fast. As an example, try this inline code (warning: not tested):
body <- 'Rcpp::NumericMatrix nm(temp);
int nrtemp = Rccp::as<int>(nrt);
for (int i = 0; i < nrtemp; ++i) {
temp(i, 9) = i
if (i > 1) {
if ((temp(i, 5) == temp(i - 1, 5) && temp(i, 2) == temp(i - 1, 2) {
temp(i, 9) = temp(i, 8) + temp(i - 1, 9)
} else {
temp(i, 9) = temp(i, 8)
}
} else {
temp(i, 9) = temp(i, 8)
}
return Rcpp::wrap(nm);
'
settings <- getPlugin("Rcpp")
# settings$env$PKG_CXXFLAGS <- paste("-I", getwd(), sep="") if you want to inc files in wd
dayloop <- cxxfunction(signature(nrt="numeric", temp="numeric"), body-body,
plugin="Rcpp", settings=settings, cppargs="-I/usr/include")
dayloop2 <- function(temp) {
# extract a numeric matrix from temp, put it in tmp
nc <- ncol(temp)
nm <- dayloop(nc, temp)
names(temp)[names(temp) == "V10"] <- "Kumm."
return(temp)
}
There's a similar procedure for #includeing things, where you just pass a parameter
inc <- '#include <header.h>
to cxxfunction, as include=inc. What's really cool about this is that it does all of the linking and compilation for you, so prototyping is really fast.
Disclaimer: I'm not totally sure that the class of tmp should be numeric and not numeric matrix or something else. But I'm mostly sure.
Edit: if you still need more speed after this, OpenMP is a parallelization facility good for C++. I haven't tried using it from inline, but it should work. The idea would be to, in the case of n cores, have loop iteration k be carried out by k % n. A suitable introduction is found in Matloff's The Art of R Programming, available here, in chapter 16, Resorting to C.
I dislike rewriting code... Also of course ifelse and lapply are better options but sometimes it is difficult to make that fit.
Frequently I use data.frames as one would use lists such as df$var[i]
Here is a made up example:
nrow=function(x){ ##required as I use nrow at times.
if(class(x)=='list') {
length(x[[names(x)[1]]])
}else{
base::nrow(x)
}
}
system.time({
d=data.frame(seq=1:10000,r=rnorm(10000))
d$foo=d$r
d$seq=1:5
mark=NA
for(i in 1:nrow(d)){
if(d$seq[i]==1) mark=d$r[i]
d$foo[i]=mark
}
})
system.time({
d=data.frame(seq=1:10000,r=rnorm(10000))
d$foo=d$r
d$seq=1:5
d=as.list(d) #become a list
mark=NA
for(i in 1:nrow(d)){
if(d$seq[i]==1) mark=d$r[i]
d$foo[i]=mark
}
d=as.data.frame(d) #revert back to data.frame
})
data.frame version:
user system elapsed
0.53 0.00 0.53
list version:
user system elapsed
0.04 0.00 0.03
17x times faster to use a list of vectors than a data.frame.
Any comments on why internally data.frames are so slow in this regard? One would think they operate like lists...
For even faster code do this class(d)='list' instead of d=as.list(d) and class(d)='data.frame'
system.time({
d=data.frame(seq=1:10000,r=rnorm(10000))
d$foo=d$r
d$seq=1:5
class(d)='list'
mark=NA
for(i in 1:nrow(d)){
if(d$seq[i]==1) mark=d$r[i]
d$foo[i]=mark
}
class(d)='data.frame'
})
head(d)
The answers here are great. One minor aspect not covered is that the question states "My PC is still working (about 10h now) and I have no idea about the runtime". I always put in the following code into loops when developing to get a feel for how changes seem to affect the speed and also for monitoring how long it will take to complete.
dayloop2 <- function(temp){
for (i in 1:nrow(temp)){
cat(round(i/nrow(temp)*100,2),"% \r") # prints the percentage complete in realtime.
# do stuff
}
return(blah)
}
Works with lapply as well.
dayloop2 <- function(temp){
temp <- lapply(1:nrow(temp), function(i) {
cat(round(i/nrow(temp)*100,2),"% \r")
#do stuff
})
return(temp)
}
If the function within the loop is quite fast but the number of loops is large then consider just printing every so often as printing to the console itself has an overhead. e.g.
dayloop2 <- function(temp){
for (i in 1:nrow(temp)){
if(i %% 100 == 0) cat(round(i/nrow(temp)*100,2),"% \r") # prints every 100 times through the loop
# do stuff
}
return(temp)
}
In R, you can often speed-up loop processing by using the apply family functions (in your case, it would probably be replicate). Have a look at the plyr package that provides progress bars.
Another option is to avoid loops altogether and replace them with vectorized arithmetics. I'm not sure exactly what you are doing, but you can probably apply your function to all rows at once:
temp[1:nrow(temp), 10] <- temp[1:nrow(temp), 9] + temp[0:(nrow(temp)-1), 10]
This will be much much faster, and then you can filter the rows with your condition:
cond.i <- (temp[i, 6] == temp[i-1, 6]) & (temp[i, 3] == temp[i-1, 3])
temp[cond.i, 10] <- temp[cond.i, 9]
Vectorized arithmetics requires more time and thinking about the problem, but then you can sometimes save several orders of magnitude in execution time.
Take a look at the accumulate() function from {purrr} :
dayloop_accumulate <- function(temp) {
temp %>%
as_tibble() %>%
mutate(cond = c(FALSE, (V6 == lag(V6) & V3 == lag(V3))[-1])) %>%
mutate(V10 = V9 %>%
purrr::accumulate2(.y = cond[-1], .f = function(.i_1, .i, .y) {
if(.y) {
.i_1 + .i
} else {
.i
}
}) %>% unlist()) %>%
select(-cond)
}
Processing with data.table is a viable option:
n <- 1000000
df <- as.data.frame(matrix(sample(1:10, n*9, TRUE), n, 9))
colnames(df) <- paste("col", 1:9, sep = "")
library(data.table)
dayloop2.dt <- function(df) {
dt <- data.table(df)
dt[, Kumm. := {
res <- .I;
ifelse (res > 1,
ifelse ((col6 == shift(col6, fill = 0)) & (col3 == shift(col3, fill = 0)) ,
res <- col9 + shift(res)
, # else
res <- col9
)
, # else
res <- col9
)
}
,]
res <- data.frame(dt)
return (res)
}
res <- dayloop2.dt(df)
m <- microbenchmark(dayloop2.dt(df), times = 100)
#Unit: milliseconds
# expr min lq mean median uq max neval
#dayloop2.dt(df) 436.4467 441.02076 578.7126 503.9874 575.9534 966.1042 10
If you ignore the possible gains from conditions filtering, it is very fast. Obviously, if you can do the calculation on the subset of data, it helps.

Resources