Avoiding row-wise processing of data.frame in R - performance

I was wondering what the best way is to avoid row-wise processing in R, most of row-wise stuff is done in internal C routines. For example: I have a data frame a:
chromosome_name start_position end_position strand
1 15 35574797 35575181 1
2 15 35590448 35591641 -1
3 15 35688422 35688645 1
4 13 75402690 75404217 1
5 15 35692892 35693969 1
What I want is: based on whether strand is positive or negative, startOFgene as start_position or end_position. One way to avoid for loop will be to separate data.frame with +1 strand and -1 strand and perform selection. What can be other way for speed up? The method does not scale-up if one has certain other complicated processing per row.

Maybe this is fast enough...
transform(a, startOFgene = ifelse(strand == 1, start_position, end_position))
chromosome_name start_position end_position strand startOFgene
1 15 35574797 35575181 1 35574797
2 15 35590448 35591641 -1 35591641
3 15 35688422 35688645 1 35688422
4 13 75402690 75404217 1 75402690
5 15 35692892 35693969 1 35692892

First, since all your columns are integer/numeric, you could use a matrix instead of a data.frame. Many operations on a matrix are a lot faster than the same operation on a data.frame, even though they're not very different in this case. Then you can use logical subsetting to create the startOFgene column.
# Create some large-ish data
M <- do.call(rbind,replicate(1e3,as.matrix(a),simplify=FALSE))
M <- do.call(rbind,replicate(1e3,M,simplify=FALSE))
A <- as.data.frame(M)
# Create startOFgene column in a matrix
m <- function() {
M <- cbind(M, startOFgene=M[,"start_position"])
negStrand <- sign(M[,"strand"]) < 0
M[negStrand,"startOFgene"] <- M[negStrand,"end_position"]
}
# Create startOFgene column in a data.frame
d <- function() {
A$startOFgene <- A$start_position
negStrand <- sign(A$strand) < 0
A$startOFgene[negStrand] <- A$end_position[negStrand]
}
library(rbenchmark)
benchmark(m(), d(), replications=10)[,1:6]
# test replications elapsed relative user.self sys.self
# 2 d() 10 18.804 1.000 16.501 2.224
# 1 m() 10 19.713 1.048 16.457 3.152

Related

Cumulative product of matrix?

I need to write a code to calculate cumulative product of a matrix.
For example, if
A = ( 1 2 3 | 4 3 2 )
then
cum.sum(A) = ( 1 2 6 | 4 24 144 )
Is there any good algorithm for doing this?
I'll use R, C, Matlab or Octave.
A <- matrix(c(1,2,3,4,3,2),byrow=TRUE,nrow=2)
I'm guessing you want the cumulative product of all (k,l) less than (i,j) ... ?
B <- A
nr <- nrow(B)
nc <- ncol(B)
for (i in 1:max(nr,nc)) {
if (i<=nr) B[i,i:nc] <- cumprod(B[i,])[i:nc]
}
This works for your example: you might have to be a little careful generalizing it to a case with more rows than columns ...

R - Improving the performance of a simple loop

I'm a beginner with R, so I'm having trouble thinking of things the "R way"...
I have this function:
upOneRow <- function(table, column) {
for (i in 1:(nrow(table) - 1)) {
table[i, column] = table [i + 1, column]
}
return(table)
}
It seems simple enough, and shouldn't take that long to run, but on a dataframe with ~300k rows, the time it takes to run is unreasonable. What is the right way to approach this?
Instead of the loop you could try something like this:
n <- nrow(table)
table[(1:(n-1)), column] <- table[(2:n), column];
to vectorize is the key
Simple answer: Columns in a data.frame are also vectors which can be indexed with [,]
my.table <- data.frame(x = 1:10, y=10:1)
> my.table
x y
1 1 5
2 2 4
3 3 3
4 4 2
5 5 1
my.table$y <-c(my.table[-1,"y"],NA) #move up one spot and pad with NA
> my.table
x y
1 1 4
2 2 3
3 3 2
4 4 1
5 5 NA
Now you function repeats the last data point at the end. If this is really what you want, pad with tail(x,1) instead of NA.
my.table$y <-c(my.table[-1,"y"],tail(my.table$y,1)) #pad with tail(x,1)
> my.table
x y
1 1 4
2 2 3
3 3 2
4 4 1
5 5 1
If I understand you right, you're trying to "move up" one column of a data frame, with the first element going to the bottom. Then, It might be achieved as:
col <- table[, column]
table[, column] <- col[c(nrow(table), 1:(nrow(table)-1))]

Why this is so slow? (loop in a DF row vs. a standalone vector)

I have a piece of code and total elapsed time is around 30 secs of which, the following code is around 27 secs. I narrowed the offending code to this:
d$dis300[i] <- h
So I change to this other piece and is now working really fast (as expected).
My question is why this is too slow against the second. The datos DF is around 7500x18 vars
First: (27 sec elapsed)
d$dis300 <- 0
for (i in 1:netot) {
h <- aaa[d$ent[i], d$dis[i]]
if (h == 0) writeLines(sprintf("ERROR. ent:%i dis:%i", d$ent[i], d$dis[i]))
d$dis300[i] <- h
}
Second: (0.2 sec elapsed)
d$dis300 <- 0
for (i in 1:netot) {
h <- aaa[d$ent[i], d$dis[i]]
if (h == 0) writeLines(sprintf("ERROR. ent:%i dis:%i", d$ent[i], d$dis[i]))
foo[i] <- h
}
d$foo <- foo
You can see both are the "same" but the offending one has this DF instead of a single vector.
Any comment is really appreciated. I came from another type of languages and this drove me nuts for a while. At least I have solution but I like to prevent this kind of issues in the future.
Thanks for your time,
The reason is that d$dis300[i] <- h calls $<-.data.frame.
It's a rather complex function as you can see:
`$<-.data.frame`
You don't say what foo is, but if it is an atomic vector, the $<- function is implemented in C for speed.
Still, I hope you declare foo as follows:
foo <- numeric(netot)
This will ensure you don't need to reallocate the vector for each assignment in the loop:
foo <- 0 # BAD!
system.time( for(i in 1:5e4) foo[i] <- 0 ) # 4.40 secs
foo <- numeric(5e4) # Pre-allocate
system.time( for(i in 1:5e4) foo[i] <- 0 ) # 0.09 secs
Using the *apply family instead you don't worry about that:
d$foo <- vapply(1:netot, function(i, aaa, ent, dis) {
h <- aaa[ent[i], dis[i]]
if (h == 0) writeLines(sprintf("ERROR. ent:%i dis:%i", ent[i], dis[i]))
h
}, numeric(1), aaa=aaa, ent=d$ent, dis=d$dis)
...here I also extracted d$ent and d$dis outside the loop which should improve things a bit too. Can't run it myself though since you didn't give reproducible data. But here's a similar example:
d <- data.frame(x=1)
system.time( vapply(1:1e6, function(i) d$x, numeric(1)) ) # 3.20 secs
system.time( vapply(1:1e6, function(i, x) x, numeric(1), x=d$x) ) # 0.56 secs
... but finally it seems it can all be reduced to (barring your error detection code):
d$foo <- aaa[cbind(d$ent, d$dis)]
Tommy's is the best answer. This was too big for comment so adding it as an answer...
This is how you can see the copies (of the whole of DF, as joran commented) yourself :
> DF = data.frame(a=1:3,b=4:6)
> tracemem(DF)
[1] "<0x0000000003104800"
> for (i in 1:3) {DF$b[i] <- i; .Internal(inspect(DF))}
tracemem[0000000003104800 -> 000000000396EAD8]:
tracemem[000000000396EAD8 -> 000000000396E4F0]: $<-.data.frame $<-
tracemem[000000000396E4F0 -> 000000000399CDC8]: $<-.data.frame $<-
#000000000399CDC8 19 VECSXP g0c2 [OBJ,NAM(2),TR,ATT] (len=2, tl=0)
#000000000399CD90 13 INTSXP g0c2 [] (len=3, tl=0) 1,2,3
#000000000399CCE8 13 INTSXP g0c2 [] (len=3, tl=0) 1,5,6
ATTRIB: # .. snip ..
tracemem[000000000399CDC8 -> 000000000399CC40]:
tracemem[000000000399CC40 -> 000000000399CAB8]: $<-.data.frame $<-
tracemem[000000000399CAB8 -> 000000000399C9A0]: $<-.data.frame $<-
#000000000399C9A0 19 VECSXP g0c2 [OBJ,NAM(2),TR,ATT] (len=2, tl=0)
#000000000399C968 13 INTSXP g0c2 [] (len=3, tl=0) 1,2,3
#000000000399C888 13 INTSXP g0c2 [] (len=3, tl=0) 1,2,6
ATTRIB: # .. snip ..
tracemem[000000000399C9A0 -> 000000000399C7E0]:
tracemem[000000000399C7E0 -> 000000000399C700]: $<-.data.frame $<-
tracemem[000000000399C700 -> 00000000039C78D8]: $<-.data.frame $<-
#00000000039C78D8 19 VECSXP g0c2 [OBJ,NAM(2),TR,ATT] (len=2, tl=0)
#00000000039C78A0 13 INTSXP g0c2 [] (len=3, tl=0) 1,2,3
#0000000003E07890 13 INTSXP g0c2 [] (len=3, tl=0) 1,2,3
ATTRIB: # .. snip ..
> DF
a b
1 1 1
2 2 2
3 3 3
Each of those tracemem[] lines corresponds to a copy of the whole object. You can see the hex addresses of the a column vector changing each time, too, despite it not being modifed by the assignment to b.
AFAIK, without dropping into C code yourself, the only way (currently) in R to modify an item of a data.frame with no copy of any memory at all, is the := operator and set() function, both in package data.table. There are 17 questions about assigning by reference using := here on Stack Overflow.
But in this case Tommy's one liner is definitely best as you don't even need a loop at all.

Speeding up reshaping person to period-format dataframe in R

I have a dataset with longitudinal data in a person-oriented format, as such:
pid varA_1 varB_1 varA_2 varB_2 varA_3 varB_3 ...
1 1 1 0 3 2 1
2 0 1 0 2 2 1
...
50k 1 0 1 3 1 0
This results in a large dataframe, with minimum 50k observations and 90 variables measured for up to 29 periods.
I would like to get a more period-oriented format, as such:
pid index start stop varA varB varC ...
1 1 ...
1 2
...
1 29
2 1
I have tried different approaches for reshaping the dataframe (*apply, plyr, reshape2, loops, appending vs. prefilling all numeric matrices, etc.,), but do not seem to get a decent processing time (+40min for subsets). I have picked up various hints along the way on what to avoid, but I'm still not sure if I miss some bottleneck or possible speedup.
Is there an optimal way to approach this kind of data-processing, so that I can evaluate the best-case processing time I can achieve in pure R-code? There have been similar questions on Stackoverflow, but they did not result in convincing answers...
First, let's build the data example (I am using 5e3 instead of 50e3 to avoid memory problems with my configuration):
nObs <- 5e3
nVar <- 90
nPeriods <- 29
dat <- matrix(rnorm(nObs*nVar*nPeriods), nrow=nObs, ncol=nVar*nPeriods)
df <- data.frame(id=seq_len(nObs), dat)
nmsV <- paste('Var', seq_len(nVar), sep='')
nmsPeriods <- paste('T', seq_len(nPeriods), sep='')
nms <- c(outer(nmsV, nmsPeriods, paste, sep='_'))
names(df)[-1] <- nms
And now with stats::reshape you change the format:
df2 <- reshape(df, dir = "long", varying = 2:((nVar*nPeriods)+1), sep = "_")
I am not sure if this is the fast solution you are looking for.
The well-aged stack() function can be very fast, if things fit into memory.
For large set, using (transparent) sqlite database as an intermediate is best. Try Gabor's package sqldf, there are many examples on googlecode.
http://code.google.com/p/sqldf/

What are some good ways to calculate a score for how difference or close 2 users choices are?

For example, if it is the choice of chocolate, ice cream, donut, ..., for the order of their preference.
If user 1 choose
A B C D E F G H I J
and user 2 chooses
J A B C I G F E D H
what are some good ways to calculate a score from 0 to 100 to tell how close their choices are? It has to make sense, such as if most answers are the same but just 1 or 2 answers different, the score cannot be made to extremely low. Or, if most answers are just "shifted by 1 position", then we cannot count them as "all different" and give 0 score for those differences of only 1 position.
Assign each letter item an integer value starting at 1
A=1, B=2, C=3, D=4, E=5, F=6 (stopping at F for simplicity)
Then consider the order the items are placed, use this as a multiple
So if a number is the first item, its multiplier is 1, if its the 6th item the multipler is 6
Figure out the maximum score you could have (basically when everything is in consecutive order)
item a b c d e f
order 1 2 3 4 5 6
value 1 2 3 4 5 6
score 1 4 9 16 25 36 Sum = 91, Score = 100% (MAX)
item a b d c e f
order 1 2 3 4 5 6
value 1 2 4 3 5 6
score 1 4 12 12 25 36 Sum = 90 Score = 99%
=======================
order 1 2 3 4 5 6
item f d b c e a
value 6 4 2 3 5 1
score 6 8 6 12 25 6 Sum = 63 Score = 69%
order 1 2 3 4 5 6
item d f b c e a
value 4 6 2 3 5 1
score 4 12 6 12 25 6 Sum = 65 Score = 71%
obviously this is a very crude implementation that I just came up with. It may not work for everything. Examples 3 and 4 are swapped by one position yet the score is off by 2% (versus ex 1 and 2 which are off by 1%). It's just a thought. I'm no algorithm expert. You could probably use the final number and do something else to it for a better numerical comparison.
You could
Calculate the edit distance between the sequences;
Subtract the edit distance from the sequence length;
Divide that by the length of the sequence
Multiply it by hundred
Score = 100 * (SequenceLength - Levenshtein( Sequence1, Sequence2 ) ) / SequenceLength
Edit distance is basically the number of operations required to transform sequence one in sequence two. An algorithm therefore is the Levenshtein distance algorithm.
Examples:
Weights
insert: 1
delete: 1
substitute: 1
Seq 1: ABCDEFGHIJ
Seq 2: JABCIGFEDH
Score = 100 * (10-7) / 10 = 30
Seq 1: ABCDEFGHIJ
Seq 2: ABDCFGHIEJ
Score = 100 * (10-3) / 10 = 70
The most straightforward way to calculate it is the Levenshtein distance, which is the number of changes that must be done to transform one string to another.
Disadvantage of Levenshtein distance for your task is that it doesn't measure closeness between products themselves. I.e. you will not know how A and J are close to each other. For example, user 1 may like donuts, and user 2 may like buns, and you know that most people who like first also like the second. From this information you can infer that user 1 makes choices that are close to choices of user 2, through they don't have same elements.
If this is your case, you will have to use one of two: statistical methods to infer correlation between choices or recommendation engines.

Resources