How can I speed up the following (noob) code:
#"mymatrix" is the matrix of word counts (docs X terms)
#"tfidfmatrix" is the transformed matrix
tfidfmatrix = Matrix(mymatrix, nrow=num_of_docs, ncol=num_of_words, sparse=T)
#Apply a transformation on each row of the matrix
for(i in 1:dim(mymatrix)[[1]]){
r = mymatrix[i,]
s = sapply(r, function(x) ifelse(x==0, 0, (1+log(x))*log((1+ndocs)/(1+x)) ) )
tfmat[i,] = s/sqrt(sum(s^2))
}
return (tfidfmatrix)
Problem is that the matrices I am working on are fairly large (~40kX100k), and this code is very slow.
The reason I am not using "apply" (instead of using a for loop and sapply) is that apply is going to give me the transpose of the matrix I want - I want num_of_docs X num_of_words, but apply will give me the transpose. I will then have to spend more time computing the transpose and re-allocating it.
Any thoughts on making this faster?
Thanks much.
Edit : I have found that the suggestions below greatly speed up my code (besides making me feel stupid). Any suggestions on where I can learn to write "optimized" R code from?
Edit 2: OK, so something is not right. Once I do s.vec[!is.finite(s.vec)] <- 0 every element of s.vec is being set to 0. Just to re-iterate my original matrix is a sparse matrix containing integers. This is due to some quirk of the Matrix package I am using. When I do s.vec[which(s.vec==-Inf)] <- 0 things work as expected. Thoughts?
As per my comment,
#Slightly larger example data
mymatrix <- matrix(runif(10000),nrow=10)
mymatrix[sample(10000,100)] <- 0
tfmat <- matrix(nrow=10, ncol=1000)
ndocs <- 1
justin <- function(){
s.vec <- ifelse(mymatrix==0, 0, (1 + log(mymatrix)) * log((1 + ndocs)/(1 + mymatrix)))
tfmat.vec <- s.vec/sqrt(rowSums(s.vec^2))
}
joran <- function(){
s.vec <- (1 + log(mymatrix)) * log((1 + ndocs)/(1 + mymatrix))
s.vec[!is.finite(s.vec)] <- 0
tfmat.vec <- s.vec/sqrt(rowSums(s.vec^2))
}
require(rbenchmark)
benchmark(justin(),joran(),replications = 1000)
test replications elapsed relative user.self sys.self user.child sys.child
2 joran() 1000 0.940 1.00000 0.842 0.105 0 0
1 justin() 1000 2.786 2.96383 2.617 0.187 0 0
So it's around 3x faster or so.
not sure what ndocs is, but ifelse is already vectorized, so you should be able to use the ifelse statement without walking through the matrix row by row and sapply along the row. The same can be said for the final calc.
However, you haven't given a complete example to replicate...
mymatrix <- matrix(runif(100),nrow=10)
tfmat <- matrix(nrow=10, ncol=10)
ndocs <- 1
s.vec <- ifelse(mymatrix==0, 0, 1 + log(mymatrix)) * log((1 + ndocs)/(1 + mymatrix))
for(i in 1:dim(mymatrix)[[1]]){
r = mymatrix[i,]
s = sapply(r, function(x) ifelse(x==0, 0, (1+log(x))*log((1+ndocs)/(1+x)) ) )
tfmat[i,] <- s
}
all.equal(s.vec, tfmat)
so the only piece missing is the rowSums in your final calc.
tfmat.vec <- s.vec/sqrt(rowSums(s.vec^2))
for(i in 1:dim(mymatrix)[[1]]){
r = mymatrix[i,]
s = sapply(r, function(x) ifelse(x==0, 0, (1+log(x))*log((1+ndocs)/(1+x)) ) )
tfmat[i,] = s/sqrt(sum(s^2))
}
all.equal(tfmat, tfmat.vec)
Related
I try to calculate Monte Carlo pi function in R. I have some problems in the code.
For now I write this code:
ploscinaKvadrata <- 0
ploscinaKroga <- 0
n = 1000
for (i in i:n) {
x <- runif(1000, min= -1, max= 1)
y <- runif(1000, min= -1, max= 1)
if ((x^2 + y^2) <= 1) {
ploscinaKroga <- ploscinaKroga + 1
} else {
ploscinaKvadrata <- ploscinaKvadrata + 1
}
izracunPi = 4* ploscinaKroga/ploscinaKvadrata
}
izracunPi
This is not working, but I don't know how to fix it.
I would also like to write a code to plot this (with circle inside square and with dots).
Here is a vectorized version (and there was also something wrong with your math)
N <- 1000000
R <- 1
x <- runif(N, min= -R, max= R)
y <- runif(N, min= -R, max= R)
is.inside <- (x^2 + y^2) <= R^2
pi.estimate <- 4 * sum(is.inside) / N
pi.estimate
# [1] 3.141472
As far as plotting the points, you can do something like this:
plot.new()
plot.window(xlim = 1.1 * R * c(-1, 1), ylim = 1.1 * R * c(-1, 1))
points(x[ is.inside], y[ is.inside], pch = '.', col = "blue")
points(x[!is.inside], y[!is.inside], pch = '.', col = "red")
but I'd recommend you use a smaller N value, maybe 10000.
This is a fun game -- and there are a number of versions of it floating around the web. Here's one I hacked from the named source (tho' his code was somewhat naive).
from http://giventhedata.blogspot.com/2012/09/estimating-pi-with-r-via-mcs-dart-very.html
est.pi <- function(n){
# drawing in [0,1] x [0,1] covers one quarter of square and circle
# draw random numbers for the coordinates of the "dart-hits"
a <- runif(n,0,1)
b <- runif(n,0,1)
# use the pythagorean theorem
c <- sqrt((a^2) + (b^2) )
inside <- sum(c<1)
#outside <- n-inside
pi.est <- inside/n*4
return(pi.est)
}
Typo 'nside' to 'inside'
I am trying to clean data using ddply but it is running very slowly on 1.3M rows.
Sample code:
#Create Sample Data Frame
num_rows <- 10000
df <- data.frame(id=sample(1:20, num_rows, replace=T),
Consumption=sample(-20:20, num_rows, replace=T),
StartDate=as.Date(sample(15000:15020, num_rows, replace=T), origin = "1970-01-01"))
df$EndDate <- df$StartDate + 90
#df <- df[order(df$id, df$StartDate, df$Consumption),]
#Are values negative?
# Needed for subsetting in ddply rows with same positive and negative values
df$Neg <- ifelse(df$Consumption < 0, -1, 1)
df$Consumption <- abs(df$Consumption)
I have written a function to remove rows where there is a consumption value in one row that is identical but negative to a consumption value in another row (for the same id).
#Remove rows from a data frame where there is an equal but opposite consumption value
#Should ensure only one negative value is removed for each positive one.
clean_negatives <- function(x3){
copies <- abs(sum(x3$Neg))
sgn <- ifelse(sum(x3$Neg) <0, -1, 1)
x3 <- x3[0:copies,]
x3$Consumption <- sgn*x3$Consumption
x3$Neg <- NULL
x3}
I then use ddply to apply that function to remove these erroneous rows in the data
ptm <- proc.time()
df_cleaned <- ddply(df, .(id,StartDate, EndDate, Consumption),
function(x){clean_negatives(x)})
proc.time() - ptm
I was hoping I could use data.table to make this go faster but I couldn't work out how to employ data.table to help.
With 1.3M rows, so far it is taking my desktop all day to compute and still hasn't finished.
Your question asks about data.table implementation. So, I've shown it here. Your function could be drastically simplified as well. You can first get the sign by summing up Neg and then filter the table and then multiply Consumption by sign (as shown below).
require(data.table)
# get the data.table in dt
dt <- data.table(df, key = c("id", "StartDate", "EndDate", "Consumption"))
# first obtain the sign directly
dt <- dt[, sign := sign(sum(Neg)), by = c("id", "StartDate", "EndDate", "Consumption")]
# then filter by abs(sum(Neg))
dt.fil <- dt[, .SD[seq_len(abs(sum(Neg)))], by = c("id", "StartDate", "EndDate", "Consumption")]
# modifying for final output (line commented after Statquant's comment
# dt.fil$Consumption <- dt.fil$Consumption * dt.fil$sign
dt.fil[, Consumption := (Consumption*sign)]
dt.fil <- subset(dt.fil, select=-c(Neg, sign))
Benchmarking
The data with million rows:
#Create Sample Data Frame
num_rows <- 1e6
df <- data.frame(id=sample(1:20, num_rows, replace=T),
Consumption=sample(-20:20, num_rows, replace=T),
StartDate=as.Date(sample(15000:15020, num_rows, replace=T), origin = "1970-01-01"))
df$EndDate <- df$StartDate + 90
df$Neg <- ifelse(df$Consumption < 0, -1, 1)
df$Consumption <- abs(df$Consumption)
The data.table function:
FUN.DT <- function() {
require(data.table)
dt <- data.table(df, key=c("id", "StartDate", "EndDate", "Consumption"))
dt <- dt[, sign := sign(sum(Neg)),
by = c("id", "StartDate", "EndDate", "Consumption")]
dt.fil <- dt[, .SD[seq_len(abs(sum(Neg)))],
by=c("id", "StartDate", "EndDate", "Consumption")]
dt.fil[, Consumption := (Consumption*sign)]
dt.fil <- subset(dt.fil, select=-c(Neg, sign))
}
Your function with ddply
FUN.PLYR <- function() {
require(plyr)
clean_negatives <- function(x3) {
copies <- abs(sum(x3$Neg))
sgn <- ifelse(sum(x3$Neg) <0, -1, 1)
x3 <- x3[0:copies,]
x3$Consumption <- sgn*x3$Consumption
x3$Neg <- NULL
x3
}
df_cleaned <- ddply(df, .(id, StartDate, EndDate, Consumption),
function(x) clean_negatives(x))
}
Benchmarking with rbenchmark (with 1 run only)
require(rbenchmark)
benchmark(FUN.DT(), FUN.PLYR(), replications = 1, order = "elapsed")
test replications elapsed relative user.self sys.self user.child sys.child
1 FUN.DT() 1 6.137 1.000 5.926 0.211 0 0
2 FUN.PLYR() 1 242.268 39.477 152.855 82.881 0 0
My data.table implementation is about 39 times faster than your current plyr implementation (I compare mine to your implementation because the functions are different).
Note: I loaded the packages within the function in order to obtain the complete time to obtain the result. Also, for the same reason I converted the data.frame to data.table with keys inside the benchmarking function. This is therefore the minimum speed-up.
If I have a function named rand1() which generates number 0(30% probability) or 1(70% probability), how to write a function rand2() which generates number 0 or 1 equiprobability use rand1() ?
Update:
Finally, I found this is a problem on book Introduction to Algorithms (2nd) (I have bought the Chinese edition of this book ), Excercise 5.1-3, the original problem is :
5.1-3
Suppose that you want to output 0 with probability 1/2 and 1 with probability 1/2.
At your disposal is a procedure BIASED-RANDOM, that outputs either 0 or 1. It
outputs 1 with some probability p and 0 with probability 1− p, where 0 < p < 1,
but you do not know what p is. Give an algorithm that uses BIASED-RANDOM
as a subroutine, and returns an unbiased answer, returning 0 with probability 1/2
and 1 with probability 1/2. What is the expected running time of your algorithm
as a function of p?
the solution is :
(see: http://www.cnblogs.com/meteorgan/archive/2012/05/04/2482317.html)
To get an unbiased random bit, given only calls to BIASED-RANDOM, call
BIASED-RANDOM twice. Repeatedly do so until the two calls return different
values, and when this occurs, return the Þrst of the two bits:
UNBIASED-RANDOM
while TRUE
do
x ← BIASED-RANDOM
y ← BIASED-RANDOM
if x != y
then return x
To see that UNBIASED-RANDOM returns 0 and 1 each with probability 1/2, observe
that the probability that a given iteration returns 0 is
Pr {x = 0 and y = 1} = (1 − p)p ,
and the probability that a given iteration returns 1 is
Pr {x = 1 and y = 0} = p(1 − p) .
(We rely on the bits returned by BIASED-RANDOM being independent.) Thus, the
probability that a given iteration returns 0 equals the probability that it returns 1.
Since there is no other way for UNBIASED-RANDOM to return a value, it returns 0
and 1 each with probability 1/2.
Generate two numbers, a and b.
If a is 0 and b is 1 (21% chance), generate a 0.
If a is 1 and b is 0 (21% chance), generate a 1.
For all other cases (58% chance), just generate a new a and b and try again.
If you call rand1 twice, there is an equal chance of getting [1 0] and [0 1], so if you return the first of each non-matching pair (and discard matching pairs) you will get, on average, 0.5(1 - p2 - (1-p)2) output bits per input bit (where p is the probability of rand1 returning 1; 0.7 in your example) and independently of p, each output bit will be 1 with probability 0.5.
However, we can do better.
Rather than throw away the matching pairs, we can remember them in the hope that they are followed by opposite matching pairs - The sequences [0 0 1 1] and [1 1 0 0] are also equally likely, and again we can return the first bit whenever we see such a sequence (still with output probability 0.5.) We can keep combining them indefinitely, looking for sequences like [0 0 0 0 1 1 1 1] etc.
And we can go even further - consider the input sequences [0 0 0 1] and [0 1 0 0] produce the same output ([0]) as it stands, but these two sequences were also equally likely, so we can extract an extra bit of output from this, returning [0 0] for the first case and [0 1]
for the second. This is where it gets more complicated though, as you would need to start buffering output bits.
Both techniques can be applied recursively, and taken to the limit it becomes lossless (i.e. if rand1 has a probability of 0.5, you get an average of one output bit per input bit.)
Full description (with math) here: http://www.eecs.harvard.edu/~michaelm/coinflipext.pdf
You will need to figure out how close you want to get to 50% 0 50% 1.
If you add results from repeated calls to rand1. if the results is 0 or 2 then the value returned is 0 if it is 1 then return 1. (in code you can use modulo 2)
int val = rand1(); // prob 30% 0, and 70% 1
val=(val+rand1())%2; // prob 58% 0, and 42% 1 (#1 see math bellow)
val=(val+rand1())%2; // prob 46.8% 0, and 53.2% 1 (#2 see math bellow)
val=(val+rand1())%2; // prob 51.28% 0, and 48.72% 1
val=(val+rand1())%2; // prob 49.488% 0, and 50.512% 1
val=(val+rand1())%2; // prob 50.2048% 0, and 49.7952% 1
You get the idea. so it is up to you to figure out how close you want the probabilities. every subsequent call will gets you closer to 50% 50% but it will never be exactly equal.
If you want the math for the probabilities:
1
prob ((val+rand1()%2) = 0) = (prob(val = 0)*prob(rand1() = 0)) + (prob(val = 1)*prob(rand1() = 1)
= (0.3*0.3)+(0.7*0.7)
= 0.09 + 0.49
= 0.58
= 58%
prob ((val+rand1()%2) = 1) = (prob(val = 1)*prob(rand1() = 0)) + (prob(val = 0)*prob(rand1() = 1)
= (0.7*0.3)+(0.3*0.7)
= 0.21 + 0.21
= 0.42
= 42%
2
prob ((val+rand1()%2) = 0) = (prob(val = 0)*prob(rand1() = 0)) + (prob(val = 1)*prob(rand1() = 1)
= (0.58*0.3)+(0.42*0.7)
= 0.174 + 0.294
= 0.468
= 46.8%
prob ((val+rand1()%2) = 1) = (prob(val = 1)*prob(rand1() = 0)) + (prob(val = 0)*prob(rand1() = 1)
= (0.42*0.3)+(0.58*0.7)
= 0.126 + 0.406
= 0.532
= 53.2%
Below rand2 function will provide 50% probability for occurence of zero or one.
#define LIMIT_TO_CALCULATE_PROBABILITY 10 //set any even numbers
int rand2()
{
static int one_occurred = 0;
static int zero_occured = 0;
int rand_value = 0;
int limit = (LIMIT_TO_CALCULATE_PROBABILITY / 2);
if (LIMIT_TO_CALCULATE_PROBABILITY == (one_occured + zero_occured))
{
one_occured = 0;
zero_occured = 0;
}
rand_value = rand1();
if ((1 == rand_value) && (one_occured < limit))
{
one_occured++;
return rand_value;
}
else if ((0 == rand_value) && (zero_occured < limit))
{
zero_occured++;
return rand_value;
}
else if (1 == rand_value)
{
zero_occured++;
return 0;
}
else if (0 == rand_value)
{
one_occured++;
return 1;
}
}
I have a double loop that I not only don't like, but would take 14 days to run on my computer since it is going over 3200 records and 1090 variables at about .12 per iteration.
A smaller reproducible bit. It simply checks how many numbers are in the same column between two records, not including NA's. Then it attaches the results to the original data frame.
y <- data.frame(c(1,2,1,NA,NA),c(3,3,3,4,NA),c(5,4,5,7,7),c(7,8,7,9,10))
resultdf <- NULL
for(i in 1:nrow(y))
{
results <- NULL
for(j in 1:nrow(y))
{
results <- c(results,sum((y[i,]==y[j,]),na.rm=TRUE))
}
resultdf <- cbind(resultdf,results)
}
y <- cbind(y,resultdf)
I have repeat calculations that could possibly be avoided leaving about 7 days.
If I understand correctly, a few apply functions are in C that might be faster. I haven't been able to get any to work though. I'm also curious if there is a package that would run faster. Can anyone help speed up the calculation?
Thank you!
I have created data to your specifications, and using #BenBolker's suggestion about using a matrix:
> y <- matrix(sample(c(1:9, NA), 3200 * 1090, replace = TRUE),
+ nrow = 3200, ncol = 1090)
and compared the computation times for three different implementations:
f1 was suggested by #Andrei:
> f1 <- function(y)apply(y, 1, function(r1)
+ apply(y, 1, function(r2)sum(r1==r2, na.rm=TRUE)))
> system.time(r1 <- f1(y))
user system elapsed
523.51 0.77 528.73
f2 was suggested by #VincentZoonekynd:
> f2 <- function(y) {
+ f <- function(i,j) sum(y[i,] == y[j,], na.rm=TRUE)
+ d <- outer( 1:nrow(y), 1:nrow(y), Vectorize(f) )
+ return(d)
+ }
> system.time(r2 <- f2(y))
user system elapsed
658.94 1.96 710.67
f3 is a double loop over the upper triangle as suggested by #BenBolker. It is also a bit more efficient than your OP in that it pre-allocates the output matrix:
> f3 <- function(y) {
+ result <- matrix(NA, nrow(y), nrow(y))
+ for (i in 1:nrow(y)) {
+ row1 <- y[i, ]
+ for (j in i:nrow(y)) {
+ row2 <- y[j, ]
+ num.matches <- sum(row1 == row2, na.rm = TRUE)
+ result[i, j] <- num.matches
+ result[j, i] <- num.matches
+ }
+ }
+ return(result)
+ }
> system.time(r3 <- f3(y))
user system elapsed
167.66 0.08 168.72
So the double loop is the fastest of all three, although not as elegant and compact as the other two answers.
Here is another solution, using outer.
f <- function(i,j) sum(y[i,] == y[j,], na.rm=TRUE)
d <- outer( 1:nrow(y), 1:nrow(y), Vectorize(f) )
Indeed, you can use apply function. Given the earlier hint that a matrix works faster, I would try:
ym <- as.matrix(y)
resultdf <- apply(ym, 1, function(r1) apply(ym, 1, function(r2) sum(r1==r2, na.rm=TRUE)))
You can get rid of the inner loop (using the y and f3 from #flodel's answer):
ty <- t(y)
ix <- rep(1:nrow(y),each = ncol(y))
f4 <- function(y){
result <- matrix(0L, nrow(y), nrow(y))
for(r in 1:nrow(y))
result[r,] <- rowsum(as.numeric(ty == y[r,]), ix, na.rm = T)
result
}
> system.time(out <- f4(y))
user system elapsed
52.616 21.061 74.000
> system.time(out <- f3(y))
user system elapsed
244.751 0.136 244.954
>
It actually does an extra job of computing twice the same thing, but is still 5 times faster. You can make it yet another 4 times faster by using the inner workings of rowsum. See this question for an example.
I was wondering if anyone could kindly help me with this seemingly easy task. I'm using nlminb to conduct optimization and compute some statistics by index. Here's an example from nlminb help.
> x <- rnbinom(100, mu = 10, size = 10)
> hdev <- function(par) {
+ -sum(dnbinom(x, mu = par[1], size = par[2], log = TRUE))
+ }
> nlminb(c(9, 12), hdev)
$par
[1] 9.730000 5.954936
$objective
[1] 297.2074
$convergence
[1] 0
$message
[1] "relative convergence (4)"
$iterations
[1] 10
$evaluations
function gradient
12 27
Suppose I generate random variables x, y, and z where z acts as an index (from 1 to 3).
> x <- rnbinom(100, mu = 10, size = 10)
> y <- rnbinom(100, mu = 10, size = 10)
> z <- rep(1:3, length=100)
> A <- cbind(x,y,z)
> hdev <- function(par) {
+ -sum(dnbinom(x+y, mu = par[1], size = par[2], log = TRUE))}
How can I apply nlminb(c(9, 12), hdev) to the data set by index z? In other words, I would like to compute nlminb(c(9, 12), hdev) for z=1, z=2, and z=3 separately. I tried by(A, z, function(A) nlminb(c(9,12), hdev)) and sparseby(A, z, function(A) nlminb(c(9,12), hdev)), but they return exactly the same values for each value of z.
I would like to turn each output into a new data frame so that it will become a 3X2 matrix.
[1] Z1_ANSWER_1 Z1_ANSWER_2
[2] Z2_ANSWER_1 Z2_ANSWER_2
[3] Z3_ANSWER_1 Z3_ANSWER_2
Since nlminb returns the summary of statistics, I needed to use CASEZ1<-nlminb$par, CASEZ2<-nlminb$par, CASEZ3<-nlminb$par and then use cbind to combine them. However, I would like to automate this process as the real data I'm working on has a lot more categories than z presented here.
If I'm not making myself clear, please let me know. I'll see if I can replicate the actual data set and functions I'm working on (I just don't have them on this computer).
Thank you very much in advance.
Let me try an approach
x <- rnbinom(100, mu = 10, size = 10)
y <- rnbinom(100, mu = 10, size = 10)
z <- rep(1:3, length=100)
A <- as.data.frame(cbind(x,y,z))
At first load the plyr library
library(plyr)
The following code returns the results for each z
dlply(A, .(z), function(x) {
hdev <- function(par, mydata) {-sum(dnbinom(mydata, mu = par[1], size = par[2], log = TRUE))}
nlminb(c(9, 12), hdev, mydata=t(as.vector(x[1] + as.vector(x[2]))))
}
)
Now, with this one you will get a 3x2 dataframe with the $par results
ddply(A, .(z), function(x) {
hdev <- function(par, mydata) {-sum(dnbinom(mydata, mu = par[1], size = par[2], log = TRUE))}
res <- nlminb(c(9, 12), hdev, mydata=t(as.vector(x[1] + as.vector(x[2]))))
return(res$par)
}
)