Adapting quantile regression to the caret package - cross-validation

I want to try adpat quantile regression to caret package. I wrote below code to adapt but I get the error as below:
library(quantreg)
quantregression <- list(type='Regression', library='quantreg',loop=NULL)
param <- data.frame(parameter = c("tau"), class = c("numeric"), label = c("tau"))
quantregression$parameters <- param
quantGrid <- function(x, y, len=NULL, search="grid"){
if(search == "grid"){
out <- expand.grid(tau=c(.05, .25, .5, .75, .95))
}
out
}
quantregression$grid <- quantGrid
quantregFit <- function(x, y, wts, param, lev, last, weights, classProbs, ...){
dat <- if(is.data.frame(x)) x else as.data.frame(x)
dat$.outcome <- y
if(!is.null(wts))
{
if (param$tau)
out <- quantreg::rq(.outcome ~ ., data = dat, weights = wts,tau=tau, ...)
else
out <- quantreg::rq(.outcome ~0 + ., data = dat, weights = wts,tau=tau, ...)
} else
{
if (param$tau)
out <- quantreg::rq(.outcome ~ ., data = dat,tau=tau, ...)
else
out <- quantreg::rq(.outcome ~0 + ., data = dat,tau=tau, ...)
}
out
}
quantregression$fit <- quantregFit
quantPred <- function(modelFit, newdata, preProc=NULL, submodels=NULL){
if(!is.data.frame(newdata)) newdata <- as.data.frame(newdata)
quantreg::predict.rq(modelFit, newdata)
}
quantregression$predict <- quantPred
quantProb <- function(){
return(NULL)
}
quantregression$prob <- quantProb
I get the error as below:
Something is wrong; all the RMSE metric values are missing:
RMSE Rsquared MAE
Min. : NA Min. : NA Min. : NA
1st Qu.: NA 1st Qu.: NA 1st Qu.: NA
Median : NA Median : NA Median : NA
Mean :NaN Mean :NaN Mean :NaN
3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA
Max. : NA Max. : NA Max. : NA
NA's :5 NA's :5 NA's :5
Error: Stopping
In addition: There were 50 or more warnings (use warnings() to see the first 50)
warnings()
Warning messages:
1: model fit failed for Fold01.Rep1: tau=0.05 Error in unique(tau) : object 'tau' not found
I want to adopt the quantile regression using "quantreg" package for caret to obtain cross validation results for quantile regression. I wrote above code but I could not find the error in the code.

Related

data.table row value depend on previous value in R

I have a data.table x, it has 2 columns a, b.
I want calculate a c column.
library(data.table)
x = data.table(a = c(1:5), b = c(1,0,2,3,6), c = NA)
x$a[1] = NA
x$b[1] = NA
x
#> a b c
#> <int> <num> <lgcl>
#> 1: NA NA NA
#> 2: 2 0 NA
#> 3: 3 2 NA
#> 4: 4 3 NA
#> 5: 5 6 NA
The algorithm is:
c[i] = ifelse(a[i] < b[i] & c[i-1] < b[i], a[i], b[i])
I don't want to use for loop, because it's too slow.
I want to use data.table functions, or a fast method like this:
x$c = fifelse(x$a < x$b & lag(x$c) < x$b, x$a, x$b)
But it's not working, because x$c calculation is in progress right now.
Any solution for this?
Thanks for help
Janos
You could use Reduce with accumulate=T option :
library(data.table)
x = data.table(a = c(1:5), b = c(1,0,2,3,6), c = NA)
x$a[1] = NA
x$b[1] = NA
x[,c:=Reduce(f = function(prev,val) ifelse((val$a < val$b & prev<val$b),val$a,val$b),
x = split(.SD[-1],seq_len(.N-1)), init = NA
,accumulate = T)][]
#> a b c
#> <int> <num> <num>
#> 1: NA NA NA
#> 2: 2 0 0
#> 3: 3 2 2
#> 4: 4 3 3
#> 5: 5 6 5
Reduce passes the result of the previous row calculation to calculate the next row.
accumulate=T returns the intermediate results instead of only the last row.

Octave Mapping over Rows

I have a matrix that's m x 2:
.2 .8
.3 .7
.7 .3
...
and I would like to map the '>' operation to each row, resulting in the column like so:
0
0
1
...
However, I can't seem to find a good way to do this without removing columns and applying a custom function to the matrix in something like
res = arrayfun(#mapfunction, matrix(:,1))
function cl = mapfunction(v)
cl = v > .5
Is there a simpler way?
You can simply calculate a Boolean / logical column that checks to see whether the first column is greater than the second column for every row in your matrix:
res = matrix(:,1) > matrix(:,2);
Given your example matrix:
matrix = [.2 .8; .3 .7; .7 .3];
This is what I get:
>> res = matrix(:,1) > matrix(:,2)
res =
0
0
1

How can I improve the performance of my data cleaning code that currently uses ddply by using data.table?

I am trying to clean data using ddply but it is running very slowly on 1.3M rows.
Sample code:
#Create Sample Data Frame
num_rows <- 10000
df <- data.frame(id=sample(1:20, num_rows, replace=T),
Consumption=sample(-20:20, num_rows, replace=T),
StartDate=as.Date(sample(15000:15020, num_rows, replace=T), origin = "1970-01-01"))
df$EndDate <- df$StartDate + 90
#df <- df[order(df$id, df$StartDate, df$Consumption),]
#Are values negative?
# Needed for subsetting in ddply rows with same positive and negative values
df$Neg <- ifelse(df$Consumption < 0, -1, 1)
df$Consumption <- abs(df$Consumption)
I have written a function to remove rows where there is a consumption value in one row that is identical but negative to a consumption value in another row (for the same id).
#Remove rows from a data frame where there is an equal but opposite consumption value
#Should ensure only one negative value is removed for each positive one.
clean_negatives <- function(x3){
copies <- abs(sum(x3$Neg))
sgn <- ifelse(sum(x3$Neg) <0, -1, 1)
x3 <- x3[0:copies,]
x3$Consumption <- sgn*x3$Consumption
x3$Neg <- NULL
x3}
I then use ddply to apply that function to remove these erroneous rows in the data
ptm <- proc.time()
df_cleaned <- ddply(df, .(id,StartDate, EndDate, Consumption),
function(x){clean_negatives(x)})
proc.time() - ptm
I was hoping I could use data.table to make this go faster but I couldn't work out how to employ data.table to help.
With 1.3M rows, so far it is taking my desktop all day to compute and still hasn't finished.
Your question asks about data.table implementation. So, I've shown it here. Your function could be drastically simplified as well. You can first get the sign by summing up Neg and then filter the table and then multiply Consumption by sign (as shown below).
require(data.table)
# get the data.table in dt
dt <- data.table(df, key = c("id", "StartDate", "EndDate", "Consumption"))
# first obtain the sign directly
dt <- dt[, sign := sign(sum(Neg)), by = c("id", "StartDate", "EndDate", "Consumption")]
# then filter by abs(sum(Neg))
dt.fil <- dt[, .SD[seq_len(abs(sum(Neg)))], by = c("id", "StartDate", "EndDate", "Consumption")]
# modifying for final output (line commented after Statquant's comment
# dt.fil$Consumption <- dt.fil$Consumption * dt.fil$sign
dt.fil[, Consumption := (Consumption*sign)]
dt.fil <- subset(dt.fil, select=-c(Neg, sign))
Benchmarking
The data with million rows:
#Create Sample Data Frame
num_rows <- 1e6
df <- data.frame(id=sample(1:20, num_rows, replace=T),
Consumption=sample(-20:20, num_rows, replace=T),
StartDate=as.Date(sample(15000:15020, num_rows, replace=T), origin = "1970-01-01"))
df$EndDate <- df$StartDate + 90
df$Neg <- ifelse(df$Consumption < 0, -1, 1)
df$Consumption <- abs(df$Consumption)
The data.table function:
FUN.DT <- function() {
require(data.table)
dt <- data.table(df, key=c("id", "StartDate", "EndDate", "Consumption"))
dt <- dt[, sign := sign(sum(Neg)),
by = c("id", "StartDate", "EndDate", "Consumption")]
dt.fil <- dt[, .SD[seq_len(abs(sum(Neg)))],
by=c("id", "StartDate", "EndDate", "Consumption")]
dt.fil[, Consumption := (Consumption*sign)]
dt.fil <- subset(dt.fil, select=-c(Neg, sign))
}
Your function with ddply
FUN.PLYR <- function() {
require(plyr)
clean_negatives <- function(x3) {
copies <- abs(sum(x3$Neg))
sgn <- ifelse(sum(x3$Neg) <0, -1, 1)
x3 <- x3[0:copies,]
x3$Consumption <- sgn*x3$Consumption
x3$Neg <- NULL
x3
}
df_cleaned <- ddply(df, .(id, StartDate, EndDate, Consumption),
function(x) clean_negatives(x))
}
Benchmarking with rbenchmark (with 1 run only)
require(rbenchmark)
benchmark(FUN.DT(), FUN.PLYR(), replications = 1, order = "elapsed")
test replications elapsed relative user.self sys.self user.child sys.child
1 FUN.DT() 1 6.137 1.000 5.926 0.211 0 0
2 FUN.PLYR() 1 242.268 39.477 152.855 82.881 0 0
My data.table implementation is about 39 times faster than your current plyr implementation (I compare mine to your implementation because the functions are different).
Note: I loaded the packages within the function in order to obtain the complete time to obtain the result. Also, for the same reason I converted the data.frame to data.table with keys inside the benchmarking function. This is therefore the minimum speed-up.

Comparing Record Results and Double For Loop

I have a double loop that I not only don't like, but would take 14 days to run on my computer since it is going over 3200 records and 1090 variables at about .12 per iteration.
A smaller reproducible bit. It simply checks how many numbers are in the same column between two records, not including NA's. Then it attaches the results to the original data frame.
y <- data.frame(c(1,2,1,NA,NA),c(3,3,3,4,NA),c(5,4,5,7,7),c(7,8,7,9,10))
resultdf <- NULL
for(i in 1:nrow(y))
{
results <- NULL
for(j in 1:nrow(y))
{
results <- c(results,sum((y[i,]==y[j,]),na.rm=TRUE))
}
resultdf <- cbind(resultdf,results)
}
y <- cbind(y,resultdf)
I have repeat calculations that could possibly be avoided leaving about 7 days.
If I understand correctly, a few apply functions are in C that might be faster. I haven't been able to get any to work though. I'm also curious if there is a package that would run faster. Can anyone help speed up the calculation?
Thank you!
I have created data to your specifications, and using #BenBolker's suggestion about using a matrix:
> y <- matrix(sample(c(1:9, NA), 3200 * 1090, replace = TRUE),
+ nrow = 3200, ncol = 1090)
and compared the computation times for three different implementations:
f1 was suggested by #Andrei:
> f1 <- function(y)apply(y, 1, function(r1)
+ apply(y, 1, function(r2)sum(r1==r2, na.rm=TRUE)))
> system.time(r1 <- f1(y))
user system elapsed
523.51 0.77 528.73
f2 was suggested by #VincentZoonekynd:
> f2 <- function(y) {
+ f <- function(i,j) sum(y[i,] == y[j,], na.rm=TRUE)
+ d <- outer( 1:nrow(y), 1:nrow(y), Vectorize(f) )
+ return(d)
+ }
> system.time(r2 <- f2(y))
user system elapsed
658.94 1.96 710.67
f3 is a double loop over the upper triangle as suggested by #BenBolker. It is also a bit more efficient than your OP in that it pre-allocates the output matrix:
> f3 <- function(y) {
+ result <- matrix(NA, nrow(y), nrow(y))
+ for (i in 1:nrow(y)) {
+ row1 <- y[i, ]
+ for (j in i:nrow(y)) {
+ row2 <- y[j, ]
+ num.matches <- sum(row1 == row2, na.rm = TRUE)
+ result[i, j] <- num.matches
+ result[j, i] <- num.matches
+ }
+ }
+ return(result)
+ }
> system.time(r3 <- f3(y))
user system elapsed
167.66 0.08 168.72
So the double loop is the fastest of all three, although not as elegant and compact as the other two answers.
Here is another solution, using outer.
f <- function(i,j) sum(y[i,] == y[j,], na.rm=TRUE)
d <- outer( 1:nrow(y), 1:nrow(y), Vectorize(f) )
Indeed, you can use apply function. Given the earlier hint that a matrix works faster, I would try:
ym <- as.matrix(y)
resultdf <- apply(ym, 1, function(r1) apply(ym, 1, function(r2) sum(r1==r2, na.rm=TRUE)))
You can get rid of the inner loop (using the y and f3 from #flodel's answer):
ty <- t(y)
ix <- rep(1:nrow(y),each = ncol(y))
f4 <- function(y){
result <- matrix(0L, nrow(y), nrow(y))
for(r in 1:nrow(y))
result[r,] <- rowsum(as.numeric(ty == y[r,]), ix, na.rm = T)
result
}
> system.time(out <- f4(y))
user system elapsed
52.616 21.061 74.000
> system.time(out <- f3(y))
user system elapsed
244.751 0.136 244.954
>
It actually does an extra job of computing twice the same thing, but is still 5 times faster. You can make it yet another 4 times faster by using the inner workings of rowsum. See this question for an example.

Make this loop faster in R

How can I speed up the following (noob) code:
#"mymatrix" is the matrix of word counts (docs X terms)
#"tfidfmatrix" is the transformed matrix
tfidfmatrix = Matrix(mymatrix, nrow=num_of_docs, ncol=num_of_words, sparse=T)
#Apply a transformation on each row of the matrix
for(i in 1:dim(mymatrix)[[1]]){
r = mymatrix[i,]
s = sapply(r, function(x) ifelse(x==0, 0, (1+log(x))*log((1+ndocs)/(1+x)) ) )
tfmat[i,] = s/sqrt(sum(s^2))
}
return (tfidfmatrix)
Problem is that the matrices I am working on are fairly large (~40kX100k), and this code is very slow.
The reason I am not using "apply" (instead of using a for loop and sapply) is that apply is going to give me the transpose of the matrix I want - I want num_of_docs X num_of_words, but apply will give me the transpose. I will then have to spend more time computing the transpose and re-allocating it.
Any thoughts on making this faster?
Thanks much.
Edit : I have found that the suggestions below greatly speed up my code (besides making me feel stupid). Any suggestions on where I can learn to write "optimized" R code from?
Edit 2: OK, so something is not right. Once I do s.vec[!is.finite(s.vec)] <- 0 every element of s.vec is being set to 0. Just to re-iterate my original matrix is a sparse matrix containing integers. This is due to some quirk of the Matrix package I am using. When I do s.vec[which(s.vec==-Inf)] <- 0 things work as expected. Thoughts?
As per my comment,
#Slightly larger example data
mymatrix <- matrix(runif(10000),nrow=10)
mymatrix[sample(10000,100)] <- 0
tfmat <- matrix(nrow=10, ncol=1000)
ndocs <- 1
justin <- function(){
s.vec <- ifelse(mymatrix==0, 0, (1 + log(mymatrix)) * log((1 + ndocs)/(1 + mymatrix)))
tfmat.vec <- s.vec/sqrt(rowSums(s.vec^2))
}
joran <- function(){
s.vec <- (1 + log(mymatrix)) * log((1 + ndocs)/(1 + mymatrix))
s.vec[!is.finite(s.vec)] <- 0
tfmat.vec <- s.vec/sqrt(rowSums(s.vec^2))
}
require(rbenchmark)
benchmark(justin(),joran(),replications = 1000)
test replications elapsed relative user.self sys.self user.child sys.child
2 joran() 1000 0.940 1.00000 0.842 0.105 0 0
1 justin() 1000 2.786 2.96383 2.617 0.187 0 0
So it's around 3x faster or so.
not sure what ndocs is, but ifelse is already vectorized, so you should be able to use the ifelse statement without walking through the matrix row by row and sapply along the row. The same can be said for the final calc.
However, you haven't given a complete example to replicate...
mymatrix <- matrix(runif(100),nrow=10)
tfmat <- matrix(nrow=10, ncol=10)
ndocs <- 1
s.vec <- ifelse(mymatrix==0, 0, 1 + log(mymatrix)) * log((1 + ndocs)/(1 + mymatrix))
for(i in 1:dim(mymatrix)[[1]]){
r = mymatrix[i,]
s = sapply(r, function(x) ifelse(x==0, 0, (1+log(x))*log((1+ndocs)/(1+x)) ) )
tfmat[i,] <- s
}
all.equal(s.vec, tfmat)
so the only piece missing is the rowSums in your final calc.
tfmat.vec <- s.vec/sqrt(rowSums(s.vec^2))
for(i in 1:dim(mymatrix)[[1]]){
r = mymatrix[i,]
s = sapply(r, function(x) ifelse(x==0, 0, (1+log(x))*log((1+ndocs)/(1+x)) ) )
tfmat[i,] = s/sqrt(sum(s^2))
}
all.equal(tfmat, tfmat.vec)

Resources