Slow nested loop in R - performance

I'm new to R and having trouble vectorizing a nested loop that is particularly slow. The loop goes through a list of centers (vectors stored in a structure) and finds the distance between these vectors and the rows of an array called x below. I know this needs to be vectorized for speed, but cannot figure out the appropriate functions to or use of apply to do so.
clusterCenters <- matrix(runif(10000),nrow=100)
clusterMembers <- matrix(runif(400000),nrow=4000)
features <- matrix(0,(dim(clusterMembers)[1]),(dim(clusterCenters)[1]))
for(c in 1:dim(clusterCenters)[1]){
center <- clusterCenters[c,]
for(v in 1:(dim(clusterMembers)[1])){
vector <- clusterMembers[v,]
features[v,c] <- sqrt(sum((center - vector)^2))
}
}
Thanks for any help.

You can take advantage of R's recycling rules to make this a bit faster. But you have to know and account for the fact that R stores matrices in column-major order. You do that by transposing clusterMembers and then the center vector will be recycled along the columns of t(clusterMembers).
set.seed(21)
clusterCenters <- matrix(runif(10000),nrow=100)
clusterMembers <- matrix(runif(400000),nrow=4000)
# your original code in function form
seven <- function() {
features <- matrix(0,(dim(clusterMembers)[1]),(dim(clusterCenters)[1]))
for(c in 1:dim(clusterCenters)[1]){
center <- clusterCenters[c,]
for(v in 1:(dim(clusterMembers)[1])){
vector <- clusterMembers[v,]
features[v,c] <- sqrt(sum((center - vector)^2))
}
}
features
}
# my fancy function
josh <- function() {
tcm <- t(clusterMembers)
Features <- matrix(0,ncol(tcm),nrow(clusterCenters))
for(i in 1:nrow(clusterCenters)) {
# clusterCenters[i,] returns a vector because drop=TRUE by default
Features[,i] <- colSums((clusterCenters[i,]-tcm)^2)
}
Features <- sqrt(Features) # outside the loop to avoid function calls
}
system.time(seven())
# user system elapsed
# 2.7 0.0 2.7
system.time(josh())
# user system elapsed
# 0.28 0.11 0.39
identical(seven(),josh())
# [1] TRUE

Related

How to set my inequality constraint into my R function?

I am working on a project consisting of the analysis of different portfolio constructions in a universe of various assets. I work on 22 assets and I recalibrate my portfolio every 90 days. This is why a weights penalties (see code) constraint is applied as the allocation changes every period.
I am currently implementing a construction based on independent components. My objective is to minimize the modified value at risk based on its components. (See code below).
My function runs correctly and everything seems to be OK, my function "MVaR.IC.port" and "MVaR.cm" work well. However, I can only implement this model in the case where short selling is allowed. I would now like to operate only in "Long only", i.e. that my weight vectors w only contain elements >=0. Concretely, i want that the expression "w <- t(w.IC)%*%a$A" in my code be >=0.
Do you know how to help me? Thank you in advance.
[results w.out.MVaR.IC.22,][1] Here are the results that must be positive. I also constraint that the sum of the weights must be equal to 1 (the investor allocates 100% of his wealth.).
Thomas
PS: train and test represent my rolling windows. In fact, I calibrate my models on 'train' (in sample) and apply them on 'test' (out of sample) in order to analyse their performance.
########################################
######### MVar on IC with CM #########
########################################
lower = rep(-5,k)
upper = rep(5,k)
#Set up objective function and constraint
MVaR.IC.cm.port <- function(S, weights, alpha, MixingMatrix)
{
obj <- MVaR(S, weights, alpha)
w.ICA <- t(weights)%*%MixingMatrix
weight.penalty = abs(1000*(1-sum(w.ICA)))
down.weight.penalty = 1000*sum(w.ICA[w.ICA > 1])
up.weight.penalty = 1000*abs(sum(w.ICA[w.ICA < -1]))
return(obj + weight.penalty + down.weight.penalty + up.weight.penalty)
}
#Out of sample return portfolio computation
ret.out.MVaR.IC.cm.22 <- c()
w.out.MVaR.IC.cm.22 <- matrix(ncol = n, nrow = 10)
for (i in 0:9) {
train <- as.matrix(portfolioReturns.new[((1+i*90):(8*90+i*90)),])
test <- as.matrix(portfolioReturns.new[(1+8*90+i*90):(9*90+i*90),])
a <- myfastICA(train, k, alg.typ = "parallel", fun = "logcosh", alpha = 1,
method = "R", row.norm = FALSE, maxit = 2000,
tol = 0.0000000001, verbose = TRUE)
x <- DEoptim(MVaR.IC.cm.port,lower,upper,
control=list(NP=(10*k),F=0.8,CR=0.9, trace=50),
S=a$S, alpha = alpha, MixingMatrix = a$A)
w.IC <- matrix(x$optim$bestmem, ncol=1)
w <- t(w.IC)%*%a$A
for (j in 1:ncol(train)){
w.out.MVaR.IC.cm.22[(i+1),j] <- w[j]
}
ret.out.MVaR.IC.cm.22 <- rbind(ret.out.MVaR.IC.cm.22, test %*% t(w))
}
w.out.MVaR.IC.cm.22

How to debug this for loop that masks multiple rasters using a single shapefile?

I have a group of multiband rasters in a folder that I'd like to mask using a single shapefile. I'd like for the masked outputs (rasters) to go into a separate folder. I received help on here a while back in making a for loop to accomplish this. It worked great, but when I scaled it up (increased the number of polygons in the shapefile, and added more rasters) it didn't work as well anymore. To be specific, it will sometimes mask the rasters correctly, and sometimes not. I can't discern any pattern as i've run this code multiple times and each time the set of rasters that don't get masked are different.
Code so far:
library(terra)
#Creating directory to store inputs
ras_dir <- "/Users/USERID/rasters"
if (!file.exists(ras_dir)) {
ras_dir <- dir.create("/Users/USERID/rasters")
}
#Creating directory to store outputs
mask_dir <- "/Users/USERID/masks"
if (!file.exists(mask_dir)) {
mask_dir <- dir.create("/Users/USERID/masks")
}
#Twelve polygons in a shapefile
v <- vect(system.file("ex/lux.shp", package="terra"))
v <- v[c(1,4,5,7,9,12)]
#10 rasters with 5 layers each. I'm not a good enough coder to programmatically write #these rasters to a directory and have them be different.
f <- system.file("ex/elev.tif", package="terra")
r <- rast(f)
r <- rep(r, 5) * 1:5
names(r) <- paste0("band", 1:5)
writeRaster(r, "/Users/USERID/ras1.tif")
writeRaster(r, "/Users/USERID/ras2.tif")
writeRaster(r, "/Users/USERID/ras3.tif")
writeRaster(r, "/Users/USERID/ras4.tif")
writeRaster(r, "/Users/USERID/ras5.tif")
inf <- list.files("/Users/USERID/rasters", pattern="tif$", full.names=TRUE)
outf <- gsub("/Users/USERID/rasters", "/Users/USERID/masks", inf)
for (i in 1:length(inf)) {
r <- rast(inf[i])
c <- crop(r, v) #Here I crop first as it saves lots of time
m <- mask(c, v, filename = outf[i], overwrite = TRUE)
}
To be clear, I know the code above runs correctly. For some reason it doesn't run with my longer dataset and I was wondering if anyone could shed light on any potential pitfalls of this type of for loop.
This looks good to me. One change I would make is to actually use the variables that represent the paths. That is, only hard-code them once, as I do below. But it would seem that in your actual implementation there is some mistake. It can help to use statements like print(outf[i]) in the loop, and first try with a few files, and then more to see where the error first comes in.
library(terra)
ras_dir <- "rasters"
mask_dir <- "masks"
dir.create(ras_dir, FALSE, FALSE)
dir.create(mask_dir, FALSE, FALSE)
#Twelve polygons in a shapefile
v <- vect(system.file("ex/lux.shp", package="terra"))
v <- v[c(1,4,5,7,9,12)]
#10 rasters with 5 layers each.
r <- rast(system.file("ex/elev.tif", package="terra"))
r <- rep(r, 5) * 1:5
names(r) <- paste0("band", 1:5)
for (i in 1:5) {
writeRaster(r * i, file.path(ras_dir, paste0("ras", i, ".tif")), overwrite=TRUE)
}
inf <- list.files(ras_dir, pattern="tif$", full.names=TRUE)
outf <- gsub(ras_dir, mask_dir, inf)
for (i in 1:length(inf)) {
r <- rast(inf[i])
c <- crop(r, v)
m <- mask(c, v, filename = outf[i], overwrite = TRUE)
}

Efficient knn algorithm

I'm trying to implement knn algorithm which operates on one dimensional vectors in R, but one which differs from the standard one just a bit, in that that it takes the smaller element in case of a tie (so the distance is just the absolute value of the difference between the attributes). More precisely, I'm trying to find k numbers which are the closest to a given number, and if there are ties I want the smaller number to be chosen.
Sounds simple, but my algorithm takes couple of seconds to finish whilst the one that's in the class package (knn) outputs an answer immediately (though it takes all elements in case of a tie or random elements)... Mine's the following:
I sample a training sample and order it increasingly.
I take an element (a number)
2.5. and search for the first place in which it becomes less than some number in the training sample.
I take 2k+1 numbers from the training sample -- k to the left of a number found in 2.5 and k to the right (if there are less than k such numbers, I take as much as I can).
Finally I calculate the distances of chosen elements to the one I took in 2 and order them along with the corresponding elements increasingly (so that the elements and their distances are ordered increasingly)
Then I take k first elements from the list created in 4. (so that no two have the same distance)
But boy, it takes 6 or 7 seconds to finish... Do you have any ideas for an improvement? (It's not an R specific question, it just happened I do it in R).
Edit. The code:
dec <- function(u, x, k) {
## u is the training sample sorted increasingly
## x is an object for classification
## k is a knn parameter
knn <- list()
i <- 1
div <- 0
for (j in u) {
if (x < j) {
div <- 0
break
}
i <- i+1
}
if (div == 0) {
distances <- array(0,dim=c(2,k))
z <- 1
for (j in 1:k) {
distances[1,z] <- u[10000-j]
distances[2,z] <- abs(u[10000-j]-x)
}
} else {
end1 <- div+k
end2 <- div-k
if (div<k) {
distances <- array(0,dim=c(2,(div+k)))
a <- 1
for (j in u[1:end1]) {
distances[1,a] <- j
distances[2,a] <- abs(j-x)
a <- a+1
}
} else if (10000-div<k) {
distances <- array(0,dim=c(2,(1000-div+k)))
a <- 1
for (j in u[end2:10000]) {
distances[1,a] <- j
distances[2,a] <- abs(j-x)
a <- a+1
}
} else {
a <- 1
distances <- array(0,dim=c(2,(2*k+1)))
for (j in u[end1:end2]) {
distances[1,a] <- j
distances[2,a] <- abs(j-x)
a <- a+1
}
}
distances <- t(distances)
distances <- distances[ order( distances[,2], distances[,1]), ]
distances <- t(distances)
}
for (i in 1:k) {
if (i>1 && distances[1,i-1] != distances[1,i])
knn[i] <- distances[1,i]
}
## and sth later...
}
kNN in 1D is straightforward.
Sort the values increasingly. To perform a query, locate the value in the sorted sequence by dichotomic search. Then find the k closest values by stepping to the closest on either side (smaller or larger) k times.

Most efficient way of subsetting vectors

I need to calculate the mean and variance of a subset of a vector. Let x be the vector and y be an indicator for whether the observation is in the subset. Which is more efficient:
sub.mean <- mean(x[y])
sub.var <- var(x[y])
or
sub <- x[y]
sub.mean <- mean(sub)
sub.var <- var(sub)
sub <- NULL
The first approach doesn't create a new object explicitly; but do the calls to mean and var do that implicitly? Or do they work on the original vector as stored?
Is the second faster because it doesn't have to do the subsetting twice?
I'm concerned with speed and with memory management for large data sets.
Benchmarking on a vector of length 10M indicates that (on my machine) the latter approach is faster:
f1 = function(x, y) {
sub.mean <- mean(x[y])
sub.var <- var(x[y])
}
f2 = function(x, y) {
sub <- x[y]
sub.mean <- mean(sub)
sub.var <- var(sub)
sub <- NULL
}
x = rnorm(10000000)
y = rbinom(10000000, 1, .5)
print(system.time(f1(x, y)))
# user system elapsed
# 0.403 0.037 0.440
print(system.time(f2(x, y)))
# user system elapsed
# 0.233 0.002 0.235
This isn't surprising- mean(x[y]) does have to create a new object for the mean function to act on, even if it doesn't add it to the local namespace. Thus, f1 is slower for having to do the subsetting twice (as you surmised).

r: for loop operation with nested indices runs super slow

I have an operation I'd like to run for each row of a data frame, changing one column. I'm an apply/ddply/sqldf man, but I'll use loops when they make sense, and I think this is one of those times. This case is tricky because the column to changes depends on information that changes by row; depending on information in one cell, I should make a change to only one of ten other cells in that row. With 75 columns and 20000 rows, the operation takes 10 minutes, when every other operation in my script takes 0-5 seconds, ten seconds max. I've stripped my problem down to the very simple test case below.
n <- 20000
t.df <- data.frame(matrix(1:5000, ncol=10, nrow=n) )
system.time(
for (i in 1:nrow(t.df)) {
t.df[i,(t.df[i,1]%%10 + 1)] <- 99
}
)
This takes 70 seconds with ten columns, and 360 when ncol=50. That's crazy. Are loops the wrong approach? Is there a better, more efficient way to do this?
I already tried initializing the nested term (t.df[i,1]%%10 + 1) as a list outside the for loop. It saves about 30 seconds (out of 10 minutes) but makes the example code above more complicated. So it helps, but its not the solution.
My current best idea came while preparing this test case. For me, only 10 of the columns are relevant (and 75-11 columns are irrelevant). Since the run times depend so much on the number of columns, I can just run the above operation on a data frame that excludes irrelevant columns. That will get me down to just over a minute. But is "for loop with nested indices" even the best way to think about my problem?
It seems the real bottleneck is having the data in the form of a data.frame. I assume that in your real problem you have a compelling reason to use a data.frame. Any way to convert your data in such a way that it can remain in a matrix?
By the way, great question and a very good example.
Here's an illustration of how much faster loops are on matrices than on data.frames:
> n <- 20000
> t.df <- (matrix(1:5000, ncol=10, nrow=n) )
> system.time(
+ for (i in 1:nrow(t.df)) {
+ t.df[i,(t.df[i,1]%%10 + 1)] <- 99
+ }
+ )
user system elapsed
0.084 0.001 0.084
>
> n <- 20000
> t.df <- data.frame(matrix(1:5000, ncol=10, nrow=n) )
> system.time(
+ for (i in 1:nrow(t.df)) {
+ t.df[i,(t.df[i,1]%%10 + 1)] <- 99
+ }
+ )
user system elapsed
31.543 57.664 89.224
Using row and col seems less complicated to me:
t.df[col(t.df) == (row(t.df) %% 10) + 1] <- 99
I think Tommy's is still faster, but using row and col might be easier to understand.
#JD Long is right that if t.df can be represented as a matrix, things will be much faster.
...And then you can actually vectorize the whole thing so that it is lightning fast:
n <- 20000
t.df <- data.frame(matrix(1:5000, ncol=10, nrow=n) )
system.time({
m <- as.matrix(t.df)
m[cbind(seq_len(nrow(m)), m[,1]%%10L + 1L)] <- 99
t2.df <- as.data.frame(m)
}) # 0.00 secs
Unfortunately, the matrix indexing I use here does not seem to work on a data.frame.
EDIT
A variant where I create a logical matrix to index works on data.frame, and is almost as fast:
n <- 20000
t.df <- data.frame(matrix(1:5000, ncol=10, nrow=n) )
system.time({
t2.df <- t.df
# Create a logical matrix with TRUE wherever the replacement should happen
m <- array(FALSE, dim=dim(t2.df))
m[cbind(seq_len(nrow(t2.df)), t2.df[,1]%%10L + 1L)] <- TRUE
t2.df[m] <- 99
}) # 0.01 secs
UPDATE: Added the matrix version of Tommy's solution to the benchmarking exercise.
You can vectorize it. Here is my solution and a comparison with the loop
n <- 20000
t.df <- (matrix(1:5000, ncol=10, nrow=n))
f_ramnath <- function(x){
idx <- x[,1] %% 10 + 1
x[cbind(1:NROW(x), idx)] <- 99
return(x)
}
f_long <- function(t.df){
for (i in 1:nrow(t.df)) {
t.df[i,(t.df[i,1]%%10 + 1)] <- 99
}
return(t.df)
}
f_joran <- function(t.df){
t.df[col(t.df) == (row(t.df) %% 10) + 1] <- 99
return(t.df)
}
f_tommy <- function(t.df){
t2.df <- t.df
# Create a logical matrix with TRUE wherever the replacement should happen
m <- array(FALSE, dim=dim(t2.df))
m[cbind(seq_len(nrow(t2.df)), t2.df[,1]%%10L + 1L)] <- TRUE
t2.df[m] <- 99
return(t2.df)
}
f_tommy_mat <- function(m){
m[cbind(seq_len(nrow(m)), m[,1]%%10L + 1L)] <- 99
}
To compare the performance of the different approaches, we can use rbenchmark.
library(rbenchmark)
benchmark(f_long(t.df), f_ramnath(t.df), f_joran(t.df), f_tommy(t.df),
f_tommy_mat(t.df), replications = 20, order = 'relative',
columns = c('test', 'elapsed', 'relative')
test elapsed relative
5 f_tommy_mat(t.df) 0.135 1.000000
2 f_ramnath(t.df) 0.172 1.274074
4 f_tommy(t.df) 0.311 2.303704
3 f_joran(t.df) 0.705 5.222222
1 f_long(t.df) 2.411 17.859259
Another option for when you do need mixed column types (and so you can't use matrix) is := in data.table. Example from ?":=" :
require(data.table)
m = matrix(1,nrow=100000,ncol=100)
DF = as.data.frame(m)
DT = as.data.table(m)
system.time(for (i in 1:1000) DF[i,1] <- i)
# 591 seconds
system.time(for (i in 1:1000) DT[i,V1:=i])
# 1.16 seconds ( 509 times faster )

Resources