simulations In R with apply and replicate - matrix

I have two matrices One that contains all the mean values and another that contains all the standard deviations. I want to simulate a random number for each of the three investors and see which investor gets the highest.
For example:- Loan 1 has three investors. I take the highest of
rnorm(1,m[1,1],sd[1,1]),rnorm(1,m[1,2],sd[1,2]),rnorm(1,m[1,3],sd[1,3])
and store it. I want to simulate this 1000 times and store results as
follows.
Output
Can I use a combination of Mapply and Sapply and replicate to do it? if you guys can give me some pointers I would be very grateful.
means <- matrix(c(-0.086731728,-0.1556901,-0.744495,
-0.166453802, -0.1978284, -0.9021422,
-0.127376145, -0.1227214, -0.6926699
), ncol = 3)
m <- t(m)
colnames(m) <- c("inv1","inv2","inv3")
rownames(m) <- c("loan1","loan2","loan3")
sd <- matrix(c(0.4431459, 0.5252441, 0.5372112,
0.4431882, 0.5252268, 0.5374614,
0.4430836, 0.5248798, 0.536924
), ncol = 3)
sd <- t(sd)
colnames(sd) <- c("inv1","inv2","inv3")
rownames(sd) <- c("loan1","loan2","loan3")

Given this is just an element-wise operation, you can use an appropriate vectorised function to compute this:
# Create a function to perform the computation you want
# Get the highest value from 1000 simulations
f <- function(m,s,reps=1000) max(rnorm(reps,m,s))
# Convert this function to a vectorised binary function
`%f%` <- Vectorize(f)
# Generate results - this will be a vector
results <- means %f% sd
# Tidy up results
results <- matrix(results,ncol(means))
colnames(results) <- colnames(means)
rownames(results) <- rownames(means)
# Results
results
inv1 inv2 inv3
loan1 1.486830 1.317569 0.8679278
loan2 1.212262 1.762396 0.7514182
loan3 1.533593 1.461248 0.7539696

Related

R caret: fully reproducible results with parallel rfe on different machines

I have the following code using random forest as method which is fully reproducible if you run it in parallel mode on the same machine:
library(doParallel)
library(caret)
recursive_feature_elimination <- function(dat){
all_preds <- dat[,which(names(dat) %in% c("Time", "Chick", "Diet"))]
response <- dat[,which(names(dat) == "weight")]
sizes <- c(1:(ncol(all_preds)-1))
# set seeds manually
set.seed(42, kind = "Mersenne-Twister", normal.kind = "Inversion")
# an optional vector of integers for the size. The vector should have length of length(sizes)+1
# length is n_repeats*nresampling+1
seeds <- vector(mode = "list", length = 16)
for(i in 1:15) seeds[[i]]<- sample.int(n=1000, size = length(sizes)+1)
# for the last model
seeds[[16]]<-sample.int(1000, 1)
seeds_list <- list(rfe_seeds = seeds,
train_seeds = NA)
# specify rfeControl
contr <- caret::rfeControl(functions=rfFuncs, method="repeatedcv", number=3, repeats=5,
saveDetails = TRUE, seeds = seeds, allowParallel = TRUE)
# recursive feature elimination caret
results <- caret::rfe(x = all_preds,
y = response,
sizes = sizes,
method ="rf",
ntree = 250,
metric= "RMSE",
rfeControl=contr )
return(results)
}
dat <- as.data.frame(ChickWeight)
cores <- detectCores()
cl <- makePSOCKcluster(cores, outfile="")
registerDoParallel(cl)
results <- recursive_feature_elimination(dat)
stopCluster(cl)
registerDoSEQ()
The outcome on my machine is:
Variables RMSE Rsquared MAE RMSESD RsquaredSD MAESD Selected
1 39.14 0.6978 24.60 2.755 0.02908 1.697
2 23.12 0.8998 13.90 2.675 0.02273 1.361 *
3 28.18 0.8997 20.32 2.243 0.01915 1.225
The top 2 variables (out of 2):
Time, Chick
I am using a Windows OS with one CPU and 4 cores. If the code is run on a UNIX OS using multiple CPUs with multiple cores, the outcome is different. I think this behaviour shows up because of the random number generation, which obviously differs between my system and the multi-CPU system. The same happens with train().
How can I get fully reproducible results independent of the OS and independent of how many CPUs and cores used for parallelization?
How can I assure that the same random numbers are used for each internal process of rfe and randomForest no matter in which sequence during the parallel computing the process is run?
How are the random numbers generated for each parallel process?

generate clustered spatstat marks?

I was wondering if anyone knows how to assign marks in spatstat so that they tend to cluster spatially? I have a set of lat long coordinates that I want to categorize into 4 groups. I have figured out how to randomly assign marks/groups to these points using the following code:
as.ppp(data, window ,marks=factor(sample(1:4,replace=TRUE)))
But I can't figure out how to assign the marks so that groups tend to occupy points closer to one another. As a further complication, I would also like the number of points within each group to be the same, specified number each time. Does anyone have any leads? Thanks in advance!
Typically in spatstat we define models which describe/generate points at random locations and possibly with random marks. If I understand you correctly you have a fixed set of locations and you simply want to assign random marks. How many points do you have? If you don't have too many points a simple suggestion could be to generate a multivariate normally distributed variable and then take the n_1 lowest values for the first mark, the n_2 next values for the second mark, and so on. A simple example with 4 equal sized groups of points:
library(spatstat)
library(mvtnorm)
set.seed(42) # Make reproducible
X <- redwood # Example data
n <- npoints(redwood)
Xdist <- pairdist(X) # n x n matrix of distances in X
decay_rate <- 1 # Parameter for covariance sturcture
sigma <- exp(-decay_rate * Xdist)
m <- rmvnorm(1, rep(0, n), sigma)
breaks <- quantile(m, probs = c(0, .25, .5, .75, 1)) # breaks to cut marks in four equal sized groups
marks(X) <- cut(m, breaks = breaks, include.lowest=TRUE, labels = 1:4)
plot(X)

Efficiently sample a data frame avoiding loops

I have a data frame which consists of a first column (experiment.id) and the rest of the columns are values associated with this experiment id. Each row is a unique experiment id. My data frame has columns in the order of 10⁴ - 10⁵.
data.frame(experiment.id=1:100, v1=rnorm(100,1,2),v2=rnorm(100,-1,2) )
This data frame is the source of my sample space. What i would like to do, is for each unique experiment.id (row) randomly sample (with replacement) one of the values v1, v2, ....,v10000 associated with this id and construct a sample s1. In each sample s1 all experiment ids are represented.
Eventually i want to perform 10⁴ samples, s1, s2, ....,s 10⁴ and calculate some statistic.
What would be the most efficient way (computationally) to perform this sampling process. I would like to avoid for loops as much as possible.
Update:
My questions in not all about sampling but also storing the samples. I guess my real question is if there is a quicker way to perform the above other than
d<-data.frame(experiment.id=1:1000, replicate (10000,rnorm(1000,100,2)) )
results<-data.frame(d$experiment.id,replicate(n=10000,apply(d[,2:10001],1,function(x){sample(x,size=1,replace=T)})))
Here is an expression that chooses one of the columns (excluding the first). It does not copy the first column, you will need to supply that as a separate step.
For a data frame d:
d[matrix(c(seq(nrow(d)), sample(ncol(d)-1, nrow(d), replace=TRUE)+1), ncol=2)]
That's one sample. To get N samples, just multiply the selection (as in John's answer):
mm <- matrix(c(rep(seq(nrow(d)), N), sample(ncol(d)-1, nrow(d)*N, replace=TRUE)+1), ncol=2)
result <- matrix(d[mm], ncol=N)
But you're going to have memory issues.
The shortest and most readable IMHO is still to use apply, but making good use of the fact that sample is vectorized:
results <- data.frame(experiment.id = d$experiment.id,
t(apply(d[, -1], 1, sample, 10000, replace = TRUE)))
If the 3 seconds it takes are too slow for your needs then I would recommend you use matrix indexing.
It's possible to do without any looping whatsoever. If you convert your columns after the first one to a matrix this gets easy because a matrix can be addressed either as [row, column] or sequentially as it's underlying vector.
mat <- as.matrix(datf[,-1])
nr <- nrow(mat); nc <- ncol(mat)
sel <- sample( 1:nc, nr, replace = TRUE )
sel <- sel + ((1:nr)-1) * nc
x <- t(mat)[sel]
seldatf <- data.frame( datf[,1], x = x )
Now, to get lots of the samples it pretty easy just multiplying the same logic.
ns <- 10 # number of samples / row
sel <- sample(1:nc, nr * ns, replace = TRUE )
sel <- sel + rep(((1:nr)-1) * nc, each = ns)
x <- t(mat)[sel]
seldatf <- cbind( datf[,1], data.frame(matrix(x, ncol = ns, byrow = TRUE)) )
It's possible that it's going to be a really big data frame if you're going to set ns <- 1e5 and you have lots of rows. You may have to watch running out of memory. I do a bit of unnecessary copying for readability reasons. You can eliminate that for memory, and speed because once you are using large amounts of memory you'll be swapping out other programs that are running. That is slow. You don't have to assign and save x, mat, or even sel. The result of not doing that would provide you about the fastest answer possible.

preallocate list in R

It is inefficient in R to expand a data structure in a loop. How do I preallocate a list of a certain size? matrix makes this easy via the ncol and nrow arguments. How does one do this in lists? For example:
x <- list()
for (i in 1:10) {
x[[i]] <- i
}
I presume this is inefficient. What is a better way to do this?
vector can create empty vector of the desired mode and length.
x <- vector(mode = "list", length = 10)
To expand on what #Jilber said, lapply is specially built for this type of operation.
instead of the for loop, you could use:
x <- lapply(1:10, function(i) i)
You can extend this to more complicated examples. Often, what is in the body of the for loop can be directly translated to a function which accepts a single row that looks like a row from each iteration of the loop.
Something like this:
x <- vector('list', 10)
But using lapply is the best choice
All 3 existing answers are great.
The reason the vector() function can create a list is explained in JennyBC's purrr tutorial:
A list is actually still a vector in R, but it’s not an atomic vector. We construct a list explicitly with list() but, like atomic vectors, most lists are created some other way in real life.
To preallocate a list
list <- vector(mode = "list", length = 10)
To preallocate a vector
vec <- rep(NA, 10)

Performance of rbind.data.frame

I have a list of dataframes for which I am certain that they all contain at least one row (in fact, some contain only one row, and others contain a given number of rows), and that they all have the same columns (names and types). In case it matters, I am also certain that there are no NA's anywhere in the rows.
The situation can be simulated like this:
#create one row
onerowdfr<-do.call(data.frame, c(list(), rnorm(100) , lapply(sample(letters[1:2], 100, replace=TRUE), function(x){factor(x, levels=letters[1:2])})))
colnames(onerowdfr)<-c(paste("cnt", 1:100, sep=""), paste("cat", 1:100, sep=""))
#reuse it in a list
someParts<-lapply(rbinom(200, 1, 14/200)*6+1, function(reps){onerowdfr[rep(1, reps),]})
I've set the parameters (of the randomization) so that they approximate my true situation.
Now, I want to unite all these dataframes in one dataframe. I thought using rbind would do the trick, like this:
system.time(
result<-do.call(rbind, someParts)
)
Now, on my system (which is not particularly slow), and with the settings above, this takes is the output of the system.time:
user system elapsed
5.61 0.00 5.62
Nearly 6 seconds for rbind-ing 254 (in my case) rows of 200 variables? Surely there has to be a way to improve the performance here? In my code, I have to do similar things very often (it is a from of multiple imputation), so I need this to be as fast as possible.
Can you build your matrices with numeric variables only and convert to a factor at the end? rbind is a lot faster on numeric matrices.
On my system, using data frames:
> system.time(result<-do.call(rbind, someParts))
user system elapsed
2.628 0.000 2.636
Building the list with all numeric matrices instead:
onerowdfr2 <- matrix(as.numeric(onerowdfr), nrow=1)
someParts2<-lapply(rbinom(200, 1, 14/200)*6+1,
function(reps){onerowdfr2[rep(1, reps),]})
results in a lot faster rbind.
> system.time(result2<-do.call(rbind, someParts2))
user system elapsed
0.001 0.000 0.001
EDIT: Here's another possibility; it just combines each column in turn.
> system.time({
+ n <- 1:ncol(someParts[[1]])
+ names(n) <- names(someParts[[1]])
+ result <- as.data.frame(lapply(n, function(i)
+ unlist(lapply(someParts, `[[`, i))))
+ })
user system elapsed
0.810 0.000 0.813
Still not nearly as fast as using matrices though.
EDIT 2:
If you only have numerics and factors, it's not that hard to convert everything to numeric, rbind them, and convert the necessary columns back to factors. This assumes all factors have exactly the same levels. Converting to a factor from an integer is also faster than from a numeric so I force to integer first.
someParts2 <- lapply(someParts, function(x)
matrix(unlist(x), ncol=ncol(x)))
result<-as.data.frame(do.call(rbind, someParts2))
a <- someParts[[1]]
f <- which(sapply(a, class)=="factor")
for(i in f) {
lev <- levels(a[[i]])
result[[i]] <- factor(as.integer(result[[i]]), levels=seq_along(lev), labels=lev)
}
The timing on my system is:
user system elapsed
0.090 0.00 0.091
Not a huge boost, but swapping rbind for rbind.fill from the plyr package knocks about 10% off the running time (with the sample dataset, on my machine).
If you really want to manipulate your data.frames faster, I would suggest to use the package data.table and the function rbindlist(). I did not perform extensive tests but for my dataset (3000 dataframes, 1000 rows x 40 columns each) rbindlist() takes only 20 seconds.
This is ~25% faster, but there has to be a better way...
system.time({
N <- do.call(sum, lapply(someParts, nrow))
SP <- as.data.frame(lapply(someParts[[1]], function(x) rep(x,N)))
k <- 0
for(i in 1:length(someParts)) {
j <- k+1
k <- k + nrow(someParts[[i]])
SP[j:k,] <- someParts[[i]]
}
})
Make sure you're binding dataframe to dataframe. Ran into huge perf degradation when binding list to dataframe.

Resources