Efficiently sample a data frame avoiding loops - performance

I have a data frame which consists of a first column (experiment.id) and the rest of the columns are values associated with this experiment id. Each row is a unique experiment id. My data frame has columns in the order of 10⁴ - 10⁵.
data.frame(experiment.id=1:100, v1=rnorm(100,1,2),v2=rnorm(100,-1,2) )
This data frame is the source of my sample space. What i would like to do, is for each unique experiment.id (row) randomly sample (with replacement) one of the values v1, v2, ....,v10000 associated with this id and construct a sample s1. In each sample s1 all experiment ids are represented.
Eventually i want to perform 10⁴ samples, s1, s2, ....,s 10⁴ and calculate some statistic.
What would be the most efficient way (computationally) to perform this sampling process. I would like to avoid for loops as much as possible.
Update:
My questions in not all about sampling but also storing the samples. I guess my real question is if there is a quicker way to perform the above other than
d<-data.frame(experiment.id=1:1000, replicate (10000,rnorm(1000,100,2)) )
results<-data.frame(d$experiment.id,replicate(n=10000,apply(d[,2:10001],1,function(x){sample(x,size=1,replace=T)})))

Here is an expression that chooses one of the columns (excluding the first). It does not copy the first column, you will need to supply that as a separate step.
For a data frame d:
d[matrix(c(seq(nrow(d)), sample(ncol(d)-1, nrow(d), replace=TRUE)+1), ncol=2)]
That's one sample. To get N samples, just multiply the selection (as in John's answer):
mm <- matrix(c(rep(seq(nrow(d)), N), sample(ncol(d)-1, nrow(d)*N, replace=TRUE)+1), ncol=2)
result <- matrix(d[mm], ncol=N)
But you're going to have memory issues.

The shortest and most readable IMHO is still to use apply, but making good use of the fact that sample is vectorized:
results <- data.frame(experiment.id = d$experiment.id,
t(apply(d[, -1], 1, sample, 10000, replace = TRUE)))
If the 3 seconds it takes are too slow for your needs then I would recommend you use matrix indexing.

It's possible to do without any looping whatsoever. If you convert your columns after the first one to a matrix this gets easy because a matrix can be addressed either as [row, column] or sequentially as it's underlying vector.
mat <- as.matrix(datf[,-1])
nr <- nrow(mat); nc <- ncol(mat)
sel <- sample( 1:nc, nr, replace = TRUE )
sel <- sel + ((1:nr)-1) * nc
x <- t(mat)[sel]
seldatf <- data.frame( datf[,1], x = x )
Now, to get lots of the samples it pretty easy just multiplying the same logic.
ns <- 10 # number of samples / row
sel <- sample(1:nc, nr * ns, replace = TRUE )
sel <- sel + rep(((1:nr)-1) * nc, each = ns)
x <- t(mat)[sel]
seldatf <- cbind( datf[,1], data.frame(matrix(x, ncol = ns, byrow = TRUE)) )
It's possible that it's going to be a really big data frame if you're going to set ns <- 1e5 and you have lots of rows. You may have to watch running out of memory. I do a bit of unnecessary copying for readability reasons. You can eliminate that for memory, and speed because once you are using large amounts of memory you'll be swapping out other programs that are running. That is slow. You don't have to assign and save x, mat, or even sel. The result of not doing that would provide you about the fastest answer possible.

Related

simulations In R with apply and replicate

I have two matrices One that contains all the mean values and another that contains all the standard deviations. I want to simulate a random number for each of the three investors and see which investor gets the highest.
For example:- Loan 1 has three investors. I take the highest of
rnorm(1,m[1,1],sd[1,1]),rnorm(1,m[1,2],sd[1,2]),rnorm(1,m[1,3],sd[1,3])
and store it. I want to simulate this 1000 times and store results as
follows.
Output
Can I use a combination of Mapply and Sapply and replicate to do it? if you guys can give me some pointers I would be very grateful.
means <- matrix(c(-0.086731728,-0.1556901,-0.744495,
-0.166453802, -0.1978284, -0.9021422,
-0.127376145, -0.1227214, -0.6926699
), ncol = 3)
m <- t(m)
colnames(m) <- c("inv1","inv2","inv3")
rownames(m) <- c("loan1","loan2","loan3")
sd <- matrix(c(0.4431459, 0.5252441, 0.5372112,
0.4431882, 0.5252268, 0.5374614,
0.4430836, 0.5248798, 0.536924
), ncol = 3)
sd <- t(sd)
colnames(sd) <- c("inv1","inv2","inv3")
rownames(sd) <- c("loan1","loan2","loan3")
Given this is just an element-wise operation, you can use an appropriate vectorised function to compute this:
# Create a function to perform the computation you want
# Get the highest value from 1000 simulations
f <- function(m,s,reps=1000) max(rnorm(reps,m,s))
# Convert this function to a vectorised binary function
`%f%` <- Vectorize(f)
# Generate results - this will be a vector
results <- means %f% sd
# Tidy up results
results <- matrix(results,ncol(means))
colnames(results) <- colnames(means)
rownames(results) <- rownames(means)
# Results
results
inv1 inv2 inv3
loan1 1.486830 1.317569 0.8679278
loan2 1.212262 1.762396 0.7514182
loan3 1.533593 1.461248 0.7539696

filling the holes in a time series data

So i am trying to build one factor models with stocks and indices in R. I have 30 stocks and 16 indices in total. They are all time series from "2013-1-1" to "2014-12-31". Well at least all my stocks are. All of my indices are missing some entries here and there. For example, all of my stocks' data have the length of 522 but one indice has a length of 250, one 300, another 400 etc. But they all start from "2013-1-1" and end at "2014-12-31". Because my indice data has holes in it, i can't check correlations and build linear models with them. I can't do anything basically. So i need to fill these holes. I am thinking about filling those holes with their mean. But i don't know how to do it.I am open to other ideas of course. Can you help me? It is an important term project for me, so there is a lot on the line...
Edited based upon your comments (and to fix a mistake I made):
This is basic data management and I'm surprised that you're being required to work with timeseries data without knowing how to merge() and how to create dataframes.
Create some fake date and value data with holes in the dates:
dFA <- data.frame(seq.Date(as.Date("2014-01-01"), as.Date("2014-02-28"), 3))
names(dFA) <- "date"
dFA$vals <- rnorm(nrow(dFA), 25, 5)
Create a dataframe of dates from the min value in dFA to the max value in dFA
dFB <- as.data.frame(seq.Date(as.Date(min(dFA$date, na.rm = T), format = "%Y-%m-%d"),
as.Date(max(dFA$date, na.rm = T), format = "%Y-%m-%d"),
1))
names(dFB) <- "date"
Merge the two dataframes together
tmp <- merge(dFB, dFA, by = "date", all = T)
Change NA values in tmp$vals to whatever you want
tmp$vals[is.na(tmp$vals)] <- mean(dFA$vals)
head(tmp)
date vals
1 2014-01-01 18.48131
2 2014-01-02 24.16256
3 2014-01-03 24.16256
4 2014-01-04 28.78855
5 2014-01-05 24.16256
6 2014-01-06 24.16256
Original comment below
The easiest way to fill in the holes is with merge().
Create a new data frame with one vector as a sequence of dates that span the range of your original dataframe and the other vector with whatever you're going to fill the holes (zeroes, means, whatever). Then just merge() the two together:
merge(dFB, dFA, by = [the column with the date values], all = TRUE)

Looking for efficient way to perform a computation - Matlab

I have a scalar function f([x,y],[i,j])= exp(-norm([x,y]-[i,j])^2/sigma^2) which receives two 2-dimensional vectors as input (norm here implements the Euclidean norm). The values of x,i range in 1:w and the values y,j range in 1:h. I want to create a cell array X such that X{x,y} will contain a w x h matrix such that X{x,y}(i,j) = f([x,y],[i,j]). This can obviously be done using 4 nested loops like so:
for x=1:w;
for y=1:h;
X{x,y}=zeros(w,h);
for i=1:w
for j=1:h
X{x,y}(i,j)=f([x,y],[i,j])
end
end
end
end
This is however extremely inefficient. I would very much appreciate an efficient way to create X.
The one way to do this is to remove the 2 innermost loops and replace then with a vectorised version. By the look of your f function this shouldn't be too bad
First we need to construct two matrices containing the 1 to w on every row and 1 to h on every column like so
wMat=repmat(1:w,h,1);
hMat=repmat(1:h,w,1)';
This is going to represent the inner two loops, and the transpose will allow us to get all combinations. Now we can vectorise the calculation (f([x,y],[i,j])= exp(-norm([x,y]-[i,j])^2/sigma^2)):
for x=1:w;
for y=1:h;
temp1=sqrt((x-wMat).^2+(y-hMat).^2);
X{x,y}=exp(temp1/(sigma^2));
end
end
Where we have computed the Euclidean norm for all pairs of nodes in the inner loops at once.
Some discussion and code
The trick here is to perform the norm-calculations with numeric arrays and save the results into a cell array version as late as possible. For performing the norm-calculations you can take help of ndgrid, bsxfun and some permute + reshape to give it the "shape" as needed for the final cell array version. So, here's the vectorized approach to perform these tasks -
%// Create x-y/i-j values to be used for calculation of function values
[xi,yi] = ndgrid(1:w,1:h);
%// Get the norm values
normvals = sqrt(bsxfun(#minus,xi(:),xi(:).').^2 + ...
bsxfun(#minus,yi(:),yi(:).').^2);
%// Get the actual function values
vals = exp(-normvals.^2/sigma^2);
%// Get the values into blocks of a 4D array and then re-arrange to match
%// with the shape of numeric array version of X
blks = reshape(permute(reshape(vals, w*h, h, []), [2 1 3]), h, w, h, w);
arranged_blks = reshape(permute(blks,[2 3 1 4]),w,h,w,h);
%// Finally get the cell array version
X = squeeze(mat2cell(arranged_blks,w,h,ones(1,w),ones(1,h)));
Benchmarking and runtimes
After improving the original loopy code with pre-allocation for X and function-inling f, runtime-benchmarks were performed with it against the proposed vectorized approach with datasizes as w, h = 60 and the runtime results thus obtained were -
----------- With Improved loopy code
Elapsed time is 41.227797 seconds.
----------- With Vectorized code
Elapsed time is 2.116782 seconds.
This suggested a whooping close to 20x speedup with the proposed solution!
For extremely huge datasizes
If you are dealing with huge datasizes, essentially you are not giving enough memory for bsxfun to work with, and bsxfun is known to use up a lot of memory for giving you a performance-efficient vectorized solution. So, for such huge-datasize cases, you can use the following loopy approach to replace normvals calculations that was listed in the earlier bsxfun based solution -
%// Get the norm values
nx = numel(xi);
normvals = zeros(nx,nx);
for ii = 1:nx
normvals(:,ii) = sqrt( (xi(:) - xi(ii)).^2 + (yi(:) - yi(ii)).^2 );
end
It seems to me that when you run through the cycle for x=w, y=h, you are calculating all the values you need at once. So you don't need recalculate them. Once you have this:
for i=1:w
for j=1:h
temp(i,j)=f([x,y],[i,j])
end
end
Then, e.g. X{1,1} is just temp(1,1), X{2,2} is just temp(1:2,1:2), and so on. If you can vectorise the calculation of f (norm here is just the Euclidean norm of that vector?) then it will get even simpler.

Performance of rbind.data.frame

I have a list of dataframes for which I am certain that they all contain at least one row (in fact, some contain only one row, and others contain a given number of rows), and that they all have the same columns (names and types). In case it matters, I am also certain that there are no NA's anywhere in the rows.
The situation can be simulated like this:
#create one row
onerowdfr<-do.call(data.frame, c(list(), rnorm(100) , lapply(sample(letters[1:2], 100, replace=TRUE), function(x){factor(x, levels=letters[1:2])})))
colnames(onerowdfr)<-c(paste("cnt", 1:100, sep=""), paste("cat", 1:100, sep=""))
#reuse it in a list
someParts<-lapply(rbinom(200, 1, 14/200)*6+1, function(reps){onerowdfr[rep(1, reps),]})
I've set the parameters (of the randomization) so that they approximate my true situation.
Now, I want to unite all these dataframes in one dataframe. I thought using rbind would do the trick, like this:
system.time(
result<-do.call(rbind, someParts)
)
Now, on my system (which is not particularly slow), and with the settings above, this takes is the output of the system.time:
user system elapsed
5.61 0.00 5.62
Nearly 6 seconds for rbind-ing 254 (in my case) rows of 200 variables? Surely there has to be a way to improve the performance here? In my code, I have to do similar things very often (it is a from of multiple imputation), so I need this to be as fast as possible.
Can you build your matrices with numeric variables only and convert to a factor at the end? rbind is a lot faster on numeric matrices.
On my system, using data frames:
> system.time(result<-do.call(rbind, someParts))
user system elapsed
2.628 0.000 2.636
Building the list with all numeric matrices instead:
onerowdfr2 <- matrix(as.numeric(onerowdfr), nrow=1)
someParts2<-lapply(rbinom(200, 1, 14/200)*6+1,
function(reps){onerowdfr2[rep(1, reps),]})
results in a lot faster rbind.
> system.time(result2<-do.call(rbind, someParts2))
user system elapsed
0.001 0.000 0.001
EDIT: Here's another possibility; it just combines each column in turn.
> system.time({
+ n <- 1:ncol(someParts[[1]])
+ names(n) <- names(someParts[[1]])
+ result <- as.data.frame(lapply(n, function(i)
+ unlist(lapply(someParts, `[[`, i))))
+ })
user system elapsed
0.810 0.000 0.813
Still not nearly as fast as using matrices though.
EDIT 2:
If you only have numerics and factors, it's not that hard to convert everything to numeric, rbind them, and convert the necessary columns back to factors. This assumes all factors have exactly the same levels. Converting to a factor from an integer is also faster than from a numeric so I force to integer first.
someParts2 <- lapply(someParts, function(x)
matrix(unlist(x), ncol=ncol(x)))
result<-as.data.frame(do.call(rbind, someParts2))
a <- someParts[[1]]
f <- which(sapply(a, class)=="factor")
for(i in f) {
lev <- levels(a[[i]])
result[[i]] <- factor(as.integer(result[[i]]), levels=seq_along(lev), labels=lev)
}
The timing on my system is:
user system elapsed
0.090 0.00 0.091
Not a huge boost, but swapping rbind for rbind.fill from the plyr package knocks about 10% off the running time (with the sample dataset, on my machine).
If you really want to manipulate your data.frames faster, I would suggest to use the package data.table and the function rbindlist(). I did not perform extensive tests but for my dataset (3000 dataframes, 1000 rows x 40 columns each) rbindlist() takes only 20 seconds.
This is ~25% faster, but there has to be a better way...
system.time({
N <- do.call(sum, lapply(someParts, nrow))
SP <- as.data.frame(lapply(someParts[[1]], function(x) rep(x,N)))
k <- 0
for(i in 1:length(someParts)) {
j <- k+1
k <- k + nrow(someParts[[i]])
SP[j:k,] <- someParts[[i]]
}
})
Make sure you're binding dataframe to dataframe. Ran into huge perf degradation when binding list to dataframe.

Performance of swapping two elements in MATLAB

Purely as an experiment, I'm writing sort functions in MATLAB then running these through the MATLAB profiler. The aspect I find most perplexing is to do with swapping elements.
I've found that the "official" way of swapping two elements in a matrix
self.Data([i1, i2]) = self.Data([i2, i1])
runs much slower than doing it in four lines of code:
e1 = self.Data(i1);
e2 = self.Data(i2);
self.Data(i1) = e2;
self.Data(i2) = e1;
The total length of time taken up by the second example is 12 times less than the single line of code in the first example.
Would somebody have an explanation as to why?
Based on suggestions posted, I've run some more tests.
It appears the performance hit comes when the same matrix is referenced in both the LHS and RHS of the assignment.
My theory is that MATLAB uses an internal reference-counting / copy-on-write mechanism, and this is causing the entire matrix to be copied internally when it's referenced on both sides. (This is a guess because I don't know the MATLAB internals).
Here are the results from calling the function 885548 times. (The difference here is times four, not times twelve as I originally posted. Each of the functions have the additional function-wrapping overhead, while in my initial post I just summed up the individual lines).
swap1: 12.547 s
swap2: 14.301 s
swap3: 51.739 s
Here's the code:
methods (Access = public)
function swap(self, i1, i2)
swap1(self, i1, i2);
swap2(self, i1, i2);
swap3(self, i1, i2);
self.SwapCount = self.SwapCount + 1;
end
end
methods (Access = private)
%
% swap1: stores values in temporary doubles
% This has the best performance
%
function swap1(self, i1, i2)
e1 = self.Data(i1);
e2 = self.Data(i2);
self.Data(i1) = e2;
self.Data(i2) = e1;
end
%
% swap2: stores values in a temporary matrix
% Marginally slower than swap1
%
function swap2(self, i1, i2)
m = self.Data([i1, i2]);
self.Data([i2, i1]) = m;
end
%
% swap3: does not use variables for storage.
% This has the worst performance
%
function swap3(self, i1, i2)
self.Data([i1, i2]) = self.Data([i2, i1]);
end
end
In the first (slow) approach, the RHS value is a matrix, so I think MATLAB incurs a performance penalty in creating a new matrix to store the two elements. The second (fast) approach avoids this by working directly with the elements.
Check out the "Techniques for Improving Performance" article on MathWorks for ways to improve your MATLAB code.
you could also do:
tmp = self.Data(i1);
self.Data(i1) = self.Data(i2);
self.Data(i2) = tmp;
Zach is potentially right in that a temporary copy of the matrix may be made to perform the first operation, although I would hazard a guess that there is some internal optimization within MATLAB that attempts to avoid this. It may be a function of the version of MATLAB you are using. I tried both of your cases in version 7.1.0.246 (a couple years old) and only saw a speed difference of about 2-2.5.
It's possible that this may be an example of speed improvement by what's called "loop unrolling". When doing vector operations, at some level within the internal code there is likely a FOR loop which loops over the indices you are swapping. By performing the scalar operations in the second example, you are avoiding any overhead from loops. Note these two (somewhat silly) examples:
vec = [1 2 3 4];
%Example 1:
for i = 1:4,
vec(i) = vec(i)+1;
end;
%Example 2:
vec(1) = vec(1)+1;
vec(2) = vec(2)+1;
vec(3) = vec(3)+1;
vec(4) = vec(4)+1;
Admittedly, it would be much easier to simply use vector operations like:
vec = vec+1;
but the examples above are for the purpose of illustration. When I repeat each example multiple times over and time them, Example 2 is actually somewhat faster than Example 1. For a small loop with a known number (in the example, just 4), it can actually be more efficient to forgo the loop. Of course, in this particular example, the vector operation given above is actually the fastest.
I usually follow this rule: Try a few different things, and pick the fastest for your specific problem.
This post deserves an update, since the JIT compiler is now a thing (since R2015b) and so is timeit (since R2013b) for more reliable function timing.
Below is a short benchmarking function for element swapping within a large array.
I have used the terms "directly swapping" and "using a temporary variable" to describe the two methods in the question respectively.
The results are pretty staggering, the performance of directly swapping 2 elements using is increasingly poor by comparison to using a temporary variable.
function benchie()
% Variables for plotting, loop to increase size of the arrays
M = 15; D = zeros(1,M); W = zeros(1,M);
for n = 1:M;
N = 2^n;
% Create some random array of length N, and random indices to swap
v = rand(N,1);
x = randi([1, N], N, 1);
y = randi([1, N], N, 1);
% Time the functions
D(n) = timeit(#()direct);
W(n) = timeit(#()withtemp);
end
% Plotting
plot(2.^(1:M), D, 2.^(1:M), W);
legend('direct', 'with temp')
xlabel('number of elements'); ylabel('time (s)')
function direct()
% Direct swapping of two elements
for k = 1:N
v([x(k) y(k)]) = v([y(k) x(k)]);
end
end
function withtemp()
% Using an intermediate temporary variable
for k = 1:N
tmp = v(y(k));
v(y(k)) = v(x(k));
v(x(k)) = tmp;
end
end
end

Resources