Applying a function to a distance matrix in R - algorithm

This question came today in the manipulatr mailing list.
http://groups.google.com/group/manipulatr/browse_thread/thread/fbab76945f7cba3f
I am rephrasing.
Given a distance matrix (calculated with dist) apply a function to the rows of the distance matrix.
Code:
library(plyr)
N <- 100
a <- data.frame(b=1:N,c=runif(N))
d <- dist(a,diag=T,upper=T)
sumd <- adply(as.matrix(d),1,sum)
The problem is that to apply the function by row you have to store the whole matrix (instead of just the lower triangular part. So it uses too much memory for large matrices. It fails in my computer for matrices of dimensions ~ 10000.
Any ideas?

First of all, for anyone who hasn't seen this yet, I strongly recommend reading this article on the r-wiki about code optimization.
Here's another version without using ifelse (that's a relatively slow function):
noeq.2 <- function(i, j, N) {
i <- i-1
j <- j-1
x <- i*(N-1) - (i-1)*((i-1) + 1)/2 + j - i
x2 <- j*(N-1) - (j-1)*((j-1) + 1)/2 + i - j
idx <- i < j
x[!idx] <- x2[!idx]
x[i==j] <- 0
x
}
And timings on my laptop:
> N <- 1000
> system.time(sapply(1:N, function(i) sapply(1:N, function(j) noeq(i, j, N))))
user system elapsed
51.31 0.10 52.06
> system.time(sapply(1:N, function(j) noeq.1(1:N, j, N)))
user system elapsed
2.47 0.02 2.67
> system.time(sapply(1:N, function(j) noeq.2(1:N, j, N)))
user system elapsed
0.88 0.01 1.12
And lapply is faster than sapply:
> system.time(do.call("rbind",lapply(1:N, function(j) noeq.2(1:N, j, N))))
user system elapsed
0.67 0.00 0.67

This is a vectorized version of the function noeq (either argument i or j):
noeq.1 <- function(i, j, N) {
i <- i-1
j <- j-1
ifelse(i < j,
i*(N-1) - ((i-1)*i)/2 + j - i,
j*(N-1) - ((j-1)*j)/2 + i - j) * ifelse(i == j, 0, 1)
}
> N <- 4
> sapply(1:N, function(i) sapply(1:N, function(j) noeq(i, j, N)))
[,1] [,2] [,3] [,4]
[1,] 0 1 2 3
[2,] 1 0 4 5
[3,] 2 4 0 6
[4,] 3 5 6 0
> sapply(1:N, function(i) noeq.1(i, 1:N, N))
[,1] [,2] [,3] [,4]
[1,] 0 1 2 3
[2,] 1 0 4 5
[3,] 2 4 0 6
[4,] 3 5 6 0
Timings are done on a 2.4 GHz Intel Core 2 Duo (Mac OS 10.6.1):
> N <- 1000
> system.time(sapply(1:N, function(j) noeq.1(1:N, j, N)))
user system elapsed
0.676 0.061 0.738
> system.time(sapply(1:N, function(i) sapply(1:N, function(j) noeq(i, j, N))))
user system elapsed
14.359 0.032 14.410

My solution is to get the indexes of the distance vector, given a row and the size of the matrix. I got this from codeguru
int Trag_noeq(int row, int col, int N)
{
//assert(row != col); //You can add this in if you like
if (row<col)
return row*(N-1) - (row-1)*((row-1) + 1)/2 + col - row - 1;
else if (col<row)
return col*(N-1) - (col-1)*((col-1) + 1)/2 + row - col - 1;
else
return -1;
}
After translating to R, assuming indexes start at 1, and assuming a lower tri instead of upper tri matrix I got.
EDIT: Using the vectorized version contributed by rcs
noeq.1 <- function(i, j, N) {
i <- i-1
j <- j-1
ix <- ifelse(i < j,
i*(N-1) - (i-1)*((i-1) + 1)/2 + j - i,
j*(N-1) - (j-1)*((j-1) + 1)/2 + i - j) * ifelse(i == j, 0, 1)
ix
}
## To get the indexes of the row, the following one liner works:
getrow <- function(z, N) noeq.1(z, 1:N, N)
## to get the row sums
getsum <- function(d, f=sum) {
N <- attr(d, "Size")
sapply(1:N, function(i) {
if (i%%100==0) print(i)
f(d[getrow(i,N)])
})
}
So, with the example:
sumd2 <- getsum(d)
This was much slower than as.matrix for small matrices before vectorizing. But just about 3x as slow after vectorizing. In a Intel Core2Duo 2ghz applying the sum by row of the size 10000 matrix took just over 100s. The as.matrix method fails. Thanks rcs!

Related

Is there a function to generate a specific n Multichoose r combination, given the index number?

For example, 3 multichoose 2 has the following combinations:
i combo
0 = [0,0]
1 = [0,1]
2 = [0,2]
3 = [1,1]
4 = [1,2]
5 = [2,2]
Could a function be written whose arguments are n,r,i and returns the combination in question, without iterating through every combination before it?
Could a function be written whose arguments are n,r,i and returns the combination in question, without iterating through every combination before it?
Yes. We have to do a little counting to get at the heart of this problem. To better illustrate how this can be broken down into very simple smaller problems, we will look at a larger example. Consider all combinations of 5 chosen 3 at a time with no repeats (we will say from here on out 5 choose 3).
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 1 2 4
[3,] 1 2 5
[4,] 1 3 4
[5,] 1 3 5
[6,] 1 4 5
[7,] 2 3 4
[8,] 2 3 5
[9,] 2 4 5
[10,] 3 4 5
Notice the first 6 rows. If we remove the first column of these 6 rows and subtract 1 from every element, we obtain:
[,1] [,2] [,1] [,2]
[1,] 2 3 [1,] 1 2
[2,] 2 4 subtract 1 [2,] 1 3
[3,] 2 5 --->>>> [3,] 1 4
[4,] 3 4 [4,] 2 3
[5,] 3 5 [5,] 2 4
[6,] 4 5 [6,] 3 4
The matrix on the right is precisely all of the combinations of 4 choose 2. Continuing on, we see that the "second" group (i.e. rows 7 through 9 of the original matrix) also looks to have order:
[,1] [,2] [,1] [,2]
[1,] 3 4 [1,] 1 2
[2,] 3 5 subtract 2 [2,] 1 3
[3,] 4 5 --->>>> [3,] 2 3
This is simply 3 choose 2. We are starting to see a pattern unfold. Namely, that all combinations of smaller n and r are contained in our parent combinations. This pattern continues as we move to the right. All that is left is to keep up with which combination we are after.
Below is the above algorithm written out in C++ (N.B. there isn't any data validation):
template <typename T>
double nChooseK(T n, T k) {
// Returns number of k-combinations from n elements.
// Mathematically speaking, we have: n!/(k!*(n-k)!)
if (k == n || k == 0)
return 1;
else if (k > n || n < 0)
return 0;
double nCk;
double temp = 1;
for (int i = 1; i <= k; i++)
temp *= (double) (n - k + i) / i;
nCk = std::round(temp);
return nCk;
}
std::vector<int> nthCombination(int n, int r, double i) {
int j = 0, n1 = n - 1, r1 = r - 1;
double temp, index1 = i, index2 = i;
std::vector<int> res(r);
for (int k = 0; k < r; k++) {
temp = nChooseK(n1, r1);
while (temp <= index1) {
index2 -= nChooseK(n1, r1);
n1--;
j++;
temp += nChooseK(n1, r1);
}
res[k] = j;
n1--;
r1--;
j++;
index1 = index2;
}
return res;
}
Calling it on our example above with 5 choose 3 we obtain:
nthCombination(5, 3, 0) -->> 0 1 2
nthCombination(5, 3, 1) -->> 0 1 3
nthCombination(5, 3, 2) -->> 0 1 4
nthCombination(5, 3, 3) -->> 0 2 3
nthCombination(5, 3, 4) -->> 0 2 4
nthCombination(5, 3, 5) -->> 0 3 4
nthCombination(5, 3, 6) -->> 1 2 3
nthCombination(5, 3, 7) -->> 1 2 4
nthCombination(5, 3, 8) -->> 1 3 4
nthCombination(5, 3, 9) -->> 2 3 4
This approach is very efficient as well. Below, we get the billionth combination of 40 choose 20 (which generates more than 100 billion combinations) instantly:
// N.B. base zero so we need to subtract 1
nthCombination(40, 20, 1000000000 - 1) -->>
0 1 2 3 4 5 8 9 14 16 18 20 22 23 31 33 34 35 38 39
Edit
As the OP points out in the comments, they gave an example with repeats. The solution is very similar and it breaks down to counting. We first need a counting function similar to nChooseK but that considers repeats. The function below does just that:
double combsWithReps(int n, int r) {
// For combinations where repetition is allowed, this
// function returns the number of combinations for
// a given n and r. The resulting vector, "triangleVec"
// resembles triangle numbers. In fact, this vector
// is obtained in a very similar method as generating
// triangle numbers, albeit in a repeating fashion.
if (r == 0)
return 1;
int i, k;
std::vector<double> triangleVec(n);
std::vector<double> temp(n);
for (i = 0; i < n; i++)
triangleVec[i] = i+1;
for (i = 1; i < r; i++) {
for (k = 1; k <= n; k++)
temp[k-1] = std::accumulate(triangleVec.begin(), triangleVec.begin() + k, 0.0);
triangleVec = temp;
}
return triangleVec[n-1];
}
And here is the function that generates the ith combination with repeats.
std::vector<int> nthCombWithRep(int n, int r, double i) {
int j = 0, n1 = n, r1 = r - 1;
double temp, index1 = i, index2 = i;
std::vector<int> res(r);
for (int k = 0; k < r; k++) {
temp = combsWithReps(n1, r1);
while (temp <= index1) {
index2 -= combsWithReps(n1, r1);
n1--;
j++;
temp += combsWithReps(n1, r1);
}
res[k] = j;
r1--;
index1 = index2;
}
return res;
}
It is very similar to the first function above. You will notice that n1-- and j++ are removed from the end of the function and also that n1 is initialized to n instead of n - 1.
Here is the above example:
nthCombWithRep(40, 20, 1000000000 - 1) -->>
0 0 0 0 0 0 0 0 0 0 0 4 5 6 8 9 12 18 18 31

Can anyone explain how this division algorithm works?

I saw this in an algorithm textbook. I am confused about the middle recursive function. If you can explain it with an example, such as 4/2, that would be great!
function divide(x, y)
Input: Two n-bit integers x and y, where y ≥ 1
Output: The quotient and remainder of x divided by y
if x = 0: return (q, r) = (0, 0)
(q, r) = divide(floor(x/2), y)
q = 2 · q, r = 2 · r
if x is odd: r = r + 1
if r ≥ y: r = r − y, q = q + 1
return (q, r)
You're seeing how many times it's divisible by 2. This is essentially performing bit shifts and operating on the binary digits. A more interesting case would be 13/3 (13 is 1101 in binary).
divide(13, 3) // initial binary value - 1101
divide(6, 3) // shift right - 110
divide(3, 3) // shift right - 11
divide(1, 3) // shift right - 1 (this is the most significant bit)
divide(0, 3) // shift right - 0 (no more significant bits)
return(0, 0) // roll it back up
return(0, 1) // since x is odd (1)
return(1, 0) // r = r * 2 = 2; x is odd (3) so r = 3 and the r > y condition is true
return(2, 0) // q = 2 * 1; r = 2 * 1 - so r >= y and q = 2 + 1
return(4, 1) // q = 2 * 2; x is odd to r = 0 + 1

Efficient way to create a circulant matrix in R

I want to create a circulant matrix from a vector in R. A circulant matrix is a matrix with the following form.
1 2 3 4
4 1 2 3
3 4 1 2
2 3 4 1
The second row is the same as the first row except the last element is at the beginning, and so on.
Now I have the vector, say, (1, 2, 3, 4) and I want to find a efficient (fast) way to create this matrix. In practice, the numbers are not integers and can be any numbers.
Here is what I am doing now.
x <- 1:4
n <- length(x)
mat <- matrix(NA, n, n)
for (i in 1:n) {
mat[i, ] <- c(x[-(1:(n+1-i))], x[1:(n+1-i)])
}
I wonder if there is a faster way to do this? I need to generate this kind of matrices over and over. A small improvement for one step will make a big difference. Thank you.
This makes use of vector recycling (it throws a warning):
circ<-function(x) {
n<-length(x)
matrix(x[matrix(1:n,n+1,n+1,byrow=T)[c(1,n:2),1:n]],n,n)
}
circ(letters[1:4])
# [,1] [,2] [,3] [,4]
#[1,] "a" "b" "c" "d"
#[2,] "d" "a" "b" "c"
#[3,] "c" "d" "a" "b"
#[4,] "b" "c" "d" "a"
Here are some benchmarks of suggested solutions.
ndoogan takes the lead!
Benchmark
x <- 1:100
microbenchmark(
OP.Circulant(x),
Josh.Circulant(x),
Dwin.Circulant(x) ,
Matt.Circulant(x),
Matt.Circulant2(x),
Ndoogan.Circulant(x),
times=100
)
# Unit: microseconds
# expr min lq median uq max
# 1 Dwin.Circulant(x) 1232.775 1288.1590 1358.999 1504.4490 2900.430
# 2 Josh.Circulant(x) 1081.080 1086.3470 1097.863 1125.8745 2526.237
# 3 Matt.Circulant(x) 61924.920 64579.3735 65948.152 129359.7895 137371.570
# 4 Matt.Circulant2(x) 12746.096 13499.0580 13832.939 14346.8570 16308.040
# 5 Ndoogan.Circulant(x) 469.502 487.2285 528.591 585.8275 1522.363
# 6 OP.Circulant(x) 1291.352 1363.8395 1421.509 1513.4950 2714.707
Code used for benchmark
OP.Circulant <- function(x) {
n <- length(x)
mat <- matrix(NA, n, n)
for (i in 1:n) {
mat[i, ] <- c(x[-(1:(n + 1 - i))], x[1:(n + 1 - i)])
}
return(mat)
}
rotn <- function(x, n) rep(x, 2)[n:(n + length(x) - 1)]
Dwin.Circulant <- function(x) {
n <- length(x)
return(t(sapply(x[c(1L, n:2)], rotn, x = x)))
}
Josh.Circulant <- function(x, nrow = length(x)) {
m <- length(x)
return(matrix(x[(1:m - rep(1:nrow, each = m))%%m + 1L],
ncol = m, byrow = TRUE))
}
Matt.Circulant <- function(x) {
n <- length(x)
mat <- matrix(, n, n)
for (i in seq(-n + 1, n - 1)) {
mat[row(mat) == col(mat) - i] = x[i%%n + 1]
}
return(mat)
}
Matt.Circulant2 <- function(x) {
n <- length(x)
return(rbind(x[], do.call(rbind, lapply(seq(n - 1),
function(i) c(tail(x, i), head(x, -i))))))
}
Ndoogan.Circulant <-function(x) {
n <- length(x)
suppressWarnings(
matrix(x[matrix(1:n,n+1,n+1,byrow=T)[c(1,n:2),1:n]],n,n))
}
# check for identical results (all TRUE)
check <- OP.Circulant(x)
identical(check, OP.Circulant(x))
identical(check, Dwin.Circulant(x))
identical(check, Josh.Circulant(x))
identical(check, Matt.Circulant(x))
identical(check, Matt.Circulant2(x))
identical(check, Ndoogan.Circulant(x))
circulant <- function(x, nrow = length(x)) {
n <- length(x)
matrix(x[(1:n - rep(1:nrow, each=n)) %% n + 1L], ncol=n, byrow=TRUE)
}
circulant(1:4)
# [,1] [,2] [,3] [,4]
# [1,] 1 2 3 4
# [2,] 4 1 2 3
# [3,] 3 4 1 2
# [4,] 2 3 4 1
circulant(7:9, nrow=5)
# [,1] [,2] [,3]
# [1,] 7 8 9
# [2,] 9 7 8
# [3,] 8 9 7
# [4,] 7 8 9
# [5,] 9 7 8
circulant(10:1, nrow=2)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# [1,] 10 9 8 7 6 5 4 3 2 1
# [2,] 1 10 9 8 7 6 5 4 3 2
rotn <- function(x,n) rep(x,2)[n:(n+length(x)-1)]
sapply(c(1,4:2), rotn, x=1:4)
[,1] [,2] [,3] [,4]
[1,] 1 4 3 2
[2,] 2 1 4 3
[3,] 3 2 1 4
[4,] 4 3 2 1
Might be faster inside a function if you constructed the double-length vector outside the sapply loop.
Here is a solution using Rcpp:
library(Rcpp)
cppFunction("
IntegerMatrix myCirculant(const int n) {
IntegerMatrix res(n);
int val = 1;
int dval = 2;
for (int i = 0; i < n*n; i++) {
res[i] = val;
if (val > 1) {
if (val != dval) {
val--;
} else {
if (dval == n) {
dval = 1;
} else {
dval++;
}
}
} else {
val = n;
}
}
return res;
}")
myCirculant(100)
works only for Integers and takes 1/10 of the time that Ndoogan.Circulant(1:100) takes on my machine.

Monte Carlo pi method

I try to calculate Monte Carlo pi function in R. I have some problems in the code.
For now I write this code:
ploscinaKvadrata <- 0
ploscinaKroga <- 0
n = 1000
for (i in i:n) {
x <- runif(1000, min= -1, max= 1)
y <- runif(1000, min= -1, max= 1)
if ((x^2 + y^2) <= 1) {
ploscinaKroga <- ploscinaKroga + 1
} else {
ploscinaKvadrata <- ploscinaKvadrata + 1
}
izracunPi = 4* ploscinaKroga/ploscinaKvadrata
}
izracunPi
This is not working, but I don't know how to fix it.
I would also like to write a code to plot this (with circle inside square and with dots).
Here is a vectorized version (and there was also something wrong with your math)
N <- 1000000
R <- 1
x <- runif(N, min= -R, max= R)
y <- runif(N, min= -R, max= R)
is.inside <- (x^2 + y^2) <= R^2
pi.estimate <- 4 * sum(is.inside) / N
pi.estimate
# [1] 3.141472
As far as plotting the points, you can do something like this:
plot.new()
plot.window(xlim = 1.1 * R * c(-1, 1), ylim = 1.1 * R * c(-1, 1))
points(x[ is.inside], y[ is.inside], pch = '.', col = "blue")
points(x[!is.inside], y[!is.inside], pch = '.', col = "red")
but I'd recommend you use a smaller N value, maybe 10000.
This is a fun game -- and there are a number of versions of it floating around the web. Here's one I hacked from the named source (tho' his code was somewhat naive).
from http://giventhedata.blogspot.com/2012/09/estimating-pi-with-r-via-mcs-dart-very.html
est.pi <- function(n){
# drawing in [0,1] x [0,1] covers one quarter of square and circle
# draw random numbers for the coordinates of the "dart-hits"
a <- runif(n,0,1)
b <- runif(n,0,1)
# use the pythagorean theorem
c <- sqrt((a^2) + (b^2) )
inside <- sum(c<1)
#outside <- n-inside
pi.est <- inside/n*4
return(pi.est)
}
Typo 'nside' to 'inside'

Why is my recursive function so slow in R?

The following takes about 30 seconds to run whereas I would expect it to be nearly instant. Is there a problem with my code?
x <- fibonacci(35);
fibonacci <- function(seq) {
if (seq == 1) return(1);
if (seq == 2) return(2);
return (fibonacci(seq - 1) + fibonacci(seq - 2));
}
Patrick Burns gives an example in R Inferno of one way to do memoization in R with local() and <<-. In fact, it's a fibonacci:
fibonacci <- local({
memo <- c(1, 1, rep(NA, 100))
f <- function(x) {
if(x == 0) return(0)
if(x < 0) return(NA)
if(x > length(memo))
stop("’x’ too big for implementation")
if(!is.na(memo[x])) return(memo[x])
ans <- f(x-2) + f(x-1)
memo[x] <<- ans
ans
}
})
That just provided a nice opportunity to plug Rcpp which allows us to add C++ functions easily to R.
So after fixing your code slightly, and using the packages inline (to easily compile, load and link short code snippets as dynamically loadable functions) as well as rbenchmark to time and compare functions, we end up with a stunning 700-fold increase in performance:
R> print(res)
test replications elapsed relative user.self sys.self
2 fibRcpp(N) 1 0.092 1.000 0.10 0
1 fibR(N) 1 65.693 714.054 65.66 0
R>
Here we see elapsed times of 92 milliseonds versus 65 seconds, for a relative ratio of 714. But by now everybody else told you not to do this directly in R.... The code is below.
## inline to compile, load and link the C++ code
require(inline)
## we need a pure C/C++ function as the generated function
## will have a random identifier at the C++ level preventing
## us from direct recursive calls
incltxt <- '
int fibonacci(const int x) {
if (x == 0) return(0);
if (x == 1) return(1);
return (fibonacci(x - 1)) + fibonacci(x - 2);
}'
## now use the snipped above as well as one argument conversion
## in as well as out to provide Fibonacci numbers via C++
fibRcpp <- cxxfunction(signature(xs="int"),
plugin="Rcpp",
incl=incltxt,
body='
int x = Rcpp::as<int>(xs);
return Rcpp::wrap( fibonacci(x) );
')
## for comparison, the original (but repaired with 0/1 offsets)
fibR <- function(seq) {
if (seq == 0) return(0);
if (seq == 1) return(1);
return (fibR(seq - 1) + fibR(seq - 2));
}
## load rbenchmark to compare
library(rbenchmark)
N <- 35 ## same parameter as original post
res <- benchmark(fibR(N),
fibRcpp(N),
columns=c("test", "replications", "elapsed",
"relative", "user.self", "sys.self"),
order="relative",
replications=1)
print(res) ## show result
And for completeness, the functions also produce the correct output:
R> sapply(1:10, fibR)
[1] 1 1 2 3 5 8 13 21 34 55
R> sapply(1:10, fibRcpp)
[1] 1 1 2 3 5 8 13 21 34 55
R>
Because you are using one of the worst algorithms in the world!
Complexity of which is O(fibonacci(n)) = O((golden ratio)^n) and golden ratio is 1.6180339887498948482…
:-) because you use exponential algorithm!!! So for fibonacci number N it has to call the function 2^N times, which 2^35, which is heck of a number.... :-)
Use linear algorithm:
fib = function (x)
{
if (x == 0)
return (0)
n1 = 0
n2 = 1
for (i in 1:(x-1)) {
sum = n1 + n2
n1 = n2
n2 = sum
}
n2
}
Sorry, edit: the complexity of the exponential recursive algorithm is not O(2^N) but O(fib(N)), as Martinho Fernandes greatly joked :-) Really a good note :-)
Because the memoise package was already mentioned here is a reference implementation:
fib <- function(n) {
if (n < 2) return(1)
fib(n - 2) + fib(n - 1)
}
system.time(fib(35))
## user system elapsed
## 36.10 0.02 36.16
library(memoise)
fib2 <- memoise(function(n) {
if (n < 2) return(1)
fib2(n - 2) + fib2(n - 1)
})
system.time(fib2(35))
## user system elapsed
## 0 0 0
Source: Wickham, H.: Advanced R, p. 238.
In general memoization in computer science means that you save the results of a function so that when you call it again with the same arguments it returns the saved value.
A recursive implementation with linear cost:
fib3 <- function(n){
fib <- function(n, fibm1, fibm2){
if(n==1){return(fibm2)}
if(n==2){return(fibm1)}
if(n >2){
fib(n-1, fibm1+fibm2, fibm1)
}
}
fib(n, 1, 0)
}
Comparing with the recursive solution with exponential cost:
> system.time(fibonacci(35))
usuário sistema decorrido
14.629 0.017 14.644
> system.time(fib3(35))
usuário sistema decorrido
0.001 0.000 0.000
This solution can be vectorized with ifelse:
fib4 <- function(n){
fib <- function(n, fibm1, fibm2){
ifelse(n<=1, fibm2,
ifelse(n==2, fibm1,
Recall(n-1, fibm1+fibm2, fibm1)
))
}
fib(n, 1, 0)
}
fib4(1:30)
## [1] 0 1 1 2 3 5 8
## [8] 13 21 34 55 89 144 233
## [15] 377 610 987 1597 2584 4181 6765
## [22] 10946 17711 28657 46368 75025 121393 196418
## [29] 317811 514229
The only changes required are changing == to <= for the n==1 case, and changing each if block to the equivalent ifelse.
If you are truly looking to return Fibonacci numbers and aren't using this example to explore how recursion works then you can solve it non-recursively by using the following:
fib = function(n) {round((1.61803398875^n+0.61803398875^n)/sqrt(5))}

Resources