MATLAB optimization: speed up computation on large matrices - performance

I am using the following function:
kernel = #(X,Y,sigma) exp((-pdist2(X,Y,'euclidean').^2)./(2*sigma^2));
to compute a series of kernels, in the following way:
K = [(1:size(featureVectors,1))', kernel(featureVectors,featureVectors, sigma)];
However, since featureVectors is a huge matrix (something like 10000x10000), it takes really a long time to compute the kernels (e.g., K).
Is it possible to somehow speed up the computation?
EDIT: Context
I am using a classifier via libsvm, with a gaussian kernel, as you may have noticed from the variable names and semantics.
I am using now (more or less) #terms~=10000 and #docs~=10000. This #terms resulted after stopwords removal and stemming. This course indicates that having 10000 features makes sense.
Unfortunately, libsvm does not implement automatically the Gaussian kernel. Thus, it is required to compute it by hand. I took the idea from here, but the kernel computation (as suggested by the referenced question) is really slow.

You are using pdist2 with two equal input arguments (X and Y are equal when you call kernel). You could save half the time by computing each pair only once. You do that using pdist and then squareform:
kernel = #(X,sigma) exp((-squareform(pdist(X,'euclidean')).^2)./(2*sigma^2));
K = [(1:size(featureVectors,1))', kernel(featureVectors, sigma)];

Your exponential function will go down very fast. For distances of several sigma your kernel function will essentially be zero. These cases we can sort out and become faster.
function z = kernel(X, Y, sigma)
d = pdist2(X,Y,'euclidean');
z = zeros(size(d)); % start with zeros
m = d < 3 * sigma;
z(m) = exp(-d(m).^2/(2*sigma^2));
end

Related

Why is x \ y so much slower than (x' * x) \ (x' * y)?

For an NxP matrix x and an Nx1 vector y with N > P, the two expressions
x \ y -- (1)
and
(x' * x) \ (x' * y) -- (2)
both compute the solution b to the matrix equation
x * b = y
in the least squares sense, i.e. so that the quantity
norm(y - x * b)
is minimized. Expression (2) does it using the classic algorithm for the solution of an ordinary least squares regression, where the left-hand argument to the \ operator is square. It is equivalent to writing
inv(x' * x) * (x' * y) -- (3)
but it uses an algorithm which is more numerically stable. It turns out that (3) is moderately faster than (2) even though (2) doesn't have to produce the inverse matrix as a byproduct, but I can accept that given the additional numerical stability.
However, some simple timings (with N=100,000 and P=30) show that expression (2) is more than 5 times faster than expression (1), even though (1) has greater flexibility to choose the algorithm used! For example, any call to (1) could just dispatch on the size of X, and in the case N>P it could reduce to (2), which would add a tiny amount of overhead, but certainly wouldn't take 5 times longer.
What is happening in expression (1) that is causing it to take so much longer?
Edit: Here are my timings
x = randn(1e5, 30);
y = randn(1e5,1);
tic, for i = 1:100; x\y; end; t1=toc;
tic, for i = 1:100; (x'*x)\(x'*y); end; t2=toc;
assert( abs(norm(x\y) - norm((x'*x)\(x'*y))) < 1e-10 );
fprintf('Speedup: %.2f\n', t1/t2)
Speedup: 5.23
You are aware of the fact that in your test
size(x) == [1e5 30] but size(x'*x) == [30 30]
size(y) == [1e5 1] but size(x'*y) == [30 1]
That means that the matrices entering the mldivide function differ in size by 4 orders of magnitude! This would render any overhead of determining which algorithm to use rather large and significant (and perhaps also running the same algorithm on the two different problems).
In other words, you have a biased test. To make a fair test, use something like
x = randn(1e3);
y = randn(1e3,1);
I find (worst of 5 runs):
Speedup: 1.06 %// R2010a
Speedup: 1.16 %// R2010b
Speedup: 0.97 %// R2013a
...the difference has all but evaporated.
But, this does show very well that if you indeed have a regression problem with low dimensionality compared to the number of observations, it really pays off to do the multiplication first :)
mldivide is a catch-all, and really great at that. But often, having knowledge about the problem may make more specific solutions, like pre-multiplication, pre-conditioning, lu, qr, linsolve, etc. orders of magnitude faster.
even though (1) has greater flexibility to choose the algorithm used!
For example, any call to (1) could just dispatch on the size of X, and
in the case N>P it could reduce to (2), which would add a tiny amount
of overhead, but certainly wouldn't take 5 times longer.
This is not the case. It could take a lot of overhead to choose which algorithm to use, particularly when compared to the computation on relatively small inputs such as these. In this case, because MATLAB can see that you have x'*x, it knows that one of the arguments must be both square and symmetric (yes - that knowledge of linear algebra is built in to MATLAB even at a parser level), and can straight away call one of the appropriate code paths within \.
I can't say whether this fully explains the timing differences you're seeing. I would want to investigate further, at least by:
Making sure to put the code within a function, and warming the function up to ensure that the JIT is engaged - and then trying the same thing with feature('accel', 'off') to remove the effect of the JIT
Trying this on a much bigger range of input sizes to check what contribution an 'algorithm choice overhead' made compared to computation time.

Is it beneficial to transpose an array in order to use column-wise operations?

Assume that we are working with a language which stores arrays in column-major order. Assume also that we have a function which uses 2-D array as an argument, and returns it.
I'm wondering can you claim that it is (or isn't) in general beneficial to transpose this array when calling the function in order to work with column-wise operations instead of row-wise operations, or does the transposing negate the the benefits of column-wise operations?
As an example, in R I have a object of class ts named y which has dimension n x p, i.e I have p times series of length n.
I need to make some computations with y in Fortran, where I have two loops with following kind of structure:
do i = 1, n
do j= 1, p
!just an example, some row-wise operations on `y`
x(i,j) = a*y(i,j)
D = ddot(m,y(i,1:p),1,b,1)
! ...
end do
end do
As Fortran (as does R) uses column-wise storage, it would be better to make the computations with p x n array instead. So instead of
out<-.Fortran("something",y=array(y,dim(y)),x=array(0,dim(y)))
ynew<-out$out$y
x<-out$out$x
I could use
out<-.Fortran("something2",y=t(array(y,dim(y))),x=array(0,dim(y)[2:1]))
ynew<-t(out$out$y)
x<-t(out$out$x)
where Fortran subroutine something2 would be something like
do i = 1, n
do j= 1, p
!just an example, some column-wise operations on `y`
x(j,i) = a*y(j,i)
D = ddot(m,y(1:p,i),1,b,1)
! ...
end do
end do
Does the choice of approach always depend on the dimensions n and p or is it possible to say one approach is better in terms of computation speed and/or memory requirements? In my application n is usually much larger than p, which is 1 to 10 in most cases.
more of a comment, buy i wanted to put a bit of code: under old school f77 you would essentially be forced to use the second approach as
y(1:p,i)
is simply a pointer to y(1,i), with the following p values contiguous in memory.
the first construct
y(i,1:p)
is a list of values interspaced in memory, so it seems to require making a copy of the data to pass to the subroutine. I say it seems because i haven't the foggiest idea how a modern optimizing compiler deals with these things. I tend to think at best its a wash at worst this could really hurt. Imagine an array so large you need to page swap to access the whole vector.
In the end the only way to answer this is to test it yourself
----------edit
did a little testng and confirmed my hunch: passing rows y(i,1:p) does cost you vs passing columns y(1:p,i). I used a subroutine that does practically nothing to see the difference. My guess with any real subroutine the hit is negligable.
Btw (and maybe this helps understand what goes on) passing every other value in a column
y(1:p:2,i) takes longer (orders of magnitude) than passing the whole column, while passing every other value in a row cuts the time in half vs. passing a whole row.
(using gfortran 12..)

Improved Iterative Scaling Method when # feature is large

In log-linear model, we can find maximum entropy solution using IIS.
We update the parameters by finding a paramters which makes model expectation over a feature and empirical expectation matches.
However, there is a exp( sum of all features) in the equation.
My question is that when the number of feature is large (say 10000) then summation over all the features will blow easily. How can we solve this problem by numerical method ? To me it seems impossible since even compute exp(50) will blow.
Do computations in log-space, and use a logsumexp operation (borrowed from scikit-learn):
// Pseudocode for 1-d version of logsumexp:
// computes log(sum(exp(x) for x in a)) in a numerically stable way.
def logsumexp(a : array of float):
amax = maximum(a)
sum = 0.
for x in a:
sum += exp(x - amax)
return log(sum) + amax
This summation can be done once, before the start of the main loop, because the feature values don't change while you're optimizing.
Side remark: IIS is quite old-fashioned. Since some 10 years, almost everyone's been using L-BFGS-B, OWL-QN or (A)SGD to fit log-linear models.

Stochastic optimization algorithms

Say we have 2 stochastic optimization algorithms (Genetic Algorithms, Particle Swarm Optimization, Cuckoo Search, etc.), A and B, and we want to find the global maxima of a function. Then, if algorithm A performs better than algorithm B at optimizing function F on a 1-dimensional search space does it also perform better than B at optimizing function F on a N-dimensional search space?
I shall refer to function F in N dimensions by F_ND.
Note that F_1D and F_ND is the same function, except in a different number of dimensions; the "landscape" is exactly the same, only of different dimensionality.
Ex: for the DeJong function we have:
F_1D(x) = x[0]*x[0]
F_5D(x) = x[0]*x[0] + x[1]*x[1] + x[2]*x[2] + x[3]*x[3] + x[4]*x[4]
F_1D and F_5D have the same "type"/"aspect"
...put otherwise:
If general_performance(A,F_1D) > general_performance(B,F_1D) then does general_performance(A,F_ND) > general_performance(B,F_ND) (for a larger N, of course) hold also?
It is currently not known if such a property would hold. The No Free Lunch theorem (NFL) does not fully apply here since you're talking about a very restricted set of problems. The problem that you have drawn is still independent in higher dimensions (one can optimize every variable separately and reach the global optimum). In this case one could argue that you could divide that problem into 5 problems of 1 dimension and solve each dimension separately. This should be always cheaper than solving them combined (assuming that no dimensions are free).
I think it depends a lot on the type of problem, but in general I would not believe that such a property holds and that for some problem and some N you can find an algorithm B such that A is better than B <=> dimension < N and B is better than A <=> dimension >= N.

Is there a fast way to invert a matrix in Matlab?

I have lots of large (around 5000 x 5000) matrices that I need to invert in Matlab. I actually need the inverse, so I can't use mldivide instead, which is a lot faster for solving Ax=b for just one b.
My matrices are coming from a problem that means they have some nice properties. First off, their determinant is 1 so they're definitely invertible. They aren't diagonalizable, though, or I would try to diagonlize them, invert them, and then put them back. Their entries are all real numbers (actually rational).
I'm using Matlab for getting these matrices and for this stuff I need to do with their inverses, so I would prefer a way to speed Matlab up. But if there is another language I can use that'll be faster, then please let me know. I don't know a lot of other languages (a little but of C and a little but of Java), so if it's really complicated in some other language, then I might not be able to use it. Please go ahead and suggest it, though, in case.
I actually need the inverse, so I can't use mldivide instead,...
That's not true, because you can still use mldivide to get the inverse. Note that A-1 = A-1 * I. In MATLAB, this is equivalent to
invA = A\speye(size(A));
On my machine, this takes about 10.5 seconds for a 5000x5000 matrix. Note that MATLAB does have an inv function to compute the inverse of a matrix. Although this will take about the same amount of time, it is less efficient in terms of numerical accuracy (more info in the link).
First off, their determinant is 1 so they're definitely invertible
Rather than det(A)=1, it is the condition number of your matrix that dictates how accurate or stable the inverse will be. Note that det(A)=∏i=1:n λi. So just setting λ1=M, λn=1/M and λi≠1,n=1 will give you det(A)=1. However, as M → ∞, cond(A) = M2 → ∞ and λn → 0, meaning your matrix is approaching singularity and there will be large numerical errors in computing the inverse.
My matrices are coming from a problem that means they have some nice properties.
Of course, there are other more efficient algorithms that can be employed if your matrix is sparse or has other favorable properties. But without any additional info on your specific problem, there is nothing more that can be said.
I would prefer a way to speed Matlab up
MATLAB uses Gauss elimination to compute the inverse of a general matrix (full rank, non-sparse, without any special properties) using mldivide and this is Θ(n3), where n is the size of the matrix. So, in your case, n=5000 and there are 1.25 x 1011 floating point operations. So on a reasonable machine with about 10 Gflops of computational power, you're going to require at least 12.5 seconds to compute the inverse and there is no way out of this, unless you exploit the "special properties" (if they're exploitable)
Inverting an arbitrary 5000 x 5000 matrix is not computationally easy no matter what language you are using. I would recommend looking into approximations. If your matrices are low rank, you might want to try a low-rank approximation M = USV'
Here are some more ideas from math-overflow:
https://mathoverflow.net/search?q=matrix+inversion+approximation
First suppose the eigen values are all 1. Let A be the Jordan canonical form of your matrix. Then you can compute A^{-1} using only matrix multiplication and addition by
A^{-1} = I + (I-A) + (I-A)^2 + ... + (I-A)^k
where k < dim(A). Why does this work? Because generating functions are awesome. Recall the expansion
(1-x)^{-1} = 1/(1-x) = 1 + x + x^2 + ...
This means that we can invert (1-x) using an infinite sum. You want to invert a matrix A, so you want to take
A = I - X
Solving for X gives X = I-A. Therefore by substitution, we have
A^{-1} = (I - (I-A))^{-1} = 1 + (I-A) + (I-A)^2 + ...
Here I've just used the identity matrix I in place of the number 1. Now we have the problem of convergence to deal with, but this isn't actually a problem. By the assumption that A is in Jordan form and has all eigen values equal to 1, we know that A is upper triangular with all 1s on the diagonal. Therefore I-A is upper triangular with all 0s on the diagonal. Therefore all eigen values of I-A are 0, so its characteristic polynomial is x^dim(A) and its minimal polynomial is x^{k+1} for some k < dim(A). Since a matrix satisfies its minimal (and characteristic) polynomial, this means that (I-A)^{k+1} = 0. Therefore the above series is finite, with the largest nonzero term being (I-A)^k. So it converges.
Now, for the general case, put your matrix into Jordan form, so that you have a block triangular matrix, e.g.:
A 0 0
0 B 0
0 0 C
Where each block has a single value along the diagonal. If that value is a for A, then use the above trick to invert 1/a * A, and then multiply the a back through. Since the full matrix is block triangular the inverse will be
A^{-1} 0 0
0 B^{-1} 0
0 0 C^{-1}
There is nothing special about having three blocks, so this works no matter how many you have.
Note that this trick works whenever you have a matrix in Jordan form. The computation of the inverse in this case will be very fast in Matlab because it only involves matrix multiplication, and you can even use tricks to speed that up since you only need powers of a single matrix. This may not help you, though, if it's really costly to get the matrix into Jordan form.

Resources