Using Eigen LeastSquaresConjugateGradient in Parallel - openmp

I am trying to solve a linear system of equations of the form
Ax=b with the LeastSquaresConjugateGradient of the Eigen library and
A is a sparse matrix.
LeastSquaresConjugateGradient < SparseMatrix < double > > solver;
solver.compute( A );
x = solver.solve( b );
In principle it works perfectly. But I want to use this for quite huge matrices and hence it would be nice to run this in parallel.
The Eigen documentation only mentions to put the following lines on top.
omp_set_num_threads( 2 );
setNbThreads( 2 );
Moreover the documentation states that the LeastSquaresConjugateGradient routine should work in parallel.
I also set the number of threads in bash by
export OMP_NUM_THREADS=2
But still I get no performance gain at all. What am I doing wrong??

Related

Poor performance in matlab

So I had to write a program in Matlab to calculate the convolution of two functions, manually. I wrote this simple piece of code that I know is not that optimized probably:
syms recP(x);
recP(x) = rectangularPulse(-1,1,x);
syms triP(x);
triP(x) = triangularPulse(-1,1,x);
t = -10:0.1:10;
s1 = -10:0.1:10;
for i = 1:201
s1(i) = 0;
for j = t
s1(i) = s1(i) + ( recP(j) * triP(t(i)-j) );
end
end
plot(t,s1);
I have a core i7-7700HQ coupled with 32 GB of RAM. Matlab is stored on my HDD and my Windows is on my SSD. The problem is that this simple code is taking I think at least 20 minutes to run. I have it in a section and I don't run the whole code. Matlab is only taking 18% of my CPU and 3 GB of RAM for this task. Which is I think probably enough, I don't know. But I don't think it should take that long.
Am I doing anything wrong? I've searched for how to increase the RAM limit of Matlab, and I found that it is not limited and it takes how much it needs. I don't know if I can increase the CPU usage of it or not.
Is there any solution to how make things a little bit faster? I have like 6 or 7 of these for loops in my homework and it takes forever if I run the whole live script. Thanks in advance for your help.
(Also, it highlights the piece of code that is currently running. It is the for loop, the outer one is highlighted)
Like Ander said, use the symbolic toolbox in matlab as a last resort. Additionally, when trying to speed up matlab code, focus on taking advantage of matlab's vectorized operations. What I mean by this is matlab is very efficient at performing operations like this:
y = x.*z;
where x and z are some Nx1 vectors each and the operator '.*' is called 'dot multiplication'. This is essentially telling matlab to perform multiplication on x1*z1, x[2]*z[2] .... x[n]*z[n] and assign all the values to the corresponding value in the vector y. Additionally, many of the functions in matlab are able to accept vectors as inputs and perform their operations on each element and return an equal size vector with the output at each element. You can check this for any given function by scrolling down in its documentation to the inputs and outputs section and checking what form of array the inputs and outputs can take. For example, rectangularPulse's documentation says it can accept vectors as inputs. Therefore, you can simplify your inner loop to this:
s1(i) = s1(i) + ( rectangularPulse(-1,1,t) * triP(t(i)-t) );
So to summarize:
Avoid the symbolic toolbox in matlab until you have a better handle of what you're doing or you absolutely have to use it.
Use matlab's ability to handle vectors and arrays very well.
Deconstruct any nested loops you write one at a time from the inside out. Usually this dramatically accelerates matlab code especially when you are new to writing it.
See if you can even further simplify the code and get rid of your outer loop as well.

Sparse matrix to speed up octave

I have a loop where "i" depends on "i-1" value, so I cannot vectorize it.
I've read that I can use a sparse matrix in order to vectorize it and so to speed up my code, but I don't understand how this work.
Any help?
Thanks
You are referring to this technique, as referenced from this (rather old) how to speed up octave article.
I'll rephrase the gist here in case the link dies in the future.
Suppose you have the following loop:
p1(1) = 0;
for i = 2 : N
t = t + dt;
p1(i) = p1(i - 1) + dt * 2 * t;
endfor
You note here that, purely from a mathematical point of view, the last step in the loop could be rephrased as:
-1 * p1(i - 1) + 1 * p1(i) = dt * 2 * t
This makes it possible to recast the problem as a sparse matrix solve, by thinking of p1 as the vector of unknowns, and each iteration of the loop as a row in a (sparse) system of equations. E.g.:
Given that t is a known vector, this makes the above a straightforward problem that can be solved via a simple matrix division operation, which is guaranteed to be fast.
Having said that, presumably this 'trick' is only useful if you are able to recast the problem in this manner in the first place. Presumably this will only be the case for linear problems of your unknown. I don't think this can necessarily be used for more complicated loops.
Also, as Cris has mentioned in the comments, if this method does not work for you, there's a chance you can optimize your loop in other ways (or even that the loop solution may not necessarily be slow in the first place).
By the way, in theory, Octave provides jit-speedup like matlab does, though unlike matlab you need to enable it explicitly (in the sense that you need to compile your octave with jit options, which tends not to be the default), and my personal experience is that this is mostly experimental and may not do much except in the simplest of loops (see this post).

Save several large Matrix from Rcpp to R environment

I used Rcpp (especially Rcpp Armadillo) to perform a method that returns as result several large matrix, for example of size 10000*10000. How can I save these matrix to use them in R environment. Assume that my code in Rcpp looks like:
list Output (20000);
for( int i(0);i<20000;++1 ){
...
...
// Suppose that the previous lines allow me to compute a matrix Gi of size 10000*10000
Output(i)=Gi;
}
return Output;
The way I programmed is very costly and need enough memory. But I need the 20000 matrix to compute an estimator in R environment. How can I save the matrix ? I do not know if bigmatrix package can help me.
Best,
I finally found a solution. I noticed that I will need 15TB to save the matrices. That is impossible. What I finally did is to save only some features of the matrices, as eigenvalues for example and others. See more details here

Large Matrix handling in Fortran between multiple routines

I have a couple of matrices which are generated in a subroutine, and are used and altered in different parts of the program. Since the matrices are 6-dimensional and get quite large (100^6 is nothing unusual), generating and passing through the routines is not an option.
I open files to read/write with form=unformatted, access='direct'
What I am doing at the moment is storing them like this:
do i=1,noct
do j=1,noct
read_in_some_vector()
do k=1,ngem**2
do l=1,nvir**2
mat(l) = mat(l) * some_vector(k) * some_mat(l,k)
end do
end do
ij=j+(i-1)*noct
write(unit=iunV,rec=ij) (mat(l),l=1,nvir**2)
end do
end do
To use the matrix, I read it recordwise from the files:
iunC=open_mat_file(mat)
do i = 1,noct
do j = 1,noct
ij=j+(i-1)*noct
read(unit=iunC,rec=ij) (mat(l),l=1,nvir**2)
ij = min(i,j) + intsum( max(i,j)-1)
read_some_vector(vec,rec=ij)
do_sth = do_sth + ddot(nvir**2,mat,1,vec,1)
end do
end do
At the moment, noct is a small number (compared to the others). But it will change to a quite huge number, so the size of the matrices will explode. The matrices are (and have to be) double precision, so 8Tb for one matrix is in the realm of possibilities.
The matrices are not sparse, and they are all strictly antisymmetric.
What I can think of is either generate the needed matrix-parts directly in the routines, or clutter the harddrives with huge files.
Both would use up a lot of time (calculating or reading/writing).
Can anybody think of a third way? Or a way to optimize this?

Eigen, get inverse failed, and inverse() is so slow

I want to get inverse matrix, so here is the code:
if (m.rows() == m.fullPivLu().rank())
{
res = m.inverse();
}
The dimension of m and res are all 5000 times 5000. And When I run the code on a high performance computing machine(Linux, Tianhe 2 SuperComputer), the process was kill at res = m.inverse();, and there was no core file or dump information generated. The console return killed and the process exited.
But there is nothing wrong on my Ubuntu laptop.
And the inverse()'s performance is poor and it costs a lot time.
So, why inverse() was killed on the high performance machine? Thank you!
Full-pivoting LU is known to be very slow, regardless of its implementation.
Better use PartialPivLU, which benefits from high performance matrix-matrix operations. Then to get the best of Eigen, use the 3.3-beta2 release and compile with both FMA (-mfma) and OpenMP (e.g., -fopenmp) supports, and don't forget to enable compiler optimizations -O3. This operation should not take more than a few seconds.
Finally, do you really need to explicitly compute the inverse? If you only apply it to some vectors or matrices (i.e., A^-1 * B or B * A^-1) then better apply the inverse in factorized form rather than explicitly computing it. With Eigen 3.3:
MatrixXd A = ...;
PartialPivLU<MatrixXd> lu(A);
x = lu.inverse() * b; // solve Ax=b, same as x = lu.solve(b);
x = b * lu.inverse(); // solve xA=b
In these expressions, the inverse is not explicitly computed!

Resources