Do iterative solvers in Eigen allocate memory every iteration? - memory-management

I am interested in using Eigen to solve sparse matrix equations. Iterative solvers require a number of "scratch" vectors that are updated with intermediate values during each iteration. As I understand it, when using an iterative solver such as the conjugate gradient method these vectors are usually allocated once before beginning iteration and then reused at every iteration to avoid frequent reallocations. As far as I can tell from looking at the ConjugateGradient class, Eigen re-allocates memory at every iteration. Could someone who is familiar with Eigen tell me whether my understanding is correct? It seemed possible that there was some sort of clever memory-saving scheme going on in the allocation procedure, with the result that the memory is not actually reallocated each time through, but I dug down and could not find such a thing. Alternatively, if Eigen is indeed re-allocating memory at each pass through the loop, is it an insubstantial burden compared to time required to do the actual computations?

Where do you see reallocation? As you can see in the source code, the four helper vectors residual, p, z, and tmp, are declared and allocated outside the while loop, that is, before the iterations take place. Moreover, recall that Eigen is an expression template library, so a line code as:
x += alpha * p;
does note create any temporary. In conclusion, no, Eigen's CG implementation does not perform any (re-)allocation within the iterations.

Related

Vectorized 2D array scipy BDF solver

I'm trying to solve simultaneously the same ODE at different point (each point n is an independent vector of shape m) using the scipy BDF solver. In other world, i have a matrix n x m, and i want to solve n points (by solving, I mean make them advance in time with a while loop ), knowing that each point n are independant from each other.
Obviously you can loop on the different points, but this method takes too much time. Is there any way to make this faster and use it as a vectorized function?
I also tried to reshape my matrix to a 1D vector, but it looks like the solver compute the jacobian matrix of the complete vector, which takes too much time and is useless as the points along n are independent.
Maybe there is a way to specify that the derivatives of points n-m are zeros to speed up the jacobian computation ?
Thanks in advance for the answer
Edit:
Thanks for your answer #Lutz Lehmann. I was able to sped up the computation a little using jac_sparcity, that avoid computing a lot of unnecessary points.
The other improvement I can imagine is regarding the rate of progress h_abs : each independent ODE should have its own h_abs. Using the 1D vector method implies that all the ODE's are advancing at the same rate of progress h_abs i.e. the most restricting one. I don't know if there is anyway of doing this.
I am already using a vectorized atol built as an n x m matrix and reshaped, the same way as the complete set of ODE to make sure that the good tolerances are applied for each variable. I've never used jumba so far, but I will definitely have a look.

Does Eigen optimize matrix operations involving multiplications with hard-coded 0 elements?

I know that Eigen actually creates the final matrix (to the left, in assignment) after the entire matrix equation (on the right) has been condensed into a single matrix operation. I also know that there are compilation flags that enable using Eigen with optimized instructions (which are often cpu vendor and architecture dependent). I would like to know if Eigen is able to
Detect when a 0 (zero) is a hard-coded element in a matrix
Use that knowledge to optimize a matrix operation so that it assigns that element directly to 0 in the final matrix assignment
Do this optimization with several operations in the matrix equation (such as multiple additions and multiplications and parenthesis)
In a dream world, the computer would recognize when those extra FPU operations are not necessary.
Can Eigen do this optimization?
How difficult would it be to program in this optimization if it is currently not implemented?
I am using Dense matrices.

What is the most efficient way to compute the inverse of a general matrix using cuSolver?

I have in mind to to use getrf and getrs from the cuSolver package and to solve AB=X with B=I.
Is this the most best way to solve this problem?
If so, what is the best way to create the col-major identity matrix B in device memory? It can be done trivially using a for loop but this would 1. take up a lot of memory and 2. be quite slow. Is there a faster way?
Note that cuSolver does not provide getri unfortunately. Therefore I must to use getrs.
Until CUDA provides the LAPACK API getri, I think getrf and getrs is the best choice for large matrix inversion.
The matrix B is of the same size as A, so I don't think allocating B makes this task consume much larger memory than its input/output data does.
The complexity of getrf and getrs are O(n^3) and O(n^2), respectively, while setting B=I is of O(n^2) + O(n). I don't think it should be a bottleneck of the whole procedure. You may share your implementation, so we could check where the problem could be.

paralleling sequence of matrix multiplication for speed up

In my function, there is a lot of element wise matrix multiplication which are independent. Is there a way to calculate them in parallel ?
All of them are very simple operations, but 70% of my run time is for these parts of code because this function is invoked millions of times.
function [r1,r2,r3]=backward(A,B,C,D,E,F,r1,r2,r3)
r1=A.*B;
r2=C.*D;
r3=E*F;
end
for i=1:300
[r1,r2,r3]=backward(A,B,C,D,E,F,r1,r2,r3)
end
EDIT: After writing the answer, I observed that you are not multiplying all the input matrices by means of matrix multiplication. Some of them are elementwise multiplications. If this is what you intended, the following answer won't apply.
You are looking for an optimal algorithm for computing product of multiple matrices. People have studied this problem long ago and they have come up with a dynamic programming algorithm to decide the optimal order.
For example, if A is of size 10000 x 1, B is of size 1 x 10000 and C is of size 10000 x 1, it would be a lot more efficient if we computed A*B*C as A*(B*C), instead of (A*B)*C. The proof of correctness of this technique lies in the fact that matrix multiplication is associative. You can read more about this on Wikipedia.
If you want a good quality MATLAB implementation of this, you can find it here. It takes the matrices as input and gives out the product. It seems like this implementation does a decent job of finding the optimal way of computing "upto" 10 matrices.
First thing to note: the last 3 variables that you provide as input are not beeing used. I don't think this will matter much, but it would be better to clean it up.
Now the actual answer:
MATLAB is all about matrix operations, and this has been highly optimized. Even using C++ you will not expect a significant speedup (and be wary of a slowdown). As such, with the information that is provided in the question, the conclusion would be that you cannot do anything to speed up independent matrix calculations.
That being said: If you could reduce the number of sequential function calls, there may be something to gain.
It is hard to say how to do this in general, but two ideas:
If you call the fuction in a for loop, use a parfor loop instead (assuming you have the parallel processing toolbox, otherwise manually break up the loop and open 4 matlab instances to paralellize the loop (can be automated if needed).
See whether you really need this many function calls to small matrix operations. If you could improve your algorithm, that could offer a huge improvement, but otherwise you may still be able to combine multiple matrices (multiple versions of A with multiple versions of B for instance) and do 1 big multiplication, rather than a 100 tiny ones).

Which is more efficient, atan2 or sqrt?

There are some situations where there are multiple methods to calculate the same value.
Right now I am coming up with an algorithm to "expand" a 2D convex polygon. To do this I want to find which direction to perturb each vertex. In order to produce a result which expands the polygon with a "skin" of the same thickness all around, the amount to perturb in that direction also depends on the angle at the vertex. But right now I'm just worried about the direction.
One way is to use atan2: Let B be my vertex, A is the previous vertex, and C is the next vertex. My direction is the "angular average" of angle(B-A) and angle(B-C).
Another way involves sqrt: unit(B-A)+unit(B-C) where unit(X) is X/length(X) yields a vector with my direction.
I'm leaning towards method number 2 because averaging angle values requires a bit of work. But I am basically choosing between two calls to atan2 and two calls to sqrt. Which is generally faster? What about if I was doing this in a shader program?
I'm not trying to optimize my program per se, I'd like to know how these functions are generally implemented (e.g. in the standard c libraries) so I'll be able to know, in general, what is the better choice.
From what I know, both sqrt and trig functions require an iterative method to arrive at an answer. This is the reason why we try to avoid them when possible. People have come up with "approximate" functions which use lookup-tables and interpolation and such to try to produce faster results. I will of course never bother with these unless I find strong evidence of bottlenecking in my code due to just these routines or routines heavily involving them, but the differences between sqrt, trig funcs, and inverse trig funcs may be relevant for the sake of discussion.
With typical libraries on common modern hardware, sqrt is faster than atan2. Cases where atan2 is faster may exist, but they are few and far between.
Recent x86 implementations actually have a fairly efficient sqrt instruction, and on that hardware the difference can be quite dramatic. The Intel Optimization Manual quotes a single-precision square root as 14 cycles on Sandybridge, and a double-precision square root as 22 cycles. With a good math library atan2 timings are commonly in the neighborhood of 100 cycles or more.
It sounds like you have all the information you need to profile and find out for yourself.
If you aren't looking for an exact result, and don't mind the additional logic required to make it work, you can use specialized operations such as RSQRTSS, RSQRTPS, which calculate 1/sqrt, to combine the two expensive operations.
Indeed, sqrt is better than atan2, and 1/sqrt is better than sqrt.
For a non built-in solution, you may be interested by the CORDIC approximations.
But in your case, you should develop the complete formulas and optimize them globally before drawing any conclusion, because the transcendent function(s) are just a fraction of the computation.

Resources