I know that Eigen actually creates the final matrix (to the left, in assignment) after the entire matrix equation (on the right) has been condensed into a single matrix operation. I also know that there are compilation flags that enable using Eigen with optimized instructions (which are often cpu vendor and architecture dependent). I would like to know if Eigen is able to
Detect when a 0 (zero) is a hard-coded element in a matrix
Use that knowledge to optimize a matrix operation so that it assigns that element directly to 0 in the final matrix assignment
Do this optimization with several operations in the matrix equation (such as multiple additions and multiplications and parenthesis)
In a dream world, the computer would recognize when those extra FPU operations are not necessary.
Can Eigen do this optimization?
How difficult would it be to program in this optimization if it is currently not implemented?
I am using Dense matrices.
Related
Let A be an n x n sparse matrix, represented by a sequence of m tuples of the form (i,j,a) --- with indices i,j (between 0 and n-1) and a being a value a in the underlying field F.
What algorithms are used, in practice, to solve linear systems of equations of the form Ax = b? Please describe them, don't just link somewhere.
Notes:
I'm interested both in exact solutions for finite fields, and in exact and bounded-error solutions for reals or complex numbers using floating-point representation. I suppose exact or bounded-solutions for rational numbers are also interesting.
I'm particularly interested in parallelizable solutions.
A is not fixed, i.e. you don't just get different b's for the same A.
The main two algorithms that I have used and parallelised are the Wiedemann algorithm and the Lanczos algorithm (and their block variants for GF(2) computations), both of which are better than structured gaussian elimination.
The LaMacchia-Odlyzo paper (the one for the Lanczos algorithm) will tell you what you need to know. The algorithms involve repeatedly multiplying your sparse matrix by a sequence of vectors. To do this efficiently, you need to use the right data structure (linked list) to make the matrix-vector multiply time proportional to the number of non-zero values in the matrix (i.e. the sparsity).
Paralellisation of these algorithms is trivial, but optimisation will depend upon the architecture of your system. The parallelisation of the matrix-vector multiply is done by splitting the matrix into blocks of rows (each processor gets one block), each block of rows multiplies by the vector separately. Then you combine the results to get the new vector.
I've done these types of computations extensively. The original authors that broke the RSA-129 factorisation took 6 weeks using structured gaussian elimination on a 16,384 processor MasPar. On the same machine, I worked with Arjen Lenstra (one of the authors) to solve the matrix in 4 days with block Wiedemann and 1 day with block Lanczos. Unfortunately, I never published the result!
I need to compute the following matrices:
M = XSX^T
and
V = XSy
what I'd like to know is the more efficient implementation using blas, knowing that S is a symmetric and definite positive matrix of dimension n, X has m rows and n columns while y is a vector of length n.
My implementation is the following:
I compute A = XS using dsymm and then with dgemm is obtained M=AX^T while dgemv is used to obtain V=Ay.
I think that at least M can be computed in a more efficient way since I know that M is symmetric and definite positive.
Your code is the best BLAS can do for you. There is no BLAS operation, that can exploit the fact that M is symmetric.
You are right though you'd technically only need to compute the upper diagonal part of the gemm product and then copy the strictly upper diagonal part to the lower diagonal part. But there is no routine for that.
May I inquire about the sizes? And may I also inspire some other sources for performance gains: Own build of your BLAS implementation, comparison with MKL, ACML, OpenBLAS, ATLAS. You could obviously code your own version that would use AVX, FMA intrinsics. You should be able to do better that some generalised library. Also what is the precision of your floating point variable?
I seriously doubt that you might gain too much by coding it yourself anyway. But what I would definitely suggest is converting everything to floats and testing if float precision is not giving you the same result with significant speed up in compute time. Very seldom have I seen such cases, which were more in the ODE solving domain and numeric integration of nasty functions.
But you did not address my question regarding the BLAS implementation and machine type.
Again, the optimisation beyond this point is not possible without more skills :(. But seriously, don't be to worried about this. There is a reason, why BLAS does not the optimisation you ask for. It might not be worth the hassle. Go with your solution.
And don't forget to investgate the use of floats rather than double. On R convert everything to float. For the Lapack commands use only sgemX
Without knowing the detail of your problem, it can be useful to recognize the zeros in the matrices. Partitioning the matrices to achieve this can provide significant benefits. Is M the sum of many XSX' sub matrices ?
For V = XSy, where y is a vector and X and S are matrices, calculating S.y then X.(Sy) should be better, unless X.S is a necessary calculation for M.
I have an MxM matrix S whose entries are zero on the diagonal, and non-zero everywhere else. I need to make a larger, block matrix. The blocks will be size NxN, and there will be MxM of them.
The (i,j)th block will be S(i,j)I where I=eye(N) is the NxN identity. This matrix will certainly be sparse, S has M^2-M nonzero entries and my block matrix will have N(M^2-M) out of (NM)^2 or about 1/N% nonzero entries, but I'll be adding it to another NMxNM matrix that I do not expect to be sparse.
Since I will be adding my block matrix to a full matrix, would there be any speed gain by trying to write my code in a 'sparse' way? I keep going back and forth, but my thinking is settling on: even if my code to turn S into a sparse block matrix isn't very efficient, when I tell it to add a full and sparse matrix together, wouldn't MATLAB know that it only needs to iterate over the nonzero elements? I've been trained that for loops are slow in MATLAB and things like repmat and padding with zeros is faster, but my guess is that the fastest thing to do would be to not even build the block matrix at all, but write code that adds the entries of (the small matrix) S to my other (large, full) matrix in a sparse way. If I were to learn how to build the block matrix with sparse code (faster than building it in a full way and passing it to sparse), then that code should be able to do the addition for me in a sparse way without even needing to build the block matrix right?
If you can keep a full NMxNM matrix in memory, dont bother with sparse operations. In fact in most cases A+B, with A full and B sparse, will take longer than A+B, where A and B are both full.
From your description, using sparse is likely slower for your problem:
If you're adding a sparse matrix A to a full matrix B, the result is full and there's almost certainly no advantage to having A sparse.
For example:
n = 12000; A = rand(n, n); B1 = rand(n, n); B2 = spalloc(n, n, n*n);
B2 is as sparse as possible, that is, it's all zeros!
On my machine, A+B1 takes about .23 seconds while A + B2 takes about .7 seconds.
Basically, operations on full matrices use BLAS/LAPACK library calls that are insanely optimized. Overhead associated with sparse is going to make things worse unless you're in the special cases where sparse is super useful.
When is sparse super useful?
Sparse is super useful when the size of matrices suggest that some algorithm should be very slow, but because of sparsity (+ perhaps special matrix structure), the actual number of computations required is orders of magnitude less.
EXAMPLE: Solving linear system A*x=b where A is block diagonal matrix:
As = sparse(rand(5, 5)); for(i=1:999) As = blkdiag(As, sparse(rand(5,5))); end %generate a 5000x5000 sparse block diagonal matrix of 5x5 blocks
Af = full(As);
b = rand(5000, 1);
On my machine, solving the linear system on the full matrix (i.e. Af \ b) takes about 2.3 seconds while As \ b takes .0012 seconds.
Sparse can be awesome, but it's only helpful for large problems where you can cleverly exploit structure.
I have a piece of code in Fortran 90 in which I have to solve both a non-linear (for which I have to invert the Jacobian matrix) and a linear system of equations. When I say small I mean n unknowns for both operations, with n<=4. Unfortunately, n is not known a priori. What do you think is the best option?
I thought of writing explicit formulas for cases with n=1,2 and using other methods for n=3,4 (e.g. some functions of the Intel MKL libraries), for the sake of performance. Is this sensible or should I write explicit formulas for the inverse matrix also for n=3,4?
I am interested in using Eigen to solve sparse matrix equations. Iterative solvers require a number of "scratch" vectors that are updated with intermediate values during each iteration. As I understand it, when using an iterative solver such as the conjugate gradient method these vectors are usually allocated once before beginning iteration and then reused at every iteration to avoid frequent reallocations. As far as I can tell from looking at the ConjugateGradient class, Eigen re-allocates memory at every iteration. Could someone who is familiar with Eigen tell me whether my understanding is correct? It seemed possible that there was some sort of clever memory-saving scheme going on in the allocation procedure, with the result that the memory is not actually reallocated each time through, but I dug down and could not find such a thing. Alternatively, if Eigen is indeed re-allocating memory at each pass through the loop, is it an insubstantial burden compared to time required to do the actual computations?
Where do you see reallocation? As you can see in the source code, the four helper vectors residual, p, z, and tmp, are declared and allocated outside the while loop, that is, before the iterations take place. Moreover, recall that Eigen is an expression template library, so a line code as:
x += alpha * p;
does note create any temporary. In conclusion, no, Eigen's CG implementation does not perform any (re-)allocation within the iterations.