dgemm or dgemv for Matrix Multiplication? - performance

I know dgemv is for matrix-vector, but which is more efficient? Using dgemm directly for matrix multiplication or using dgemv to do the matrix multiplication by multiplying the Matrix A with each individual column of matrix B using dgemv?

If you make repeated calls to DGEMV, you will not benefit from cache tiling and re-use, which are the biggest advantages good DGEMM implementations have. DGEMM is vastly more efficient than multiple calls to DGEMV.

Related

Precise matrix inversion in Q

Given an invertible matrix M over the rationals Q, the inverse matrix M^(-1) is again a matrix over Q. Are their (efficient) libraries to compute the inverse precisely?
I am aware of high-performance linear algebra libraries such as BLAS/LAPACK, but these libraries are based on floating point arithmetic and are thus not suitable for computing precise (analytical) solutions.
Motivation: I want to compute the absorption probabilities of a large absorbing Markov chain using its fundamental matrix. I would like to do so precisely.
Details: By large, I mean a 1000x1000 matrix in the best case, and a several million dimensional matrix in the worst case. The further I can scale things the better. (I realize that the worst case is likely far out of reach.)
You can use the Eigen matrix library, which with little effort works on arbitrary scalar types. There is an example in the documentation how to use it with GMPs mpq_class: http://eigen.tuxfamily.org/dox/TopicCustomizing_CustomScalar.html
Of course, as #btilly noted, most of the time you should not calculate the inverse, but calculate a matrix decomposition and use that to solve equation systems. For rational numbers you can use any LU-decomposition, or if the matrix is symmetric, the LDLt decomposition. See here for a catalogue of decompositions.

Cholesky Decomposition for a large number of small matrices using OpenCL

My question is similar to what was asked here:[a link] Performing many small matrix operations in parallel in OpenCL, except that I want to do Cholesky decomposition. Also, my matrices will be going from 15 X 15 to 100 X 100, and I will have upto 100,000 of them. All of the matrices will have the same dimensions.The decomposed matrices will be used further within the GPU.
This paper [a link] http://icl.cs.utk.edu/news_pub/submissions/haidar_iccp.pdf discusses the algorithms at a high level. They use the term batched Cholesky for such a problem (large number of small matrices). The way they have done it is to implement a batched version of all the steps involved in a Cholesky decomposition.
So I am wanting to start with a batched matrix multiplication (since it is one of the steps in a cholesky decomposition). For large matrices, matrix multiplication is done in a blocked manner on a GPU. My question: would it be suitable for the kind of problem I have? Any other suggestions will be helpful. Am a little unsure how to approach this.

Algorithm for fast vector-vector (a * a^H) multiplication?

The vector a containing complex numbers is of size N-by-1. The task is to find the matrix A (N-by-N) obtained by multiplication a * a^H, where H is the Hermitian operator (conjugate-transpose), so that matrix A is Hermitian.
Is there any algorithm to do this faster than O(N^2)? (except that only half the matrix can be computed). Can the divide and conquer approach be applied here somehow?
You could create a class with a Matrix interface which internally stores only the given vector a and does one complex multiplication on demand when a matrix element is accessed.
Depending on your use case, this could be more efficient, because it uses much less memory.

Matrix-Vector Multiplication - Sparse vs. Dense matrices

I want to implement a matrix-vector multiplication in C. My matrix is 1000 * 1000^2 and highly sparse (less than 0.01% non-zero elements). The non-zero elements are dispersed among the rows (between 0 to 126 non-zero elements per row).
I have heard that generally, using parallel processing for sparse matrix-vector multiplication is challenging and not as efficient as dense matrices because the ratio of computation to memory access is low (Here). But I cannot really understand what is the main difference between a sparse and a dense matrix with respect to parallel computation that makes sparse matrices less efficient.It seems the same problem is still around for the dense matrices (please correct me if I am wrong).
It is appreciated if let me know how dense matrices differ from sparse matrices in terms of parallel processing.
Thanks

Complexity of arbitrary matrix multiplications

I've got a simple question concerning the implementation of matrix multiplications. I know that there are algorithms for matrices of equal size (n x n) that have a complexity of O(n^2.xxx). But if I have two matrices A and B of different sizes (p x q, q x r), what would be the minimal complexity of the implementation to date? I would guess it is O(pqr) since I would implement a multiplication with 3 nested loops with p, q and r iterations. In particular, does anyone now how the Eigen library implements a multiplication?
A common technique is to pad matrices with size (p*q, q*r), so that their sizes becomes (n*n). Then, you can apply Strassen's algorithm.
You're correct about it being O(pqr) for exactly the reasons you stated.
I'm not sure how Eigen implements it, but there are many ways to optimize matrix multiplication algorithms, such as optimizing cache performance by tiling and being aware of whether the language you're using is row major or column major (you want the inner loops to access memory in as small steps as possible to prevent cache misses). Some other optimization techniques are detailed here.
As Yu-Had Lyu mentioned you can pad them with zeroes, but unless p,q and r are close the complexity degenerates (time to perform the padding).
To answer your other question about how Eigen implements it:
The way numerics packages implement matrix multiplication is normally by using the typical O(pqr) algorithm, but heavily optimised in `non-mathematical' ways: blocking for better cache locality, using special processor instructions (SIMD etc.)
Some packages (MATLAB, Octave, ublas) use two libraries called BLAS and LAPACK which provide linear algebra primitives (like matrix multiplication) heavily optimised in this way (sometimes using hardware-specific optimisations).
AFAIK, Eigen simply uses blocking and SIMD instructions.
Few common numeric libraries (Eigen included) use Strassen Algorithm. The reason for this is actually very interesting: while the complexity is better (O(n^(log2 7))) the hidden constants behind the big Oh are very large due to all the additions performed - in other words the algorithm is only useful in practice for very large matrices.
Note: There is an even more efficient (in terms of asymptotic complexity) algorithm than Strassen's algorithm: the Coppersmith–Winograd algorithm with O(n^(2.3727)), but for which the constants are so large, that it is unlikely that it will ever be used in practice. It is in fact believed that there exists an algorithm that runs in O(n^2) (which is the trivial lower-bound, since any algorithm needs to at least read the n^2 elements of the matrices).

Resources