Multiply Eigen::Matrices without transposing first - eigen

Say you have Matrix<int, 4, 2> and Matrix<int, 3, 2> which you want to multiply in the natural way that consumes the -1 dimension without first transposing.
Is this possible? Or do we have to transpose first. Which would be silly (unperformative) from a cache perspective, because now the elements we are multiplying and summing aren't contiguous.
Here's a playground. https://godbolt.org/z/Gdj3sfzcb
Pytorch provides torch.inner and torch.tensordot which do this.

Just like in Numpy, transpose() just creates a "view". It doesn't do any expensive memory operations (unless you assign it to a new matrix). Just call a * b.transpose() and let Eigen handle the details of the memory access. A properly optimized BLAS library like Eigen handles the transposition on smaller tiles in temporary memory for optimal performance.
Memory order still matters for fine tuning though. If you can, write your matrix multiplications in the form a.transpose() * b for column-major matrices (like Eigen, Matlab), or a * b.transpose() for row-major matrices like those in Numpy. That saves the BLAS library the trouble of doing that transposition.
Side note: You used auto for your result. Please read the Common Pitfalls chapter in the documentation. Your code didn't compute a matrix multiplication, it stored an expression of one.

Related

Non-intuitive perf diff between `matrix * vector`, `matrix’ * vector` and `copy(matrix’) * vector` Usage Performance blas

Question from Julia Discourse
I’m using Julia 1.2. This is my test:
a = rand(1000, 1000)
b = adjoint(a)
c = copy(b)
#btime a * x setup=(x=rand(1000)) # 114.757 μs
#btime b * x setup=(x=rand(1000)) # 94.179 μs
#btime c * x setup=(x=rand(1000)) # 110.325 μs
I was expecting a and c to be at very least not slower.
After inspecting stdlib/LinearAlgebra/src/matmul.jl , it turns out that Julia passes b.parent (i.e. a ) to BLAS.gemv , not b , and instead switches LAPACK’s dgemv_ into a different and apparently faster mode.
Am I correct in assuming that the speedup comes from the fact that the memory is aligned in a more favorable way for whatever dgemv_ does, when it’s in a trans = T mode? If so, then I’m guessing this isn’t actionable, besides possibly mentioning the gotcha in the docs somehow. If my assumption is wrong though, is there something to be done about this?
Answer from #stevengj in the same Discourse thread:
Am I correct in assuming that the speedup comes from the fact that the memory is aligned in a more favorable way for whatever dgemv_ does, when it’s in a trans = T mode?
Close. It does have to do with memory, but it’s about locality, not alignment. The basic thing to understand is that it is more efficient to access consecutive (or at least nearby) data from memory than data that is separated, due to the existence of cache lines. (Consecutive access also has some advantages in utilizing SIMD instructions.)
Julia stores matrices in column-major order, so that the columns are contiguous in memory. When you multiply a transposed matrix (that has not been copied) by a vector, therefore, it can compute it as the dot product of the contiguous column (= transposed row) with the contiguous vector, which has good spatial locality and therefore utilizes cache lines efficiently.
For multiplying a non -transposed matrix by a vector, in contrast, you are taking the dot products of non-contiguous rows of the matrix with the vector, and it is harder to efficiently utilize cache lines. To improve spatial locality in this case, an optimized BLAS like OpenBLAS actually computes the dot products of several rows at a time (a “block”) with the vector, I believe — that’s why it’s only 10% slower and not much worse. (In fact, even the transposed case may do some blocking to keep the vector in cache.)

CUDA implementation for arbitrary precision arithmetics

I have to multiply two very large (~ 2000 X 2000) dense matrices whose entries are floats with arbitrary precision (I am using GMP and the precision is currently set to 600). I was wondering if there is any CUDA library that supports arbitrary precision arithmetics? The only library that I have found is called CAMPARY however it seems to be missing some references to some of the used functions.
The other solution that I was thinking about was implementing a version of the Karatsuba algorithm for multiplying matrices with arbitrary precision entries. The end step of the algorithm would just be multiplying matrices of doubles, which could be done very efficiently using cuBLAS. Is there any similar implementation already out there?
Since nobody has suggested such a library so far, let's assume that one doesn't exist.
You could always implement the naive implementation:
One grid thread for each pair of coordinates in the output matrix.
Each thread performs an inner product of a row and a column in the input matrices.
Individual element operations will use the code taken from the GMP (hopefully not much more than copy-and-paste).
But you can also do better than this - just like you can do better for regular-float matrix multiplication. Here's my idea (likely not the best of course):
Consider the worked example of matrix multiplication using shared memory in the CUDA C Programming Guide. It suggests putting small submatrices in shared memory. You can still do this - but you need to be careful with shared memory sizes (they're small...):
A typical GPU today has 64 KB shared memory usable per grid block (or more)
They take 16 x 16 submatrix.
Times 2 (for the two multiplicands)
Times ceil(801/8) (assuming the GMP representation uses 600 bits from the mantissa, one bit for the sign and 200 bits from the exponent)
So 512 * 101 < 64 KB !
That means you can probably just use the code in their worked example as-is, again replacing the float multiplication and addition with code from GMP.
You may then want to consider something like parallelizing the GMP code itself, i.e. using multiple threads to work together on single pairs of 600-bit-precision numbers. That would likely help your shared memory reading pattern. Alternatively, you could interleave the placement of 4-byte sequences from the representation of your elements, in shared memory, for the same effect.
I realize this is a bit hand-wavy, but I'm pretty certain I've waved my hands correctly and it would be a "simple matter of coding".

C++ - need suggestion on 2-d Data structure - size of 1-D is not fixed

I wish to have a 2-D data structure to store SAT formulas. What I want is similar to 2-D arrays. But I wish to do dynamic memory allocation.
I was thinking in terms of array of vectors of the following form :-
typedef vector<int> intVector;
intVector *clause;
clause = new intVector [numClause + 1] ;
Thus clause[0] will be one vector, clause[1] will be another and so on. And each vector may have a different size. But I am unsure if this is the right thing to do in terms of memory allocation.
Can an array of vectors be made, so that the vectors have different sizes ? How bad is it on memory management ?
Thanks.
To memory management will be fine using STL (dynamic memory alloc), performance will go down because is dynamic and doesn't do direct access.
If you're in complete dispair you may use a vector of pairs, with the first as an static array and the second, the count of used elements. If some array grows bigger, you may reallocate with more memory and increase the count. This will be a big mess, but may work. You'll lose less memory and make direct acess to the second level of arrays, but is ugly and dirt.

How to use fixed matrix in Eigen?

I have a big matrix which is 1000*500. But how to use Eigen fixed matrix for speed up? The dynamic matrix is slow.
Using fixed size matrices for such large matrices is meaningless. Recall that the advantages of fixed-size matrices are 1) allocation on the stack if requested, and 2) explicit unrolling.
If you think the computation you are performing is too slow, then be specific about your computation. Moreover, make sure you benchmark with compiler optimisations ON. Because of the heavy use of templates, Eigen is particularly slow in debug mode.
Finally, for the record, here is how you can create a fixed size matrix of arbitrary size, e.g., a 6x8 matrix of doubles:
Matrix<double, 6, 8> mat;

Hiding communication in Matrix Vector Product with MPI

I have to solve a huge linear equation for multiple right sides (Let's say 20 to 200). The Matrix is stored in a sparse format and distributed over multiple MPI nodes (Let's say 16 to 64). I run a CG solver on the rank 0 node. It's not possible to solve the linear equation directly, because the system matrix would be dense (Sys = A^T * S * A).
The basic Matrix-Vector multiplication is implemented as:
broadcast x
y = A_part * x
reduce y
While the collective operations are reasonably fast (OpenMPI seems to use a binary tree like communication pattern + Infiniband), it still accounts for a quite large part of the runtime. For performance reasons we already calculate 8 right sides per iteration (Basicly SpM * DenseMatrix, just to be complete).
I'm trying to come up with a good scheme to hide the communication latency, but I did not have a good idea yet. I also try to refrain from doing 1:n communication, although I did not yet measure if scaling would be a problem.
Any suggestions are welcome!
If your matrix is already distributed, would it be possible to use a distributed sparse linear solver instead of running it only on rank 0 and then broadcasting the result (if I'm reading your description correctly..). There's plenty of libraries for that, e.g. SuperLU_DIST, MUMPS, PARDISO, Aztec(OO), etc.
The "multiple rhs" optimization is supported by at least SuperLU and MUMPS (haven't checked the others, but I'd be VERY surprised if they didn't support it!), since they solve AX=B where X and B are matrices with potentially > 1 column. That is, each "rhs" is stored as a column vector in B.
If you don't need to have the results of an old right-hand-side before starting the next run you could try to use non-blocking communication (ISend, IRecv) and communicate the result while calculating the next right-hand-side already.
But make sure you call MPI_Wait before reading the content of the communicated array, in order to be sure you're not reading "old" data.
If the matrices are big enough (i.e. it takes long enough to calculate the matrix-product) you don't have any communication delay at all with this approach.

Resources