What is the best matrix multiplication algorithm? [closed] - algorithm

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
What is the best matrix multiplication algorithm? What means 'the best'for me? It means the fastest and ready for todays machines.
Please give links to pseudocode if you can.

BLAS is the best ready-to-use efficient matrix multiplication library. There are many different implementation. Here is a benchmark I made for some implementations on a MacBook Pro with dual-core Intel Core 2 Duo 2.66 GHz :
gotoBLAS2 (open-source) : https://www.tacc.utexas.edu/research-development/tacc-software/gotoblas2
ATLAS (open-source) : http://math-atlas.sourceforge.net/
Accelerate.framework (Apple) : http://developer.apple.com/performance/accelerateframework.html
a non-optimized, but portable, implementation that I called 'vanilla' (from the GSL)
There are also other commercial implementations that I didn't test here :
MKL (Intel) : http://software.intel.com/en-us/articles/intel-mkl/
ACML (AMD) : http://developer.amd.com/cpu/Libraries/acml/Pages/default.aspx

The best matrix multiplication algorithm is the one that someone with detailed architectural knowledge has already hand-tuned for your target platform.
There are lots of good libraries that supply tuned matrix-multiply implementations. Use one of them.

There are probably better ones but these are the ones I've head of (better than the standard cubic complexity algorithm).
Strassen's - O(N^2.8)
Coppersmith Winograd - O(N^2.376)

Why pseudocode? Why implement it yourself? If speed is your concern, there are highly optimized algorithms available that include optimizations for specific instruction sets (e.g. SIMD), implementing those all by yourself offers no real benefit (apart from maybe learning),
Take a look at different BLAS implementations, like:
http://www.netlib.org/blas/
http://math-atlas.sourceforge.net/

Here is algorithms course of MIT and the matrix multiplication lecture
http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-046j-introduction-to-algorithms-sma-5503-fall-2005/video-lectures/lecture-19-shortest-paths-iii-all-pairs-shortest-paths-matrix-multiplication-floyd-warshall-johnson/
matrix multiplication - O(n^3)
Strassen’s algorithm - O(n^2.8) http://en.wikipedia.org/wiki/Strassen_algorithm
Coppersmith–Winograd - O(n^2.376) http://en.wikipedia.org/wiki/Coppersmith%E2%80%93Winograd_algorithm

Depends on the size of the matrix, and whether it's sparse or not.
For small-to-medium-sized dense matrices, I believe that some variation on the "naive" O(N^3) algorithm is a win, if you pay attention to cache-coherence and use the platform's vector instructions.
Data arrangement is important -- for cases where your standard matrix layout is cache-unfriendly (e.g., column-major * row-major), you should try binary decomposition of your matrix multiplication -- even if you don't use Strassen's or other "fast" algorithms, this order of operations can yield a "cache-oblivious" algorithm that automatically makes good use of every level of cache. If you have the luxury to rearrange your matrices, you might try combining this with a bit-interleaved (or "Z-order") ordering of data elements.
Finally, remember: premature optimization is the root of all evil. And when it's not premature any more, always profile & benchmark before, during, and after optimizing....

There is no "best algorithm" for all matrices on all modern CPUs.
You will need to do some research into the many methods available, and then find a best-fit solution to the particular problems you are calculating on the particular hardware you are dealing with.
For example, the "fastest" way on your hardware platform may be to use a "slow" algorithm but ask your GPU to apply it to 256 matrices in parallel. Or using a "fast" general-purpose (mxn) algorithm may produce much slower results than using an optimised 3x3 matrix multiply. If you really want it to be fast then you may want to consider getting down to the bare metal to make sure you make best use of specific CPU features like SIMD instructions, branch prediction and cache coherence, at the expense of portability.

There is an algorithm call the Cannon's algorithm a distributed matrix multiplication algorithm. More here

Related

SIMD-exploiting implementation of Peterson and Monico's Lanczos algorithm over the field with two elements

(This question is probably flirting with the "no software recommendations" rule; I understand why it might be closed).
In their paper F_2 Lanczos revisited, Peterson and Monico give a version of the Lanczos algorithm for finding a subspace of the kernel of a linear map over Z/2Z. If my cursory reading of their paper is correct (whether it is or not is clearly not a question for SO), the algorithm presented requires a number of iterations that scales inversely proportional to the word size of the machine used. The authors implemented their proof of concept algorithm with a 64 bit word size.
Does there exist a publicly available implementation of that algorithm utilizing wide SIMD words for (a potentially significant) speedup?
An existing implementation would be a software recommendation. A more interesting question is "Is it possible to use SIMD to make this algorithm run faster?" From my glance at the paper, it sounds like SIMD is exactly what they are describing ("We will partition a 64 bit machine word x into eight subwords...where each ... is an 8-bit word") so if the authors' implementation is publicly available somewhere, the answer is "yes" because they're already using it. If this algorithm were written in C/C++ or something like that, an optimizing compiler would likely do a pretty good job of vectorizing it with SIMD even without manually specifying how to split the registers (can be verified by looking at the assembly). It would arguably be preferable to implement in high level language without splitting registers manually, because then the compiler could optimize it for any target machine's word size.

Performing many small matrix operations in parallel in OpenCL

I have a problem that requires me to do eigendecomposition and matrix multiplication of many (~4k) small (~3x3) square Hermitian matrices. In particular, I need each work item to perform eigendecomposition of one such matrix, and then perform two matrix multiplications. Thus, the work that each thread has to do is rather minimal, and the full job should be highly parallelizable.
Unfortunately, it seems all the available OpenCL LAPACKs are for delegating operations on large matrices to the GPU rather than for doing smaller linear algebra operations inside an OpenCL kernel. As I'd rather not implement matrix multiplcation and eigendecomposition for arbitrarily sized matrices in OpenCL myself, I was hoping someone here might know of a suitable library for the job?
I'm aware that OpenCL might be getting built-in matrix operations at some point since the matrix type is reserved, but that is not really of much use right now. There is a similar question here from 2011, but it pretty much just says to roll your own, so I'm hoping the situation has improved since then.
In general, my experience with libraries like LAPACK, fftw, cuFFT, etc. is that when you want to do many really small problems like this, you are better off writing your own for performance. Those libraries are usually written for generality, so you can often beat their performance for specific small problems, especially if you can use unique properties of your particular problem.
I realize you don't want to hear "roll your own" but for this type of problem it is really the best thing to do IMO. You might find a library to do this, but considering the code that you really want (for performance) will not generalize, I doubt it exists. You'll be looking specifically for code to find the eigenvalues of 3x3 matrices. That's less of a library and more of a random code snippet with a suitable license that you can manipulate to take advantage of your specific problem.
In this specific case, you can find the eigenvalues of a 3x3 matrix with the textbook method using the characteristic polynomial. Remember that there is a relatively simple closed form solution for cubic equations: http://en.wikipedia.org/wiki/Cubic_function#General_formula_for_roots.
While I think it is very likely that this approach would be much faster than iterative methods, it would be wise to verify that if performance is an issue.

What's the difference between LibSVM and LibLinear

libsvm and liblinear are both software libraries that implement Support Vector Machines. What's the difference? And how do the differences make liblinear faster than libsvm?
In practice the complexity of the SMO algorithm (that works both for kernel and linear SVM) as implemented in libsvm is O(n^2) or O(n^3) whereas liblinear is O(n) but does not support kernel SVMs. n is the number of samples in the training dataset.
Hence for medium to large scale forget about kernels and use liblinear (or maybe have a look at approximate kernel SVM solvers such as LaSVM).
Edit: in practice libsvm becomes painfully slow at 10k samples.
SVM is support vector machine, which is basically a linear classifier, but using many kernel transforms to turn a non-linear problem into a linear problem beforehand.
From the link above, it seems like liblinear is very much the same thing, without those kernel transforms. So, as they say, in cases where the kernel transforms are not needed (they mention document classification), it will be faster.
From : http://www.csie.ntu.edu.tw/~cjlin/papers/liblinear.pdf
It supports L2-regularized logistic regression (LR), L2-loss and L1-loss linear support vector machines (SVMs) (Boser et al., 1992). It inherits many features of the popular SVM library LIBSVM
And you might also see some useful information here from one of the creators: http://agbs.kyb.tuebingen.mpg.de/km/bb/showthread.php?tid=710
The main idea, I would say, is that liblinear is optimized to deal with linear classification (i.e. no kernels necessary), whereas linear classification is only one of the many capabilities of libsvm, so logically it may not match up to liblinear in terms of classification accuracy. Obviously, I'm making some broad generalizations here, and the exact details on the differences are probably covered in the paper I linked above as well as with the corresponding user's guide to libsvm from the libsvm website.

fast matrix multiplication in Matlab

I need to make a matrix/vector multiplication in Matlab of very large sizes: "A" is an 655360 by 5 real-valued matrix that are not necessarily sparse and "B" is a 655360 by 1 real-valued vector. My question is how to compute: B'*A efficiently.
I have notice a slight time improvement by computing A'*B instead, which gives a column vector. But still it is quite slow (I need to perform this operation several times in the program).
With a little bit search I found an interesting Matlab toolbox MTIMESX by James Tursa, which I hoped would improve the above matrix multiplication performance. After several trials, I can only have very marginal gains over the Matlab native matrix multiplication.
Any suggestions about how should I rewrite A'*B so that the operation is more efficient? Thanks.
Matlab's raison d'etre is doing matrix computations. I would be fairly surprised if you could significantly outperform its built-in matrix multiplication with hand-crafted tools. First of all, you should make sure your multiplication can actually be performed significantly faster. You could do this by implementing a similar multiplication in C++ with Eigen.
I have had good results with matlab matrix multiplication using the GPU
In order to avoid the transpose operation, you could try:
sum(bsxfun(#times, A, B), 2)
But I would be astonished it was faster than the direct version. See #thiton's answer.
Also look at http://www.mathworks.co.uk/company/newsletters/news_notes/june07/patterns.html to see why the column-vector-based version is faster than the row-vector-based version.
Matlab is built using fairly optimized libraries (BLAS, etc.), so you can't easily improve upon it from within Matlab. Where you can improve is to get a better BLAS, such as one optimized for your processor - this will enable better use of the caches by getting appropriately sized blocks of data from main memory. Take a look into creating your own compiled versions of ATLAS, ACML, MKL, and Goto BLAS.
I wouldn't try to solve this one particular multiplication unless it's really killing you. Changing up the BLAS is likely to lead to a happier solution, especially if you're not currently making use of multicore processors.
Your #1 option, if this is your bottleneck, is to re-examine your algorithm. See this question Optimizing MATLAB code for a great example of how choosing a different algorithm reduced runtime by three orders of magnitude.

CUDA - Simple matrix addition/sum operation

This should be very simple but I could not find an exhaustive answer:
I need to perform A+B = C with matrices, where A and B are two matrices of unknown size (they could be 2x2 or 20.000x20.000 as greatest value)
Should I use CUBLAS with Sgemm function to calculate?
I need the maximum speed achievable so I thought of CUBLAS library which should be well-optimized
For any sort of technical computing, you should always use optimized libraries when available. Existing libraries, used by hundreds of other people, are going to be better tested and better optimized than anything you do yourself, and the time you don't spend writing (and debugging, and optimizing) that function yourself can be better spent working on the actual high-level problem you want to solve instead of re-discovering things other people have already implemented. This is just basic specialization of labour stuff; focus on the compute problem you want to solve, and let people who spend their days professionally writing GPGPU matrix routines do that for you.
Only when you are sure that existing libraries don't do what you need -- maybe they solve too general a problem, or make certain assumptions that don't hold in your case -- should you roll your own.
I agree with the others that in this particular case, the operation is pretty straightforward and it's feasible to DIY; but if you're going to be doing anything else with those matricies once you're done adding them, you'd be best off using optimized BLAS routines for whatever platform you're on.
What you want to do would be trivial to implement in CUDA and will be bandwidth limited.
And since CUBLAS5.0, cublasgeam can be used for that. It computes the weighted sum of 2 optionally transposed matrices.

Resources