Strassen's algorithm for matrix multiplication just gives a marginal improvement over the conventional O(N^3) algorithm. It has higher constant factors and is much harder to implement. Given these shortcomings, is strassens algorithm actually useful and is it implemented in any library for matrix multiplication? Moreover, how is matrix multiplication implemented in libraries?
Generally Strassen’s Method is not preferred for practical applications for following reasons.
The constants used in Strassen’s method are high and for a typical application Naive method works better.
For Sparse matrices, there are better methods especially designed
for them.
The submatrices in recursion take extra space.
Because of the limited precision of computer arithmetic on
noninteger values, larger errors accumulate in Strassen’s algorithm
than in Naive Method.
So the idea of strassen's algorithm is that it's faster (asymptotically speaking). This could make a big difference if you are dealing with either huge matrices or else a very large number of matrix multiplications. However, just because it's faster asymptotically doesn't make it the most efficient algorithm practically. There are all sorts of implementation considerations such as caching and architecture specific quirks. Also there is also parallelism to consider.
I think your best bet would be to look at the common libraries and see what they are doing. Have a look at BLAS for example. And I think that Matlab uses MAGMA.
If your contention is that you don't think O(n^2.8) is that much faster than O(n^3) this chart shows you that n doesn't need to be very large before that difference becomes significant.
It's very important to stop at the right moment.
With 1,000 x 1,000 matrices, you can multiply them by doing seven 500 x 500 products plus a few additions. That's probably useful. With 500 x 500, maybe. With 10 x 10 matrices, most likely not. You'd just have to do some experiments first at what point to stop.
But Strassen's algorithm only saves a factor 2 (at best) when the number of rows grows by a factor 32, the number of coefficients grows by 1,024, and the total time grows by a factor 16,807 instead of 32,768. In practice, that's a "constant factor". I'd say you gain more by transposing the second matrix first so you can multiply rows by rows, then look carefully at cache sizes, vectorise as much as possible, and distribute over multiple cores that don't step on each others' feet.
Marginal improvement: True, but growing as the matrix sizes grow.
Higher constant factors: Practical implementations of Strassen's algorithm use conventional n^3 for blocks below a particular size, so this doesn't really matter.
Harder to implement: whatever.
As for what's used in practice: First, you have to understand that multiplying two ginormous dense matrices is unusual. Much more often, one or both of them is sparse, or symmetric, or upper triangular, or some other pattern, which means that there are quite a few specialized tools which are essential to the efficient large matrix multiplication toolbox. With that said, for giant dense matrices, Strassen's is The Solution.
Related
I need to multiply N matrix pairs. If we are multiplying matrcies sequentionaly compler can use all cores for multiplucation of pair of matrix if they are large enough. Let say for simplicity that we do elementwise multiplication
But paralization still will not be optimal.
From the other side we can multiply K matrix pair in paralel using single thread multiplication where K is the number of cores. I think that in such way will have much more cach misses rate and this way will be slower. Am I right?
Parallelization is usually faster than serialization, unless you have massive overhead for splitting your computation. So the question you're asking is "can we efficiently split this multiplication?"
Yes we can, and we can speed results practically on the order of Θ(n^2). See here, especially the sections on cache behavior. Good luck!
In my free time I'm preparing for interview questions like: implement multiplying numbers represented as arrays of digits. Obviously I'm forced to write it from the scratch in a language like Python or Java, so an answer like "use GMP" is not acceptable (as mentioned here: Understanding Schönhage-Strassen algorithm (huge integer multiplication)).
For which exactly range of sizes of those 2 numbers (i.e. number of digits), I should choose
School grade algorithm
Karatsuba algorithm
Toom-Cook
Schönhage–Strassen algorithm ?
Is Schönhage–Strassen O(n log n log log n) always a good solution? Wikipedia mentions that Schönhage–Strassen is advisable for numbers beyond 2^2^15 to 2^2^17. What to do when one number is ridiculously huge (e.g. 10,000 to 40,000 decimal digits), but second consists of just couple of digits?
Does all those 4 algorithms parallelizes easily?
You can browse the GNU Multiple Precision Arithmetic Library's source and see their thresholds for switching between algorithms.
More pragmatically, you should just profile your implementation of the algorithms. GMP puts a lot of effort into optimizing, so their algorithms will have different constant factors than yours. The difference could easily move the thresholds around by an order of magnitude. Find out where the times cross as input size increases for your code, and set the thresholds correspondingly.
I think all of the algorithms are amenable to parallelization, since they're mostly made up up of divide and conquer passes. But keep in mind that parallelizing is another thing that will move the thresholds around quite a lot.
I am not really trying to optimize anything, but I remember hearing this from programmers all the time, that I took it as a truth. After all they are supposed to know this stuff.
But I wonder why is division actually slower than multiplication? Isn't division just a glorified subtraction, and multiplication is a glorified addition? So mathematically I don't see why going one way or the other has computationally very different costs.
Can anyone please clarify the reason/cause of this so I know, instead of what I heard from other programmer's that I asked before which is: "because".
CPU's ALU (Arithmetic-Logic Unit) executes algorithms, though they are implemented in hardware. Classic multiplications algorithms includes Wallace tree and Dadda tree. More information is available here. More sophisticated techniques are available in newer processors. Generally, processors strive to parallelize bit-pairs operations in order the minimize the clock cycles required. Multiplication algorithms can be parallelized quite effectively (though more transistors are required).
Division algorithms can't be parallelized as efficiently. The most efficient division algorithms are quite complex (The Pentium FDIV bug demonstrates the level of complexity). Generally, they requires more clock cycles per bit. If you're after more technical details, here is a nice explanation from Intel. Intel actually patented their division algorithm.
But I wonder why is division actually slower than multiplication? Isn't division just a glorified subtraction, and multiplication is a glorified addition?
The big difference is that in a long multiplication you just need to add up a bunch of numbers after shifting and masking. In a long division you have to test for overflow after each subtraction.
Lets consider a long multiplication of two n bit binary numbers.
shift (no time)
mask (constant time)
add (neively looks like time proportional to n²)
But if we look closer it turns out we can optimise the addition by using two tricks (there are further optimisations but these are the most important).
We can add the numbers in groups rather than sequentially.
Until the final step we can add three numbers to produce two rather than adding two to produce one. While adding two numbers to produce one takes time proportional to n, adding three numbers to produce two can be done in constant time because we can eliminate the carry chain.
So now our algorithm looks like
shift (no time)
mask (constant time)
add numbers in groups of three to produce two until there are only two left (time proportional to log(n))
perform the final addition (time proportional to n)
In other words we can build a multiplier for two n bit numbers in time roughly proportional to n (and space roughly proportional to n²). As long as the CPU designer is willing to dedicate the logic multiplication can be almost as fast as addition.
In long division we need to know whether each subtraction overflowed before we can decide what inputs to use for the next one. So we can't apply the same parallising tricks as we can with long multiplication.
There are methods of division that are faster than basic long division but still they are slower than multiplication.
From Wikipedia:
"Anindya De, Chandan Saha, Piyush Kurur and Ramprasad Saptharishi[11] gave a similar algorithm using modular arithmetic in 2008 achieving the same running time. However, these latter algorithms are only faster than Schönhage–Strassen for impractically large inputs."
I would be very interested in the size of such impractically large integers.
Maybe someone did implement both algorithms in a certain way and could do some benchmarks?
Thanks
Fürer's algorithm and it's modular equivalent (DSKS) are very deep research topics and, for now, remain only as academic interest. Nobody actually knows how big the cross-over point is. And in all likeliness it doesn't matter because that cross-over point is likely to be well beyond 64-bit computing limits.
I've implemented Schönhage-Strassen before and I understand how Fürer's algorithm works. So I'm quite familiar with both of them. I can say it's very possible that the cross-over point between Schönhage-Strassen and Fürer's algorithm is so high that a computer capable of holding the parameters will be larger than the size of the observable universe.
That's the problem when you have complexities that differ by less than a logarithm. It takes exponentially large input sizes to compensate even for small differences in the Big-O constant.
In this case, Fürer's algorithm is known to have a very very very large Big-O constant.
I've got a simple question concerning the implementation of matrix multiplications. I know that there are algorithms for matrices of equal size (n x n) that have a complexity of O(n^2.xxx). But if I have two matrices A and B of different sizes (p x q, q x r), what would be the minimal complexity of the implementation to date? I would guess it is O(pqr) since I would implement a multiplication with 3 nested loops with p, q and r iterations. In particular, does anyone now how the Eigen library implements a multiplication?
A common technique is to pad matrices with size (p*q, q*r), so that their sizes becomes (n*n). Then, you can apply Strassen's algorithm.
You're correct about it being O(pqr) for exactly the reasons you stated.
I'm not sure how Eigen implements it, but there are many ways to optimize matrix multiplication algorithms, such as optimizing cache performance by tiling and being aware of whether the language you're using is row major or column major (you want the inner loops to access memory in as small steps as possible to prevent cache misses). Some other optimization techniques are detailed here.
As Yu-Had Lyu mentioned you can pad them with zeroes, but unless p,q and r are close the complexity degenerates (time to perform the padding).
To answer your other question about how Eigen implements it:
The way numerics packages implement matrix multiplication is normally by using the typical O(pqr) algorithm, but heavily optimised in `non-mathematical' ways: blocking for better cache locality, using special processor instructions (SIMD etc.)
Some packages (MATLAB, Octave, ublas) use two libraries called BLAS and LAPACK which provide linear algebra primitives (like matrix multiplication) heavily optimised in this way (sometimes using hardware-specific optimisations).
AFAIK, Eigen simply uses blocking and SIMD instructions.
Few common numeric libraries (Eigen included) use Strassen Algorithm. The reason for this is actually very interesting: while the complexity is better (O(n^(log2 7))) the hidden constants behind the big Oh are very large due to all the additions performed - in other words the algorithm is only useful in practice for very large matrices.
Note: There is an even more efficient (in terms of asymptotic complexity) algorithm than Strassen's algorithm: the Coppersmith–Winograd algorithm with O(n^(2.3727)), but for which the constants are so large, that it is unlikely that it will ever be used in practice. It is in fact believed that there exists an algorithm that runs in O(n^2) (which is the trivial lower-bound, since any algorithm needs to at least read the n^2 elements of the matrices).