I've got a simple question concerning the implementation of matrix multiplications. I know that there are algorithms for matrices of equal size (n x n) that have a complexity of O(n^2.xxx). But if I have two matrices A and B of different sizes (p x q, q x r), what would be the minimal complexity of the implementation to date? I would guess it is O(pqr) since I would implement a multiplication with 3 nested loops with p, q and r iterations. In particular, does anyone now how the Eigen library implements a multiplication?
A common technique is to pad matrices with size (p*q, q*r), so that their sizes becomes (n*n). Then, you can apply Strassen's algorithm.
You're correct about it being O(pqr) for exactly the reasons you stated.
I'm not sure how Eigen implements it, but there are many ways to optimize matrix multiplication algorithms, such as optimizing cache performance by tiling and being aware of whether the language you're using is row major or column major (you want the inner loops to access memory in as small steps as possible to prevent cache misses). Some other optimization techniques are detailed here.
As Yu-Had Lyu mentioned you can pad them with zeroes, but unless p,q and r are close the complexity degenerates (time to perform the padding).
To answer your other question about how Eigen implements it:
The way numerics packages implement matrix multiplication is normally by using the typical O(pqr) algorithm, but heavily optimised in `non-mathematical' ways: blocking for better cache locality, using special processor instructions (SIMD etc.)
Some packages (MATLAB, Octave, ublas) use two libraries called BLAS and LAPACK which provide linear algebra primitives (like matrix multiplication) heavily optimised in this way (sometimes using hardware-specific optimisations).
AFAIK, Eigen simply uses blocking and SIMD instructions.
Few common numeric libraries (Eigen included) use Strassen Algorithm. The reason for this is actually very interesting: while the complexity is better (O(n^(log2 7))) the hidden constants behind the big Oh are very large due to all the additions performed - in other words the algorithm is only useful in practice for very large matrices.
Note: There is an even more efficient (in terms of asymptotic complexity) algorithm than Strassen's algorithm: the Coppersmith–Winograd algorithm with O(n^(2.3727)), but for which the constants are so large, that it is unlikely that it will ever be used in practice. It is in fact believed that there exists an algorithm that runs in O(n^2) (which is the trivial lower-bound, since any algorithm needs to at least read the n^2 elements of the matrices).
Related
The Google microbenchmark library supports estimating complexity of an algorithm but everything is expressed by telling the framework what the size of your N is. I'm curious what the best way to represent M+N algorithms in this framework is. In my test I go over a cartesian product of M & N values in my test.
Do I call SetComplexityN with M+N (& for O(MN) algorithms I assumed SetComplexityN is similarly M*N)? If I wanted to hard-code the complexity of the algorithm (vs doing best fit) does benchmark::oN then map to M+N and benchmark::oNSquared maps to O(MN)?
It's not something we've yet considered in the library.
If you set the complexity to M+N and use oN then the fitting curve used for the minimal least square calculation will be linear in M+N.
However, if you set the complexity to M*N and use oNSquared then we'll try to fit to pow(M*N, 2) which is likely not what you want, so I think still using oN would be appropriate.
I am interested to learn about the time complexity for multiplying large integers modulo a (large) constant using multiple processors. What this boils down to is basically just integer multiplication, since division and remainder can also be implemented using multiplication (e.g. reciprocal multiplication or Barrett reduction).
I know that the runtime of currently known integer multiplication algorithms is roughly bounded by the lower bound o(n * log n). My research has failed to find out if this is for a single core or multi core machine. However, I am thinking this is for a single core machine as the algorithms seem to use a divide-and-conquer approach.
Now my question is, what is the currently known lower bound for the time complexity of a parallel integer multiplication algorithm implemented on m cores? Can a time complexity with lower bound of o(n) or less be achieved, given enough cores? (i.e. if m depends on n?) Here o(n) describes the input size of the integers at hand.
So far in my research I have read several papers claiming a speedup using parallel FFT multiplication. Unfortunately these only claim empirical speedups (e.g. "a 56% speed improvement using 6 cores on a such and such computer") and then fail to explain the theoretical speedup expressed in time complexity bounds.
I am aware the "fastest" integer multiplication algorithm has not yet been found, this is an unsolved problem in computer science. I am merely inquiring about the currently known bounds for such parallel algorithms.
Update #1: User #delnan linked to a wiki page about the NC complexity class. That wiki page mentions integer multiplication is in NC, meaning there exists an O((log n)^c) algorithm on O(n^k) processors. This is helpful towards getting closer to an answer. The part that's left unanswered for now is what are the c and k constants for integer multiplication and which parallel algorithm lends itself to this purpose?
Update #2: According to page 12 of 15 in this PDF file from a Computer Science course at Cornell University, integer multiplication in the NC complexity class takes O(log n) time on O(n^2) processors. It also explains an example algorithm on how to go about this. I'll write up a proper answer for this question shortly.
One last question to satisfy my curiosity: might anyone know something about the currently known time complexity for "just" O(n), O(sqrt(n)) or O(log n) processors?
The computational complexity of algorithms is not affected by parallelisation.
For sure, take a problem such as integer multiplication and you can find a variety of algorithms for solving the problem. These algorithms will exhibit a range of complexities. But given any algorithm running it on p processors will, at theoretical best, give a speedup of p times. In terms of computational complexity this is like multiplying the existing complexity, call it O(f(n)), by a constant, to get O((1/p)*f(n)). As you know multiplying a complexity by a constant doesn't affect the complexity classification.
To take another, perhaps more practical, line of argument, changing the number of processors doesn't change the number of basic operations that an algorithm performs for any given problem at any given size -- except for any additional operations necessitated by coordinating the operation of parallel components.
Again, bear in mind that computational complexity is a very theoretical measure of computational effort, generally measured as the number of some kind of basic operations required to compute for a given input size. One such basic operation might be the multiplication of two numbers (of limited size). Changing the definition of the basic operation to multiplication of two vectors of numbers (each of limited size) won't change the complexity of algorithms.
Of course, there is a lot of empirical evidence that parallel algorithms can operate more quickly than their serial counterparts, as you have found.
Strassen's algorithm for matrix multiplication just gives a marginal improvement over the conventional O(N^3) algorithm. It has higher constant factors and is much harder to implement. Given these shortcomings, is strassens algorithm actually useful and is it implemented in any library for matrix multiplication? Moreover, how is matrix multiplication implemented in libraries?
Generally Strassen’s Method is not preferred for practical applications for following reasons.
The constants used in Strassen’s method are high and for a typical application Naive method works better.
For Sparse matrices, there are better methods especially designed
for them.
The submatrices in recursion take extra space.
Because of the limited precision of computer arithmetic on
noninteger values, larger errors accumulate in Strassen’s algorithm
than in Naive Method.
So the idea of strassen's algorithm is that it's faster (asymptotically speaking). This could make a big difference if you are dealing with either huge matrices or else a very large number of matrix multiplications. However, just because it's faster asymptotically doesn't make it the most efficient algorithm practically. There are all sorts of implementation considerations such as caching and architecture specific quirks. Also there is also parallelism to consider.
I think your best bet would be to look at the common libraries and see what they are doing. Have a look at BLAS for example. And I think that Matlab uses MAGMA.
If your contention is that you don't think O(n^2.8) is that much faster than O(n^3) this chart shows you that n doesn't need to be very large before that difference becomes significant.
It's very important to stop at the right moment.
With 1,000 x 1,000 matrices, you can multiply them by doing seven 500 x 500 products plus a few additions. That's probably useful. With 500 x 500, maybe. With 10 x 10 matrices, most likely not. You'd just have to do some experiments first at what point to stop.
But Strassen's algorithm only saves a factor 2 (at best) when the number of rows grows by a factor 32, the number of coefficients grows by 1,024, and the total time grows by a factor 16,807 instead of 32,768. In practice, that's a "constant factor". I'd say you gain more by transposing the second matrix first so you can multiply rows by rows, then look carefully at cache sizes, vectorise as much as possible, and distribute over multiple cores that don't step on each others' feet.
Marginal improvement: True, but growing as the matrix sizes grow.
Higher constant factors: Practical implementations of Strassen's algorithm use conventional n^3 for blocks below a particular size, so this doesn't really matter.
Harder to implement: whatever.
As for what's used in practice: First, you have to understand that multiplying two ginormous dense matrices is unusual. Much more often, one or both of them is sparse, or symmetric, or upper triangular, or some other pattern, which means that there are quite a few specialized tools which are essential to the efficient large matrix multiplication toolbox. With that said, for giant dense matrices, Strassen's is The Solution.
Algorithm requirements
Input is an arbitrary square matrix M of size N×N, which just fits in memory.
The algorithm's output must be true if M[i,j] = M[j,i] for all j≠i, false otherwise.
Obvious solutions
Check if the transpose equals the matrix itself (MT=M). Easiest to program in many environments, but (usually) consumes twice the memory and requires N² comparisons worst case. Therefore, this is O(N²) and has high peak memory.
Check if the lower triangular part equals the upper triangular part. Of course, the algorithm returns on the first inequality found. This would make the worst case (worst case being, the matrix is indeed symmetric) require N²/2 - N comparisons, since the diagonal does not need to be checked. So although it is better than option 1, this is still O(N²).
Question
Although it's hard to see how it would be possible (the N² elements will all have to be compared somehow), is there an algorithm doing this check that is better than O(N²)?
Or, provided there is a proof of non-existence of such an algorithm: how to implement this most efficiently for a multi-core CPU (Intel or AMD) taking into account things like cache-friendliness, optimal branch prediction, other compiler-specific specializations, etc.?
This question stems mostly from academic interest, although I imagine a practical use could be to determine what solver to use if the matrix describes a linear system AX=b...
Since you will have to examine all the elements except the diagonal, the complexity IMO can't be better than O (n^2).
For a dense matrix, the answer is a definite "no", because any uninspected (non-diagonal) elements could be different from their transposed counterparts.
For standard representations of a sparse matrix, the same reasoning indicates that you can't generally do better than the input size.
However, the same reasoning doesn't apply to arbitrary matrix representations. For example, you could store sparse representations of the symmetric and antisymmetric components of your matrix, which can easily be checked for symmetry in O(1) time by checking if antisymmetric element has any components at all...
I think you can take a probabilistic approach here.
I think it's not a chance/coincidence that x randomly picked lower coordinate elements will match to their upper triangular counter part. The chance is very high that the matrix is indeed symmetric.
So instead of going through all the ½n² - n elements you can check p random coordinates and tell if the matrix is symmetric with confidence:
p / (½n² - n)
you can then decide a threshold above which you believe that the matrix must be a symmetric matrix.
I am looking for an efficient algorithm to find the largest eigenpair of a small, general (non-square, non-sparse, non-symmetric), complex matrix, A, of size m x n. By small I mean m and n is typically between 4 and 64 and usually around 16, but with m not equal to n.
This problem is straight forward to solve with the general LAPACK SVD algorithms, i.e. gesvd or gesdd. However, as I am solving millions of these problems and only require the largest eigenpair, I am looking for a more efficient algorithm. Additionally, in my application the eigenvectors will generally be similar for all cases. This lead me to investigate Arnoldi iteration based methods, but I have neither found a good library nor algorithm that applies to my small general complex matrix. Is there an appropriate algorithm and/or library?
Rayleigh iteration has cubic convergence. You may want to implement also the power method and see how they compare, since you need LU or QR decomposition of your matrix.
http://en.wikipedia.org/wiki/Rayleigh_quotient_iteration
Following #rchilton's comment, you can apply this to A* A.
The idea of looking for the largest eigenpair is analogous to finding a large power of the matrix, as the lower frequency modes get damped out during the iteration. The Lanczos algorithm, is one of a few such algorithms that rely on the so-called Ritz eigenvectors during the decomposition. From Wikipedia:
The Lanczos algorithm is an iterative algorithm ... that is an adaptation of power methods to find eigenvalues and eigenvectors of a square matrix or the singular value decomposition of a rectangular matrix. It is particularly useful for finding decompositions of very large sparse matrices. In latent semantic indexing, for instance, matrices relating millions of documents to hundreds of thousands of terms must be reduced to singular-value form.
The technique works even if the system is not sparse, but if it is large and dense it has the advantage that it doesn't all have to be stored in memory at the same time.
How does it work?
The power method for finding the largest eigenvalue of a matrix A can be summarized by noting that if x_{0} is a random vector and x_{n+1}=A x_{n}, then in the large n limit, x_{n} / ||x_{n}|| approaches the normed eigenvector corresponding to the largest eigenvalue.
Non-square matrices?
Noting that your system is not a square matrix, I'm pretty sure that the SVD problem can be decomposed into separate linear algebra problems where the Lanczos algorithm would apply. A good place to ask such questions would be over at https://math.stackexchange.com/.