Why is my Strassen Matrix multiplier so fast? - performance

As an experiment I implemented the Strassen Matrix Multiplication Algorithm to see if truly lead to faster code for large n.
https://github.com/wcochran/strassen_multiplier/blob/master/mm.c
To my surprise it was way faster for large n. For example, the n=1024 case
took 17.20 seconds using the conventional method whereas it only took 1.13 seconds
using the Strassen method (2x2.66 GHz Xeon). What -- a 15x speedup!? It should only be marginally faster. In fact, it seemed to be as good for even small 32x32 matrices!?
The only way I can explain this much of a speed-up is that my algorithm is more cache-friendly -- i.e., it focuses on small pieces of the matrices and thus the data is more localized. Maybe I should be doing all my matrix arithmetic piecemeal when possible.
Any other theories on why this is so fast?

The recursive nature of Strassen has better memory locality,
so that may be a part of the picture. A recursive regular
matrix multiplication is perhaps a reasonable thing
to compare to.

First question is "are the results correct?". If so, it's likely that your "conventional" method is not a good implementation.
The conventional method is not to use 3 nested FOR loops to scan the inputs in the order you learned in math class. One simple improvement is to transpose the matrix on the right so that it sits in memory with columns being coherent rather than rows. Modify the multiply loop to use this alternate layout and it will run much faster on a large matrix.
The standard matrix libraries implement much more cache friendly methods that consider the size of the data cache.
You might also implement a recursive version of the standard matrix product (subdivide into 2x2 matrix of matricies that are half the size). This will give something closer to optimal cache performance, which strassen gets from being recursive.
So either you're doing it wrong, or your conventional code is not optimized.

What is the loop order in your conventional multiplication? If you have
for (int i = 0; i < new_height; ++i)
{
for (int j = 0; j < new_width; ++j)
{
double sum = 0.0;
for (int k = 0; k < common; ++k)
{
sum += lhs[i * common + k] * rhs[k * new_width + j];
}
product[i * new_width + j] = sum;
}
}
then you're not being very nice to the cache because you're accessing the right hand side matrix in a non-continuous manner. After reordering to
for (int i = 0; i < new_height; ++i)
{
for (int k = 0; k < common; ++k)
{
double const fixed = lhs[i * common + k];
for (int j = 0; j < new_width; ++j)
{
product[i * new_width + j] += fixed * rhs[k * new_width + j];
}
}
}
access to two matrices in the inner-most loop are continuous and one is even fixed. A good compiler would probably do this automatically, but I chose to explicitly pull it out for demonstration.
You didn't specify the language, but as for C++, advanced compilers even recognize the unfriendly loop order in some configurations and reorder them.

Related

What counts as an operation in algorithms?

So I've just started learning algorithms and data structures, and I've read about Big O and how it portrays complexity of algorithms based on how the number of operations required scales
But what actually counts as an operation? In this bubble sort, does each iteration of the for loop count as an operation, or only when an if statement is triggered, or all of them?
And since there are so many different algorithms of all kinds, how do you immediately identify what would count as an "operation" happening in the algorithm's code?
function bubbleSort(array) {
for (let i = 0; i < array.length; i++) {
for (let j = 0; j < array.length; j++) {
if (array[j + 1] < array[j]) {
let tmp = array[j]
array[j] = array[j+1]
array[j+1] = tmp
}
}
}
return array
}
You can count anything as an operation that will execute within a constant amount of time, independent of input. In other words, operations that have a constant time complexity.
If we assume your input consists of fixed-size integers (like 32-bit, 64 bit), then all of the following can be considered such elementary operations:
i++
j < array.length
array[j + 1] < array[j]
let tmp = array[j]
...
But that also means you can take several of such operations together and still consider them an elementary operation. So this is also an elementary operation:
if (array[j + 1] < array[j]) {
let tmp = array[j]
array[j] = array[j+1]
array[j+1] = tmp
}
So, don't concentrate on breaking down operations into smaller operations, and those again into even smaller operations, when you are already certain that the larger operation is O(1).
Usually, everything that happens is a single operation. This is one of the reason we don't actually count the exact number of them, but instead use asymptotic notations (big O and big Theta).
However, sometimes you are interested about one kind of operation only. A common example is algorithms that use IO. Since IO is significantly more time consuming than anything happening on the CPU, you often just "count" the number of IO operations instead. In these cases, you often actually care about exact number of times an IO occurs, and can't use only asymptotic notations.

Is there an optimised way to calculate `x^T A x` for symmetric `A` in Eigen++?

Given a symmetric matrix A and a vector x, I often need to calculate x^T * A * x. I can do it Eigen++ 3.x with x.transpose() * (A * x), but that doesn't exploit the information that x is the same on both sides and A is symmetric. Is there a more efficient way to calculate this?
How often do you calculate this? If very often for different x, then it might give a bit of speed-up to compute a Cholesky or LDLT decomposition of A and use that the product of a triangular matrix with a vector only needs half the multiplications.
Or possibly even simpler, if you decompose A=L+D+L.T, where L is strictly lower triangular and D is diagonal, then
x.T*A*x = x.T*D*x + 2*x.T*(L*x)
where the first terms is the sum over d[k]*x[k]**2. The second term, if carefully using the triangular structure, uses half the operations of the original expression.
If the triangular matrix-vector product has to be implemented outside of Eigen procedures, this might destroy the efficiency/optimizations of BLAS-like block operations in the generic matrix-vector product. In the end, there might be no improvements from this reduction in the count of arithmetic operations.
For small matrices, writing the for loop myself seems to be faster than relying on Eigen's code. For large matrices I got good results using .selfAdjointView:
double xAx_symmetric(const Eigen::MatrixXd& A, const Eigen::VectorXd& x)
{
const auto dim = A.rows();
if (dim < 15) {
double sum = 0;
for (Eigen::Index i = 0; i < dim; ++i) {
const auto x_i = x[i];
sum += A(i, i) * x_i * x_i;
for (Eigen::Index j = 0; j < i; ++j) {
sum += 2 * A(j, i) * x_i * x[j];
}
}
return sum;
}
else {
return x.transpose() * A.selfadjointView<Eigen::Upper>() * x;
}
}

Amdahl's Law: matrix multiplication

I'm trying to calculate the fraction P of my code which can be parallelized, to apply Amdahl's Law and observe the theoretical maximum speedup.
My code spends most of its time on multiplying matrices (using the library Eigen). Should I consider this part entirely parallelizable?
If your your matrices are large enough, let's say larger than 60, then you can compile with OpenMP enabled (e.g., -fopenmp with gcc) and the products will be parallelized for you. However, it is often better to parallelize at the highest level as possible, especially if the matrices are not very large. Then it depends whether you can identify independent tasks in your algorithm.
First, it would be appropriate to consider how the Eigen library is handling the matrix multiplication.
Then, a matrix(mxn)-vector(nx1) multiplication without Eigen could be written like this:
1 void mxv(int m, int n, double* a, double* b, double* c)
2 { //a=bxc
3 int i, j;
4
5 for (i=0; i<m; i++)
6 {
7 a[i] = 0.0;
8 for (j=0; j<n; j++)
9 a[i] += b[i*n+j]*c[j];
10 }
11 }
As you can see, since no two products compute the same element of the result vector a[] and since the order in which the values for the elements a[i] for i=0...m are calculated does not affect the correctness of the answer, these computations can be carried out independently over the index value of i.
Then a loop like the previous one is entirely parallelizable. It would be relatively straightforward using OpenMP for parallel-implementation purposes on such loops.

What are common approaches for implementing common white paper math idioms in code

I am looking for a resource that can explain common math operations found in white papers in terms that that coders with minimal math background can understand in terms of coding idioms -- for loops etc.
I frequently see the same kinds of symbols in different equations and that the often result in easily comprehensible algorithms. An overview of what the symbols mean would go a long way to making academic paper more comprehensible.
the only ones i can think of that are not obvious (arithmetic, trig functions etc) and have a direct equivalent in code are sum, Σ, and product Π.
so something like Σ a[i] is:
sum = 0;
for (i = 0; i < len(a); ++i) sum += a[i];
and some related details: a subscript (small number below the line) is often the same as an array index (so the i in Σ a[i] might be written small, below and to the right of the a). similarly the range of the i value (here 0 to the length of a) may be given as two small numbers just to the right of the Σ (start value, 0, at the bottom, finish value, n, at the top).
and the equivalent product is Π a[i]:
product = 1;
for (i = 0; i < len(a); ++i) product *= a[i];
update in the comments xan suggests covering matrices too. those get complicated, but at the simplest you might see something like:
a[i] = M[i][j] b[j]
(where it's much more likely that the i and j are subscripts, as described above). and that has implied loops:
for (i = 0; i < len(a); ++i) {
a[i] = 0;
for (j = 0; j < len(b); ++j) a[i] += M[i][j] * b[j]
}
worse, often that will be written simply as a = M b and you're expected to fill everything in yourself....
update 2 the first equation in the paper you reference below is w(s[i],0) = alpha[d] * Size(s[i]). as far as i can see, that's nothing more than:
double Size(struct s) { ... }
double w(struct s, int x) {
if (x == 0) return alpha[d] * Size(s);
...
}
and other terms are similarly fancy-looking but not actually complicated function calls and multiplications. note that |...| is abs(...) and the "dot" is multiplication (i think).
I use this site all the time for complex mathematical operations translated to code. I never graduated high school.
http://www.wolframalpha.com/
"Common math operations" depends on the kinds of problems you're used to solving. They can range all the way from simple arithmetic (+, -, *, /) to calculus (integrals, summations, derivatives, partial differential equations, matricies, etc.)
What does "common" mean to you and your development team?

Optimise Floyd-Warshall for symmetric adjacency matrix

Is there an optimisation that lowers the constant factor of the runtime of Floyd-Warshall, if you are guaranteed to have a symmetric adjacency matrix?
After some thought I came up with:
for (int k = 0; k < N; ++k)
for (int i = 0; i < N; ++i)
for (int j = 0; j <= i; ++j)
dist[j][i] = dist[i][j] = min(dist[i][j], dist[i][k] + dist[k][j]);
Now of course we both need to show it's correct and faster.
Correctness is harder to prove, since it relies on the proof of Floyd-Warshall's which is non-trivial. A pretty good proof is given here: Floyd-Warshall proof
The input matrix is symmetric. Now the rest of the proof uses a modified Floyd-Warshall's proof to show that the order of the calculations in the 2 inner loops doesn't matter and that the graph stays symmetrical after each step. If we show both of these conditions are true then both algorithms do the same thing.
Let's define dist[i][j][k] as the distance from i to j using using only vertices from the set {0, ..., k} as intermediate vertices on the path from i to j.
dist[i][j][k-1] is defined as the weight of the edge from i to j. If there is no edge in between this weight is taken to be infinity.
Now using the same logic as used in the proof linked above:
dist[i][j][k] = min(dist[i][j][k-1], dist[i][k][k-1] + dist[k][j][k-1])
Now in the calculation of dist[i][k][k] (and similarly for dist[k][i][k]):
dist[i][k][k] = min(dist[i][k][k-1], dist[i][k][k-1] + dist[k][k][k-1])
Now since dist[k][k][k-1] cannot be negative (or we'd have a negative loop in the graph), this means that dist[i][k][k] = dist[i][k][k-1]. Since if dist[k][k][k-1] = 0 then both parameters are the same, otherwise the first parameter of the min() is chosen.
So now, because dist[i][k][k] = dist[i][k][k-1], when calculating dist[i][j][k] it doesn't matter if dist[i][k] or dist[k][j] already allow k in their paths. Since dist[i][j][k-1] is only used for the calculation of dist[i][j][k], dist[i][j] will stay dist[i][j][k-1] in the matrix until dist[i][j][k] is calculated. If i or j equals k then the above case applies.
Therefore, the order of the calculations doesn't matter.
Now we need to show that dist[i][j] = dist[j][i] after all steps of the algorithm.
We start out with a symmetric grid thus dist[a][b] = dist[b][a], for all a and b.
dist[i][j] = min(dist[i][j], dist[i][k] + dist[k][j])
= min(dist[j][i], dist[k][i] + dist[j][k])
= min(dist[j][i], dist[j][k] + dist[k][i])
= dist[j][i]
Therefore our assignment is both true and it will maintain the invariant that dist[a][b] = dist[b][a]. Therefore dist[i][j] = dist[j][i] after all steps of the algorithm
Therefore both algorithms yield the same, correct, result.
Speed is easier to prove. The inner loop is called just over half the number of times it is normally called, so the function is about twice as fast. Just made slightly slower because you still assign the same number of times, but this doesn't matter as min() is what takes up most of your time.
If you see anything wrong with my proof, however technical, feel free to point it out and I will attempt to fix it.
EDIT:
You can both speed up and save half the memory by changing the loop as such:
for (int k = 0; k < N; ++k) {
for (int i = 0; i < k; ++i)
for (int j = 0; j <= i; ++j)
dist[i][j] = min(dist[i][j], dist[i][k] + dist[j][k]);
for (int i = k; i < N; ++i) {
for (int j = 0; j < k; ++j)
dist[i][j] = min(dist[i][j], dist[k][i] + dist[j][k]);
for (int j = k; j <= i; ++j)
dist[i][j] = min(dist[i][j], dist[k][i] + dist[k][j]);
}
}
This just splits up the above for loops of the optimised algorithm, so it's still correct and it'll likely get the same speed, but uses half the memory.
Thanks to Chris Elion for the idea.
(Using the notation in the pseudo-code in the Wikipedia article) I believe (but haven't tested) that if the edgeCost matrix is symmetric, then the path matrix will also be symmetric after each iteration. Thus you only need to update half of the entries at each iteration.
At a lower level, you only need to store half of the matrix (since d(i,j) = d(j,i)), so you can reduce the amount of memory used, and hopefully reduce the number of cache misses since you'll access the same data multiple times.

Resources