Can someone please explain strassen's algorithm for matrix multiplication in an intuitive way? I've gone through (well, tried to go through) the explanation in the book and wiki but it's not clicking upstairs. Any links on the web that use a lot of English rather than formal notation etc. would be helpful, too. Are there any analogies which might help me build this algorithm from scratch without having to memorize it?
Consider multiplying two 2x2 matrices, as follows:
A B * E F = AE+BG AF+BH
C D G H CE+DG CF+DH
The obvious way to compute the right side is just to do the 8 multiplies and 4 additions. But imagine multiplies are a lot more expensive than additions, so we want to reduce the number of multiplications if at all possible. Strassen uses a trick to compute the right hand side with one less multiply and a lot more additions (and some subtractions).
Here are the 7 multiplies:
M1 = (A + D) * (E + H) = AE + AH + DE + DH
M2 = (A + B) * H = AH + BH
M3 = (C + D) * E = CE + DE
M4 = A * (F - H) = AF - AH
M5 = D * (G - E) = DG - DE
M6 = (C - A) * (E + F) = CE + CF - AE - AF
M7 = (B - D) * (G + H) = BG + BH - DG - DH
So to compute AE+BG, start with M1+M7 (which gets us the AE and BG terms), then add/subtract some of the other Ms until AE+BG is all we are left with. Miraculously, the M's are chosen so that M1+M7-M2+M5 works. Same with the other 3 results required.
Now just realize this works not just for 2x2 matrices, but for any (even) sized matrices where the A..H are submatrices.
In my opinion there are 3 ideas that you need to get:
You can split a matrix into blocks and operate on the resulting matrix of blocks like you would on a matrix of numbers. In particular you can multiply two such block matrices (of course as long as the number of block rows in one matches the number of block columns in the other) and get the same result as you would when multiplying original matrices of numbers.
The blocks necessary to express the result of 2x2 block matrix multiplication have enough common factors to allow computing them in fewer multiplications than the original formula implies. This is the trick described in Tony's answer.
Recursion.
Strassen algorithm is just an application of the above. To understand the analysis of its complexity, you need to read "Concrete Mathematics" by Ronald Graham, Donald Knuth, and Oren Patashnik or a similar book.
Took a quick look at the Wikipedia and it appears to me that this algorithm is a slight reduction in the number of multiplications required by rearranging the equations.
Here's an analogy. How many multiplications in x*x + 5*x + 6? Two, right? How many multiplications in (x+2)(x+3)? One, right? But they're the same expression!
Note, I do not expect this to provide a deep understanding of the algorithm, just an intuitive way in which you can understand how the algorithm can possibly lead to an improvement in calculation complexity.
Related
and thank you for the attention you're paying to my question :)
My question is about finding an (efficient enough) algorithm for finding orthogonal polynomials of a given weight function f.
I've tried to simply apply the Gram-Schmidt algorithm but this one is not efficient enough. Indeed, it requires O(n^2) integrals. But my goal is to use this algorithm in order to find Hankel determinants of a function f. So a "direct" computation wich consists in simply compute the matrix and take its determinants requires only 2*n - 1 integrals.
But I want to use the theorem stating that the Hankel determinant of order n of f is a product of the n first leading coefficients of the orthogonal polynomials of f. The reason is that when n gets larger (say about 20), Hankel determinant gets really big and my goal is to divided it by an other big constant (for n = 20, the constant is of order 10^103). My idea is then to "dilute" the computation of the constant in the product of the leading coefficients.
I hope there is a O(n) algorithm to compute the n first orthogonal polynomials :) I've done some digging and found nothing in that direction for general function f (f can be any smooth function, actually).
EDIT: I'll precise here what the objects I'm talking about are.
1) A Hankel determinant of order n is the determinant of a square matrix which is constant on the skew diagonals. Thus for example
a b c
b c d
c d e
is a Hankel matrix of size 3 by 3.
2) If you have a function f : R -> R, you can associate to f its "kth moment" which is defined as (I'll write it in tex) f_k := \int_{\mathbb{R}} f(x) x^k dx
With this, you can create a Hankel matrix A_n(f) whose entries are (A_n(f)){ij} = f{i+j-2}, that is something of the like
f_0 f_1 f_2
f_1 f_2 f_3
f_2 f_3 f_4
With this in mind, it is easy to define the Hankel determinant of f which is simply
H_n(f) := det(A_n(f)). (Of course, it is understood that f has sufficient decay at infinity, this means that all the moments are well defined. A typical choice for f could be the gaussian f(x) = exp(-x^2), or any continuous function on a compact set of R...)
3) What I call orthogonal polynomials of f is a set of polynomials (p_n) such that
\int_{\mathbb{R}} f(x) p_j(x) p_k(x) is 1 if j = k and 0 otherwize.
(They are called like that since they form an orthonormal basis of the vector space of polynomials with respect to the scalar product
(p|q) = \int_{\mathbb{R}} f(x) p(x) q(x) dx
4) Now, it is basic linear algebra that from any basis of a vector space equipped with a scalar product, you can built a orthonormal basis thanks to the Gram-Schmidt algorithm. This is where the n^2 integrations comes from. You start from the basis 1, x, x^2, ..., x^n. Then you need n(n-1) integrals for the family to be orthogonal, and you need n more in order to normalize them.
5) There is a theorem saying that if f : R -> R is a function having sufficient decay at infinity, then we have that its Hankel determinant H_n(f) is equal to
H_n(f) = \prod_{j = 0}^{n-1} \kappa_j^{-2}
where \kappa_j is the leading coefficient of the j+1th orthogonal polynomial of f.
Thank you for your answer!
(PS: I tagged octave because I work in octave so, with a bit of luck (but I doubt it), there is a built-in function or a package already done managing this kind of think)
Orthogonal polynomials obey a recurrence relation, which we can write as
P[n+1] = (X-a[n])*P[n] - b[n-1]*P[n-1]
P[0] = 1
P[1] = X-a[0]
and we can compute the a, b coefficients by
a[n] = <X*P[n]|P[n]> / c[n]
b[n-1] = c[n-1]/c[n]
where
c[n] = <P[n]|P[n]>
(Here < | > is your inner product).
However I cannot vouch for the stability of this process at large n.
I'm studying pattern recognition and I found an interesting algorithm that I'd like to deepen, the Expectations Maximization Algorithm. I haven't great knowledge of probability and statistics and I've read some article on the operation of the algorithm on normal or Gaussian distributions, but I would start with a simple example to understand better. I hope that the example may be suitable.
Assume we have a jar with 3 colors, red, green, blue. Corresponding probability of drawing each colored ball are: pr, pg, pb. Now, let's assume that we have the following parametrized model for the probabilities of drawing the different colours :
pr = 1/4
pg = 1/4 + p/4
pb = 1/2 - p/4
with p unknown parameter. Now assume that the man who is doing the experiment is actually colourblind and cannot discern the red from the green balls. He draws N balls, but only sees
m1 = nR + nG red/green balls and m2 = nB blue balls.
The question is, can the man still estimate the parameter p and with that in hand calculate his best guess for the number of red and green balls (obviously, he knows the number of blue balls)? I think that obviously he can, but what about EM? What I have to consider?
Well, the general outline of the EM algorithm is that if you know the values of some of the parameters, then computing the MLE for the other parameters is very simple. The commonly-given example is mixture density estimation. If you know the mixture weights, then estimating the parameters for the individual densities is easy (M step). Then you go back a step: if you know the individual densities then you can estimate the mixture weights (E step). There isn't necessarily an EM algorithm for every problem, and even if there is one, it's not necessarily the most efficient algorithm. It is, however, usually simpler and therefore more convenient.
In the problem you stated, you can pretend that you know the numbers of red and green balls and then you can carry out ML estimation for p (M step). Then with the value of p you go back and estimate the numbers of red and green balls (E step). Without thinking about it too much, my guess is that you could reverse the roles of the parameters and still work it as an EM algorithm: you could pretend that you know p and carry out ML estimation for the numbers of balls, then go back and estimate p.
If you are still following, we can work out formulas for all this stuff.
When "p" is not known, you can go for maximum likihood or MLE.
First, from your descriptions, "p" has to be in [-1, 2] or the probabilities will not make sense.
You have two certain observations: nG + nR = m and nB = N - m (m = m1, N = m1 + m2)
The chances of this happening is N! / (m! (N - m)!) (1- pb)^m (1 - pb)^(N - m).
Ignoring the constant of N choose m, we will maximize the second term:
p* = argmax over p of (1 - pb)^m pb^(N - m)
The easy solution is that p* should make pb = (N - m) / N = 1 - m / N.
So 0.5 - 0.25 p* = 1 = m / N ==> p* = max(-1, -2 + 4 * m / N)
I am working on algorithm to perform linear regression for one or more independent variables.
that is: (if I have m real world values and in the case of two independent variables a and b)
C + D*a1 + E* b1 = y1
C + D*a2 + E* b2 = y2
...
C + D*am + E* bm = ym
I would like to use the least squares solution to find best fitting straight line.
I will be using the matrix notation
so
where Beta is the vector [C, D, E] where these values will be the best fit line.
Question
What is the best way to solve this formula? Should I compute the inverse of
or should I use the LU factorization/decmposition of the matrix. What is the performance of each on large amount of data (i.e a big value of m , could be in order of 10^8 ...)
EDIT
If the answer was to use Cholesky decomposition or QR decomposition, are there any implementation hints/ simple libraries to use.
I am coding in C/ C++.
Two straightforward approaches spring to mind for solving a dense overdetermined system Ax=b:
Form A^T A x = A b, then Cholesky-factorise A^T A = L L^T, then do two back-solves. This usually gets you an answer precise to about sqrt(machine epsilon).
Compute the QR factorisation A = Q*R, where Q's columns are orthogonal and R is square and upper-triangular, using something like Householder elimination. Then solve Rx = Q^T b for x by back-substitution. This usually gets you an answer precise to about machine epsilon --- twice the precision as the Cholesky method, but it takes about twice as long.
For sparse systems, I'd usually prefer the Cholesky method because it takes better advantage of sparsity.
Your X^TX matrix should have a Cholesky decomposition. I'd look into this decomposition before LU. It is faster: http://en.wikipedia.org/wiki/Cholesky_decomposition
Imagine that I'm a bakery trying to maximize the number of pies I can produce with my limited quantities of ingredients.
Each of the following pie recipes A, B, C, and D produce exactly 1 pie:
A = i + j + k
B = t + z
C = 2z
D = 2j + 2k
*The recipes always have linear form, like above.
I have the following ingredients:
4 of i
5 of z
4 of j
2 of k
1 of t
I want an algorithm to maximize my pie production given my limited amount of ingredients.
The optimal solution of these example inputs would yield me the following quantities of pies:
2 x A
1 x B
2 x C
0 x D
= a total of 5 pies
I can solve this easily enough by taking the maximal producer of all combinations, but the number
of combos becomes prohibitive as the quantities of ingredients increases. I feel like there must
be generalizations of this type of optimization problem, I just don't know where to start.
While I can only bake whole pies, I would be still be interested in seeing a method which may produce non integer results.
You can define the linear programming problem. I'll show the usage on the example, but it can of course be generalized to any data.
Denote your pies as your variables (x1 = A, x2 = B, ...) and the LP problem will be as follows:
maximize x1 + x2 + x3 + x4
s.t. x1 <= 4 (needed i's)
x1 + 2x4 <= 4 (needed j's)
x1 + 2x4 <= 2 (needed k's)
x2 <= 1 (needed t's)
x2 + 2x3 <= 5 (needed z's)
and x1,x2,x3,x4 >= 0
The fractional solution to this problem is solveable polynomially, but the integer linear programming is NP-Complete.
The problem is indeed NP-Complete, because given an integer linear programming problem, you can reduce the problem to "maximize the number of pies" using the same approach, where each constraint is an ingredient in the pie and the variables are the number of pies.
For the integers problem - there are a lot of approximation techniques in the literature for the problem if you can do with "close up to a certain bound", (for example local ratio technique or primal-dual are often used) or if you need an exact solution - exponential solution is probably your best shot. (Unless of course, P=NP)
Since all your functions are linear, it sounds like you're looking for either linear programming (if continuous values are acceptable) or integer programming (if you require your variables to be integers).
Linear programming is a standard technique, and is efficiently solvable. A traditional algorithm for doing this is the simplex method.
Integer programming is intractable in general, because adding integral constraints allows it to describe intractable combinatorial problems. There seems to be a large number of approximation techniques (for example, you might try just using regular linear programming to see what that gets you), but of course they depend on the specific nature of your problem.
How to find the first perfect square from the function: f(n)=AnĀ²+Bn+C? B and C are given. A,B,C and n are always integer numbers, and A is always 1. The problem is finding n.
Example: A=1, B=2182, C=3248
The answer for the first perfect square is n=16, because sqrt(f(16))=196.
My algorithm increments n and tests if the square root is a integer nunber.
This algorithm is very slow when B or C is large, because it takes n calculations to find the answer.
Is there a faster way to do this calculation? Is there a simple formula that can produce an answer?
What you are looking for are integer solutions to a special case of the general quadratic Diophantine equation1
Ax^2 + Bxy + Cy^2 + Dx + Ey + F = 0
where you have
ax^2 + bx + c = y^2
so that A = a, B = 0, C = -1, D = b, E = 0, F = c where a, b, c are known integers and you are looking for unknown x and y that satisfy this equation. Once you recognize this, solutions to this general problem are in abundance. Mathematica can do it (use Reduce[eqn && Element[x|y, Integers], x, y]) and you can even find one implementation here including source code and an explanation of the method of solution.
1: You might recognize this as a conic section. It is, and people have been studying them for thousands of years. As such, our understanding of them is very deep and your problem is actually quite famous. The study of them is an immensely deep and still active area of mathematics.