I need to compute the following matrices:
M = XSX^T
and
V = XSy
what I'd like to know is the more efficient implementation using blas, knowing that S is a symmetric and definite positive matrix of dimension n, X has m rows and n columns while y is a vector of length n.
My implementation is the following:
I compute A = XS using dsymm and then with dgemm is obtained M=AX^T while dgemv is used to obtain V=Ay.
I think that at least M can be computed in a more efficient way since I know that M is symmetric and definite positive.
Your code is the best BLAS can do for you. There is no BLAS operation, that can exploit the fact that M is symmetric.
You are right though you'd technically only need to compute the upper diagonal part of the gemm product and then copy the strictly upper diagonal part to the lower diagonal part. But there is no routine for that.
May I inquire about the sizes? And may I also inspire some other sources for performance gains: Own build of your BLAS implementation, comparison with MKL, ACML, OpenBLAS, ATLAS. You could obviously code your own version that would use AVX, FMA intrinsics. You should be able to do better that some generalised library. Also what is the precision of your floating point variable?
I seriously doubt that you might gain too much by coding it yourself anyway. But what I would definitely suggest is converting everything to floats and testing if float precision is not giving you the same result with significant speed up in compute time. Very seldom have I seen such cases, which were more in the ODE solving domain and numeric integration of nasty functions.
But you did not address my question regarding the BLAS implementation and machine type.
Again, the optimisation beyond this point is not possible without more skills :(. But seriously, don't be to worried about this. There is a reason, why BLAS does not the optimisation you ask for. It might not be worth the hassle. Go with your solution.
And don't forget to investgate the use of floats rather than double. On R convert everything to float. For the Lapack commands use only sgemX
Without knowing the detail of your problem, it can be useful to recognize the zeros in the matrices. Partitioning the matrices to achieve this can provide significant benefits. Is M the sum of many XSX' sub matrices ?
For V = XSy, where y is a vector and X and S are matrices, calculating S.y then X.(Sy) should be better, unless X.S is a necessary calculation for M.
Related
I'm trying to solve simultaneously the same ODE at different point (each point n is an independent vector of shape m) using the scipy BDF solver. In other world, i have a matrix n x m, and i want to solve n points (by solving, I mean make them advance in time with a while loop ), knowing that each point n are independant from each other.
Obviously you can loop on the different points, but this method takes too much time. Is there any way to make this faster and use it as a vectorized function?
I also tried to reshape my matrix to a 1D vector, but it looks like the solver compute the jacobian matrix of the complete vector, which takes too much time and is useless as the points along n are independent.
Maybe there is a way to specify that the derivatives of points n-m are zeros to speed up the jacobian computation ?
Thanks in advance for the answer
Edit:
Thanks for your answer #Lutz Lehmann. I was able to sped up the computation a little using jac_sparcity, that avoid computing a lot of unnecessary points.
The other improvement I can imagine is regarding the rate of progress h_abs : each independent ODE should have its own h_abs. Using the 1D vector method implies that all the ODE's are advancing at the same rate of progress h_abs i.e. the most restricting one. I don't know if there is anyway of doing this.
I am already using a vectorized atol built as an n x m matrix and reshaped, the same way as the complete set of ODE to make sure that the good tolerances are applied for each variable. I've never used jumba so far, but I will definitely have a look.
I am reading the book "Introduction to linear algebra" by Gilbert Strang. The section is called "Orthonormal Bases and Gram-Schmidt". The author several times emphasised the fact that with orthonormal basis it's very easy and fast to calculate Least Squares solution, since Qᵀ*Q = I, where Q is a design matrix with orthonormal basis. So your equation becomes x̂ = Qᵀb.
And I got the impression that it's a good idea to every time calculate QR decomposition before applying Least Squares. But later I figured out time complexity for QR decomposition and it turned out to be that calculating QR decomposition and after that applying Least Squares is more expensive than regular x̂ = inv(AᵀA)Aᵀb.
Is that right that there is no point in using QR decomposition to speed up Least Squares? Or maybe I got something wrong?
So the only purpose of QR decomposition regarding Least Squares is numerical stability?
There are many ways to do least squares; typically these vary in applicability, accuracy and speed.
Perhaps the Rolls-Royce method is to use SVD. This can be used to solve under-determined (fewer obs than states) and singular systems (where A'*A is not invertible) and is very accurate. It is also the slowest.
QR can only be used to solve non-singular systems (that is we must have A'*A invertible, ie A must be of full rank), and though perhaps not as accurate as SVD is also a good deal faster.
The normal equations ie
compute P = A'*A
solve P*x = A'*b
is the fastest (perhaps by a large margin if P can be computed efficiently, for example if A is sparse) but is also the least accurate. This too can only be used to solve non singular systems.
Inaccuracy should not be taken lightly nor dismissed as some academic fanciness. If you happen to know that the problems ypu will be solving are nicely behaved, then it might well be fine to use an inaccurate method. But otherwise the inaccurate routine might well fail (ie say there is no solution when there is, or worse come up with a totally bogus answer).
I'm a but confused that you seem to be suggesting forming and solving the normal equations after performing the QR decomposition. The usual way to use QR in least squares is, if A is nObs x nStates:
decompose A as A = Q*(R )
(0 )
transform b into b~ = Q'*b
(here R is upper triangular)
solve R * x = b# for x,
(here b# is the first nStates entries of b~)
I stumbled upon something, which I consider very strange.
As an example consider the code
A = reshape(1:6, 3,2)
A/[1 1]
which gives
3×1 Array{Float64,2}:
2.5
3.5
4.5
As I understand, in general such division gives the weighted average of columns, where each weight is inversely proportional to the corresponding element of the vector.
So my question is, why is it defined such way?
What is the mathematical justification of this definition?
It's the minimum error solution to |A - v*[1 1]|₂ – which, being overconstrained, has no exact solution in general (i.e. value v such that the norm is precisely zero). The behavior of / and \ is heavily overloaded, solving both under and overconstrained systems by a variety of techniques and heuristics. Whether this kind of overloading is a good idea or not is debatable, but it's what people have come to expect from these operations in Matlab and Octave, and it's often quite convenient to have so much functionality available in a single operator.
Let A be an NxN matrix and b be a Nx1 column vector. Then \ solves Ax=b, and / solves xA=b.
As Stefan mentions, this is extended to underdetermined cases as the least squares solution. This is done via the QR or SVD decompositions. See the details on these algorithms to see why this is the case. Hint: the linear form of the OLS estimator can actually be written as the solution to matrix decompositions, so it's the same thing.
Now you might ask, how does it actually solve it? That's a complicated question. Essentially, it uses a matrix factorization. But which matrix factorization is used is dependent on the matrix type. The reason for this is because Gaussian elimination is O(n^3), and so treating the problem generally is usually not good. But whenever you can specialize, you can get speedups. So essentially \ (and /, which transposes and calls \) check for a bunch of special types and pick a factorization or other algorithm (LU, QR, SVD, Cholesky, etc.) based on the matrix type. The flow chart from MATLAB explains this very well. There's a lot of details here, and it gets even more details when the matrix is sparse. Also IterativeSolvers.jl should be mentioned because it's another set of algorithms for solving Ax=b.
Most applied math problems reduce down to linear algebra, with solving Ax=b being one of the most important and difficult problems, which is why there is tons of research on the subject. In fact, you can probably say that the vast majority of the field of numerical linear algebra is devoted to finding fast methods for solving Ax=b on specific matrix types. \ essentially puts all of the direct (non-iterative) methods into one convenient operator.
I have a piece of code in Fortran 90 in which I have to solve both a non-linear (for which I have to invert the Jacobian matrix) and a linear system of equations. When I say small I mean n unknowns for both operations, with n<=4. Unfortunately, n is not known a priori. What do you think is the best option?
I thought of writing explicit formulas for cases with n=1,2 and using other methods for n=3,4 (e.g. some functions of the Intel MKL libraries), for the sake of performance. Is this sensible or should I write explicit formulas for the inverse matrix also for n=3,4?
I want to find a fast algorithm for computing 1/d , where d is double ( albeit it can be converted to integer) what is the best algorithm of many algorithms(SRT , goldschmidt,newton raphson, ...)?I'm writing my program in c language.
thanks in advance.
The fastest program is: double result = 1 / d;
CPU:s already use a root finding iterative algorithm like the ones you describe, to find the reciprocal 1/d. So you should find it difficult to beat it using a software implementation of the same algorithm.
If you have few/known denominators then try a lookup table. This is the usual approach for even slower functions such as trig functions.
Otherwise: just compute 1/d. It will be the fastest you can do. And there is an endless list of things you can do to speed up arithmetic if you have to
use 32 bit (single) instead of 64bit (double) precision. FP Division on takes a number of cycles proportional to the number of bits.
vectorize the operations. For example I believe you can compute four 32bit float divisions in parallel with SSE2, or even more in parallel by doing it on the GPU.
I've asked it from someone and I get my answer:
So, you can't add a hardware divider to the FPGA then? Or fast reciprocal support?
Anyway it depends. Does it have fast multiplication? If not, well, that's a problem, you could only implement the slow methods then.
If you have fast multiplication and IEEE floats, you can use the weird trick I linked to in my previous post with a couple of refinement steps. That's really just Newton–Raphson division with a simpler calculation for the initial approximation (but afaik it still only takes 3 refinements for single-precision floats, just like the regular initial approximation). Fast reciprocal support works that way too - give a fast initial approximation (handling the exponent right and getting significant bits from a lookup table, if you get 12 significant bits that way you only need one refinement step for single-precision or, 13 are enough to get 2 steps for double-precision) and optionally have instructions that help implement the refinement step (like AMD's PFRCPIT1 and PFRCPIT2), for example to calculate Y = (1 - D*X) and to calculate X + X * Y.
Even without those tricks Newton–Raphson division is still not bad, with the linear approximation it takes only 4 refinements for double-precision floats, but it also takes some annoying exponent adjustments to get in the right range first (in hardware that wouldn't be half as annoying).
Goldschmidt division is, afaik, roughly equivalent in performance and might have a slightly less complex implementation. It's really the same sort of deal - trickery with the exponent to get in the right range, the "2 - something" estimation trick (which is rearranged in Newton-Raphson division, but it's really the same thing), and doing the refinement step until all the bits are right. It just looks a little different.