Cholesky Decomposition for a large number of small matrices using OpenCL - matrix

My question is similar to what was asked here:[a link] Performing many small matrix operations in parallel in OpenCL, except that I want to do Cholesky decomposition. Also, my matrices will be going from 15 X 15 to 100 X 100, and I will have upto 100,000 of them. All of the matrices will have the same dimensions.The decomposed matrices will be used further within the GPU.
This paper [a link] http://icl.cs.utk.edu/news_pub/submissions/haidar_iccp.pdf discusses the algorithms at a high level. They use the term batched Cholesky for such a problem (large number of small matrices). The way they have done it is to implement a batched version of all the steps involved in a Cholesky decomposition.
So I am wanting to start with a batched matrix multiplication (since it is one of the steps in a cholesky decomposition). For large matrices, matrix multiplication is done in a blocked manner on a GPU. My question: would it be suitable for the kind of problem I have? Any other suggestions will be helpful. Am a little unsure how to approach this.

Related

Precise matrix inversion in Q

Given an invertible matrix M over the rationals Q, the inverse matrix M^(-1) is again a matrix over Q. Are their (efficient) libraries to compute the inverse precisely?
I am aware of high-performance linear algebra libraries such as BLAS/LAPACK, but these libraries are based on floating point arithmetic and are thus not suitable for computing precise (analytical) solutions.
Motivation: I want to compute the absorption probabilities of a large absorbing Markov chain using its fundamental matrix. I would like to do so precisely.
Details: By large, I mean a 1000x1000 matrix in the best case, and a several million dimensional matrix in the worst case. The further I can scale things the better. (I realize that the worst case is likely far out of reach.)
You can use the Eigen matrix library, which with little effort works on arbitrary scalar types. There is an example in the documentation how to use it with GMPs mpq_class: http://eigen.tuxfamily.org/dox/TopicCustomizing_CustomScalar.html
Of course, as #btilly noted, most of the time you should not calculate the inverse, but calculate a matrix decomposition and use that to solve equation systems. For rational numbers you can use any LU-decomposition, or if the matrix is symmetric, the LDLt decomposition. See here for a catalogue of decompositions.

Why cuSparse is much slower than cuBlas for sparse matrix multiplication

Recently when I used cuSparse and cuBLAS in CUDA TOOLKIT 6.5 to do sparse matrix multiplication, I find cuSPARSE is much slower than cuBLAS in all cases!
In all my experiments, I used cusparseScsrmm in cuSparse and cublasSgemm in cuBLAS. In the sparse matrix, half of the total elements are zero. The GPU I used is NVIDIA Titan Black. Besides, all the time consumed is obtained through nvvp tool provided by NVIDIA. Below are some of the results:
Experiment A:
sparse matrix size: 192x2400
dense matrix size: 2400x256
cusparse time: 1.4ms
cublas time: 0.21ms
Experiment B:
sparse matrix size: 192x75
dense matrix size: 75x1024
cusparse time: 0.27ms
cublas time: 0.04ms
So, it's very odd to see the results listed above. Because cuSPARSE is designed particularly to handle sparse matrix manipulation, how could it be even slower than cuBLAS!? If so, then there is no need to use cuSPARSE at all. Could you please give me any explanation to the results? Also, could you suggest any other ways to speed up sparse matrix multiplication?
I don't think that you can classify a matrix with half zeros as "sparse": the timing you have found are reasonable (actually the sparse algorithm is behaving pretty well!).
Sparse algorithms are efficient only when considering matrices where most of the elements are zeros (for example, matrices coming out from finite elements problems).
This holds true for CPUs, non only for GPUs: there's an important overhead in treating the matrix as sparse, and it become convenient to use sparse algorithms only when... most of the elements are zeros (typical: ten or less non-zeros per row, matrix of rank thousands - hundred thousands - (millions?) ).
There are other matrix shapes that have efficient solution algorithms, that you can try if it applies to your problem, e.g. band matrices. I don't know whether they have been ported to cuBlas though.
About the overheads
Dense linear algebra algorithms can perform optimally because processors have been designed in order to best efficiently solve for such systems. Consider the DGEMM operation (matrix-matrix multiply): it's an operation that let you use the processors at >95% of it's theoretical peak floating point performance, for large matrices (ie, matrices not fitting any cache of the system). How?
prefetching
optimal cache usage
vectorization (SSE, AVX)
pipelining
In a sparse LA algorithm only non-zero elements and their corresponding indexes are stored into memory: memory accesses are in fact indirect. So the sparse algorithm cannot exploit the hardware at the same level of optimization: I don't know about specific figures in this context, but 10 to 20% wouldn't be strange.
The gain is clearly that operations on zeros (on non-stored elements) are simply not performed, resulting in order of magnitudes less operations and much less needed storage.
There are further overheads in integers logics, conditionals, but modern CPUs are pretty good in overlapping integer and FP operations, and with "speculative executions". Unfortunately they too can prevent vectorization and so are further overheads with respect to the dense case.
What about GPUs?
Dense LA algorithm are an optimal fit for GPUs as the same as for CPUs: in this case you have optimal usage of:
coalescing
shared memory
memory access patterns
Again the indirect access to matrices elements in sparse LA algorithm prevent to exploit the same level of optimization.
References
I can't remember which one I used when encountered sparse problems... I think it was PSBLAS: http://people.uniroma2.it/salvatore.filippone/psblas/
But here you will be overwhelmed of them:
http://www.netlib.org/utk/people/JackDongarra/la-sw.html

Where is strassen's matrix multiplication useful?

Strassen's algorithm for matrix multiplication just gives a marginal improvement over the conventional O(N^3) algorithm. It has higher constant factors and is much harder to implement. Given these shortcomings, is strassens algorithm actually useful and is it implemented in any library for matrix multiplication? Moreover, how is matrix multiplication implemented in libraries?
Generally Strassen’s Method is not preferred for practical applications for following reasons.
The constants used in Strassen’s method are high and for a typical application Naive method works better.
For Sparse matrices, there are better methods especially designed
for them.
The submatrices in recursion take extra space.
Because of the limited precision of computer arithmetic on
noninteger values, larger errors accumulate in Strassen’s algorithm
than in Naive Method.
So the idea of strassen's algorithm is that it's faster (asymptotically speaking). This could make a big difference if you are dealing with either huge matrices or else a very large number of matrix multiplications. However, just because it's faster asymptotically doesn't make it the most efficient algorithm practically. There are all sorts of implementation considerations such as caching and architecture specific quirks. Also there is also parallelism to consider.
I think your best bet would be to look at the common libraries and see what they are doing. Have a look at BLAS for example. And I think that Matlab uses MAGMA.
If your contention is that you don't think O(n^2.8) is that much faster than O(n^3) this chart shows you that n doesn't need to be very large before that difference becomes significant.
It's very important to stop at the right moment.
With 1,000 x 1,000 matrices, you can multiply them by doing seven 500 x 500 products plus a few additions. That's probably useful. With 500 x 500, maybe. With 10 x 10 matrices, most likely not. You'd just have to do some experiments first at what point to stop.
But Strassen's algorithm only saves a factor 2 (at best) when the number of rows grows by a factor 32, the number of coefficients grows by 1,024, and the total time grows by a factor 16,807 instead of 32,768. In practice, that's a "constant factor". I'd say you gain more by transposing the second matrix first so you can multiply rows by rows, then look carefully at cache sizes, vectorise as much as possible, and distribute over multiple cores that don't step on each others' feet.
Marginal improvement: True, but growing as the matrix sizes grow.
Higher constant factors: Practical implementations of Strassen's algorithm use conventional n^3 for blocks below a particular size, so this doesn't really matter.
Harder to implement: whatever.
As for what's used in practice: First, you have to understand that multiplying two ginormous dense matrices is unusual. Much more often, one or both of them is sparse, or symmetric, or upper triangular, or some other pattern, which means that there are quite a few specialized tools which are essential to the efficient large matrix multiplication toolbox. With that said, for giant dense matrices, Strassen's is The Solution.

Matrix-Vector Multiplication - Sparse vs. Dense matrices

I want to implement a matrix-vector multiplication in C. My matrix is 1000 * 1000^2 and highly sparse (less than 0.01% non-zero elements). The non-zero elements are dispersed among the rows (between 0 to 126 non-zero elements per row).
I have heard that generally, using parallel processing for sparse matrix-vector multiplication is challenging and not as efficient as dense matrices because the ratio of computation to memory access is low (Here). But I cannot really understand what is the main difference between a sparse and a dense matrix with respect to parallel computation that makes sparse matrices less efficient.It seems the same problem is still around for the dense matrices (please correct me if I am wrong).
It is appreciated if let me know how dense matrices differ from sparse matrices in terms of parallel processing.
Thanks

Efficient algorithm for finding largest eigenpair of small general complex matrix

I am looking for an efficient algorithm to find the largest eigenpair of a small, general (non-square, non-sparse, non-symmetric), complex matrix, A, of size m x n. By small I mean m and n is typically between 4 and 64 and usually around 16, but with m not equal to n.
This problem is straight forward to solve with the general LAPACK SVD algorithms, i.e. gesvd or gesdd. However, as I am solving millions of these problems and only require the largest eigenpair, I am looking for a more efficient algorithm. Additionally, in my application the eigenvectors will generally be similar for all cases. This lead me to investigate Arnoldi iteration based methods, but I have neither found a good library nor algorithm that applies to my small general complex matrix. Is there an appropriate algorithm and/or library?
Rayleigh iteration has cubic convergence. You may want to implement also the power method and see how they compare, since you need LU or QR decomposition of your matrix.
http://en.wikipedia.org/wiki/Rayleigh_quotient_iteration
Following #rchilton's comment, you can apply this to A* A.
The idea of looking for the largest eigenpair is analogous to finding a large power of the matrix, as the lower frequency modes get damped out during the iteration. The Lanczos algorithm, is one of a few such algorithms that rely on the so-called Ritz eigenvectors during the decomposition. From Wikipedia:
The Lanczos algorithm is an iterative algorithm ... that is an adaptation of power methods to find eigenvalues and eigenvectors of a square matrix or the singular value decomposition of a rectangular matrix. It is particularly useful for finding decompositions of very large sparse matrices. In latent semantic indexing, for instance, matrices relating millions of documents to hundreds of thousands of terms must be reduced to singular-value form.
The technique works even if the system is not sparse, but if it is large and dense it has the advantage that it doesn't all have to be stored in memory at the same time.
How does it work?
The power method for finding the largest eigenvalue of a matrix A can be summarized by noting that if x_{0} is a random vector and x_{n+1}=A x_{n}, then in the large n limit, x_{n} / ||x_{n}|| approaches the normed eigenvector corresponding to the largest eigenvalue.
Non-square matrices?
Noting that your system is not a square matrix, I'm pretty sure that the SVD problem can be decomposed into separate linear algebra problems where the Lanczos algorithm would apply. A good place to ask such questions would be over at https://math.stackexchange.com/.

Resources