Matrix-Vector Multiplication - Sparse vs. Dense matrices - matrix

I want to implement a matrix-vector multiplication in C. My matrix is 1000 * 1000^2 and highly sparse (less than 0.01% non-zero elements). The non-zero elements are dispersed among the rows (between 0 to 126 non-zero elements per row).
I have heard that generally, using parallel processing for sparse matrix-vector multiplication is challenging and not as efficient as dense matrices because the ratio of computation to memory access is low (Here). But I cannot really understand what is the main difference between a sparse and a dense matrix with respect to parallel computation that makes sparse matrices less efficient.It seems the same problem is still around for the dense matrices (please correct me if I am wrong).
It is appreciated if let me know how dense matrices differ from sparse matrices in terms of parallel processing.


Why cuSparse is much slower than cuBlas for sparse matrix multiplication

Recently when I used cuSparse and cuBLAS in CUDA TOOLKIT 6.5 to do sparse matrix multiplication, I find cuSPARSE is much slower than cuBLAS in all cases!
In all my experiments, I used cusparseScsrmm in cuSparse and cublasSgemm in cuBLAS. In the sparse matrix, half of the total elements are zero. The GPU I used is NVIDIA Titan Black. Besides, all the time consumed is obtained through nvvp tool provided by NVIDIA. Below are some of the results:
Experiment A:
sparse matrix size: 192x2400
dense matrix size: 2400x256
cusparse time: 1.4ms
cublas time: 0.21ms
Experiment B:
sparse matrix size: 192x75
dense matrix size: 75x1024
cusparse time: 0.27ms
cublas time: 0.04ms
So, it's very odd to see the results listed above. Because cuSPARSE is designed particularly to handle sparse matrix manipulation, how could it be even slower than cuBLAS!? If so, then there is no need to use cuSPARSE at all. Could you please give me any explanation to the results? Also, could you suggest any other ways to speed up sparse matrix multiplication?
I don't think that you can classify a matrix with half zeros as "sparse": the timing you have found are reasonable (actually the sparse algorithm is behaving pretty well!).
Sparse algorithms are efficient only when considering matrices where most of the elements are zeros (for example, matrices coming out from finite elements problems).
This holds true for CPUs, non only for GPUs: there's an important overhead in treating the matrix as sparse, and it become convenient to use sparse algorithms only when... most of the elements are zeros (typical: ten or less non-zeros per row, matrix of rank thousands - hundred thousands - (millions?) ).
There are other matrix shapes that have efficient solution algorithms, that you can try if it applies to your problem, e.g. band matrices. I don't know whether they have been ported to cuBlas though.
About the overheads
Dense linear algebra algorithms can perform optimally because processors have been designed in order to best efficiently solve for such systems. Consider the DGEMM operation (matrix-matrix multiply): it's an operation that let you use the processors at >95% of it's theoretical peak floating point performance, for large matrices (ie, matrices not fitting any cache of the system). How?
optimal cache usage
vectorization (SSE, AVX)
In a sparse LA algorithm only non-zero elements and their corresponding indexes are stored into memory: memory accesses are in fact indirect. So the sparse algorithm cannot exploit the hardware at the same level of optimization: I don't know about specific figures in this context, but 10 to 20% wouldn't be strange.
The gain is clearly that operations on zeros (on non-stored elements) are simply not performed, resulting in order of magnitudes less operations and much less needed storage.
There are further overheads in integers logics, conditionals, but modern CPUs are pretty good in overlapping integer and FP operations, and with "speculative executions". Unfortunately they too can prevent vectorization and so are further overheads with respect to the dense case.
What about GPUs?
Dense LA algorithm are an optimal fit for GPUs as the same as for CPUs: in this case you have optimal usage of:
shared memory
memory access patterns
Again the indirect access to matrices elements in sparse LA algorithm prevent to exploit the same level of optimization.
I can't remember which one I used when encountered sparse problems... I think it was PSBLAS:
But here you will be overwhelmed of them:

Cholesky Decomposition for a large number of small matrices using OpenCL

My question is similar to what was asked here:[a link] Performing many small matrix operations in parallel in OpenCL, except that I want to do Cholesky decomposition. Also, my matrices will be going from 15 X 15 to 100 X 100, and I will have upto 100,000 of them. All of the matrices will have the same dimensions.The decomposed matrices will be used further within the GPU.
This paper [a link] discusses the algorithms at a high level. They use the term batched Cholesky for such a problem (large number of small matrices). The way they have done it is to implement a batched version of all the steps involved in a Cholesky decomposition.
So I am wanting to start with a batched matrix multiplication (since it is one of the steps in a cholesky decomposition). For large matrices, matrix multiplication is done in a blocked manner on a GPU. My question: would it be suitable for the kind of problem I have? Any other suggestions will be helpful. Am a little unsure how to approach this.

What is the best way to multiply a large dense matrix with its transpose?

I have a large matrix of the order 1M x 300 (obtained after SVD decomposition of a large item matrix). So, the matrix is a dense one with float as data type. I would like to compute the similarity matrix by multiplying the dimensionally reduced matrix with its transpose.
I implemented the matrix multiplication method and that doesn't just end.
What are the ways to perform matrix multiplication between the dense matrix (~1M rows x 300 columns) with its transpose?
Will using MapReduce help in speeding up the job?
I also saw Apache Hama being efficient for large matrix computations. Will that fit my problem?
Strassen's algorithm is also used for large matrices, how do i use it?
Any other solutions/suggestion for it?

LU Decomposition N^3

Suppose I have a square N X N symmetric real matrix A, and that I want to compute the LU decomposition of A. What is the complexity (e.g. O(N^2), O(N^3), etc...) of the best algorithm to do this
If A is a dense matrix
If A is a sparse matrix?
Wikipedia claims the following:
If two matrices of order n can be multiplied in time M(n), where
M(n)≥na for some a>2, then the LU decomposition can be computed in
time O(M(n)). This means, for example, that an O(n^2.376) algorithm
exists based on the Coppersmith–Winograd algorithm.
For a sparse matrix there is no single answer. It depends on the nature of the sparsity.
I would say it's the same order for sparse matrix multiplication as dense because (1) these order metrics only apply when the data is so large that the order effect dominates, and (2) sparsity at best reduces computation by a linear factor unrelated to the size N, therefore as N grows, but sparsity stays the same, the computation should again increase as O(N^3). As always, in the real world, your data size may not be large enough for this aspect of performance (the order) to dominate, and use of caches and optimized kernels will matter far more.

Efficient algorithm for finding largest eigenpair of small general complex matrix

I am looking for an efficient algorithm to find the largest eigenpair of a small, general (non-square, non-sparse, non-symmetric), complex matrix, A, of size m x n. By small I mean m and n is typically between 4 and 64 and usually around 16, but with m not equal to n.
This problem is straight forward to solve with the general LAPACK SVD algorithms, i.e. gesvd or gesdd. However, as I am solving millions of these problems and only require the largest eigenpair, I am looking for a more efficient algorithm. Additionally, in my application the eigenvectors will generally be similar for all cases. This lead me to investigate Arnoldi iteration based methods, but I have neither found a good library nor algorithm that applies to my small general complex matrix. Is there an appropriate algorithm and/or library?
Rayleigh iteration has cubic convergence. You may want to implement also the power method and see how they compare, since you need LU or QR decomposition of your matrix.
Following #rchilton's comment, you can apply this to A* A.
The idea of looking for the largest eigenpair is analogous to finding a large power of the matrix, as the lower frequency modes get damped out during the iteration. The Lanczos algorithm, is one of a few such algorithms that rely on the so-called Ritz eigenvectors during the decomposition. From Wikipedia:
The Lanczos algorithm is an iterative algorithm ... that is an adaptation of power methods to find eigenvalues and eigenvectors of a square matrix or the singular value decomposition of a rectangular matrix. It is particularly useful for finding decompositions of very large sparse matrices. In latent semantic indexing, for instance, matrices relating millions of documents to hundreds of thousands of terms must be reduced to singular-value form.
The technique works even if the system is not sparse, but if it is large and dense it has the advantage that it doesn't all have to be stored in memory at the same time.
How does it work?
The power method for finding the largest eigenvalue of a matrix A can be summarized by noting that if x_{0} is a random vector and x_{n+1}=A x_{n}, then in the large n limit, x_{n} / ||x_{n}|| approaches the normed eigenvector corresponding to the largest eigenvalue.
Non-square matrices?
Noting that your system is not a square matrix, I'm pretty sure that the SVD problem can be decomposed into separate linear algebra problems where the Lanczos algorithm would apply. A good place to ask such questions would be over at
