Sparse matrix design in modern C++ - c++14

I want to implement a sparse matrix class using modern c++ ie 14 or 17. I know there must be some trade offs between storage and runtime efficiency. Right now I'd prefer to optimize more in terms of storage efficiency. If possible, I'd prefer more work to be done at compile time rather than runtime. For example, vector does have a lot of runtime checks so it may not be optimal. Can someone suggest a container for this? I plan to support the following operations:
Matrix Multiplication, Addition, Subtraction, Inversion and Transpose
Matrix iterators ie column row
Efficient constructors
etc
Thanks!

Related

Efficient algorithm for GEMM in memory limited scenarios

I am looking for an efficient algorithm to perform (dense) large matrix multiplications on GPUs. More specifically, for the case where the GPU does not have enough memory to hold all the matrices (e.g., m=n=k=100,000). I'm using cuBLAS to perform matrix multiplication in blocks, and I can think of many block-based approaches, but they are very inefficient because the A, B or C matrices have to be copied to/from the GPU multiple times.
I know that many efficient algorithms have been proposed (for example, here), but I was unable to find a concrete definition of the algorithm used. Is there an algorithm to perform this task without redundant copies (this is, copying A, B and C exactly once)? Any pointers to competitive approaches?
Such an algorithm is called an out-of-core algorithm and this problem is generally solved by using tiles. The idea is to first split A and B in relatively big tiles. Then, send 2 tiles on the GPU, perform the multiplication of the two, write the result in a preallocated tile (always the same), send it back to the CPU and accumulate the result in a tile of the C matrix. Actually, this algorithm is the same than the ones used to solve the matrix multiplication except that items are tiles and you need to care about sending/receiving data to/from the GPU. CUDA streams can be used to improve the execution time by overlapping communications with computations. Note that tiles needs to be copied multiple times because you do not have enough memory on the GPU. Lebesgue curves (aka Z-tiling or Z-order curves) can be used to reduce the number of copies/communications. Doing all of this is a bit complex. Some runtime systems and tools can help you to hide memory transfers more easily (eg. StarPu which is a research project).

What is the quickest method of matrix multiplication?

I've been working on a rather extensive program as of late, and i'm currently at a point where I have to utilize matrix multiplication. Thing is, for this particular program, speed is crucial. I'm familiar with a number of matrix setups, but I would like to know which method will run the fastest. I've done extensive research, but turned up very little results. Here is a list of the matrix multiplication algorithms I am familiar with:
Iterative algorithm
Divide and Conquer algorithm
Sub Cubic algorithms
Shared Memory Parallelism
If anyone needs clarification on the methods I listed, or on the question in general, feel free to ask.
The Strassen algorithm and the naive (O(n^3)) one are the most used in practice.
More complex algorithms with tighter asymptotic bounds are not used because they benefits would be apparent only for extremely large matrices, due to their complexity, e.g. Coppersmith algorithm.
As others pointed out you might want to use a library like ATLAS which will automatically tune the algorithm depending on the characteristcs of the platform where you are executing, e.g. L1/L2 cache sizes.
Quickest way might be using an existing library that's already optimized, you don't have to reinvent the wheel every time.

Floating point algorithms with potential for performance optimization

For a university lecture I am looking for floating point algorithms with known asymptotic runtime, but potential for low-level (micro-)optimization. This means optimizations such as minimizing cache misses and register spillages, maximizing instruction level parallelism and taking advantage of SIMD (vector) instructions on new CPUs. The optimizations are going to be CPU-specific and will make use of applicable instruction set extensions.
The classic textbook example for this is matrix multiplication, where great speedups can be achieved by simply reordering the sequence of memory accesses (among other tricks). Another example is FFT. Unfortunately, I am not allowed to choose either of these.
Anyone have any ideas, or an algorithm/method that could use a boost?
I am only interested in algorithms where a per-thread speedup is conceivable. Parallelizing problems by multi-threading them is fine, but not the scope of this lecture.
Edit 1: I am taking the course, not teaching it. In the past years, there were quite a few projects that succeeded in surpassing the current best implementations in terms of performance.
Edit 2: This paper lists (from page 11 onwards) seven classes of important numerical methods and some associated algorithms that use them. At least some of the mentioned algorithms are candidates, it is however difficult to see which.
Edit 3: Thank you everyone for your great suggestions! We proposed to implement the exposure fusion algorithm (paper from 2007) and our proposal was accepted. The algorithm creates HDR-like images and consists mainly of small kernel convolutions followed by weighted multiresolution blending (on the Laplacian pyramid) of the source images. Interesting for us is the fact that the algorithm is already implemented in the widely used Enfuse tool, which is now at version 4.1. So we will be able to validate and compare our results with the original and also potentially contribute to the development of the tool itself. I will update this post in the future with the results if I can.
The simplest possible example:
accumulation of a sum. unrolling using multiple accumulators and vectorization allow a speedup of (ADD latency)*(SIMD vector width) on typical pipelined architectures (if the data is in cache; because there's no data reuse, it typically won't help if you're reading from memory), which can easily be an order of magnitude. Cute thing to note: this also decreases the average error of the result! The same techniques apply to any similar reduction operation.
A few classics from image/signal processing:
convolution with small kernels (especially small 2d convolves like a 3x3 or 5x5 kernel). In some sense this is cheating, because convolution is matrix multiplication, and is intimately related to the FFT, but in reality the nitty-gritty algorithmic techniques of high-performance small kernel convolutions are quite different from either.
erode and dilate.
what image people call a "gamma correction"; this is really evaluation of an exponential function (maybe with a piecewise linear segment near zero). Here you can take advantage of the fact that image data is often entirely in a nice bounded range like [0,1] and sub-ulp accuracy is rarely needed to use much cheaper function approximations (low-order piecewise minimax polynomials are common).
Stephen Canon's image processing examples would each make for instructive projects. Taking a different tack, though, you might look at certain amenable geometry problems:
Closest pair of points in moderately high dimension---say 50000 or so points in 16 or so dimensions. This may have too much in common with matrix multiplication for your purposes. (Take the dimension too much higher and dimensionality reduction silliness starts mattering; much lower and spatial data structures dominate. Brute force, or something simple using a brute-force kernel, is what I would want to use for this.)
Variation: For each point, find the closest neighbour.
Variation: Red points and blue points; find the closest red point to each blue point.
Welzl's smallest containing circle algorithm is fairly straightforward to implement, and the really costly step (check for points outside the current circle) is amenable to vectorisation. (I suspect you can kill it in two dimensions with just a little effort.)
Be warned that computational geometry stuff is usually more annoying to implement than it looks at first; don't just grab a random paper without understanding what degenerate cases exist and how careful your programming needs to be.
Have a look at other linear algebra problems, too. They're also hugely important. Dense Cholesky factorisation is a natural thing to look at here (much more so than LU factorisation) since you don't need to mess around with pivoting to make it work.
There is a free benchmark called c-ray.
It is a small ray-tracer for spheres designed to be a benchmark for floating-point performance.
A few random stackshots show that it spends nearly all its time in a function called ray_sphere that determines if a ray intersects a sphere and if so, where.
They also show some opportunities for larger speedup, such as:
It does a linear search through all the spheres in the scene to try to find the nearest intersection. That represents a possible area for speedup, by doing a quick test to see if a sphere is farther away than the best seen so far, before doing all the 3-d geometry math.
It does not try to exploit similarity from one pixel to the next. This could gain a huge speedup.
So if all you want to look at is chip-level performance, it could be a decent example.
However, it also shows how there can be much bigger opportunities.

Is there any built-in type of matrix in CUDA for matrix and matrix-vector operation?

I want to implement some matrix–vector math. There are vector types like: float2, int2, but I cannot find any built-in type matrix in CUDA.
Is there a library that suitable for such operations?
You're right to look for a library for matrix data types. I recommend taking a look at ArrayFire.
Here is the quick reference page with a listing of the supported types. Here are the functions you can run with is, which is organized into the categories of data analysis, linear algebra, image and signal processing, sparse matrices, and a bunch of common place algorithms for data indexing, sorting, reductions, visualization, and faster for loops.
Other libraries include CULA or MAGMA (focused on linear algebra), Thrust (targeted at 1D operations), and a host of niche academic libraries.
Disclaimer: I work on ArrayFire myself.

Calculate eigenvalues/eigenvectors of hundreds of small matrices using CUDA

I have a question on the eigen-decomposition of hundreds of small matrices using CUDA.
I need to calculate the eigenvalues and eigenvectors of hundreds (e.g. 500) of small (64-by-64) real symmetric matrices concurrently. I tried to implement it by the Jacobi method using chess tournament ordering (see this paper (PDF) for more information).
In this algorithm, 32 threads are defined in each block, while each block handles one small matrix, and the 32 threads work together to inflate 32 off-diagonal elements until convergence. However, I am not very satisfied with its performance.
I am wondering where there is any better algorithm for my question, i.e. the eigen-decomposition of many 64-by-64 real symmetric matrices. I guess the householder's method may be a better choice but not sure whether it can be efficiently implemented in CUDA. There are not a lot of useful information online, since most of other programmers are more interested in using CUDA/OpenCL to decompose one large matrix instead of a lot of small matrices.
At least for the Eigenvalues, a sample can be found in the Cuda SDK
http://www.nvidia.de/content/cudazone/cuda_sdk/Linear_Algebra.html
Images seem broken, but download of samples still works. I would suggest downloading the full SDK and having a look at that exsample. Also, this Paper could be helpfull:
http://docs.nvidia.com/cuda/samples/6_Advanced/eigenvalues/doc/eigenvalues.pdf

Resources