Efficient algorithm for GEMM in memory limited scenarios - algorithm

I am looking for an efficient algorithm to perform (dense) large matrix multiplications on GPUs. More specifically, for the case where the GPU does not have enough memory to hold all the matrices (e.g., m=n=k=100,000). I'm using cuBLAS to perform matrix multiplication in blocks, and I can think of many block-based approaches, but they are very inefficient because the A, B or C matrices have to be copied to/from the GPU multiple times.
I know that many efficient algorithms have been proposed (for example, here), but I was unable to find a concrete definition of the algorithm used. Is there an algorithm to perform this task without redundant copies (this is, copying A, B and C exactly once)? Any pointers to competitive approaches?

Such an algorithm is called an out-of-core algorithm and this problem is generally solved by using tiles. The idea is to first split A and B in relatively big tiles. Then, send 2 tiles on the GPU, perform the multiplication of the two, write the result in a preallocated tile (always the same), send it back to the CPU and accumulate the result in a tile of the C matrix. Actually, this algorithm is the same than the ones used to solve the matrix multiplication except that items are tiles and you need to care about sending/receiving data to/from the GPU. CUDA streams can be used to improve the execution time by overlapping communications with computations. Note that tiles needs to be copied multiple times because you do not have enough memory on the GPU. Lebesgue curves (aka Z-tiling or Z-order curves) can be used to reduce the number of copies/communications. Doing all of this is a bit complex. Some runtime systems and tools can help you to hide memory transfers more easily (eg. StarPu which is a research project).

Related

CUDA Sorting Many Vectors / Arrays

I have many (200 000) vectors of integers (around 2000 elements in each vector) in GPU memory.
I am trying to parallelize algorithm which needs to sort, calculate average, standard deviation and skewness for each vector.
In the next step, the algorithm has to delete the maximal element and repeated calculation of statistical moments until some criteria is not fulfilled for each vector independently.
I would like to ask someone more experienced what is the best approach to parallelize this algorithm.
Is it possible to sort more that one vector at once?
Maybe is it better to not parallelize sorting but the whole algorithm as one thread?
200 000 vectors of integers ... 2000 elements in each vector ... in GPU memory.
2,000 integers sounds like something a single GPU block could tackle handily. They would fit in its shared memory (or into its register file, but that would be less useful for various reasons), so you wouldn't need to sort them in global memory. 200,000 vector = 200,000 blocks; but you can't have 2000 block threads - that excessive
You might be able to use cub's block radix sort, as #talonmies suggests, but I'm not too sure that's the right thing to do. You might be able to do it with thrust, but there's also a good chance you'll have a lot of overhead and complex code (I may be wrong though). Give serious consideration to adapting an existing (bitonic) sort kernel, or even writing your own - although that's more challenging to get right.
Anyway, if you write your own kernel, you can code your "next step" after sorting the data.
Maybe is it better to not parallelize sorting but the whole algorithm as one thread?
This depends on how much time your application spends on these sorting efforts at the moment, relative to its entire running time. See also Amdahl's Law for a more formal statement of the above. Having said that - typically it should be worthwhile to parallelize the sorting when you already have data in GPU memory.

Parallelization of Multiple Delaunay Triangulations in Mathematica

I am trying to run a large number of "DelaunayTriangulation" routines from the Computational Geometry package in Mathematica. I have an array, "Lattice", which contains data for several thousand points in several thousand time-frames. (e.g. Lattice[[i]] indicates the ith time frame with ~10000 (x,y) coordinates).
I want to generate another large array, "Tri", with all the triangulation index data inside. For a serial calculation:
Tri=Table[DelaunayTriangulation[Lattice[[i]]],{i,imax}];
This calculation will take an exceptionally long time, so naturally, I wish to parallelize this computation:
Tri=Parallelize[Table[DelaunayTriangulation[Lattice[[i]]],{i,imax}]];
The problem lies here; usually, I would expect these individual triangulations to be divided between the 16 cores I have and run in parallel, but I don't see this. The parallelization doesn't affect anything and the computation runs as if it were on a single core.
I'm sure my use of "Parrallelize" is correct, as it works with default Mathematica commands in other tables.
Is this an issue using the triangulaion routine? Or perhaps memory (although the serial calculation uses about >>1Gb of the 32Gb of RAM I have)? Any insight into this will be useful.

Floating point algorithms with potential for performance optimization

For a university lecture I am looking for floating point algorithms with known asymptotic runtime, but potential for low-level (micro-)optimization. This means optimizations such as minimizing cache misses and register spillages, maximizing instruction level parallelism and taking advantage of SIMD (vector) instructions on new CPUs. The optimizations are going to be CPU-specific and will make use of applicable instruction set extensions.
The classic textbook example for this is matrix multiplication, where great speedups can be achieved by simply reordering the sequence of memory accesses (among other tricks). Another example is FFT. Unfortunately, I am not allowed to choose either of these.
Anyone have any ideas, or an algorithm/method that could use a boost?
I am only interested in algorithms where a per-thread speedup is conceivable. Parallelizing problems by multi-threading them is fine, but not the scope of this lecture.
Edit 1: I am taking the course, not teaching it. In the past years, there were quite a few projects that succeeded in surpassing the current best implementations in terms of performance.
Edit 2: This paper lists (from page 11 onwards) seven classes of important numerical methods and some associated algorithms that use them. At least some of the mentioned algorithms are candidates, it is however difficult to see which.
Edit 3: Thank you everyone for your great suggestions! We proposed to implement the exposure fusion algorithm (paper from 2007) and our proposal was accepted. The algorithm creates HDR-like images and consists mainly of small kernel convolutions followed by weighted multiresolution blending (on the Laplacian pyramid) of the source images. Interesting for us is the fact that the algorithm is already implemented in the widely used Enfuse tool, which is now at version 4.1. So we will be able to validate and compare our results with the original and also potentially contribute to the development of the tool itself. I will update this post in the future with the results if I can.
The simplest possible example:
accumulation of a sum. unrolling using multiple accumulators and vectorization allow a speedup of (ADD latency)*(SIMD vector width) on typical pipelined architectures (if the data is in cache; because there's no data reuse, it typically won't help if you're reading from memory), which can easily be an order of magnitude. Cute thing to note: this also decreases the average error of the result! The same techniques apply to any similar reduction operation.
A few classics from image/signal processing:
convolution with small kernels (especially small 2d convolves like a 3x3 or 5x5 kernel). In some sense this is cheating, because convolution is matrix multiplication, and is intimately related to the FFT, but in reality the nitty-gritty algorithmic techniques of high-performance small kernel convolutions are quite different from either.
erode and dilate.
what image people call a "gamma correction"; this is really evaluation of an exponential function (maybe with a piecewise linear segment near zero). Here you can take advantage of the fact that image data is often entirely in a nice bounded range like [0,1] and sub-ulp accuracy is rarely needed to use much cheaper function approximations (low-order piecewise minimax polynomials are common).
Stephen Canon's image processing examples would each make for instructive projects. Taking a different tack, though, you might look at certain amenable geometry problems:
Closest pair of points in moderately high dimension---say 50000 or so points in 16 or so dimensions. This may have too much in common with matrix multiplication for your purposes. (Take the dimension too much higher and dimensionality reduction silliness starts mattering; much lower and spatial data structures dominate. Brute force, or something simple using a brute-force kernel, is what I would want to use for this.)
Variation: For each point, find the closest neighbour.
Variation: Red points and blue points; find the closest red point to each blue point.
Welzl's smallest containing circle algorithm is fairly straightforward to implement, and the really costly step (check for points outside the current circle) is amenable to vectorisation. (I suspect you can kill it in two dimensions with just a little effort.)
Be warned that computational geometry stuff is usually more annoying to implement than it looks at first; don't just grab a random paper without understanding what degenerate cases exist and how careful your programming needs to be.
Have a look at other linear algebra problems, too. They're also hugely important. Dense Cholesky factorisation is a natural thing to look at here (much more so than LU factorisation) since you don't need to mess around with pivoting to make it work.
There is a free benchmark called c-ray.
It is a small ray-tracer for spheres designed to be a benchmark for floating-point performance.
A few random stackshots show that it spends nearly all its time in a function called ray_sphere that determines if a ray intersects a sphere and if so, where.
They also show some opportunities for larger speedup, such as:
It does a linear search through all the spheres in the scene to try to find the nearest intersection. That represents a possible area for speedup, by doing a quick test to see if a sphere is farther away than the best seen so far, before doing all the 3-d geometry math.
It does not try to exploit similarity from one pixel to the next. This could gain a huge speedup.
So if all you want to look at is chip-level performance, it could be a decent example.
However, it also shows how there can be much bigger opportunities.

How to deal with a giant sparse matrices?

Someone point me in the right direction. I'm looking to do some heavy-duty manipulation of some really large and often very sparse matrices and I'm looking for the right tool for the job. These matrices will be much, much larger than the RAM of any single machine and will therefore likely be spread to several different machines. The matrices will often be sparse. I will want to perform all of the common matrix operations: multiplication, transpose, inverse, pseudo-inverse, SVD, Eigenvalue Decomposition, etc. Probably key among my concerns is that since the matrices will very likely be spread among several machines, I will want to minimize information sharing, because network latency is probably my biggest enemy. I'm concerned that map-reduce (a la Hadoop) is not the right option because it's focus is upon streaming large amounts of data between machines. This book provides a great intro to map-reduce from an algorithmic perspective. And lots of matrix operations are akin to giant JOIN operations which are known to be slow or map-reduce.
So... where should I go?
This paper: Design of Hadoop-based Large-Scale Matrix Computations can help you on the implementation guidelines. HBase is meant for storing sparse tables so HBase might be the recommended storage option of the Matrices.

Calculate eigenvalues/eigenvectors of hundreds of small matrices using CUDA

I have a question on the eigen-decomposition of hundreds of small matrices using CUDA.
I need to calculate the eigenvalues and eigenvectors of hundreds (e.g. 500) of small (64-by-64) real symmetric matrices concurrently. I tried to implement it by the Jacobi method using chess tournament ordering (see this paper (PDF) for more information).
In this algorithm, 32 threads are defined in each block, while each block handles one small matrix, and the 32 threads work together to inflate 32 off-diagonal elements until convergence. However, I am not very satisfied with its performance.
I am wondering where there is any better algorithm for my question, i.e. the eigen-decomposition of many 64-by-64 real symmetric matrices. I guess the householder's method may be a better choice but not sure whether it can be efficiently implemented in CUDA. There are not a lot of useful information online, since most of other programmers are more interested in using CUDA/OpenCL to decompose one large matrix instead of a lot of small matrices.
At least for the Eigenvalues, a sample can be found in the Cuda SDK
http://www.nvidia.de/content/cudazone/cuda_sdk/Linear_Algebra.html
Images seem broken, but download of samples still works. I would suggest downloading the full SDK and having a look at that exsample. Also, this Paper could be helpfull:
http://docs.nvidia.com/cuda/samples/6_Advanced/eigenvalues/doc/eigenvalues.pdf

Resources