small Matrix Inversion on CUDA

small Matrix Inversion on CUDA - matrix

I need a bit of advice from you, and I hope it won't take a lot of your time.
So here is my question:
I have a small square dense matrix, with possible sizes 4x4, 8x8, 16x16,
and I want to inverse it using CUDA.
The special part of the question is that I have 1024 idle cuda threads to perform this task.
So I have a suspicion that the most widespread inverse methods like Gauss Jordan won't properly work here, because they are slightly parallel and will use only about 4-16 threads from huge amount of 1024.
But how else can I inverse this matrices using all available threads?
Thank you for your attention!

There are at least two possible ready made options for this sort of problem:
Use the batched solvers shipping in recent versions of the CUBLAS library
Use the BSD licensed Gauss-Jordan elimination device code functions which NVIDIA distribute to registered developers. These were intended to invert small matrices using one thread per matrix
[This answer was assembled from comments and added as a community wiki entry to get the question off the unanswered queue]

Related

Multi-channel Lattice Recursive Least Squares

I'm trying to implement multi-channelt lattice RLS, i.e. the recursive least squares algorithm which performs noise cancellation with multiple inputs, but a single 'desired output'.
I have the basic RLS algorithm working with multiple components, but it's too inefficient and memory intensive for my purpose.
Wikipedia has an excellent example of lattice RLS, which works great.
https://en.wikipedia.org/wiki/Recursive_least_squares_filter
However, the sources it cites do not go into much detail on how to extend this to the multi-channel case, and re-doing the full derivation is a bit beyond me.
Does anyone know a good source which describes or implements this algorithm in the multi-channel case? Many thanks.

Use separate parallel adaptive filters...one for each noise reference and combine these outputs to subtract from your noisy signal. LMS usually works best but RLS is fine. Problems arise if any of the noise references are heavily correlated with the desired signal.

How to Scale SPICE Matrix so LU-decomposition doesn't Fail

I am implementing a SPICE solver. I have the following problem: say I put two diodes and a current source in serial (standard diodes). I use MNA and Boost LU-decomposition. The problem is that the nodal matrix becomes very quickly near-singular. I think I have to scale the values but I don't know how and I couldn't find anything on the Internet. Any ideas how to do this scaling?

In the perspective of numerical, there is a scale technique for this kind of near-singular matrices. Basically, this technique is to divide each row of A by the sum (or maximum) of the absolute values in that row. You can find KLU which is a linear solver for circuit simulations for more details.
In perspective of SPICE simulation, it uses so-call Gmin stepping technique to iteratively compute and approach a real answer. You can find this in the documents of a SPICE project QUCS (Quite Universal Circuit Simulator).

Scaling does not help when the matrix has both very large and very small entries.
It is necessary to use some or all of the many tricks that were developed for circuit solver applications. A good start is clipping the range of the exponential and log function arguments to reasonable values -- in most circuits a diode forward voltage is never more than 1V and the diode reverse current not less than 1pA.
Actually, look at all library functions and wrap them in code that makes their arguments and results suitable for circuit-solving purposes. Simple clipping is sometimes good enough, but it is way better to make sure the functions stay (twice) differentiable and continuous.

Floating point algorithms with potential for performance optimization

For a university lecture I am looking for floating point algorithms with known asymptotic runtime, but potential for low-level (micro-)optimization. This means optimizations such as minimizing cache misses and register spillages, maximizing instruction level parallelism and taking advantage of SIMD (vector) instructions on new CPUs. The optimizations are going to be CPU-specific and will make use of applicable instruction set extensions.
The classic textbook example for this is matrix multiplication, where great speedups can be achieved by simply reordering the sequence of memory accesses (among other tricks). Another example is FFT. Unfortunately, I am not allowed to choose either of these.
Anyone have any ideas, or an algorithm/method that could use a boost?
I am only interested in algorithms where a per-thread speedup is conceivable. Parallelizing problems by multi-threading them is fine, but not the scope of this lecture.
Edit 1: I am taking the course, not teaching it. In the past years, there were quite a few projects that succeeded in surpassing the current best implementations in terms of performance.
Edit 2: This paper lists (from page 11 onwards) seven classes of important numerical methods and some associated algorithms that use them. At least some of the mentioned algorithms are candidates, it is however difficult to see which.
Edit 3: Thank you everyone for your great suggestions! We proposed to implement the exposure fusion algorithm (paper from 2007) and our proposal was accepted. The algorithm creates HDR-like images and consists mainly of small kernel convolutions followed by weighted multiresolution blending (on the Laplacian pyramid) of the source images. Interesting for us is the fact that the algorithm is already implemented in the widely used Enfuse tool, which is now at version 4.1. So we will be able to validate and compare our results with the original and also potentially contribute to the development of the tool itself. I will update this post in the future with the results if I can.

The simplest possible example:
accumulation of a sum. unrolling using multiple accumulators and vectorization allow a speedup of (ADD latency)*(SIMD vector width) on typical pipelined architectures (if the data is in cache; because there's no data reuse, it typically won't help if you're reading from memory), which can easily be an order of magnitude. Cute thing to note: this also decreases the average error of the result! The same techniques apply to any similar reduction operation.
A few classics from image/signal processing:
convolution with small kernels (especially small 2d convolves like a 3x3 or 5x5 kernel). In some sense this is cheating, because convolution is matrix multiplication, and is intimately related to the FFT, but in reality the nitty-gritty algorithmic techniques of high-performance small kernel convolutions are quite different from either.
erode and dilate.
what image people call a "gamma correction"; this is really evaluation of an exponential function (maybe with a piecewise linear segment near zero). Here you can take advantage of the fact that image data is often entirely in a nice bounded range like [0,1] and sub-ulp accuracy is rarely needed to use much cheaper function approximations (low-order piecewise minimax polynomials are common).

Stephen Canon's image processing examples would each make for instructive projects. Taking a different tack, though, you might look at certain amenable geometry problems:
Closest pair of points in moderately high dimension---say 50000 or so points in 16 or so dimensions. This may have too much in common with matrix multiplication for your purposes. (Take the dimension too much higher and dimensionality reduction silliness starts mattering; much lower and spatial data structures dominate. Brute force, or something simple using a brute-force kernel, is what I would want to use for this.)
Variation: For each point, find the closest neighbour.
Variation: Red points and blue points; find the closest red point to each blue point.
Welzl's smallest containing circle algorithm is fairly straightforward to implement, and the really costly step (check for points outside the current circle) is amenable to vectorisation. (I suspect you can kill it in two dimensions with just a little effort.)
Be warned that computational geometry stuff is usually more annoying to implement than it looks at first; don't just grab a random paper without understanding what degenerate cases exist and how careful your programming needs to be.
Have a look at other linear algebra problems, too. They're also hugely important. Dense Cholesky factorisation is a natural thing to look at here (much more so than LU factorisation) since you don't need to mess around with pivoting to make it work.

There is a free benchmark called c-ray.
It is a small ray-tracer for spheres designed to be a benchmark for floating-point performance.
A few random stackshots show that it spends nearly all its time in a function called ray_sphere that determines if a ray intersects a sphere and if so, where.
They also show some opportunities for larger speedup, such as:
It does a linear search through all the spheres in the scene to try to find the nearest intersection. That represents a possible area for speedup, by doing a quick test to see if a sphere is farther away than the best seen so far, before doing all the 3-d geometry math.
It does not try to exploit similarity from one pixel to the next. This could gain a huge speedup.
So if all you want to look at is chip-level performance, it could be a decent example.
However, it also shows how there can be much bigger opportunities.

Calculate eigenvalues/eigenvectors of hundreds of small matrices using CUDA

I have a question on the eigen-decomposition of hundreds of small matrices using CUDA.
I need to calculate the eigenvalues and eigenvectors of hundreds (e.g. 500) of small (64-by-64) real symmetric matrices concurrently. I tried to implement it by the Jacobi method using chess tournament ordering (see this paper (PDF) for more information).
In this algorithm, 32 threads are defined in each block, while each block handles one small matrix, and the 32 threads work together to inflate 32 off-diagonal elements until convergence. However, I am not very satisfied with its performance.
I am wondering where there is any better algorithm for my question, i.e. the eigen-decomposition of many 64-by-64 real symmetric matrices. I guess the householder's method may be a better choice but not sure whether it can be efficiently implemented in CUDA. There are not a lot of useful information online, since most of other programmers are more interested in using CUDA/OpenCL to decompose one large matrix instead of a lot of small matrices.

At least for the Eigenvalues, a sample can be found in the Cuda SDK
http://www.nvidia.de/content/cudazone/cuda_sdk/Linear_Algebra.html
Images seem broken, but download of samples still works. I would suggest downloading the full SDK and having a look at that exsample. Also, this Paper could be helpfull:
http://docs.nvidia.com/cuda/samples/6_Advanced/eigenvalues/doc/eigenvalues.pdf

What are canonical examples of parallel computation?

I am writing a paper to test a new application that will demonstrate the benefits of parallelized computation (compared to the traditional serialized version of this application). I want to use the canonical examples for parallel computation in my paper.
My first example is the parallel computation of pi. I would ideally like an example where each iteration is very time consuming (because of the additional overhead associated with parallelizing); my first thought is a Bayesian simulation with MCMC and Gibbs sampling.
What other problems are typically discussed in this context? What are good examples of large embarassingly parallel problems?

just a few more -
Multiplying matrices
Inverting matrices
FFT
String matching
Rendering 3d scenes (via scan line conversion or ray tracing)

One example I've used in the past of an embarrassingly parallel problem is visualizing the mandelbrot set. Each pixel can be computed independently.
Conway's Life is interesting as well, in that each value of the "next" board can be computed independently, but will depend on the relevant bits of the "current" board being done already.

I would suggest that canonical examples of parallel computation and embarassingly parallel problems are, if not completely, then nearly, disjoint sets. To put it another way, people working in parallel computation aren't terribly excited about embarassingly parallel problems; we call them that because we'd be embarassed to be working on them.
I'd be looking, if I were you, at these (a not entirely original list):
linear algebra on large dense matrices, both direct and iterative approaches;
linear algebra on huge sparse matrices
branch and bound approaches to linear programming (and related) problems;
sequence matching for bioinformatics (outside my field, I may have mis-expressed this);
continuos optimisation.
I expect there are many more.
EDIT: You may be interested in this list of problems which have been selected for benchmarking the next generation of European (academic) supercomputers. It will give you some idea of where that niche is heading.

Molecular dynamics simluations allow you to change the size of the problem until your computer resources are exhausted (i.e. 256 particles vs. 256,000,000 particles). Its truly a "canonical" example if you run the MD simulations under NVT conditions ;-)

My favorite example is monte carlo simulation.

Word counting seems to be the canonical example for MapReduce.
http://en.wikipedia.org/wiki/MapReduce#Example

Finding collisions in hash functions using Paul C. van Oorschot and Michael J. Weiner's method (PDF) comes up often in various cryptographic settings.

I used the Mandelbrot set demo to explain to my mom what parallel programming is about : http://www.ateji.com/px/demo.html
All the examples you mentions are mostly heavy data-parallel codes. You'll probably want to mention also task-oriented codes, such as servers responding to many requests in parallel, and data-flow or stream programming examples (MapReduce is a good representative of this class).

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio