armadillo linear sparse system solver using LAPACK and SuperLU - c++11

I tried to solve the sparse linear system using armadillo library.
#include <iostream>
#include<armadillo>
using namespace std;
using namespace arma;
int main(int argc, char** argv) {
int no_examples = 5000;
sp_mat A = sprandu<sp_mat>(no_examples,no_examples,0.7);
vec b = randu<vec>(no_examples);
wall_clock timer;
double t;
timer.tic();
vec x1 = spsolve(A, b,"superlu");
t= timer.toc();
cout<<"Elapsed time is:"<<t<<endl;
}
I compiled the program using g++ demo.cpp -O3 -I/usr/include/armadillo_bits -DARMA_DONT_USE_WRAPPER -lsuperlu -lopenblas -llapack. The run time obtained with superlu option is about 8.5 seconds. When the system system is solved with LAPACK option in spsolve see here, run time is 4.01 seconds. Can somebody explain why:
Solving the same system of equations takes longer by SuperLu than LAPACK?
my hunch is that they may be using different algorithms to solve the sparse linear system. Any other idea is welcomed!
Edit: I'm running on Ubuntu 14.04 with export OPENBLAS_NUM_THREADS=4.

The density of the matrix is too big, making it almost dense.
The LAPACK's dense algorithm is able to use many of the vectorization and caching optimizations. The sparse algorithm is more complex than the dense LU factorization. It performs initializations and tries to make use of the sparsity of the matrix. If matrix is almost dense the simpler straight forward algorithm becomes faster.
I would expect SuperLU to have better performance for density values below 0.3-0.4.
Also storing a matrix with density=0.7 needs more memory in sparse format.
It needs to store values and their indexes.

Related

Bottleneck in the Eigen partial_lu_inplace when factorizing a large number of small matrices

I need to factorize ~1e05 small matrices of maximal variable size of 20x20. Profiling of the matrix factorization using HpcToolkit shows that the hotspot in the code is in Eigen::internal::partial_lu_inplace.
I checked the Eigen documenation on the inplace matrix decomposition, and I understand it can be important for large matrices to use inplace decomposition, to re-use the memory and have better cache efficiency.
I am currently computing the decomposition like this:
// Factorize the matrix.
matrixFactorization_ = A_.partialPivLu();
Profiling with HpcToolkit shows that the inplace factorization is the hotspot:
Is it possible to disable the inplace decomposition and test if the code will be faster for small matrices I am dealing with?
Note: If you look at the CPU time in the column on the image, you'll notice the runtime is in seconds: I am not after microsecond optimizations here, the calculation takes ~4 seconds in total.
Edit: HPCToolkit statistically profiles the code in fully optimized mode -O3 , but with the information that is required to map the measurements to the sourcecode -g3.
If the profiler gives you so detailed information then you forgot to enable compiler's optimizations (e.g., -O3 -march=native -DNDEBUG, or "Release" mode + /arch:AVX with VS). With Eigen, this will make a huge difference.
Then you might save dynamic memory allocation by using:
typedef Matrix<double,Dynamic,Dynamic,ColMajor,20,20> MatMax20;
MatMax20 A_;
PartialPivLU<MatMax20> matrixFactorization_;
The matrix A_ and all internals of PartialPivLU will thus be statically allocated.
To update an existing facto better write:
matrixFactorization_.compute(A_);

Can Random Number Generator of Fortran 90 be trusted for Monte Carlo Integration?

I have written a short monte carlo integration algorithm to calculate an integral in Fortran 90. I once compared the result obtained by solving the integral with respect to some parameter using the intrinsic random number generator with the random number generator method ran1 presented in Numerical Recipes for Fortran90 Volume 2.
Running the same algorithm twice, once calling the intrinsic random_seed(), then always call random_number() and once calling the ran1() method provided in the Numerical Recipe book I obtain as result in principal the same shape but the intrinsic result is a continuous curve in contrast to the ran1 result. In both cases I call the function with random parameters 10,000 times for a parameter value q, add it and then go on to the next q value and call the function 10,000 times etc.
A comparative image of the result can be found here:
If I increase the number of calls both curves converge. But I was wondering: why does the intrinsic random number generator generate this smoothness? Is it still generally advised to use it or are there are other more advised RNG? I suppose the continuous result is a result of the "less" randomness of the intrinsic number generator.
(I left out the source code as I don't think that there is a lot of input from it. If somebody cares I can hand it in later.)
There are NO guarantees about the quality of the pseudo random generator in standard Fortran. If you care about some particular quality of implementation for cryptography or science sensitive to random numbers (Monte-Carlo), you should use some library which you have control about.
You can study the manual of your compiler to find out what it says about the random number generator, but every compiler can implement a completely different algorithm to generate random numbers.
Numerical Recipes is actually not well received by some people in the numerical mathematics community http://www.uwyo.edu/buerkle/misc/wnotnr.html
This site is not for software recommendation, but this article (link given by roygvib in a comment): https://arxiv.org/abs/1005.4117 is a good review with examples of bad and good algorithms, methods how to test them, how to generate arbitrary number distributions and examples of calls of two example libraries in C (one of them can be called from Fortran as well).
Personally I use this https://bitbucket.org/LadaF/elmm/src/master/src/rng_par_zig.f90 parallel PRNG, but I didn't test the quality, I personally just need speed. But this is not a software recommendation site.
The particular random number generator used depends on the compiler. Perhaps documented, perhaps not. Perhaps subject to change. For high quality work, I would use library / source code from elsewhere. A webpage with Fortran implementations of the Mersenne Twister: http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/VERSIONS/FORTRAN/fortran.html. I have used http://theo.phys.sci.hiroshima-u.ac.jp/~ishikawa/PRNG/mt_stream_en.html and verified against the original implementation. This version is useful for multi-threaded programs.
I agree with Vladamir F. But...
To help you in your quest for better random numbers, consider the fairly recent addition to C++, C++11 Pseudo Random Number Generators. Mersenne twister and lots of others are there. It's a pretty well-thought out system. I see two ways you could test those sequences:
make a system call from Fortran to a little C++ utility that generates a bunch of them for you or
bind the random number generators to Fortran through ISO_C_BINDING.
I prefer the second and because the bindings are a little tricky, I have prepared an example. The files are:
cxx11_rand.h and cxx11_rand.cpp which define calls to the random number generator
c_rand.cpp which calls the C++ functions in C functions
f_rand_M.f90 which binds the C functions to Fortran
f_main.f90 which uses the module functions to generate 10 random numbers in [0,1).
cxx11_rand.h
#ifndef CXX11_RAND_H
#define CXX11_RAND_H
void cxx11_init();
double cxx11_rand();
#endif
cxx11_rand.cpp
#include <algorithm>
#include <cstdio>
#include <memory>
#include <random>
#include <set>
#include <vector>
#include <chrono>
#include <thread>
#include <iostream>
#include "cxx11_rand.h"
static std::unique_ptr<std::uniform_real_distribution<>> dist;
static std::unique_ptr<std::mt19937_64> eng;
void cxx11_init(){
eng = std::unique_ptr<std::mt19937_64>( new std::mt19937_64(std::random_device{}()) );
dist = std::unique_ptr< std::uniform_real_distribution<> >( new std::uniform_real_distribution<>(0.0,1.0) );
}
double cxx11_rand(){
return (*dist)( *eng );
}
c_rand.cpp
#include "cxx11_rand.h"
#ifdef __cplusplus
extern "C" {
#endif
void c_init(){
cxx11_init();
}
double c_rand(){
return cxx11_rand();
}
#ifdef __cplusplus
}
#endif
f_rand_M.f90
module f_rand_M
implicit none
!define fortran interface bindings to C functions
interface
subroutine fi_init() bind(C, name="c_init")
end subroutine
real(C_DOUBLE) function fi_rand() bind(C, name="c_rand")
use ISO_C_BINDING, only: C_DOUBLE
end function
end interface
contains
subroutine f_init()
call fi_init()
end subroutine
real(C_DOUBLE) function f_rand()
use ISO_C_BINDING, only: C_DOUBLE
f_rand = fi_rand()
end function
end module
f_main.f90
program main
use f_rand_M
implicit none
integer :: i
call f_init()
do i=1,10
write(*,*)f_rand()
end do
end program
You can compile/link with the following GNU commands
echo "compiling objects"
g++ -c --std=c++11 cxx11_rand.cpp c_rand.cpp
gfortran -c f_rand_M.f90
echo "building & executing fortran main"
gfortran f_main.f90 f_rand_M.o c_rand.o cxx11_rand.o -lstdc++ -o f_main.exe
./f_main.exe
Your output should look like this (with different random numbers of course--the seed here was chosen from a "source of entropy", e.g. wall time).
compiling objects
building & executing fortran main
0.47439556226575341
0.11177335018127127
0.10417488557661241
0.77378163596792404
0.20780793755332663
0.27951447624366532
0.66920698086955666
0.80676663600103105
0.98028384008440417
0.88893587108730432
I used GCC 4.9 on a Mac for testing.
PRNG is a bad option when you are doing MC and it does not matter in which programming language. For MC simulations it is always good idea to use service like Random ORG or hardware random number generator. The book Effective Java, in Item 47, clearly shows an example of problematic PRNG. With PRNGs you are never sure that you are OK with their internal implementation.

Library function capabilities of Mathematica

I am trying to use CUSP as an external linear solver for Mathematica to use the power of the GPU.
Here is the CUSP Project webpage. I am asking for some suggestion how we can integrate CUSP with Mathematica. I am sure many of you here will be interested to discuss this. I think writing a input matrix and then feeding it to CUSP program is not the way to go. Using Mathematica's LibrarayFunctionLoad will be a better way to pipeline the input matrix to the GPU based solver on the fly. What will be the way to supply the matrix and the right hand side matrix directly from Mathematica?
Here is some CUSP code snippet.
#include <cusp/hyb_matrix.h>
#include <cusp/io/matrix_market.h>
#include <cusp/krylov/cg.h>
int main(void)
{
// create an empty sparse matrix structure (HYB format)
cusp::hyb_matrix<int, float, cusp::device_memory> A;
// load a matrix stored in MatrixMarket format
cusp::io::read_matrix_market_file(A, "5pt_10x10.mtx");
// allocate storage for solution (x) and right hand side (b)
cusp::array1d<float, cusp::device_memory> x(A.num_rows, 0);
cusp::array1d<float, cusp::device_memory> b(A.num_rows, 1);
// solve the linear system A * x = b with the Conjugate Gradient method
cusp::krylov::cg(A, x, b);
return 0;
}
This question gives us the possibility to discuss compilation capabilities of Mathematica 8. It is also possible to invoke the topic of mathlink interface of MMA. I hope people here find this problem worthy and interesting enough to ponder on.
BR
If you want to use LibraryLink (for which LibraryFunctionLoad is used to access a dynamic library function as a Mathematica downvalue) there's actually not much room for discussion, LibraryFunctions can receive Mathematica tensors of machine doubles or machine integers and you're done.
The Mathematica MTensor format is a dense array, just as you'd naturally use in C, so if CUSP uses some other format you will need to write some glue code to translate between representations.
Refer to the LibraryLink tutorial for full details.
You will want to especially note the section "Memory Management of MTensors" in the Interaction with Mathematica page, and choose the "Shared" mode to just pass a Mathematica tensor by reference.

Why is MATLAB so fast in matrix multiplication?

I am making some benchmarks with CUDA, C++, C#, Java, and using MATLAB for verification and matrix generation. When I perform matrix multiplication with MATLAB, 2048x2048 and even bigger matrices are almost instantly multiplied.
1024x1024 2048x2048 4096x4096
--------- --------- ---------
CUDA C (ms) 43.11 391.05 3407.99
C++ (ms) 6137.10 64369.29 551390.93
C# (ms) 10509.00 300684.00 2527250.00
Java (ms) 9149.90 92562.28 838357.94
MATLAB (ms) 75.01 423.10 3133.90
Only CUDA is competitive, but I thought that at least C++ will be somewhat close and not 60 times slower. I also don't know what to think about the C# results. The algorithm is just the same as C++ and Java, but there's a giant jump 2048 from 1024.
How is MATLAB performing matrix multiplication so fast?
C++ Code:
float temp = 0;
timer.start();
for(int j = 0; j < rozmer; j++)
{
for (int k = 0; k < rozmer; k++)
{
temp = 0;
for (int m = 0; m < rozmer; m++)
{
temp = temp + matice1[j][m] * matice2[m][k];
}
matice3[j][k] = temp;
}
}
timer.stop();
This kind of question is recurring and should be answered more clearly than "MATLAB uses highly optimized libraries" or "MATLAB uses the MKL" for once on Stack Overflow.
History:
Matrix multiplication (together with Matrix-vector, vector-vector multiplication and many of the matrix decompositions) is (are) the most important problems in linear algebra. Engineers have been solving these problems with computers since the early days.
I'm not an expert on the history, but apparently back then, everybody just rewrote his FORTRAN version with simple loops. Some standardization then came along, with the identification of "kernels" (basic routines) that most linear algebra problems needed in order to be solved. These basic operations were then standardized in a specification called: Basic Linear Algebra Subprograms (BLAS). Engineers could then call these standard, well-tested BLAS routines in their code, making their work much easier.
BLAS:
BLAS evolved from level 1 (the first version which defined scalar-vector and vector-vector operations) to level 2 (vector-matrix operations) to level 3 (matrix-matrix operations), and provided more and more "kernels" so standardized more and more of the fundamental linear algebra operations. The original FORTRAN 77 implementations are still available on Netlib's website.
Towards better performance:
So over the years (notably between the BLAS level 1 and level 2 releases: early 80s), hardware changed, with the advent of vector operations and cache hierarchies. These evolutions made it possible to increase the performance of the BLAS subroutines substantially. Different vendors then came along with their implementation of BLAS routines which were more and more efficient.
I don't know all the historical implementations (I was not born or a kid back then), but two of the most notable ones came out in the early 2000s: the Intel MKL and GotoBLAS. Your Matlab uses the Intel MKL, which is a very good, optimized BLAS, and that explains the great performance you see.
Technical details on Matrix multiplication:
So why is Matlab (the MKL) so fast at dgemm (double-precision general matrix-matrix multiplication)? In simple terms: because it uses vectorization and good caching of data. In more complex terms: see the article provided by Jonathan Moore.
Basically, when you perform your multiplication in the C++ code you provided, you are not at all cache-friendly. Since I suspect you created an array of pointers to row arrays, your accesses in your inner loop to the k-th column of "matice2": matice2[m][k] are very slow. Indeed, when you access matice2[0][k], you must get the k-th element of the array 0 of your matrix. Then in the next iteration, you must access matice2[1][k], which is the k-th element of another array (the array 1). Then in the next iteration you access yet another array, and so on... Since the entire matrix matice2 can't fit in the highest caches (it's 8*1024*1024 bytes large), the program must fetch the desired element from main memory, losing a lot of time.
If you just transposed the matrix, so that accesses would be in contiguous memory addresses, your code would already run much faster because now the compiler can load entire rows in the cache at the same time. Just try this modified version:
timer.start();
float temp = 0;
//transpose matice2
for (int p = 0; p < rozmer; p++)
{
for (int q = 0; q < rozmer; q++)
{
tempmat[p][q] = matice2[q][p];
}
}
for(int j = 0; j < rozmer; j++)
{
for (int k = 0; k < rozmer; k++)
{
temp = 0;
for (int m = 0; m < rozmer; m++)
{
temp = temp + matice1[j][m] * tempmat[k][m];
}
matice3[j][k] = temp;
}
}
timer.stop();
So you can see how just cache locality increased your code's performance quite substantially. Now real dgemm implementations exploit that to a very extensive level: They perform the multiplication on blocks of the matrix defined by the size of the TLB (Translation lookaside buffer, long story short: what can effectively be cached), so that they stream to the processor exactly the amount of data it can process. The other aspect is vectorization, they use the processor's vectorized instructions for optimal instruction throughput, which you can't really do from your cross-platform C++ code.
Finally, people claiming that it's because of Strassen's or Coppersmith–Winograd algorithm are wrong, both these algorithms are not implementable in practice, because of hardware considerations mentioned above.
Here's my results using MATLAB R2011a + Parallel Computing Toolbox on a machine with a Tesla C2070:
>> A = rand(1024); gA = gpuArray(A);
% warm up by executing the operations a couple of times, and then:
>> tic, C = A * A; toc
Elapsed time is 0.075396 seconds.
>> tic, gC = gA * gA; toc
Elapsed time is 0.008621 seconds.
MATLAB uses highly optimized libraries for matrix multiplication which is why the plain MATLAB matrix multiplication is so fast. The gpuArray version uses MAGMA.
Update using R2014a on a machine with a Tesla K20c, and the new timeit and gputimeit functions:
>> A = rand(1024); gA = gpuArray(A);
>> timeit(#()A*A)
ans =
0.0324
>> gputimeit(#()gA*gA)
ans =
0.0022
Update using R2018b on a WIN64 machine with 16 physical cores and a Tesla V100:
>> timeit(#()A*A)
ans =
0.0229
>> gputimeit(#()gA*gA)
ans =
4.8019e-04
(NB: at some point (I forget when exactly) gpuArray switched from MAGMA to cuBLAS - MAGMA is still used for some gpuArray operations though)
Update using R2022a on a WIN64 machine with 32 physical cores and an A100 GPU:
>> timeit(#()A*A)
ans =
0.0076
>> gputimeit(#()gA*gA)
ans =
2.5344e-04
This is why. MATLAB doesn't perform a naive matrix multiplication by looping over every single element the way you did in your C++ code.
Of course I'm assuming that you just used C=A*B instead of writing a multiplication function yourself.
Matlab incorporated LAPACK some time ago, so I assume their matrix multiplication uses something at least that fast. LAPACK source code and documentation is readily available.
You might also look at Goto and Van De Geijn's paper "Anatomy of High-Performance Matrix
Multiplication" at http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.140.1785&rep=rep1&type=pdf
The answer is LAPACK and BLAS libraries make MATLAB blindingly fast at matrix operations, not any proprietary code by the folks at MATLAB.
Use the LAPACK and/or BLAS libraries in your C++ code for matrix operations and you should get similar performance as MATLAB. These libraries should be freely available on any modern system and parts were developed over decades in academia. Note that there are multiple implementations, including some closed source such as Intel MKL.
A discussion of how BLAS gets high performance is available here.
BTW, it's a serious pain in my experience to call LAPACK libraries directly from c (but worth it). You need to read the documentation VERY precisely.
When doing matrix multiplying, you use naive multiplication method which takes time of O(n^3).
There exist matrix multiplication algorithm which takes O(n^2.4). Which means that at n=2000 your algorithm requires ~100 times as much computation as the best algorithm.
You should really check the wikipedia page for matrix multiplication for further information on the efficient ways to implement it.
Depending on your version of Matlab, I believe it might be using your GPU already.
Another thing; Matlab keeps track of many properties of your matrix; wether its diagonal, hermetian, and so forth, and specializes its algorithms based thereon. Maybe its specializing based on the zero matrix you are passing it, or something like that? Maybe it is caching repeated function calls, which messes up your timings? Perhaps it optimizes out repeated unused matrix products?
To guard against such things happening, use a matrix of random numbers, and make sure you force execution by printing the result to screen or disk or somesuch.
The general answer to "Why is matlab faster at doing xxx than other programs" is that matlab has a lot of built in, optimized functions.
The other programs that are used often do not have these functions so people apply their own creative solutions, which are suprisingly slower than professionally optimized code.
This can be interpreted in two ways:
1) The common/theoretical way: Matlab is not significantly faster, you are just doing the benchmark wrong
2) The realistic way: For this stuff Matlab is faster in practice because languages as c++ are just too easily used in ineffective ways.
MATLAB uses a highly optimized implementation of LAPACK from Intel known as Intel Math Kernel Library (Intel MKL) - specifically the dgemm function. The speed This library takes advantage of processor features including SIMD instructions and multi-core processors. They don't document which specific algorithm they use. If you were to call Intel MKL from C++ you should see similar performance.
I am not sure what library MATLAB uses for GPU multiplication but probably something like nVidia CUBLAS.
The sharp contrast is not only due to Matlab's amazing optimization (as discussed by many other answers already), but also in the way you formulated matrix as an object.
It seems like you made matrix a list of lists? A list of lists contains pointers to lists which then contain your matrix elements. The locations of the contained lists are assigned arbitrarily. As you are looping over your first index (row number?), the time of memory access is very significant. In comparison, why don't you try implement matrix as a single list/vector using the following method?
#include <vector>
struct matrix {
matrix(int x, int y) : n_row(x), n_col(y), M(x * y) {}
int n_row;
int n_col;
std::vector<double> M;
double &operator()(int i, int j);
};
And
double &matrix::operator()(int i, int j) {
return M[n_col * i + j];
}
The same multiplication algorithm should be used so that the number of flop is the same. (n^3 for square matrices of size n)
I'm asking you to time it so that the result is comparable to what you had earlier (on the same machine). With the comparison, you will show exactly how significant memory access time can be!
It's slow in C++ because you are not using multithreading. Essentially, if A = B C, where they are all matrices, the first row of A can be computed independently from the 2nd row, etc. If A, B, and C are all n by n matrices, you can speed up the multiplication by a factor of n^2, as
a_{i,j} = sum_{k} b_{i,k} c_{k,j}
If you use, say, Eigen [ http://eigen.tuxfamily.org/dox/GettingStarted.html ], multithreading is built-in and the number of threads is adjustable.
Because MATLAB is a programming language at first developed for numerical linear algebra (matrix manipulations), which has libraries especially developed for matrix multiplications. And now MATLAB can also use the GPUs (Graphics processing unit) for this additionally.
And if we look at your computation results:
1024x1024 2048x2048 4096x4096
--------- --------- ---------
CUDA C (ms) 43.11 391.05 3407.99
C++ (ms) 6137.10 64369.29 551390.93
C# (ms) 10509.00 300684.00 2527250.00
Java (ms) 9149.90 92562.28 838357.94
MATLAB (ms) 75.01 423.10 3133.90
then we can see that not only MATLAB is so fast in matrix multiplication: CUDA C (programming language from NVIDIA) has some better results than MATLAB. CUDA C has also libraries especially developed for matrix multiplications and it uses the GPUs.
Short history of MATLAB
Cleve Moler, the chairman of the computer science department at the University of New Mexico, started developing MATLAB in the late 1970s. He designed it to give his students access to LINPACK (a software library for performing numerical linear algebra) and EISPACK (is a software library for numerical computation of linear algebra) without them having to learn Fortran. It soon spread to other universities and found a strong audience within the applied mathematics community. Jack Little, an engineer, was exposed to it during a visit Moler made to Stanford University in 1983. Recognizing its commercial potential, he joined with Moler and Steve Bangert. They rewrote MATLAB in C and founded MathWorks in 1984 to continue its development. These rewritten libraries were known as JACKPAC. In 2000, MATLAB was rewritten to use a newer set of libraries for matrix manipulation, LAPACK (is a standard software library for numerical linear algebra).
Source
What is CUDA C
CUDA C uses also libraries especially developed for matrix multiplications like OpenGL (Open Graphics Library). It uses also GPU and Direct3D (on MS Windows).
The CUDA platform is designed to work with programming languages such as C, C++, and Fortran. This accessibility makes it easier for specialists in parallel programming to use GPU resources, in contrast to prior APIs like Direct3D and OpenGL, which required advanced skills in graphics programming. Also, CUDA supports programming frameworks such as OpenACC and OpenCL.
Example of CUDA processing flow:
Copy data from main memory to GPU memory
CPU initiates the GPU compute kernel
GPU's CUDA cores execute the kernel in parallel
Copy the resulting data from GPU memory to main memory
Comparing CPU and GPU Execution Speeds
We ran a benchmark in which we measured the amount of time it took to execute 50 time steps for grid sizes of 64, 128, 512, 1024, and 2048 on an Intel Xeon Processor X5650 and then using an NVIDIA Tesla C2050 GPU.
For a grid size of 2048, the algorithm shows a 7.5x decrease in compute time from more than a minute on the CPU to less than 10 seconds on the GPU. The log scale plot shows that the CPU is actually faster for small grid sizes. As the technology evolves and matures, however, GPU solutions are increasingly able to handle smaller problems, a trend that we expect to continue.
Source
From introduction for CUDA C Programming Guide:
Driven by the insatiable market demand for realtime, high-definition 3D graphics, the programmable Graphic Processor Unit or GPU has evolved into a highly parallel, multithreaded, manycore processor with tremendous computational horsepower and very high memory bandwidth, as illustrated by Figure 1 and Figure 2.
Figure 1. Floating-Point Operations per Second for the CPU and GPU
Figure 2. Memory Bandwidth for the CPU and GPU
The reason behind the discrepancy in floating-point capability between the CPU and the GPU is that the GPU is specialized for compute-intensive, highly parallel computation - exactly what graphics rendering is about - and therefore designed such that more transistors are devoted to data processing rather than data caching and flow control, as schematically illustrated by Figure 3.
Figure 3. The GPU Devotes More Transistors to Data Processing
More specifically, the GPU is especially well-suited to address problems that can be expressed as data-parallel computations - the same program is executed on many data elements in parallel - with high arithmetic intensity - the ratio of arithmetic operations to memory operations. Because the same program is executed for each data element, there is a lower requirement for sophisticated flow control, and because it is executed on many data elements and has high arithmetic intensity, the memory access latency can be hidden with calculations instead of big data caches.
Data-parallel processing maps data elements to parallel processing threads. Many applications that process large data sets can use a data-parallel programming model to speed up the computations. In 3D rendering, large sets of pixels and vertices are mapped to parallel threads. Similarly, image and media processing applications such as post-processing of rendered images, video encoding and decoding, image scaling, stereo vision, and pattern recognition can map image blocks and pixels to parallel processing threads. In fact, many algorithms outside the field of image rendering and processing are accelerated by data-parallel processing, from general signal processing or physics simulation to computational finance or computational biology.
Source
Advanced reading
GPUs (Graphics processing unit)
MATLAB
CUDA C Programming Guide
Using GPUs in MATLAB
Basic Linear Algebra Subprograms (BLAS)
Anatomy of High-Performance Matrix Multiplication, from Kazushige Goto and Robert A. Van De Geijn
Some interesting facs
I've written C++ matrix multiplication that is as fast as Matlab's but it took some care. (Before Matlab was using GPUs for this).
Сitation from this answer.

CUDA Add Rows of a Matrix

I'm trying to add the rows of a 4800x9600 matrix together, resulting in a matrix 1x9600.
What I've done is split the 4800x9600 into 9,600 matrices of length 4800 each. I then perform a reduction on the 4800 elements.
The trouble is, this is really slow...
Anyone got any suggestions?
Basically, I'm trying to implement MATLAB's sum(...) function.
Here is the code which I've verified works fine, it's just it's really slow:
void reduceRows(Matrix Dresult,Matrix DA)
{
//split DA into chunks
Matrix Dchunk;
Dchunk.h=1;Dchunk.w=DA.h;
cudaMalloc((void**)&Dchunk.data,Dchunk.h*Dchunk.w*sizeof(float));
Matrix DcolSum;
DcolSum.h=1;DcolSum.w=1;
//cudaMalloc((void**)&DcolSum.data,DcolSum.h*DcolSum.w*sizeof(float));
int i;
for(i=0;i<DA.w;i++) //loop over each column
{
//printf("%d ",i);
cudaMemcpy(Dchunk.data,&DA.data[i*DA.h],DA.h*sizeof(float),cudaMemcpyDeviceToDevice);
DcolSum.data=&Dresult.data[i];
reduceTotal(DcolSum,Dchunk);
}
cudaFree(Dchunk.data);
}
Matrix is defined as:
typedef struct{
long w;
long h;
float* data;
}Matrix;
ReduceTotal() just calls the standard NVIDIA reduction, sums all the elements in Dchunk and puts the answer in DcolSum.
I'm about to do all this on the CPU if I can't find an answer... ;(
Many thanks in advance,
Instead of looping over each column, parallelize on the columns. Each of 4600 threads sums the 9600 entries in its column, and puts the sum in the appropriate place in the result vector.
If you're looking for a library to make working with Cuda simpler, I highly recommend Thrust: http://code.google.com/p/thrust/
Using Thrust, I would create a functor to hold your matrix's pointer in device memory, and then map it over a sequence of column indices. The operator() of the functor would take an index, sum up everything in that column of the matrix, and return the sum. Then you would have your sum sitting in a thrust::device_vector without any memory copies (or even direct CUDA calls).
Your functor might look something like:
struct ColumnSumFunctor {
const Matrix matrix;
// Make a functor to sum the matrix
ColumnSumFunctor(const Matrix& matrix);
// Compute and return the sum of the specified column
__device__
int operator()(const int& column) const;
};
Reduction is very basic operation in GPGPU, it's supposed to be fast, and 9600 times of reduction shouldn't be slow either.
What graphics card are you using?
I suggest you split it into 9600 arrays, each time you reduce an array of 4800 elements into one result. Instead of reduceTotal, I suggest you use CUDPP to perform the reduction operation, CUDPP is like the STL for CUDA. It's implemented with concern on performance.
http://code.google.com/p/cudpp/
I think your problem is that you are launching 9600X2 kernels. This should be an easy algorithm to express as a single kernel.
The most naive way to implement it would not coalesce memory, but it could well be faster than the way you are doing it now.
Once you've got the naive way working, then coalesce your memory reads: e.g. have every thread in a block read 16 consecutive floats into shared memory, syncthreads, then accumulate the relevant 16 floats into a register, synthreads, then repeat
The Computing SDK has lots of examples of reduction techniques.

Resources