Matlab Convolution using gpu - performance

I tried the matlab's convolution function conv2 convn with gpuArray.
For example convn(gpuArray.rand(100,100,10,'single'),gpuArray.rand(5,'single') and compared it to the cpu version convn(rand(100,100,10),rand(5)). Unfortunately the gpu version is much slower than the cpu version, especially noticeable when I put the function for example into a loop(which will be relevant for me). Does anyone know an alternative to fast convolution using matlab and the gpu for relatively small filtering kernels from 5x5 to 14x14?

The GPU performance is limited by the data array size [100x100x10] and [5x5] in your test case.
The actual performance also depends on the GPU and CPU module type. For your data size (test case 2 of the following code), I can get a performance improvement (2.75x) on GPU Tesla M2090 and CPU Xeon E5-2609.
For the following matlab test code
m=1000;
n=100;
k=5;
gc=convn(gpuArray.rand(m,m,10,'single'),gpuArray.rand(k,'single'));
tic;
for i=1:n
gc=convn(gpuArray.rand(m,m,10,'single'),gpuArray.rand(k,'single'));
end
toc
c=convn(rand(m,m,10,'single'),rand(k,'single'));
tic;
for i=1:n
c=convn(rand(m,m,10,'single'),rand(k,'single'));
end
toc
When m=1000; n=100; k=5; I got very good performance improvement (11.6x) on GPU.
Elapsed time is 2.367453 seconds.
Elapsed time is 27.502952 seconds.
But when m=100; n=1000; k=5; I got only 2.75x
Elapsed time is 1.206053 seconds.
Elapsed time is 3.330559 seconds.
When m=100; n=1000; k=14;, it becomes better (4.84x).
Elapsed time is 2.804957 seconds.
Elapsed time is 13.585698 seconds.

Related

How to get equivalent CPU time-difference of GPU rendering time

How to convert the time difference given by the GPU timer while rendering into the equivalent CPU timing?
Let's say,
glGetQueryObjectuiv(query, GL_QUERY_RESULT, &elapsed_time) - will return the elapsed time for that query and I presume this elapsed time will correspond to GPU frequency.
How to get the corresponding CPU time which is equivalent to the GPU elapsed time?
It's a timer query - it returns a time in nanoseconds. Time doesn't change with frequency ...

Matlab parallel computing running slower than expected

I am trying to use Matlab parallel computing toolbox. My PC has 6 'workers' or cores. Thus, I would expect my code to run roughly 6x as fast (ie ~600% increase in speed). However, when I actually time the opertations, I find I am only getting roughly a 40% increase in speed.
Is this normal, or am I doing something wrong?
Here is my code:
N=5000;
%%Parrelel
pp=parpool(6)
ts=tic;
parfor i=1:12
q=eye(N); q^-1;
end
disp(['Time for Parrelel Computation: ' num2str(toc(ts),3) 's']);
delete(pp);
%Serial
ts=tic;
for i=1:12
q=eye(N); q^-1;
end
disp(['Time for Serial Computation: ' num2str(toc(ts),3) 's']);
The readout is:
Time for Parrelel Computation: 24.6s
Time for Serial Computation: 35.9s
I wouldve expected the parrelel computation to be roughly 35/6~=6s, not 24s
Any advice?
Thanks
Roman

Measuring elapsed CPU time in Julia

Many scientific computing languages make a distinction between absolute time (wall clock) and CPU time (processor cycles). For example, in Matlab we have:
>> tic; pause(1); toc
Elapsed time is 1.009068 seconds.
>> start = cputime; pause(1); elapsed = cputime - start
elapsed =
0
and in Mathematica we have:
>>In[1]:= AbsoluteTiming[Pause[1]]
>>Out[1]= {1.0010572, Null}
>>In[2]:= Timing[Pause[1]]
>>Out[2]= {0., Null}
This distinction is useful when benchmarking code run on computation servers, where there may be high variance in the absolute timing results depending on what other processes are running concurrently.
The Julia standard library provides support for timing of expressions through tic(), toc(), #time and a few other functions/macros all based on time_ns(), a function that measures absolute time.
>>julia> #time sleep(1)
elapsed time: 1.017056895 seconds (135788 bytes allocated)
My question: Is there a simple way to get the elapsed CPU time for an expression evaluation in Julia?
(Side note: doing some digging, it appears that Julia timing is based on the uv_hrtime() function from libuv. It seems to me that using uv_getrusage from the same library might give a way to access elapsed CPU time in Julia, but I'm no expert. Has anybody tried using anything like this?)
I couldn't find any existing solutions, so I've put together a package with some simple CPU timing functionality here: https://github.com/schmrlng/CPUTime.jl. The package is completely untested on parallel code and may have other bugs, but if anybody else would like to try it out calling
>> Pkg.clone("https://github.com/schmrlng/CPUTime.jl.git")
from the julia> prompt should install the package.
Julia does have the commands tic() and toc() which work just like tic and toc in Matlab:
julia> tic(); 7^1000000000; toc()
elapsed time: 0.046563597 seconds
0.046563597

Matlab + CUDA slow in solving matrix-vector equation A*x=B

I am calculating an equation A*x=B, where A is a matrix and B is a vector, x is answer (unknown) vector.
Hardware specs:
Intel i7 3630QM (4 cores),
nVidia GeForce GT 640M (384 CUDA cores)
Here's an example:
>> A=rand(5000);
>> B=rand(5000,1);
>> Agpu=gpuArray(A);
>> Bgpu=gpuArray(B);
>> tic;A\B;toc;
Elapsed time is 1.382281 seconds.
>> tic;Agpu\Bgpu;toc;
Elapsed time is 4.775395 seconds.
Somehow GPU is much slower... Why? It is also slower in FFT, INV, LU calculations, which should be related with matrix division.
However, GPU is much faster in matrix multiplication (the same data):
>> tic;A*B;toc;
Elapsed time is 0.014700 seconds.
>> tic;Agpu*Bgpu;toc;
Elapsed time is 0.000505 seconds.
The main question is why GPU A\B (mldivide) is so slow comparing to CPU?
UPDATED
Here are some more results when A, B (on CPU), AA, BB (on GPU) are rand(5000):
>> tic;fft(A);toc;
Elapsed time is *0.117189 *seconds.
>> tic;fft(AA);toc;
Elapsed time is 1.062969 seconds.
>> tic;fft(AA);toc;
Elapsed time is 0.542242 seconds.
>> tic;fft(AA);toc;
Elapsed time is *0.229773* seconds.
>> tic;fft(AA);toc;
Bold times are stable times. However GPU is almost twice slower. By the way, why GPU is even more slower on first two attempts? Is it compiled twice firstly?
In addition:
>> tic;sin(A);toc;
Elapsed time is *0.121008* seconds.
>> tic;sin(AA);toc;
Elapsed time is 0.020448 seconds.
>> tic;sin(AA);toc;
Elapsed time is 0.157209 seconds.
>> tic;sin(AA);toc;
Elapsed time is *0.000419 *seconds
After two calculations GPU is incredibly faster in sin calculations.
So, still, why GPU is so slow in matrix division, fft and similar calculations, though it is so fast in matrix multiplication and trigonometry? The question actually should not be like that... GPU should be faster in all these calculations because Matlab has released overlapped functions (mldivide, fft) for GPU.
Could somebody help me solve these issues, please? :)
Please read how Matlab calculates the solutions. It will help you understand why GPU is slower.
I'll try say it in few words.
A*x=b becomes L*(U*x=y)=b, L*U=A
So Matlab makes A to L*U (This process cannot be done fully parallel
as far as I know instead some steps can be done parallel, due to
their nature)
Then Matlab solves L*y=B and finds y. (This process cannot be done
parallel as each step requires data from previous)
Then Matlab solves U*x=y and finds x. (This process cannot be done
parallel as each step requires data from previous)
So it GPU clock is slower than the CPU, and since processes cannot be done parallel, CPU is faster. And no, unless you come up with a better method (good luck!) then GPU will be always slower except in some very specific cases.
Part 1 of the explanation is in the answer from user2230360, but your question is twofold, so I'll add a bit about the multiplication.
As noted already, the LU factorization is not very easily parallelized even if some steps can be. Matrix multiplication, however, is very much parallelizable. If you're working with these things you should be able to do matrix multiplication by hand, and then you will know that calculating the elements of the matrix C in A*B=C can be done in any order you want - hence the possibility for parallel computation. That is probably why you're seeing so lightning fast multiplication, but slow solving of linear systems. One can't be parallelized "as much as the other".

does matlab cache solutions for eigs

I seem to be getting different performance results when using eigs. On the same matrix, calling
[c, v] = eigs(A, 2, 'sm');
somtimes takes 30 seconds and sometimes 2 seconds.
I need to know whether there's a speedup using some caching on subsequent calls for eigs on the same matrix since I need to report the times...
If so, this doesn't appear to be a generic feature. I ran this test from the command line
A = randn(10000);
B = randn(10000);
C = B;
tic; [c1,v1] = eigs(A,2,'sm'); toc;
tic; [c2,v2] = eigs(A,2,'sm'); toc;
tic; [c3,v3] = eigs(B,2,'sm'); toc;
tic; [c4,v4] = eigs(C,2,'sm'); toc
and got this result
Elapsed time is 32.373128 seconds.
Elapsed time is 28.412905 seconds.
Elapsed time is 32.752616 seconds.
Elapsed time is 29.024055 seconds.
I'm surprised, because usually MATLAB tries to outsmart you and will store results for reuse.
Under some circumstances, a large enough matrix might push things into virtual memory, or not, depending upon whether there is a large enough block of contiguous RAM available. Or, you may be doing something on the side.
You can verify what is happening by watching a process monitor as you do the test. Are there suddenly large amounts of disk accesses? If so, then virtual memory is being touched. Is there a different, unrelated process active that is hogging the CPU?

Resources