Matlab + CUDA slow in solving matrix-vector equation A*x=B - performance

I am calculating an equation A*x=B, where A is a matrix and B is a vector, x is answer (unknown) vector.
Hardware specs:
Intel i7 3630QM (4 cores),
nVidia GeForce GT 640M (384 CUDA cores)
Here's an example:
>> A=rand(5000);
>> B=rand(5000,1);
>> Agpu=gpuArray(A);
>> Bgpu=gpuArray(B);
>> tic;A\B;toc;
Elapsed time is 1.382281 seconds.
>> tic;Agpu\Bgpu;toc;
Elapsed time is 4.775395 seconds.
Somehow GPU is much slower... Why? It is also slower in FFT, INV, LU calculations, which should be related with matrix division.
However, GPU is much faster in matrix multiplication (the same data):
>> tic;A*B;toc;
Elapsed time is 0.014700 seconds.
>> tic;Agpu*Bgpu;toc;
Elapsed time is 0.000505 seconds.
The main question is why GPU A\B (mldivide) is so slow comparing to CPU?
UPDATED
Here are some more results when A, B (on CPU), AA, BB (on GPU) are rand(5000):
>> tic;fft(A);toc;
Elapsed time is *0.117189 *seconds.
>> tic;fft(AA);toc;
Elapsed time is 1.062969 seconds.
>> tic;fft(AA);toc;
Elapsed time is 0.542242 seconds.
>> tic;fft(AA);toc;
Elapsed time is *0.229773* seconds.
>> tic;fft(AA);toc;
Bold times are stable times. However GPU is almost twice slower. By the way, why GPU is even more slower on first two attempts? Is it compiled twice firstly?
In addition:
>> tic;sin(A);toc;
Elapsed time is *0.121008* seconds.
>> tic;sin(AA);toc;
Elapsed time is 0.020448 seconds.
>> tic;sin(AA);toc;
Elapsed time is 0.157209 seconds.
>> tic;sin(AA);toc;
Elapsed time is *0.000419 *seconds
After two calculations GPU is incredibly faster in sin calculations.
So, still, why GPU is so slow in matrix division, fft and similar calculations, though it is so fast in matrix multiplication and trigonometry? The question actually should not be like that... GPU should be faster in all these calculations because Matlab has released overlapped functions (mldivide, fft) for GPU.
Could somebody help me solve these issues, please? :)

Please read how Matlab calculates the solutions. It will help you understand why GPU is slower.
I'll try say it in few words.
A*x=b becomes L*(U*x=y)=b, L*U=A
So Matlab makes A to L*U (This process cannot be done fully parallel
as far as I know instead some steps can be done parallel, due to
their nature)
Then Matlab solves L*y=B and finds y. (This process cannot be done
parallel as each step requires data from previous)
Then Matlab solves U*x=y and finds x. (This process cannot be done
parallel as each step requires data from previous)
So it GPU clock is slower than the CPU, and since processes cannot be done parallel, CPU is faster. And no, unless you come up with a better method (good luck!) then GPU will be always slower except in some very specific cases.

Part 1 of the explanation is in the answer from user2230360, but your question is twofold, so I'll add a bit about the multiplication.
As noted already, the LU factorization is not very easily parallelized even if some steps can be. Matrix multiplication, however, is very much parallelizable. If you're working with these things you should be able to do matrix multiplication by hand, and then you will know that calculating the elements of the matrix C in A*B=C can be done in any order you want - hence the possibility for parallel computation. That is probably why you're seeing so lightning fast multiplication, but slow solving of linear systems. One can't be parallelized "as much as the other".

Related

How does one compute FLOPS from time elapsed for a computation?

One can see from this tutorial on the usage of Intel MKL DFTs that Dr. Andrey E. Vladimirov uses the time elapsed during a task, namely t1-t0, to compute the number of GigaFLOPS using GF/s = HztoPerf/(t1-t0) where HztoPerf = 5.0 * 1e-9 * double(fft_size) * log2(double(fft_size)) * double(num_fft).
Is this a general formula? If not, how do I deduce the average GF/s for my CPU (Intel Xeon E5-1660 at 3 GHz with 8 cores) if I know the time elapsed to run a computation (e.g. involving various FFTs)?
You have to know how many FP operations your problem requires. Then you divide that by time.
1e-9 accounts for the Giga = 10^9 metric prefix. Without that, you'd have FLOP/s not GFLOP/s if you divide FLoating point OPeration count by seconds.
5.0 * fft_size * log2(fft_size) appears to be the number of FP ops per FFT.
An efficient FFT is O(n log2(n)), and apparently this implementation has a constant factor of 5. (Or possibly that's including some work done using the result?)
num_fft is presumably the total number of FFTs of that size done, i.e. the repeat count. So the product of all those things is the number of FP ops actually done during computation of the FFT.
Hardware performance counters on Intel CPUs can record number of FLOPs (even counting FMAs as 2): there are events like fp_arith_inst_retired.256b_packed_double for various SIMD widths.
perf has a GFLOPs "metric group" you can use that enables the relevant events and calculates it for you:
perf stat --all-user -M GFLOPs ./my_program my args
Counting only in user-space is probably redundant; kernel code might use SIMD for software RAID5/6, but not in interrupt handlers and probably not system calls. And not FP math.
Example on my i7-6700k Skylake
$ perf stat --all-user -M GFLOPs awk 'BEGIN{for(i=0;i<100000000;i++){}}'
Performance counter stats for 'awk BEGIN{for(i=0;i<100000000;i++){}}':
0 fp_arith_inst_retired.256b_packed_single # 0.03 GFLOPs (66.58%)
99,934,901 fp_arith_inst_retired.scalar_double (66.68%)
0 fp_arith_inst_retired.128b_packed_single (66.71%)
0 fp_arith_inst_retired.scalar_single (66.71%)
0 fp_arith_inst_retired.256b_packed_double (66.71%)
0 fp_arith_inst_retired.128b_packed_double (66.62%)
3,352,766,500 ns duration_time
3.352766500 seconds time elapsed
3.347268000 seconds user
0.000000000 seconds sys
Unfortunately it had to multiplex between those events, since there were more than 4, and hyperthreading is enabled, so the total number of scalar double-precision FP operations (99,934,901) was measured a bit lower than the awk loop iteration count. With just -e task_clock,cycles,instructions,fp_arith_inst_retired.scalar_double, it came out at exactly 100,000,000 counts, since apparently gawk did no other FP operations.
Of course awk is not a high-FP-throughput program, and only used scalar FP math. Numeric variables in awk are double-precision, like JavaScript, but unlike JS it doesn't JIT, let alone take advantage of the ability to do them as integer.

Parallel Processing Speed up S(n)

Assume that a query is decomposed into a serial part and a parallel part. The serial part occupies 20% of the entire elapsed time, whereas the rest can be done in parallel.
Given that the one-processor elapsed time is 1 hour, what is the speed up if 10 processors are used? (For simplicity, you may assume that during the parallel processing of the parallel part the task is equally divided among all participating processors).
Even with n processors the serial part will take up the same share (0.2) of computation time as in the one-processor elapsed time (OPELT = 1 hour) scenario. The other 80% can be done in parallel, thus are divided by the number of processors available.
0.2*OPELT + (0.8*OPELT)/n
Speed up S(n) is then the ratio between the one-processor elapsed time and the n-processor elapsed time.
In detail to be found here.

Parallel computing and Julia

I have some performance problems with parallel computing in Julia. I am new in both, Julia and parallel calculations.
In order to learn, I parallelized a code that should benefits from parallelization, but it does not.
The program estimates the mean of the mean of the components of arrays whose elements were chosen randomly with an uniform distribution.
Serial version
tic()
function mean_estimate(N::Int)
iter = 100000*2
p = 5000
vec_mean = zeros(iter)
for i = 1:iter
vec_mean[i] = mean( rand(p) )
end
return mean(vec_mean)
end
a = mean_estimate(0)
toc()
println("The mean is: ", a)
Parallelized version
addprocs(CPU_CORES - 1)
println("CPU cores ", CPU_CORES)
tic()
#everywhere function mean_estimate(N::Int)
iter = 100000
p = 5000
vec_mean = zeros(iter)
for i = 1:iter
vec_mean[i] = mean( rand(p) )
end
return mean(vec_mean)
end
the_mean = mean(vcat(pmap(mean_estimate,[1,2])...))
toc()
println("The mean is: ", the_mean)
Notes:
The factor 2 in the fourth line of the serial code is because I tried the code in a PC with two cores.
I checked the usage of the two cores with htop, and it seems to be ok.
The outputs I get are:
me#pentium-ws:~/average$ time julia serial.jl
elapsed time: 2.68671022 seconds
The mean is: 0.49999736055814215
real 0m2.961s
user 0m2.928s
sys 0m0.116s
and
me#pentium-ws:~/average$ time julia -p 2 parallel.jl
CPU cores 2
elapsed time: 2.890163089 seconds
The mean is: 0.5000104221069994
real 0m7.576s
user 0m11.744s
sys 0m0.308s
I've noticed that the serial version is slightly faster than the parallelized one for the timed part of the code. Also, that there is large difference in the total execution time.
Questions
Why is the parallelized version slower? (what I am doing wrong?)
Which is the right way to parallelize this program?
Note: I use pmap with vcat because I wish to try with the median too.
Thanks for your help
EDIT
I measured times as #HighPerformanceMark suggested. The tic()/toc() times are the following. The iteration number is 2E6 for every case.
Array Size Single thread Parallel Ratio
5000 2.69 2.89 1.07
100 000 488.77 346.00 0.71
1000 000 4776.58 4438.09 0.93
I am puzzled about why there is not clear trend with array size.
You should pay prime attention to suggestions in the comments.
As #ChrisRackauckas points out, type instability is a common stumbling block for performant Julia code. If you want highly performant code, then make sure that your functions are type-stable. Consider annotating the return type of the function pmap and/or vcat, e.g. f(pids::Vector{Int}) = mean(vcat(pmap(mean_estimate, pids))) :: Float64 or something similar, since pmap does not strongly type its output. Another strategy is to roll your own parallel scheduler. You can use pmap source code as a springboard (see code here).
Furthermore, as #AlexMorley commented, you are confounding your performance measurements by including compilation times. Normally performance of a function f() is measured in Julia by running it twice and measuring only the second run. In the first run, the JIT compiler compiles f() before running it, while the second run uses the compiled function. Compilation incurs a (unwanted) performance cost, so timing the second run avoid measuring the compilation.
If possible, preallocate all outputs. In your code, you have set each worker to allocate its own zeros(iter) and its own rand(p). This can have dramatic performance consequences. A sketch of your code:
# code mean_estimate as two functions
f(p::Int) = mean(rand(p))
function g(iter::Int, p::Int)
vec_mean = zeros(iter)
for i in eachindex(vec_mean)
vec_mean[i] = f(p)
end
return mean(vec_mean)
end
# run twice, time on second run to get compute time
g(200000, 5000)
#time g(200000, 5000)
### output on my machine
# 2.792953 seconds (600.01 k allocations: 7.470 GB, 24.65% gc time)
# 0.4999951853035917
The #time macro is alerting you that the garbage collector is cleaning up a lot of allocated memory during execution, several gigabytes in fact. This kills performance. Memory allocations may be overshadowing any distinction between your serial and parallel compute times.
Lastly, remember that parallel computing incurs overhead from scheduling and managing individual workers. Your workers are computing the mean of the means of many random vectors of length 5000. But you could succinctly compute the mean (or median) of, say, 5M entries with
x = rand(5_000_000)
mean(x)
#time mean(x) # 0.002854 seconds (5 allocations: 176 bytes)
so it is unclear how your parallel computing scheme improves upon serial performance. Parallel computing generally provides the best help when your arrays are truly beefy or your calculations are arithmetically intense, and vector means probably do not fall in that domain.
One last note: you may want to peek at SharedArrays, which distribute arrays over several workers with a common memory pool, or the experimental multithreading facilities in Julia. You may find those parallel frameworks more intuitive than pmap.

Measuring accurate GPU computation time

I'm working on a code in which I have to perform a vector-matrix multiplication on a chunk of data, copying the results back to CPU and then start multiplying another chunk. I perform the vector to matrix multiplication using cublas library (following code).
clock_t a,b;
a = clock();
for(int i=0;i<n;i++)
{
cublasSgemv(handle,CUBLAS_OP_T,m,k,&alpha, dev_b1+((i+1)*m), m, dev_b1+(i*m),1, &beta,out,1);
out+=(n-(i+1));
cudaMemcpy(b3,dev_b3, sizeof(float)*(cor_size), cudaMemcpyDeviceToHost);
}
b = clock();
cout<<"Running time is: "<<(double)(b-a)/clocks_per_sec;
I have to measure the running time of this for loop. I read something about CudaEvent but in my case, I want to measure the time of total loop not a kernel so I used clock function. I am wondering is this a correct way to measure the time for this chunk of code or there are more accurate ways to do that?
I know that for measuring elapsed time we have to repeat running the code multiple times and take the average of elapsed times of all runs, so another question is that is there any trade-off for the number of times that running code should be repeated?
Thanks
cudaMemcpy synchronizes host and device, so a CPU timer such as clock_t should give results that are identical with those produced by a CUDA timer, making the necessary allowances for the granularity/resolution of clock_t.
As regards the accuracy of the measurements is concerned, from what I have seen, the first iteration timings could be disregarded in the calculations. Subsequent timing measurements should yield numbers depending on factors such as load imbalance in the algorithm being run, which might decide on whether we get the same numbers at every iteration. I would reckon that that would not be an issue here, with Sgemm.
You can still use CUDA events to measure the entire loop runtime, by recording two events (one before starting the loop, one after the end, i.e. in the positions where you are currently using clock()), synchronizing on the second event and then getting the elapsed time using cudaEventElapsedTime(). This should have the advantage of being more accurate than clock().

Matlab Convolution using gpu

I tried the matlab's convolution function conv2 convn with gpuArray.
For example convn(gpuArray.rand(100,100,10,'single'),gpuArray.rand(5,'single') and compared it to the cpu version convn(rand(100,100,10),rand(5)). Unfortunately the gpu version is much slower than the cpu version, especially noticeable when I put the function for example into a loop(which will be relevant for me). Does anyone know an alternative to fast convolution using matlab and the gpu for relatively small filtering kernels from 5x5 to 14x14?
The GPU performance is limited by the data array size [100x100x10] and [5x5] in your test case.
The actual performance also depends on the GPU and CPU module type. For your data size (test case 2 of the following code), I can get a performance improvement (2.75x) on GPU Tesla M2090 and CPU Xeon E5-2609.
For the following matlab test code
m=1000;
n=100;
k=5;
gc=convn(gpuArray.rand(m,m,10,'single'),gpuArray.rand(k,'single'));
tic;
for i=1:n
gc=convn(gpuArray.rand(m,m,10,'single'),gpuArray.rand(k,'single'));
end
toc
c=convn(rand(m,m,10,'single'),rand(k,'single'));
tic;
for i=1:n
c=convn(rand(m,m,10,'single'),rand(k,'single'));
end
toc
When m=1000; n=100; k=5; I got very good performance improvement (11.6x) on GPU.
Elapsed time is 2.367453 seconds.
Elapsed time is 27.502952 seconds.
But when m=100; n=1000; k=5; I got only 2.75x
Elapsed time is 1.206053 seconds.
Elapsed time is 3.330559 seconds.
When m=100; n=1000; k=14;, it becomes better (4.84x).
Elapsed time is 2.804957 seconds.
Elapsed time is 13.585698 seconds.

Resources