OpenCL very low GFLOPS, no data transfer bottleneck

OpenCL very low GFLOPS, no data transfer bottleneck - performance

I am trying to optimize an algorithm I am running on my GPU (AMD HD6850). I counted the number of floating point operations inside my kernel and measured its execution time. I found it to achieve ~20 SP GFLOPS, however according to the GPUs specs I should achieve ~1500 GFLOPS.
To find the bottleneck I created a very simple kernel:
kernel void test_gflops(const float d, global float* result)
{
int gid = get_global_id(0);
float cd;
for (int i=0; i<100000; i++)
{
cd = d*i;
}
if (cd == -1.0f)
{
result[gid] = cd;
}
}
Running this kernel I get ~5*10^5 work_items/sec. I count one floating point operation (not sure if that's right, what about incrementing i and comparing it to 100000?) per iteration of the loop.
==> 5*10^5 work_items/sec * 10^5 FLOPS = 50 GFLOPS.
Even if there are 3 or 4 operations going on in the loop, it's much slower than the what the card should be able to do. What am I doing wrong?
The global work size is big enough (no speed change for 10k vs 100k work items).

Here are a couple of tricks:
GPU doesn't like cycles at all. Use #pragma unroll to unwind them.
Your GPU is good at vector operations. Stick to it, that will allow you to process multiple operands at once.
Use vector load/store whether it's possible.
Measure the memory bandwidth - I'm almost sure that you are bandwidth-limited because of poor access pattern.
In my opinion, kernel should look like this:
typedef union floats{
float16 vector;
float array[16];
} floats;
kernel void test_gflops(const float d, global float* result)
{
int gid = get_global_id(0);
floats cd;
cd.vector = vload16(gid, result);
cd.vector *= d;
#pragma unroll
for (int i=0; i<16; i++)
{
if(cd.array[i] == -1.0f){
result[gid] = cd;
}
}
Make your NDRange bigger to compensate difference between 16 & 1000 in loop condition.

Related

Optimize Cuda Kernel time execution

I'm a learning Cuda student, and I would like to optimize the execution time of my kernel function. As a result, I realized a short program computing the difference between two pictures. So I compared the execution time between a classic CPU execution in C, and a GPU execution in Cuda C.
Here you can find the code I'm talking about:
int *imgresult_data = (int *) malloc(width*height*sizeof(int));
int size = width*height;
switch(computing_type)
{
case GPU:
HANDLE_ERROR(cudaMalloc((void**)&dev_data1, size*sizeof(unsigned char)));
HANDLE_ERROR(cudaMalloc((void**)&dev_data2, size*sizeof(unsigned char)));
HANDLE_ERROR(cudaMalloc((void**)&dev_data_res, size*sizeof(int)));
HANDLE_ERROR(cudaMemcpy(dev_data1, img1_data, size*sizeof(unsigned char), cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(dev_data2, img2_data, size*sizeof(unsigned char), cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(dev_data_res, imgresult_data, size*sizeof(int), cudaMemcpyHostToDevice));
float time;
cudaEvent_t start, stop;
HANDLE_ERROR( cudaEventCreate(&start) );
HANDLE_ERROR( cudaEventCreate(&stop) );
HANDLE_ERROR( cudaEventRecord(start, 0) );
for(int m = 0; m < nb_loops ; m++)
{
diff<<<height, width>>>(dev_data1, dev_data2, dev_data_res);
}
HANDLE_ERROR( cudaEventRecord(stop, 0) );
HANDLE_ERROR( cudaEventSynchronize(stop) );
HANDLE_ERROR( cudaEventElapsedTime(&time, start, stop) );
HANDLE_ERROR(cudaMemcpy(imgresult_data, dev_data_res, size*sizeof(int), cudaMemcpyDeviceToHost));
printf("Time to generate: %4.4f ms \n", time/nb_loops);
break;
case CPU:
clock_t begin = clock(), diff;
for (int z=0; z<nb_loops; z++)
{
// Apply the difference between 2 images
for (int i = 0; i < height; i++)
{
tmp = i*imgresult_pitch;
for (int j = 0; j < width; j++)
{
imgresult_data[j + tmp] = (int) img2_data[j + tmp] - (int) img1_data[j + tmp];
}
}
}
diff = clock() - begin;
float msec = diff*1000/CLOCKS_PER_SEC;
msec = msec/nb_loops;
printf("Time taken %4.4f milliseconds", msec);
break;
}
And here is my kernel function:
__global__ void diff(unsigned char *data1 ,unsigned char *data2, int *data_res)
{
int row = blockIdx.x;
int col = threadIdx.x;
int v = col + row*blockDim.x;
if (row < MAX_H && col < MAX_W)
{
data_res[v] = (int) data2[v] - (int) data1[v];
}
}
I obtained these execution time for each one
CPU: 1,3210ms
GPU: 0,3229ms
I wonder why GPU result is not as lower as it should be. I am a beginner in Cuda so please be comprehensive if there are some classic errors.
EDIT1:
Thank you for your feedback. I tried to delete the 'if' condition from the kernel but it didn't change deeply my program execution time.
However, after having install Cuda profiler, it told me that my threads weren't running concurrently. I don't understand why I have this kind of message, but it seems true because I only have a 5 or 6 times faster application with GPU than with CPU. This ratio should be greater, because each thread is supposed to process one pixel concurrently to all the other ones. If you have an idea of what I am doing wrong, it would be hepful...
Flow.

Here are two things you could do which may improve the performance of your diff kernel:
1. Let each thread do more work
In your kernel, each thread handles just a single element; but having a thread do anything already has a bunch of overhead, at the block and the thread level, including obtaining the parameters, checking the condition and doing address arithmetic. Now, you could say "Oh, but the reads and writes take much more time then that; this overhead is negligible" - but you would be ignoring the fact, that the latency of these reads and writes is hidden by the presence of many other warps which may be scheduled to do their work.
So, let each thread process more than a single element. Say, 4, as each thread can easily read 4 bytes at once into a register. Or even 8 or 16; experiment with it. Of course you'll need to adjust your grid and block parameters accordingly.
2. "Restrict" your pointers
__restrict is not part of C++, but it is supported in CUDA. It tells the compiler that accesses through different pointers passed to the function never overlap. See:
What does the restrict keyword mean in C++?
Realistic usage of the C99 'restrict' keyword?
Using it allows the CUDA compiler to apply additional optimizations, e.g. loading or storing data via non-coherent cache. Indeed, this happens with your kernel although I haven't measured the effects.
3. Consider using a "SIMD" instruction
CUDA offers this intrinsic:
__device__  unsigned int __vsubss4 ( unsigned int a, unsigned int b )
Which subtracts each signed byte value in a from its corresponding one in b. If you can "live" with the result, rather than expecting a larger int variable, that could save you some of work - and go very well with increasing the number of elements per thread. In fact, it might let you increase it even further to get to the optimum.

I don't think you are measuring times correctly, memory copy is a time consuming step in GPU that you should take into account when measuring your time.
I see some details that you can test:
I suppose you are using MAX_H and MAX_H as constants, you may consider doing so using cudaMemcpyToSymbol().
Remember to sync your threads using __syncthreads(), so you don't get issues between each loop iteration.
CUDA works with warps, so block and number of threads per block work better as multiples of 8, but not larger than 512 threads per block unless your hardware supports it. Here is an example using 128 threads per block: <<<(cols*rows+127)/128,128>>>.
Remember as well to free your allocated memory in GPU and destroying your time events created.
In your kernel function you can have a single variable int v = threadIdx.x + blockIdx.x * blockDim.x .
Have you tested, beside the execution time, that your result is correct? I think you should use cudaMallocPitch() and cudaMemcpy2D() while working with arrays due to padding.

Probably there are other issues with the code, but here's what I see. The following lines in __global__ void diff are considered not optimal:
if (row < MAX_H && col < MAX_W)
{
data_res[v] = (int) data2[v] - (int) data1[v];
}
Conditional operators inside a kernel result in warp divergence. It means that if and else parts inside a warp are executed in sequence, not in parallel. Also, as you might have realized, if evaluates to false only at borders. To avoid the divergence and needless computation, split your image in two parts:
Central part where row < MAX_H && col < MAX_W is always true. Create an additional kernel for this area. if is unnecessary here.
Border areas that will use your diff kernel.
Obviously you'll have modify your code that calls the kernels.
And on a separate note:
GPU has throughput-oriented architecture, but not latency-oriented as CPU. It means CPU may be faster then CUDA when it comes to processing small amounts of data. Have you tried using large data sets?
CUDA Profiler is a very handy tool that will tell you're not optimal in the code.

Slow parallel memory access in OpenCL / nVidia, what did I miss?

I'm playing around with OpenCL, Geforce GTX550 and driver version 331.38 from Ubuntu 14.04. What stumps me is the speed of copying from global to local memory. As far as I know, the following code should do coalesced access to global memory:
void toLocal(__local float* target, const __global float* source, int count) {
const int iterations = (count + get_local_size(0) - 1) / get_local_size(0);
for (int i = 0; i < iterations; i++) {
int idx = i * get_local_size(0) + get_local_id(0);
if (idx < count)
target[idx] = source[idx];
}
}
In practice, the following code (which should use all threads to copy the same float over and over again) is measurably faster:
void toLocal(__local float* target, const __global float* source, int count) {
for (int i = 0; i < count; i++)
target[i] = source[i];
}
Both source and target point directly at the beginning of a buffer, so I would guess they are correctly aligned. Group size is 16 by 16, trying to use all threads makes the code more complex but doesn't affect speed. The optimal coalescing group size would be 128 bytes or 32 floats, but as far as I know, on compute model 2 cards (which GTX550 is) the penalty of using only a part or even permuting the elements shouldn't be that bad. Adding local memory fence to the first version makes it only slower. Is there anything else I missed?
EDIT: Changing group size to 32 by 32 made the parallel version roughly as fast as sequential 16 by 16 and made the sequential version slightly slower. Still not the speed improvement I was expecting.

Need help explaining some CUDA performance results

We have been experimenting with different histogramming algorithms on a CUDA GPU. Most of the results I can explain, but we noticed some really weird features of which I have no clue what is causing them.
Kernels
The weird stuff happens in a data-parallel implementation. This means that the data is distributed over the threads. Each thread looks at a subset (ideally just 1) of the data, and adds its contribution to a histogram in global memory, which requires atomic operations.
__global__ void histogram1(float *data, uint *hist, uint n, float xMin, float binWidth, uin\
t nBins)
{
uint const nThreads = blockDim.x * gridDim.x;
uint const tid = threadIdx.x + blockIdx.x * blockDim.x;
uint idx = tid;
while (idx < n)
{
float x = data[idx];
uint bin = (x - xMin) / binWidth;
atomicAdd(hist + bin, 1);
idx += nThreads;
}
}
As a first optimization, each block first constructs a partial histogram in shared memory before doing a reduction of partial histograms to obtain the final result in global memory. The code is pretty straightforward, and I believe that it's very similar to that used in Cuda By Example.
__global__ void histogram2(float *data, uint *hist, uint n,
float xMin, float binWidth, uint nBins)
{
extern __shared__ uint partialHist[]; // size = nBins * sizeof(uint)
uint const nThreads = blockDim.x * gridDim.x;
uint const tid = threadIdx.x + blockIdx.x * blockDim.x;
// initialize shared memory to 0
uint idx = threadIdx.x;
while (idx < nBins)
{
partialHist[idx] = 0;
idx += blockDim.x;
}
__syncthreads();
// Calculate partial histogram (in shared mem)
idx = tid;
while (idx < n)
{
float x = data[idx];
uint bin = (x - xMin) / binWidth;
atomicAdd(partialHist + bin, 1);
idx += nThreads;
}
__syncthreads();
// Compute resulting total (global) histogram
idx = threadIdx.x;
while (idx < nBins)
{
atomicAdd(hist + idx, partialHist[idx]);
idx += blockDim.x;
}
}
Results
Speedup vs n
We benchmarked these two kernels to see how they behave as a function of n, which is the number of datapoints. The data was uniform randomly distributed. In the figure below, HIST_DP_1 is the unoptimized trivial version, whereas HIST_DP_2 is the one using shared memory to speed things up:
The timings have been taken relative to the CPU performance, and the weird stuff happens for very large datasets. The optimizing function, instead of flattening out like the unoptimized version, starts to improve again (relatively). We'd expect that for large datasets, the occupancy of our card will be near 100%, which would mean that from that point on the performance would scale linearly, like the CPU (and indeed the unoptimized blue curve).
The behavior could be due to the fact that the chance of having two threads performing an atomic operation on the same bin in shared/global memory going to zero for large data-sets, but in that case we would expect the drop to be in different places for different nBins. This is not what we observe, the drop is in all three panels at around 10^7 bins. What is happening here? Some complicated caching effect? Or is it something obvious that we missed?
Speedup vs nBins
To have a closer look at the behavior as a function of the number of bins, we fixed our dataset at 10^4 (10^5 in one case), and ran the algorithms for many different bin-numbers.
As a reference we also generated some non-random data. The red graph shows the results for perfectly sorted data, whereas the light-blue line corresponds to a dataset in which every value was identical (maximal congestion in the atomic operations). The question is obvious: what is the discontinuity doing there?
System Setup
NVidia Tesla M2075, driver 319.37
Cuda 5.5
Intel(R) Xeon(R) CPU E5-2603 0 # 1.80GHz
Thanks for your help!
EDIT: Reproduction Case
As requested: a compiling, runnable reproduction case. The code is quite long, which is why I didn't include it in the first place. The snippet is available on snipplr. To make your life even more easy, I'll include a little shell-script to run it for the same settings I used, and an Octave script to produce the plots.
Shell script
#!/bin/bash
runs=100
# format: [n] [nBins] [t_cpu] [t_gpu1] [t_gpu2]
for nBins in 100 1000 10000
do
for n in 10 50 100 200 500 1000 2000 5000 10000 50000 100000 500000 1000000 10000000 100000000
do
echo -n "$n $nBins "
./repro $n $nBins $runs
done
done
Octave script
T = load('repro.txt');
bins = unique(T(:,2));
t = cell(1, numel(bins));
for i = 1:numel(bins)
t{i} = T(T(:,2) == bins(i), :);
subplot(2, numel(bins), i);
loglog(t{i}(:,1), t{i}(:,3:5))
title(sprintf("nBins = %d", bins(i)));
legend("cpu", "gpu1", "gpu2");
subplot(2, numel(bins), i + numel(bins));
loglog(t{i}(:,1), t{i}(:,4)./t{i}(:,3), ...
t{i}(:,1), t{i}(:,5)./t{i}(:,3));
title("relative");
legend("gpu1/cpu", "gpu2/cpu");
end
Absolute Timings
Absolute timings show that it's not the CPU slowing down. Instead, the GPU is speeding up relatively:

Regarding question 1:
This is not what we observe, the drop is in all three panels at around 10^7 bins. What is happening here? Some complicated caching effect? Or is it something obvious that we missed?
This drop is due to the limit you've set on the maximum number of blocks (1<<14 == 16384). At n=10^7 gpuBench2 the limit has kicked in, and each thread starts processing multiple elements. At n=10^8 each thread works on 12 (sometimes 11) elements. If you remove this cap you can see that your performance continues to flatline.
Why is this faster? Multiple elements per thread allows for latency of the load from data to be hidden much better, especially in the case with 10000 bins where you are only able to fit one block on to each SM due to the high shared memory usage. In this case, every element in the block will reach the global load at around the same time, and none will be able to continue until it has completed its load. By having multiple elements we can pipeline these loads, getting many elements per thread for the latency of one.
(You don't see this in gupBench1 as it is not latency bound, but bandwidth bound to L2. You can see this very quickly if you import the output of nvprof into the visual profiler)
Regarding question 2:
The question is obvious: what is the discontinuity doing there?
I don't have a Fermi to hand, and I can't reproduce this on my Kepler, so I'd assume it's something that is Fermi specific. That's the danger of answering questions with two parts, I suppose!

Fastest way to simulate recurrent equations in OpenCL

I am trying to simulate a recurrent equation of the type
si(t+1) = f[Σj Wijsj(t) + vi*input(t)]
in OpenCL, where f(.) is some non-linear function (in the code below it is just a step-function with threshold th) and s(t) is some external input. Naturally, I implemented one worker for every xi. In every time-step every worker calculates the result of the equation above and subsequently this result is shared with all other workers. Therefore, all workers have to be in the same workgroup.
My current OpenCL kernel looks like this
__kernel void part1(__global int* s, __global float* W, __global float* Th, __global float* V, __global float* input, int N, int T)
{
unsigned int i = get_global_id(0);
float value = 0;
float v = V[i];
float th = Th[i];
for(int t = 0; t < T; t++){
value = v*input[t];
for(int j = 0; j < N; j++){
value = value + W[i*N + j]*s[j];
}
barrier(CLK_GLOBAL_MEM_FENCE);
if (value >= th){
s[i] = 1;
} else {
s[i] = 0;
}
barrier(CLK_GLOBAL_MEM_FENCE);
}
}
Unfortunately, this code is actually three times slower than an equivalent C-implementation. Also, I expected that a change in the number of workers should not make a huge difference (because new workers are sitting on new threads that run in parallel to the others), but actually the processing time increases linearly with the number of workers. The bottleneck seems to be the writing operation after the first barrier. Eliminating this operation (but leaving the barrier in place) cuts down the processing time by a factor of 25 and eliminates the linear dependence.
I am pretty new to OpenCL and I would appreciate any help to speed this code up!
Thanks a lot in advance!
Blue2script

As I already stated in my comment, accessing global memory is slow. Usually the hardware hides the latency by having several subgroups of threads running on the same compute unit. The subgroups I'm referring to are call warps in the NVIDIA lingo and wavefronts in the AMD one. Usually a workgroup is composed of several subgroups.
So meanwhile one subgroup waits to receive the data from the global memory another one that already has all the necessary resources can run. When the running one gets stalled because it needs to read/write data from/to the global memory another one can start running and so on.
However, in your case, because of the barriers, all the workers in all the subgroups will have to that the others wrote in the memory before being able to continue the computation (the barrier is at the workgroup level). Hence the latency hits you right in the face :).
Now, a way to improve your implementation would be to use local memory and this time a barrier at the local memory level (with the flag CLK_LOCAL_MEM_FENCE). The same principle I've just explained still applies but accessing local memory is much faster.
As far as I understand your code (quite likely I didn't get all the subtleties), your s array has N elements, and I guess that you also have N workers. So you create a local array of N elements and you do something like that:
kernel void part1(global int* s, global float* W, global float* Th, global float* V, global float* input, local int* local_s, int N, int T)
{
unsigned int i = get_global_id(0);
unsigned int local_i = get_local_id(0);
float value = 0;
float v = V[i];
float th = Th[i];
//fetch from global to local and sync before computing
local_s[local_i] = s[i];
barrier(CLK_LOCAL_MEM_FENCE);
for(int t = 0; t < T; t++){
value = v*input[t];
for(int j = 0; j < N; j++){
value = value + W[i*N + j]*local_s[j];
}
barrier(CLK_LOCAL_MEM_FENCE);
if (value >= th){
local_s[i] = 1;
} else {
local_s[i] = 0;
}
barrier(CLK_LOCAL_MEM_FENCE);
}
//If necessary write some stuff to global (maybe the last s computed?)
}
Now I have to warn you:
I might have completely misunderstood your needs :)
I've just edited your code while typing this answer, so there are most probably typos and such.
Even using local memory, having so many barriers might still make the opencl version slower than the C one.
Note that I removed the leading __ since they are not necessary and it is easier to read in my opinion.
EDIT: Regarding your comment about CLK_LOCAL_MEM_FENCE vs. CLK_GLOBAL_MEM_FENCE. A barrier is always applied at the workgroup level, so all workers within a workgroup have to hit that barrier. The flag given as parameter refers to memory access. When the flag is CLK_GLOBAL_MEM_FENCE it means that every read/write operation regarding the global memory has to be completed by every worker before any worker can continue to run the next statements. This is exactly the same with the CLK_LOCAL_MEM_FENCE flag but for the local memory.

Using both GPU device of CUDA and zero copy pinned memory

I am using the CUSP library for sparse matrix-multiplication on CUDA a machine. My current code is
#include <cusp/coo_matrix.h>
#include <cusp/multiply.h>
#include <cusp/print.h>
#include <cusp/transpose.h>
#include<stdio.h>
#define CATAGORY_PER_SCAN 1000
#define TOTAL_CATAGORY 100000
#define MAX_SIZE 1000000
#define ELEMENTS_PER_CATAGORY 10000
#define ELEMENTS_PER_TEST_CATAGORY 1000
#define INPUT_VECTOR 1000
#define TOTAL_ELEMENTS ELEMENTS_PER_CATAGORY * CATAGORY_PER_SCAN
#define TOTAL_TEST_ELEMENTS ELEMENTS_PER_TEST_CATAGORY * INPUT_VECTOR
int main(void)
{
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);
cusp::coo_matrix<long long int, double, cusp::host_memory> A(CATAGORY_PER_SCAN,MAX_SIZE,TOTAL_ELEMENTS);
cusp::coo_matrix<long long int, double, cusp::host_memory> B(MAX_SIZE,INPUT_VECTOR,TOTAL_TEST_ELEMENTS);
for(int i=0; i< ELEMENTS_PER_TEST_CATAGORY;i++){
for(int j = 0;j< INPUT_VECTOR ; j++){
int index = i * INPUT_VECTOR + j ;
B.row_indices[index] = i; B.column_indices[ index ] = j; B.values[index ] = i;
}
}
for(int i = 0;i < CATAGORY_PER_SCAN; i++){
for(int j=0; j< ELEMENTS_PER_CATAGORY;j++){
int index = i * ELEMENTS_PER_CATAGORY + j ;
A.row_indices[index] = i; A.column_indices[ index ] = j; A.values[index ] = i;
}
}
/*cusp::print(A);
cusp::print(B); */
//test vector
cusp::coo_matrix<long int, double, cusp::device_memory> A_d = A;
cusp::coo_matrix<long int, double, cusp::device_memory> B_d = B;
// allocate output vector
cusp::coo_matrix<int, double, cusp::device_memory> y_d(CATAGORY_PER_SCAN, INPUT_VECTOR ,CATAGORY_PER_SCAN * INPUT_VECTOR);
cusp::multiply(A_d, B_d, y_d);
cusp::coo_matrix<int, double, cusp::host_memory> y=y_d;
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
float elapsedTime;
cudaEventElapsedTime(&elapsedTime, start, stop); // that's our time!
printf("time elaplsed %f ms\n",elapsedTime);
return 0;
}
cusp::multiply function uses 1 GPU only (as of my understanding).
How can I use setDevice() to run same program on both the GPU(one cusp::multiply per GPU) .
Measure the total time accurately.
How can I use zero-copy pinned memory with this library as I can use malloc myself.

1 How can I use setDevice() to run same program on both the GPU
If you mean "How can I perform a single cusp::multiply operation using two GPUs", the answer is you can't.
EDIT:
For the case where you want to run two separate CUSP sparse matrix-matrix products on different GPUs, it is possible to simply wrap the operation in a loop and call cudaSetDevice before the transfers and the cusp::multiply call. You will probably not, however get any speed up by doing so. I think I am correct in saying that both the memory transfers and cusp::multiply operations are blocking calls, so the host CPU will stall until they are finished. Because of this, the calls for different GPUs cannot overlap and there will be no speed up over performing the same operation on a single GPU twice. If you were willing to use a multithreaded application and have a host CPU with multiple cores, you could probably still run them in parallel, but it won't be as straightforward host code as it seems you are hoping for.
2 Measure the total time accurately
The cuda_event approach you have now is the most accurate way of measuring the execution time of a single kernel. If you had a hypthetical multi-gpu scheme, then the sum of the events from each GPU context would be the total execution time of the kernels. If, by total time, you mean the "wallclock" time to complete the operation, then you would need to either use a host timer around the whole multigpu segment of your code. I vaguely recall that it might be possible in the latest versions of CUDA to synchronize between events in streams from different contexts in some circumstances, so a CUDA event based timer might still be usable in such a scenario.
3 How can I use zero-copy pinned memory with this library as I can use malloc myself.
To the best of my knowledge that isn't possible. The underlying thrust library CUSP uses can support containers using zero copy memory, but CUSP doesn't expose the necessary mechanisms in the standard matrix constructors to be able to use allocate a CUSP sparse matrix in zero copy memory.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio