Suppose a warp contains 32 threads whereby there are 32 SIMD lanes in the GPU. Each thread computes one iteration of the loop
for (j = 0; j < 32; j++) {
if (A[j] > 20) {
...
} else {
...
}
}
Now suppose that for each 0 <= j < 32, A[j] > 20 and A[j+1] <= 20 so there is branch divergence. Why is this bad for SIMD utilisation (warp divergence), because since every thread has its own SIMD lane if one thread executes a different branch this shouldn't affect the others since they do it in parallel, not? I'm very new to this topic so I apologise in advance if this question is poorly formulated.
We define the SIMD utilization of a program that runs on a GPU as the fraction of SIMD lanes that are kept busy with active threads during the run of the program.
Since different SIMD lanes cannot do different operations at the same time, the GPU compiler transforms the code so that both the if and else case are computed by all the lane (by different instructions). The computation is masked so that results appear not to be computed by all the lane from the user point-of-view. However, this trick strongly impact the performance of the program. This is why this is not good.
To be more clear about what is going on under the hood, here is an example of possible generated assembly code:
reg_0 <- load 32 int32_t from *A
mask_0 <- reg_0 > 20 (mask_0 is a SIMD register of 32 booleans)
mask_1 <- not mask_0
reg_1 <- operation based on reg_0 masked by mask_0 (if)
reg_2 <- operation based on reg_0 masked by mask_1 (else)
reg_3 <- reg_1 or reg_2 (merge/blend of the two results)
Related
This fabulous post teaches me a lot, but I still have a question. For the following code:
double multiply(std::vector<double> const& a, std::vector<double> const& b){
double tmp(0);
int active_levels = omp_get_active_level();
#pragma omp parallel for reduction(+:tmp) if(active_level < 1)
for(unsigned int i=0;i<a.size();i++){
tmp += a[i]+b[i];
}
return tmp;
}
If multiply() is called from another parallel part:
#pragma omp parallel for
for (int i = 0; i < count; i++) {
multiply(a[i], b[i]);
}
Because the outer loop iteration depends on count variable, if count is a big number, it is reasonable. But if count is only 1 and our server is a multiple-core machine(e.g., has 512 cores), then the multiply() function only generate 1 thread. So in this case, the server is under-utilized. BTW, the answer also mentioned:
In any case, writing such code is a bad practice. You should simply leave the parallel regions as they are and allow the end user choose whether nested parallelism should be enabled or not.
So how to balance the thread number in nested case when using OpenMP?
Consider using OpenMP tasks (omp taskloop within one parallel section and an intermediate omp single). This allows you to flexibly use the threads in OpenMP on different nesting levels instead of manually defining numbers of threads for each level or oversubscribing OS threads.
However this comes at increased scheduling costs. At the end of the day, there is no perfect solution that will always do best. Instead you will have to keep measuring and analyzing your performance on practical inputs.
I've got a strange performance inversion on filter kernel with and without branching. Kernel with branching runs ~1.5x faster than the kernel without branching.
Basically I need to sort a bunch of radiance rays then apply interaction kernels. Since there are a lot of accompanying data, I can't use something like thrust::sort_by_key() many times.
Idea of the algorithm:
Run a loop for all possible interaction types (which is five)
At every cycle a warp thread votes for its interaction type
After loop completion every warp thread knows about another threads with the same interaction type
Threads elect they leader (per interaction type)
Leader updates interactions offsets table using atomicAdd
Each thread writes its data to corresponding offset
I used techniques described in this Nvidia post https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-optimized-filtering-warp-aggregated-atomics/
My first kernel contains a branch inside loop and runs for ~5ms:
int active;
int leader;
int warp_progress;
for (int i = 0; i != hit_interaction_count; ++i)
{
if (i == decision)
{
active = __ballot(1);
leader = __ffs(active) - 1;
warp_progress = __popc(active);
}
}
My second kernel use lookup table of two elements, use no branching and runs for ~8ms:
int active = 0;
for (int i = 0; i != hit_interaction_count; ++i)
{
const int masks[2] = { 0, ~0 };
int mask = masks[i == decision];
active |= (mask & __ballot(mask));
}
int leader = __ffs(active) - 1;
int warp_progress = __popc(active);
Common part:
int warp_offset;
if (lane_id() == leader)
warp_offset = atomicAdd(&interactions_offsets[decision], warp_progress);
warp_offset = warp_broadcast(warp_offset, leader);
...copy data here...
How can that be? Is there any way to implement such filter kernel so it will run faster than branching one?
UPD: Complete source code can be found in filter_kernel cuda_equation/radiance_cuda.cu at https://bitbucket.org/radiosity/engine/src
I think this is CPU programmer brain deformation. On CPU I expect performance boost because of eliminated branch and branch misprediction penalty.
But there is no branch prediction on GPU and no penalty, so only instructions count matters.
First I need to rewrite code to the simple one.
With branch:
int active;
for (int i = 0; i != hit_interaction_count; ++i)
if (i == decision)
active = __ballot(1);
Without branch:
int active = 0;
for (int i = 0; i != hit_interaction_count; ++i)
{
int mask = 0 - (i == decision);
active |= (mask & __ballot(mask));
}
In first version there are ~3 operations: compare, if and __ballot().
In second version there are ~5 operations: compare, make mask, __ballot(), & and |=.
And there are ~15 ops in common code.
Both loops runs for 5 cycles. It total 35 ops in first, and 45 ops in second. This calculation can explain performance degradation.
I'm a learning Cuda student, and I would like to optimize the execution time of my kernel function. As a result, I realized a short program computing the difference between two pictures. So I compared the execution time between a classic CPU execution in C, and a GPU execution in Cuda C.
Here you can find the code I'm talking about:
int *imgresult_data = (int *) malloc(width*height*sizeof(int));
int size = width*height;
switch(computing_type)
{
case GPU:
HANDLE_ERROR(cudaMalloc((void**)&dev_data1, size*sizeof(unsigned char)));
HANDLE_ERROR(cudaMalloc((void**)&dev_data2, size*sizeof(unsigned char)));
HANDLE_ERROR(cudaMalloc((void**)&dev_data_res, size*sizeof(int)));
HANDLE_ERROR(cudaMemcpy(dev_data1, img1_data, size*sizeof(unsigned char), cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(dev_data2, img2_data, size*sizeof(unsigned char), cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(dev_data_res, imgresult_data, size*sizeof(int), cudaMemcpyHostToDevice));
float time;
cudaEvent_t start, stop;
HANDLE_ERROR( cudaEventCreate(&start) );
HANDLE_ERROR( cudaEventCreate(&stop) );
HANDLE_ERROR( cudaEventRecord(start, 0) );
for(int m = 0; m < nb_loops ; m++)
{
diff<<<height, width>>>(dev_data1, dev_data2, dev_data_res);
}
HANDLE_ERROR( cudaEventRecord(stop, 0) );
HANDLE_ERROR( cudaEventSynchronize(stop) );
HANDLE_ERROR( cudaEventElapsedTime(&time, start, stop) );
HANDLE_ERROR(cudaMemcpy(imgresult_data, dev_data_res, size*sizeof(int), cudaMemcpyDeviceToHost));
printf("Time to generate: %4.4f ms \n", time/nb_loops);
break;
case CPU:
clock_t begin = clock(), diff;
for (int z=0; z<nb_loops; z++)
{
// Apply the difference between 2 images
for (int i = 0; i < height; i++)
{
tmp = i*imgresult_pitch;
for (int j = 0; j < width; j++)
{
imgresult_data[j + tmp] = (int) img2_data[j + tmp] - (int) img1_data[j + tmp];
}
}
}
diff = clock() - begin;
float msec = diff*1000/CLOCKS_PER_SEC;
msec = msec/nb_loops;
printf("Time taken %4.4f milliseconds", msec);
break;
}
And here is my kernel function:
__global__ void diff(unsigned char *data1 ,unsigned char *data2, int *data_res)
{
int row = blockIdx.x;
int col = threadIdx.x;
int v = col + row*blockDim.x;
if (row < MAX_H && col < MAX_W)
{
data_res[v] = (int) data2[v] - (int) data1[v];
}
}
I obtained these execution time for each one
CPU: 1,3210ms
GPU: 0,3229ms
I wonder why GPU result is not as lower as it should be. I am a beginner in Cuda so please be comprehensive if there are some classic errors.
EDIT1:
Thank you for your feedback. I tried to delete the 'if' condition from the kernel but it didn't change deeply my program execution time.
However, after having install Cuda profiler, it told me that my threads weren't running concurrently. I don't understand why I have this kind of message, but it seems true because I only have a 5 or 6 times faster application with GPU than with CPU. This ratio should be greater, because each thread is supposed to process one pixel concurrently to all the other ones. If you have an idea of what I am doing wrong, it would be hepful...
Flow.
Here are two things you could do which may improve the performance of your diff kernel:
1. Let each thread do more work
In your kernel, each thread handles just a single element; but having a thread do anything already has a bunch of overhead, at the block and the thread level, including obtaining the parameters, checking the condition and doing address arithmetic. Now, you could say "Oh, but the reads and writes take much more time then that; this overhead is negligible" - but you would be ignoring the fact, that the latency of these reads and writes is hidden by the presence of many other warps which may be scheduled to do their work.
So, let each thread process more than a single element. Say, 4, as each thread can easily read 4 bytes at once into a register. Or even 8 or 16; experiment with it. Of course you'll need to adjust your grid and block parameters accordingly.
2. "Restrict" your pointers
__restrict is not part of C++, but it is supported in CUDA. It tells the compiler that accesses through different pointers passed to the function never overlap. See:
What does the restrict keyword mean in C++?
Realistic usage of the C99 'restrict' keyword?
Using it allows the CUDA compiler to apply additional optimizations, e.g. loading or storing data via non-coherent cache. Indeed, this happens with your kernel although I haven't measured the effects.
3. Consider using a "SIMD" instruction
CUDA offers this intrinsic:
__device__ unsigned int __vsubss4 ( unsigned int a, unsigned int b )
Which subtracts each signed byte value in a from its corresponding one in b. If you can "live" with the result, rather than expecting a larger int variable, that could save you some of work - and go very well with increasing the number of elements per thread. In fact, it might let you increase it even further to get to the optimum.
I don't think you are measuring times correctly, memory copy is a time consuming step in GPU that you should take into account when measuring your time.
I see some details that you can test:
I suppose you are using MAX_H and MAX_H as constants, you may consider doing so using cudaMemcpyToSymbol().
Remember to sync your threads using __syncthreads(), so you don't get issues between each loop iteration.
CUDA works with warps, so block and number of threads per block work better as multiples of 8, but not larger than 512 threads per block unless your hardware supports it. Here is an example using 128 threads per block: <<<(cols*rows+127)/128,128>>>.
Remember as well to free your allocated memory in GPU and destroying your time events created.
In your kernel function you can have a single variable int v = threadIdx.x + blockIdx.x * blockDim.x .
Have you tested, beside the execution time, that your result is correct? I think you should use cudaMallocPitch() and cudaMemcpy2D() while working with arrays due to padding.
Probably there are other issues with the code, but here's what I see. The following lines in __global__ void diff are considered not optimal:
if (row < MAX_H && col < MAX_W)
{
data_res[v] = (int) data2[v] - (int) data1[v];
}
Conditional operators inside a kernel result in warp divergence. It means that if and else parts inside a warp are executed in sequence, not in parallel. Also, as you might have realized, if evaluates to false only at borders. To avoid the divergence and needless computation, split your image in two parts:
Central part where row < MAX_H && col < MAX_W is always true. Create an additional kernel for this area. if is unnecessary here.
Border areas that will use your diff kernel.
Obviously you'll have modify your code that calls the kernels.
And on a separate note:
GPU has throughput-oriented architecture, but not latency-oriented as CPU. It means CPU may be faster then CUDA when it comes to processing small amounts of data. Have you tried using large data sets?
CUDA Profiler is a very handy tool that will tell you're not optimal in the code.
I have two very simple pieces of codes. I am trying to parallel them as follows:
double sk = 0, ed = 0;
#pragma omp parallel shared(Z,Zo,U1,U2,U3) private(i) reduction(+: sk, ed)
{
#pragma omp for
for (i=0;i<imgDim;i++)
{
sk += (Z[i]-Zo[i])*(Z[i]-Zo[i]);
ed += U1[i]*U1[i] + U2[i]*U2[i] + U3[i]*U3[i];
}
}
//////////////////////////////////////////////////////////////////////////////////////
double rk = 0, epri = 0, ex = 0, ez = 0;
#pragma omp parallel shared(X,Z) private(i) reduction(+: rk, ex,ez)
{
#pragma omp for
for(i = 0; i<imgDim; i++)
{
rk += (X[0][i]-Z[i])*(X[0][i]-Z[i]) + (X[1][i]-Z[i])*(X[1][i]-Z[i]) + (X[2][i]-Z[i])*(X[2][i]-Z[i]);
ex += X[0][i]*X[0][i] + X[1][i]*X[1][i] + X[2][i]*X[2][i];
ez += Z[i]*Z[i];
}
}
Z, Zo,U1,U2,U3,X are all big matrices. imgDim is 4 million. The speed up is not as expected. On a 16 core machine, the speed up of these two pieces of small codes is only two times. I do not understand why OMP presents this behavior because these two codes only add something up. This should be what OMP good at.
The more strange behavior is MPI slow things down when I try to parallel these code by using MPI as follows:
int startval = imgDim*pid/np;
int endval = imgDim*(pid+1)/np-1;
int ierr;
double p_sum_sk = 0;
double p_sum_ed = 0;
for (i=startval;i<=endval;i++)
{
sk += (Z[i]-Zo[i])*(Z[i]-Zo[i]);
ed += U1[i]*U1[i] + U2[i]*U2[i] + U3[i]*U3[i];
}
MPI_Reduce(&sk, &p_sum_sk, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
MPI_Reduce(&ed, &p_sum_ed, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
MPI_Bcast(&sk, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Bcast(&ed, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD);
/////////////////////////////////////////////////////////////////////////////////////
int startval = imgDim*pid/np;
int endval = imgDim*(pid+1)/np-1;
double p_sum_rk = 0.;
double p_sum_ex = 0.;
double p_sum_ez = 0.;
for(i = startval; i<=endval; i++)
{
rk = rk + (X[0][i]-Z[i])*(X[0][i]-Z[i]) + (X[1][i]-Z[i])*(X[1][i]-Z[i]) + (X[2][i]-Z[i])*(X[2][i]-Z[i]);
ex += X[0][i]*X[0][i] + X[1][i]*X[1][i] + X[2][i]*X[2][i];
ez += Z[i]*Z[i];
}
MPI_Reduce(&rk,&p_sum_rk,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD);
MPI_Reduce(&ex,&p_sum_ex,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD);
MPI_Reduce(&ez,&p_sum_ez,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD);
MPI_Bcast(&rk,1,MPI_INT,0,MPI_COMM_WORLD);
MPI_Bcast(&rk,1,MPI_INT,0,MPI_COMM_WORLD);
MPI_Bcast(&epri,1,MPI_INT,0,MPI_COMM_WORLD);
np is the number of processors and pid is the id of current processor. After I use 32 or even 64 processors, it did not show any speed up. It is even slower than the sequential code. I do not understand why. These codes are just adding stuff up. OMP and MPI should be good at it. Can anyone give me a hand?
Your code is memory bound - you load a huge amount of data on each iteration and make simple (i.e. fast) computations over it. If imgDim is 4 million, then even if each element of Z, Zo, U1, U2, U3 is as short as 4 bytes (e.g. they are float or int arrays), their total size would be 80 MiB and this would not fit in the last-level CPU cache even given a dual-socket system. Things would be worse if these arrays hold double values (as hinted by the fact that you reduce into double variables), as it would bump up the memory size twofold. Also, if you use a decent compiler, which is able to vectorise the code (e.g. icc does it by default, GCC requires -ftree-vectorize), even a single thread would be able to saturate the memory bandwidth of the CPU socket and then running with more than one thread would bring no benefit whatsoever.
I would say that the 2x OpenMP speed-up that you observe on a 16-core system comes from the fact that this system has two CPU sockets and is NUMA, i.e. it has a separate memory controller on each socket and hence when running with 16 threads you utilise twice the memory bandwidth of the single socket. This could be verified if you run the code with two threads only, but bind them in a different way: one thread per core on the same socket or one thread per core but on different sockets. In the first case there should be no speed-up while in the second case the speed-up should be about 2x. Binding threads to cores is (yet) implementation dependent - you could take a look at GOMP_CPU_AFFINITY for GCC and KMP_AFFINITY if you happen to use Intel compilers.
The same applies to the MPI case. Now you have processes instead of threads, but the memory bandwidth limitation stays. Things are even worse, as now there is also communication overhead being added and it could exceed the computation time if the problem size is too small (the ratio depends on the network interconnect - it is lower with faster and less latent interconnects like QDR InfiniBand fabric). But with MPI you have access to more CPU sockets and hence to higher total memory bandwidth. You could launch your code with one MPI process per socket to get the best possible performance out of your system. Process binding (or pinning in Intel's terminology) is also important in that case.
Some questions about CUDA.
1) I noticed that, in every sample code, operations which are not parallel (i.e., the computation of a scalar), performed in global functions, are always done specifying a certain thread. For example, in this simple code for a dot product, thread 0 performs the summation:
__global__ void dot( int *a, int *b, int *c )
{
// Shared memory for results of multiplication
__shared__ int temp[N];
temp[threadIdx.x] = a[threadIdx.x] * b[threadIdx.x];
// Thread 0 sums the pairwise products
if( 0 == threadIdx.x )
{
int sum = 0;
for( int i = 0; i < N; i++ )
sum += temp[i];
*c = sum;
}
}
This is fine for me; however, in a code which I wrote I did not specify the thread for the non-parallel operation, and it still works: hence, is it compulsory to define the thread? In particular, the non-parallel operation which I want to perform is the following:
if (epsilon == 1)
{
V[0] = B*(Exp - 1 - b);
}
else
{
V[0] = B*(Exp - 1 + a);
}
The various variables were passed as arguments of the global function. And here comes my second question.
2) I computed the value of V[0] with a program in CUDA and another serial on the CPU, obtaining different results. Obviously I thought that the problem in CUDA could be that I did not specify the thread, but, even with this, the result does not change, and it is still (much) greater from the serial one: 6.71201e+22 vs -2908.05. Where could be the problem? The other calculations performed in the global function are the following:
int tid = threadIdx.x;
if ( tid != 0 && tid < N )
{
{Various stuff which does not involve V or the variables used to compute V[0]}
V[tid] = B*(1/(1+alpha[tid]*alpha[tid])*(One_G[tid]*Exp - Cos - alpha[tid]*Sin) + kappa[tid]*Sin);
}
As you can see, in my condition I avoid to consider the case tid == 0.
3) Finally, a last question: usually in the sample codes I noticed that, if you want to use on the CPU values allocated and computed on the GPU memory, you should copy those values on the CPU (e.g, with command cudaMemcpy, specifying cudaMemcpyDeviceToHost). But I manage to use those values directly in the main code (CPU) without any problem. Can be this a clue that there is something wrong with my GPU (or my installation of CUDA), which also causes the previous odd things?
Thank you for your help.
== Added on the 5th January ==
Sorry for the late of my reply. Before invoking the kernel, there are all the memory allocations of the arrays to compute (which are quite a lot). In particular, the code about the array involved in my question is:
float * V;
cudaMalloc( (void**)&V, N * sizeof(float) );
At the end of the code I wrote:
float V_ [N];
cudaMemcpy( &V_, V, N * sizeof(float), cudaMemcpyDeviceToHost );
cudaFree(V);
cout << V_[0] << endl;
Thank you again for your attention.
if you don't have any cudaMemcpy in your code, that's exactly the problem. ;-)
The GPU is accessing it's own memory (the RAM on your graphics card), while the CPU is accessing the RAM on your mainboard.
You need to allocate and copy alpha, kappa, One_g and all other arrays to your GPU first, using cudaMemcpy, then run your kernel and after that copy your results back to the CPU.
Also, don't forget to allocate the memory on BOTH sides.
As for the non-parallel stuff: If the result is always the same, all threads will write the same thing, so the result is exactly the same, just quite a bit more inefficient, since all of them try to access the same resources.
Is that the exact code you're using?
In regards to question 1, you should have a __syncthreads() after the assignment to your shared memory, temp.
Otherwise you'll get a race condition where thread 0 can start the summation prior to temp being fully populated.
As for your other question about specifying the thread, if you have
if (epsilon == 1)
{
V[0] = B*(Exp - 1 - b);
}
else
{
V[0] = B*(Exp - 1 + a);
}
Then every thread will execute that code; for example, if you have X number of threads executing, and epsilon is 1 for all of them, then all X threads will evaluate the same line:
V[0] = B*(Exp - 1 - b);
and hence you'll have another race condition, as you'll have all X threads writing to V[0]. If all the threads have the same value for B*(Exp - 1 - b), then you might not notice a difference, while if they have different values then you're liable to get different results each time, depending on what order the threads arrive