max FLOPS for matrix multiplication Intel/AMD CPU - performance

My formula for estimating the maximum FLOPs/s of an Intel CPU is
Max SP FLOPs/s = frequencey * 4 SSE(8AVX) * 2 (MAC) * number of cores (not HW threads)
Max DP FLOPs/s = 0.5 * Max SP FLOPs/s
By MAC I mean the CPU can do one SSE(AVX) multiplication and addition at the same time. On the system I'm using the maximum frequency under load is 2.66 GHz. It only has SSE and the number of cores (not Hardware threads) is 4. That gives: Max SP FLOPs/s = 85.12 GFLOPs/s.
The number of FLOPs for matrix multiplication is approxiamelty 2*n*m*k. For a square matrix with n=1000 that's 2*10E9 FLOPs (2 billion FLOPs). Once I know the time I can estimate the FLOPs/s.
However, the best I can get for my own code is about 40 SP GFLOPs/s for example with n=1000. I get about the same result with Eigen. That's about a 45% efficiency. Is my calculation for the maximum wrong? What's the best efficiency for a Intel CPU for large dense matrix multiplication? Does anyone have a paper describing this?
I know that on the GPU the efficiency can be more than 60%.
http://www.anandtech.com/show/6774/nvidias-geforce-gtx-titan-part-2-titans-performance-unveiled/3
Edit:
I get similar results for n=500 which easily fits in the 12MB L3 cache of my system so the cache does not seem to be the limiting factor (though maybe I can use it more efficiently).
Edit2:
Eigen Benchmarks show it as good as MKL (for SSE). They use a Intel(R) Core(TM)2 Quad CPU Q9400 # 2.66GHz. So 2.66* 2(DP SSE) *2 MAC * 4 cores = 42.25 DP GFLOPs/s. You can see on the plot they are all getting less that 20. Something on order of 45% like me.
http://eigen.tuxfamily.org/index.php?title=Benchmark
http://ark.intel.com/products/35365/Intel-Core2-Quad-Processor-Q9400-6M-Cache-2_66-GHz-1333-MHz-FSB
Edit3:
Here is my code for anyone that cares. I can get slight better results than this but not much better. I'm using Agner Fog's vectorclass for SEE/AVX. and setting Vec8f to float8 and Vec4d to double4
//SGEMM and AVX call MM_tile<float, float8>(nthreads, a, b, c, n, m, k);
template <typename ftype, typename floatn>
void GEMM_tile(const int nthreads, const ftype*A , const ftype* B, ftype* C, const int N, const int M, const int K) {
for(int i=0; i<N; i++) {
for(int j=0; j<K; j++) {
C[K*i + j] = 0;
}
}
const int nc = 32;
const int kc = 32;
const int mc = 32;
omp_set_num_threads(nthreads);
#pragma omp parallel for if(nthreads>1)
for(int ii=0; ii<N; ii+=nc) {
for(int jj=0; jj<K; jj+=kc)
for(int ll=0; ll<M; ll+=mc) {
const int nb = min(N-ii, nc);
const int kb = min(K-jj, kc);
const int mb = min(M-ll, mc);
MM_block<ftype, floatn>(nb, mb, kb, &A[M*ii+ll], N, &B[K*ll+jj], K, &C[K*ii+jj], K );
}
}
}
template <typename ftype, typename floatn>
void MM_block(int n, int m, int k, const ftype *a, const int stridea,
const ftype *b, const int strideb,
ftype *c, const int stridec ) {
const int vec_size = sizeof(floatn)/sizeof(ftype);
for(int i=0; i<n; i+=4) {
for(int j=0; j<k; j+=vec_size) {
Dot4x4_vec_block<ftype, floatn>(m, &a[strideb*i], &b[j], &c[stridec*i + j], stridea, strideb, stridec);
}
}
template <typename ftype, typename floatn>
inline void Dot4x4_vec_block(const int n, const ftype *a, const ftype *b, ftype *c, const int stridea, const int strideb, const int stridec) {
floatn tmp0, tmp1, tmp2, tmp3;
load(tmp0, &c[stridec*0]);
load(tmp1, &c[stridec*1]);
load(tmp2, &c[stridec*2]);
load(tmp3, &c[stridec*3]);
ftype *a0_ptr = (ftype*)&a[stridea*0];
ftype *a1_ptr = (ftype*)&a[stridea*1];
ftype *a2_ptr = (ftype*)&a[stridea*2];
ftype *a3_ptr = (ftype*)&a[stridea*3];
for(int i=0; i<n; i++) {
floatn breg = floatn().load(&b[i*strideb + 0]);
floatn areg0 = *a0_ptr++;
floatn areg1 = *a1_ptr++;
floatn areg2 = *a2_ptr++;
floatn areg3 = *a3_ptr++;
tmp0 += areg0 * breg;
tmp1 += areg1 * breg;
tmp2 += areg2 * breg;
tmp3 += areg3 * breg;
}
tmp0.store(&c[stridec*0]);
tmp1.store(&c[stridec*1]);
tmp2.store(&c[stridec*2]);
tmp3.store(&c[stridec*3]);
}

Often, the limiting factor for processing throughput is memory bandwidth, especially in cases where your working set doesn't fit into the CPU cache (your 1000-by-1000 matrix of float will take up ~4 MB, whereas your CPU probably has a 2 MB L3 cache). This is a situation where the structure of your algorithm can make a big difference in how it performs, but you will usually hit a wall at some point where you just can't get any faster because you're waiting on values to come from some higher level in the memory hierarchy.
In addition, your theoretical numbers assume that you have sufficient instructions without data dependencies to keep all of the execution units tasked on every cycle. This can be very difficult to do in practice. I'm not sure what the optimum throughput for a general matrix multiply would be, but check out this previous question for information on what you can do to maximize the instruction throughput.

Related

Why is local memory in this OpenCL algorithm so slow?

I am writing some OpenCL code. My kernel should create a special "accumulator" output based on an input image. I have tried two concepts and both are equally slow, although the second one uses local memory. Could you please help me identify why the local memory version is so slow? The target GPU for the kernels is a AMD Radeon Pro 450.
// version one
__kernel void find_points(__global const unsigned char* input, __global unsigned int* output) {
const unsigned int x = get_global_id(0);
const unsigned int y = get_global_id(1);
int ind;
for(k = SOME_BEGINNING; k <= SOME_END; k++) {
// some pretty wild calculation
// ind is not linear and accesses different areas of the output
ind = ...
if(input[y * WIDTH + x] == 255) {
atomic_inc(&output[ind]);
}
}
}
// variant two
__kernel void find_points(__global const unsigned char* input, __global unsigned int* output) {
const unsigned int x = get_global_id(0);
const unsigned int y = get_global_id(1);
__local int buf[7072];
if(y < 221 && x < 32) {
buf[y * 32 + x] = 0;
}
barrier(CLK_LOCAL_MEM_FENCE);
int ind;
int k;
for(k = SOME_BEGINNING; k <= SOME_END; k++) {
// some pretty wild calculation
// ind is not linear and access different areas of the output
ind = ...
if(input[y * WIDTH + x] == 255) {
atomic_inc(&buf[ind]);
}
}
barrier(CLK_LOCAL_MEM_FENCE);
if(get_local_id(0) == get_local_size(0) - 1)
for(k = 0; k < 7072; k++)
output[k] = buf[k];
}
}
I would expect that the second variant is faster than the first one, but it isn't. Sometimes it is even slower.
Local buffer size __local int buf[7072] (28288 bytes) is too big. I don't know how big shared memory for AMD Radeon Pro 450 is but likely that is 32kB or 64kB per computing unit.
32768/28288 = 1, 65536/28288 = 2 means only 1 or maximum 2 wavefronts (64 work items) can run simultaneously only, so occupancy of computing unit is very very low hence poor performance.
Your aim should be to reduce local buffer as much as possible so that more wavefronts can be processed simultaneously.
Use CodeXL to profile your kernel - there are tools to show you all of this.
Alternatively you can have a look at CUDA occupancy calculator excel spreadsheet if you don't want to run the profiler to get a better idea of what that is about.

Floating point min/max in CUDA slower than CPU version. Why?

I wrote a kernel for computing the min and max values of an array of about 100,000 floats using reduction (see code below). I use thread blocks to reduce chunks of 1024 values to a single value (in shared memory), and then do the final reduction among the blocks on the CPU.
I then compared this with a serial calculation just on the CPU. The CUDA version takes 2.2ms, and the CPU version takes 0.21ms. Why is the CUDA version much slower? Is the array size not large enough to take advantage of the parallelism, or is my code not optimized somehow?
This is part of an exercise in the Udacity Parallel Programming class. I am running this through their web site, so I don't know what the exact hardware is, but they claim the code runs on actual GPUs.
Here is the CUDA code:
__global__ void min_max_kernel(const float* const d_logLuminance,
const size_t length,
float* d_min_logLum,
float* d_max_logLum) {
// Shared working memory
extern __shared__ float sh_logLuminance[];
int blockWidth = blockDim.x;
int x = blockDim.x * blockIdx.x + threadIdx.x;
float* min_logLuminance = sh_logLuminance;
float* max_logLuminance = sh_logLuminance + blockWidth;
// Copy this block's chunk of the data to shared memory
// We copy twice so we compute min and max at the same time
if (x < length) {
min_logLuminance[threadIdx.x] = d_logLuminance[x];
max_logLuminance[threadIdx.x] = min_logLuminance[threadIdx.x];
}
else {
// Pad if we're out of range
min_logLuminance[threadIdx.x] = FLT_MAX;
max_logLuminance[threadIdx.x] = -FLT_MAX;
}
__syncthreads();
// Reduce
for (int s = blockWidth/2; s > 0; s /= 2) {
if (threadIdx.x < s) {
if (min_logLuminance[threadIdx.x + s] < min_logLuminance[threadIdx.x]) {
min_logLuminance[threadIdx.x] = min_logLuminance[threadIdx.x + s];
}
if (max_logLuminance[threadIdx.x + s] > max_logLuminance[threadIdx.x]) {
max_logLuminance[threadIdx.x] = max_logLuminance[threadIdx.x + s];
}
}
__syncthreads();
}
// Write to global memory
if (threadIdx.x == 0) {
d_min_logLum[blockIdx.x] = min_logLuminance[0];
d_max_logLum[blockIdx.x] = max_logLuminance[0];
}
}
size_t get_num_blocks(size_t inputLength, size_t threadsPerBlock) {
return inputLength / threadsPerBlock +
((inputLength % threadsPerBlock == 0) ? 0 : 1);
}
/*
* Compute min, max over the data by first reducing on the device, then
* doing the final reducation on the host.
*/
void compute_min_max(const float* const d_logLuminance,
float& min_logLum,
float& max_logLum,
const size_t numRows,
const size_t numCols) {
// Compute min, max
printf("\n=== computing min/max ===\n");
const size_t blockWidth = 1024;
const size_t numPixels = numRows * numCols;
size_t numBlocks = get_num_blocks(numPixels, blockWidth);
printf("Num min/max blocks = %d\n", numBlocks);
float* d_min_logLum;
float* d_max_logLum;
int alloc_size = sizeof(float) * numBlocks;
checkCudaErrors(cudaMalloc(&d_min_logLum, alloc_size));
checkCudaErrors(cudaMalloc(&d_max_logLum, alloc_size));
min_max_kernel<<<numBlocks, blockWidth, sizeof(float) * blockWidth * 2>>>
(d_logLuminance, numPixels, d_min_logLum, d_max_logLum);
float* h_min_logLum = (float*) malloc(alloc_size);
float* h_max_logLum = (float*) malloc(alloc_size);
checkCudaErrors(cudaMemcpy(h_min_logLum, d_min_logLum, alloc_size, cudaMemcpyDeviceToHost));
checkCudaErrors(cudaMemcpy(h_max_logLum, d_max_logLum, alloc_size, cudaMemcpyDeviceToHost));
min_logLum = FLT_MAX;
max_logLum = -FLT_MAX;
// Reduce over the block results
// (would be a bit faster to do it on the GPU, but it's just 96 numbers)
for (int i = 0; i < numBlocks; i++) {
if (h_min_logLum[i] < min_logLum) {
min_logLum = h_min_logLum[i];
}
if (h_max_logLum[i] > max_logLum) {
max_logLum = h_max_logLum[i];
}
}
printf("min_logLum = %.2f\nmax_logLum = %.2f\n", min_logLum, max_logLum);
checkCudaErrors(cudaFree(d_min_logLum));
checkCudaErrors(cudaFree(d_max_logLum));
free(h_min_logLum);
free(h_max_logLum);
}
And here is the host version:
void compute_min_max_on_host(const float* const d_logLuminance, size_t numPixels) {
int alloc_size = sizeof(float) * numPixels;
float* h_logLuminance = (float*) malloc(alloc_size);
checkCudaErrors(cudaMemcpy(h_logLuminance, d_logLuminance, alloc_size, cudaMemcpyDeviceToHost));
float host_min_logLum = FLT_MAX;
float host_max_logLum = -FLT_MAX;
printf("HOST ");
for (int i = 0; i < numPixels; i++) {
if (h_logLuminance[i] < host_min_logLum) {
host_min_logLum = h_logLuminance[i];
}
if (h_logLuminance[i] > host_max_logLum) {
host_max_logLum = h_logLuminance[i];
}
}
printf("host_min_logLum = %.2f\nhost_max_logLum = %.2f\n",
host_min_logLum, host_max_logLum);
free(h_logLuminance);
}
As #talonmies suggests, behavior may be different for larger sizes; 100,000 is really not that much: Much of it fits within the combined overall L1 cache of the cores on a modern CPU; half of it fits in a single core's L2 cache.
Transfer over PCI express takes time; and in your case, double the time it might have, since you don't use pinned memory.
You're not overlapping computation and PCI express I/O (not that it would make much sense for only 100,000 elements)
Your kernel is rather slow, for more than one reason; not the least of which is the extensive use of shared memory, most of which is unnecessary
More generally: Always profile your code using nvvp (or nvprof for getting textual information for further analysis).

Why is this code ten times slower on the GPU than CPU?

I have a problem that boils down to performing some arithmetic on each element of a set of matrices. I thought this sounded like the kind of computation that could benefit greatly from being shifted onto the GPU. However, I've only succeeded in slowing down the computation by a factor of 10!
Here are the specifics of my test system:
OS: Windows 10
CPU: Core i7-4700MQ # 2.40 GHz
GPU: GeForce GT 750M (compute capability 3.0)
CUDA SDK: v7.5
The code below performs equivalent calcs to my production code, on the CPU and on the GPU. The latter is consistently ten times slower on my machine (CPU approx. 650ms; GPU approx. 7s).
I've tried changing the grid and block sizes; I've increased and decreased the size of the array passed to the GPU; I've run it through the visual profiler; I've tried integer data rather than doubles, but whatever I do, the GPU version is always significantly slower than the CPU equivalent.
So why is the GPU version so much slower and what changes, that I've not mentioned above, could I try to improve its performance?
Here's my command line: nvcc source.cu -o CPUSpeedTest.exe -arch=sm_30
And here's the contents of source.cu:
#include <iostream>
#include <windows.h>
#include <cuda_runtime_api.h>
void AdjustArrayOnCPU(double factor1, double factor2, double factor3, double denominator, double* array, int arrayLength, double* curve, int curveLength)
{
for (size_t i = 0; i < arrayLength; i++)
{
double adjustmentFactor = factor1 * factor2 * factor3 * (curve[i] / denominator);
array[i] = array[i] * adjustmentFactor;
}
}
__global__ void CudaKernel(double factor1, double factor2, double factor3, double denominator, double* array, int arrayLength, double* curve, int curveLength)
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < arrayLength)
{
double adjustmentFactor = factor1 * factor2 * factor3 * (curve[idx] / denominator);
array[idx] = array[idx] * adjustmentFactor;
}
}
void AdjustArrayOnGPU(double array[], int arrayLength, double factor1, double factor2, double factor3, double denominator, double curve[], int curveLength)
{
double *dev_row, *dev_curve;
cudaMalloc((void**)&dev_row, sizeof(double) * arrayLength);
cudaMalloc((void**)&dev_curve, sizeof(double) * curveLength);
cudaMemcpy(dev_row, array, sizeof(double) * arrayLength, cudaMemcpyHostToDevice);
cudaMemcpy(dev_curve, curve, sizeof(double) * curveLength, cudaMemcpyHostToDevice);
CudaKernel<<<100, 1000>>>(factor1, factor2, factor3, denominator, dev_row, arrayLength, dev_curve, curveLength);
cudaMemcpy(array, dev_row, sizeof(double) * arrayLength, cudaMemcpyDeviceToHost);
cudaFree(dev_curve);
cudaFree(dev_row);
}
void FillArray(int length, double row[])
{
for (size_t i = 0; i < length; i++) row[i] = 0.1 + i;
}
int main(void)
{
const int arrayLength = 10000;
double arrayForCPU[arrayLength], curve1[arrayLength], arrayForGPU[arrayLength], curve2[arrayLength];;
FillArray(arrayLength, curve1);
FillArray(arrayLength, curve2);
///////////////////////////////////// CPU Version ////////////////////////////////////////
LARGE_INTEGER StartingTime, EndingTime, ElapsedMilliseconds, Frequency;
QueryPerformanceFrequency(&Frequency);
QueryPerformanceCounter(&StartingTime);
for (size_t iterations = 0; iterations < 10000; iterations++)
{
FillArray(arrayLength, arrayForCPU);
AdjustArrayOnCPU(1.0, 1.0, 1.0, 1.0, arrayForCPU, 10000, curve1, 10000);
}
QueryPerformanceCounter(&EndingTime);
ElapsedMilliseconds.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;
ElapsedMilliseconds.QuadPart *= 1000;
ElapsedMilliseconds.QuadPart /= Frequency.QuadPart;
std::cout << "Elapsed Milliseconds: " << ElapsedMilliseconds.QuadPart << std::endl;
///////////////////////////////////// GPU Version ////////////////////////////////////////
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start);
for (size_t iterations = 0; iterations < 10000; iterations++)
{
FillArray(arrayLength, arrayForGPU);
AdjustArrayOnGPU(arrayForGPU, 10000, 1.0, 1.0, 1.0, 1.0, curve2, 10000);
}
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float elapsedTime;
cudaEventElapsedTime(&elapsedTime, start, stop);
std::cout << "CUDA Elapsed Milliseconds: " << elapsedTime << std::endl;
cudaEventDestroy(start);
cudaEventDestroy(stop);
return 0;
}
And here is an example of the output of CUDASpeedTest.exe
Elapsed Milliseconds: 565
CUDA Elapsed Milliseconds: 7156.76
What follows is likely to be embarrassingly obvious to most developers working with CUDA, but may be of value to others - like myself - who are new to the technology.
The GPU code is ten times slower than the CPU equivalent because the GPU code exhibits a perfect storm of performance-wrecking characteristics.
The GPU code spends most of its time allocating memory on the GPU, copying data to the device, performing a very, very simple calculation (that is supremely fast irrespective of the type of processor it's running on) and then copying data back from the device to the host.
As noted in the comments, if an upper bound exists on the size of the data structures being processed, then a buffer on the GPU can be allocated exactly once and reused. In the code above, this takes the GPU to CPU runtime down from 10:1 to 4:1.
The remaining performance disparity is down to the fact that the CPU is able to perform the required calculations, in serial, millions of times in a very short time span due to its simplicity. In the code above, the calculation involves reading a value from an array, some multiplication, and finally an assignment
to an array element. Something this simple must be performed millions of times
before the benefits of doing so in parallel outweigh the necessary time penalty of transferring the data to the GPU and back. On my test system, a million array elements is the break even point, where GPU and CPU perform in (approximately) the same amount of time.

OpenMP Code Not Scaling due to overheads and cache issues

struct xnode
{
float *mat;
};
void testScaling( )
{
int N = 1000000; ///total num matrices
int dim = 10;
//memory for matrices
std::vector<xnode> nodeArray(N);
for( int k = 0; k < N; ++k )
nodeArray[k].mat = new float [dim*dim];
//memory for Y
std::vector<float*> Y(N,0);
for( int k = 0; k < N; ++k )
Y[k] = new float [dim];
//shared X
float* X = new float [dim];
for(int i = 0; i < dim; ++i ) X[i] = 1.0;
//init mats
for( int k = 0; k < N; ++k )
{
for( int i=0; i<dim*dim; ++i )
nodeArray[k].mat[i] = 0.25+((float)i)/3;
}
int NTIMES = 500;
//gemv args
char trans = 'N';
int lda = dim;
int incx = 1;
float alpha =1 , beta = 0;
//threads
int thr[4];
thr[0] =1 ; thr[1] = 2; thr[2] = 4; thr[3] = 8;
for( int t = 0; t<4; ++t )//test for nthreads
{
int nthreads = thr[t];
double t_1 = omp_get_wtime();
for( int ii = 0; ii < NTIMES; ++ii )//do matvec NTIMES
{
#pragma omp parallel for num_threads(nthreads)
for( int k=0; k<N; ++k )
{
//compute Y[k] = mat[k] * X;
GEMV(&trans, &dim, &dim, &alpha, nodeArray[k].mat, &lda, X, &incx, &beta, Y[k], &incx);
//GEMV(&trans, &dim, &dim, &alpha, nodeArray[0].mat, &lda, X, &incx, &beta, Y[k], &incx);
}
}
double t_2 = omp_get_wtime();
std::cout << "Threads " << nthreads << " time " << (t_2-t_1)/NTIMES << std::endl;
}
//clear memory
for( int k = 0; k < N; ++k )
{
delete [] nodeArray[k].mat;
delete [] Y[k];
}
delete [] X;
}
The above code parallelizes the matrix-vector product of N matrices of size dim, and stores results in N output vectors. The average of 500 products is taken as the time per matrix-vector product. The matrix-vector products in the above example are all of equal size and thus the threads should be perfectly balanced - we should achieve a performance scaling close to ideal 8x. The following are the observations (Machine – Intel Xeon 3.1Ghz.2 processors,8cores each, HyperThreading enabled, Windows, VS2012, Intel MKL, Intel OMP library).
OBSERVATION 1:
dim=10 N=1000000
Threads 1 - time 0.138068s
Threads 2 - time 0.0729147s
Threads 4 - time 0.0360527s
Threads 8 - time 0.0224268s (6.1x on 8threads)
OBSERVATION 2 :
dim=20 N=1000000
Threads 1 time 0.326617
Threads 2 time 0.185706
Threads 4 time 0.0886508
Threads 8 time 0.0733666 (4.5x on 8 threads).
Note – I ran VTune on this case. It showed CPUTime 267.8sec, Overhead time 43 sec, Spin time – 8 sec. The overhead time is all spent in a libiomp function (intel library). 8Threads/1Thread scaling is poor for such cases.
Next - in the gemv for loop, we change nodeArray[k].mat to nodeArray[0].mat (see commented statement), so that only the first matrix is used for all the matrix-vector products.
OBSERVATION 3
dim=20 N=1000000
Threads 1 time 0.152298 (The serial time is halved)
Threads 2 time 0.0769173
Threads 4 time 0.0384086
Threads 8 time 0.019336 (7.87x on 8 threads)
Thus I get almost ideal scaling - why is this behavior? VTune says that a significant portion of CPU time is spent in synchronization and thread overhead. Here it seems there is no relation between the load balancing and thread synchronization. As matrix size is increased the granularity should increase and thread overhead should be proportionately small. But as we increase from size 10 to 20 the scaling is weakening. When we use nodeArray[0].mat (only the first matrix) for doing all the matrix-vector products the cache is updated only once (since the compiler knows this during optimization) and we get near ideal scaling. Thus the synchronization overhead seems to be related to some cache related issue. I have tried a number of other things like setting KMP_AFFINITY and varying load distribution but that did not buy me anything.
My questions are:
1. I dont have a clear idea about how does the cache performance affect openMP thread synchronization. Can someone explain this?
2. Can anything be done about improving the scaling and reducing the overhead?
Thanks

CUDA performance of atomic operation on different address in warp

To my knowledge, if atomic operations are performed on same memory address location in a warp, the performance of the warp could be 32 times slower.
But what if atomic operations of threads in a warp are on 32 different memory locations? Is there any performance penalty at all? Or it will be as fast as normal operation?
My use case is that I have 32 different positions, each thread in a warp needs one of these position but which position is data dependent. So each thread could use atomicCAS to scan if the location desired is empty or not. If it is not empty, scan the next position.
If I am lucky, 32 threads could atomicCAS to 32 different memory locations, is there any performance penalty is this case?
I assume Kepler architecture is used
In the code below, I'm adding a constant value to the elements of an array (dev_input). I'm comparing two kernels, one using atomicAdd and one using regular addition. This is an example taken to the extreme in which atomicAdd operates on completely different addresses, so there will be no need for serialization of the operations.
#include <stdio.h>
#define BLOCK_SIZE 1024
int iDivUp(int a, int b) { return ((a % b) != 0) ? (a / b + 1) : (a / b); }
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
__global__ void regular_addition(float *dev_input, float val, int N) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N) dev_input[i] = dev_input[i] + val;
}
__global__ void atomic_operations(float *dev_input, float val, int N) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N) atomicAdd(&dev_input[i],val);
}
int main(){
int N = 8192*32;
float* output = (float*)malloc(N*sizeof(float));
float* dev_input; gpuErrchk(cudaMalloc((void**)&dev_input, N*sizeof(float)));
gpuErrchk(cudaMemset(dev_input, 0, N*sizeof(float)));
int NumBlocks = iDivUp(N,BLOCK_SIZE);
float time, timing1 = 0.f, timing2 = 0.f;
cudaEvent_t start, stop;
int niter = 32;
for (int i=0; i<niter; i++) {
gpuErrchk(cudaEventCreate(&start));
gpuErrchk(cudaEventCreate(&stop));
gpuErrchk(cudaEventRecord(start,0));
atomic_operations<<<NumBlocks,BLOCK_SIZE>>>(dev_input,3,N);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
gpuErrchk(cudaEventRecord(stop,0));
gpuErrchk(cudaEventSynchronize(stop));
gpuErrchk(cudaEventElapsedTime(&time, start, stop));
timing1 = timing1 + time;
}
printf("Time for atomic operations: %3.5f ms \n", timing1/(float)niter);
for (int i=0; i<niter; i++) {
gpuErrchk(cudaEventCreate(&start));
gpuErrchk(cudaEventCreate(&stop));
gpuErrchk(cudaEventRecord(start,0));
regular_addition<<<NumBlocks,BLOCK_SIZE>>>(dev_input,3,N);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
gpuErrchk(cudaEventRecord(stop,0));
gpuErrchk(cudaEventSynchronize(stop));
gpuErrchk(cudaEventElapsedTime(&time, start, stop));
timing2 = timing2 + time;
}
printf("Time for regular addition: %3.5f ms \n", timing2/(float)niter);
}
Testing this code on my NVIDIA GeForce GT540M, CUDA 5.5, Windows 7, I obtain approximately the same results for the two kernels, i.e., about 0.7ms.
Now change the instruction
if (i < N) atomicAdd(&dev_input[i],val);
to
if (i < N) atomicAdd(&dev_input[i%32],val);
which is closer to the case of your interest, namely, each atomicAdd operates on different addresses within a warp. The result I obtain is that no performance penalty is observed.
Finally, change the above instruction to
if (i < N) atomicAdd(&dev_input[0],val);
This is the other extreme in which atomicAdd always operates on the same address. In this case, the execution time raises to 5.1ms.
The above tests have been performed on a Fermi architecture. You can try to run the above code on your Kepler card.

Resources