Reducing Shared Memory Bank Conflicts

Reducing Shared Memory Bank Conflicts - parallel-processing

Nvprof reported that there are about 200 milion shared_ld_bank_conflict and some shared_st_bank_conflict in my sgemm kernel. I tried the padding trick __shared__ float smem[SIZE + OFFSET];, it reduced store bank conflicts to 0, but load bank conflicts are still there. I don't know how to further improve it.
__global__ void sgemm(
const float* __restrict__ A,
const float* __restrict__ B,
float* __restrict__ C,
int M, int N, int K
){
int tid = threadIdx.x;
int gStartx = blockIdx.x * 128;
int gStarty = blockIdx.y * 128;
int dx = tid % 8;
int dy = tid / 8;
int vx = tid % 16;
int vy = tid / 16;
__shared__ volatile float aSM[8][128+4];
__shared__ volatile float bSM[8][128+4];
float aBuffer1[4];
float bBuffer1[4];
float aBuffer2[4];
float bBuffer2[4];
float cCache[8][8];
#pragma unroll
for (int i=0; i<8; i++)
#pragma unroll
for (int j=0; j<8; j++)
cCache[i][j] = 0.f;
//load first two tiles
#pragma unroll
for (int i=0; i<4; i++){
aBuffer1[i] = A[(gStarty + dy + i*32)*K + (dx)];
bBuffer1[i] = B[(gStartx + dy + i*32)*K + (dx)];
}
int nIt = (K + 8 - 1) / 8;
#pragma unroll
for (int itr=0; itr<nIt; itr++){
int gStartk = itr * 8;
int is_odd = itr & 1;
if (is_odd == 0){
#pragma unroll
for (int i=0; i<4; i++){
if (itr != (nIt - 1)){
// prefetch next tiles
aBuffer2[i] = A[(gStarty + i*32 + dy)*K + (gStartk + 8 + dx)];
bBuffer2[i] = B[(gStartx + i*32 + dy)*K + (gStartk + 8 + dx)];
}
//move current tiles to SMEM
aSM[dx][dy+i*32] = aBuffer1[i];
bSM[dx][dy+i*32] = bBuffer1[i];
}
} else {
#pragma unroll
for (int i=0; i<4; i++){
if (itr != (nIt - 1)){
//prefetch next tiles to another buffer
aBuffer1[i] = A[(gStarty + i*32 + dy)*K + (gStartk + 8 + dx)];
bBuffer1[i] = B[(gStartx + i*32 + dy)*K + (gStartk + 8 + dx)];
}
aSM[dx][dy+i*32] = aBuffer2[i];
bSM[dx][dy+i*32] = bBuffer2[i];
}
}
__syncthreads();
float aCache[8][4];
#pragma unroll
for (int p=0; p<2; p++){
#pragma unroll
for (int ki=0; ki<8; ki++){
#pragma unroll
for (int mi=0; mi<4; mi++){
aCache[ki][mi] = aSM[ki][8*vy + 4*p +mi];
}
}
#pragma unroll
for (int ki=0; ki<8; ki++){
#pragma unroll
for (int ni=0; ni<8; ni++){
float b = bSM[ki][8*vx + ni];
#pragma unroll
for (int mi=0; mi<4; mi++){
float a = aCache[ki][mi];
cCache[mi + 4*p][ni] = fma(a, b, cCache[mi + 4*p][ni] );
}
}
}
}
__syncthreads();
}
#pragma unroll
for (int i=0; i<8; i++){
for (int j=0; j<8; j++){
C[(gStarty + vy*8 + i)*N + (gStartx + vx*8 + j)] = cCache[i][j];
}
}
}
A (2048x2048) matrix is row major, B (2048x2048) is column major, each block has 256 threads, each block calculates 128x128 portion of C, and each thread calculates 8x8x8. the gpu is Tesla P100.

Ok I found a solution: when storing to bSM, insert one padding word between every 32 words in the second dimention
//bSM[dx][dy+i*32] = bBuffer1[i];
bSM[dx][dy+i*33] = bBuffer1[i]; //we're skipping column 32, 65, 98, 131
when reading bSM[i][j], read it like this: bSM[i][j/32 + j]
//float b = bSM[ki][8*vx + ni];
float b = bSM[ki][(8*vx) / 32 + 8*vx + ni];
// (8*vx+ni)/32 is the same as (8*vx)/32, since vi is always less than 8
now it's giving me 55% performance of cublas gemm on tesla p4

Related

Access an matrix as its tranpose in tiled matrix mutliplication in CUDA

I'm currently experimenting with CUDA and i came across this kernel from an answer for matrix multiplication: https://stackoverflow.com/a/18856054/7867026
I want instead of doing A*B to do A_Transpose*A but without saving A_Transpose (only matrix A as an input to kernel). I have to properly set the indexes but I'm confused by this matrix representation. Any help would be appreciated.

most of what you need is here and here.
In the first link it is identified that AxAT involves taking inner products of rows of matrix A, and similarly ATxA will involve taking inner products of columns of matrix A. Also note the symmetry statement. In the second link (scroll down from that point a bit in the programming guide) you will find a complete tiled matrix multiply. You just need to index into both tiles by column.
Here is a worked example, using the code from the SO answer you linked:
$ cat t1654.cu
#include <iostream>
#include <cstdio>
#include <cstdlib>
const int TILE_DIM = 32;
template <typename T>
__global__ void ATA(const T * __restrict__ A, T * __restrict__ C, int ARows, int ACols)
{
T CValue = 0;
int Row = blockIdx.y*TILE_DIM + threadIdx.y;
int Col = blockIdx.x*TILE_DIM + threadIdx.x;
__shared__ T As[TILE_DIM][TILE_DIM];
__shared__ T Bs[TILE_DIM][TILE_DIM];
for (int k = 0; k < (TILE_DIM + ARows - 1)/TILE_DIM; k++) {
if (k*TILE_DIM + threadIdx.y < ARows && blockIdx.y*blockDim.y+threadIdx.x < ACols)
As[threadIdx.y][threadIdx.x] = A[(k*TILE_DIM + threadIdx.y)*ACols + blockIdx.y*blockDim.y+threadIdx.x];
else
As[threadIdx.y][threadIdx.x] = 0.0;
if (k*TILE_DIM + threadIdx.y < ARows && Col < ACols)
Bs[threadIdx.y][threadIdx.x] = A[(k*TILE_DIM + threadIdx.y)*ACols + Col];
else
Bs[threadIdx.y][threadIdx.x] = 0.0;
__syncthreads();
for (int n = 0; n < TILE_DIM; ++n)
CValue += As[n][threadIdx.y] * Bs[n][threadIdx.x];
__syncthreads();
}
if (Row < ACols && Col < ACols)
C[((blockIdx.y * blockDim.y + threadIdx.y)*ACols) +
(blockIdx.x * blockDim.x)+ threadIdx.x] = CValue;
}
template <typename T>
__global__ void transpose_naive(const T * __restrict__ in, T * __restrict__ out, const int dim){
int col = threadIdx.x+blockDim.x*blockIdx.x;
int row = threadIdx.y+blockDim.y*blockIdx.y;
if ((col < dim) && (row < dim)) out[col*dim+row] = in[row*dim+col];
}
template <typename T>
__global__ void mm_naive(const T * __restrict__ A, const T * __restrict__ B, T * __restrict__ C, const int rowA, const int colA, const int colB){
int col = threadIdx.x+blockDim.x*blockIdx.x;
int row = threadIdx.y+blockDim.y*blockIdx.y;
if ((row < rowA) && (col < colB)){
T Cval = 0;
for (int i = 0; i < colA; i++) Cval += A[row*colA+i]*B[i*colB+col];
C[row*colB+col] = Cval;}
}
typedef float mt;
int main(){
mt *d_A, *d_B, *d_C, *h_A, *h_C, *h_C1;
int m = 64;
int n = 64;
h_A = new mt[m*n];
h_C = new mt[n*n];
h_C1 = new mt[n*n];
cudaMalloc(&d_A, m*n*sizeof(d_A[0]));
cudaMalloc(&d_B, m*n*sizeof(d_A[0]));
cudaMalloc(&d_C, n*n*sizeof(d_C[0]));
// test 1
for (int i = 0; i < m; i++)
for (int j = 0; j < n; j++)
h_A[i*n+j] = (i==j)?1.0f:0.0f;
cudaMemcpy(d_A, h_A, m*n*sizeof(d_A[0]), cudaMemcpyHostToDevice);
dim3 block(TILE_DIM, TILE_DIM);
dim3 grid((n+block.x-1)/block.x, (n+block.y-1)/block.y);
ATA<<<grid,block>>>(d_A, d_C, m, n);
cudaMemcpy(h_C, d_C, n*n*sizeof(d_C[0]), cudaMemcpyDeviceToHost);
#ifdef DEBUG
for (int i = 0; i < n; i++){
for (int j = 0; j < n; j++)
std::cout << h_C[i*n+j] << " ";
std::cout << std::endl;}
std::cout << std::endl;
#endif
// test 2
for (int i = 0; i < m; i++)
for (int j = 0; j < n; j++)
h_A[i*n+j] = rand()%10;
cudaMemcpy(d_A, h_A, m*n*sizeof(d_A[0]), cudaMemcpyHostToDevice);
ATA<<<grid,block>>>(d_A, d_C, m, n);
cudaMemcpy(h_C, d_C, n*n*sizeof(d_C[0]), cudaMemcpyDeviceToHost);
#ifdef DEBUG
for (int i = 0; i < n; i++){
for (int j = 0; j < n; j++)
std::cout << h_C[i*n+j] << " ";
std::cout << std::endl;}
std::cout << std::endl;
#endif
transpose_naive<<<grid,block>>>(d_A, d_B, n);
mm_naive<<<grid,block>>>(d_B, d_A, d_C, n, n, n);
cudaMemcpy(h_C1, d_C, n*n*sizeof(d_C[0]), cudaMemcpyDeviceToHost);
#ifdef DEBUG
for (int i = 0; i < n; i++){
for (int j = 0; j < n; j++)
std::cout << h_C1[i*n+j] << " ";
std::cout << std::endl;}
std::cout << std::endl;
#endif
for (int i = 0; i < n*n; i++) if (h_C[i] != h_C1[i]) {std::cout << "mismatch at: " << i << " was: " << h_C[i] << " should be: " << h_C1[i] << std::endl; return 0;}
}
$ nvcc -o t1654 t1654.cu
$ cuda-memcheck ./t1654
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
$
Note that loading the Bs tile is identical in both cases. The main changes are in loading the As tile, and also note the indexing change when computing Cvalue. These changes are necessary to index in both cases by column.
There may still be bugs. I have not tested the non-square case, nor have I tested the case where the matrix size is not a multiple of block size. Furthermore I've taken no advantage of the symmetry in the output. However this should help with the indexing.

Number Of Pi in Parallel Programming Openmp

Hello everyone i wanted to calculate number of pi in openmp but something is wrong. Could you please tell me which part did i do wrong?
As you see in the below the time suppose to decrease but it doesn't.
#include <stdio.h>
#include <omp.h>
#define MAX_THREADS 4
static long num_steps = 100000000;
double step;
int main()
{
int i, j;
double pi, full_sum = 0.0;
double start_time, run_time;
double sum[MAX_THREADS];
step = 1.0 / (double)num_steps;
for (j = 1; j <= MAX_THREADS; j++){
omp_set_num_threads(j);
full_sum = 0.0;
start_time = omp_get_wtime();
#pragma omp parallel private(i)
{
int id = omp_get_thread_num();
int numthreads = omp_get_num_threads();
double x;
double partial_sum = 0;
#pragma omp single
printf(" num_threads = %d", numthreads);
for (i = id; i< num_steps; i += numthreads){
x = (i + 0.5)*step;
partial_sum += +4.0 / (1.0 + x*x);
}
#pragma omp critical
full_sum += partial_sum;
}
pi = step * full_sum;
run_time = omp_get_wtime() - start_time;
printf("\n pi is %f in %f seconds %d threds \n ", pi, run_time, j);
}
}

square Matrix transpose with CUDA

I'm trying to write the matrix transpose algorithm. I test this program with matrix size equal to 1024, the result shows that not all elements are in the right places.
Why isn't my array transposing correctly? Does anyone can help me or give me any hint? I will appreciate it. Thanks a lot!
there is the whole cpu code:
__global__ void transpose_naive (float *out, float *in, int w, int h )
{
unsigned int xIdx = blockDim.x * blockIdx.x + threadIdx.x;
unsigned int yIdx = blockDim.y * blockIdx.y + threadIdx.y;
if ( xIdx <=w && yIdx <=h ) {
unsigned int idx_in = xIdx + w * yIdx;
unsigned int idx_out = yIdx + h * xIdx;
out[idx_out] = in[idx_in];
}
}
int main()
{
int nx=1024;
int mem_size = nx*nx*sizeof(float);
int t=32;
dim3 dimGrid(((nx-1)/t) +1, ((nx-1)/t) +1);
dim3 dimBlock(t,t);
float *h_idata = (float*)malloc(mem_size);
float *h_cdata = (float*)malloc(mem_size);
float *d_idata, *d_cdata;
checkCuda(cudaMalloc(&d_idata, mem_size) );
checkCuda(cudaMalloc(&d_cdata, mem_size) );
// host
for (int j = 0; j < nx; j++)
for (int i = 0; i < nx; i++)
h_idata[j*nx + i] = j*nx + i;
// device
checkCuda(cudaMemcpy(d_idata, h_idata, mem_size, cudaMemcpyHostToDevice) );
// events for timing
cudaEvent_t startEvent, stopEvent;
checkCuda(cudaEventCreate(&startEvent) );
checkCuda(cudaEventCreate(&stopEvent) );
float ms;
checkCuda( cudaEventRecord(startEvent, 0) );
transpose_naive<<<dimGrid, dimBlock>>>(d_cdata, d_idata,nx,nx);
checkCuda(cudaEventRecord(stopEvent, 0) );
checkCuda(cudaEventSynchronize(stopEvent) );
checkCuda(cudaEventElapsedTime(&ms, startEvent, stopEvent) );
checkCuda( cudaMemcpy(h_cdata, d_cdata, mem_size, cudaMemcpyDeviceToHost) );
printf("the time %5f ", ms);
printf("\n");
savetofile(h_idata,"i.txt",nx,nx);
savetofile(h_cdata,"t.txt",nx,nx);
error_exit:
// cleanup
checkCuda(cudaEventDestroy(startEvent) );
checkCuda(cudaEventDestroy(stopEvent) );
checkCuda( cudaFree(d_cdata) );
checkCuda( cudaFree(d_idata) );
free(h_idata);
free(h_cdata);
system("pause");
}

I think there is something wrong with file output "i.txt" and "t.txt" otherwise the program looks to be correct. I have made some minor changes in your code by adding error checking and printing on the standard output stream. I am printing the last (1020 - 1024) 3 x 3 matrix to cross check the transpose. Run it on your system and verify whether the matrix transpose is correct or not?
#include "cuda_runtime.h"
#include <stdio.h>
#include <stdlib.h>
#include "device_launch_parameters.h"
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort = true)
{
if (code != cudaSuccess)
{
fprintf(stderr, "GPUassert: %s %s %d\n", cudaGetErrorString(code),file, line);
if (abort) exit(code);
}
}
__global__ void transpose_naive(float *out, float *in, int w, int h)
{
unsigned int xIdx = blockDim.x * blockIdx.x + threadIdx.x;
unsigned int yIdx = blockDim.y * blockIdx.y + threadIdx.y;
if (xIdx <= w && yIdx <= h) {
unsigned int idx_in = xIdx + w * yIdx;
unsigned int idx_out = yIdx + h * xIdx;
out[idx_out] = in[idx_in];
}
}
int main()
{
int nx = 1024;
int mem_size = nx*nx*sizeof(float);
int t = 32;
dim3 dimGrid(((nx - 1) / t) + 1, (((nx - 1) / t) + 1));
dim3 dimBlock(t, t);
float *h_idata = (float*)malloc(mem_size);
float *h_cdata = (float*)malloc(mem_size);
float *d_idata, *d_cdata;
gpuErrchk(cudaMalloc(&d_idata, mem_size));
gpuErrchk(cudaMalloc(&d_cdata, mem_size));
// host
for (int j = 0; j < nx; j++)
for (int i = 0; i < nx; i++)
h_idata[j*nx + i] = j*nx + i;
// device
gpuErrchk(cudaMemcpy(d_idata,h_idata,mem_size,cudaMemcpyHostToDevice));
// events for timing
cudaEvent_t startEvent, stopEvent;
gpuErrchk(cudaEventCreate(&startEvent));
gpuErrchk(cudaEventCreate(&stopEvent));
float ms;
gpuErrchk(cudaEventRecord(startEvent, 0));
transpose_naive << <dimGrid, dimBlock >> >(d_cdata, d_idata, nx, nx);
gpuErrchk(cudaEventRecord(stopEvent, 0));
gpuErrchk(cudaEventSynchronize(stopEvent));
gpuErrchk(cudaEventElapsedTime(&ms, startEvent, stopEvent));
gpuErrchk(cudaMemcpy(h_cdata,d_cdata,mem_size,cudaMemcpyDeviceToHost));
printf("the time %5f ", ms);
printf("\n");
for (int i = 1020; i < 1024; i++) {
for (int j = 1020; j < 1024; j++) {
printf("%.2f ", h_idata[i*nx + j]);
}
printf("\n");
}
printf("\n");
for (int i = 1020; i < 1024; i++) {
for (int j = 1020; j < 1024; j++) {
printf("%.2f ", h_cdata[i*nx + j]);
}
printf("\n");
}
//savetofile(h_idata, "i.txt", nx, nx);
//savetofile(h_cdata, "t.txt", nx, nx);
//error_exit:
// cleanup
gpuErrchk(cudaEventDestroy(startEvent));
gpuErrchk(cudaEventDestroy(stopEvent));
gpuErrchk(cudaFree(d_cdata));
gpuErrchk(cudaFree(d_idata));
free(h_idata);
free(h_cdata);
//system("pause");
}

The only flaw in the code is the incorrect bound checks in the following line of the kernel.
if ( xIdx <=w && yIdx <=h ) {
As the indices are from 0 to w-1 and 0 to h-1 for x and y dimensions respectively, the if condition should be as follows:
if ( xIdx <w && yIdx <h ) {

Parallelizing the Gaussian Blur Algorithm with OpenMP

I was trying to parallelize the gaussian blur function using OpenMP,
but I am new at OpenMP, and when I tried to parallelize the two for loops (I don't think there are any variables that need to be private for each thread), it ended up
running even slower than before, and the output was different. So did I do anything wrong? What should I do to make it run faster?
void gaussian_blur(float **src, float **dst, int w, int h, float sigma)
{
int x, y, i;
int ksize = (int)(sigma * 2.f * 4.f + 1) | 1;
int halfk = ksize / 2;
float scale = -0.5f/(sigma*sigma);
float sum = 0.f;
float *kernel, *ringbuf;
int xmax = w - halfk;
int ymax = h - halfk;
// if sigma too small, just copy src to dst
if (ksize <= 1)
{
for (y = 0; y < h; y++)
for (x = 0; x < w; x++)
dst[y][x] = src[y][x];
return;
}
// create Gaussian kernel
kernel = malloc(ksize * sizeof(float));
ringbuf = malloc(ksize * sizeof(float));
#pragma omp parallel for reduction(+ : sum)
for (i = 0; i < ksize; i++)
{
float x = (float)(i - halfk);
float t = expf(scale * x * x);
kernel[i] = t;
sum += t;
}
scale = 1.f / sum;
#pragma omp parallel for
for (i = 0; i < ksize; i++)
kernel[i] *= scale;
// blur each row
#pragma omp parallel for // this is the for loop I parallelized but ended up with wrong output and running slower
for (y = 0; y < h; y++)
{
int x1;
int bufi0 = ksize-1;
float tmp = src[y][0];
for (x1 = 0; x1 < halfk ; x1++) ringbuf[x1] = tmp;
for (; x1 < ksize-1; x1++) ringbuf[x1] = src[y][x1-halfk];
for (x1 = 0; x1 < w; x1++)
{
if(x1 < xmax)
ringbuf[bufi0++] = src[y][x1+halfk];
else
ringbuf[bufi0++] = src[y][w-1];
if (bufi0 == ksize) bufi0 = 0;
dst[y][x1] = convolve(kernel, ringbuf, ksize, bufi0);
}
}
// blur each column
#pragma omp parallel for // this is the for loop I parallelized but ended up with wrong output and running slower
for (x = 0; x < w; x++)
{
int y1;
int bufi0 = ksize-1;
float tmp = dst[0][x];
for (y1 = 0; y1 < halfk ; y1++) ringbuf[y1] = tmp;
for ( ; y1 < ksize-1; y1++) ringbuf[y1] = dst[y1-halfk][x];
for (y1 = 0; y1 < h; y1++)
{
if(y1 < ymax)
ringbuf[bufi0++] = dst[y1+halfk][x];
else
ringbuf[bufi0++] = dst[h-1][x];
if (bufi0 == ksize) bufi0 = 0;
dst[y1][x] = convolve(kernel, ringbuf, ksize, bufi0);
}
}
// clean up
free(kernel);
free(ringbuf);
}

Besides the need to properly identify private and shared data, there are several things that you could do in order to speed up your program.
As a first step you should remove any unnecessary concurrency. For example, how big ksize happens to be on average? If it is less than several hundred elements, it makes absolutely no sense to employ OpenMP for such simple operations as computing the kernel and then normalising it:
#pragma omp parallel for reduction(+ : sum)
for (i = 0; i < ksize; i++)
{
float x = (float)(i - halfk);
float t = expf(scale * x * x);
kernel[i] = t;
sum += t;
}
scale = 1.f / sum;
#pragma omp parallel for
for (i = 0; i < ksize; i++)
kernel[i] *= scale;
On a typical modern CPU it would take more cycles to bootstrap the parallel regions than to compute this on a single core. Also on modern CPUs these loops can be unrolled and vectorised and you can get up to 8x boost on a single core. If the kernel is too small, then besides OpenMP overhead you will also get slowdown from excessive false sharing. You have to make sure that each thread gets an exact multiple of 16 elements (64 bytes of cache line size / sizeof(float)) to work on in order to prevent false sharing.
You also have to make sure that threads do not share cache lines in the column blur section.
// blur each column
#pragma omp parallel for
for (x = 0; x < w; x++)
{
...
for (y1 = 0; y1 < h; y1++)
{
...
dst[y1][x] = convolve(kernel, ringbuf, ksize, bufi0);
}
}
Because of the access pattern here, you have to make sure that each thread gets a chunk of columns that is a multiple of 16 or else there will be a border overlap area of 16*y1 pixels shared by every two consecutive threads where excessive false sharing will occur. If you cannot guarantee that w is divisible by 16, then you can give each thread a starting offset in the y direction, e.g. the innermost loop becomes:
int tid = omp_get_thread_num();
for (y1 = 2*tid; y1 < h; y1++)
{
...
}
for (y1 = 0; y1 < 2*tid; y1++)
{
...
}
The multiplier 2 is arbitrary. The idea is to give the next thread several rows of advance in comparison to the current one so that both threads will not be processing the same line at once at any moment in time. You could also use addition and modulo arithmetic to compute y1, i.e.
for (y2 = 0; y2 < h; y2++)
{
y1 = (y2 + 2*tid) % h;
...
}
but this is generally slower than just separating the loop in two parts.
Also mind your data size. The last level cache (LLC) has very high but still limited bandwidth. If data cannot fit in the private cache of each core then compiler optimisations such as loop vectorisations can put very high pressure on the LLC. Things get more ugly if data doesn't fit in the LLC and therefore the main memory has to be accessed.
If you don't know what false sharing is, there is an article in Dr.Dobb's that kind of explains it here.

I may have fixed your code. You did not post your convolve function so it's difficult to say for sure but I'm not sure it matters. There are at least two bugs. There is a race condition in the ringbuf array. To fix this I extend the array times the number of threads.
ringbuf = (float*)malloc(nthreads*ksize * sizeof(float));
To access the array do something like this
int ithread = omp_get_thread_num();
ringbuf[ksize*ithread + x1]
Edit: I added some code which defines ringbuf inside the parallel block. That way you don't have to access ringbuf based on the thread number.
The second errors is the ibufi0 variable. I defined a new one like this
const int ibufi0_fix = (x1+ksize-1)%ksize;
Below is the code I used to check it. Replace with your convolve function. Note, this may still be quite inefficient. There are probably cache issues such as cache misses and false sharing (particularly when you convolve vertically). Hopefully, though, the image will be correct now.
Edit: here is a paper by Intel that shows how to do this best with AVX. It's optimized to minimize the cache misses. I'm not sure it's optimized for threading though.
http://software.intel.com/en-us/articles/iir-gaussian-blur-filter-implementation-using-intel-advanced-vector-extensions
I'm writing my own function on this (it's actually the reason I started learning OpenMP) which uses SSE/AVX as well. There are a lot of similarities with matrix multiplication and image filtering so I learned how to optimized matrix multiplication first and will do Gaussian Blur shortly...
#include "math.h"
#include "omp.h"
#include "stdio.h"
#include <nmmintrin.h>
float convolve(const float *kernel, const float *ringbuf, const int ksize, const int bufi0) {
float sum = 0.0f;
for(int i=0; i<ksize; i++) {
sum += kernel[i]*ringbuf[i];
}
return sum;
}
void gaussian_blur(float *src, float *dst, int w, int h, float sigma, int nthreads)
{
int x, y, i;
int ksize = (int)(sigma * 2.f * 4.f + 1) | 1;
int halfk = ksize / 2;
printf("ksize %d\n", ksize);
float scale = -0.5f/(sigma*sigma);
float sum = 0.f;
float *kernel, *ringbuf;
int xmax = w - halfk;
int ymax = h - halfk;
// if sigma too small, just copy src to dst
if (ksize <= 1)
{
for (y = 0; y < h; y++)
for (x = 0; x < w; x++)
dst[y*w + x] = src[y*w + x];
return;
}
// create Gaussian kernel
//kernel = malloc(ksize * sizeof(float));
kernel = (float*)_mm_malloc(ksize * sizeof(float),16);
//ringbuf = malloc(ksize * sizeof(float));
ringbuf = (float*)_mm_malloc(nthreads*ksize * sizeof(float),16);
#pragma omp parallel for reduction(+ : sum) if(nthreads>1)
for (i = 0; i < ksize; i++)
{
float x = (float)(i - halfk);
float t = expf(scale * x * x);
kernel[i] = t;
sum += t;
}
scale = 1.f / sum;
#pragma omp parallel for if(nthreads>1)
for (i = 0; i < ksize; i++)
kernel[i] *= scale;
// blur each row
#pragma omp parallel for if(nthreads>1)// this is the for loop I parallelized but ended up with wrong output and running slower
for (y = 0; y < h; y++)
{
int ithread = omp_get_thread_num();
//printf("nthread %d\n", nthread);
int x1;
int bufi0 = ksize-1;
float tmp = src[y*w + 0];
for (x1 = 0; x1 < halfk ; x1++) ringbuf[ksize*ithread + x1] = tmp;
for (; x1 < ksize-1; x1++) ringbuf[ksize*ithread + x1] = src[y*w + x1-halfk];
for (x1 = 0; x1 < w; x1++)
{
const int ibufi0_fix = (x1+ksize-1)%ksize;
if(x1 < xmax)
ringbuf[ksize*ithread + ibufi0_fix] = src[y*w + x1+halfk];
else
ringbuf[ksize*ithread + ibufi0_fix] = src[y*w + w-1];
if (bufi0 == ksize) bufi0 = 0;
dst[y*w + x1] = convolve(kernel, &ringbuf[ksize*ithread], ksize, bufi0);
}
}
// blur each column
#pragma omp parallel for if(nthreads>1)// this is the for loop I parallelized but ended up with wrong output and running slower
for (x = 0; x < w; x++)
{
int ithread = omp_get_thread_num();
int y1;
int bufi0 = ksize-1;
float tmp = dst[0*w + x];
for (y1 = 0; y1 < halfk ; y1++) ringbuf[ksize*ithread + y1] = tmp;
for ( ; y1 < ksize-1; y1++) ringbuf[ksize*ithread + y1] = dst[(y1-halfk)*w + x];
for (y1 = 0; y1 < h; y1++)
{
const int ibufi0_fix = (y1+ksize-1)%ksize;
if(y1 < ymax)
ringbuf[ibufi0_fix] = dst[(y1+halfk)*w + x];
else
ringbuf[ibufi0_fix] = dst[(h-1)*w + x];
if (bufi0 == ksize) bufi0 = 0;
dst[y1*w + x] = convolve(kernel, &ringbuf[ksize*ithread], ksize, bufi0);
}
}
// clean up
_mm_free(kernel);
_mm_free(ringbuf);
}
int compare(float *dst1, float *dst2, const int n) {
int error = 0;
for(int i=0; i<n; i++) {
if(*dst1 != *dst2) error++;
}
return error;
}
int main() {
const int w = 20;
const int h = 20;
float *src = (float*)_mm_malloc(w*h*sizeof(float),16);
float *dst1 = (float*)_mm_malloc(w*h*sizeof(float),16);
float *dst2 = (float*)_mm_malloc(w*h*sizeof(float),16);
for(int i=0; i<w*h; i++) {
src[i] = i;
}
gaussian_blur(src, dst1, w, h, 1.0f, 1);
gaussian_blur(src, dst2, w, h, 1.0f, 4);
int error = compare(dst1, dst2, w*h);
printf("error %d\n", error);
_mm_free(src);
_mm_free(dst1);
_mm_free(dst2);
}
Edit: here is code which defines ringbuf inside the parallel block based on the comment by Hristo. It should be equivalent.
#include "math.h"
#include "omp.h"
#include "stdio.h"
#include <nmmintrin.h>
float convolve(const float *kernel, const float *ringbuf, const int ksize, const int bufi0) {
float sum = 0.0f;
for(int i=0; i<ksize; i++) {
sum += kernel[i]*ringbuf[i];
}
return sum;
}
void gaussian_blur(float *src, float *dst, int w, int h, float sigma, int nthreads)
{
int x, y, i;
int ksize = (int)(sigma * 2.f * 4.f + 1) | 1;
int halfk = ksize / 2;
printf("ksize %d\n", ksize);
float scale = -0.5f/(sigma*sigma);
float sum = 0.f;
float *kernel;
int xmax = w - halfk;
int ymax = h - halfk;
// if sigma too small, just copy src to dst
if (ksize <= 1)
{
for (y = 0; y < h; y++)
for (x = 0; x < w; x++)
dst[y*w + x] = src[y*w + x];
return;
}
// create Gaussian kernel
//kernel = malloc(ksize * sizeof(float));
kernel = (float*)_mm_malloc(ksize * sizeof(float),16);
#pragma omp parallel for reduction(+ : sum) if(nthreads>1)
for (i = 0; i < ksize; i++)
{
float x = (float)(i - halfk);
float t = expf(scale * x * x);
kernel[i] = t;
sum += t;
}
scale = 1.f / sum;
#pragma omp parallel for if(nthreads>1)
for (i = 0; i < ksize; i++)
kernel[i] *= scale;
// blur each row
//#pragma omp parallel for if(nthreads>1)// this is the for loop I parallelized but ended up with wrong output and running slower
#pragma omp parallel if(nthreads>1)
{
float *ringbuf = (float*)_mm_malloc(ksize * sizeof(float),16);
#pragma omp for// this is the for loop I parallelized but ended up with wrong output and running slower
for (y = 0; y < h; y++)
{
//printf("nthread %d\n", nthread);
int x1;
int bufi0 = ksize-1;
float tmp = src[y*w + 0];
for (x1 = 0; x1 < halfk ; x1++) ringbuf[x1] = tmp;
for (; x1 < ksize-1; x1++) ringbuf[x1] = src[y*w + x1-halfk];
for (x1 = 0; x1 < w; x1++)
{
const int ibufi0_fix = (x1+ksize-1)%ksize;
if(x1 < xmax)
ringbuf[ibufi0_fix] = src[y*w + x1+halfk];
else
ringbuf[ibufi0_fix] = src[y*w + w-1];
if (bufi0 == ksize) bufi0 = 0;
dst[y*w + x1] = convolve(kernel, ringbuf, ksize, bufi0);
}
}
_mm_free(ringbuf);
}
// blur each column
#pragma omp parralel if(ntheads>1)
{
float *ringbuf = (float*)_mm_malloc(ksize * sizeof(float),16);
#pragma omp for// this is the for loop I parallelized but ended up with wrong output and running slower
for (x = 0; x < w; x++)
{
int y1;
int bufi0 = ksize-1;
float tmp = dst[0*w + x];
for (y1 = 0; y1 < halfk ; y1++) ringbuf[y1] = tmp;
for ( ; y1 < ksize-1; y1++) ringbuf[y1] = dst[(y1-halfk)*w + x];
for (y1 = 0; y1 < h; y1++)
{
const int ibufi0_fix = (y1+ksize-1)%ksize;
if(y1 < ymax)
ringbuf[ibufi0_fix] = dst[(y1+halfk)*w + x];
else
ringbuf[ibufi0_fix] = dst[(h-1)*w + x];
if (bufi0 == ksize) bufi0 = 0;
dst[y1*w + x] = convolve(kernel, ringbuf, ksize, bufi0);
}
}
_mm_free(ringbuf);
}
// clean up
_mm_free(kernel);
}
int compare(float *dst1, float *dst2, const int n) {
int error = 0;
for(int i=0; i<n; i++) {
if(*dst1 != *dst2) error++;
}
return error;
}
int main() {
const int w = 20;
const int h = 20;
float *src = (float*)_mm_malloc(w*h*sizeof(float),16);
float *dst1 = (float*)_mm_malloc(w*h*sizeof(float),16);
float *dst2 = (float*)_mm_malloc(w*h*sizeof(float),16);
for(int i=0; i<w*h; i++) {
src[i] = i;
}
gaussian_blur(src, dst1, w, h, 1.0f, 1);
gaussian_blur(src, dst2, w, h, 1.0f, 4);
int error = compare(dst1, dst2, w*h);
printf("error %d\n", error);
_mm_free(src);
_mm_free(dst1);
_mm_free(dst2);
}

Unordinary performance gap between OpenCL and CUDA

I have coded a simple tiled matrix multiplication in CUDA. It's like this:
__global__ void matrixMultiplyShared(float * A, float * B, float * C,
int numARows, int numAColumns,
int numBRows, int numBColumns,
int numCRows, int numCColumns) {
__shared__ float ds_A[TILE_WIDTH][TILE_WIDTH];
__shared__ float ds_B[TILE_WIDTH][TILE_WIDTH];
int bx = blockIdx.x; int by = blockIdx.y;
int tx = threadIdx.x; int ty = threadIdx.y;
int row = by * TILE_WIDTH + ty;
int col = bx * TILE_WIDTH + tx;
float Cvalue = 0.0;
// Loop over the M and N tiles required to compute the Pd element
for (int m = 0; m < (numAColumns-1)/TILE_WIDTH+1; ++m) {
if(row<numARows && m*TILE_WIDTH+tx < numAColumns){
ds_A[ty][tx] = A[row*numAColumns + m*TILE_WIDTH+tx];
} else {
ds_A[ty][tx] = 0;
}
if(m*TILE_WIDTH+ty < numBRows && col < numBColumns){
ds_B[ty][tx] = B[(m*TILE_WIDTH+ty)*numBColumns+col];
} else {
ds_B[ty][tx] = 0;
}
__syncthreads();
if(row < numCRows && col < numCColumns){
for (int k = 0; k < TILE_WIDTH; ++k)
Cvalue += ds_A[ty][k] * ds_B[k][tx];
}
__syncthreads();
}
if(row < numCRows && col < numCColumns)
C[row*numCColumns+col] = Cvalue;
}
After that, I used the same above kernel (with some minor changes) in the OpenCL version to compare the performance of CUDA and OpenCL together. But the result was to so far beyond my expectations. OpenCL was 6-7 times faster than CUDA. Is it valid?
The output of Nisght is as follows:
CUDA:
OpenCL:
You can see a large gap between starting the app and executing the kernel. why is that happened?
My GPU is: GTX 580 |
The Kernel Ex time (CUDA): 3.78s |
The Kernel Ex time (OpenCL): 0.53s |
CUDA Code: http://pastebin.com/VQMp3Hba
OpenCL Host Code: http://pastebin.com/cjGYSLQf
OpenCL Kernel Code: http://pastebin.com/KKw3Ayz7

You can try and insert explicit timers in the code instead of trusting the output from the tool. May be the case that the tool is wrong.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Reducing Shared Memory Bank Conflicts - parallel-processing

Related

Access an matrix as its tranpose in tiled matrix mutliplication in CUDA

Number Of Pi in Parallel Programming Openmp

square Matrix transpose with CUDA

Parallelizing the Gaussian Blur Algorithm with OpenMP

Unordinary performance gap between OpenCL and CUDA

Categories

Resources