Some CUDA computations fail with larger block dimension (< 1024)

Some CUDA computations fail with larger block dimension (< 1024) - parallel-processing

I am learning CUDA with a GTX 960 4GB. I wrote a program which performs an element-wise matrix multiplication. When I increase the block dimensions for x and y to lets say (32 x 32) in combination with a large matrix (lets say 15000 x 15000 elements), some but not all multiplication results are wrong (value 0 instead of 6).
When I then decrease the block dimensions to e.g (8 x 8), all results are right again. When I decrease the Matrix size, the results are right again, too.
So in case of this example, there seems to be combinations of total threads and threads per block, which does not work.
I am surprised I can't find any threads regarding this topic. All I can find is about increasing performance and occupancy, but not about when some but not all calculations are aborted.
The grid dimensions are calculated as follows:
dim3 blocks(ceil<int>(COLS / threads.x), ceil<int>(ROWS / threads.y));
Why do some multiplications fail while others are successful?
Some Examples
Block dim : (8, 8)
Matrix shape : (15000, 15000)
Verification : 0 elements have failed, total length 225000000, shape: (15000, 15000)
Block dim : (16, 16)
Matrix shape : (15000, 15000)
Verification : 239936 elements have failed, total length 225000000, shape: (15000, 15000)
Block dim : (32, 32)
Matrix shape : (15000, 15000)
Verification : 719424 elements have failed, total length 225000000, shape: (15000, 15000).
Block dim : (32, 32)
Matrix shape : (10000, 10000)
Verification : 0 elements have failed, total length 100000000, shape: (10000, 10000).
Driver Version
$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 470.82.00 Thu Oct 14 10:24:40 UTC 2021
Complete Code
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <assert.h>
#include <cuda.h>
#include <cuda_runtime.h>
#define ROWS 10000
#define COLS 10000
#define MAX_ERR 1e-6
typedef struct {
int width;
int height;
float* elements;
} Matrix;
size_t ij(int i, int j){
return j * ROWS + i;
}
__global__ void matrix_multi_elemwise(const Matrix OUT, const Matrix A, const Matrix B) {
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
if (col < A.width && row < A.height) {
int index = row * A.height + col; // linearisation of index
OUT.elements[index] = A.elements[index] * B.elements[index];
}
}
int main(){
Matrix A, B, OUT;
Matrix dev_A, dev_B, dev_OUT;
size_t SIZE = ROWS * COLS * sizeof(float);
// Allocate host memory
A.elements = (float*) malloc(SIZE);
B.elements = (float*) malloc(SIZE);
OUT.elements = (float*) malloc(SIZE);
// Initialize host matrices
A.height = ROWS; A.width = COLS;
B.height = ROWS; B.width = COLS;
OUT.height = ROWS; OUT.width = COLS;
for (int j = 0; j < ROWS; j++) {
for(int i = 0; i < COLS; i++){
A.elements[ij(i, j)] = 2.0f;
B.elements[ij(i, j)] = 3.0f;
}
}
// Allocate device memory
cudaMalloc((void**) &dev_A.elements, SIZE);
cudaMalloc((void**) &dev_B.elements, SIZE);
cudaMalloc((void**) &dev_OUT.elements, SIZE);
dev_A.height = A.height; dev_A.width = A.width;
dev_B.height = A.height; dev_B.width = B.width;
dev_OUT.height = A.height; dev_OUT.width = OUT.width;
// Transfer data from host to device memory
cudaMemcpy(dev_A.elements, A.elements, SIZE, cudaMemcpyHostToDevice);
cudaMemcpy(dev_B.elements, B.elements, SIZE, cudaMemcpyHostToDevice);
// Executing kernel
dim3 threads(16, 16);
dim3 blocks(ceil<int>(COLS / threads.x), ceil<int>(ROWS / threads.y));
matrix_multi_elemwise<<<blocks, threads>>>(dev_OUT, dev_A, dev_B);
cudaError_t err = cudaGetLastError();
if(err != cudaSuccess) {
printf("CUDA Runtime API Error reported : %s in file %s on line.\n", cudaGetErrorString(err), __FILE__);
}
// Wait for GPU to finish before accessing on host
cudaDeviceSynchronize();
// Transfer data back to host memory
cudaMemcpy(OUT.elements, dev_OUT.elements, SIZE, cudaMemcpyDeviceToHost);
// Verification
int count = 0, length = 0, i = 0, j = 0;
for (j = 0; j < ROWS; j++) {
for(i = 0; i < COLS; i++){
//assert(fabs(OUT.elements[ij(i, j)] / A.elements[ij(i, j)] - B.elements[ij(i, j)]) < MAX_ERR);
if (fabs(OUT.elements[ij(i, j)] / A.elements[ij(i, j)] - B.elements[ij(i, j)]) > MAX_ERR) {
count++;
}
length++;
}
}
printf("Verification: %i elements have failed, total length %i, shape: (%i, %i).\n", count, length, i, j);
// Deallocate device memory
cudaFree(dev_A.elements);
cudaFree(dev_B.elements);
cudaFree(dev_OUT.elements);
// Deallocate host memory
free(A.elements);
free(B.elements);
free(OUT.elements);
}

The number of blocks is wrong. Indeed, COLS and threads.x are both integers. Thus, the result is a truncated integer. ceil<int> cannot ceil the result as it has already been truncated. This cause some blocks not to be computed: 15000 is divisible by 8 but not by 16. You need to either cast COLS to a floating-point number or compute the ceil result manually (safer). Here is an example:
dim3 blocks((COLS + threads.x - 1) / threads.x, (ROWS + threads.y - 1) / threads.y);
As pointed out in the comment, note that row * A.height + col is wrong: it should be row * A.width + col instead. This causes issues for non square matrices.

Related

Sparse plus dense matrix operation using cuSPARSE

Is it possible to add a sparse matrix and a dense matrix using cuSPARSE? In cuBLAS, I'd just treat the matrices as vectors and use axpy. cuSPARSE does have axpy for sparse/dense vectors, but it cannot be used for matrices because sparse vectors and matrices have different memory structure.

cusparse has dense-to-sparse and sparse-to-dense conversion routines. You could:
convert the sparse matrix to dense (e.g. with cusparse<t>csr2dense), then add the two with cublas<t>geam, producing a dense matrix result
convert the dense matrix to sparse (e.g. with cusparse<t>dense2csr), then use cusparse<t>csrgeam to produce a sparse result
Note that using cusparse<t>geam is a little bit more involved than just a single function call, but the usage methodology is given in the documentation. Also, when using cusparse<t>dense2csr, you will likely want to use cusparse<t>nnz to help with the storage allocations needed.

Here is a fully worked example with a customized kernel summing up a sparse matrix A stored in CSR format with a dense matrix B providing a dense matrix C. The customized kernel explicitly deals with the mapping between the CSR and dense indices.
#include <stdio.h>
#include <assert.h>
#include <cusparse.h>
#define BLOCKSIZEX 16
#define BLOCKSIZEY 16
/*******************/
/* iDivUp FUNCTION */
/*******************/
int iDivUp(int a, int b){ return ((a % b) != 0) ? (a / b + 1) : (a / b); }
/********************/
/* CUDA ERROR CHECK */
/********************/
// --- Credit to http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api
void gpuAssert(cudaError_t code, const char *file, int line, bool abort = true)
{
if (code != cudaSuccess)
{
fprintf(stderr, "GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) { exit(code); }
}
}
void gpuErrchk(cudaError_t ans) { gpuAssert((ans), __FILE__, __LINE__); }
/***************************/
/* CUSPARSE ERROR CHECKING */
/***************************/
static const char *_cusparseGetErrorEnum(cusparseStatus_t error)
{
switch (error)
{
case CUSPARSE_STATUS_SUCCESS:
return "CUSPARSE_STATUS_SUCCESS";
case CUSPARSE_STATUS_NOT_INITIALIZED:
return "CUSPARSE_STATUS_NOT_INITIALIZED";
case CUSPARSE_STATUS_ALLOC_FAILED:
return "CUSPARSE_STATUS_ALLOC_FAILED";
case CUSPARSE_STATUS_INVALID_VALUE:
return "CUSPARSE_STATUS_INVALID_VALUE";
case CUSPARSE_STATUS_ARCH_MISMATCH:
return "CUSPARSE_STATUS_ARCH_MISMATCH";
case CUSPARSE_STATUS_MAPPING_ERROR:
return "CUSPARSE_STATUS_MAPPING_ERROR";
case CUSPARSE_STATUS_EXECUTION_FAILED:
return "CUSPARSE_STATUS_EXECUTION_FAILED";
case CUSPARSE_STATUS_INTERNAL_ERROR:
return "CUSPARSE_STATUS_INTERNAL_ERROR";
case CUSPARSE_STATUS_MATRIX_TYPE_NOT_SUPPORTED:
return "CUSPARSE_STATUS_MATRIX_TYPE_NOT_SUPPORTED";
case CUSPARSE_STATUS_ZERO_PIVOT:
return "CUSPARSE_STATUS_ZERO_PIVOT";
}
return "<unknown>";
}
inline void __cusparseSafeCall(cusparseStatus_t err, const char *file, const int line)
{
if (CUSPARSE_STATUS_SUCCESS != err) {
fprintf(stderr, "CUSPARSE error in file '%s', line %d, error %s\nterminating!\n", __FILE__, __LINE__, \
_cusparseGetErrorEnum(err)); \
assert(0); \
}
}
extern "C" void cusparseSafeCall(cusparseStatus_t err) { __cusparseSafeCall(err, __FILE__, __LINE__); }
/*****************************/
/* SETUP DESCRIPTOR FUNCTION */
/*****************************/
void setUpDescriptor(cusparseMatDescr_t &descrA, cusparseMatrixType_t matrixType, cusparseIndexBase_t indexBase) {
cusparseSafeCall(cusparseCreateMatDescr(&descrA));
cusparseSafeCall(cusparseSetMatType(descrA, matrixType));
cusparseSafeCall(cusparseSetMatIndexBase(descrA, indexBase));
}
/********************************************************/
/* DENSE TO SPARSE CONVERSION FOR REAL DOUBLE PRECISION */
/********************************************************/
void dense2SparseD(const double * __restrict__ d_A_dense, int **d_nnzPerVector, double **d_A,
int **d_A_RowIndices, int **d_A_ColIndices, int &nnz, cusparseMatDescr_t descrA,
const cusparseHandle_t handle, const int M, const int N) {
const int lda = M; // --- Leading dimension of dense matrix
gpuErrchk(cudaMalloc(&d_nnzPerVector[0], M * sizeof(int)));
// --- Compute the number of nonzero elements per row and the total number of nonzero elements
// the dense d_A_dense
cusparseSafeCall(cusparseDnnz(handle, CUSPARSE_DIRECTION_ROW, M, N, descrA, d_A_dense,
lda, d_nnzPerVector[0], &nnz));
// --- Device side sparse matrix
gpuErrchk(cudaMalloc(&d_A[0], nnz * sizeof(double)));
gpuErrchk(cudaMalloc(&d_A_RowIndices[0], (M + 1) * sizeof(int)));
gpuErrchk(cudaMalloc(&d_A_ColIndices[0], nnz * sizeof(int)));
cusparseSafeCall(cusparseDdense2csr(handle, M, N, descrA, d_A_dense, lda, d_nnzPerVector[0],
d_A[0], d_A_RowIndices[0], d_A_ColIndices[0]));
}
/********************************/
/* SPARSE + DENSE CUSTOM KERNEL */
/********************************/
__global__ void sparsePlusDense(const double * __restrict__ d_A, const int * __restrict__ d_A_RowIndices,
const int * __restrict__ d_A_ColIndices, const double * __restrict__ d_B,
double * __restrict__ d_C, const int M, const int N) {
const int tidx = threadIdx.x + blockIdx.x * blockDim.x;
const int tidy = threadIdx.y + blockIdx.y * blockDim.y;
if ((tidx >= N) || (tidy >= M)) return;
const int row = tidy;
const int nnzRow = d_A_RowIndices[tidy + 1] - d_A_RowIndices[tidy];
if (tidx >= nnzRow) return;
const int col = d_A_ColIndices[d_A_RowIndices[tidy] + tidx];
d_C[row * N + col] = d_C[row * N + col] + d_A[d_A_RowIndices[tidy] + tidx];
}
/********/
/* MAIN */
/********/
int main() {
cusparseHandle_t handle;
// --- Initialize cuSPARSE
cusparseSafeCall(cusparseCreate(&handle));
// --- Initialize matrix descriptors
cusparseMatDescr_t descrA;
setUpDescriptor(descrA, CUSPARSE_MATRIX_TYPE_GENERAL, CUSPARSE_INDEX_BASE_ZERO);
/**************************/
/* SETTING UP THE PROBLEM */
/**************************/
const int M = 5; // --- Number of rows
const int N = 4; // --- Number of columns
// --- Host side dense matrix
double *h_A_dense = (double*)malloc(M * N * sizeof(*h_A_dense));
// --- Column-major storage
h_A_dense[0] = 0.4612; h_A_dense[5] = 0.0; h_A_dense[10] = 1.3; h_A_dense[15] = 0.0;
h_A_dense[1] = 0.0; h_A_dense[6] = 1.443; h_A_dense[11] = 0.0; h_A_dense[16] = 0.0;
h_A_dense[2] = -0.0006; h_A_dense[7] = 0.4640; h_A_dense[12] = 0.0723; h_A_dense[17] = 0.0;
h_A_dense[3] = 0.3566; h_A_dense[8] = 0.0; h_A_dense[13] = 0.7543; h_A_dense[18] = 0.0;
h_A_dense[4] = 0.; h_A_dense[9] = 0.0; h_A_dense[14] = 0.0; h_A_dense[19] = 0.1;
// --- Create device array and copy host array to it
double *d_A_dense; gpuErrchk(cudaMalloc(&d_A_dense, M * N * sizeof(double)));
gpuErrchk(cudaMemcpy(d_A_dense, h_A_dense, M * N * sizeof(*d_A_dense), cudaMemcpyHostToDevice));
/*******************************/
/* FROM DENSE TO SPARSE MATRIX */
/*******************************/
int nnz = 0; // --- Number of nonzero elements in dense matrix
int *d_nnzPerVector; // --- Device side number of nonzero elements per row
double *d_A; // --- Sparse matrix values - array of size nnz
int *d_A_RowIndices; // --- "Row indices"
int *d_A_ColIndices; // --- "Column indices"
dense2SparseD(d_A_dense, &d_nnzPerVector, &d_A, &d_A_RowIndices, &d_A_ColIndices, nnz, descrA,
handle, M, N);
/*************************/
/* DENSE MATRIX OPERANDS */
/*************************/
// --- Host side dense matrix
double *h_B_dense = (double*)malloc(M * N * sizeof(*h_B_dense));
// --- Column-major storage
h_B_dense[0] = 1.5; h_B_dense[5] = -0.2; h_B_dense[10] = -0.9; h_B_dense[15] = 1.1;
h_B_dense[1] = 2.1; h_B_dense[6] = 2.0; h_B_dense[11] = 1.1; h_B_dense[16] = -0.009;
h_B_dense[2] = -2; h_B_dense[7] = -0.82; h_B_dense[12] = 1.2; h_B_dense[17] = 1.21;
h_B_dense[3] = -0.001; h_B_dense[8] = -1.1; h_B_dense[13] = 0.887; h_B_dense[18] = 1.1143;
h_B_dense[4] = 1.1; h_B_dense[9] = 2.1; h_B_dense[14] = -1.1213; h_B_dense[19] = 5.4334;
// --- Create device array and copy host array to it
double *d_B_dense; gpuErrchk(cudaMalloc(&d_B_dense, M * N * sizeof(double)));
gpuErrchk(cudaMemcpy(d_B_dense, h_B_dense, M * N * sizeof(*d_B_dense), cudaMemcpyHostToDevice));
// --- Allocate space for the result e initialize it
double *d_C_dense; gpuErrchk(cudaMalloc(&d_C_dense, M * N * sizeof(double)));
gpuErrchk(cudaMemcpy(d_C_dense, d_B_dense, M * N * sizeof(double), cudaMemcpyDeviceToDevice));
/*********************************/
/* RUN THE SPARSE-DENSE ADDITION */
/*********************************/
dim3 GridDim(iDivUp(N, BLOCKSIZEX), iDivUp(M, BLOCKSIZEY));
dim3 BlockDim(BLOCKSIZEX, BLOCKSIZEY);
sparsePlusDense << <GridDim , BlockDim>> >(d_A, d_A_RowIndices, d_A_ColIndices, d_B_dense, d_C_dense, M, N);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
/*******************************************************/
/* CHECKING THE RESULTS FOR SPARSE TO DENSE CONVERSION */
/*******************************************************/
double *h_C_dense = (double *)malloc(M * N * sizeof(double));
gpuErrchk(cudaMemcpy(h_C_dense, d_C_dense, M * N * sizeof(double), cudaMemcpyDeviceToHost));
printf("\nFirst dense operand matrix (column-major storage) \n");
for (int m = 0; m < M; m++) {
for (int n = 0; n < N; n++)
printf("%f\t", h_A_dense[n * M + m]);
printf("\n");
}
printf("\nSecond dense operand matrix (row-major storage) \n");
for (int m = 0; m < M; m++) {
for (int n = 0; n < N; n++)
printf("%f\t", h_B_dense[n + m * N]);
printf("\n");
}
printf("\nReference dense matrix (the first has column-major storage, the second row-major\n");
for (int m = 0; m < M; m++) {
for (int n = 0; n < N; n++)
printf("%f\t", h_A_dense[n * M + m] + h_B_dense[n + m * N]);
printf("\n");
}
printf("\nSecond dense operand matrix (row-major storage) \n");
for (int m = 0; m < M; m++) {
for (int n = 0; n < N; n++)
printf("%f\t", h_C_dense[n + m * N]);
printf("\n");
}
return 0;
}

Kernel crashes while trying to do a simple value assignment

I am learning CUDA and still at the very beginner level. I am trying a simple assignment but my code crashes when I run it and I am not sure why. Any help would be appreciated.
EDIT: Crashes on cudaMemcpy and in Image structure, the pixelVal is of type int**. Is that the cause?
Original C++ code:
void Image::reflectImage(bool flag, Image& oldImage)
/*Reflects the Image based on users input*/
{
int rows = oldImage.N;
int cols = oldImage.M;
Image tempImage(oldImage);
for(int i = 0; i < rows; i++)
{
for(int j = 0; j < cols; j++)
tempImage.pixelVal[rows - (i + 1)][j] = oldImage.pixelVal[i][j];
}
oldImage = tempImage;
}
My CUDA kernel & code:
#define NTPB 512
__global__ void fliph(int* a, int* b, int r, int c)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if (i >= r || j >= c)
return;
a[(r - i * c) + j] = b[i * c + j];
}
void Image::reflectImage(bool flag, Image& oldImage)
/*Reflects the Image based on users input*/
{
int rows = oldImage.N;
int cols = oldImage.M;
Image tempImage(oldImage);
if(flag == true) //horizontal reflection
{
//Allocate device memory
int* dpixels;
int* oldPixels;
int n = rows * cols;
cudaMalloc((void**)&dpixels, n * sizeof(int));
cudaMalloc((void**)&oldPixels, n * sizeof(int));
cudaMemcpy(dpixels, tempImage.pixelVal, n * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(oldPixels, oldImage.pixelVal, n * sizeof(int), cudaMemcpyHostToDevice);
int nblks = (n + NTPB - 1) / NTPB;
fliph<<<nblks, NTPB>>>(dpixels, oldPixels, rows, cols);
cudaMemcpy(tempImage.pixelVal, dpixels, n * sizeof(int), cudaMemcpyDeviceToHost);
cudaFree(dpixels);
cudaFree(oldPixels);
}
oldImage = tempImage;
}

You have to create a 2D Grid in order to process the image using 2D indices i and j. In the current case, the kernel is processing only the first row of the image.
To create a 2D indexing mechanism, create a 2D block and 2D grid like this:
const int BLOCK_DIM = 16;
dim3 Block(BLOCK_DIM,BLOCK_DIM);
dim3 Grid;
Grid.x = (cols + Block.x - 1)/Block.x;
Grid.y = (rows + Block.y - 1)/Block.y;
fliph<<<Grid, Block>>>(dpixels, oldPixels, rows, cols);

FIR filter in CUDA (as a 1D convolution)

I'm trying to implement a FIR (Finite Impulse Response) filter in CUDA. My approach is quite simple and looks somewhat like this:
#include <cuda.h>
__global__ void filterData(const float *d_data,
const float *d_numerator,
float *d_filteredData,
const int numeratorLength,
const int filteredDataLength)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
float sum = 0.0f;
if (i < filteredDataLength)
{
for (int j = 0; j < numeratorLength; j++)
{
// The first (numeratorLength-1) elements contain the filter state
sum += d_numerator[j] * d_data[i + numeratorLength - j - 1];
}
}
d_filteredData[i] = sum;
}
int main(void)
{
// (Skipping error checks to make code more readable)
int dataLength = 18042;
int filteredDataLength = 16384;
int numeratorLength= 1659;
// Pointers to data, filtered data and filter coefficients
// (Skipping how these are read into the arrays)
float *h_data = new float[dataLength];
float *h_filteredData = new float[filteredDataLength];
float *h_filter = new float[numeratorLength];
// Create device pointers
float *d_data = nullptr;
cudaMalloc((void **)&d_data, dataLength * sizeof(float));
float *d_numerator = nullptr;
cudaMalloc((void **)&d_numerator, numeratorLength * sizeof(float));
float *d_filteredData = nullptr;
cudaMalloc((void **)&d_filteredData, filteredDataLength * sizeof(float));
// Copy data to device
cudaMemcpy(d_data, h_data, dataLength * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_numerator, h_numerator, numeratorLength * sizeof(float), cudaMemcpyHostToDevice);
// Launch the kernel
int threadsPerBlock = 256;
int blocksPerGrid = (filteredDataLength + threadsPerBlock - 1) / threadsPerBlock;
filterData<<<blocksPerGrid,threadsPerBlock>>>(d_data, d_numerator, d_filteredData, numeratorLength, filteredDataLength);
// Copy results to host
cudaMemcpy(h_filteredData, d_filteredData, filteredDataLength * sizeof(float), cudaMemcpyDeviceToHost);
// Clean up
cudaFree(d_data);
cudaFree(d_numerator);
cudaFree(d_filteredData);
// Do stuff with h_filteredData...
// Clean up some more
delete [] h_data;
delete [] h_filteredData;
delete [] h_filter;
}
The filter works, but as I'm new to CUDA programming and I'm not sure how to optimize it.
A slight problem that I see is that dataLength, filteredDataLength, and numeratorLength are not known before hand in the application I intend to use the filter in. Also, even though dataLength is a multiple of 32 in the above code, it is not guaranteed to be that in the final application.
When I compare my code above to ArrayFire, my code takes about three times longer to execute.
Does anyone have any ideas on how to speed things up?
EDIT: Have changed all filterLength to numeratorLength.

I can suggest the following to speed up your code:
Use the shared memory: it is a tiny cache-like memory but extremely
faster than the global card memory. You can find more about it by
looking for __shared__ keyword in CUDA documentation. For
example, you can pre-fetch the filter numerators and big chunks
of data in shared memory, this will significantly enhance your
performance. You need to pay extra attention to the data
alignment in this case as it really matters and it can slow down
your code.
Think about unrolling the for-loop of the numerator
sum. You can check the reduce-vector example in CUDA
documentation.
You can also think about parallelizing the
numerator loop itself by itself. This can be done by adding an extra dimension (say 'y') to your thread-block. You will need to make sum a shared vector as well that has the dimension of numeratorLength. You can also check the reduce vector example on how
to quickly take the sum of this vector at the end.

You are attempting at calculating the filter output by directly evaluating the 1D convolution through a CUDA kernel.
In the case when the filter impulse response duration is long, one thing you can do to evaluate the filtered input is performing the calculations directly in the conjugate domain using FFTs. Below I'm reporting a sample code using CUDA Thrust and the cuFFT library. It is a direct translation of the Matlab-based example reported at
Low-Pass Filtering by FFT Convolution
Let me disclaim that some optimizations are possible with this code, but I preferred to leave it as it is so that it could be more easily compared to its Matlab's counterpart.
#include <stdio.h>
#include <math.h>
#include <cufft.h>
#include <thrust\device_vector.h>
#include <thrust\sequence.h>
#define pi_f 3.14159265358979f // Greek pi in single precision
/****************/
/* SIN OPERATOR */
/****************/
class sin_op {
float fk_, Fs_;
public:
sin_op(float fk, float Fs) { fk_ = fk; Fs_ = Fs; }
__host__ __device__ float operator()(float x) const { return sin(2.f*pi_f*x*fk_/Fs_); }
};
/*****************/
/* SINC OPERATOR */
/*****************/
class sinc_op {
float fc_, Fs_;
public:
sinc_op(float fc, float Fs) { fc_ = fc; Fs_ = Fs; }
__host__ __device__ float operator()(float x) const
{
if (x==0) return (2.f*fc_/Fs_);
else return (2.f*fc_/Fs_)*sin(2.f*pi_f*fc_*x/Fs_)/(2.f*pi_f*fc_*x/Fs_);
}
};
/********************/
/* HAMMING OPERATOR */
/********************/
class hamming_op {
int L_;
public:
hamming_op(int L) { L_ = L; }
__host__ __device__ float operator()(int x) const
{
return 0.54-0.46*cos(2.f*pi_f*x/(L_-1));
}
};
/*********************************/
/* MULTIPLY CUFFTCOMPLEX NUMBERS */
/*********************************/
struct multiply_cufftComplex {
__device__ cufftComplex operator()(const cufftComplex& a, const cufftComplex& b) const {
cufftComplex r;
r.x = a.x * b.x - a.y * b.y;
r.y = a.x * b.y + a.y * b.x;
return r;
}
};
/********/
/* MAIN */
/********/
void main(){
// Signal parameters:
int M = 256; // signal length
const int N = 4;
float f[N] = { 440, 880, 1000, 2000 }; // frequencies
float Fs = 5000.; // sampling rate
// Generate a signal by adding up sinusoids:
thrust::device_vector<float> d_x(M,0.f); // pre-allocate 'accumulator'
thrust::device_vector<float> d_n(M); // discrete-time grid
thrust::sequence(d_n.begin(), d_n.end(), 0, 1);
thrust::device_vector<float> d_temp(M);
for (int i=0; i<N; i++) {
float fk = f[i];
thrust::transform(d_n.begin(), d_n.end(), d_temp.begin(), sin_op(fk,Fs));
thrust::transform(d_temp.begin(), d_temp.end(), d_x.begin(), d_x.begin(), thrust::plus<float>());
}
// Filter parameters:
int L = 257; // filter length
float fc = 600.f; // cutoff frequency
// Design the filter using the window method:
thrust::device_vector<float> d_hsupp(L);
thrust::sequence(d_hsupp.begin(), d_hsupp.end(), -(L-1)/2, 1);
thrust::device_vector<float> d_hideal(L);
thrust::transform(d_hsupp.begin(), d_hsupp.end(), d_hideal.begin(), sinc_op(fc,Fs));
thrust::device_vector<float> d_l(L);
thrust::sequence(d_l.begin(), d_l.end(), 0, 1);
thrust::device_vector<float> d_h(L);
thrust::transform(d_l.begin(), d_l.end(), d_h.begin(), hamming_op(L));
// h is our filter
thrust::transform(d_hideal.begin(), d_hideal.end(), d_h.begin(), d_h.begin(), thrust::multiplies<float>());
// --- Choose the next power of 2 greater than L+M-1
int Nfft = pow(2,(ceil(log2((float)(L+M-1))))); // or 2^nextpow2(L+M-1)
// Zero pad the signal and impulse response:
thrust::device_vector<float> d_xzp(Nfft,0.f);
thrust::device_vector<float> d_hzp(Nfft,0.f);
thrust::copy(d_x.begin(), d_x.end(), d_xzp.begin());
thrust::copy(d_h.begin(), d_h.end(), d_hzp.begin());
// Transform the signal and the filter:
cufftHandle plan;
cufftPlan1d(&plan, Nfft, CUFFT_R2C, 1);
thrust::device_vector<cufftComplex> d_X(Nfft/2+1);
thrust::device_vector<cufftComplex> d_H(Nfft/2+1);
cufftExecR2C(plan, (cufftReal*)thrust::raw_pointer_cast(d_xzp.data()), (cufftComplex*)thrust::raw_pointer_cast(d_X.data()));
cufftExecR2C(plan, (cufftReal*)thrust::raw_pointer_cast(d_hzp.data()), (cufftComplex*)thrust::raw_pointer_cast(d_H.data()));
thrust::device_vector<cufftComplex> d_Y(Nfft/2+1);
thrust::transform(d_X.begin(), d_X.end(), d_H.begin(), d_Y.begin(), multiply_cufftComplex());
cufftPlan1d(&plan, Nfft, CUFFT_C2R, 1);
thrust::device_vector<float> d_y(Nfft);
cufftExecC2R(plan, (cufftComplex*)thrust::raw_pointer_cast(d_Y.data()), (cufftReal*)thrust::raw_pointer_cast(d_y.data()));
getchar();
}

Besides my other answer which I expect will be more convenient for convolution kernels with long duration, below I'm reporting a different implementation, which is more compliant with the OP's initial attempt and I expect will be more convenient for convolution kernels with short duration. Such an implementation is based on a hand-written kernel exploiting caching in shared memory. More details can be found in the book by D.B. Kirk and W.-m. W. Hwu
Programming Massively Parallel Processors, Second Edition: A Hands-on Approach
#include <stdio.h>
#include <stdlib.h>
#include "TimingGPU.cuh"
#include "Utilities.cuh"
#define RG 10
#define BLOCKSIZE 8
/****************/
/* CPU FUNCTION */
/****************/
void h_convolution_1D(const float * __restrict__ h_Signal, const float * __restrict__ h_ConvKernel, float * __restrict__ h_Result_CPU,
const int N, const int K) {
for (int i = 0; i < N; i++) {
float temp = 0.f;
int N_start_point = i - (K / 2);
for (int j = 0; j < K; j++) if (N_start_point + j >= 0 && N_start_point + j < N) {
temp += h_Signal[N_start_point+ j] * h_ConvKernel[j];
}
h_Result_CPU[i] = temp;
}
}
/********************/
/* BASIC GPU KERNEL */
/********************/
__global__ void d_convolution_1D_basic(const float * __restrict__ d_Signal, const float * __restrict__ d_ConvKernel, float * __restrict__ d_Result_GPU,
const int N, const int K) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
float temp = 0.f;
int N_start_point = i - (K / 2);
for (int j = 0; j < K; j++) if (N_start_point + j >= 0 && N_start_point + j < N) {
temp += d_Signal[N_start_point+ j] * d_ConvKernel[j];
}
d_Result_GPU[i] = temp;
}
/***************************/
/* GPU KERNEL WITH CACHING */
/***************************/
__global__ void d_convolution_1D_caching(const float * __restrict__ d_Signal, const float * __restrict__ d_ConvKernel, float * __restrict__ d_Result_GPU,
const int N, const int K) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
__shared__ float d_Tile[BLOCKSIZE];
d_Tile[threadIdx.x] = d_Signal[i];
__syncthreads();
float temp = 0.f;
int N_start_point = i - (K / 2);
for (int j = 0; j < K; j++) if (N_start_point + j >= 0 && N_start_point + j < N) {
if ((N_start_point + j >= blockIdx.x * blockDim.x) && (N_start_point + j < (blockIdx.x + 1) * blockDim.x))
// --- The signal element is in the tile loaded in the shared memory
temp += d_Tile[threadIdx.x + j - (K / 2)] * d_ConvKernel[j];
else
// --- The signal element is not in the tile loaded in the shared memory
temp += d_Signal[N_start_point + j] * d_ConvKernel[j];
}
d_Result_GPU[i] = temp;
}
/********/
/* MAIN */
/********/
int main(){
const int N = 15; // --- Signal length
const int K = 5; // --- Convolution kernel length
float *h_Signal = (float *)malloc(N * sizeof(float));
float *h_Result_CPU = (float *)malloc(N * sizeof(float));
float *h_Result_GPU = (float *)malloc(N * sizeof(float));
float *h_ConvKernel = (float *)malloc(K * sizeof(float));
float *d_Signal; gpuErrchk(cudaMalloc(&d_Signal, N * sizeof(float)));
float *d_Result_GPU; gpuErrchk(cudaMalloc(&d_Result_GPU, N * sizeof(float)));
float *d_ConvKernel; gpuErrchk(cudaMalloc(&d_ConvKernel, K * sizeof(float)));
for (int i=0; i < N; i++) { h_Signal[i] = (float)(rand() % RG); }
for (int i=0; i < K; i++) { h_ConvKernel[i] = (float)(rand() % RG); }
gpuErrchk(cudaMemcpy(d_Signal, h_Signal, N * sizeof(float), cudaMemcpyHostToDevice));
gpuErrchk(cudaMemcpy(d_ConvKernel, h_ConvKernel, K * sizeof(float), cudaMemcpyHostToDevice));
h_convolution_1D(h_Signal, h_ConvKernel, h_Result_CPU, N, K);
d_convolution_1D_basic<<<iDivUp(N, BLOCKSIZE), BLOCKSIZE>>>(d_Signal, d_ConvKernel, d_Result_GPU, N, K);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
gpuErrchk(cudaMemcpy(h_Result_GPU, d_Result_GPU, N * sizeof(float), cudaMemcpyDeviceToHost));
for (int i = 0; i < N; i++) if (h_Result_CPU[i] != h_Result_GPU[i]) {printf("mismatch2 at %d, cpu: %d, gpu %d\n", i, h_Result_CPU[i], h_Result_GPU[i]); return 1;}
printf("Test basic passed\n");
d_convolution_1D_caching<<<iDivUp(N, BLOCKSIZE), BLOCKSIZE>>>(d_Signal, d_ConvKernel, d_Result_GPU, N, K);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
gpuErrchk(cudaMemcpy(h_Result_GPU, d_Result_GPU, N * sizeof(float), cudaMemcpyDeviceToHost));
for (int i = 0; i < N; i++) if (h_Result_CPU[i] != h_Result_GPU[i]) {printf("mismatch2 at %d, cpu: %d, gpu %d\n", i, h_Result_CPU[i], h_Result_GPU[i]); return 1;}
printf("Test caching passed\n");
return 0;
}

Accessing submatrices using cuBLAS

I have read the following post
Accessing submatrices using LAPACK
I would like to do something similar calling cuBLAS routines from Fortran.
Basically I have a large matrix partitioned into 3 x 3 blocks with the partitioning changing in each step of a loop. At the moment, I allocate/free pointers for each individual sub-block and copy the relevant parts of the matrix to and from the device at each step. That creates a lot of overhead which I am hoping to eliminate. Is that feasible?

You can do device pointer arithmetic in host code in just the same way as you would with host pointers. For example, if you had an MxN matrix stored on the GPU:
float *A_d;
cudaMalloc((void **)&A_d, size_t(M*N)*sizeof(float));
and you wanted to operate on a submatrix starting at (x1,y1), then you would pass A+x1+M*y1 to any CUBLAS function which expects a matrix as an argument.

talonmies has already satisfactorily answered this question. To support his answer and to be possibly useful to other users, I'm here providing a full example on how using cublas<t>gemm to perform multiplications between submatrices of full matrices A and B and how assigning the result to a submatrix of a full matrix C.
Although the question regards Fortran, the code below is given in C/C++ since I'm not using Fortran in connection with CUDA and since many users are using CUDA in connection with C/C++.
The code makes use of
pointer arithmetics to access submatrices;
the concept of the leading dimension and of submatrix dimensions.
The code below considers three matrices:
A - 10 x 9;
B - 15 x 13;
C - 10 x 12.
Matrix C is initialized to all 10s. The code performs the following submatrix multiplication in Matlab language:
C(1+x3:5+x3,1+y3:3+y3) = A(1+x1:5+x1,1+y1:4+y1) * B(1+x2:4+x2,1+y2:3+x2);
The Utilities.cu and Utilities.cuh files are mantained here and omitted here.
#include <thrust/device_vector.h>
#include <thrust/random.h>
#include <cublas_v2.h>
#include "Utilities.cuh"
/********/
/* MAIN */
/********/
int main()
{
/**************************/
/* SETTING UP THE PROBLEM */
/**************************/
//const int Nrows1 = 10; // --- Number of rows of matrix 1
//const int Ncols1 = 10; // --- Number of columns of matrix 1
//const int Nrows2 = 15; // --- Number of rows of matrix 2
//const int Ncols2 = 15; // --- Number of columns of matrix 2
//const int Nrows3 = 12; // --- Number of rows of matrix 3
//const int Ncols3 = 12; // --- Number of columns of matrix 3
const int Nrows1 = 10; // --- Number of rows of matrix 1
const int Ncols1 = 9; // --- Number of columns of matrix 1
const int Nrows2 = 15; // --- Number of rows of matrix 2
const int Ncols2 = 13; // --- Number of columns of matrix 2
const int Nrows3 = 10; // --- Number of rows of matrix 3
const int Ncols3 = 12; // --- Number of columns of matrix 3
const int Nrows = 5; // --- Number of rows of submatrix matrix 3 = Number of rows of submatrix 1
const int Ncols = 3; // --- Number of columns of submatrix matrix 3 = Number of columns of submatrix 2
const int Nrowscols = 4; // --- Number of columns of submatrix 1 and of rows of submatrix 2
const int x1 = 3; // --- Offset for submatrix multiplication along the rows
const int y1 = 2; // --- Offset for submatrix multiplication along the columns
const int x2 = 6; // --- Offset for submatrix multiplication along the rows
const int y2 = 4; // --- Offset for submatrix multiplication along the columns
const int x3 = 3; // --- Offset for submatrix multiplication along the rows
const int y3 = 5; // --- Offset for submatrix multiplication along the columns
// --- Random uniform integer distribution between 0 and 100
thrust::default_random_engine rng;
thrust::uniform_int_distribution<int> dist(0, 20);
// --- Matrix allocation and initialization
thrust::device_vector<float> d_matrix1(Nrows1 * Ncols1);
thrust::device_vector<float> d_matrix2(Nrows2 * Ncols2);
for (size_t i = 0; i < d_matrix1.size(); i++) d_matrix1[i] = (float)dist(rng);
for (size_t i = 0; i < d_matrix2.size(); i++) d_matrix2[i] = (float)dist(rng);
printf("\n\nOriginal full size matrix A\n");
for(int i = 0; i < Nrows1; i++) {
std::cout << "[ ";
for(int j = 0; j < Ncols1; j++)
std::cout << d_matrix1[j * Nrows1 + i] << " ";
std::cout << "]\n";
}
printf("\n\nOriginal full size matrix B\n");
for(int i = 0; i < Nrows2; i++) {
std::cout << "[ ";
for(int j = 0; j < Ncols2; j++)
std::cout << d_matrix2[j * Nrows2 + i] << " ";
std::cout << "]\n";
}
/*************************/
/* MATRIX MULTIPLICATION */
/*************************/
cublasHandle_t handle;
cublasSafeCall(cublasCreate(&handle));
thrust::device_vector<float> d_matrix3(Nrows3 * Ncols3, 10.f);
float alpha = 1.f;
float beta = 0.f;
cublasSafeCall(cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, Nrows, Ncols, Nrowscols, &alpha,
thrust::raw_pointer_cast(d_matrix1.data())+x1+Nrows1*y1, Nrows1, thrust::raw_pointer_cast(d_matrix2.data())+x2+Nrows2*y2, Nrows2,
&beta, thrust::raw_pointer_cast(d_matrix3.data())+x3+Nrows3*y3, Nrows3));
printf("\n\nResult full size matrix C\n");
for(int i = 0; i < Nrows3; i++) {
std::cout << "[ ";
for(int j = 0; j < Ncols3; j++)
std::cout << d_matrix3[j * Nrows3 + i] << " ";
std::cout << "]\n";
}
return 0;
}

CUDA Maximum Reduction Algorithm Not Working

A previous question asked how to find to find the maximum value of an array in CUDA efficiently: Finding max value in CUDA, the top response provided a link to a NVIDIA presentation on optimizing reduction kernels.
If you are using Visual Studio, simply remove the header reference, and everything between CPU EXECUTION.
I setup a variant which found the max, but it doesn't match what the CPU is finding:
// Returns the maximum value of
// an array of size n
float GetMax(float *maxes, int n)
{
int i = 0;
float max = -100000;
for(i = 0; i < n; i++)
{
if(maxes[i] > max)
max = maxes[i];
}
return max;
}
// Too obvious...
__device__ float MaxOf2(float a, float b)
{
if(a > b) return a;
else return b;
}
__global__ void MaxReduction(int n, float *g_idata, float *g_odata)
{
extern __shared__ float sdata[];
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(BLOCKSIZE*2) + tid;
unsigned int gridSize = BLOCKSIZE*2*gridDim.x;
sdata[tid] = 0;
//MMX(index,i)
//MMX(index,i+blockSize)
// Final Optimized Kernel
while (i < n) {
sdata[tid] = MaxOf2(g_idata[i], g_idata[i+BLOCKSIZE]);
i += gridSize;
}
__syncthreads();
if (BLOCKSIZE >= 512) { if (tid < 256) { sdata[tid] = MaxOf2(sdata[tid], sdata[tid + 256]); } __syncthreads(); }
if (BLOCKSIZE >= 256) { if (tid < 128) { sdata[tid] = MaxOf2(sdata[tid], sdata[tid + 128]); } __syncthreads(); }
if (BLOCKSIZE >= 128) { if (tid < 64) { sdata[tid] = MaxOf2(sdata[tid], sdata[tid + 64]); } __syncthreads(); }
if (tid < 32) {
if (BLOCKSIZE >= 64) sdata[tid] = MaxOf2(sdata[tid], sdata[tid + 32]);
if (BLOCKSIZE >= 32) sdata[tid] = MaxOf2(sdata[tid], sdata[tid + 16]);
if (BLOCKSIZE >= 16 ) sdata[tid] = MaxOf2(sdata[tid], sdata[tid + 8]);
if (BLOCKSIZE >= 8) sdata[tid] = MaxOf2(sdata[tid], sdata[tid + 4]);
if (BLOCKSIZE >= 4) sdata[tid] = MaxOf2(sdata[tid], sdata[tid + 2]);
if (BLOCKSIZE >= 2) sdata[tid] = MaxOf2(sdata[tid], sdata[tid + 1]);
}
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
I have a giant setup to test this algorithm:
#include <cstdio>
#include <cstdlib>
#include <ctime>
#include <iostream>
#include <sys/time.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
#include "book.h"
#define ARRAYSIZE 16384
#define GRIDSIZE 60
#define BLOCKSIZE 32
#define SIZEFLOAT 4
using namespace std;
// Function definitions
float GetMax(float *maxes, int n);
__device__ float MaxOf2(float a, float b);
__global__ void MaxReduction(int n, float *g_idata, float *g_odata);
// Returns random floating point number
float RandomReal(float low, float high)
{
float d;
d = (float) rand() / ((float) RAND_MAX + 1);
return (low + d * (high - low));
}
int main()
{
/*****************VARIABLE SETUP*****************/
// Pointer to CPU numbers
float *numbers;
// Pointer to GPU numbers
float *dev_numbers;
// Counter
int i = 0;
// Randomize
srand(time(0));
// Timers
// Kernel timers
cudaEvent_t start_kernel, stop_kernel;
float elapsedTime_kernel;
HANDLE_ERROR(cudaEventCreate(&start_kernel));
HANDLE_ERROR(cudaEventCreate(&stop_kernel));
// cudaMalloc timers
cudaEvent_t start_malloc, stop_malloc;
float elapsedTime_malloc;
HANDLE_ERROR(cudaEventCreate(&start_malloc));
HANDLE_ERROR(cudaEventCreate(&stop_malloc));
// CPU timers
struct timeval start, stop;
float elapsedTime = 0;
/*****************VARIABLE SETUP*****************/
/*****************CPU ARRAY SETUP*****************/
// Setup CPU array
HANDLE_ERROR(cudaHostAlloc((void**)&numbers, ARRAYSIZE * sizeof(float), cudaHostAllocDefault));
for(i = 0; i < ARRAYSIZE; i++)
numbers[i] = RandomReal(0, 50000.0);
/*****************CPU ARRAY SETUP*****************/
/*****************GPU ARRAY SETUP*****************/
// Start recording cuda malloc time
HANDLE_ERROR(cudaEventRecord(start_malloc,0));
// Allocate memory to GPU
HANDLE_ERROR(cudaMalloc((void**)&dev_numbers, ARRAYSIZE * sizeof(float)));
// Transfer CPU array to GPU
HANDLE_ERROR(cudaMemcpy(dev_numbers, numbers, ARRAYSIZE*sizeof(float), cudaMemcpyHostToDevice));
// An array to temporarily store maximum values on the GPU
float *dev_max;
HANDLE_ERROR(cudaMalloc((void**)&dev_max, GRIDSIZE * sizeof(float)));
// An array to hold grab the GPU max
float maxes[GRIDSIZE];
/*****************GPU ARRAY SETUP*****************/
/*****************KERNEL EXECUTION*****************/
// Start recording kernel execution time
HANDLE_ERROR(cudaEventRecord(start_kernel,0));
// Run kernel
MaxReduction<<<GRIDSIZE, BLOCKSIZE, SIZEFLOAT*BLOCKSIZE>>> (ARRAYSIZE, dev_numbers, dev_max);
// Transfer maxes over
HANDLE_ERROR(cudaMemcpy(maxes, dev_max, GRIDSIZE * sizeof(float), cudaMemcpyDeviceToHost));
// Print out the max
cout << GetMax(maxes, GRIDSIZE) << endl;
// Stop recording kernel execution time
HANDLE_ERROR(cudaEventRecord(stop_kernel,0));
HANDLE_ERROR(cudaEventSynchronize(stop_kernel));
// Retrieve recording data
HANDLE_ERROR(cudaEventElapsedTime(&elapsedTime_kernel, start_kernel, stop_kernel));
// Stop recording cuda malloc time
HANDLE_ERROR(cudaEventRecord(stop_malloc,0));
HANDLE_ERROR(cudaEventSynchronize(stop_malloc));
// Retrieve recording data
HANDLE_ERROR(cudaEventElapsedTime(&elapsedTime_malloc, start_malloc, stop_malloc));
// Print results
printf("%5.3f\t%5.3f\n", elapsedTime_kernel, elapsedTime_malloc);
/*****************KERNEL EXECUTION*****************/
/*****************CPU EXECUTION*****************/
// Capture the start time
gettimeofday(&start, NULL);
// Call generic P7Viterbi function
cout << GetMax(numbers, ARRAYSIZE) << endl;
// Capture the stop time
gettimeofday(&stop, NULL);
// Retrieve time elapsed in milliseconds
long int elapsed_sec = stop.tv_sec - start.tv_sec;
long int elapsed_usec = stop.tv_usec - start.tv_usec;
elapsedTime = (float)(1000.0f * elapsed_sec) + (float)(0.001f * elapsed_usec);
// Print results
printf("%5.3f\n", elapsedTime);
/*****************CPU EXECUTION*****************/
// Free memory
cudaFreeHost(numbers);
cudaFree(dev_numbers);
cudaFree(dev_max);
cudaEventDestroy(start_kernel);
cudaEventDestroy(stop_kernel);
cudaEventDestroy(start_malloc);
cudaEventDestroy(stop_malloc);
// Exit program
return 0;
}
I ran cuda-memcheck on this test program, with -g & -G switches on, and it reports 0 problems. Can anyone spot the issue?
NOTE: Be sure to have book.h from the CUDA by Example book in your current directory when you compile the program. Source link here: http://developer.nvidia.com/cuda-example-introduction-general-purpose-gpu-programming
Download the source code, and book.h will be under the common directory/folder.

Your kernel looks broken to me. The thread local search (before the shared memory reduction), should look something like this:
sdata[tid] = g_idata[i];
i += gridSize;
while (i < n) {
sdata[tid] = MaxOf2(sdata[tid], g_idata[i]);
i += gridSize;
}
shouldn't it?
Also note that if you run this on Fermi, the shared memory buffer should be declared volatile, and you will get a noticeable improvement in performance if the thread local search is done with a register variable, rather than in shared memory. There is about an 8 times difference in effective bandwidth between the two.
EDIT: Here is a simplified, working version of your reduction kernel. You should note a number of differences compared to your original:
__global__ void MaxReduction(int n, float *g_idata, float *g_odata)
{
extern __shared__ volatile float sdata[];
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(BLOCKSIZE) + tid;
unsigned int gridSize = BLOCKSIZE*gridDim.x;
float val = g_idata[i];
i += gridSize;
while (i < n) {
val = MaxOf2(g_idata[i],val);
i += gridSize;
}
sdata[tid] = val;
__syncthreads();
// This versions uses a single warp for the shared memory
// reduction
# pragma unroll
for(int i=(tid+32); ((tid<32)&&(i<BLOCKSIZE)); i+=32)
sdata[tid] = MaxOf2(sdata[tid], sdata[i]);
if (tid < 16) sdata[tid] = MaxOf2(sdata[tid], sdata[tid + 16]);
if (tid < 8) sdata[tid] = MaxOf2(sdata[tid], sdata[tid + 8]);
if (tid < 4) sdata[tid] = MaxOf2(sdata[tid], sdata[tid + 4]);
if (tid < 2) sdata[tid] = MaxOf2(sdata[tid], sdata[tid + 2]);
if (tid == 0) g_odata[blockIdx.x] = MaxOf2(sdata[tid], sdata[tid + 1]);
}
This code should also be safe on Fermi. You should also familiarise yourself with the CUDA math library, because there is a fmax(x,y) intrinsic which you should use in place of your MaxOf2 function.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Some CUDA computations fail with larger block dimension (< 1024) - parallel-processing

Related

Sparse plus dense matrix operation using cuSPARSE

Kernel crashes while trying to do a simple value assignment

FIR filter in CUDA (as a 1D convolution)

Accessing submatrices using cuBLAS

CUDA Maximum Reduction Algorithm Not Working

Categories

Resources