I am trying to implement matrix multiplication using CUDA. I have two matrices of order Mw and wN. I launched (w*w) threads in each block and grid dimension = (M/w,N/w). I created a matrix in shared memory of size 32*32. I want to implement matrix multiplication using only one matrix in shared memory. Here's my code
#include<stdio.h>
#include<cuda.h>
#include<stdlib.h>
#include<stdlib.h>
#include<unistd.h>
#include<math.h>
__global__ void add(int *a,int *b, int *c,int *p,int *q){
// __shared__ int aTile[*p][*p];
//const int a=*p;
__shared__ int aTile[32][32];
int row = blockIdx.x*blockDim.x+threadIdx.x;
int col = blockIdx.y*blockDim.y+threadIdx.y;
int sum=0;
aTile[threadIdx.x][threadIdx.y] = a[row*(*p)+threadIdx.y];
__syncthreads();
if(row< *q && col< *q)
{
for(int k=0;k<*p;k++)
{
sum+= aTile[threadIdx.x][k]*b[col+(*q)*k];
// __syncthreads();
}
c[col+(*q)*row]=sum;
//__syncthreads();
}
}
int main(){
printf("Enter the number of rows of matrix 1\n");
int row_1;
scanf("%d",&row_1);
printf("Enter the number of columns of matrix 1\n");
int col_1;
scanf("%d",&col_1);
/*printf("Enter the values of matrix 1 \n");
*/
int a[row_1][col_1];
for(int i=0;i<row_1;i++)
{
for(int j=0;j<col_1;j++)
{
//scanf("%d",&a[i][j]);
a[i][j]=1;
}
}
printf("Enter the number of rows of matrix 2\n");
int row_2;
scanf("%d",&row_2);
printf("Enter the number of columns of matrix 2\n");
int col_2;
scanf("%d",&col_2);
/* printf("Enter the values of matrix 2 \n");
*/
int b[row_2][col_2];
for(int i=0;i<row_2;i++)
{
for(int j=0;j<col_2;j++)
{
// scanf("%d",&b[i][j]);
b[i][j]=1;
}
}
int c[row_1][col_2];
//dim3 dimBlock(col_1, col_1);// in one block u have row_1*col_2 threads;
dim3 dimBlock(col_1,col_1);
//dim3 dimGrid((row_1/col_1)+1,(col_2/col_1)+1); // in one grid you have 1*1 blocks
dim3 dimGrid(ceil(row_1/col_1),ceil(col_2/col_1));
int *p;
int *q;
int *dev_a,*dev_b,*dev_c;
int size_a=row_1*col_1*sizeof(int);
int size_b=row_2*col_2*sizeof(int);
int size_c = row_1*col_2*sizeof(int);
cudaMalloc((void**)&dev_a,size_a);
cudaMalloc((void**)&dev_b,size_b);
cudaMalloc((void**)&dev_c,size_c);
cudaMalloc((void**)&p,sizeof(int));
cudaMalloc((void**)&q,sizeof(int));
cudaMemcpy(dev_a,a,size_a,cudaMemcpyHostToDevice);
cudaMemcpy(dev_b,b,size_b,cudaMemcpyHostToDevice);
cudaMemcpy(dev_c,c,size_c,cudaMemcpyHostToDevice);
cudaMemcpy(p,&col_1,sizeof(int),cudaMemcpyHostToDevice);
cudaMemcpy(q,&col_2,sizeof(int),cudaMemcpyHostToDevice);
add<<<dimGrid,dimBlock>>>(dev_a,dev_b,dev_c,p,q);
cudaMemcpy(c,dev_c,size_c,cudaMemcpyDeviceToHost);
cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);
printf("output matrix is : \n");
for(int i=0;i<10;i++)
{
for(int j=0;j<10;j++)
{
printf("%d ",c[i][j]);
}
printf("\n");
}
}
I get the correct output for when i multiply matrices of size 32*32 and 32*32
but when i Multiply matrices of sizes 33*33 and 33*33(and above) , the resultant multiplied matrix contains all zeros. I have tried to increased the size of matrix in shared memory, but I get the following error
ptxas error : Entry function '_Z3addPiS_S_S_S_' uses too much shared data (0x10038 bytes, 0x4000 max)
I am pretty new to CUDA. Sorry, if this was too much basic question
This is a basic question and has been answered many times over.
First of all, use proper cuda error checking any time you are having trouble with a CUDA code. In this case, you would have received an error that would have been instructive.
CUDA kernels have a limit on the maximum number of threads per threadblock. That limit (under CUDA 7, 7.5RC, currently) is 1024 threads per block, on all supported devices. The number of threads per block is specified (in this case) by your dimBlock variable, and it is the product of the terms in each dimension:
dim3 dimBlock(col_1,col_1);
add<<<dimGrid,dimBlock>>>(dev_a,dev_b,dev_c,p,q);
Therefore, when col_1 is 32, you are requesting 32x32 threads (1024) which is the maximum. Any value above 32x32 will fail for this reason. (Your kernel will not launch. No kernel code will get executed when you specify 33x33 here.)
Rather than rewrite this code to fix all the issues, I suggest you study any of the dozens of questions already asked about matrix multiplication, here on the cuda tag. In fact, if you want to see a shared memory optimized code for naive matrix multiplication in CUDA, there is a full example in the programming guide (including both the non-shared version and the shared version for comparison).
And again, I suggest you implement proper cuda error checking before asking for help here. Even if you don't understand the error results, it will be useful information for those who are trying to help you.
You have an overflow in this line:
aTile[threadIdx.x][threadIdx.y] = a[row*(*p)+threadIdx.y];
knowing that aTile is defined as __shared__ int aTile[32][32];
If you want to do tiling, you'll have to loop over the number of tiles you need to cover your matrice.
Related
So I want to allocate 2D arrays and also copy them between the CPU and GPU in CUDA, but I am a total beginner and other online materials are very difficult for me to understand or are incomplete. It is important that I am able to access them as a 2D array in the kernel code as shown below.
Note that height != width for the arrays, that's something that further confuses me if it's possible as I always struggle choosing grid size.
I've considered flattening them, but I really want to get it working this way.
This is how far I've got by my own research.
__global__ void myKernel(int *firstArray, int *secondArray, int rows, int columns) {
int row = blockIdx.x * blockDim.x + threadIdx.x;
int column = blockIdx.y * blockDim.y + threadIdx.y;
if (row >= rows || column >= columns)
return;
// Do something with the arrays like you would on a CPU, like:
firstArray[row][column] = row * 2;
secondArray[row[column] = row * 3;
}
int main() {
int rows = 300, columns = 200;
int h_firstArray[rows][columns], h_secondArray[rows][columns];
int *d_firstArray[rows][columns], *d_secondArray[rows][columns];
// populate h_ arrays (Can do this bit myself)
// Allocate memory on device, no idea how to do for 2D arrays.
// Do memcopies to GPU, no idea how to do for 2D arrays.
dim3 block(rows,columns);
dim3 grid (1,1);
myKernel<<<grid,block>>>(d_firstArray, d_secondArray, rows, columns);
// Do memcopies back to host, no idea how to do for 2D arrays.
cudaFree(d_firstArray);
cudaFree(d_secondArray);
return 0;
}
EDIT: I was asked if the array width will be known at compile time in the problems I would try to solve. You can assume it is as I'm interested primarily in this particular situation as of now.
In the general case (array dimensions not known until runtime), handling doubly-subscripted access in CUDA device code requires an array of pointers, just as it does in host code. C and C++ handle each subscript as a pointer dereference, in order to reach the final location in the "2D array".
Double-pointer/doubly-subscripted access in device code in the general case is already covered in the canonical answer linked from the cuda tag info page. There are several drawbacks to this, which are covered in that answer so I won't repeat them here.
However, if the array width is known at compile time (array height can be dynamic - i.e. determined at runtime), then we can leverage the compiler and the language typing mechanisms to allow us to circumvent most of the drawbacks. Your code demonstrates several other incorrect patterns for CUDA and/or C/C++ usage:
Passing an item for doubly-subscripted access to a C or C++ function cannot be done with a simple single pointer type like int *firstarray
Allocating large host arrays via stack-based mechanisms:
int h_firstArray[rows][columns], h_secondArray[rows][columns];
is often problematic in C and C++. These are stack based variables and will often run into stack limits if large enough.
CUDA threadblocks are limited to 1024 threads total. Therefore such a threadblock dimension:
dim3 block(rows,columns);
will not work except for very small sizes of rows and columns (the product must be less than or equal to 1024).
When declaring pointer variables for a device array in CUDA, it is almost never correct to create arrays of pointers:
int *d_firstArray[rows][columns], *d_secondArray[rows][columns];
nor do we allocate space on the host, then "reallocate" those pointers for device usage.
What follows is a worked example with the above items addressed and demonstrating the aforementioned method where the array width is known at runtime:
$ cat t50.cu
#include <stdio.h>
const int array_width = 200;
typedef int my_arr[array_width];
__global__ void myKernel(my_arr *firstArray, my_arr *secondArray, int rows, int columns) {
int column = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
if (row >= rows || column >= columns)
return;
// Do something with the arrays like you would on a CPU, like:
firstArray[row][column] = row * 2;
secondArray[row][column] = row * 3;
}
int main() {
int rows = 300, columns = array_width;
my_arr *h_firstArray, *h_secondArray;
my_arr *d_firstArray, *d_secondArray;
size_t dsize = rows*columns*sizeof(int);
h_firstArray = (my_arr *)malloc(dsize);
h_secondArray = (my_arr *)malloc(dsize);
// populate h_ arrays
memset(h_firstArray, 0, dsize);
memset(h_secondArray, 0, dsize);
// Allocate memory on device
cudaMalloc(&d_firstArray, dsize);
cudaMalloc(&d_secondArray, dsize);
// Do memcopies to GPU
cudaMemcpy(d_firstArray, h_firstArray, dsize, cudaMemcpyHostToDevice);
cudaMemcpy(d_secondArray, h_secondArray, dsize, cudaMemcpyHostToDevice);
dim3 block(32,32);
dim3 grid ((columns+block.x-1)/block.x,(rows+block.y-1)/block.y);
myKernel<<<grid,block>>>(d_firstArray, d_secondArray, rows, columns);
// Do memcopies back to host
cudaMemcpy(h_firstArray, d_firstArray, dsize, cudaMemcpyDeviceToHost);
cudaMemcpy(h_secondArray, d_secondArray, dsize, cudaMemcpyDeviceToHost);
// validate
if (cudaGetLastError() != cudaSuccess) {printf("cuda error\n"); return -1;}
for (int i = 0; i < rows; i++)
for (int j = 0; j < columns; j++){
if (h_firstArray[i][j] != i*2) {printf("first mismatch at %d,%d, was: %d, should be: %d\n", i,j,h_firstArray[i][j], i*2); return -1;}
if (h_secondArray[i][j] != i*3) {printf("second mismatch at %d,%d, was: %d, should be: %d\n", i,j,h_secondArray[i][j], i*3); return -1;}}
printf("success!\n");
cudaFree(d_firstArray);
cudaFree(d_secondArray);
return 0;
}
$ nvcc -arch=sm_61 -o t50 t50.cu
$ cuda-memcheck ./t50
========= CUDA-MEMCHECK
success!
========= ERROR SUMMARY: 0 errors
$
I've reversed the sense of your kernel indexing (x,y) to help with coalesced global memory access. We see that with this kind of type creation, we can leverage the compiler and the language features to end up with a code that allows for doubly-subscripted access in both host and device code, while otherwise allowing CUDA operations (e.g. cudaMemcpy) as if we are dealing with single-pointer (e.g. "flattened") arrays.
When I try this I get the wrong result at 'output' even though I am copying the values of 'cum' array to output.
But if I rename the 'cum' array mentioned earlier in the code. I get the correct value of array. Therefore I am unable to reuse the result values.
The device has 8 cores with no shared memory.
Any and all comments/suggestions appreciated.
kernel void histogram(global unsigned int *input,
global unsigned int *output,
global unsigned int *frequency,
global unsigned int *cum,
unsigned int N)
{
int pid = get_global_id(0);
//cumulative sum
for(int i=0; i < 16; i++)
{
cum[(i*16)+(2*pid)+1] = frequency[(i*16)+(2*pid)] + frequency[(i*16)+(2*pid)+1];
}
barrier(CLK_GLOBAL_MEM_FENCE);
for(int i=0; i < 32; i++)
{
output[(i*8)+pid] = cum[(i*8)+pid];
}
barrier(CLK_GLOBAL_MEM_FENCE);
}
Make sure you understand parallel prefix sums. In particular I don't see a downsweep step of the total sum or parts:
Parallel Prefix Sum (Scan) with CUDA
I'd look in the TI's Keystone II SDK you're using in OpenCL device memory read/write issue to see if they have any scan or parallel prefix sum implementations or built in functions.
Is it possible to generate random numbers within a device function without preallocate all the states? I would like to generate and use them in "realtime". I need to use them for Monte Carlo simulations what are the most suitable for this purpose? The number generated below are single precision is it possible to have them in double precision?
#include <iostream>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <curand_kernel.h>
__global__ void cudaRand(float *d_out, unsigned long seed)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
curandState state;
curand_init(seed, i, 0, &state);
d_out[i] = curand_uniform(&state);
}
int main(int argc, char** argv)
{
size_t N = 1 << 4;
float *v = new float[N];
float *d_out;
cudaMalloc((void**)&d_out, N * sizeof(float));
// generate random numbers
cudaRand << < 1, N >> > (d_out, time(NULL));
cudaMemcpy(v, d_out, N * sizeof(float), cudaMemcpyDeviceToHost);
for (size_t i = 0; i < N; i++)
{
printf("out: %f\n", v[i]);
}
cudaFree(d_out);
delete[] v;
return 0;
}
UPDATE
#include <iostream>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <curand_kernel.h>
#include <ctime>
__global__ void cudaRand(double *d_out)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
curandState state;
curand_init((unsigned long long)clock() + i, 0, 0, &state);
d_out[i] = curand_uniform_double(&state);
}
int main(int argc, char** argv)
{
size_t N = 1 << 4;
double *h_v = new double[N];
double *d_out;
cudaMalloc((void**)&d_out, N * sizeof(double));
// generate random numbers
cudaRand << < 1, N >> > (d_out);
cudaMemcpy(h_v, d_out, N * sizeof(double), cudaMemcpyDeviceToHost);
for (size_t i = 0; i < N; i++)
printf("out: %f\n", h_v[i]);
cudaFree(d_out);
delete[] h_v;
return 0;
}
How I was dealing with the similar situation in the past, within __device__/__global__ function:
int tId = threadIdx.x + (blockIdx.x * blockDim.x);
curandState state;
curand_init((unsigned long long)clock() + tId, 0, 0, &state);
double rand1 = curand_uniform_double(&state);
double rand2 = curand_uniform_double(&state);
So just use curand_uniform_double for generating random doubles and also I believe you don't want the same seed for all of the threads, thats what I am trying to achieve by using clock() + tId instead. This way the odds of having the same rand1/rand2 in any of the two threads are close to nil.
EDIT:
However, based on below comments, proposed approach may perhaps lead to biased result:
JackOLantern pointed me to this part of curand documentation:
Sequences generated with different seeds usually do not have statistically correlated values, but some choices of seeds may give statistically correlated sequences.
Also there is a devtalk thread devoted to how to improve performance of curand_init in which the proposed solution to speed up the curand initialization is:
One thing you can do is use different seeds for each thread and a fixed subsequence of 0 and offset of 0.
But the same poster is later stating:
The downside is that you lose some of the nice mathematical properties between threads. It is possible that there is a bad interaction between the hash function that initializes the generator state from the seed and the periodicity of the generators. If that happens, you might get two threads with highly correlated outputs for some seeds. I don't know of any problems like this, and even if they do exist they will most likely be rare.
So it is basically up to you whether you want better performance (as I did) or 1000% unbiased results. If that is what you desire, then solution proposed by JackOLantern is the correct one, i.e. initialize curand as:
curand_init((unsigned long long)clock(), tId, 0, &state)
Using not 0 value for offset and subsequence parameters is, however, decreasing performance. For more info on these parameters you may review this SO thread and also curand documentation.
I see that JackOLantern stated in comment that:
I would say it is not recommandable to call curand_init and curand_uniform_double from within the same kernel from two reasons ........ Second, curand_init initializes the pseudorandom number generator and sets all of its parameters, so I'm afraid your approach will be somewhat slow.
I was dealing with this in my thesis on several pages, tried various approaches to get different random numbers in each thread and creating curandState in each of the threads turned out to be the most viable solution for me. I needed to generate ~10 random numbers in each thread and among others I tried:
developing my own simple random number generator (Linear Congruential Generator) whose intialization was basically for free, however, the performance suffered greatly when generating numbers, so in the end having curandState in each thread turned out to be superior,
pre-allocating curandStates and reusing them - this was memory heavy and when I decreased number of preallocated states then I had to use non zero values for offset/subsequence parameters of curand_uniform_double in order to get rid of bias which led to decreased performance when generating numbers.
So after making thorough analysis I decided to indeed call curand_init and curand_uniform_double in each thread. The only problem was with the amount of registry that these states were occupying so I had to be careful with the block sizes not to exceed the max number of registry available to each block.
Thats what I have to say about provided solution which I was finally able to test and it is working just fine on my machine/GPU. I run the code from UPDATE section in the above question and 16 different random numbers were displayed in the console correctly. Therefore I advise you to properly perform error checking after executing kernel to see what went wrong inside. This topic is very well covered in this SO thread.
To my knowledge, if atomic operations are performed on same memory address location in a warp, the performance of the warp could be 32 times slower.
But what if atomic operations of threads in a warp are on 32 different memory locations? Is there any performance penalty at all? Or it will be as fast as normal operation?
My use case is that I have 32 different positions, each thread in a warp needs one of these position but which position is data dependent. So each thread could use atomicCAS to scan if the location desired is empty or not. If it is not empty, scan the next position.
If I am lucky, 32 threads could atomicCAS to 32 different memory locations, is there any performance penalty is this case?
I assume Kepler architecture is used
In the code below, I'm adding a constant value to the elements of an array (dev_input). I'm comparing two kernels, one using atomicAdd and one using regular addition. This is an example taken to the extreme in which atomicAdd operates on completely different addresses, so there will be no need for serialization of the operations.
#include <stdio.h>
#define BLOCK_SIZE 1024
int iDivUp(int a, int b) { return ((a % b) != 0) ? (a / b + 1) : (a / b); }
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
__global__ void regular_addition(float *dev_input, float val, int N) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N) dev_input[i] = dev_input[i] + val;
}
__global__ void atomic_operations(float *dev_input, float val, int N) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N) atomicAdd(&dev_input[i],val);
}
int main(){
int N = 8192*32;
float* output = (float*)malloc(N*sizeof(float));
float* dev_input; gpuErrchk(cudaMalloc((void**)&dev_input, N*sizeof(float)));
gpuErrchk(cudaMemset(dev_input, 0, N*sizeof(float)));
int NumBlocks = iDivUp(N,BLOCK_SIZE);
float time, timing1 = 0.f, timing2 = 0.f;
cudaEvent_t start, stop;
int niter = 32;
for (int i=0; i<niter; i++) {
gpuErrchk(cudaEventCreate(&start));
gpuErrchk(cudaEventCreate(&stop));
gpuErrchk(cudaEventRecord(start,0));
atomic_operations<<<NumBlocks,BLOCK_SIZE>>>(dev_input,3,N);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
gpuErrchk(cudaEventRecord(stop,0));
gpuErrchk(cudaEventSynchronize(stop));
gpuErrchk(cudaEventElapsedTime(&time, start, stop));
timing1 = timing1 + time;
}
printf("Time for atomic operations: %3.5f ms \n", timing1/(float)niter);
for (int i=0; i<niter; i++) {
gpuErrchk(cudaEventCreate(&start));
gpuErrchk(cudaEventCreate(&stop));
gpuErrchk(cudaEventRecord(start,0));
regular_addition<<<NumBlocks,BLOCK_SIZE>>>(dev_input,3,N);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
gpuErrchk(cudaEventRecord(stop,0));
gpuErrchk(cudaEventSynchronize(stop));
gpuErrchk(cudaEventElapsedTime(&time, start, stop));
timing2 = timing2 + time;
}
printf("Time for regular addition: %3.5f ms \n", timing2/(float)niter);
}
Testing this code on my NVIDIA GeForce GT540M, CUDA 5.5, Windows 7, I obtain approximately the same results for the two kernels, i.e., about 0.7ms.
Now change the instruction
if (i < N) atomicAdd(&dev_input[i],val);
to
if (i < N) atomicAdd(&dev_input[i%32],val);
which is closer to the case of your interest, namely, each atomicAdd operates on different addresses within a warp. The result I obtain is that no performance penalty is observed.
Finally, change the above instruction to
if (i < N) atomicAdd(&dev_input[0],val);
This is the other extreme in which atomicAdd always operates on the same address. In this case, the execution time raises to 5.1ms.
The above tests have been performed on a Fermi architecture. You can try to run the above code on your Kepler card.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
I've implemented various algorithms using Cuda, such as matrix multiplication, Cholesky decomposition and inversion (by forward substitution) of a lower triangular matrix.
For some of these algorithms I have a for loop in the kernel that repeats part of the kernel code lots of times. It all works well for (flattened: represented by 1D arrays) matrices (of floats) up to about 200x200, with the for loop calling part of the kernel code 200 times. Increasing the matrix size to say 1000x1000 (with the for loop calling part of the kernel code 1000 times) leaves the GPU to take as much computing time as can be expected based on trials with smaller matrix sizes. But no kernel code (including parts outside the for loop) seems to have been run (the output matrix has none of its elements changed since initialization). If I increase the matrix size to around 500 I'm sometimes able to get the kernel to run if I set the limiter in the for loop to some low value (such has 3).
Have I hit some hardware limit here or is there a trick I can use to make these for loops work for large matrices?
This is an example of complete code that you can copy into a .cu file. The kernel attempts to copy the contents of matrix A (W*H) to matrix B (W*H). The output shows the first element of both matrices, for W*H < 200x200 this works just fine, for W*H = 1000x1000 no copying seems to occur because the elements of B remain zero, as if nothing happened since initialization. I'm compiling and running this code on a linux based server. For large matrices error checking gives me: "GPUassert: unspecified launch failure" at line 67 which is the cudamempcy line that copies matrix B from device to host.
#include <cuda.h>
#include <cuda_runtime.h>
#include <cuda_runtime_api.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <iostream>
#include <time.h>
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
__global__ void MatrixCopy(float *A, float *B, int W)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
int j = blockIdx.y*blockDim.y + threadIdx.y;
B[j*W + i]=A[j*W + i];
}
int main(void)
{
clock_t start1=clock();
int W=1000;
int H=1000;
float *A, *B;
float *devA, *devB;
A=(float*)malloc(W*H*sizeof(float));
B=(float*)malloc(W*H*sizeof(float));
for(int i=0; i<=W*H; i++)
{
A[i]=rand() % 3;
A[i]=A[i]+1;
B[i]=0;
}
gpuErrchk( cudaMalloc( (void**)&devA, W*H*sizeof(float) ) );
gpuErrchk( cudaMalloc( (void**)&devB, W*H*sizeof(float) ) );
gpuErrchk( cudaMemcpy( devA, A, W*H*sizeof(float), cudaMemcpyHostToDevice ) );
gpuErrchk( cudaMemcpy( devB, B, W*H*sizeof(float), cudaMemcpyHostToDevice ) );
dim3 threads(32,32);
int bloW=(int)ceil((double)W/32);
int bloH=(int)ceil((double)H/32);
dim3 blocks(bloW, bloH);
clock_t finish1=clock();
clock_t start2=clock();
MatrixCopy<<<blocks,threads>>>(devA, devB, W);
gpuErrchk( cudaPeekAtLastError() );
gpuErrchk( cudaMemcpy( B, devB, W*H*sizeof(float), cudaMemcpyDeviceToHost ) );
clock_t finish2=clock();
printf("\nGPU calculation time (ms): %d\nInitialization time (ms): %d\n\n", (int)ceil(double(((finish2-start2)*1000/(CLOCKS_PER_SEC)))), (int)ceil(double(((finish1-start1)*1000/(CLOCKS_PER_SEC)))));
printf("\n%f\n", A[0]);
printf("\n%f\n\n", B[0]);
gpuErrchk( cudaFree(devA) );
gpuErrchk( cudaFree(devB) );
free(A);
free(B);
#ifdef _WIN32
system ("PAUSE");
#endif
return 0;
}
Your kernel has no thread checking.
You are deciding the grid size (in blocks) like this:
int bloW=(int)ceil((double)W/32);
int bloH=(int)ceil((double)H/32);
For values of H and W that are not even multiples of the threads per block sizes (32) this creates extra threads and blocks, outside of the actual matrix you care about (1000x1000). There's nothing wrong with this; this is common practice.
However, we must make sure those extra threads don't actually do anything (i.e. don't generate invalid accesses to memory). Your kernel does not provide this checking.
If you modify your kernel to be something like this:
__global__ void MatrixCopy(float *A, float *B, int W, int H)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
int j = blockIdx.y*blockDim.y + threadIdx.y;
if ((i < W) && (j < H))
B[j*W + i]=A[j*W + i];
}
I think you'll have better results. Without this, some of your A and B references in the kernel are generating out-of-bounds accesses, which you can see if your run your code with cuda-memcheck. And you'll have to modify the kernel invocation line to add the H parameter as well. I haven't really sorted out whether your i variable corresponds to H or W; I assume you can do that and make the change if needed. In this case, since the matrix is square, it doesn't really matter.
And you should do proper cuda error checking any time you are having trouble with CUDA code. I would suggest doing this before you post here asking for help.