How to avoid un-coalesced accesses in matrix multiplication CUDA kernel?

How to avoid un-coalesced accesses in matrix multiplication CUDA kernel? - parallel-processing

I am learning CUDA with the book 'Programming Massively Parallel Processors'. A practice problem from chapter 5 confuses me:
For tiled matrix multiplication out of possible range of values for
BLOCK_SIZE, for what values of BLOCK_SIZE will the kernel completely
avoid un-coalesced accesses to global memory? (you only need to consider square blocks)
On my understanding, BLOCK_SIZE does little to memory-coalescing. As long as threads within single warp access consecutive elements, we will have a coalesced accesses. I could not figure out where the kernel has un-coalesced accesses to global memory. Any hints from you guys?
Here is the kernel's source codes:
#define COMMON_WIDTH 512
#define ROW_LEFT 500
#define COL_RIGHT 250
#define K 1000
#define TILE_WIDTH 32
__device__ int D_ROW_LEFT = ROW_LEFT;
__device__ int D_COL_RIGHT = COL_RIGHT;
__device__ int D_K = K;
.....
__global__
void MatrixMatrixMultTiled(float *matrixLeft, float *matrixRight, float *output){
__shared__ float sMatrixLeft[TILE_WIDTH][TILE_WIDTH];
__shared__ float sMatrixRight[TILE_WIDTH][TILE_WIDTH];
int bx = blockIdx.x; int by = blockIdx.y;
int tx = threadIdx.x; int ty = threadIdx.y;
int col = bx * TILE_WIDTH + tx;
int row = by * TILE_WIDTH + ty;
float value = 0;
for (int i = 0; i < ceil(D_K/(float)TILE_WIDTH); ++i){
if (row < D_ROW_LEFT && row * D_K + i * TILE_WIDTH +tx < D_K){
sMatrixLeft[ty][tx] = matrixLeft[row * D_K + i * TILE_WIDTH +tx];
}
if (col < D_COL_RIGHT && (ty + i * TILE_WIDTH) * D_COL_RIGHT + col < D_K ){
sMatrixRight[ty][tx] = matrixRight[(ty + i * TILE_WIDTH) * D_COL_RIGHT + col];
}
__syncthreads();
for (int j = 0; j < TILE_WIDTH; j++){
value += sMatrixLeft[ty][j] * sMatrixRight[j][tx];
}
__syncthreads();
}
if (row < D_ROW_LEFT && col < D_COL_RIGHT ){
output[row * D_COL_RIGHT + col] = value;
}
}

Your question is incomplete, since the code you have posted does not make any reference to BLOCK_SIZE, and that is certainly at least very relevant to the question posed in the book. More generally, questions that pose a kernel without the launch configuration are often incomplete, since the launch configuration is often relevant to both the correctness and the behavior, of a kernel.
I've not re-read this portion of the book right at the moment. However I'll assume the kernel launch configuration includes a block dimension that is something like the following: (this information is absent from your question but should have been included, in my opinion, for a sensible question)
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGrid(...,...);
And I will assume the kernel launch is given by something like:
MatrixMatrixMultTiled<<<dimGrid, dimBlock>>>(...);
Your statement: "As long as threads within single warp access consecutive elements, we will have a coalesced accesses." is a reasonable working definition. Let's show that that is violated for some choices of BLOCK_SIZE, given the above assumptions to cover over the gaps in your incomplete question.
Coalesced access is a term that applies to global memory accesses only. We will therefore ignore accesses to shared memory. We will also, for this discussion, ignore accesses to the __device__ variables such as D_ROW_LEFT. (The access to those variables appears to be uniform. We can quibble about whether that constitutes coalesced access. My claim would be that it does constitute coalesced access, but we need not unpack that here.) Therefore we are left with just 3 "access" points:
matrixLeft[row * D_K + i * TILE_WIDTH +tx];
matrixRight[(ty + i * TILE_WIDTH) * D_COL_RIGHT + col];
output[row * D_COL_RIGHT + col]
Now, to pick an example, let's suppose BLOCK_SIZE is 16. Will any of the above access points violate your statement "threads within single warp access consecutive elements"?
Let's start with the block (0,0). Therefore row is equal to threadIdx.y and col is equal to threadIdx.x. Let's consider the first warp in that block. Therefore the first 16 threads in that warp will have a threadIdx.y value of 0, and their threadIdx.x values will be increasing from 0..15. Likewise the second 16 threads in that warp will have a threadIdx.y value of 1, and their threadIdx.x values will be increasing from 0..15.
Now let's compute the actual index generated for the first access point above, across the warp. Let's assume we are on the first loop iteration, so i is zero. Therefore this:
matrixLeft[row * D_K + i * TILE_WIDTH +tx];
reduces to:
matrixLeft[threadIdx.y * D_K + threadIdx.x];
D_K here is just the device copy of the K variable, which is 1000. Now let's evaluate the reduced index expression above across our selected warp (0) in our selected block (0,0):
warp lane: 0 1 2 3 4 5 6 .. 15 16 17 18 .. 31
threadIdx.x 0 1 2 3 4 5 6 15 0 1 2 15
threadIdx.y 0 0 0 0 0 0 0 0 1 1 1 1
index: 0 1 2 3 4 5 6 15 1000 1001 1002 1015
Therefore the generated index pattern here shows a discontinuity between the 16th and 17th thread in the warp, and the access pattern does not fit your previously stated condition:
"threads within single warp access consecutive elements"
and we do not have coalesced access in this case (at least, for float quantities).

Related

Grid size in phase #4 of Harris' reduction optimization

I am learning about unrolling loops to optimize kernel computation.
This is a code snippet from the book Professional CUDA C Programming:
if (idx + 4 * blockDim.x <= n)
{
int a1 = g_idata[idx];
int a2 = g_idata[idx + blockDim.x];
int a3 = g_idata[idx + 2 * blockDim.x];
int a4 = g_idata[idx + 3 * blockDim.x];
tmpSum = a1 + a2 + a3 + a4;
}
In my understanding, each thread works on 4 data blocks and processes a single element from each data block.
So, when we launch kernel, compared with kernel w/o unrolling grid.x, the configuration is changed to
reduceSmemUnroll<<<grid.x / 4, block>>>.
Then I have a question about the code snippet from Mark Harris's presentation on parallel reduction on page 32:
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(blockSize*2) + threadIdx.x;
unsigned int gridSize = blockSize*2*gridDim.x;
sdata[tid] = 0;
while (i < n) {
sdata[tid] += g_idata[i] + g_idata[i+blockSize];
i += gridSize;
}
__syncthreads();
My question is about how to determine the size of grid when launching the kernel? Should it be grid.x/2 compared to configuration w/o multiple load?

Yes, it should be half the number of blocks; it says so on the slide with the first occurrence of the code snippet you quoted from in Mark's presentation - already on slide 18:
Halve the number of blocks, and replace single load:
[code snippet]
with two loads and [the] first add of the reduction
Of course, you need to be careful about the sizes. The presentation assumes, for simplicity, that your overall length is a power of 2, so you can always safely divide by 2 while there are multiple elements left. In real life that is not the case, so you may need to allow for slack (e.g. "half the grid size plus one if it was odd").

Heat equation matrix in CUDA - illegal address error

Following this question with reference to the shared memory example in the official guide, I'm trying to build the heat equation matrix, which is just like in this poorly drawn image I made
Here's what I've done so far, minimal example
#define N 32
#define BLOCK_SIZE 16
#define NUM_BLOCKS ((N + BLOCK_SIZE - 1)/ BLOCK_SIZE)
__global__ void heat_matrix(int* A)
{
const unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;
__shared__ int temp_sm_A[N*N];
int* temp_A = &temp_sm_A[0]; memset(temp_A, 0, N*N*sizeof(int));
if (tid < N) //(*)
{
#pragma unroll
for (unsigned int m = 0; m < NUM_BLOCKS; ++m)
{
#pragma unroll
for (unsigned int e = 0; e < BLOCK_SIZE ; ++e)
{
if ( (tid == 0 && e == 0) || (tid == (N-1) && e == (BLOCK_SIZE-1) ) )
{
temp_A[tid + (e + BLOCK_SIZE * m) * N] = -2;
temp_A[tid + (e + BLOCK_SIZE * m) * N + ( tid==0 ? 1 : -1 )] = 1;
}
if ( tid == e )
{
temp_A[tid + (e + BLOCK_SIZE * m) * N - 1] = 1;
//printf("temp_A[%d] = 1;\n", (tid + (e + BLOCK_SIZE * m) * N -1));
temp_A[tid + (e + BLOCK_SIZE * m) * N] = -2;
//printf("temp_A[%d] = -2;\n", (tid + (e + BLOCK_SIZE * m) * N));
temp_A[tid + (e + BLOCK_SIZE * m) * N + 1] = 1;
//printf("temp_A[%d] = 1;\n", (tid + (e + BLOCK_SIZE * m) * N +1));
}
}
}
__syncthreads(); //(**)
memcpy(A, temp_A, N*N*sizeof(int));
}
}
int main(){
int* h_A = (int*)malloc(N*N*sizeof(int)); memset(h_A, 0, N*N*sizeof(int));
int* d_A;
checkCudaErrors(cudaMalloc((void**)&d_A, N*N*sizeof(int)));
checkCudaErrors(cudaMemcpy(d_A, h_A, N*N*sizeof(int), cudaMemcpyHostToDevice));
dim3 dim_grid((N/2 + BLOCK_SIZE -1)/ BLOCK_SIZE);
dim3 dim_block(BLOCK_SIZE);
heat_matrix <<< dim_grid, dim_block >>> (d_A);
checkCudaErrors(cudaMemcpy(h_A, d_A, N*N*sizeof(int), cudaMemcpyDeviceToHost));
...
}
The code is arranged to suit a large N (larger than 32). I took advantage of the block division. When executing nvcc yields
CUDA error at matrix.cu:102 code=77(cudaErrorIllegalAddress) "cudaMemcpy(h_A, d_A, N*N*sizeof(int), cudaMemcpyDeviceToHost)"
And cuda-memcheck provides only one error (actually there is another, but it comes from cudasuccess=checkCudaErrors(cudaDeviceReset()); ...)
========= CUDA-MEMCHECK
========= Invalid __shared__ write of size 4
========= at 0x00000cd0 in heat_matrix(int*)
========= by thread (0,0,0) in block (0,0,0)
========= Address 0xfffffffc is out of bounds
...
I can't see where I did wrong in the code. How can the thread 0 in the first block provoke an illegal access? There's even the specific if case to deal with it, and there isn't reported the line of the code in which the error occurred.
Moreover, is there a more efficient way for my code than to deal with all those ifs? Sure there is, but I couldn't find a better parallel expression to split the cases into the second for.
On a side note, to me the (*) seems unnecessary; instead (**) is necessary if I want to follow with other GPU function calls. Am I right?

Check out this line:
temp_A[tid + (e + BLOCK_SIZE * m) * N - 1] = 1;
For the thread with tid equal to zero during the first iteration, tid + (e + BLOCK_SIZE * m) * N - 1 evaluates to an index of -1. This is exactly what the cuda-memcheck output is complaining about (with the address having wrapped around due to underflow).
A similar out-of-bounds access will occur later for the line
temp_A[tid + (e + BLOCK_SIZE * m) * N + 1] = 1;
when tid, e, and m all assume their maximum value.
You have multiple threads writing to the same memory location. Each thread should write to exactly one array element per inner loop iteration. There is no need to write out neighboring elements because they are already covered by their own threads.
You have a race condition between the initializing memset() and the stores inside the main loops. Put a syncthreads() after the memset().
The calls to memset() and memcpy() will lead to each thread doing a full set/copy, doing the operations N times instead of just once.
The common way of handling this is to write out the operation explicitly, dividing the work between the threads of the block.
However ...
there is no benefit from creating the matrix in shared memory first and then copying it to global memory later. Writing directly to A in global memory eliminates the need for memset(), memcpy() and syncthreads() altogether.
Using a block size of just 16 threads leaves half of the resources unused, as thread blocks are allocated in units of 32 threads (a warp).
You may want to re-read the section about the Thread Hierarchy in the CUDA C Programming Guide.

In your kernel, temp_A is a local pointer to beginning of your shared memory array. Considering:
N = 32;
BLOCK_SIZE = 16;
m (0,1);
e (0,BLOCK_SIZE)
Accesses like temp_A[tid + (e + BLOCK_SIZE * m) * N] can easily go out of bounds of 1024-elements long array.

Best way to achieve CUDA Vector Diagonalization

What I want to do is feed in my m x n matrix, and in parallel, construct n square diagonal matrices for each column of the matrix, perform an operation on each square diagonal matrix, and then recombine the result. How do I do this?
So far, I start of with an m x n matrix; the result from a previous matrix computation where each element is calculated using the function y = f(g(x)).
This gives me a matrix with n column elements [f1, f2...fn] where each fn represents a column vector of height m.
From here, I want to differentiate each column of the matrix with respect to g(x). Differentiating fn(x) w.r.t. g(x) results in a square matrix with elements f'(x). Under constraint, this square matrix reduces to a Jacobian with the elements of each row along the diagonal of the square matrix, and equal to fn', all other elements equaling zero.
Hence the reason why it is necessary to construct the diagonal for each of the vector rows fn.
To do this, I take a target vector defined as A(hA x 1) which was extracted from the larger A(m x n) matrix. I then prepared a zeroed matrix defined as C(hA x hA) which will be used to hold the diagonals.
The aim is to diagonalize the vector A into a square matrix with each element of A sitting on the diagonal of C, everything else being zero.
There are probably more efficient ways to accomplish this using some pre-built routine without building a whole new kernel, but please be aware that for these purposes, this method is necessary.
The kernel code (which works) to accomplish this is shown here:
_cudaDiagonalizeTest << <5, 1 >> >(d_A, matrix_size.uiWA, matrix_size.uiHA, d_C, matrix_size.uiWC, matrix_size.uiHC);
__global__ void _cudaDiagonalizeTest(float *A, int wA, int hA, float *C, int wC, int hC)
{
int ix, iy, idx;
ix = blockIdx.x * blockDim.x + threadIdx.x;
iy = blockIdx.y * blockDim.y + threadIdx.y;
idx = iy * wA + ix;
C[idx * (wC + 1)] = A[idx];
}
I am a bit suspicious that this is a very naive approach to a solution and was wondering if someone could give an example of how I could do the same using
a) reduction
b) thrust
For vectors of large row size, I would like to be able to use the GPU's multithreading capabilities to chunk the task into small jobs, and combine each result at the end with __syncthreads().
The picture below shows what the desired result is.
I have read NVIDIA's article on reduction, but did not manage to achieve the desired results.
Any assistance or explanation would be very much welcomed.
Thanks.
Matrix A is the target with 4 columns. I want to take each column, and copy its elements into Matrix B as a diagonal, iterating through each column.

I created a simple example based on thrust. It uses column-major order to store the matrices in a thrust::device_vector. It should scale well with larger row/column counts.
Another approach could be based off the thrust strided_range example.
This example does what you want (fill the diagonals based on the input vector). However, depending on how you proceed with the resulting matrix to your "Differentiating" step, it might still be worth investigating if a sparse storage (without all the zero entries) is possible, since this will reduce memory consumption and ease iterating.
#include <thrust/device_vector.h>
#include <thrust/scatter.h>
#include <thrust/sequence.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/functional.h>
#include <iostream>
template<typename V>
void print_matrix(const V& mat, int rows, int cols)
{
for(int i = 0; i < rows; ++i)
{
for(int j = 0; j < cols; ++j)
{
std::cout << mat[i + j*rows] << "\t";
}
std::cout << std::endl;
}
}
struct diag_index : public thrust::unary_function<int,int>
{
diag_index(int rows) : rows(rows){}
__host__ __device__
int operator()(const int index) const
{
return (index*rows + (index%rows));
}
const int rows;
};
int main()
{
const int rows = 5;
const int cols = 4;
// allocate memory and fill with demo data
// we use column-major order
thrust::device_vector<int> A(rows*cols);
thrust::sequence(A.begin(), A.end());
thrust::device_vector<int> B(rows*rows*cols, 0);
// fill diagonal matrix
thrust::scatter(A.begin(), A.end(), thrust::make_transform_iterator(thrust::make_counting_iterator(0),diag_index(rows)), B.begin());
print_matrix(A, rows, cols);
std::cout << std::endl;
print_matrix(B, rows, rows*cols);
return 0;
}
This example will output:
0 5 10 15
1 6 11 16
2 7 12 17
3 8 13 18
4 9 14 19
0 0 0 0 0 5 0 0 0 0 10 0 0 0 0 15 0 0 0 0
0 1 0 0 0 0 6 0 0 0 0 11 0 0 0 0 16 0 0 0
0 0 2 0 0 0 0 7 0 0 0 0 12 0 0 0 0 17 0 0
0 0 0 3 0 0 0 0 8 0 0 0 0 13 0 0 0 0 18 0
0 0 0 0 4 0 0 0 0 9 0 0 0 0 14 0 0 0 0 19

An alternate answer that does not use thrust is as follows:
_cudaMatrixTest << <5, 5 >> >(d_A, matrix_size.uiWA, matrix_size.uiHA, d_C, matrix_size.uiWC, matrix_size.uiHC);
__global__ void _cudaMatrixTest(float *A, int wA, int hA, float *C, int wC, int hC)
{
int ix, iy, idx;
ix = blockIdx.x * blockDim.x + threadIdx.x;
iy = blockIdx.y * blockDim.y + threadIdx.y;
idx = iy * wA + ix;
C[idx * wC + (idx % wC)] = A[threadIdx.x * wA + (ix / wC)];
}
where d_A is
0 5 10 15
1 6 11 16
2 7 12 17
3 8 13 18
4 9 14 19
Both answers are viable solutions. The question is, which is better/faster?

Which way to order a shared 2D/3D array for parallel reduction over 1 dimension in CUDA/OpenCL?

Overall goal
I have several reductions to make on a bipartite graph, represented by two dense arrays for vertices and a dense array specifying whether an edge is present b/w the two. Say, two arrays are a0[] and a1[], and all edges go like e[i0][i1] (that is, from elements in a0 to elements in a1).
There are ~100+100 vertices, and ~100*100 edges, so each thread is responsible for one edge.
Task 1 : max reduction
For each vertex in a0 I want to find the maximum of all vertices (in a1) connected to it, and then the same in reverse: having assigned the result to an array b0, for each vertex in a1, I want to find the maximum b0[i0] of the connected vertices.
To do this, I:
1) load into shared memory
#define DC_NUM_FROM_SHARED 16
#define DC_NUM_TO_SHARED 16
__global__ void max_reduce_down(
Value* value1
, Value* max_value_in_connected
, int r0_size, int r1_size
, bool** connected
)
{
int id_from;
id_from = blockIdx.x * blockDim.x + threadIdx.x;
id_to = blockIdx.y * blockDim.y + threadIdx.y;
bool within_bounds = (id_from < r0_size) && (id_to < r1_size);
//load into shared memory
__shared__ Value value[DC_NUM_TO_SHARED][DC_NUM_FROM_SHARED]; //FROM is the inner (consecutive) dimension
if(within_bounds)
value[threadIdx.y][threadIdx.x] = connected[id_to][id_from]? value1[id_to] : 0;
else
value[threadIdx.y][threadIdx.x] = 0;
__syncthreads();
if(!within_bounds)
return;
2) reduce
for(int stride = DC_NUM_TO_SHARED/2; threadIdx.y < stride; stride >>= 1)
{
value[threadIdx.y][threadIdx.x] = max(value[threadIdx.y][threadIdx.x], dc[threadIdx.y + stride][threadIdx.x]);
__syncthreads();
}
3) write back
max_value_connected[id_from] = value[0][threadIdx.x];
Task 2 : best k
Similar problem, but reduction is only in for vertices in a0, I need to find the k best candidates are chosen from connected in a1 (k is ~5).
1) I initialize the shared array with zero elements except for the 1st place
int id_from, id_to;
id_from = blockIdx.x * blockDim.x + threadIdx.x;
id_to = blockIdx.y * blockDim.y + threadIdx.y;
__shared Value* values[MAX_CHAMPS * CHAMPS_NUM_FROM_SHARED * CHAMPS_NUM_TO_SHARED]; //champion overlaps
__shared int* champs[MAX_CHAMPS * CHAMPS_NUM_FROM_SHARED * CHAMPS_NUM_TO_SHARED]; // overlap champions
bool within_bounds = (id_from < r0_size) && (id_to < r1_size);
int i = threadIdx.y * CHAMPS_NUM_FROM_SHARED + threadIdx.x;
if(within_bounds)
{
values[i] = connected[id_to][id_from] * values1[id_to];
champs[i] = connected[id_to][id_from] ? id_to : -1;
}
else
{
values[i] = 0;
champs[i] = -1;
}
for(int place = 1; place < CHAMP_COUNT; place++)
{
i = (place * CHAMPS_NUM_TO_SHARED + threadIdx.y) * CHAMPS_NUM_FROM_SHARED + threadIdx.x;
values[i] = 0;
champs[i] = -1;
}
if(! within_bounds)
return;
__syncthreads();
2) reduce it
for(int stride = CHAMPS_NUM_TO_SHARED/2; threadIdx.y < stride; stride >>= 1)
{
merge_2_champs(values, champs, CHAMP_COUNT, id_from, id_to, id_to + stride);
__syncthreads();
}
3) write the results back
for(int place = 0; place < LOCAL_DESIRED_ACTIVITY; place++)
champs0[place][id_from] = champs[place * CHAMPS_NUM_TO_SHARED * CHAMPS_NUM_FROM_SHARED + threadIdx.x];
Issue
How do I order (transpose) the elements in the shared array, so that memory access uses the cache better?
Does it matter at this point, or there is much more I can gain from other optimizations?
Would it be better to transpose the edge matrix if I needed to optimize for Task 2? (as far as I understood, there is a symmetry in Task 1, so it doesn't matter).
P.S.
I have delayed unrolling loops and doing the first reduction iteration while loading, since I thought it is too complicated to do before I have explored simpler ways.
For Task 2, it would be nice to not load zero elements, since the array would never need to grow, and only start shrinking once log k steps have been made. This would make it k times more compact in shared memory! But I dread the resulting index math.
Syntax and Correctness
The unusual types are just typedef'ed ints/chars/etc - AFAIK, in GPUs, it makes sense to compactify those as much as possible. I have not run the code yet, no need to check for indexing errors.
Also, I am using CUDA, but I am interested in an OpenCL perspective as well, since I think the best solution should be the same, and I will be using OpenCL in the future anyway.

OK, I think I figured this out.
The two alternatives that I am considering are to have reductions work on the y dimension, and independent on the x dimension, or vice versa (x dimension being the contiguous one). In any case, the scheduler is able to assemble threads into warps along the x dimension, so some coherence is guaranteed. However, having coherence extend beyond a warp would be great. Also, due to the 2D/3D nature of the shared arrays, one would have to limit the dimensions to 16 or even 8.
To ensure coalescence within a warp, the scheduler has to assemble warps along the x dimension.
If reducing over x dimension, after each iteration, the number of active threads in a warp will halve. However, if reducing over y dimension, it is the number of active warps that will halve.
So, I need to reduce over y.
Unless the transpose (load) is the slowest, which is an abnormal case.

Coalesced buffer reads really matter; kernels can be 32x slower if you don't do them. It can be worth doing a re-arrangement pass if it means being able to do them (of course, the re-arrangement pass needs to be coalesced as well, but you can often leverage shared local memory to do this).

Make CURAND generate different random numbers from a uniform distribution

I am trying to use CURAND library to generate random numbers which are completely independent of each other from 0 to 100. Hence I am giving time as seed to each thread and specifying the "id = threadIdx.x + blockDim.x * blockIdx.x" as sequence and offset .
Then after getting the random number as float, I multiply it by 100 and take its integer value.
Now, the problem I am facing is that its getting the same random number for the thread [0,0] and [0,1], no matter how many times I run the code which is 11. I am unable to understand what am I doing wrong. Please help.
I am pasting my code below:
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include<curand_kernel.h>
#include "util/cuPrintf.cu"
#include<time.h>
#define NE WA*HA //Total number of random numbers
#define WA 2 // Matrix A width
#define HA 2 // Matrix A height
#define SAMPLE 100 //Sample number
#define BLOCK_SIZE 2 //Block size
__global__ void setup_kernel ( curandState * state, unsigned long seed )
{
int id = threadIdx.x + blockIdx.x + blockDim.x;
curand_init ( seed, id , id, &state[id] );
}
__global__ void generate( curandState* globalState, float* randomMatrix )
{
int ind = threadIdx.x + blockIdx.x * blockDim.x;
if(ind < NE){
curandState localState = globalState[ind];
float stopId = curand_uniform(&localState) * SAMPLE;
cuPrintf("Float random value is : %f",stopId);
int stop = stopId ;
cuPrintf("Random number %d\n",stop);
for(int i = 0; i < SAMPLE; i++){
if(i == stop){
float random = curand_normal( &localState );
cuPrintf("Random Value %f\t",random);
randomMatrix[ind] = random;
break;
}
}
globalState[ind] = localState;
}
}
/////////////////////////////////////////////////////////
// Program main
/////////////////////////////////////////////////////////
int main(int argc, char** argv)
{
// 1. allocate host memory for matrix A
unsigned int size_A = WA * HA;
unsigned int mem_size_A = sizeof(float) * size_A;
float* h_A = (float* ) malloc(mem_size_A);
time_t t;
// 2. allocate device memory
float* d_A;
cudaMalloc((void**) &d_A, mem_size_A);
// 3. create random states
curandState* devStates;
cudaMalloc ( &devStates, size_A*sizeof( curandState ) );
// 4. setup seeds
int n_blocks = size_A/BLOCK_SIZE;
time(&t);
printf("\nTime is : %u\n",(unsigned long) t);
setup_kernel <<< n_blocks, BLOCK_SIZE >>> ( devStates, (unsigned long) t );
// 4. generate random numbers
cudaPrintfInit();
generate <<< n_blocks, BLOCK_SIZE >>> ( devStates,d_A );
cudaPrintfDisplay(stdout, true);
cudaPrintfEnd();
// 5. copy result from device to host
cudaMemcpy(h_A, d_A, mem_size_A, cudaMemcpyDeviceToHost);
// 6. print out the results
printf("\n\nMatrix A (Results)\n");
for(int i = 0; i < size_A; i++)
{
printf("%f ", h_A[i]);
if(((i + 1) % WA) == 0)
printf("\n");
}
printf("\n");
// 7. clean up memory
free(h_A);
cudaFree(d_A);
}
Output that I get is :
Time is : 1347857063
[0, 0]: Float random value is : 11.675105[0, 0]: Random number 11
[0, 0]: Random Value 0.358356 [0, 1]: Float random value is : 11.675105[0, 1]: Random number 11
[0, 1]: Random Value 0.358356 [1, 0]: Float random value is : 63.840496[1, 0]: Random number 63
[1, 0]: Random Value 0.696459 [1, 1]: Float random value is : 44.712799[1, 1]: Random number 44
[1, 1]: Random Value 0.735049

There are a few things wrong here, I'm addressing the first ones here to get you started:
General points
Please check the return values of all CUDA API calls, see here for more info.
Please run cuda-memcheck to check for obvious things like out-of-bounds accesses.
Specific points
When allocating space for the RNG state, you should have space for one state per thread (not one per matrix element as you have now).
Your thread ID calculation in setup_kernel() is wrong, should be threadIdx.x + blockIdx.x * blockDim.x (* instead of +).
You use the thread ID as the sequence number as well as the offset, you should just set the offset to zero as described in the cuRAND manual:
For the highest quality parallel pseudorandom number generation, each
experiment should be assigned a unique seed. Within an experiment,
each thread of computation should be assigned a unique sequence
number.
Finally you're running two threads per block, that's incredibly inefficient. Check out the CUDA C Programming Guide, in the "maximize utilization" section for more information, but you should be looking to launch a multiple of 32 threads per block (e.g. 128, 256) and a large number of blocks (e.g. tens of thousands). If you're problem is small then consider running multiple problems at once (either batched in a single kernel launch or as kernels in different streams to get concurrent execution).

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio