I inherited some CUDA code that I need to work on but some of the indexing done in it is confusing me.
A simple example would be the normalisation of data. Say we have a shared array A[2*N] which is a matrix of shape 2xN which has been unrolled to an array. Then we have the normalisation means and standard deviation: norm_means[2] and norm_stds[2]. The goal is to normalise the data in A in parallel. A minimal example would be:
__global__ void normalise(float *data, float *norm, float *std) {
int tdy = threadIdx.y;
for (int i=tdy; i<D; i+=blockDim.y)
data[i] = data[i] * norm[i] + std[i];
int main(int argc, char **argv) {
// generate data
int N=100;
int D=2;
MatrixXd A = MatrixXd::Random(N*D,1);
MatrixXd norm_means = MatrixXd::Random(D,1);
MatrixXd norm_stds = MatrixXd::Random(D,1);
// transfer data to device
float* A_d;
float* norm_means_d;
float* nrom_stds_d;
cudaMalloc((void **)&A_d, N * D * sizeof(float));
cudaMalloc((void **)&norm_means_d, D * sizeof(float));
cudaMalloc((void **)&norm_stds_d, D * sizeof(float));
cudaMemcpy(A_d, A.data(), D * N * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(norm_means_d, norm_means.data(), D * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(norm_stds_d, norm_stds.data(), D * sizeof(float), cudaMemcpyHostToDevice);
// Setup execution
const int BLOCKSIZE_X = 8;
const int BLOCKSIZE_Y = 16;
const int GRIDSIZE_X = (N-1)/BLOCKSIZE_X + 1;
dim3 dimBlock(BLOCKSIZE_X, BLOCKSIZE_Y, 1);
dim3 dimGrid(GRIDSIZE_X, 1, 1);
normalise<<dimGrid, dimBlock, 0>>>(A_d, norm_means_d, norm_stds_d);
Note that I am using Eigen for the matrix generation. I have omitted the includes for brevity.
This code above through some magic works and achieves the desired results. However, the CUDA kernel function does not make any sense to me because the for loop should stop after one execution as i>D after the first iteration .. but it doesn't?
If I change the kernel that makes more sense to me eg.
__global__ void normalise(float *data, float *norm, float *std) {
int tdy = threadIdx.y;
for (int i=0; i<D; i++)
data[tdy + i*blockDim.y] = data[tdy + i*blockDim.y] * norm[i] + std[i];
the program stops working and just outputs gibberish data.
Can somebody explain why I get this behaviour?
PS. I am very new to CUDA

It is indeed senseless to have a 2-dimensional kernel to perform an elementwise operation on an array. There is also no reason to work in blocks of size 8x16. But your modified kernel uses the second dimension (y) only; that's probably why it doesn't work. You probably needed to use the first dimension (x) only.
However - it would be reasonable to use the Y dimension for the actual second dimension, e.g. something like this:
__global__ void normalize(
float __restrict *data,
const float __restrict *norm,
const float __restrict *std)
auto pos = threadIdx.x + blockDim.x * blockIdx.x;
auto d = threadIdx.y + blockDim.y * blockIdx.y; // or maybe just threadIdx.y;
data[pos + d * N] = data[pos + d * N] * norm[d] + std[d];
Other points to consider:
I added __restrict to your pointers. Always do that when relevant; here's why.
It's a good idea to have a single thread to work on more than one element of data - but you should make that happen in the longer dimension, where the thread can reuse its norm and std values rather than read them from memory every time.


CUDA Initialize Array on Device

I am very new to CUDA and I am trying to initialize an array on the device and return the result back to the host to print out to show if it was correctly initialized. I am doing this because the end goal is a dot product solution in which I multiply two arrays together, storing the results in another array and then summing up the entire thing so that I only need to return the host one value.
In the code I am working on all I am only trying to see if I am initializing the array correctly. I am trying to create an array of size N following the patterns of 1,2,3,4,5,6,7,8,1,2,3....
This is the code that I've written and it compiles without issue but when I run it the terminal is hanging and I have no clue why. Could someone help me out here? I'm so incredibly confused :\
#include <stdio.h>
#include <stdlib.h>
#include <chrono>
#define ARRAY_SIZE 100
#define BLOCK_SIZE 32
__global__ void cu_kernel (int *a_d,int *b_d,int *c_d, int size)
int x = blockIdx.x * blockDim.x + threadIdx.x;
__shared__ int temp;
if(temp != 8){
a_d[x] = temp;
} else {
a_d[x] = temp;
temp = 1;
int main (int argc, char *argv[])
//declare pointers for arrays
int *a_d, *b_d, *c_d, *sum_h, *sum_d,a_h[ARRAY_SIZE];
//set space for device variables
cudaMalloc((void**) &a_d, sizeof(int) * ARRAY_SIZE);
cudaMalloc((void**) &b_d, sizeof(int) * ARRAY_SIZE);
cudaMalloc((void**) &c_d, sizeof(int) * ARRAY_SIZE);
cudaMalloc((void**) &sum_d, sizeof(int));
// set execution configuration
dim3 dimblock (BLOCK_SIZE);
dim3 dimgrid (ARRAY_SIZE/BLOCK_SIZE);
// actual computation: call the kernel
cu_kernel <<<dimgrid, dimblock>>> (a_d,b_d,c_d,ARRAY_SIZE);
cudaError_t result;
// transfer results back to host
result = cudaMemcpy (a_h, a_d, sizeof(int) * ARRAY_SIZE, cudaMemcpyDeviceToHost);
if (result != cudaSuccess) {
fprintf(stderr, "cudaMemcpy failed.");
// print reversed array
printf ("Final state of the array:\n");
for (int i =0; i < ARRAY_SIZE; i++) {
printf ("%d ", a_h[i]);
printf ("\n");
There are at least 3 issues with your kernel code.
you are using shared memory variable temp without initializing it.
you are not resolving the order in which threads access a shared variable as discussed here.
you are imagining (perhaps) a particular order of thread execution, and CUDA provides no guarantees in that area
The first item seems self-evident, however naive methods to initialize it in a multi-threaded environment like CUDA are not going to work. Firstly we have the multi-threaded access pattern, again, Furthermore, in a multi-block scenario, shared memory in one block is logically distinct from shared memory in another block.
Rather than wrestle with mechanisms unsuited to create the pattern you desire, (informed by notions carried over from a serial processing environment), I would simply do something trivial like this to create the pattern you desire:
__global__ void cu_kernel (int *a_d,int *b_d,int *c_d, int size)
int x = blockIdx.x * blockDim.x + threadIdx.x;
if (x < size) a_d[x] = (x&7) + 1;
Are there other ways to do it? certainly.
__global__ void cu_kernel (int *a_d,int *b_d,int *c_d, int size)
int x = blockIdx.x * blockDim.x + threadIdx.x;
__shared__ int temp;
if (!threadIdx.x) temp = blockIdx.x*blockDim.x;
if (x < size) a_d[x] = ((temp+threadIdx.x) & 7) + 1;
You can get as fancy as you like.
These changes will still leave a few values at zero at the end of the array, which would require changes to your grid sizing. There are many questions about this already, or study a sample code like vectorAdd.

Cuda matrix addition

I have written the following code to sum two 4x4 matrices in cuda.
__global__ void Matrix_add(double* a, double* b, double* c,int n)
int row = blockIdx.x * blockDim.x + threadIdx.x;
int col = blockIdx.y * blockDim.y + threadIdx.y;
int index = row * n + col;
if(col<n && row <n)
c[index] = a[index] + b[index];
int main()
int n=4;
double **h_a;
double **h_b;
double **h_c;
double *d_a, *d_b, *d_c;
int size = n*n*sizeof(double);
h_a = (double **) malloc(n*sizeof(double*));
h_b = (double **) malloc(n*sizeof(double*));
h_c = (double **) malloc(n*sizeof(double*));
int t=0;
for (t=0;t<n;t++)
h_a[t]= (double *)malloc(n*sizeof(double));
h_b[t]= (double *)malloc(n*sizeof(double));
h_c[t]= (double *)malloc(n*sizeof(double));
int i=0,j=0;
dim3 dimBlock(4,4);
dim3 dimGrid(1,1);
Matrix_add<<<dimGrid, dimBlock>>>(d_a,d_b,d_c,n);
for( j=0;j<n;j++)
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
Result of this addition should be a 2x2 all-ones matrix but in the result all the elements of matrix are 0. Also I get this message after getting result:
Segmentation fault (core dumped)
Can anyone please help me to find out the problem.
Thank you
Your host arrays (h_a, h_b, h_c) are not contiguous in memory, so your initial cudaMemcpy() calls will read garbage into GPU memory (apparently zeros in your case).
The reason is that your hosts arrays are not actually flat, but instead are represented as arrays of pointers. I guess to fake two-dimensional arrays in C? In any case, you either need to be more careful with your cudaMemcpy()s and copy the host arrays row-by-row, or use a flat representation on the host.

Optimising Matrix Multiplication OpenCL - Purpose: learn how to manage memory

I'm new to OpenCL and trying to understand how to optimise matrix multiplication to become familiar with the various paradigms. Here's the current code.
If I'm multipliying matrices A and B. I allocate a row of A in private memory to start with (because each work item uses it), and a column of B in local memory (because each work group uses it).
1) the code is currently incorrect, unfortunately I'm struggling on how to use local work ids to get the correct code, but I can't find my mistake? I'm basing myself on http://www.cs.bris.ac.uk/home/simonm/workshops/OpenCL_lecture3.pdf but (slide 27) it seems that this is wrong as they don't make use of loc_size in their internal loop)
2) Are there any other optimisations you would suggest with this code?
__kernel void mmul(
__global int* C,
__global int* A,
__global int* B,
const int rA,
const int rB,
const int cC,
__local char* local_mem)
int k,ty;
int tx = get_global_id(0);
int loctx = get_local_id(0);
int loc_size = get_local_size(0);
int value = 0 ;
int tmp_array[1000];
for(k=0;k<rB;k++) {
tmp_array[k] = A[tx * cA + k] ;
for (ty=0 ; ty < cC ; ty++) { \n" \
for (k = loctx ; k < rB ; k+=loc_size) {
local_mem[k] = B[ty + k * cC] ;
value = 0 ;
for(k=0;k<rB;k+=1) {
int i = loctx + k*loc_size;
value += tmp_array[k] * local_mem[i];
C[ty + (tx * cC)] = value;
where I set the global and local work items as follows
const size_t globalWorkItems[1] = {result_row};
const size_t localWorkItems[1] = {(size_t)local_wi_size};
local_wi_size is result_row/number of compute units (such that result_row % compute units == 0)
Your code is pretty close, but the indexing into the local memory array is actually simpler that you think. You have a row in private memory and a column in local memory, and you need to compute the dot product of these two vectors. You just need to sum row[k]*col[k], for k = 0 up to N-1:
for(k=0;k<rB;k+=1) {
value += tmp_array[k] * local_mem[k];
There's actually a second, more subtle bug that is also present in the example solution given on the slides you are using. Since you are reading and writing local memory inside a loop, you actually need two barriers, in order to make sure that work-items writing to local memory on iteration i don't overwrite values that are being read by other work-items executing iteration i-1.
Therefore, the full code for your kernel (tested and working), should look something like this:
__kernel void mmul(
__global int* C,
__global int* A,
__global int* B,
const int rA,
const int rB,
const int cC,
__local char* local_mem)
int k,ty;
int tx = get_global_id(0);
int loctx = get_local_id(0);
int loc_size = get_local_size(0);
int value = 0;
int tmp_array[1000];
for(k=0;k<rB;k++) {
tmp_array[k] = A[tx * cA + k] ;
for (ty=0 ; ty < cC ; ty++) {
for (k = loctx ; k < rB ; k+=loc_size) {
local_mem[k] = B[ty + k * cC];
barrier(CLK_LOCAL_MEM_FENCE); // First barrier to ensure writes have finished
value = 0;
for(k=0;k<rB;k+=1) {
value += tmp_array[k] * local_mem[k];
C[ty + (tx * cC)] = value;
barrier(CLK_LOCAL_MEM_FENCE); // Second barrier to ensure reads have finished
You can find the full set of exercises and solutions that go with the slides you are looking at on the HandsOnOpenCL GitHub page. There's also a more complete set of slides from the same tutorial available here, which go on to show a much more optimised matrix multiply example that uses a blocking approach to better exploit temporal and spatial locality. The aforementioned missing barrier bug has been fixed in the example solution code, but not on the slides (yet).

Optimizing CUDA interpolation

I have developped the following interpolation with CUDA and I am looking for a way of improving this interpolation. For some reasons, I dont want to use CUDA textures.
The other point that I have noticed that for some unknown reasons, is that the interpolation is not performed on the whole vector in my case if the size of the vector is superior than the number of threads (for example with a vector of size 1000, and a number of threads equal to 512,. A thread does its first job and that’s all. I would like to optimize the singleInterp function.
Here is my code:
__device__ float singleInterp(float* data, float x, int lx_data) {
float res = 0;
int i1=0;
int j=lx_data;
int imid;
while (j>i1+1)
imid = (int)(i1+j+1)/2;
if (data[imid]<x)
if (i1==j)
res = data[i1+lx_data];
res =__fmaf_rn( __fdividef(data[j+lx_data]-data[i1+lx_data],(data[j]-data[i1])),x-data[i1], data[i1+lx_data]);
return res;
__global__ void linearInterpolation(float* data, float* x_in, int lx_data) {
int i = threadIdx.x + blockDim.x * blockIdx.x;
int index = i;
if (index < lx_data)
x_in[index] = singleInterp(data, x_in[index], lx_data);
It seems that you are interested in 1D linear interpolation. I already had the problem of optimizing such a kind of interpolation and I ended up with the following code
__global__ void linear_interpolation_kernel_function_GPU(double* __restrict__ result_d, const double* __restrict__ data_d, const double* __restrict__ x_out_d, const int M, const int N)
int j = threadIdx.x + blockDim.x * blockIdx.x;
double reg_x_out = x_out_d[j/2]+M/2;
int k = floor(reg_x_out);
double a = (reg_x_out)-floor(reg_x_out);
double dk = data_d[2*k+(j&1)];
double dkp1 = data_d[2*k+2+(j&1)];
result_d[j] = a * dkp1 + (-dk * a + dk);
The data are assumed to be sampled at integer nodes between -M/2 and M/2.
The code is "equivalent" to 1D texture interpolation, as explained at the following web-page. For the 1D linear texture interpolation, see Fig. 13 of the CUDA-Programming-Guide. For comparisons betwee different solutions, please see the following thread.

Concurrently initializing many arrays with random numbers using Curand and CUDA kernel

I am trying to initialize 100 elements of each these parallel arrays with randomly generated numbers concurrently on the GPU. However, my routine is not producing a variety of random numbers. When I debug the code in Visual Studio I see one number for every element in the array. The object of this code is to optimize the CImg FilledTriangles routine to use the GPU where it can.
What am I doing wrong and how can I fix it? Here is my code:
__global__ void initCurand(curandState* state, unsigned long seed)
int idx = threadIdx.x + blockIdx.x * blockDim.x;
curand_init(seed, idx, 0, &state[idx]);
* CUDA kernel that will execute 100 threads in parallel
__global__ void initializeArrays(float* posx, float* posy,float* rayon, float* veloc, float* opacity
,float * angle, unsigned char** color, int height, int width, curandState* state){
int idx = threadIdx.x + blockIdx.x * blockDim.x;
curandState localState = state[idx];
posx[idx] = (float)(curand_uniform(&localState)*width);
posy[idx] = (float)(curand_uniform(&localState)*height);
rayon[idx] = (float)(10 + curand_uniform(&localState)*50);
angle[idx] = (float)(curand_uniform(&localState)*360);
veloc[idx] = (float)(curand_uniform(&localState)*20 - 10);
color[idx][0] = (unsigned char)(curand_uniform(&localState)*255);
color[idx][1] = (unsigned char)(curand_uniform(&localState)*255);
color[idx][2] = (unsigned char)(curand_uniform(&localState)*255);
opacity[idx] = (float)(0.3 + 1.5*curand_uniform(&localState));
Here is the host code that prepares and calls these kernels: I am trying to create 100 threads (for each element) on one block in a grid.
// launch grid of threads
dim3 dimBlock(100);
dim3 dimGrid(1);
initCurand<<<dimBlock,dimGrid>>>(devState, unsigned(time(nullptr)));
// synchronize the device and the host
initializeArrays<<<dimBlock, dimGrid>>>(d_posx, d_posy, d_rayon, d_veloc, d_opacity, d_angle,d_color, img0.height(), img0.width(), devState);
// Define random properties (pos, size, colors, ..) for all triangles that will be displayed.
float posx[100], posy[100], rayon[100], angle[100], veloc[100], opacity[100];
// Define the same properties but for the device
float* d_posx;
float* d_posy;
float* d_rayon;
float* d_angle;
float* d_veloc;
float* d_opacity;
//unsigned char d_color[100][3];
unsigned char** d_color;
curandState* devState;
cudaError_t err;
// allocate memory on the device for the device arrays
err = cudaMalloc((void**)&d_posx, 100 * sizeof(float));
err = cudaMalloc((void**)&d_posy, 100 * sizeof(float));
err = cudaMalloc((void**)&d_rayon, 100 * sizeof(float));
err = cudaMalloc((void**)&d_angle, 100 * sizeof(float));
err = cudaMalloc((void**)&d_veloc, 100 * sizeof(float));
err = cudaMalloc((void**)&d_opacity, 100 * sizeof(float));
err = cudaMalloc((void**)&devState, 100*sizeof(curandState));
size_t pitch;
//allocated the device memory for source array
err = cudaMallocPitch(&d_color, &pitch, 3 * sizeof(unsigned char),100);
Getting the results:
// get the populated arrays back to the host for use
err = cudaMemcpy(posx,d_posx, 100 * sizeof(float), cudaMemcpyDeviceToHost);
err = cudaMemcpy(posy,d_posy, 100 * sizeof(float), cudaMemcpyDeviceToHost);
err = cudaMemcpy(rayon,d_rayon, 100 * sizeof(float), cudaMemcpyDeviceToHost);
err = cudaMemcpy(veloc,d_veloc, 100 * sizeof(float), cudaMemcpyDeviceToHost);
err = cudaMemcpy(opacity,d_opacity, 100 * sizeof(float), cudaMemcpyDeviceToHost);
err = cudaMemcpy(angle,d_angle, 100 * sizeof(float), cudaMemcpyDeviceToHost);
err = cudaMemcpy2D(color,pitch,d_color,100, 100 *sizeof(unsigned char),3, cudaMemcpyDeviceToHost);
definitely you will need to make a change from this:
err = cudaMalloc((void**)&devState, 100*sizeof(float));
to this:
err = cudaMalloc((void**)&devState, 100*sizeof(curandState));
If you ran your code through cuda-memcheck, you would have discovered this. Your initCurand kernel had plenty of out-of-bounds accesses due to this.
You should also be doing error checking on all cuda calls and all kernel launches. I believe your second kernel call is failing due to a messed up operation on your color[][] array.
Normally when we create an array with cudaMallocPitch, we need to access it using the pitch parameter. C doubly-subscripted arrays by themselves won't work, because C has no inherent knowledge of the actual array width.
I was able to fix it by making the following changes:
__global__ void initializeArrays(float* posx, float* posy,float* rayon, float* veloc, float* opacity,float * angle, unsigned char* color, int height, int width, curandState* state, size_t pitch){
int idx = threadIdx.x + blockIdx.x * blockDim.x;
curandState localState = state[idx];
posx[idx] = (float)(curand_uniform(&localState)*width);
posy[idx] = (float)(curand_uniform(&localState)*height);
rayon[idx] = (float)(10 + curand_uniform(&localState)*50);
angle[idx] = (float)(curand_uniform(&localState)*360);
veloc[idx] = (float)(curand_uniform(&localState)*20 - 10);
color[idx*pitch] = (unsigned char)(curand_uniform(&localState)*255);
color[(idx*pitch)+1] = (unsigned char)(curand_uniform(&localState)*255);
color[(idx*pitch)+2] = (unsigned char)(curand_uniform(&localState)*255);
opacity[idx] = (float)(0.3 + 1.5*curand_uniform(&localState));
initializeArrays<<<dimBlock, dimGrid>>>(d_posx, d_posy, d_rayon, d_veloc, d_opacity, d_angle,d_color, img0.height(), img0.width(), devState, pitch);
unsigned char* d_color;
with those changes, I was able to eliminate the errors I found and the code spit out various random values. I haven't inspected all the values, but that should get you started.
