Is the following code correct when passing the random generator state(CUDA toolkit 3.2 curand.lib) by reference in function CalculateValue(curandState *localStat) and GetExponential(curandState *localState)?
Thanks
__device__ double GetExponential(curandState *localState) {
double u1 = curand_uniform_double(localState); }
__device__ double CalculateValue(curandState *localStat) {
double x = GetExponential(localState);
return x; }
__global__ void RunMonteCarloKernel(curandState *state, double *results) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
/* Copy state to local memory for efficiency */
curandState localState = state[threadIdx.x + blockIdx.x * blockDim.x];
results[i] = CalculateValue(&localState);
/* Copy state back to global memory */
state[threadIdx.x + blockIdx.x * blockDim.x] = localState; }
__global__ void setup_kernel(curandState *state) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
/* Each thread gets different seed, a different sequence number, no offset */
curand_init(i, i, 0, &state[i]); }
int main(void) {
double *devResults;
curandState *devStates;
/* Allocate space for prng states on device */
CUDA_CALL(cudaMalloc((void **)&devStates, totalThreads * sizeof(curandState)));
/* Setup prng states */
setup_kernel<<<totalBlocks, threadsPerBlock>>>(devStates);
for(int i=0; i< 1000; i++)
{
RunMonteCarloKernel(devStates, devResults);
} }
Is there a problem? It looks ok.
You may want to check out the EstimatePiInlineP sample which is in the MonteCarloCURAND directory of the 3.2 SDK. It uses C++ style pass by reference to avoid taking the address of a local variable. You would need to store the state back to memory at the end of the kernel (as you do in your code).
Passing by C++ reference can assist the compiler by clearly showing that the function can operate on the data directly in the original registers. Taking the address of a local array in a GPU can be detrimental to performance if the compiler cannot be certain that all threads handle the pointer identically (i.e. identical operations on the pointer), in which case it will spill the array to local memory. It'll work, but it may be slower.
Related
I am very new to CUDA and I am trying to initialize an array on the device and return the result back to the host to print out to show if it was correctly initialized. I am doing this because the end goal is a dot product solution in which I multiply two arrays together, storing the results in another array and then summing up the entire thing so that I only need to return the host one value.
In the code I am working on all I am only trying to see if I am initializing the array correctly. I am trying to create an array of size N following the patterns of 1,2,3,4,5,6,7,8,1,2,3....
This is the code that I've written and it compiles without issue but when I run it the terminal is hanging and I have no clue why. Could someone help me out here? I'm so incredibly confused :\
#include <stdio.h>
#include <stdlib.h>
#include <chrono>
#define ARRAY_SIZE 100
#define BLOCK_SIZE 32
__global__ void cu_kernel (int *a_d,int *b_d,int *c_d, int size)
{
int x = blockIdx.x * blockDim.x + threadIdx.x;
__shared__ int temp;
if(temp != 8){
a_d[x] = temp;
temp++;
} else {
a_d[x] = temp;
temp = 1;
}
}
int main (int argc, char *argv[])
{
//declare pointers for arrays
int *a_d, *b_d, *c_d, *sum_h, *sum_d,a_h[ARRAY_SIZE];
//set space for device variables
cudaMalloc((void**) &a_d, sizeof(int) * ARRAY_SIZE);
cudaMalloc((void**) &b_d, sizeof(int) * ARRAY_SIZE);
cudaMalloc((void**) &c_d, sizeof(int) * ARRAY_SIZE);
cudaMalloc((void**) &sum_d, sizeof(int));
// set execution configuration
dim3 dimblock (BLOCK_SIZE);
dim3 dimgrid (ARRAY_SIZE/BLOCK_SIZE);
// actual computation: call the kernel
cu_kernel <<<dimgrid, dimblock>>> (a_d,b_d,c_d,ARRAY_SIZE);
cudaError_t result;
// transfer results back to host
result = cudaMemcpy (a_h, a_d, sizeof(int) * ARRAY_SIZE, cudaMemcpyDeviceToHost);
if (result != cudaSuccess) {
fprintf(stderr, "cudaMemcpy failed.");
exit(1);
}
// print reversed array
printf ("Final state of the array:\n");
for (int i =0; i < ARRAY_SIZE; i++) {
printf ("%d ", a_h[i]);
}
printf ("\n");
}
There are at least 3 issues with your kernel code.
you are using shared memory variable temp without initializing it.
you are not resolving the order in which threads access a shared variable as discussed here.
you are imagining (perhaps) a particular order of thread execution, and CUDA provides no guarantees in that area
The first item seems self-evident, however naive methods to initialize it in a multi-threaded environment like CUDA are not going to work. Firstly we have the multi-threaded access pattern, again, Furthermore, in a multi-block scenario, shared memory in one block is logically distinct from shared memory in another block.
Rather than wrestle with mechanisms unsuited to create the pattern you desire, (informed by notions carried over from a serial processing environment), I would simply do something trivial like this to create the pattern you desire:
__global__ void cu_kernel (int *a_d,int *b_d,int *c_d, int size)
{
int x = blockIdx.x * blockDim.x + threadIdx.x;
if (x < size) a_d[x] = (x&7) + 1;
}
Are there other ways to do it? certainly.
__global__ void cu_kernel (int *a_d,int *b_d,int *c_d, int size)
{
int x = blockIdx.x * blockDim.x + threadIdx.x;
__shared__ int temp;
if (!threadIdx.x) temp = blockIdx.x*blockDim.x;
__syncthreads();
if (x < size) a_d[x] = ((temp+threadIdx.x) & 7) + 1;
}
You can get as fancy as you like.
These changes will still leave a few values at zero at the end of the array, which would require changes to your grid sizing. There are many questions about this already, or study a sample code like vectorAdd.
I inherited some CUDA code that I need to work on but some of the indexing done in it is confusing me.
A simple example would be the normalisation of data. Say we have a shared array A[2*N] which is a matrix of shape 2xN which has been unrolled to an array. Then we have the normalisation means and standard deviation: norm_means[2] and norm_stds[2]. The goal is to normalise the data in A in parallel. A minimal example would be:
__global__ void normalise(float *data, float *norm, float *std) {
int tdy = threadIdx.y;
for (int i=tdy; i<D; i+=blockDim.y)
data[i] = data[i] * norm[i] + std[i];
}
int main(int argc, char **argv) {
// generate data
int N=100;
int D=2;
MatrixXd A = MatrixXd::Random(N*D,1);
MatrixXd norm_means = MatrixXd::Random(D,1);
MatrixXd norm_stds = MatrixXd::Random(D,1);
// transfer data to device
float* A_d;
float* norm_means_d;
float* nrom_stds_d;
cudaMalloc((void **)&A_d, N * D * sizeof(float));
cudaMalloc((void **)&norm_means_d, D * sizeof(float));
cudaMalloc((void **)&norm_stds_d, D * sizeof(float));
cudaMemcpy(A_d, A.data(), D * N * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(norm_means_d, norm_means.data(), D * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(norm_stds_d, norm_stds.data(), D * sizeof(float), cudaMemcpyHostToDevice);
// Setup execution
const int BLOCKSIZE_X = 8;
const int BLOCKSIZE_Y = 16;
const int GRIDSIZE_X = (N-1)/BLOCKSIZE_X + 1;
dim3 dimBlock(BLOCKSIZE_X, BLOCKSIZE_Y, 1);
dim3 dimGrid(GRIDSIZE_X, 1, 1);
normalise<<dimGrid, dimBlock, 0>>>(A_d, norm_means_d, norm_stds_d);
}
Note that I am using Eigen for the matrix generation. I have omitted the includes for brevity.
This code above through some magic works and achieves the desired results. However, the CUDA kernel function does not make any sense to me because the for loop should stop after one execution as i>D after the first iteration .. but it doesn't?
If I change the kernel that makes more sense to me eg.
__global__ void normalise(float *data, float *norm, float *std) {
int tdy = threadIdx.y;
for (int i=0; i<D; i++)
data[tdy + i*blockDim.y] = data[tdy + i*blockDim.y] * norm[i] + std[i];
}
the program stops working and just outputs gibberish data.
Can somebody explain why I get this behaviour?
PS. I am very new to CUDA
It is indeed senseless to have a 2-dimensional kernel to perform an elementwise operation on an array. There is also no reason to work in blocks of size 8x16. But your modified kernel uses the second dimension (y) only; that's probably why it doesn't work. You probably needed to use the first dimension (x) only.
However - it would be reasonable to use the Y dimension for the actual second dimension, e.g. something like this:
__global__ void normalize(
float __restrict *data,
const float __restrict *norm,
const float __restrict *std)
{
auto pos = threadIdx.x + blockDim.x * blockIdx.x;
auto d = threadIdx.y + blockDim.y * blockIdx.y; // or maybe just threadIdx.y;
data[pos + d * N] = data[pos + d * N] * norm[d] + std[d];
}
Other points to consider:
I added __restrict to your pointers. Always do that when relevant; here's why.
It's a good idea to have a single thread to work on more than one element of data - but you should make that happen in the longer dimension, where the thread can reuse its norm and std values rather than read them from memory every time.
This is the device code I have written so far.
__global__ void syndrom(int *d_s, int *d_cx) {
int tid = threadIdx.x + blockDim.x * blockIdx.x + 1;
int t2 = 5460;
int N_BCH = 16383;
if (tid <= t2) {
d_s[Usetid] = 0;
for (int j = 0; j < N_BCH; j ++) {
if (d_cx[j] != 0) {
d_s[tid] ^= d_alpha_to[(tid * j) % N_BCH];
}
}
d_s[tid] = d_index_of[d_s[tid]];
}
}
I call it in the host
dim3 grid(96);
dim3 block(256);
But the speed is not very good, I want to get help. Thanks.
This is not a Complete and Verifiable Example, which you are expected to provide here on StackOverflow (for example - what is d_alpha_to?), but I can still make a few suggestions:
Use more threads instead of having each thread iterate a very large number of times. They way GPU work parallelizes is saturating the processors with threads which are ready to perform more computation.
Don't operate on (the same place in) global memory repeatedly. Put d_s[tid] in a local variable (which will be placed in a register), work on it there, and when you're done, write it back. Accessing global memory is obviously much much slower than accessing registers.
Decorate your pointers with __restrict__ (and make d_cx a const pointer). Read more about __restrict__ here.
I am in the process of implementing multithreading through a NVIDIA GeForce GT 650M GPU for a simulation I have created. In order to make sure everything works properly, I have created some side code to test that everything works. At one point I need to update a vector of variables (they can all be updated separately).
Here is the gist of it:
`\__device__
int doComplexMath(float x, float y)
{
return x+y;
}`
`// Kernel function to add the elements of two arrays
__global__
void add(int n, float *x, float *y, vector<complex<long double> > *z)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = index; i < n; i += stride)
z[i] = doComplexMath(*x, *y);
}`
`int main(void)
{
int iGAMAf = 1<<10;
float *x, *y;
vector<complex<long double> > VEL(iGAMAf,zero);
// Allocate Unified Memory – accessible from CPU or GPU
cudaMallocManaged(&x, sizeof(float));
cudaMallocManaged(&y, sizeof(float));
cudaMallocManaged(&VEL, iGAMAf*sizeof(vector<complex<long double> >));
// initialize x and y on the host
*x = 1.0f;
*y = 2.0f;
// Run kernel on 1M elements on the GPU
int blockSize = 256;
int numBlocks = (iGAMAf + blockSize - 1) / blockSize;
add<<<numBlocks, blockSize>>>(iGAMAf, x, y, *VEL);
// Wait for GPU to finish before accessing on host
cudaDeviceSynchronize();
return 0;
}`
I am trying to allocate unified memory (memory accessible from the GPU and CPU). When compiling using nvcc, I get the following error:
error: no instance of overloaded function "cudaMallocManaged" matches the argument list
argument types are: (std::__1::vector, std::__1::allocator>> *, unsigned long)
How can I overload the function properly in CUDA to use this type with multithreading?
It isn't possible to do what you are trying to do.
To allocate a vector using managed memory you would have to write your own implementation of an allocator which inherits from std::allocator_traits and calls cudaMallocManaged under the hood. You can then instantiate a std::vector using your allocator class.
Also note that your CUDA kernel code is broken in that you can't use std::vector in device code.
Note that although the question has managed memory in view, this is applicable to other types of CUDA allocation such as pinned allocation.
As another alternative, suggested here, you could consider using a thrust host vector in lieu of std::vector and use a custom allocator with it. A worked example is here in the case of pinned allocator (cudaMallocHost/cudaHostAlloc).
How do I initialize device array which is allocated using cudaMalloc()?
I tried cudaMemset, but it fails to initialize all values except 0.code, for cudaMemset looks like below, where value is initialized to 5.
cudaMemset(devPtr,value,number_bytes)
As you are discovering, cudaMemset works like the C standard library memset. Quoting from the documentation:
cudaError_t cudaMemset ( void * devPtr,
int value,
size_t count
)
Fills the first count bytes of the memory area pointed to by devPtr
with the constant byte value value.
So value is a byte value. If you do something like:
int *devPtr;
cudaMalloc((void **)&devPtr,number_bytes);
const int value = 5;
cudaMemset(devPtr,value,number_bytes);
what you are asking to happen is that each byte of devPtr will be set to 5. If devPtr was a an array of integers, the result would be each integer word would have the value 84215045. This is probably not what you had in mind.
Using the runtime API, what you could do is write your own generic kernel to do this. It could be as simple as
template<typename T>
__global__ void initKernel(T * devPtr, const T val, const size_t nwords)
{
int tidx = threadIdx.x + blockDim.x * blockIdx.x;
int stride = blockDim.x * gridDim.x;
for(; tidx < nwords; tidx += stride)
devPtr[tidx] = val;
}
(standard disclaimer: written in browser, never compiled, never tested, use at own risk).
Just instantiate the template for the types you need and call it with a suitable grid and block size, paying attention to the last argument now being a word count, not a byte count as in cudaMemset. This isn't really any different to what cudaMemset does anyway, using that API call results in a kernel launch which is do too different to what I posted above.
Alternatively, if you can use the driver API, there is cuMemsetD16 and cuMemsetD32, which do the same thing, but for half and full 32 bit word types. If you need to do set 64 bit or larger types (so doubles or vector types), your best option is to use your own kernel.
I also needed a solution to this question and I didn't really understand the other proposed solution. Particularly I didn't understand why it iterates over the grid blocks for(; tidx < nwords; tidx += stride) and for that matter, the kernel invocation and why using the counter-intuitive word sizes.
Therefore I created a much simpler monolithic generic kernel and customized it with strides i.e. you may use it to initialize a matrix in multiple ways e.g. set rows or columns to any value:
template <typename T>
__global__ void kernelInitializeArray(T* __restrict__ a, const T value,
const size_t n, const size_t incx) {
int tid = threadIdx.x + blockDim.x * blockIdx.x;
if (tid*incx < n) {
a[tid*incx] = value;
}
}
Then you may invoke the kernel like this:
template <typename T>
void deviceInitializeArray(T* a, const T value, const size_t n, const size_t incx) {
int number_of_blocks = ((n / incx) + BLOCK_SIZE - 1) / BLOCK_SIZE;
dim3 gridDim(number_of_blocks, 1);
dim3 blockDim(BLOCK_SIZE, 1);
kernelInitializeArray<T> <<<gridDim, blockDim>>>(a, value, n, incx);
}