CUDA's Mersenne Twister for an arbitrary number of threads - random

CUDA's implementation of the Mersenne Twister (MT) random number generator is limited to a maximal number of threads/blocks of 256 and 200 blocks/grid, i.e. the maximal number of threads is 51200.
Therefore, it is not possible to launch the kernel that uses the MT with
kernel<<<blocksPerGrid, threadsPerBlock>>>(devMTGPStates, ...)
where
int blocksPerGrid = (n+threadsPerBlock-1)/threadsPerBlock;
and n is the total number of threads.
What is the best way to use the MT for threads > 51200?
My approach if to use constant values for blocksPerGrid and threadsPerBlock, e.g. <<<128,128>>> and use the following in the kernel code:
__global__ void kernel(curandStateMtgp32 *state, int n, ...) {
int id = threadIdx.x+blockIdx.x*blockDim.x;
while (id < n) {
float x = curand_normal(&state[blockIdx.x]);
/* some more calls to curand_normal() followed
by the algorithm that works with the data */
id += blockDim.x*gridDim.x;
}
}
I am not sure if this is the correct way or if it can influence the MT status in an undesired way?
Thank you.

I suggest you read the CURAND documentation carefully and thoroughly.
The MT API will be most efficient when using 256 threads per block with up to 64 blocks to generate numbers.
If you need more than that, you have a variety of options:
simply generate more numbers from the existing state - set (i.e. 64
blocks, 256 threads), and distribute these numbers amongst the
threads that need them.
Use more than a single state per block (but this does not allow you to exceed the overall limit within a state-set, it just addresses the need for a single block.)
Create multiple MT generators with independent seeds (and therefore independent state-sets).
Generally, I don't see a problem with the kernel that you've outlined, and it's roughly in line with choice 1 above. However it does not allow you to exceed 51200 threads. (your example has <<<128, 128>>> so 16384 threads)

Following Robert's answer, below I'm providing a fully worked example on using cuRAND's Mersenne Twister for an arbitrary number of threads. I'm using Robert's first option to generate more numbers from the existing state-set and distributing these numbers amongst the threads that need them.
// --- Generate random numbers with cuRAND's Mersenne Twister
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <cuda.h>
#include <curand_kernel.h>
/* include MTGP host helper functions */
#include <curand_mtgp32_host.h>
#define BLOCKSIZE 256
#define GRIDSIZE 64
/*******************/
/* GPU ERROR CHECK */
/*******************/
#define gpuErrchk(x) do { if((x) != cudaSuccess) { \
printf("Error at %s:%d\n",__FILE__,__LINE__); \
return EXIT_FAILURE;}} while(0)
#define CURAND_CALL(x) do { if((x) != CURAND_STATUS_SUCCESS) { \
printf("Error at %s:%d\n",__FILE__,__LINE__); \
return EXIT_FAILURE;}} while(0)
/*******************/
/* iDivUp FUNCTION */
/*******************/
__host__ __device__ int iDivUp(int a, int b) { return ((a % b) != 0) ? (a / b + 1) : (a / b); }
/*********************/
/* GENERATION KERNEL */
/*********************/
__global__ void generate_kernel(curandStateMtgp32 * __restrict__ state, float * __restrict__ result, const int N)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
for (int k = tid; k < N; k += blockDim.x * gridDim.x)
result[k] = curand_uniform(&state[blockIdx.x]);
}
/********/
/* MAIN */
/********/
int main()
{
const int N = 217 * 123;
// --- Allocate space for results on host
float *hostResults = (float *)malloc(N * sizeof(float));
// --- Allocate and initialize space for results on device
float *devResults; gpuErrchk(cudaMalloc(&devResults, N * sizeof(float)));
gpuErrchk(cudaMemset(devResults, 0, N * sizeof(float)));
// --- Setup the pseudorandom number generator
curandStateMtgp32 *devMTGPStates; gpuErrchk(cudaMalloc(&devMTGPStates, GRIDSIZE * sizeof(curandStateMtgp32)));
mtgp32_kernel_params *devKernelParams; gpuErrchk(cudaMalloc(&devKernelParams, sizeof(mtgp32_kernel_params)));
CURAND_CALL(curandMakeMTGP32Constants(mtgp32dc_params_fast_11213, devKernelParams));
//CURAND_CALL(curandMakeMTGP32KernelState(devMTGPStates, mtgp32dc_params_fast_11213, devKernelParams, GRIDSIZE, 1234));
CURAND_CALL(curandMakeMTGP32KernelState(devMTGPStates, mtgp32dc_params_fast_11213, devKernelParams, GRIDSIZE, time(NULL)));
// --- Generate pseudo-random sequence and copy to the host
generate_kernel << <GRIDSIZE, BLOCKSIZE >> >(devMTGPStates, devResults, N);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
gpuErrchk(cudaMemcpy(hostResults, devResults, N * sizeof(float), cudaMemcpyDeviceToHost));
// --- Print results
//for (int i = 0; i < N; i++) {
for (int i = 0; i < 10; i++) {
printf("%f\n", hostResults[i]);
}
// --- Cleanup
gpuErrchk(cudaFree(devMTGPStates));
gpuErrchk(cudaFree(devResults));
free(hostResults);
return 0;
}

Related

how to divide n numbers among N processors using open mp

I was a given a task to perform CREW sort in parallel programming. As the first step of this, I have an array of size n and N processors, I need to divide these elements among N processors and sort each part sequentially and merge them back , how can I do this in openmp. I am new to using openmp ,so any resources to solve this problem will be helpful.
This is what I wrote from the back of my head. It might not be optimal and is not tested. But it should give you a direction how to handle such a problem.
#include <stddef.h>
#include <stdlib.h>
#include <openmp.h>
ptrdiff_t min(ptrdiff_t a, ptrdiff_t b) {
return ((a > b) ? b : a);
}
void inplace_sequential_sort(double *data, ptrdiff_t n) { /* ... */ }
void inplace_merge(double *data, ptrdiff_t n1, ptrdiff_t n2) { /* ... */ }
void inplace_parallel_sort(double *data, ptrdiff_t n) {
if (n < 2)
return;
/* allocate memory for helper array */
int const max_threads = omp_get_max_threads();
ptrdiff_t *my_n = calloc(max_threads, sizeof(*my_n));
if (!my_n) { /* ... */ }
#pragma omp parallel default(none) \
shared(n, data, my_n)
{
/* get thread ID and actual number of threads */
int const tid = omp_get_thread_num();
int const N = omp_get_num_threads();
/* distribute data among threads */
ptrdiff_t const max_elem_per_thread = ((n + N - 1) / N);
ptrdiff_t const my_begin = min(tid * max_elem_per_thread, n);
my_n[tid] = min(n - begin, max_elem_per_thread);
if (my_n[tid] > 1)
inplace_squential_sort(data + my_begin, my_n[tid]);
/* merge sorted data sections (parallel reduction algorithm) */
for (ptrdiff_t stride = 1; stride < N; stride *= 2 ) {
#pragma omp barrier
if (ti % (2 * stride) == 0 && my_begin + my_n[tid] != n) {
inplace_merge(data + my_begin, my_n[tid], my_n[tid + stride]);
my_n[tid] += my_n[tid + stride];
}
}
} /* end of parallel region */
free(my_n);
}
I assumed that you want a C solution (not C++ or Fortran) and sort the data inplace. This is a very basic solution. OpenMP can do much more (e.g. tasking). The functions inplace_sequantial_sort() and inplace_merge() have to be provided.

curand_uniform not deterministic?

I want to generate pseudo-random numbers on a CUDA device in a deterministic way, saying if I ran the program two times I expect the exact same results, given that the program uses a hardcoded seed. Following the examples provided by nvidia: https://docs.nvidia.com/cuda/curand/device-api-overview.html#device-api-example
I would expect exactly the described behavior.
But I do get different results, running the exact same code multiple times. Is there a way to get pseudo-random numbers in a deterministic way, as I described?
Following example code shows my problem:
#include <iostream>
#include <cuda.h>
#include <curand_kernel.h>
__global__ void setup_kernel(curandState *state)
{
auto id = threadIdx.x + blockIdx.x * blockDim.x;
curand_init(123456, id, 0, &state[id]);
}
__global__ void draw_numbers(curandState *state, float* results)
{
auto id = threadIdx.x + blockIdx.x * blockDim.x;
// Copy state
curandState localState = state[id % 1024];
// Generate random number
results[id] = curand_uniform(&localState);
// Copy back state
state[id % 1024] = localState;
}
int main(int argc, char* argv[])
{
// Setup
curandState* dStates;
cudaMalloc((void **) &dStates, sizeof(curandState) * 1024);
setup_kernel<<<1024, 1>>>(dStates);
// Random numbers
float* devResults;
cudaMalloc((void **) &devResults, sizeof(float) * 16 * 1024);
float *hostResults = (float*) calloc(16 * 1024, sizeof(float));
// Call draw random numbers
draw_numbers<<<1024, 16>>>(dStates, devResults);
// Copy results
cudaMemcpy(hostResults, devResults, 16 * 1024 * sizeof(float), cudaMemcpyDeviceToHost);
// Output number 12345
::std::cout << "12345 is: " << hostResults[12345] << ::std::endl;
return 0;
}
Compiling and running the code produces different output on my machine:
$ nvcc -std=c++11 curand.cu && ./a.out && ./a.out && ./a.out
12345 is: 0.8059
12345 is: 0.53454
12345 is: 0.382981
As I said, I would expect three times the same output in this example.
curand_uniform does deterministically depend on the state it is provided.
Thanks to the comments by Robert Crovella I see now that the error was in relying on the thread execution order. Just not reusing the state would result in the same "random" numbers, when the draw_numbers kernel is called multiple times, which is not an option for me either.
My guess is that the best solution in my case is to only launch 1024 threads (as many as curandState are set up) and generating multiple random numbers in each thread (in my example 16/thread). This way I receive different random numbers on multiple calls within the program, but the same numbers for every program launch.
Updated code:
#include <iostream>
#include <cuda.h>
#include <curand_kernel.h>
__global__ void setup_kernel(curandState *state)
{
auto id = threadIdx.x + blockIdx.x * blockDim.x;
curand_init(123456, id, 0, &state[id]);
}
__global__ void draw_numbers(curandState *state, float* results, int runs)
{
auto id = threadIdx.x + blockIdx.x * blockDim.x;
// Copy state
curandState localState = state[id];
// Generate random numbers
for (int i = 0; i < runs; ++i)
{
results[id + i * 1024] = curand_uniform(&localState);
}
// Copy back state
state[id] = localState;
}
int main(int argc, char* argv[])
{
// Setup
curandState* dStates;
cudaMalloc((void **) &dStates, sizeof(curandState) * 1024);
setup_kernel<<<1024, 1>>>(dStates);
// Random numbers
float* devResults;
cudaMalloc((void **) &devResults, sizeof(float) * 16 * 1024);
float *hostResults = (float*) calloc(16 * 1024, sizeof(float));
// Call draw random numbers
draw_numbers<<<16, 64>>>(dStates, devResults, 16);
// Copy results
cudaMemcpy(hostResults, devResults, 16 * 1024 * sizeof(float), cudaMemcpyDeviceToHost);
// Output number 12345
::std::cout << "12345 is " << hostResults[12345];
// Call draw random numbers (again)
draw_numbers<<<16, 64>>>(dStates, devResults, 16);
// Copy results
cudaMemcpy(hostResults, devResults, 16 * 1024 * sizeof(float), cudaMemcpyDeviceToHost);
// Output number 12345 again
::std::cout << " and " << hostResults[12345] << ::std::endl;
return 0;
}
Producing following output:
$ nvcc -std=c++11 curand.cu && ./a.out && ./a.out && ./a.out
12345 is 0.164181 and 0.295907
12345 is 0.164181 and 0.295907
12345 is 0.164181 and 0.295907
which serves exactly my use-case.

How can I increase the limit of generated prime numbers with Sieve of Eratosthenes?

What do I need to change in my program to be able to compute a higher limit of prime numbers?
Currently my algorithm works only with numbers up to 85 million. Should work with numbers up to 3 billion in my opinion.
I'm writing my own implementation of the Sieve of Eratosthenes in CUDA and I've hit a wall.
So far the algorithm seems to work fine for small numbers (below 85 million).
However, when I try to compute prime numbers up to 100 million, 2 billion, 3 billion, the system freezes (while it's computing stuff in the CUDA device), then after a few seconds, my linux machine goes back to normal (unfrozen), but the CUDA program crashes with the following error message:
CUDA error at prime.cu:129 code=6(cudaErrorLaunchTimeout) "cudaDeviceSynchronize()"
I have a GTX 780 (3 GB) and I'm allocating the sieves in a char array, so if I were to compute prime numbers up to 100,000, it would allocate 100,000 bytes in the device.
I assumed that the GPU would allow up to 3 billion numbers since it has 3 GB of memory, however, it only lets me do 85 million tops (85 million bytes = 0.08 GB)
this is my prime.cu code:
#include <stdio.h>
#include <helper_cuda.h> // checkCudaErrors() - NVIDIA_CUDA-6.0_Samples/common/inc
// #include <cuda.h>
// #include <cuda_runtime_api.h>
// #include <cuda_runtime.h>
typedef unsigned long long int uint64_t;
/******************************************************************************
* kernel that initializes the 1st couple of values in the primes array.
******************************************************************************/
__global__ static void sieveInitCUDA(char* primes)
{
primes[0] = 1; // value of 1 means the number is NOT prime
primes[1] = 1; // numbers "0" and "1" are not prime numbers
}
/******************************************************************************
* kernel for sieving the even numbers starting at 4.
******************************************************************************/
__global__ static void sieveEvenNumbersCUDA(char* primes, uint64_t max)
{
uint64_t index = blockIdx.x * blockDim.x + threadIdx.x + threadIdx.x + 4;
if (index < max)
primes[index] = 1;
}
/******************************************************************************
* kernel for finding prime numbers using the sieve of eratosthenes
* - primes: an array of bools. initially all numbers are set to "0".
* A "0" value means that the number at that index is prime.
* - max: the max size of the primes array
* - maxRoot: the sqrt of max (the other input). we don't wanna make all threads
* compute this over and over again, so it's being passed in
******************************************************************************/
__global__ static void sieveOfEratosthenesCUDA(char *primes, uint64_t max,
const uint64_t maxRoot)
{
// get the starting index, sieve only odds starting at 3
// 3,5,7,9,11,13...
/* int index = blockIdx.x * blockDim.x + threadIdx.x + threadIdx.x + 3; */
// apparently the following indexing usage is faster than the one above. Hmm
int index = blockIdx.x * blockDim.x + threadIdx.x + 3;
// make sure index won't go out of bounds, also don't start the execution
// on numbers that are already composite
if (index < maxRoot && primes[index] == 0)
{
// mark off the composite numbers
for (int j = index * index; j < max; j += index)
{
primes[j] = 1;
}
}
}
/******************************************************************************
* checkDevice()
******************************************************************************/
__host__ int checkDevice()
{
// query the Device and decide on the block size
int devID = 0; // the default device ID
cudaError_t error;
cudaDeviceProp deviceProp;
error = cudaGetDevice(&devID);
if (error != cudaSuccess)
{
printf("CUDA Device not ready or not supported\n");
printf("%s: cudaGetDevice returned error code %d, line(%d)\n", __FILE__, error, __LINE__);
exit(EXIT_FAILURE);
}
error = cudaGetDeviceProperties(&deviceProp, devID);
if (deviceProp.computeMode == cudaComputeModeProhibited || error != cudaSuccess)
{
printf("CUDA device ComputeMode is prohibited or failed to getDeviceProperties\n");
return EXIT_FAILURE;
}
// Use a larger block size for Fermi and above (see compute capability)
return (deviceProp.major < 2) ? 16 : 32;
}
/******************************************************************************
* genPrimesOnDevice
* - inputs: limit - the largest prime that should be computed
* primes - an array of size [limit], initialized to 0
******************************************************************************/
__host__ void genPrimesOnDevice(char* primes, uint64_t max)
{
int blockSize = checkDevice();
if (blockSize == EXIT_FAILURE)
return;
char* d_Primes = NULL;
int sizePrimes = sizeof(char) * max;
uint64_t maxRoot = sqrt(max);
// allocate the primes on the device and set them to 0
checkCudaErrors(cudaMalloc(&d_Primes, sizePrimes));
checkCudaErrors(cudaMemset(d_Primes, 0, sizePrimes));
// make sure that there are no errors...
checkCudaErrors(cudaPeekAtLastError());
// setup the execution configuration
dim3 dimBlock(blockSize);
dim3 dimGrid((maxRoot + dimBlock.x) / dimBlock.x);
dim3 dimGridEvens(((max + dimBlock.x) / dimBlock.x) / 2);
//////// debug
#ifdef DEBUG
printf("dimBlock(%d, %d, %d)\n", dimBlock.x, dimBlock.y, dimBlock.z);
printf("dimGrid(%d, %d, %d)\n", dimGrid.x, dimGrid.y, dimGrid.z);
printf("dimGridEvens(%d, %d, %d)\n", dimGridEvens.x, dimGridEvens.y, dimGridEvens.z);
#endif
// call the kernel
// NOTE: no need to synchronize after each kernel
// http://stackoverflow.com/a/11889641/2261947
sieveInitCUDA<<<1, 1>>>(d_Primes); // launch a single thread to initialize
sieveEvenNumbersCUDA<<<dimGridEvens, dimBlock>>>(d_Primes, max);
sieveOfEratosthenesCUDA<<<dimGrid, dimBlock>>>(d_Primes, max, maxRoot);
// check for kernel errors
checkCudaErrors(cudaPeekAtLastError());
checkCudaErrors(cudaDeviceSynchronize());
// copy the results back
checkCudaErrors(cudaMemcpy(primes, d_Primes, sizePrimes, cudaMemcpyDeviceToHost));
// no memory leaks
checkCudaErrors(cudaFree(d_Primes));
}
to test this code:
int main()
{
int max = 85000000; // 85 million
char* primes = malloc(max);
// check that it allocated correctly...
memset(primes, 0, max);
genPrimesOnDevice(primes, max);
// if you wish to display results:
for (uint64_t i = 0; i < size; i++)
{
if (primes[i] == 0) // if the value is '0', then the number is prime
{
std::cout << i; // use printf if you are using c
if ((i + 1) != size)
std::cout << ", ";
}
}
free(primes);
}
This error:
CUDA error at prime.cu:129 code=6(cudaErrorLaunchTimeout) "cudaDeviceSynchronize()"
doesn't necessarily mean anything other than that your kernel is taking too long. It's not necessarily a numerical limit, or computational error, but a system-imposed limit on the amount of time your kernel is allowed to run. Both Linux and windows can have such watchdog timers.
If you want to work around it in the linux case, review this document.
You don't mention it, but I assume your GTX780 is also hosting a (the) display. In that case, there is a time limit on kernels by default. If you can use another device as the display, then reconfigure your machine to have X not use the GTX780, as described in the link. If you do not have another GPU to use for the display, then the only option is to modify the interactivity setting indicated in the linked document, if you want to run long-running kernels. And in this situation, the keyboard/mouse/display will become non-responsive while the kernel is running. If your kernel should happen to run too long, it can be difficult to recover the machine, and may require a hard reboot. (You could also SSH into the machine, and kill the process that is using the GPU for CUDA.)

Sorting many small arrays in CUDA

I am implementing a median filter in CUDA. For a particular pixel, I extract its neighbors corresponding to a window around the pixel, say a N x N (3 x 3) window, and now have an array of N x N elements. I do not envision using a window of more than 10 x 10 elements for my application.
This array is now locally present in the kernel and already loaded into device memory. From previous SO posts that I have read, the most common sorting algorithms are implemented by Thrust. But, Thrust can only be called from the host. Thread - Thrust inside user written kernels
Is there a quick and efficient way to sort a small array of N x N elements inside the kernel?
If the number of elements is fixed and small, you can use sorting networks (http://pages.ripco.net/~jgamble/nw.html). It provides a fixed number of compare/swap operations for a fixed number of elements (eg. 19 compare/swap iterations for 8 elements).
Your problem is sorting many small arrays in CUDA.
Following Robert's suggestion in his comment, CUB offers a possible solution to face this problem. Below I report an example that was constructed around Robert's code at cub BlockRadixSort: how to deal with large tile size or sort multiple tiles?.
The idea is assigning the small arrays to be sorted to different thread blocks and then using cub::BlockRadixSort to sort each array. Two versions are provided, one loading and one loading the small arrays into shared memory.
Let me finally note that your statement that CUDA Thrust is not callable from within kernels is not anymore true. The post Thrust inside user written kernels you linked to has been updated with other answers.
#include <cub/cub.cuh>
#include <stdio.h>
#include <stdlib.h>
#include "Utilities.cuh"
using namespace cub;
/**********************************/
/* CUB BLOCKSORT KERNEL NO SHARED */
/**********************************/
template <int BLOCK_THREADS, int ITEMS_PER_THREAD>
__global__ void BlockSortKernel(int *d_in, int *d_out)
{
// --- Specialize BlockLoad, BlockStore, and BlockRadixSort collective types
typedef cub::BlockLoad <int*, BLOCK_THREADS, ITEMS_PER_THREAD, BLOCK_LOAD_TRANSPOSE> BlockLoadT;
typedef cub::BlockStore <int*, BLOCK_THREADS, ITEMS_PER_THREAD, BLOCK_STORE_TRANSPOSE> BlockStoreT;
typedef cub::BlockRadixSort <int , BLOCK_THREADS, ITEMS_PER_THREAD> BlockRadixSortT;
// --- Allocate type-safe, repurposable shared memory for collectives
__shared__ union {
typename BlockLoadT ::TempStorage load;
typename BlockStoreT ::TempStorage store;
typename BlockRadixSortT::TempStorage sort;
} temp_storage;
// --- Obtain this block's segment of consecutive keys (blocked across threads)
int thread_keys[ITEMS_PER_THREAD];
int block_offset = blockIdx.x * (BLOCK_THREADS * ITEMS_PER_THREAD);
BlockLoadT(temp_storage.load).Load(d_in + block_offset, thread_keys);
__syncthreads();
// --- Collectively sort the keys
BlockRadixSortT(temp_storage.sort).Sort(thread_keys);
__syncthreads();
// --- Store the sorted segment
BlockStoreT(temp_storage.store).Store(d_out + block_offset, thread_keys);
}
/*******************************/
/* CUB BLOCKSORT KERNEL SHARED */
/*******************************/
template <int BLOCK_THREADS, int ITEMS_PER_THREAD>
__global__ void shared_BlockSortKernel(int *d_in, int *d_out)
{
// --- Shared memory allocation
__shared__ int sharedMemoryArray[BLOCK_THREADS * ITEMS_PER_THREAD];
// --- Specialize BlockStore and BlockRadixSort collective types
typedef cub::BlockRadixSort <int , BLOCK_THREADS, ITEMS_PER_THREAD> BlockRadixSortT;
// --- Allocate type-safe, repurposable shared memory for collectives
__shared__ typename BlockRadixSortT::TempStorage temp_storage;
int block_offset = blockIdx.x * (BLOCK_THREADS * ITEMS_PER_THREAD);
// --- Load data to shared memory
for (int k = 0; k < ITEMS_PER_THREAD; k++) sharedMemoryArray[threadIdx.x * ITEMS_PER_THREAD + k] = d_in[block_offset + threadIdx.x * ITEMS_PER_THREAD + k];
__syncthreads();
// --- Collectively sort the keys
BlockRadixSortT(temp_storage).Sort(*static_cast<int(*)[ITEMS_PER_THREAD]>(static_cast<void*>(sharedMemoryArray + (threadIdx.x * ITEMS_PER_THREAD))));
__syncthreads();
// --- Write data to shared memory
for (int k = 0; k < ITEMS_PER_THREAD; k++) d_out[block_offset + threadIdx.x * ITEMS_PER_THREAD + k] = sharedMemoryArray[threadIdx.x * ITEMS_PER_THREAD + k];
}
/********/
/* MAIN */
/********/
int main() {
const int numElemsPerArray = 8;
const int numArrays = 4;
const int N = numArrays * numElemsPerArray;
const int numElemsPerThread = 4;
const int RANGE = N * numElemsPerThread;
// --- Allocating and initializing the data on the host
int *h_data = (int *)malloc(N * sizeof(int));
for (int i = 0 ; i < N; i++) h_data[i] = rand() % RANGE;
// --- Allocating the results on the host
int *h_result1 = (int *)malloc(N * sizeof(int));
int *h_result2 = (int *)malloc(N * sizeof(int));
// --- Allocating space for data and results on device
int *d_in; gpuErrchk(cudaMalloc((void **)&d_in, N * sizeof(int)));
int *d_out1; gpuErrchk(cudaMalloc((void **)&d_out1, N * sizeof(int)));
int *d_out2; gpuErrchk(cudaMalloc((void **)&d_out2, N * sizeof(int)));
// --- BlockSortKernel no shared
gpuErrchk(cudaMemcpy(d_in, h_data, N*sizeof(int), cudaMemcpyHostToDevice));
BlockSortKernel<N / numArrays / numElemsPerThread, numElemsPerThread><<<numArrays, numElemsPerArray / numElemsPerThread>>>(d_in, d_out1);
gpuErrchk(cudaMemcpy(h_result1, d_out1, N*sizeof(int), cudaMemcpyDeviceToHost));
printf("BlockSortKernel no shared\n\n");
for (int k = 0; k < numArrays; k++)
for (int i = 0; i < numElemsPerArray; i++)
printf("Array nr. %i; Element nr. %i; Value %i\n", k, i, h_result1[k * numElemsPerArray + i]);
// --- BlockSortKernel with shared
gpuErrchk(cudaMemcpy(d_in, h_data, N*sizeof(int), cudaMemcpyHostToDevice));
shared_BlockSortKernel<N / numArrays / numElemsPerThread, numElemsPerThread><<<numArrays, numElemsPerArray / numElemsPerThread>>>(d_in, d_out2);
gpuErrchk(cudaMemcpy(h_result2, d_out2, N*sizeof(int), cudaMemcpyDeviceToHost));
printf("\n\nBlockSortKernel with shared\n\n");
for (int k = 0; k < numArrays; k++)
for (int i = 0; i < numElemsPerArray; i++)
printf("Array nr. %i; Element nr. %i; Value %i\n", k, i, h_result2[k * numElemsPerArray + i]);
return 0;
}
If you are using CUDA 5.X, you can use dynamic parallelism. You can make some child kernel in your filter kernel to finish the sort job. As how to sort by CUDA, you can use some induction skills.

Make CURAND generate different random numbers from a uniform distribution

I am trying to use CURAND library to generate random numbers which are completely independent of each other from 0 to 100. Hence I am giving time as seed to each thread and specifying the "id = threadIdx.x + blockDim.x * blockIdx.x" as sequence and offset .
Then after getting the random number as float, I multiply it by 100 and take its integer value.
Now, the problem I am facing is that its getting the same random number for the thread [0,0] and [0,1], no matter how many times I run the code which is 11. I am unable to understand what am I doing wrong. Please help.
I am pasting my code below:
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include<curand_kernel.h>
#include "util/cuPrintf.cu"
#include<time.h>
#define NE WA*HA //Total number of random numbers
#define WA 2 // Matrix A width
#define HA 2 // Matrix A height
#define SAMPLE 100 //Sample number
#define BLOCK_SIZE 2 //Block size
__global__ void setup_kernel ( curandState * state, unsigned long seed )
{
int id = threadIdx.x + blockIdx.x + blockDim.x;
curand_init ( seed, id , id, &state[id] );
}
__global__ void generate( curandState* globalState, float* randomMatrix )
{
int ind = threadIdx.x + blockIdx.x * blockDim.x;
if(ind < NE){
curandState localState = globalState[ind];
float stopId = curand_uniform(&localState) * SAMPLE;
cuPrintf("Float random value is : %f",stopId);
int stop = stopId ;
cuPrintf("Random number %d\n",stop);
for(int i = 0; i < SAMPLE; i++){
if(i == stop){
float random = curand_normal( &localState );
cuPrintf("Random Value %f\t",random);
randomMatrix[ind] = random;
break;
}
}
globalState[ind] = localState;
}
}
/////////////////////////////////////////////////////////
// Program main
/////////////////////////////////////////////////////////
int main(int argc, char** argv)
{
// 1. allocate host memory for matrix A
unsigned int size_A = WA * HA;
unsigned int mem_size_A = sizeof(float) * size_A;
float* h_A = (float* ) malloc(mem_size_A);
time_t t;
// 2. allocate device memory
float* d_A;
cudaMalloc((void**) &d_A, mem_size_A);
// 3. create random states
curandState* devStates;
cudaMalloc ( &devStates, size_A*sizeof( curandState ) );
// 4. setup seeds
int n_blocks = size_A/BLOCK_SIZE;
time(&t);
printf("\nTime is : %u\n",(unsigned long) t);
setup_kernel <<< n_blocks, BLOCK_SIZE >>> ( devStates, (unsigned long) t );
// 4. generate random numbers
cudaPrintfInit();
generate <<< n_blocks, BLOCK_SIZE >>> ( devStates,d_A );
cudaPrintfDisplay(stdout, true);
cudaPrintfEnd();
// 5. copy result from device to host
cudaMemcpy(h_A, d_A, mem_size_A, cudaMemcpyDeviceToHost);
// 6. print out the results
printf("\n\nMatrix A (Results)\n");
for(int i = 0; i < size_A; i++)
{
printf("%f ", h_A[i]);
if(((i + 1) % WA) == 0)
printf("\n");
}
printf("\n");
// 7. clean up memory
free(h_A);
cudaFree(d_A);
}
Output that I get is :
Time is : 1347857063
[0, 0]: Float random value is : 11.675105[0, 0]: Random number 11
[0, 0]: Random Value 0.358356 [0, 1]: Float random value is : 11.675105[0, 1]: Random number 11
[0, 1]: Random Value 0.358356 [1, 0]: Float random value is : 63.840496[1, 0]: Random number 63
[1, 0]: Random Value 0.696459 [1, 1]: Float random value is : 44.712799[1, 1]: Random number 44
[1, 1]: Random Value 0.735049
There are a few things wrong here, I'm addressing the first ones here to get you started:
General points
Please check the return values of all CUDA API calls, see here for more info.
Please run cuda-memcheck to check for obvious things like out-of-bounds accesses.
Specific points
When allocating space for the RNG state, you should have space for one state per thread (not one per matrix element as you have now).
Your thread ID calculation in setup_kernel() is wrong, should be threadIdx.x + blockIdx.x * blockDim.x (* instead of +).
You use the thread ID as the sequence number as well as the offset, you should just set the offset to zero as described in the cuRAND manual:
For the highest quality parallel pseudorandom number generation, each
experiment should be assigned a unique seed. Within an experiment,
each thread of computation should be assigned a unique sequence
number.
Finally you're running two threads per block, that's incredibly inefficient. Check out the CUDA C Programming Guide, in the "maximize utilization" section for more information, but you should be looking to launch a multiple of 32 threads per block (e.g. 128, 256) and a large number of blocks (e.g. tens of thousands). If you're problem is small then consider running multiple problems at once (either batched in a single kernel launch or as kernels in different streams to get concurrent execution).

Resources