CURAND - generating more than one quasirandom number per thread - random

I wish to generate a very large set of quasirandom numbers. (By 'very large', I mean much larger than the maximum number of concurrent threads any current CUDA device can support,
requiring each thread to loop, or for the kernel to be launched with a large grid size. And I want quasirandom for their low-discrepancy properties.) For pseudorandoms, where each call to curand_init can take a different sequence parameter, this seems simple.
For generating N quasirandom numbers, where N is greater than gridDim.x * blockDim.x, is there a solution more efficient than either
Running curand_init N times for N states, giving each call a unique offset in [0, N);
Running curand_init only gridDim.x * blockDim.x times for that number of states, but giving each call an offset of e.g. 10*threadID, if I expect each thread to have to generate 10 numbers?
(Ignoring any overhead due to large offsets, i.e. ignoring skip_ahead().)
I had a look at the code in the CUDA 6.0 samples, and MC_EstimatePiInlineQ appeared to do what I was looking for in two dimensions. However, when the number of points to generate exceeds gridDim.x * blockDim.x, I believe this code actually produces the same points multiple times. This is an issue since gridDim.x is not necessarily large enough to fit the problem size in this example; it is tuned to target roughly 10 blocks per multiprocessor on the device.
The relevant code is below (slightly altered for brevity):
// RNG init kernel
template <typename rngState_t, typename rngDirectionVectors_t>
__global__ void initRNG(rngState_t *const rngStates,
rngDirectionVectors_t *const rngDirections)
{
// Determine thread ID
unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int step = gridDim.x * blockDim.x;
// Initialise the RNG
curand_init(rngDirections[0], tid, &rngStates[tid]);
curand_init(rngDirections[1], tid, &rngStates[tid + step]);
}
and,
// Estimator kernel
template <typename Real, typename rngState_t>
__global__ void computeValue(unsigned int *const results,
rngState_t *const rngStates,
const unsigned int numSims)
{
// Determine thread ID
unsigned int bid = blockIdx.x;
unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int step = gridDim.x * blockDim.x;
// Initialise the RNG
rngState_t localState1 = rngStates[tid];
rngState_t localState2 = rngStates[tid + step];
// Count the number of points which lie inside the unit quarter-circle
unsigned int pointsInside = 0;
for (unsigned int i = tid ; i < numSims ; i += step)
{
Real x = curand_uniform(&localState1);
Real y = curand_uniform(&localState2);
// Do something.
}
// Do some more.
}
Suppose gridDim.x * blockDim.x < N, then at least thread tid = 0 will loop twice in the for. In its second run, it will generate the second random number relative to its initializing offset of 0; this is equivalent to the first random number relative to an initializing offset of 1, which is exactly what tid = 1 did the first time. So the point already has already been generated! This is true for all threads except the one with the highest tid (i.e. some multiple of gridDim.x * blockDim.x), if it is even looping more than once. At best this is useless work, and for my use-case it will be harmful.
I have created a stripped-down version of the mentioned example, based on some hypothetical device, where we have only 4 threads per block, and only 2 blocks, but wish to generate 16 points. Note that lines 9-15 of the output are identical to lines 2-8. Only line 16 is a new point.

This is just a case of reading the docs, but in practice I've found it can indeed be substantially faster to limit the number of states you generate.
This corresponds to option 2 in the question: each thread's offset to curand_init should be n * tid where n is at least as great as the number of random numbers you wish each thread to generate. If that isn't know at state-generation, you can instead use skip_ahead(n * tid, &state) before calling curand, curand_uniform etc.

Related

warp shuffling to reduction of arrays with any length

I am working on a Cuda kernel which performs vector dot product (A x B). I assumed that the length of each vector is multiple of 32 (32,64, ...) and defined the block size to be equal to the length of the array. Each thread in the block multiplies one element of A to the corresponding element of B (thread i ==>psum = A[i]xB[i]). After multiplication, I used the following functions which used warp shuffling technique to perform reduction and calculate the sum all multiplications.
__inline__ __device__
float warpReduceSum(float val) {
int warpSize =32;
for (int offset = warpSize/2; offset > 0; offset /= 2)
val += __shfl_down(val, offset);
return val;
}
__inline__ __device__
float blockReduceSum(float val) {
static __shared__ int shared[32]; // Shared mem for 32 partial sums
int lane = threadIdx.x % warpSize;
int wid = threadIdx.x / warpSize;
val = warpReduceSum(val); // Each warp performs partial reduction
if (lane==0)
shared[wid]=val; // Write reduced value to shared memory
__syncthreads(); // Wait for all partial reductions
//read from shared memory only if that warp existed
val = (threadIdx.x < blockDim.x / warpSize) ? shared[lane] : 0;
if (wid==0)
val = warpReduceSum(val); // Final reduce within first warp
return val;
}
I simply call blockReduceSum(psum) which psum is the multiplication of two elements by a thread.
This approach doesn't work when the length of the array is not multiple of 32, so my question is, can we change this code so that it also works for any length? or is it impossible because if the length of the array is not multiple of 32, some warps have elements belonging more than one array?
First of all, depending on the GPU you are using, performing dot product with just 1 block will probably not be very efficient (as long as you are not batching several dot products in 1 kernel, each done by a single block).
To answer your question: you can reuse the code you have written by just calling your kernel with the number of threads being the closest multiple of 32 higher than N (length of the array) and introducing if statement before calling to blockReduceSum that would like this:
__global__ void kernel(float * A, float * B, int N) {
float psum = 0;
if(threadIdx.x < N) //threadIDx.x because your are using single block, you will need to change it to more general id once you move to multiple blocks
psum = A[threadIdx.x] * B[threadIdx.x];
blockReduceSum(psum);
//The rest of computation
}
That way, threads that do not have array element associated with them, but that need to be there due to use of __shfl, will contribute 0 to the sum.

Generate random number within a function with cuRAND without preallocation

Is it possible to generate random numbers within a device function without preallocate all the states? I would like to generate and use them in "realtime". I need to use them for Monte Carlo simulations what are the most suitable for this purpose? The number generated below are single precision is it possible to have them in double precision?
#include <iostream>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <curand_kernel.h>
__global__ void cudaRand(float *d_out, unsigned long seed)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
curandState state;
curand_init(seed, i, 0, &state);
d_out[i] = curand_uniform(&state);
}
int main(int argc, char** argv)
{
size_t N = 1 << 4;
float *v = new float[N];
float *d_out;
cudaMalloc((void**)&d_out, N * sizeof(float));
// generate random numbers
cudaRand << < 1, N >> > (d_out, time(NULL));
cudaMemcpy(v, d_out, N * sizeof(float), cudaMemcpyDeviceToHost);
for (size_t i = 0; i < N; i++)
{
printf("out: %f\n", v[i]);
}
cudaFree(d_out);
delete[] v;
return 0;
}
UPDATE
#include <iostream>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <curand_kernel.h>
#include <ctime>
__global__ void cudaRand(double *d_out)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
curandState state;
curand_init((unsigned long long)clock() + i, 0, 0, &state);
d_out[i] = curand_uniform_double(&state);
}
int main(int argc, char** argv)
{
size_t N = 1 << 4;
double *h_v = new double[N];
double *d_out;
cudaMalloc((void**)&d_out, N * sizeof(double));
// generate random numbers
cudaRand << < 1, N >> > (d_out);
cudaMemcpy(h_v, d_out, N * sizeof(double), cudaMemcpyDeviceToHost);
for (size_t i = 0; i < N; i++)
printf("out: %f\n", h_v[i]);
cudaFree(d_out);
delete[] h_v;
return 0;
}
How I was dealing with the similar situation in the past, within __device__/__global__ function:
int tId = threadIdx.x + (blockIdx.x * blockDim.x);
curandState state;
curand_init((unsigned long long)clock() + tId, 0, 0, &state);
double rand1 = curand_uniform_double(&state);
double rand2 = curand_uniform_double(&state);
So just use curand_uniform_double for generating random doubles and also I believe you don't want the same seed for all of the threads, thats what I am trying to achieve by using clock() + tId instead. This way the odds of having the same rand1/rand2 in any of the two threads are close to nil.
EDIT:
However, based on below comments, proposed approach may perhaps lead to biased result:
JackOLantern pointed me to this part of curand documentation:
Sequences generated with different seeds usually do not have statistically correlated values, but some choices of seeds may give statistically correlated sequences.
Also there is a devtalk thread devoted to how to improve performance of curand_init in which the proposed solution to speed up the curand initialization is:
One thing you can do is use different seeds for each thread and a fixed subsequence of 0 and offset of 0.
But the same poster is later stating:
The downside is that you lose some of the nice mathematical properties between threads. It is possible that there is a bad interaction between the hash function that initializes the generator state from the seed and the periodicity of the generators. If that happens, you might get two threads with highly correlated outputs for some seeds. I don't know of any problems like this, and even if they do exist they will most likely be rare.
So it is basically up to you whether you want better performance (as I did) or 1000% unbiased results. If that is what you desire, then solution proposed by JackOLantern is the correct one, i.e. initialize curand as:
curand_init((unsigned long long)clock(), tId, 0, &state)
Using not 0 value for offset and subsequence parameters is, however, decreasing performance. For more info on these parameters you may review this SO thread and also curand documentation.
I see that JackOLantern stated in comment that:
I would say it is not recommandable to call curand_init and curand_uniform_double from within the same kernel from two reasons ........ Second, curand_init initializes the pseudorandom number generator and sets all of its parameters, so I'm afraid your approach will be somewhat slow.
I was dealing with this in my thesis on several pages, tried various approaches to get different random numbers in each thread and creating curandState in each of the threads turned out to be the most viable solution for me. I needed to generate ~10 random numbers in each thread and among others I tried:
developing my own simple random number generator (Linear Congruential Generator) whose intialization was basically for free, however, the performance suffered greatly when generating numbers, so in the end having curandState in each thread turned out to be superior,
pre-allocating curandStates and reusing them - this was memory heavy and when I decreased number of preallocated states then I had to use non zero values for offset/subsequence parameters of curand_uniform_double in order to get rid of bias which led to decreased performance when generating numbers.
So after making thorough analysis I decided to indeed call curand_init and curand_uniform_double in each thread. The only problem was with the amount of registry that these states were occupying so I had to be careful with the block sizes not to exceed the max number of registry available to each block.
Thats what I have to say about provided solution which I was finally able to test and it is working just fine on my machine/GPU. I run the code from UPDATE section in the above question and 16 different random numbers were displayed in the console correctly. Therefore I advise you to properly perform error checking after executing kernel to see what went wrong inside. This topic is very well covered in this SO thread.

Need help explaining some CUDA performance results

We have been experimenting with different histogramming algorithms on a CUDA GPU. Most of the results I can explain, but we noticed some really weird features of which I have no clue what is causing them.
Kernels
The weird stuff happens in a data-parallel implementation. This means that the data is distributed over the threads. Each thread looks at a subset (ideally just 1) of the data, and adds its contribution to a histogram in global memory, which requires atomic operations.
__global__ void histogram1(float *data, uint *hist, uint n, float xMin, float binWidth, uin\
t nBins)
{
uint const nThreads = blockDim.x * gridDim.x;
uint const tid = threadIdx.x + blockIdx.x * blockDim.x;
uint idx = tid;
while (idx < n)
{
float x = data[idx];
uint bin = (x - xMin) / binWidth;
atomicAdd(hist + bin, 1);
idx += nThreads;
}
}
As a first optimization, each block first constructs a partial histogram in shared memory before doing a reduction of partial histograms to obtain the final result in global memory. The code is pretty straightforward, and I believe that it's very similar to that used in Cuda By Example.
__global__ void histogram2(float *data, uint *hist, uint n,
float xMin, float binWidth, uint nBins)
{
extern __shared__ uint partialHist[]; // size = nBins * sizeof(uint)
uint const nThreads = blockDim.x * gridDim.x;
uint const tid = threadIdx.x + blockIdx.x * blockDim.x;
// initialize shared memory to 0
uint idx = threadIdx.x;
while (idx < nBins)
{
partialHist[idx] = 0;
idx += blockDim.x;
}
__syncthreads();
// Calculate partial histogram (in shared mem)
idx = tid;
while (idx < n)
{
float x = data[idx];
uint bin = (x - xMin) / binWidth;
atomicAdd(partialHist + bin, 1);
idx += nThreads;
}
__syncthreads();
// Compute resulting total (global) histogram
idx = threadIdx.x;
while (idx < nBins)
{
atomicAdd(hist + idx, partialHist[idx]);
idx += blockDim.x;
}
}
Results
Speedup vs n
We benchmarked these two kernels to see how they behave as a function of n, which is the number of datapoints. The data was uniform randomly distributed. In the figure below, HIST_DP_1 is the unoptimized trivial version, whereas HIST_DP_2 is the one using shared memory to speed things up:
The timings have been taken relative to the CPU performance, and the weird stuff happens for very large datasets. The optimizing function, instead of flattening out like the unoptimized version, starts to improve again (relatively). We'd expect that for large datasets, the occupancy of our card will be near 100%, which would mean that from that point on the performance would scale linearly, like the CPU (and indeed the unoptimized blue curve).
The behavior could be due to the fact that the chance of having two threads performing an atomic operation on the same bin in shared/global memory going to zero for large data-sets, but in that case we would expect the drop to be in different places for different nBins. This is not what we observe, the drop is in all three panels at around 10^7 bins. What is happening here? Some complicated caching effect? Or is it something obvious that we missed?
Speedup vs nBins
To have a closer look at the behavior as a function of the number of bins, we fixed our dataset at 10^4 (10^5 in one case), and ran the algorithms for many different bin-numbers.
As a reference we also generated some non-random data. The red graph shows the results for perfectly sorted data, whereas the light-blue line corresponds to a dataset in which every value was identical (maximal congestion in the atomic operations). The question is obvious: what is the discontinuity doing there?
System Setup
NVidia Tesla M2075, driver 319.37
Cuda 5.5
Intel(R) Xeon(R) CPU E5-2603 0 # 1.80GHz
Thanks for your help!
EDIT: Reproduction Case
As requested: a compiling, runnable reproduction case. The code is quite long, which is why I didn't include it in the first place. The snippet is available on snipplr. To make your life even more easy, I'll include a little shell-script to run it for the same settings I used, and an Octave script to produce the plots.
Shell script
#!/bin/bash
runs=100
# format: [n] [nBins] [t_cpu] [t_gpu1] [t_gpu2]
for nBins in 100 1000 10000
do
for n in 10 50 100 200 500 1000 2000 5000 10000 50000 100000 500000 1000000 10000000 100000000
do
echo -n "$n $nBins "
./repro $n $nBins $runs
done
done
Octave script
T = load('repro.txt');
bins = unique(T(:,2));
t = cell(1, numel(bins));
for i = 1:numel(bins)
t{i} = T(T(:,2) == bins(i), :);
subplot(2, numel(bins), i);
loglog(t{i}(:,1), t{i}(:,3:5))
title(sprintf("nBins = %d", bins(i)));
legend("cpu", "gpu1", "gpu2");
subplot(2, numel(bins), i + numel(bins));
loglog(t{i}(:,1), t{i}(:,4)./t{i}(:,3), ...
t{i}(:,1), t{i}(:,5)./t{i}(:,3));
title("relative");
legend("gpu1/cpu", "gpu2/cpu");
end
Absolute Timings
Absolute timings show that it's not the CPU slowing down. Instead, the GPU is speeding up relatively:
Regarding question 1:
This is not what we observe, the drop is in all three panels at around 10^7 bins. What is happening here? Some complicated caching effect? Or is it something obvious that we missed?
This drop is due to the limit you've set on the maximum number of blocks (1<<14 == 16384). At n=10^7 gpuBench2 the limit has kicked in, and each thread starts processing multiple elements. At n=10^8 each thread works on 12 (sometimes 11) elements. If you remove this cap you can see that your performance continues to flatline.
Why is this faster? Multiple elements per thread allows for latency of the load from data to be hidden much better, especially in the case with 10000 bins where you are only able to fit one block on to each SM due to the high shared memory usage. In this case, every element in the block will reach the global load at around the same time, and none will be able to continue until it has completed its load. By having multiple elements we can pipeline these loads, getting many elements per thread for the latency of one.
(You don't see this in gupBench1 as it is not latency bound, but bandwidth bound to L2. You can see this very quickly if you import the output of nvprof into the visual profiler)
Regarding question 2:
The question is obvious: what is the discontinuity doing there?
I don't have a Fermi to hand, and I can't reproduce this on my Kepler, so I'd assume it's something that is Fermi specific. That's the danger of answering questions with two parts, I suppose!

CUDA: PRNG generator and state

I have been playing around with cuRAND library to understand the concept of a generator and state. The term state has yet puzzled me so I would like to clarify my understanding with the help of some code.
int main () {
float *hData, *dData;
curandState *devState;
size_t nThreads = 4;
size_t nRows = 10;
size_t n = nRows * nThreads;
hData = new float[n];
cudaMalloc((void**)&dData, n * sizeof(float));
cudaMemset(dData, 0, n * sizeof(float));
cudaMalloc((void**)&devState, nThreads * sizeof(curandState));
initCurand <<< 1, nThreads >>> (devState, 1234);
cudaDeviceSynchronize();
testrand <<< 1, nThreads >>> (devState, dData, nRows);
cudaDeviceSynchronize();
cudaMemcpy(hData, dData, n*sizeof(int), cudaMemcpyDeviceToHost);
cudaFree(devState);
cudaFree(dData);
free(hData);
}
Below are the kernels.
__global__ void initCurand(curandState *state, unsigned long seed)
{
curand_init(seed, threadIdx.x, 0, &state[threadIdx.x]);
}
__global__ void testrand(curandState *state, float *d1, int rows)
{
int idx = threadIdx.x;
int stride = blockDim.x;
for (int i = 0; i < rows; i++)
{
d1[idx + i * stride] = curand_uniform(&state[idx]);
}
}
Simply put, there are as many states as the number of threads. Each thread consume 10 random numbers from THEIR respective state. Here is the output:
0.145468 0.820181 0.550399 0.294830
0.434899 0.926417 0.811845 0.308556
0.870710 0.511765 0.782640 0.620706
0.455165 0.537594 0.742539 0.535606
0.857093 0.809246 0.541354 0.497212
0.582418 0.017524 0.195556 0.898062
0.201404 0.449338 0.006050 0.041652
0.786745 0.799349 0.093755 0.994597
0.300772 0.136307 0.648018 0.970036
0.366787 0.377424 0.096621 0.495483
Can I therefore conclude that state is actually a generator itself? Each generator/state generates a uniform distribution of random number and different state maintain a distance such that the numbers generated do not repeat. ==> Q1
From the cuRAND guide, I learned that creating a lot of states are compute intensive. also experimented to see that. How can generate random numbers using a single state (variable) such that each thread consumes different random number (or the next random number is queue) from the given (single) distribution. ==> Q2
e.g. I have four threads and single state variable which generates a (hypothetical) distribution of random numbers as follows:
2 6 100 26 81 72 78 21 33 57 19 32 ...
-------- 1st Cycle ------ 2nd Cycle ------
Thread_1: - 2 --------- 81 -
Thread_2: - 6 --------- 72 -
Thread_3: - 100 ------ 78 -
Thread_4: - 26 -------- 21 -
Is this possible using a single state variable?
Q1:
state is data. It represents all the data (such as seeds, sequence positions, subsequence positions, etc.) required to generate a random number.
A generator is a function. It operates on an instance of state, and creates a random number (according to the generator type), and also updates (modifies) the state.
Q2:
We generate multiple states, so that multiple threads can simultaneously generate random numbers. It does not make sense to talk about a single state servicing multiple threads, if the threads are generating random numbers in parallel. Since the state is updated by the generator (see Q1) during the RNG process, you would have some threads reading from the state and other threads writing to the state, which would be a race condition, and would likely lead to corruption of the state.
What you are describing, where thread 1-4 generate 2,6,100, and 26 from a single state on the first cycle is not possible.

Make CURAND generate different random numbers from a uniform distribution

I am trying to use CURAND library to generate random numbers which are completely independent of each other from 0 to 100. Hence I am giving time as seed to each thread and specifying the "id = threadIdx.x + blockDim.x * blockIdx.x" as sequence and offset .
Then after getting the random number as float, I multiply it by 100 and take its integer value.
Now, the problem I am facing is that its getting the same random number for the thread [0,0] and [0,1], no matter how many times I run the code which is 11. I am unable to understand what am I doing wrong. Please help.
I am pasting my code below:
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include<curand_kernel.h>
#include "util/cuPrintf.cu"
#include<time.h>
#define NE WA*HA //Total number of random numbers
#define WA 2 // Matrix A width
#define HA 2 // Matrix A height
#define SAMPLE 100 //Sample number
#define BLOCK_SIZE 2 //Block size
__global__ void setup_kernel ( curandState * state, unsigned long seed )
{
int id = threadIdx.x + blockIdx.x + blockDim.x;
curand_init ( seed, id , id, &state[id] );
}
__global__ void generate( curandState* globalState, float* randomMatrix )
{
int ind = threadIdx.x + blockIdx.x * blockDim.x;
if(ind < NE){
curandState localState = globalState[ind];
float stopId = curand_uniform(&localState) * SAMPLE;
cuPrintf("Float random value is : %f",stopId);
int stop = stopId ;
cuPrintf("Random number %d\n",stop);
for(int i = 0; i < SAMPLE; i++){
if(i == stop){
float random = curand_normal( &localState );
cuPrintf("Random Value %f\t",random);
randomMatrix[ind] = random;
break;
}
}
globalState[ind] = localState;
}
}
/////////////////////////////////////////////////////////
// Program main
/////////////////////////////////////////////////////////
int main(int argc, char** argv)
{
// 1. allocate host memory for matrix A
unsigned int size_A = WA * HA;
unsigned int mem_size_A = sizeof(float) * size_A;
float* h_A = (float* ) malloc(mem_size_A);
time_t t;
// 2. allocate device memory
float* d_A;
cudaMalloc((void**) &d_A, mem_size_A);
// 3. create random states
curandState* devStates;
cudaMalloc ( &devStates, size_A*sizeof( curandState ) );
// 4. setup seeds
int n_blocks = size_A/BLOCK_SIZE;
time(&t);
printf("\nTime is : %u\n",(unsigned long) t);
setup_kernel <<< n_blocks, BLOCK_SIZE >>> ( devStates, (unsigned long) t );
// 4. generate random numbers
cudaPrintfInit();
generate <<< n_blocks, BLOCK_SIZE >>> ( devStates,d_A );
cudaPrintfDisplay(stdout, true);
cudaPrintfEnd();
// 5. copy result from device to host
cudaMemcpy(h_A, d_A, mem_size_A, cudaMemcpyDeviceToHost);
// 6. print out the results
printf("\n\nMatrix A (Results)\n");
for(int i = 0; i < size_A; i++)
{
printf("%f ", h_A[i]);
if(((i + 1) % WA) == 0)
printf("\n");
}
printf("\n");
// 7. clean up memory
free(h_A);
cudaFree(d_A);
}
Output that I get is :
Time is : 1347857063
[0, 0]: Float random value is : 11.675105[0, 0]: Random number 11
[0, 0]: Random Value 0.358356 [0, 1]: Float random value is : 11.675105[0, 1]: Random number 11
[0, 1]: Random Value 0.358356 [1, 0]: Float random value is : 63.840496[1, 0]: Random number 63
[1, 0]: Random Value 0.696459 [1, 1]: Float random value is : 44.712799[1, 1]: Random number 44
[1, 1]: Random Value 0.735049
There are a few things wrong here, I'm addressing the first ones here to get you started:
General points
Please check the return values of all CUDA API calls, see here for more info.
Please run cuda-memcheck to check for obvious things like out-of-bounds accesses.
Specific points
When allocating space for the RNG state, you should have space for one state per thread (not one per matrix element as you have now).
Your thread ID calculation in setup_kernel() is wrong, should be threadIdx.x + blockIdx.x * blockDim.x (* instead of +).
You use the thread ID as the sequence number as well as the offset, you should just set the offset to zero as described in the cuRAND manual:
For the highest quality parallel pseudorandom number generation, each
experiment should be assigned a unique seed. Within an experiment,
each thread of computation should be assigned a unique sequence
number.
Finally you're running two threads per block, that's incredibly inefficient. Check out the CUDA C Programming Guide, in the "maximize utilization" section for more information, but you should be looking to launch a multiple of 32 threads per block (e.g. 128, 256) and a large number of blocks (e.g. tens of thousands). If you're problem is small then consider running multiple problems at once (either batched in a single kernel launch or as kernels in different streams to get concurrent execution).

Resources