Generate random number within a function with cuRAND without preallocation

Generate random number within a function with cuRAND without preallocation - random

Is it possible to generate random numbers within a device function without preallocate all the states? I would like to generate and use them in "realtime". I need to use them for Monte Carlo simulations what are the most suitable for this purpose? The number generated below are single precision is it possible to have them in double precision?
#include <iostream>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <curand_kernel.h>
__global__ void cudaRand(float *d_out, unsigned long seed)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
curandState state;
curand_init(seed, i, 0, &state);
d_out[i] = curand_uniform(&state);
}
int main(int argc, char** argv)
{
size_t N = 1 << 4;
float *v = new float[N];
float *d_out;
cudaMalloc((void**)&d_out, N * sizeof(float));
// generate random numbers
cudaRand << < 1, N >> > (d_out, time(NULL));
cudaMemcpy(v, d_out, N * sizeof(float), cudaMemcpyDeviceToHost);
for (size_t i = 0; i < N; i++)
{
printf("out: %f\n", v[i]);
}
cudaFree(d_out);
delete[] v;
return 0;
}
UPDATE
#include <iostream>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <curand_kernel.h>
#include <ctime>
__global__ void cudaRand(double *d_out)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
curandState state;
curand_init((unsigned long long)clock() + i, 0, 0, &state);
d_out[i] = curand_uniform_double(&state);
}
int main(int argc, char** argv)
{
size_t N = 1 << 4;
double *h_v = new double[N];
double *d_out;
cudaMalloc((void**)&d_out, N * sizeof(double));
// generate random numbers
cudaRand << < 1, N >> > (d_out);
cudaMemcpy(h_v, d_out, N * sizeof(double), cudaMemcpyDeviceToHost);
for (size_t i = 0; i < N; i++)
printf("out: %f\n", h_v[i]);
cudaFree(d_out);
delete[] h_v;
return 0;
}

How I was dealing with the similar situation in the past, within __device__/__global__ function:
int tId = threadIdx.x + (blockIdx.x * blockDim.x);
curandState state;
curand_init((unsigned long long)clock() + tId, 0, 0, &state);
double rand1 = curand_uniform_double(&state);
double rand2 = curand_uniform_double(&state);
So just use curand_uniform_double for generating random doubles and also I believe you don't want the same seed for all of the threads, thats what I am trying to achieve by using clock() + tId instead. This way the odds of having the same rand1/rand2 in any of the two threads are close to nil.
EDIT:
However, based on below comments, proposed approach may perhaps lead to biased result:
JackOLantern pointed me to this part of curand documentation:
Sequences generated with different seeds usually do not have statistically correlated values, but some choices of seeds may give statistically correlated sequences.
Also there is a devtalk thread devoted to how to improve performance of curand_init in which the proposed solution to speed up the curand initialization is:
One thing you can do is use different seeds for each thread and a fixed subsequence of 0 and offset of 0.
But the same poster is later stating:
The downside is that you lose some of the nice mathematical properties between threads. It is possible that there is a bad interaction between the hash function that initializes the generator state from the seed and the periodicity of the generators. If that happens, you might get two threads with highly correlated outputs for some seeds. I don't know of any problems like this, and even if they do exist they will most likely be rare.
So it is basically up to you whether you want better performance (as I did) or 1000% unbiased results. If that is what you desire, then solution proposed by JackOLantern is the correct one, i.e. initialize curand as:
curand_init((unsigned long long)clock(), tId, 0, &state)
Using not 0 value for offset and subsequence parameters is, however, decreasing performance. For more info on these parameters you may review this SO thread and also curand documentation.
I see that JackOLantern stated in comment that:
I would say it is not recommandable to call curand_init and curand_uniform_double from within the same kernel from two reasons ........ Second, curand_init initializes the pseudorandom number generator and sets all of its parameters, so I'm afraid your approach will be somewhat slow.
I was dealing with this in my thesis on several pages, tried various approaches to get different random numbers in each thread and creating curandState in each of the threads turned out to be the most viable solution for me. I needed to generate ~10 random numbers in each thread and among others I tried:
developing my own simple random number generator (Linear Congruential Generator) whose intialization was basically for free, however, the performance suffered greatly when generating numbers, so in the end having curandState in each thread turned out to be superior,
pre-allocating curandStates and reusing them - this was memory heavy and when I decreased number of preallocated states then I had to use non zero values for offset/subsequence parameters of curand_uniform_double in order to get rid of bias which led to decreased performance when generating numbers.
So after making thorough analysis I decided to indeed call curand_init and curand_uniform_double in each thread. The only problem was with the amount of registry that these states were occupying so I had to be careful with the block sizes not to exceed the max number of registry available to each block.
Thats what I have to say about provided solution which I was finally able to test and it is working just fine on my machine/GPU. I run the code from UPDATE section in the above question and 16 different random numbers were displayed in the console correctly. Therefore I advise you to properly perform error checking after executing kernel to see what went wrong inside. This topic is very well covered in this SO thread.

Related

CURAND - generating more than one quasirandom number per thread

I wish to generate a very large set of quasirandom numbers. (By 'very large', I mean much larger than the maximum number of concurrent threads any current CUDA device can support,
requiring each thread to loop, or for the kernel to be launched with a large grid size. And I want quasirandom for their low-discrepancy properties.) For pseudorandoms, where each call to curand_init can take a different sequence parameter, this seems simple.
For generating N quasirandom numbers, where N is greater than gridDim.x * blockDim.x, is there a solution more efficient than either
Running curand_init N times for N states, giving each call a unique offset in [0, N);
Running curand_init only gridDim.x * blockDim.x times for that number of states, but giving each call an offset of e.g. 10*threadID, if I expect each thread to have to generate 10 numbers?
(Ignoring any overhead due to large offsets, i.e. ignoring skip_ahead().)
I had a look at the code in the CUDA 6.0 samples, and MC_EstimatePiInlineQ appeared to do what I was looking for in two dimensions. However, when the number of points to generate exceeds gridDim.x * blockDim.x, I believe this code actually produces the same points multiple times. This is an issue since gridDim.x is not necessarily large enough to fit the problem size in this example; it is tuned to target roughly 10 blocks per multiprocessor on the device.
The relevant code is below (slightly altered for brevity):
// RNG init kernel
template <typename rngState_t, typename rngDirectionVectors_t>
__global__ void initRNG(rngState_t *const rngStates,
rngDirectionVectors_t *const rngDirections)
{
// Determine thread ID
unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int step = gridDim.x * blockDim.x;
// Initialise the RNG
curand_init(rngDirections[0], tid, &rngStates[tid]);
curand_init(rngDirections[1], tid, &rngStates[tid + step]);
}
and,
// Estimator kernel
template <typename Real, typename rngState_t>
__global__ void computeValue(unsigned int *const results,
rngState_t *const rngStates,
const unsigned int numSims)
{
// Determine thread ID
unsigned int bid = blockIdx.x;
unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int step = gridDim.x * blockDim.x;
// Initialise the RNG
rngState_t localState1 = rngStates[tid];
rngState_t localState2 = rngStates[tid + step];
// Count the number of points which lie inside the unit quarter-circle
unsigned int pointsInside = 0;
for (unsigned int i = tid ; i < numSims ; i += step)
{
Real x = curand_uniform(&localState1);
Real y = curand_uniform(&localState2);
// Do something.
}
// Do some more.
}
Suppose gridDim.x * blockDim.x < N, then at least thread tid = 0 will loop twice in the for. In its second run, it will generate the second random number relative to its initializing offset of 0; this is equivalent to the first random number relative to an initializing offset of 1, which is exactly what tid = 1 did the first time. So the point already has already been generated! This is true for all threads except the one with the highest tid (i.e. some multiple of gridDim.x * blockDim.x), if it is even looping more than once. At best this is useless work, and for my use-case it will be harmful.
I have created a stripped-down version of the mentioned example, based on some hypothetical device, where we have only 4 threads per block, and only 2 blocks, but wish to generate 16 points. Note that lines 9-15 of the output are identical to lines 2-8. Only line 16 is a new point.

This is just a case of reading the docs, but in practice I've found it can indeed be substantially faster to limit the number of states you generate.
This corresponds to option 2 in the question: each thread's offset to curand_init should be n * tid where n is at least as great as the number of random numbers you wish each thread to generate. If that isn't know at state-generation, you can instead use skip_ahead(n * tid, &state) before calling curand, curand_uniform etc.

Parallel multiplication of many small matrices by fixed vector

Situation is the following: I have a number (1000s) of elements which are given by small matrices of dimensions 4x2, 9x3 ... you get the idea. All matrices have the same dimension.
I want to multiply each of these matrices with a fixed vector of precalculated values. In short:
for(i = 1...n)
X[i] = M[i] . N;
What is the best approach to do this in parallel using Thrust? How do I lay out my data in memory?
NB: There might be specialized, more suitable libraries to do this on GPUs. I'm interested in Thrust because it allows me to deploy to different backends, not just CUDA.

One possible approach:
flatten the arrays (matrices) into a single data vector. This is an advantageous step for enabling general thrust processing anyway.
use a strided range mechanism to take your scaling vector and extend it to the overall length of your flattened data vector
use thrust::transform with thrust::multiplies to multiply the two vectors together.
If you need to access the matrices later out of your flattened data vector (or result vector), you can do so with pointer arithmetic, or a combination of fancy iterators.
If you need to re-use the extended scaling vector, you may want to use the method outlined in step 2 exactly (i.e. create an actual vector using that method, length = N matrices, repeated). If you are only doing this once, you can achieve the same effect with a counting iterator, followed by a transform iterator (modulo the length of your matrix in elements), followed by a permutation iterator, to index into your original scaling vector (length = 1 matrix).
The following example implements the above, without using the strided range iterator method:
#include <iostream>
#include <stdlib.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/functional.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/transform.h>
#define N_MAT 1000
#define H_MAT 4
#define W_MAT 3
#define RANGE 1024
struct my_modulo_functor : public thrust::unary_function<int, int>
{
__host__ __device__
int operator() (int idx) {
return idx%(H_MAT*W_MAT);}
};
int main(){
thrust::host_vector<int> data(N_MAT*H_MAT*W_MAT);
thrust::host_vector<int> scale(H_MAT*W_MAT);
// synthetic; instead flatten/copy matrices into data vector
for (int i = 0; i < N_MAT*H_MAT*W_MAT; i++) data[i] = rand()%RANGE;
for (int i = 0; i < H_MAT*W_MAT; i++) scale[i] = rand()%RANGE;
thrust::device_vector<int> d_data = data;
thrust::device_vector<int> d_scale = scale;
thrust::device_vector<int> d_result(N_MAT*H_MAT*W_MAT);
thrust::transform(d_data.begin(), d_data.end(), thrust::make_permutation_iterator(d_scale.begin(), thrust::make_transform_iterator(thrust::counting_iterator<int>(0), my_modulo_functor())) ,d_result.begin(), thrust::multiplies<int>());
thrust::host_vector<int> result = d_result;
for (int i = 0; i < N_MAT*H_MAT*W_MAT; i++)
if (result[i] != data[i] * scale[i%(H_MAT*W_MAT)]) {std::cout << "Mismatch at: " << i << " cpu result: " << (data[i] * scale[i%(H_MAT*W_MAT)]) << " gpu result: " << result[i] << std::endl; return 1;}
std::cout << "Success!" << std::endl;
return 0;
}
EDIT: Responding to a question below:
The benefit of fancy iterators (i.e. transform(numbers, iterator)) is that they often allow for eliminaion of extra data copies/data movement, as compared to assembling other number (which requires extra steps and data movement) and then passing it to transform(numbers, other numbers). If you're only going to use other numbers once, then the fancy iterators will generally be better. If you're going to use other numbers again, then you may want to assemble it explicitly. This preso is instructive, in particular "Fusion".
For a one-time use of other numbers the overhead of assembling it on the fly using fancy iterators and the functor is generally lower than explicitly creating a new vector, and then passing that new vector to the transform routine.

When looking for a software library which is concisely made for multiplying small matrices, then one may have a look at https://github.com/hfp/libxsmm. Below, the code requests a specialized matrix kernel according to the typical GEMM parameters (please note that some limitations apply).
double alpha = 1, beta = 1;
const char transa = 'N', transb = 'N';
int flags = LIBXSMM_GEMM_FLAGS(transa, transb);
int prefetch = LIBXSMM_PREFETCH_AUTO;
libxsmm_blasint m = 23, n = 23, k = 23;
libxsmm_dmmfunction xmm = NULL;
xmm = libxsmm_dmmdispatch(m, n, k,
&m/*lda*/, &k/*ldb*/, &m/*ldc*/,
&alpha, &beta, &flags, &prefetch);
Given the above code, one can proceed and run "xmm" for an entire series of (small) matrices without a particular data structure (below code also uses "prefetch locations").
if (0 < n) { /* check that n is at least 1 */
# pragma parallel omp private(i)
for (i = 0; i < (n - 1); ++i) {
const double *const ai = a + i * asize;
const double *const bi = b + i * bsize;
double *const ci = c + i * csize;
xmm(ai, bi, ci, ai + asize, bi + bsize, ci + csize);
}
xmm(a + (n - 1) * asize, b + (n - 1) * bsize, c + (n - 1) * csize,
/* pseudo prefetch for last element of batch (avoids page fault) */
a + (n - 1) * asize, b + (n - 1) * bsize, c + (n - 1) * csize);
}
In addition to the manual loop control as shown above, libxsmm_gemm_batch (or libxsmm_gemm_batch_omp) can be used (see ReadTheDocs). The latter is useful if data structures exist that describe the series of operands (A, B, and C matrices).
There are two reasons why this library gives superior performance: (1) on-the-fly code specialization using an in-memory code generation technique, and (2) loading the next matrix operands while calculating the current product.
( Given one is looking for something that blends well with C/C++, this library supports it. However, it does not aim for CUDA/Thrust. )

Correct OpenMP pragmas for pi monte carlo in C with not thread-safe random number generator

I need some help to parallelize the pi calculation with the monte carlo method with openmp by a given random number generator, which is not thread safe.
First: This SO thread didn't help me.
My own try is the following #pragma omp statements. I thought the i, x and y vars should be init by each thread and should than be private. z ist the sum of all hits in the circle, so it should be summed after the implied barriere after the for loop.
Think the main problem ist the static state var of the random number generator. I made a critical section where the functions are called, so that only one thread per time could execute it. But the Pi solutions doesn't scale with more higher values.
Note: I should not use another RNG, but its okay to make little changes on it.
int main (int argc, char *argv[]) {
int i, z = 0, threads = 8, iters = 100000;
double x,y, pi;
#pragma omp parallel firstprivate(i,x,y) reduction(+:z) num_threads(threads)
for (i=0; i<iters; ++i) {
#pragma omp critical
{
x = rng_doub(1.0);
y = rng_doub(1.0);
}
if ((x*x+y*y) <= 1.0)
z++;
}
pi = ((double) z / (double) (iters*threads))*4.0;
printf("Pi: %lf\n", pi);;
return 0;
}
This RNG is actually an included file, but as I'm not sure if I create the header file correct, I integrated it in the other program file, so I have only one .c file.
#define RNG_MOD 741025
int rng_int(void) {
static int state = 0;
return (state = (1366 * state + 150889) % RNG_MOD);
}
double rng_doub(double range) {
return ((double) rng_int()) / (double) ((RNG_MOD - 1)/range);
}
I've also tried to make the static int state global, but it doesn't change my result, maybe I done it wrong. So please could you help me make the correct changes? Thank you very much!

Your original linear congruent PRNG has a cycle length of 49400, therefore you are only getting 29700 unique test points. This is a terrible generator to be used for any kind of Monte Carlo simulations. Even if you make 100000000 trials, you won't get any closer to the true value of Pi because you are simply repeating the same points over and over again and as a result both the final value of z and iters are simply multiplied by the same constant, which cancel in the end during the division.
The per-thread seed introduced by Z boson improves the situation a little bit with the number of unique points increasing with the total number of OpenMP threads. The increase is not linear since if the seed of one PRNG falls in the sequence of another PRNG, both PRNGs produce the same sequence shifted with no more than 49400 elements. Given the cycle length, each PRNG covers 49400/RNG_MOD = 6,7% of the total output range and that is the probability of two PRNGs being synchronised. There are a total of RNG_MOD/49400 = 15 unique sequences possible. It basically means that in the best seeding case scenario you won't be able to get past 30 threads as any other thread would simply repeat the result of some of the others. The multiplier 2 comes from the fact that each point uses two elements from the sequence and therefore it is possible to get a different set of points if you shift the sequence by one element.
The ultimate solution is to completely drop your PRNG and stick to something like Mersenne twister MT19937, which has a cycle length of 219937 − 1 and a very strong seeding algorithm. If you are not able to use another PRNG as you state in your question, at least modify the constants of the LCG to match those used in rand():
int rng_int(void) {
static int state = 1;
// & 0x7fffffff is equivalent to modulo with RNG_MOD = 2^31
return (state = (state * 1103515245 + 12345) & 0x7fffffff);
}
Note that rand() is not a good PRNG - it is still bad. It is just a little better than the one used in your code.

Try the code below. It makes a private state for each thread. I did something similar with the at rand_r function Why does calculation with OpenMP take 100x more time than with a single thread?
Edit: I updated my code using some of Hristo's suggestions. I used threadprivate (for the first time). I also used a better rand function which gives a better estimate of pi but it's still not good enough.
One strange things was I had to define the function rng_int after threadprivate otherwise I got an error "error: 'state' declared 'threadprivate' after first use". I should probably ask a question about this.
//gcc -O3 -Wall -pedantic -fopenmp main.c
#include <omp.h>
#include <stdio.h>
#define RNG_MOD 0x80000000
int state;
int rng_int(void);
double rng_doub(double range);
int main() {
int i, numIn, n;
double x, y, pi;
n = 1<<30;
numIn = 0;
#pragma omp threadprivate(state)
#pragma omp parallel private(x, y) reduction(+:numIn)
{
state = 25234 + 17 * omp_get_thread_num();
#pragma omp for
for (i = 0; i <= n; i++) {
x = (double)rng_doub(1.0);
y = (double)rng_doub(1.0);
if (x*x + y*y <= 1) numIn++;
}
}
pi = 4.*numIn / n;
printf("asdf pi %f\n", pi);
return 0;
}
int rng_int(void) {
// & 0x7fffffff is equivalent to modulo with RNG_MOD = 2^31
return (state = (state * 1103515245 + 12345) & 0x7fffffff);
}
double rng_doub(double range) {
return ((double)rng_int()) / (((double)RNG_MOD)/range);
}
You can see the results (and edit and run the code) at http://coliru.stacked-crooked.com/a/23c1753a1b7d1b0d

Using both GPU device of CUDA and zero copy pinned memory

I am using the CUSP library for sparse matrix-multiplication on CUDA a machine. My current code is
#include <cusp/coo_matrix.h>
#include <cusp/multiply.h>
#include <cusp/print.h>
#include <cusp/transpose.h>
#include<stdio.h>
#define CATAGORY_PER_SCAN 1000
#define TOTAL_CATAGORY 100000
#define MAX_SIZE 1000000
#define ELEMENTS_PER_CATAGORY 10000
#define ELEMENTS_PER_TEST_CATAGORY 1000
#define INPUT_VECTOR 1000
#define TOTAL_ELEMENTS ELEMENTS_PER_CATAGORY * CATAGORY_PER_SCAN
#define TOTAL_TEST_ELEMENTS ELEMENTS_PER_TEST_CATAGORY * INPUT_VECTOR
int main(void)
{
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);
cusp::coo_matrix<long long int, double, cusp::host_memory> A(CATAGORY_PER_SCAN,MAX_SIZE,TOTAL_ELEMENTS);
cusp::coo_matrix<long long int, double, cusp::host_memory> B(MAX_SIZE,INPUT_VECTOR,TOTAL_TEST_ELEMENTS);
for(int i=0; i< ELEMENTS_PER_TEST_CATAGORY;i++){
for(int j = 0;j< INPUT_VECTOR ; j++){
int index = i * INPUT_VECTOR + j ;
B.row_indices[index] = i; B.column_indices[ index ] = j; B.values[index ] = i;
}
}
for(int i = 0;i < CATAGORY_PER_SCAN; i++){
for(int j=0; j< ELEMENTS_PER_CATAGORY;j++){
int index = i * ELEMENTS_PER_CATAGORY + j ;
A.row_indices[index] = i; A.column_indices[ index ] = j; A.values[index ] = i;
}
}
/*cusp::print(A);
cusp::print(B); */
//test vector
cusp::coo_matrix<long int, double, cusp::device_memory> A_d = A;
cusp::coo_matrix<long int, double, cusp::device_memory> B_d = B;
// allocate output vector
cusp::coo_matrix<int, double, cusp::device_memory> y_d(CATAGORY_PER_SCAN, INPUT_VECTOR ,CATAGORY_PER_SCAN * INPUT_VECTOR);
cusp::multiply(A_d, B_d, y_d);
cusp::coo_matrix<int, double, cusp::host_memory> y=y_d;
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
float elapsedTime;
cudaEventElapsedTime(&elapsedTime, start, stop); // that's our time!
printf("time elaplsed %f ms\n",elapsedTime);
return 0;
}
cusp::multiply function uses 1 GPU only (as of my understanding).
How can I use setDevice() to run same program on both the GPU(one cusp::multiply per GPU) .
Measure the total time accurately.
How can I use zero-copy pinned memory with this library as I can use malloc myself.

1 How can I use setDevice() to run same program on both the GPU
If you mean "How can I perform a single cusp::multiply operation using two GPUs", the answer is you can't.
EDIT:
For the case where you want to run two separate CUSP sparse matrix-matrix products on different GPUs, it is possible to simply wrap the operation in a loop and call cudaSetDevice before the transfers and the cusp::multiply call. You will probably not, however get any speed up by doing so. I think I am correct in saying that both the memory transfers and cusp::multiply operations are blocking calls, so the host CPU will stall until they are finished. Because of this, the calls for different GPUs cannot overlap and there will be no speed up over performing the same operation on a single GPU twice. If you were willing to use a multithreaded application and have a host CPU with multiple cores, you could probably still run them in parallel, but it won't be as straightforward host code as it seems you are hoping for.
2 Measure the total time accurately
The cuda_event approach you have now is the most accurate way of measuring the execution time of a single kernel. If you had a hypthetical multi-gpu scheme, then the sum of the events from each GPU context would be the total execution time of the kernels. If, by total time, you mean the "wallclock" time to complete the operation, then you would need to either use a host timer around the whole multigpu segment of your code. I vaguely recall that it might be possible in the latest versions of CUDA to synchronize between events in streams from different contexts in some circumstances, so a CUDA event based timer might still be usable in such a scenario.
3 How can I use zero-copy pinned memory with this library as I can use malloc myself.
To the best of my knowledge that isn't possible. The underlying thrust library CUSP uses can support containers using zero copy memory, but CUSP doesn't expose the necessary mechanisms in the standard matrix constructors to be able to use allocate a CUSP sparse matrix in zero copy memory.

Odd "bank conflict"-type behavior with memcpy in CUDA global memory

I have distilled a performance issue down to the code shown below. This code takes an array of 128,000 64-byte structures ("Rule"s) and scatters them within another array. For example, if SCATTERSIZE is 10, then the code will copy ("scatter") 128,000 of these structures from the "small" array where they are stored contiguously at indices 0, 1, 2, ..., 127999, and place them at indices 0, 10, 20, 30, ..., 1279990 within the "big" array.
Here's what I can't figure out: On a device of compute capability 1.3 (Tesla C1060) performance suffers dramatically whenever SCATTERSIZE is a multiple of 16. And on a device of compute capability 2.0 (Tesla C2075) performance suffers quite a bit whenever SCATTERSIZE is a multiple of 24.
I don't think this can be a shared memory-bank thing, since I'm not using shared memory. And I don't think it can be related to coalescing. Using the commandline profiler and inspecting the "gputime" entry, I find a 300% increase in runtime on the 1.3 device, and a 40% increase in runtime on the 2.0 device, for the bad SCATTERSIZEs. I'm stumped. Here is the code:
#include <stdio.h>
#include <cuda.h>
#include <stdint.h>
typedef struct{
float a[4][4];
} Rule;
#ifndef SCATTERSIZE
#define SCATTERSIZE 96
#endif
__global__ void gokernel(Rule* b, Rule* s){
int idx = blockIdx.x * blockDim.x + threadIdx.x;
memcpy(&b[idx * SCATTERSIZE], &s[idx], sizeof(Rule));
}
int main(void){
int blocksPerGrid = 1000;
int threadsPerBlock = 128;
int numThreads = blocksPerGrid * threadsPerBlock;
printf("blocksPerGrid = %d, SCATTERSIZE = %d\n", blocksPerGrid, SCATTERSIZE);
Rule* small;
Rule* big;
cudaError_t err = cudaMalloc(&big, numThreads * 128 * sizeof(Rule));
printf("Malloc big: %s\n",cudaGetErrorString(err));
err = cudaMalloc(&small, numThreads * sizeof(Rule));
printf("Malloc small: %s\n",cudaGetErrorString(err));
gokernel <<< blocksPerGrid, threadsPerBlock >>> (big, small);
err = cudaThreadSynchronize();
printf("Kernel launch: %s\n", cudaGetErrorString(err));
}

Because the implementation of __device__ memcpy is hidden (it is a compiler built-in), it's hard to say what the cause is exactly. One hunch (thanks to njuffa on this one) is that it is what's known as partition camping, where addresses from many threads are mapping to one or a few physical DRAM partitions rather than being spread across them.
On SM 1_2/1_3 GPUs partition camping could be quite bad depending on the memory access stride, but this has been improved starting with SM_2_0 devices so that would explain why the effect is less pronounced.
You can often work around this effect by adding some padding into arrays to avoid offending offsets, but it may not be worth it depending on your computation.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio