Cannot understand how jCuda cuLaunchKernel work? - jcuda

I am trying to understand how to use Cuda in Java. I am using jCuda.
Everything was fine until I came across an example containing the code:
// Set up the kernel parameters: A pointer to an array
// of pointers which point to the actual values.
Pointer kernelParameters = Pointer.to(
Pointer.to(new int[]{numElements}),
Pointer.to(deviceInputA),
Pointer.to(deviceInputB),
Pointer.to(deviceOutput)
);
The kernel function prototype is:
__global__ void add(int n, float *a, float *b, float *sum)
The question is:
In terms of c, does it not seem that we are passing something like?
(***n, ***a, ***b, ***sum)
So basically, do we always have to have:
Pointer kernelParameters = Pointer.to( double pointer, double pointer, ...)???
Thank you

The cuLaunchKernel function of JCuda corresponds to the cuLaunchKernel function of CUDA. The signature of this function in CUDA is
CUresult cuLaunchKernel(
CUfunction f,
unsigned int gridDimX,
unsigned int gridDimY,
unsigned int gridDimZ,
unsigned int blockDimX,
unsigned int blockDimY,
unsigned int blockDimZ,
unsigned int sharedMemBytes,
CUstream hStream,
void** kernelParams,
void** extra)
where the kernelParams is the only parameter that is relevant for this question. The documentation says
Kernel parameters can be specified via kernelParams. If f has N parameters, then kernelParams needs to be an array of N pointers. Each of kernelParams[0] through kernelParams[N-1] must point to a region of memory from which the actual kernel parameter will be copied.
The key point here is the last sentence: The elements of the kernelParams array are not the actual kernel parameters. They only point to the actual kernel parameters.
And indeed, this has the odd effect that for a kernel that receives a single float *pointer, you could basically set up the kernel parameters as follows:
float *pointer= allocateSomeDeviceMemory();
float** pointerToPointer = &pointer;
float*** pointerToPointerToPointer = &pointerToPointer;
void **kernelParams = pointerToPointerToPointer;
(This is just to make clear that this is indeed a pointer to a pointer to a pointer - in reality, wou wouldn't write it like that)
Now, the "structure" of the kernel parameters is basically the same for JCuda and for CUDA. Of course you can not take "the address of a pointer" in Java, but the number of indirections is the same. Imagine you have a kernel like this:
__global__ void example(int value, float *pointer)
In the CUDA C API, you can then define the kernel parameters as follows:
int value = 123;
float *pointer= allocateSomeDeviceMemory();
int* pointerToValue = &value;
float** pointerToPointer = &pointer;
void **kernelParams = {
pointerToValue,
pointerToPointer
};
The setup is done analogously in the JCuda Java API:
int value = 123;
Pointer pointer= allocateSomeDeviceMemory();
Pointer pointerToValue = Pointer.to(new int[]{value});
float** pointerToPointer = Pointer.to(pointer);
Pointer kernelParameters = Pointer.to(
pointerToValue,
pointerToPointer
);
The main difference that is relevant here is that you can write this a bit more concisely in C, using the address operator &:
void **kernelParams = {
&value, // This can be imagined as a pointer to an int
&pointer // This can be imagined as a pointer to a pointer
};
But this is basically the same as in the example that you provided:
Pointer kernelParameters = Pointer.to(
Pointer.to(new int[]{value}), // A pointer to an int
Pointer.to(pointer) // A pointer to a pointer
);
Again, the key point is that with something like
void **kernelParams = {
&value,
};
or
Pointer kernelParameters = Pointer.to(
Pointer.to(new int[]{value}),
);
you are not passing the value to the kernel directly. Instead, you are telling CUDA: "Here is an array of pointers. The first pointer points to an int value. Copy the value from this memory location, and use it as the actual value for the kernel call".

Related

assign pointer to constant pointer in c++

I have a constant pointer cp that points to A and a non constant pointer p that points to B. I wold say that I can assign cp to p, i.e. p=cp because in this way both cp and p point to A and I cannot do the opposite: cp=p, because in this way I am saying that cp should point to B but cp is a constant pointer so I cannot change what it is pointing to.
I tried with this simple code but the result is the opposite, can someone explain me what is the correct version please?
std::vector<int> v;
v.push_back(0);
auto cp = v.cbegin(); // .cbegin() is constant
auto p = v.begin(); // .begin() is non constant
now if I write cp=p the compiler doesn't mark as error, but if I write p=cp the compiler marks the error.
cbegin is a pointer to something that is constant. You can change that pointer to point to something of the same constant type.
You're confusing this with a pointer, which is constant, to something that is not.
This is hard to see here, but the difference is between
const int* cp; // pointer to a constant value, but can point to something else
int* const pc; // pointer is constant, value can change
const int* const cpc; // pointer cannot be changed, value it points to cannot be changed
You can never make a "pointer that points to something that's not const" point at something that is const – because that means that you could change what is const, by derefencing the pointer!
const int value = 5; // can never change value
const int value2 = 10; // can never change value
const int* cp = &value; // our pointer to const int points at a const int
*cp = 6; // error: the value of something const can't be changed
cp = &value2; // fine, because we're pointing at a const int
int* p const = &value; // error: trying to point a pointer to non-const to a const, which would allow us to:
*p = 7; // which should be illegal.

Can I convert a non-const function argument to const and set the size of array?

Arrays require a constant to initialize the size. Hence, int iarr[10]
I thought I could possibly take a non-const argument and convert it to const then use it for an array size
int run(int const& size);
int run(int const& size)
{
const int csize = size;
constexpr int cesize = csize;
std::array<int, cesize> arr;
}
This, unfortunately doesn't work and I thought of using const_cast as
int run(int& size);
int run(int& size)
{
const int val = const_cast<int&>(size);
constexpr int cesize = val;
std::array<int, cesize> arr;
}
and this won't work either. I've read through a few SO posts to see if I can find anything
cannot-convert-argument-from-int-to-const-int
c-function-pass-non-const-argument-to-const-reference-parameter
what-does-a-const-cast-do-differently
Is there a way to ensure the argument is const when used as an initializer for the size of an array?
EDIT: I'm not asking why I can't initialize an array with a non-const. I'm asking how to initialize an array from a non-const function argument. Hence, initialize-array-size-from-another-array-value is not the question I am asking. I already know I can't do this but there may be a way and answer has been provided below.
std::array is a non-resizable container whose size is known at compile-time.
If you know your size values at compile-time, you can pass the value as a non-type template argument:
template <int Size>
int run()
{
std::array<int, Size> arr;
}
It can be used as follows:
run<5>();
Note that Size needs to be a constant expression.
If you do not know your sizes at compile-time, use std::vector instead of std::array:
int run(int size)
{
std::vector<int> arr;
arr.resize(size); // or `reserve`, depending on your needs
}
std::vector is a contiguous container that can be resized at run-time.
I'm asking how to initialize an array from a non-const function argument.
As you saw, it is not possible initialize an array size with an variable, because you need to specify the size or array at compiler time.
To solve your problem you should use std::vector that works like an array but you can resize it at run time. You can handle de vector as if you were handled an array, using the operator [], for example:
class MyClass
{
vector<char> myVector;
public:
MyClass();
void resizeMyArray(int newSize);
char getCharAt(int index);
};
MyClass::MyClass():
myVector(0) //initialize the vector to elements
{
}
void MyClass::resizeMyArray(int newSize)
{
myVector.clear();
myVector.resize(newSize, 0x00);
}
char MyClass::getCharAt(int index)
{
return myVector[index];
}
For more information check this link: http://www.cplusplus.com/reference/vector/vector/
Upgrade: Also, considere that std::array can't be resize, as this links say:
Arrays are fixed-size sequence containers: they hold a specific number of elements ordered in a strict linear sequence.

why rvalue reference can be bind to a non reference type

int main()
{
int rx = 0;
int ry = std::move(rx); //here is the point of my question
int lx = 0;
int ly = &lx; //(2) obviously doesn't compile
std::cin.ignore();
}
I'm a little bit lost with this aspect of rvalue, I can't understand how we can't bind &&rx to ry, because std::move(rx) is a reference to a rvalue, so I believed that this kind of expression could only be bind to a reference type as is it he case for lvalue reference and illustrated in (2)
References != address-of operator.
int& ly = lx; // reference
int* ly = &lx; // pointer
std::move obtains an rvalue reference to its argument and converts it to an xvalue. [1]
Which in turn can be copied to ry.
The expression int ry = std::move(rx); does not "bind" rx to ry. It tells the compiler that rx is no longer needed and that its contents can be moved to ry while at the same time invalidating rx.
This is especially useful when functions return by value:
std::vector<int> foo() {
std::vector<int> v = {1,2,3,4};
return v;
}
std::vector<int> u = foo();
At return v the compiler notice that v is no longer needed an that it can actually use it directly as u without doing a deep copy of the vector contents.

Passing an struct including a pointer to another struct, to kernel in CUDA

I have two structs as
struct collapsed {
char **seq;
int num;
};
struct data {
collapsed *x;
int num;
int numblocks;
int *blocksizes;
float *regmult;
float *learnmult;
};
I am passing it to my kernel as;
__global__ void KERNEL(data* X,...){
...
collapsed x = X->x[0]; // GIVES CUDA_EXPECTION_1:Lane Illegal Address
}
data X;
//init X
data *X_dev;
cudaMalloc((data **) & X_dev, sizeof(data));
cudaMemcpy(X_dev, &X, sizeof(data), cudaMemcpyHostToDevice);
KERNEL<<<...>>>(X_dev,...);
This code gives CUDA_EXPECTION_1:Lane Illegal Address in the kernel code. What is wrong or what is the right way to do it ? Any idea?
You're dereferencing a host pointer on the device.
X is a valid device pointer.
But when you copied the X struct to the device, you copied x along with it, which contains a host pointer. When you dereference that pointer:
collapsed x = X->x[0];
^ this is dereferencing the x pointer
the device code throws an error.
More detail is given here as well as instructions on how to fix it.

Initialize device array in CUDA

How do I initialize device array which is allocated using cudaMalloc()?
I tried cudaMemset, but it fails to initialize all values except 0.code, for cudaMemset looks like below, where value is initialized to 5.
cudaMemset(devPtr,value,number_bytes)
As you are discovering, cudaMemset works like the C standard library memset. Quoting from the documentation:
cudaError_t cudaMemset ( void * devPtr,
int value,
size_t count
)
Fills the first count bytes of the memory area pointed to by devPtr
with the constant byte value value.
So value is a byte value. If you do something like:
int *devPtr;
cudaMalloc((void **)&devPtr,number_bytes);
const int value = 5;
cudaMemset(devPtr,value,number_bytes);
what you are asking to happen is that each byte of devPtr will be set to 5. If devPtr was a an array of integers, the result would be each integer word would have the value 84215045. This is probably not what you had in mind.
Using the runtime API, what you could do is write your own generic kernel to do this. It could be as simple as
template<typename T>
__global__ void initKernel(T * devPtr, const T val, const size_t nwords)
{
int tidx = threadIdx.x + blockDim.x * blockIdx.x;
int stride = blockDim.x * gridDim.x;
for(; tidx < nwords; tidx += stride)
devPtr[tidx] = val;
}
(standard disclaimer: written in browser, never compiled, never tested, use at own risk).
Just instantiate the template for the types you need and call it with a suitable grid and block size, paying attention to the last argument now being a word count, not a byte count as in cudaMemset. This isn't really any different to what cudaMemset does anyway, using that API call results in a kernel launch which is do too different to what I posted above.
Alternatively, if you can use the driver API, there is cuMemsetD16 and cuMemsetD32, which do the same thing, but for half and full 32 bit word types. If you need to do set 64 bit or larger types (so doubles or vector types), your best option is to use your own kernel.
I also needed a solution to this question and I didn't really understand the other proposed solution. Particularly I didn't understand why it iterates over the grid blocks for(; tidx < nwords; tidx += stride) and for that matter, the kernel invocation and why using the counter-intuitive word sizes.
Therefore I created a much simpler monolithic generic kernel and customized it with strides i.e. you may use it to initialize a matrix in multiple ways e.g. set rows or columns to any value:
template <typename T>
__global__ void kernelInitializeArray(T* __restrict__ a, const T value,
const size_t n, const size_t incx) {
int tid = threadIdx.x + blockDim.x * blockIdx.x;
if (tid*incx < n) {
a[tid*incx] = value;
}
}
Then you may invoke the kernel like this:
template <typename T>
void deviceInitializeArray(T* a, const T value, const size_t n, const size_t incx) {
int number_of_blocks = ((n / incx) + BLOCK_SIZE - 1) / BLOCK_SIZE;
dim3 gridDim(number_of_blocks, 1);
dim3 blockDim(BLOCK_SIZE, 1);
kernelInitializeArray<T> <<<gridDim, blockDim>>>(a, value, n, incx);
}

Resources