Allocation and usage of cuda device variable in different functions - memory-management

i am quite new to CUDA and I have a question regarding the memory management for an object. I have an object function to load the data to the device and if another object function is called the computation is carried out.
I have read some parts of the NVIDIA programming guide and some SO questions but they do data copying and computing in a single function so there no need of multiple functions.
Some more specifications:
The data is read one time. I do not know the data size at compile time therefore I need a dynamic allocation. My current device has a compute capability of 2.1 (will be updated soon to 6.1).
I want to copy the data in a first function and use the data in a different function. For example:
__constant__ int dev_size;
__device__ float* dev_data; //<- not sure about this
/* kernel */
__global__ void computeSomething(float* dev_output)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < dev_size)
{
dev_output[idx] = dev_data[idx]*100; // some computation;
}
}
// function 1
void OBJECT::copyVolumeToGPU(int size, float* data)
{
cudaMalloc(&dev_data, size * sizeof(float));
cudaMemcpy(dev_data, data, size * sizeof(float), cudaMemcpyHostToDevice );
cudaMemcpyToSymbol(dev_size, size, sizeof(int));
}
// function 2
void OBJECT::computeSmthOnDevice(int size)
{
// allocate output array
auto host_output = new float[size];
float* dev_output;
cudaMalloc(&dev_output, size * sizeof(float));
int block = 256;
int grid = ceil(size/block);
computeSomething<<<grid,block>>>(dev_output);
cudaMemcpy(host_output, dev_data, size * sizeof(float), cudaMemcpyDeviceToHost);
/* ... do something with output ... */
delete[] host_output;
cudaFree(dev_output);
}
gpuErrChk is carried out this way: https://stackoverflow.com/a/14038590/3921660 but omitted in this example.
Can I copy the data using a __device__pointer (like __device__ float* dev_data;)?

Generally, your idea is workable, but this:
cudaMalloc(&dev_data, size * sizeof(float));
is not legal. It is not legal to take an address of a __device__ item in host code. So if you know the size at compile time, the easiest approach is to convert this to a static allocation e.g.
__device__ float dev_data[1000];
If you really want to make this a dynamically allocated __device__ pointer, then you will need to use a method such as described here, which involves using cudaMalloc on a typical device pointer in host code that is a "temporary", then copy that "temporary" pointer to the __device__ pointer via cudaMemcpyToSymbol. And then when you want to copy data to/from that particular allocation via cudaMemcpy, you would use cudaMemcpy to/from the temporary pointer from host code.
Note that for the purposes of "communicating" data from one function to the next, or one kernel to the next, there's no reason you couldn't just use a dynamically allocated pointer from cudaMemcpy, and pass that pointer around to wherever you need it. You can even pass it via a global variable to any host function that needs it, like an ordinary global pointer. For kernels, however, you would still need to pass such a global pointer to the kernel via kernel argument.

Related

Create array of struct scatterlist from buffer

I am trying to build an array of type "struct scatterlist", from a buffer pointed by a virtual kernel address (I know the byte size of the buffer, but it may be large). Ideally I would like to have function like init_sg_array_from_buf:
void my_function(void *buffer, int buffer_length)
{
struct scatterlist *sg;
int sg_count;
sg_count = init_sg_array_from_buf(buffer, buffer_length, sg);
}
Which function in the scatterlist api, does something similar? Currently the only possibility I see, is to manually determine the amount of pages, spanned by the buffer. Windows has a kernel macro called "ADDRESS_AND_SIZE_TO_SPAN_PAGES", but I didn't even manage to find something like this in the linux kernel.

Structure defined in a dynamically loaded library

I am dynamically loading the cudart (Cuda Run Time Library) to access just the cudaGetDeviceProperties function. This one requires two arguments:
A cudaDeviceProp structure which is defined in a header of the run time library;
An integer which represents the device ID.
I am not including the cuda_runtime.h header in order to not get extra constants, macros, enum, class... that I do not want to use.
However, I need the cudaDeviceProp structure. Is there a way to get it without redefining it? I wrote the following code:
struct cudaDeviceProp;
class CudaRTGPUInfoDL
{
typedef int(*CudaDriverVersion)(int*);
typedef int(*CudaRunTimeVersion)(int*);
typedef int(*CudaDeviceProperties)(cudaDeviceProp*,int);
public:
struct Properties
{
char name[256]; /**< ASCII string identifying device */
size_t totalGlobalMem; /**< Global memory available on device in bytes */
size_t sharedMemPerBlock; /**< Shared memory available per block in bytes */
int regsPerBlock; /**< 32-bit registers available per block */
int warpSize; /**< Warp size in threads */
size_t memPitch; /**< Maximum pitch in bytes allowed by memory copies */
/*... Tons of members follow..*/
};
public:
CudaRTGPUInfoDL();
~CudaRTGPUInfoDL();
int getCudaDriverVersion();
int getCudaRunTimeVersion();
const Properties& getCudaDeviceProperties();
private:
QLibrary library;
private:
CudaDriverVersion cuDriverVer;
CudaRunTimeVersion cuRTVer;
CudaDeviceProperties cuDeviceProp;
Properties properties;
};
As everybody can see, I simply "copy-pasted" the declaration of the structure.
In order to get the GPU properties, I simply use this method:
const CudaRTGPUInfoDL::Properties& CudaRTGPUInfoDL::getCudaDeviceProperties()
{
// Unsafe but needed.
cuDeviceProp(reinterpret_cast<cudaDeviceProp*>(&properties), 0);
return properties;
}
Thanks for your answers.
If you need the structure to be complete, you should define it (probably by including the appropriate header).
If you're just going to be passing around references or pointers, such as in the method you show, then it doesn't need to be complete and can just be forward declared:
class cudaDeviceProp;

dma_common_mmap documentation to let user read/write physical address

I am trying to write a Linux kernel module to map some address back to the user using dma_common_mmap(). I then want the user to mmap and write/read the address space.
My main problem now is that I can't find the documentation for dma_common_mmap(), does any exist? I have googled but didn't find out how to use it and let the user read/write the address.
The documentation for dma_common_mmap() doesn't exist. But you can look at Doxygen comment for dma_mmap_attrs() function:
/**
* dma_mmap_attrs - map a coherent DMA allocation into user space
* #dev: valid struct device pointer, or NULL for ISA and EISA-like devices
* #vma: vm_area_struct describing requested user mapping
* #cpu_addr: kernel CPU-view address returned from dma_alloc_attrs
* #handle: device-view address returned from dma_alloc_attrs
* #size: size of memory originally requested in dma_alloc_attrs
* #attrs: attributes of mapping properties requested in dma_alloc_attrs
*
* Map a coherent DMA buffer previously allocated by dma_alloc_attrs
* into user space. The coherent DMA buffer must not be freed by the
* driver until the user space mapping has been released.
*/
static inline int
dma_mmap_attrs(struct device *dev, struct vm_area_struct *vma, void *cpu_addr,
dma_addr_t dma_addr, size_t size, struct dma_attrs *attrs)
{
struct dma_map_ops *ops = get_dma_ops(dev);
BUG_ON(!ops);
if (ops->mmap)
return ops->mmap(dev, vma, cpu_addr, dma_addr, size, attrs);
return dma_common_mmap(dev, vma, cpu_addr, dma_addr, size);
}
#define dma_mmap_coherent(d, v, c, h, s) dma_mmap_attrs(d, v, c, h, s, NULL)
dma_mmap_attrs() calls in turn dma_common_mmap(), so all the documentation (except for attrs param) applies to dma_common_mmap() as is.
EDIT
I think you should use dma_mmap_coherent() (along with dma_alloc_coherent()), which does pretty much the same as dma_common_mmap() (see code above). See this example to get some clue on how to use it both in kernel side and in user-space. See also how dma_mmap_coherent() is used in ALSA kernel code, in snd_pcm_lib_default_mmap() function.

Why doesn't boost::lockfree::spsc_queue have emplace?

The regular std::vector has emplace_back which avoid an unnecessary copy. Is there a reason spsc_queue doesn't support this? Is it impossible to do emplace with lock-free queues for some reason?
I'm not a boost library implementer nor maintainer, so the rationale behind why not to include an emplace member function is beyond my knowledge, but it isn't too difficult to implement it yourself if you really need it.
The spsc_queue has a base class of either compile_time_sized_ringbuffer or runtime_sized_ringbuffer depending on if the size of the queue is known at compilation or not. These two classes maintain the actual buffer used with the obvious differences between a dynamic buffer and compile-time buffer, but delegate, in this case, their push member functions to a common base class - ringbuffer_base.
The ringbuffer_base::push function is relatively easy to grok:
bool push(T const & t, T * buffer, size_t max_size)
{
const size_t write_index = write_index_.load(memory_order_relaxed); // only written from push thread
const size_t next = next_index(write_index, max_size);
if (next == read_index_.load(memory_order_acquire))
return false; /* ringbuffer is full */
new (buffer + write_index) T(t); // copy-construct
write_index_.store(next, memory_order_release);
return true;
}
An index into the location where the next item should be stored is done with a relaxed load (which is safe since the intended use of this class is single producer for the push calls) and gets the appropriate next index, checks to make sure everything is in-bounds (with a load-acquire for appropriate synchronization with the thread that calls pop) , but the main statement we're interested in is:
new (buffer + write_index) T(t); // copy-construct
Which performs a placement new copy construction into the buffer. There's nothing inherently thread-unsafe about passing around some parameters to use to construct a T directly from viable constructor arguments. I wrote the following snippet and made the necessary changes throughout the derived classes to appropriately delegate the work up to the base class:
template<typename ... Args>
std::enable_if_t<std::is_constructible<T,Args...>::value,bool>
emplace( T * buffer, size_t max_size,Args&&... args)
{
const size_t write_index = write_index_.load(memory_order_relaxed); // only written from push thread
const size_t next = next_index(write_index, max_size);
if (next == read_index_.load(memory_order_acquire))
return false; /* ringbuffer is full */
new (buffer + write_index) T(std::forward<Args>(args)...); // emplace
write_index_.store(next, memory_order_release);
return true;
}
Perhaps the only difference is making sure that the arguments passed in Args... can actually be used to construct a T, and of course doing the emplacement via std::forward instead of a copy construction.

How to use arrays in program (global) scope in OpenCL

AMD OpenCL Programming Guide, Section 6.3 Constant Memory Optimization:
Globally scoped constant arrays. These arrays are initialized,
globally scoped, and in the constant address space (as specified in
section 6.5.3 of the OpenCL specification). If the size of an array is
below 64 kB, it is placed in hardware constant buffers; otherwise, it
uses global memory. An example of this is a lookup table for math
functions.
I want to use this "globally scoped constant array". I have such code in pure C
#define SIZE 101
int *reciprocal_table;
int reciprocal(int number){
return reciprocal_table[number];
}
void kernel(int *output)
{
for(int i=0; i < SIZE; i+)
output[i] = reciprocal(i);
}
I want to port it into OpenCL
__kernel void kernel(__global int *output){
int gid = get_global_id(0);
output[gid] = reciprocal(gid);
}
int reciprocal(int number){
return reciprocal_table[number];
}
What should I do with global variable reciprocal_table? If I try to add __global or __constant to it I get an error:
global variable must be declared in addrSpace constant
I don't want to pass __constant int *reciprocal_table from kernel to reciprocal. Is it possible to initialize global variable somehow? I know that I can write it down into code, but does other way exist?
P.S. I'm using AMD OpenCL
UPD Above code is just an example. I have real much more complex code with a lot of functions. So I want to make array in program scope to use it in all functions.
UPD2 Changed example code and added citation from Programming Guide
#define SIZE 2
int constant array[SIZE] = {0, 1};
kernel void
foo (global int* input,
global int* output)
{
const uint id = get_global_id (0);
output[id] = input[id] + array[id];
}
I can get the above to compile with Intel as well as AMD. It also works without the initialization of the array but then you would not know what's in the array and since it's in the constant address space, you could not assign any values.
Program global variables have to be in the __constant address space, as stated by section 6.5.3 in the standard.
UPDATE Now, that I fully understood the question:
One thing that worked for me is to define the array in the constant space and then overwrite it by passing a kernel parameter constant int* array which overwrites the array.
That produced correct results only on the GPU Device. The AMD CPU Device and the Intel CPU Device did not overwrite the arrays address. It also is probably not compliant to the standard.
Here's how it looks:
#define SIZE 2
int constant foo[SIZE] = {100, 100};
int
baz (int i)
{
return foo[i];
}
kernel void
bar (global int* input,
global int* output,
constant int* foo)
{
const uint id = get_global_id (0);
output[id] = input[id] + baz (id);
}
For input = {2, 3} and foo = {0, 1} this produces {2, 4} on my HD 7850 Device (Ubuntu 12.10, Catalyst 9.0.2). But on the CPU I get {102, 103} with either OCL Implementation (AMD, Intel). So I can not stress, how much I personally would NOT do this, because it's only a matter of time, before this breaks.
Another way to achieve this is would be to compute .h files with the host during runtime with the definition of the array (or predefine them) and pass them to the kernel upon compilation via a compiler option. This, of course, requires recompilation of the clProgram/clKernel for every different LUT.
I struggled to get this work in my own program some time ago.
I did not find any way to initialize a constant or global scope array from the host via some clEnqueueWriteBuffer or so. The only way is to write it explicitely in your .cl source file.
So here my trick to initialize it from the host is to use the fact that you are actually compiling your source from the host, which also means you can alter your src.cl file before compiling it.
First my src.cl file reads:
__constant double lookup[SIZE] = { LOOKUP }; // precomputed table (in constant memory).
double func(int idx) {
return(lookup[idx])
}
__kernel void ker1(__global double *in, __global double *out)
{
... do something ...
double t = func(i)
...
}
notice the lookup table is initialized with LOOKUP.
Then, in the host program, before compiling your OpenCL code:
compute the values of my lookup table in host_values[]
on your host, run something like:
char *buf = (char*) malloc( 10000 );
int count = sprintf(buf, "#define LOOKUP "); // actual source generation !
for (int i=0;i<SIZE;i++) count += sprintf(buf+count, "%g, ",host_values[i]);
count += sprintf(buf+count,"\n");
then read the content of your source file src.cl and place it right at buf+count.
you now have a source file with an explicitely defined lookup table that you just computed from the host.
compile your buffer with something like clCreateProgramWithSource(context, 1, (const char **) &buf, &src_sz, err);
voilĂ  !
It looks like "array" is a look-up table of sorts. You'll need to clCreateBuffer and clEnqueueWriteBuffer so the GPU has a copy of it to use.

Resources