CUDA dynamic parallelism: Access child kernel results in global memory

CUDA dynamic parallelism: Access child kernel results in global memory - memory-management

I am currently trying my first dynamic parallelism code in CUDA. It is pretty simple. In the parent kernel I am doing something like this:
int aPayloads[32];
// Compute aPayloads start values here
int* aGlobalPayloads = nullptr;
cudaMalloc(&aGlobalPayloads, (sizeof(int) *32));
cudaMemcpyAsync(aGlobalPayloads, aPayloads, (sizeof(int)*32), cudaMemcpyDeviceToDevice));
mykernel<<<1, 1>>>(aGlobalPayloads); // Modifies data in aGlobalPayloads
cudaDeviceSynchronize();
// Access results in payload array here
Assuming that I do things right so far, what is the fastest way to access the results in aGlobalPayloads after kernel execution? (I tried cudaMemcpy() to copy aGlobalPayloads back to aPayloads but cudaMemcpy() is not allowed in device code).

You can directly access the data in aGlobalPayloads from your parent kernel code, without any copying:
mykernel<<<1, 1>>>(aGlobalPayloads); // Modifies data in aGlobalPayloads
cudaDeviceSynchronize();
int myval = aGlobalPayloads[0];
I'd encourage careful error checking (Read the whole accepted answer here). You do it in device code the same way as in host code. The programming guide states: "May not pass in local or shared memory pointers". Your usage of aPayloads is a local memory pointer.
If for some reason you want that data to be explicitly put back in your local array, you can use in-kernel memcpy for that:
memcpy(aPayloads, aGlobalPayloads, sizeof(int)*32);
int myval = aPayloads[0]; // retrieves the same value
(that is also how I would fix the issue I mention in item 2 - use in-kernel memcpy)

Related

Using Thrust Functions with raw pointers: Controlling the allocation of memory

I have a question regarding the thrust library when using CUDA.
I am using a thrust function, i.e. exclusive_scan, and I want to use raw pointers. I am using raw (device) pointers because I want to have full control of when the memory is allocated and deallocated.
After the function call, I will hand over the pointer to another data structure and then free the memory in either the destructor of this data structure, or in the next function call, when I recompute my (device) pointers. I came across for example this problem here now, which recommends to wrap the data structure in a device_vector. But then I run into the problem that the memory is freed once my device_vector goes out of scope, which I do not want. Having the device pointer globally is also not an option, since I am hacking code, i.e. it is used as a buffer and I would have to rewrite a lot if I wanted to do something like that.
Does anyone have a good workaround regarding this? The only chance I do see right now is to rewrite the thrust-function on my own, only using raw device-pointers.
EDIT: I misread, I can wrap it in a device_ptr instead of a device_vector.
Asking further though, how could I solve this if there wasn't the option of using a device_ptr?

There is no problem using plain pointers in thrust methods.
For data on the device do:
....
struct DoSomething {
__device__ int operator()(int item) { return 1; }
};
int* IntData;
cudaMalloc(&IntData, sizeof(int) * count);
auto dev_data = device_pointer_cast(IntData);
thrust::generate(dev_data, dev_data + count, DoSomething());
thrust::sort(dev_data, dev_data + count);
....
cudaFree(IntData);
For data on the host use plain malloc/free and raw_pointer_cast instead of device_pointer_cast.
See: thrust: Memory management

Allocate swappable memory in linux kernel

Memory in the Linux kernel is usually unswappable (Do Kernel pages get swapped out?). However, sometimes it is useful to allow memory to be swapped out. Is it possible to explicitly allocate swappable memory inside the Linux kernel? One method I thought of was to create a user space process and use its memory. Is there anything better?

You can create a file in the internal shm shared memory filesystem.
const char *name = "example";
loff_t size = PAGE_SIZE;
unsigned long flags = 0;
struct file *filp = shmem_file_setup(name, size, flags);
/* assert(!IS_ERR(filp)); */
The file isn't actually linked, so the name isn't visible. The flags may include VM_NORESERVE to skip accounting up-front, instead accounting as pages are allocated. Now you have a shmem file. You can map a page like so:
struct address_space *mapping = filp->f_mapping;
pgoff_t index = 0;
struct page *p = shmem_read_mapping_page(mapping, index);
/* assert(!IS_ERR(filp)); */
void *data = page_to_virt(p);
memset(data, 0, PAGE_SIZE);
There is also shmem_read_mapping_page_gfp(..., gfp_t) to specify how the page is allocated. Don't forget to put the page back when you're done with it.
put_page(p);
Ditto with the file.
fput(filp);

Answer to your question is a simple No, or Yes with a complex modification to kernel source.
First, to enable swapping out, you have to ask yourself what is happening when kswapd is swapping out. Essentially it will walk through all the processes and make a decision whether its memory can be swapped out or not. And all these memory have the hardware mode of ring 3. So SMAP essentially forbid it from being read as data or executed as program in the kernel (ring 0):
https://en.wikipedia.org/wiki/Supervisor_Mode_Access_Prevention
And check your distros "CONFIG_X86_SMAP", for mine Ubuntu it is default to "y" which is the case for past few years.
But if you keep your memory as a kernel address (ring 0), then you may need to consider changing the kswapd operation to trigger swapout of kernel addresses. Whick kernel addresses to walk first? And what if the address is part of the kswapd's kernel operation? The complexities involved is huge.
And next is to consider the swap in operation: When the memory read is attempted and it's "not present" bit is enabled, then hardware exception will trigger linux kernel memory fault handler (which is __do_page_fault()).
And looking into __do_page_fault:
https://elixir.bootlin.com/linux/latest/source/arch/x86/mm/fault.c#L1477
and there after how it handler the kernel addresses (do_kern_address_fault()):
https://elixir.bootlin.com/linux/latest/source/arch/x86/mm/fault.c#L1174
which essentially is just reporting as error for possible scenario. If you want to enable kernel address pagefaulting, then this path has to be modified.
And note too that the SMAP check (inside smap_violation) is done in the user address pagefaulting (do_usr_addr_fault()).

Successfully de-referenced userspace pointer in kernel space without using copy_from_user()

There's a bug in a driver within our company's codebase that's been there for years.
Basically we make calls to the driver via ioctls. The data passed between userspace and driver space is stored in a struct, and the pointer to the data is fed into the ioctl. The driver is responsible to dereference the pointer by using copy_from_user(). But this code hasn't been doing that for years, instead just dereferencing the userspace pointer. And so far (that I know of) it hasn't caused any issues until now.
I'm wondering how this code hasn't caused any problems for so long? In what case will dereferencing a pointer in kernel space straight from userspace not cause a problem?
In userspace
struct InfoToDriver_t data;
data.cmd = DRV_SET_THE_CLOCK;
data.speed = 1000;
ioctl(driverFd, DEVICE_XX_DRIVER_MODIFY, &data);
In the driver
device_xx_driver_ioctl_handler (struct inode *inode, struct file *filp, unsigned int cmd, unsigned long arg)
{
struct InfoToDriver_t *user_data;
switch(cmd)
{
case DEVICE_XX_DRIVER_MODIFY:
// what we've been doing for years, BAD
// But somehow never caused a kernel oops until now
user_data = (InfoToDriver_t *)arg;
if (user_data->cmd == DRV_SET_THE_CLOCK)
{ .... }
// what we're supposed to do
copy_from_user(user_data, (void *)arg, sizeof(InfoToDriver_t));
if (user_data->cmd == DRV_SET_THE_CLOCK)
{ ... }

A possible answer is, this depends on the architecture. As you have seen, on a sane architecture (such as x86 or x86-64) simply dereferencing __user pointers just works. But Linux pretends to support every possible architecture, there are architectures where simple dereference does not work. Otherwise copy_to/from_user won't existed.
Another reason for copy_to/from_user is possibility that usermode side modifies its memory simultaneously with the kernel side (in another thread). You cannot assume that the content of usermode memory is frozen while accessing it from kernel. Rougue usermode code can use this to attack the kernel. For example, you can probe the pointer to output data before executing the work, but when you get to copy the result back to usermode, this pointer is already invalid. Oops. The copy_to_user API ensures (should ensure) that the kernel won't crash during the copy, instead the guilty application will be killed.
A safer approach is to copy the whole usermode data structure into kernel (aka 'capture'), check this copy for consistency.
The bottom line... if this driver is proven to work well on certain architecture, and there are no plans to port it, there's no urgency to change it. But check carefully robustness of the kernel code, if capture of the usermode data is needed, or problem may arise during copying from usermode.

unique_ptr heap and stack allocation

Raw pointers can point to objects allocated on the stack or on the heap.
Heap allocation example:
// heap allocation
int* rawPtr = new int(100);
std::cout << *rawPtr << std::endl; // 100
Stack allocation example:
int i = 100;
int* rawPtr = &i;
std::cout << *rawPtr << std::endl; // 100
Heap allocation using auto_ptr example:
int* rawPtr = new int(100);
std::unique_ptr<int> uPtr(rawPtr);
std::cout << *uPtr << std::endl; // 100
Stack allocation using auto_ptr example:
int i = 100;
int* rawPtr = &i;
std::unique_ptr<int> uPtr(rawPtr); // runtime error
Are 'smart pointers' intended to be used to point to dynamically created objects on the heap? For C++11, are we supposed to continue using raw pointers for pointing to stack allocated objects? Thank you.

Smart pointers are usually used to point to objects allocated with new and deleted with delete. They don't have to be used this way, but that would seem to be the intent, if we want to guess the intended use of the language constructs.
The reason your code crashes in the last example is because of the "deleted with delete" part. When it goes out of scope, the unique_ptr will try to delete the object it has a pointer to. Since it was allocated on the stack, this fails. Just as if you had written, delete rawPtr;
Since one usually uses smart pointers with heap objects, there is a function to allocate on the heap and convert to a smart pointer all in one go. std::unique_ptr<int> uPtr = make_unique<int>(100); will perform the actions of the first two lines of your third example. There is also a matching make_shared for shared pointers.
It is possible to use smart pointers with stack objects. What you do is specify the deleter used by the smart pointer, providing one that does not call delete. Since it's a stack variable and nothing need be done to delete it, the deleter could do nothing. Which makes one ask, what's the point of the smart pointer then, if all it does is call a function that does nothing? Which is why you don't commonly see smart pointers used with stack objects. But here's an example that shows some usefulness.
{
char buf[32];
auto erase_buf = [](char *p) { memset(p, 0, sizeof(buf)); };
std::unique_ptr<char, decltype(erase_buf)> passwd(buf, erase_buf);
get_password(passwd.get());
check_password(passwd.get());
}
// The deleter will get called since passwd has gone out of scope.
// This will erase the memory in buf so that the password doesn't live
// on the stack any longer than it needs to. This also works for
// exceptions! Placing memset() at the end wouldn't catch that.

The runtime error is due to the fact that delete was called on a memory location that was never allocated with new.
If an object has already been created with dynamic storage duration (typically implemented as creation on a 'heap') then a 'smart pointer' will not behave correctly as demonstrated by the runtime error.
Are 'smart pointers' intended to be used to point to dynamically
created objects on the heap? For C++11, are we supposed to continue
using raw pointers for pointing to stack allocated objects?
As for what one is supposed to do, well, it helps to think of the storage duration and specifically how the object was created.
If the object has automatic storage duration (stack) then avoid taking the address and use references. The ownership does not belong with the pointer and a reference makes the ownership clearer.
If the object has dynamic storage duration (heap) then a smart pointer is the way to go as it can then manage the ownership.
So for the last example, the following would be better (pointer owns the int):
auto uPtr = std::make_unique<int>(100);
The uPtr will have automatic storage duration and will call the destructor when it goes out of scope. The int will have dynamic storage duration (heap) and will be deleteed by the smart pointer.
One could generally avoid using new and delete and avoid using raw pointers. With make_unique and make_shared, new isn't required.

Are 'smart pointers' intended to be used to point to dynamically created objects on the heap?
They are intended for heap-allocated objects to prevent leaks.
The guideline for C++ is to use plain pointers to refer to a single object (but not own it). The owner of the object holds it by value, in a container or via a smart pointer.

Are 'smart pointers' intended to be used to point to dynamically created objects on the heap?
Yes, but that's just the default. Notice that std::unique_ptr has a constructor (no. (3)/(4) on that page) which takes a pointer that you have obtained somehow, and a "deleter" that you provide. In this case the unique pointer will not do anything with the heap (unless your deleter does so).
For C++11, are we supposed to continue using raw pointers for pointing to stack allocated objects? Thank you.
You should use raw pointers in code that does not "own" the pointer - does not need to concern itself with allocation or deallocation; and that is regardless of whether you're pointing into the heap or the stack or elsewhere.
Another place to use it is when you're implementing some class that has a complex ownership pattern, for protected/private members.
PS: Please, forget about std::auto_ptr... pretend it never existed :-)

CUDA: can thread creates separate copy of all the data?

I have very basic question which i fail to understand after going through documents. I am facing this issue while executing one of my project as the output i get is totally corrupted and i believe problem is either with memory allocation or with thread sync.
ok the question is:
Can every thread creates separate copy of all the variables and pointers passed to the kernal function ? or it just creates copy of variable but the pointers we pass that memory is shared amoung all threads.
e.g.
int main()
{
const int DC4_SIZE = 3;
const int DC4_BYTES = DC4_SIZE * sizeof(float);
float * dDC4_in;
float * dDC4_out;
float hDC4_out[DC4_SIZE];
float hDC4_out[DC4_SIZE];
gpuErrchk(cudaMalloc((void**) &dDC4_in, DC4_BYTES));
gpuErrchk(cudaMalloc((void**) &dDC4_out, DC4_BYTES));
// dc4 initialization function on host which allocates some values to DC4[] array
gpuErrchk(cudaMemcpy(dDC4_in, hDC4_in, DC4_BYTES, cudaMemcpyHostToDevice));
mykernel<<<10,128>>>(VolDepth,dDC4_in);
cudaMemcpy(hDC4_out, dDC4_out, DC4_BYTES, cudaMemcpyDeviceToHost);
}
__global__ void mykernel(float VolDepth,float * dDC4_in,float * dDC4_out)
{
for(int index =0 to end)
dDC4_out[index]=dDC4_in[index] * VolDepth;
}
so i am passing dDC4_in and dDC4_out pointers to GPU with dDC4_in initialized with some values and computing dDC4_out and copying back to host,
so does my all 1280 threads will have separate dDC4_in/out copies or they all will work on same copy on GPU overwriting the values of other threads?

global memory is shared by all threads in a grid. The parameters you pass to your kernel (that you've allocated with cudaMalloc) are in the global memory space.
Threads do have their own memory (local memory), but in your example dDC4_in and dDC4_out are shared by all of your threads.
As a general run-down (taken from the CUDA Best Practices documentation):
On the DRAM side:
Local memory (and registers) is per-thread, shared memory is per-block, and global, constant, and texture are per-grid.
In addition, global/constant/texture memory can be read and modified on the host, while local and shared memory are only around for the duration of your kernel. That is, if you have some important information in your local or shared memory and your kernel finishes, that memory is reclaimed and your information lost. Also, this means that the only way to get data into your kernel from the host is via global/constant/texture memory.
Anyways, in your case it's a bit hard to recommend how to fix your code, because you don't take threads into account at all. Not only that, in the code you posted, you're only passing 2 arguments to your kernel (which takes 3 parameters), so it's no surprise your results are somewhat lacking. Even if your code were valid, you would have every thread looping from 0 to end and writing the to the same location in memory (which would be serialized, but you wouldn't know which write would be the last one to go through). In addition to that race condition, you have every thread doing the same computation; each of your 1280 threads will execute that for loop and perform the same steps. You have to decide on a mapping of threads to data elements, divide up the work in your kernel based on your thread to element mapping, and perform your computation based on that.
e.g. if you have a 1 thread : 1 element mapping,
__global__ void mykernel(float VolDepth,float * dDC4_in,float * dDC4_out)
{
int index = threadIdx.x + blockIdx.x*blockDim.x;
dDC4_out[index]=dDC4_in[index] * VolDepth;
}
of course this would also necessitate changing your kernel launch configuration to have the correct number of threads, and if the threads and elements aren't exact multiples, you'll want some added bounds checking in your kernel.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio