CUDA: can thread creates separate copy of all the data? - memory-management

I have very basic question which i fail to understand after going through documents. I am facing this issue while executing one of my project as the output i get is totally corrupted and i believe problem is either with memory allocation or with thread sync.
ok the question is:
Can every thread creates separate copy of all the variables and pointers passed to the kernal function ? or it just creates copy of variable but the pointers we pass that memory is shared amoung all threads.
e.g.
int main()
{
const int DC4_SIZE = 3;
const int DC4_BYTES = DC4_SIZE * sizeof(float);
float * dDC4_in;
float * dDC4_out;
float hDC4_out[DC4_SIZE];
float hDC4_out[DC4_SIZE];
gpuErrchk(cudaMalloc((void**) &dDC4_in, DC4_BYTES));
gpuErrchk(cudaMalloc((void**) &dDC4_out, DC4_BYTES));
// dc4 initialization function on host which allocates some values to DC4[] array
gpuErrchk(cudaMemcpy(dDC4_in, hDC4_in, DC4_BYTES, cudaMemcpyHostToDevice));
mykernel<<<10,128>>>(VolDepth,dDC4_in);
cudaMemcpy(hDC4_out, dDC4_out, DC4_BYTES, cudaMemcpyDeviceToHost);
}
__global__ void mykernel(float VolDepth,float * dDC4_in,float * dDC4_out)
{
for(int index =0 to end)
dDC4_out[index]=dDC4_in[index] * VolDepth;
}
so i am passing dDC4_in and dDC4_out pointers to GPU with dDC4_in initialized with some values and computing dDC4_out and copying back to host,
so does my all 1280 threads will have separate dDC4_in/out copies or they all will work on same copy on GPU overwriting the values of other threads?

global memory is shared by all threads in a grid. The parameters you pass to your kernel (that you've allocated with cudaMalloc) are in the global memory space.
Threads do have their own memory (local memory), but in your example dDC4_in and dDC4_out are shared by all of your threads.
As a general run-down (taken from the CUDA Best Practices documentation):
On the DRAM side:
Local memory (and registers) is per-thread, shared memory is per-block, and global, constant, and texture are per-grid.
In addition, global/constant/texture memory can be read and modified on the host, while local and shared memory are only around for the duration of your kernel. That is, if you have some important information in your local or shared memory and your kernel finishes, that memory is reclaimed and your information lost. Also, this means that the only way to get data into your kernel from the host is via global/constant/texture memory.
Anyways, in your case it's a bit hard to recommend how to fix your code, because you don't take threads into account at all. Not only that, in the code you posted, you're only passing 2 arguments to your kernel (which takes 3 parameters), so it's no surprise your results are somewhat lacking. Even if your code were valid, you would have every thread looping from 0 to end and writing the to the same location in memory (which would be serialized, but you wouldn't know which write would be the last one to go through). In addition to that race condition, you have every thread doing the same computation; each of your 1280 threads will execute that for loop and perform the same steps. You have to decide on a mapping of threads to data elements, divide up the work in your kernel based on your thread to element mapping, and perform your computation based on that.
e.g. if you have a 1 thread : 1 element mapping,
__global__ void mykernel(float VolDepth,float * dDC4_in,float * dDC4_out)
{
int index = threadIdx.x + blockIdx.x*blockDim.x;
dDC4_out[index]=dDC4_in[index] * VolDepth;
}
of course this would also necessitate changing your kernel launch configuration to have the correct number of threads, and if the threads and elements aren't exact multiples, you'll want some added bounds checking in your kernel.

Related

Linux `alloc_pages_node` not incrementing `_refcount` for all allocated pages

When allocating physically-contiguous memory with alloc_pages_node in Linux v6.0, the _refcount in struct page for all of the allocated pages is not incremented. Only the first page of the allocation has its _refcount correctly incremented.
Is this correct/intended behavior?
Is this function only intended to be used in particular use cases/in a particular way such that the incorrect _refcount is accounted for?
Context: alloc_pages* are a series of functions in the kernel intended for allocating a physically contiguous set of pages Documentation. These functions return a pointer to the struct page corresponding to the first page of the allocated region.
I am using this function during early boot (in fact while setting up the stacks for the init process and for kthreadd).
By this point, the buddy-allocator is functional and usable.
Similar APIs (ignoring the need for physical contiguity) such as vmalloc increment the _refcount for all allocated pages.
This is the code I am running. The output is also listed below.
Code
order = get_order(nr_pages << PAGE_SIZE);
p = alloc_pages_node(node, gfp_mask, order);
if (!p)
return;
for(i = 0; i < nr_pages; i++, p++)
printk("_refcount = %d", p->_refcount);
Output
_refcount = 1
_refcount = 0
_refcount = 0
...
Arguments
gfp_mask is (THREADINFO_GFP & ~__GFP_ACCOUNT) | __GFP_NOWARN | __GFP_HIGHMEM.
The first part THREADINFO_GFP & ~__GFP_ACCOUNT of this is sent by alloc_thread_stack_node
__vmalloc_area_node adds __GFP_NOWARN | __GFP_HIGHMEM
order = get_order(nr_pages << PAGE_SIZE) = 2 since nr_pages is 4.
Is this correct/intended behavior?
Yes, this is normal. Page allocations of order higher than 0 are effectively considered as a single "high-order" page(1) by the the buddy allocator, so functions such as alloc_pages() and __free_pages(), which operate on both order-0 and high-order pages, only care about the reference count of the first page.
Upon allocation (alloc_pages), only the first struct page of the group gets its refcount initialized. Upon deallocation (__free_pages), the refcount of the first page is decremented and tested: if it reaches zero, the whole group of pages gets actually freed(2). When this happens, a sanity check is also performed on every single page to ensure that the reference count is zero.
If you intend to allocate multiple pages at once, but then manage them separately, you will need to split them using split_page(), which effectively "enables" reference counting for every single struct page and initializes its refcount to 1. You can then use __free_pages(p, 0) (or __free_page()) on each page separately.(3)
Similar APIs (ignoring the need for physical contiguity) such as vmalloc increment the _refcount for all allocated pages.
Whether to allocate single order-0 pages or do a higher-order allocation is a choice that depends on the semantics of the specific memory allocation API. Problem is, these semantics can often change based on the actual API usage in kernel code(4). Indeed as of now vmalloc() splits the high-order page obtained from alloc_pages() using split_page(), but this was only a recent change done because some of its callers were relying on the allocated pages to be independent (e.g., doing their own reference counting).
(1) Not to be confused with compound pages, although their refcounting is performed in the same way, i.e. only the first page (PageHead()) is refcounted.
(2) It is actually a little bit more complex than that, all pages except the first are freed regardless of the refcount of the first, to avoid memory leaks in rare situations, see this relevant commit. The refcount sanity check on all the freed pages is done anyway.
(3) Note that allocating high-order pages and then splitting them into order-0 pages is generally not a good idea, as you can guess from the comment on top of split_pages(): "Note: this is probably too low level an operation for use in drivers. Please consult with lkml before using this in your driver." - This is because high-order allocations are harder to satisfy than order-0 allocations, and breaking high-order page blocks only makes it even harder.
(4) Welcome to the magic world of kernel APIs I guess. Much like Hogwarts' staircases, they like to change.

CUDA dynamic parallelism: Access child kernel results in global memory

I am currently trying my first dynamic parallelism code in CUDA. It is pretty simple. In the parent kernel I am doing something like this:
int aPayloads[32];
// Compute aPayloads start values here
int* aGlobalPayloads = nullptr;
cudaMalloc(&aGlobalPayloads, (sizeof(int) *32));
cudaMemcpyAsync(aGlobalPayloads, aPayloads, (sizeof(int)*32), cudaMemcpyDeviceToDevice));
mykernel<<<1, 1>>>(aGlobalPayloads); // Modifies data in aGlobalPayloads
cudaDeviceSynchronize();
// Access results in payload array here
Assuming that I do things right so far, what is the fastest way to access the results in aGlobalPayloads after kernel execution? (I tried cudaMemcpy() to copy aGlobalPayloads back to aPayloads but cudaMemcpy() is not allowed in device code).
You can directly access the data in aGlobalPayloads from your parent kernel code, without any copying:
mykernel<<<1, 1>>>(aGlobalPayloads); // Modifies data in aGlobalPayloads
cudaDeviceSynchronize();
int myval = aGlobalPayloads[0];
I'd encourage careful error checking (Read the whole accepted answer here). You do it in device code the same way as in host code. The programming guide states: "May not pass in local or shared memory pointers". Your usage of aPayloads is a local memory pointer.
If for some reason you want that data to be explicitly put back in your local array, you can use in-kernel memcpy for that:
memcpy(aPayloads, aGlobalPayloads, sizeof(int)*32);
int myval = aPayloads[0]; // retrieves the same value
(that is also how I would fix the issue I mention in item 2 - use in-kernel memcpy)

Does moving data from global memory to shared memory stall the thread?

__shared__ float smem[2];
smem[0] = global_memory[0];
smem[1] = global_memory[1];
/*process smem[0]...*/
/*process smem[1]...*/
My question is, does smem[1] = global_memory[1]; block computation on smem[0]?
In Cuda thread scheduling - latency hiding and Cuda global memory load and store they say memory read will not stall the thread, untill the read data is being used. does storing it to shared memory count as "using the data"? should I do something like this:
__shared__ float smem[2];
float a = global_memory[0];
float b = global_memory[1];
smem[0] = a;
/* process smem[0]*/
smem[1] = b;
/* process smem[1]*/
Or perhaps compiler does it for me? but then does it use extra registers?
Yes, in the general case this would block the CUDA thread:
smem[0] = global_memory[0];
the reason is that this operation would be broken into two steps:
LDG Rx, [Ry]
STS [Rz], Rx
The first SASS instruction loads from global memory. This operation does not block the CUDA thread. It can be issued to the LD/ST unit, and the thread can continue. However the register target of that operation (Rx) is tracked, and if any instruction needs to use the value from Rx, the CUDA thread will stall at that point.
Of course the very next instruction is the STS (store shared) instruction that will use the value from Rx, so the CUDA thread will stall at that point (until the global load is satisfied).
Of course it's possible that the compiler may reorder the instructions so that the STS instruction occurs later, but there is no guarantee of that. Regardless, whenever the STS instruction is ordered by the compiler, the CUDA thread will stall at that point, until the global load is completed. For the example you have given, I think its quite likely that the compiler would create code that looks like this:
LDG Rx, [Ry]
LDG Rw, [Ry+1]
STS [Rz], Rx
STS [Rz+1], Rw
In other words, I think its likely that the compiler would organize these loads such that both global loads could be issued, before a possible stall occurs. However, there is no guarantee of this, and the specific behavior for your code can only be deduced by studying the actual SASS, but in the general case we should assume the possibility of a thread stall.
Yes, if you can break up the loads and stores as you have shown in your code, then this operation:
float b = global_memory[1];
should not block this operation:
smem[0] = a;
/* process smem[0]*/
Having said all that, CUDA introduced a new mechanism to address this scenario in CUDA 11, supported by devices of compute capability 8.0 and higher (so, all Ampere GPUs at this time). This new feature is referred to as asynchronous copy of data from global to shared memory. It allows for these copy operations to proceed without stalling CUDA threads. However this feature requires proper use of a barrier to make sure that when you need to actually use the data in shared memory, it is present.

Allocate swappable memory in linux kernel

Memory in the Linux kernel is usually unswappable (Do Kernel pages get swapped out?). However, sometimes it is useful to allow memory to be swapped out. Is it possible to explicitly allocate swappable memory inside the Linux kernel? One method I thought of was to create a user space process and use its memory. Is there anything better?
You can create a file in the internal shm shared memory filesystem.
const char *name = "example";
loff_t size = PAGE_SIZE;
unsigned long flags = 0;
struct file *filp = shmem_file_setup(name, size, flags);
/* assert(!IS_ERR(filp)); */
The file isn't actually linked, so the name isn't visible. The flags may include VM_NORESERVE to skip accounting up-front, instead accounting as pages are allocated. Now you have a shmem file. You can map a page like so:
struct address_space *mapping = filp->f_mapping;
pgoff_t index = 0;
struct page *p = shmem_read_mapping_page(mapping, index);
/* assert(!IS_ERR(filp)); */
void *data = page_to_virt(p);
memset(data, 0, PAGE_SIZE);
There is also shmem_read_mapping_page_gfp(..., gfp_t) to specify how the page is allocated. Don't forget to put the page back when you're done with it.
put_page(p);
Ditto with the file.
fput(filp);
Answer to your question is a simple No, or Yes with a complex modification to kernel source.
First, to enable swapping out, you have to ask yourself what is happening when kswapd is swapping out. Essentially it will walk through all the processes and make a decision whether its memory can be swapped out or not. And all these memory have the hardware mode of ring 3. So SMAP essentially forbid it from being read as data or executed as program in the kernel (ring 0):
https://en.wikipedia.org/wiki/Supervisor_Mode_Access_Prevention
And check your distros "CONFIG_X86_SMAP", for mine Ubuntu it is default to "y" which is the case for past few years.
But if you keep your memory as a kernel address (ring 0), then you may need to consider changing the kswapd operation to trigger swapout of kernel addresses. Whick kernel addresses to walk first? And what if the address is part of the kswapd's kernel operation? The complexities involved is huge.
And next is to consider the swap in operation: When the memory read is attempted and it's "not present" bit is enabled, then hardware exception will trigger linux kernel memory fault handler (which is __do_page_fault()).
And looking into __do_page_fault:
https://elixir.bootlin.com/linux/latest/source/arch/x86/mm/fault.c#L1477
and there after how it handler the kernel addresses (do_kern_address_fault()):
https://elixir.bootlin.com/linux/latest/source/arch/x86/mm/fault.c#L1174
which essentially is just reporting as error for possible scenario. If you want to enable kernel address pagefaulting, then this path has to be modified.
And note too that the SMAP check (inside smap_violation) is done in the user address pagefaulting (do_usr_addr_fault()).

memory pool usage (boost::pool) for variable sized buffers?

The bottleneck of my current project is heap allocation... profiling stated about 50% of the time one critical thread spends with/in the new operator.
The application cannot use stack memory here and needs to allocate a lot of one central job structure—a custom job/buffer implementation: small and short-lived but variable in size. The object are itself heap memory std::shared_ptr/std::weak_ptr objects and carry a classic C-Array (char*) payload.
Depending on the runtime configuration and workload in different parts 300k-500k object might get created and are in use at the same time (but this should usually not happen). Since its a x64 application memory fragmentation isn't that big a deal (but it might get when also targeted at x86).
To increase speed and packet throughput and as well be save to memory fragmentation in the future I was thinking about using some memory management pool which lead me to boost::pool.
Almost all examples use fixed size object... but I'm unsure how to deal with a variable lengthed payload? A simplified object like this could be created using a boost::pool but I'm unsure what to do with the payload? Is it usable with a boost:pool at all?
class job {
public:
static std::shared_ptr<job> newObj();
private:
delegate_t call;
args_t * args;
unsigned char * payload;
size_t payload_size;
}
Usually the objects are destroyed when all references to the shared_ptr run out of scope and I wouldn't want to change the shared-ptr back to a c-ptr. A deferred destruction of the objects should also work to increase performance and from what I read should work better with a boost:pool. I haven't found if the pool supports an interaction with the smart_ptr? The alternative but quirky way would be to save a reference to the shared_ptr on creation together with the pool and release them in blocks.
Does anyone have experiences with the two? boost:pool usage with variable sized objects and smart pointer interaction?
Thank you!

Resources