CUDA: Does passing arguments to a kernel slow the kernel launch much? - gpgpu

CUDA beginner here.
In my code i am currently launching kernels a lot of times in a loop in the host code. (Because i need synchronization between blocks). So i wondered if i might be able to optimize the kernel launch.
My kernel launches look something like this:
MyKernel<<<blocks,threadsperblock>>>(double_ptr, double_ptr, int N, double x);
So to launch a kernel some signal obviously has to go from the CPU to the GPU, but i'm wondering if the passing of arguments make this process noticeably slower.
The arguments to the kernel are the same every single time, so perhaps i could save time by copying them once, access them in the kernel by a name defined by
__device__ int N;
<and somehow (how?) copy the value to this name N on the GPU once>
and simply launch the kernel with no arguments as such
MyKernel<<<blocks,threadsperblock>>>();
Will this make my program any faster?
What is the best way of doing this?
AFAIK the arguments are stored in some constant global memory. How can i make sure that the manually transferred values are stored in a memory which is as fast or faster?
Thanks in advance for any help.

I would expect the benefits of such an optimization to be rather small. On sane platforms (ie. anything other than WDDM), kernel launch overhead is only of the order of 10-20 microseconds, so there probably isn't a lot of scope to improve.
Having said that, if you want to try, the logical way to affect this is using constant memory. Define each argument as a __constant__ symbol at translation unit scope, then use the cudaMemcpyToSymbol function to copy values from the host to device constant memory.

Simple answer: no.
To be more elaborate: You need to send some signals from host to the GPU anyway, to launch the kernel itself. At this point, few more bytes of parameter data does not matter anymore.

Related

What happens when multiple GPU threads in a single warp/wave attempt to write to the same shared memory location?

I've been learning about parallel/GPU programming a lot recently, and I've encountered a situation that's stumped me. What happens when two threads in a warp/wave attempt to write to the same exact location in shared memory? Specifically, I'm confused as to how this can occur when warp threads each execute the exact same instruction at the same time (to my understanding).
For instance, say you dispatch a shader that runs 32 threads, the size of a normal non-AMD warp. Assuming no dynamic branching (which as I understand, will normally call up a second warp to execute the branched code? I could be very wrong about that), what happens if we have every single thread try to write to a single location in shared memory?
Though I believe my question applies to any kind of GPU code, here's a simple example in HLSL:
groupshared uint test_target;
#pragma kernel WarpWriteTest
[numthreads(32, 1, 1)]
void WarpWriteTest (uint thread_id: SV_GroupIndex) {
test_target = thread_id;
}
I understand this is almost certainly implementation-specific, but I'm just curious what would generally happen in a situation like this. Obviously, you'd end up with an unpredictable value stored in test_target, but what I'm really curious about is what happens on a hardware level. Does the entire warp have to wait until every write is complete, at which point it will continue executing code in lockstep (and would this result in noticeable latency)? Or is there some other mechanism to GPU shared memory/cache that I'm not understanding?
Let me clarify, I'm not asking what happens when multiple threads try to access a value in global memory/DRAM—I'd be curious to know, but my question is specifically concerned the shared memory in a threadgroup. I also apologize if this information is readily available somewhere else—as anyone reading might know, GPU terminology in general can be very nebulous and non-standardized, so I've had difficulty even knowing what I should be looking for.
Thank you so much!

Ways to invoke Linux kernel memory allocation?

I am examining how kernel memory allocators work (SLAB and SLUB). To trick them into being called, I need to invoke kernel memory allocations via a user-land program.
The obvious way would be calling syscall.fork(), which would generate process instances, for which the kernel must maintain PCB structures, which require a fair amount of memory space.
Then I'm out. I would not limit my experiments to merely calling fork() and trace them using Systemtap. Any other convenient ways to do the similar, but may requiring kernel objects (other than proc_t) with various features (the most important of which: their sizes)?
Thanks.
SLUB is just an efficient way (in comparison with SLAB) of managing the cache objects. It is more or less the same thing. You can read here why SLUB was introduced and this link talks about what exactly slab allocator is. Now on to tracing what exactly happens in kernel and how to trace it:
The easier but inefficient way is to read the source code but for that you need to know from where to start in the source.
Another way, more accurate, is to write a driver that allocates memory using kmem_cache_create() and then call it from your user program. Now you have a well defined start point, use kgdb and step through the entire sequence.

How does CUDA handle multiple updates to memory address?

I have written a CUDA kernel in which each thread makes an update to a particular memory address (with int size). Some threads might want to update this address simultaneously.
How does CUDA handle this? Does the operation become atomic? Does this increase the latency of my application in any way? If so, how?
The operation does not become atomic, and it is essentially undefined behavior. When two or more threads write to the same location, one of the values will end up in the location, but there is no way to predict which one.
It can be especially problematic if you are reading and writing, such as to increment a variable.
CUDA provides a set of atomic operations to help.
You may also use other coding techniques such as parallel reductions, to help when there are multiple updates to the same location, such as finding a max or min value.
If you don't care about the order of updates, it should not be a performance issue for newer GPUs which automatically condense writes or reads to a single location in global memory or shared memory, but this is also not specified behavior.

CUDA global static data alternative?

I'm building a toolkit that offers different algorithms in CUDA. However, many of these algorithms use static constant global data that will be used by all threads, declared this way for example:
static __device__ __constant__ real buf[MAX_NB];
My problem is that if I include all the .cuh files in the library, when the library will be instantiated all this memory will be allocated on the device, even though the user might want to use only one of these algorithms. Is there any any way around this? Will I absolutely have to use the typical dynamically allocated memory?
I want the fastest constant memory possible that can be used by all threads at runtime. Any ideas?
Thanks!
All the constant memory in a .cu file is allocated at launch (When the .cubin is generated and run, each .cu belongs to a different module)! Therefore, to use many different kernels that use constant memory, you have to divide them in .cu files as to not get a const memory overflow. The usual max is 64kb. Source: http://forums.nvidia.com/index.php?showtopic=185993
Have you looked into texture memory? I believe it is tricky but that it can be quite fast and can be allocated dynamically.
If you can't use textures, I've been brainstorming and the only thing I can think of for constant is to allocate a single constant array... some amount that is hopefully less than /all/ of the constants in /all/ of the headers, but big enough for what anyone would need in a maximal use case. Then you can load different values into this array for different needs.
(I'm assuming you've confirmed that allocating constant memory for the entire library is a problem. Is it insufficient space, or long initialization times, or what?)

Question about syntax for using constant cache

Hey all, I didn't see much in the way of syntax for __constant variable allocation in OpenCL in the guides from Nvidia.
When I call clCreateBuffer, do I have to give it the flag CL_MEM_READ_ONLY. It doesn't seem to mind that I set it to CL_MEM_READ_WRITE for now, though I bet trying to write to constant cache in the kernel will screw something up.
Are there any gotchas or special things I need to remember to do on the host side? If I declare the arguement as __constant in the device kernel code, then am I good to go with using the constant cache variable so long as I don't write to it?
Yes, that's basically it. You have to keep in mind that the constant cache has a size limit of 64 KB, though. Since the __constant address space is inherently read-only, the compiler should complain if you try to write to it.
Unfortunately, __constant memory altogether is a bit buggy with NVidia's implementation. Occasionally the compiler will emit wrong code, reads from constant memory simply return zero. As of the 260.x driver series they haven't fixed the problem.

Resources