Instruction transfer between CPU and GPU - gpgpu

I'm looking for information related to how CPU moves program code to the GPU when working with GPGPU computation. Internet is plenty of manuals about data transfer, but not about instruction/program loading.
The question is: program is handled by the CPU, which "configures" the GPU with the adequate flags on each computing unit to perform a given operation. After that, data is transfered and processed. How the firs operation is done? How instructions are issued to the GPU? Are the instructions somehow packet to take advantage of the bus bandwidth? I may have ignore something fundamental, so any additional information is welcome.

There is indeed not much information about it, but you overestimate the effect.
The whole kernel code is loaded onto GPU only once (at worst once-per-kernel-invocation, but it looks like it is actually once-per-application-run, see below), and then is executed completely on the GPU without any intervention from CPU. So, whole kernel code is copied in one chunk somewhere before kernel invocation. To estimate code size, the .cubin size of all GPU code of our home-made MD package (52 kernels, some of which are > 150 lines of code) is only 91 KiB, so it's safe to assume that in pretty much all the cases the code transfer time is negligible.
Here's is what information I've found in official docs:
In CUDA Driver API, the code is loaded on device the time you call cuModuleLoad function
The CUDA driver API does not attempt to lazily allocate the resources
needed by a module; if the memory for functions and data
(constant and global) needed by the module cannot be allocated,
cuModuleLoad() fails
Theoretically, you might have to unload the module and then load it again, if you have several modules which use too much constant (or statically allocated global) memory to be loaded simultaneously, but it's quite uncommon, and you usually call cuModuleLoad only once per application launch, right after context creation.
CUDA Runtime API does not provide any measures of controlling module loading/unloading, but it looks like all the necessary code is loaded onto device during it's initialization.
OpenCL Specs are not as specific as CUDA Driver API, but the code is likely (wild guessing involved) copied to device on clBuildProgram stage.

Related

Why cannot a kernel be launched with the reason of too many register use when there is a register spilling mechanism?

1) When does a kernel start to spill registers to local memory?
2) When there is not enough registers, how does the CUDA runtime decide to not launch a kernel and throws too many resources requested error? How many registers are enough to launch a kernel?
3) Since there is a register spilling mechanism, shouldn't all CUDA kernels be launched even if there are not enough registers?
1) When does a kernel start to spill registers to local memory?
This is entirely under control of the compiler. It is not performed by the runtime, and there are no dynamic runtime decisions about it. When your code reaches the point of a spill, it means that the compiler has inserted an instruction like:
STL [R0], R1
In this case, R1 is being stored to local memory, the local memory address given in R0. This would be a spill store. (After that instruction, R1 could be used for/loaded with something else.) The compiler knows when it has done this, of course, and so it can report the number of spill loads and spill stores it has chosen to use/make. You can get this information (along with register usage, and other information) using the -Xptxas=-v compiler switch.
The compiler (unless you restrict it, see below) makes decisions about register usage primarily focused on performance, paying otherwise less attention to how many registers are actually used. The first priority is performance.
2) When there is not enough registers, how does the CUDA runtime decide to not launch a kernel and throws too many resources requested error? How many registers are enough to launch a kernel?
At compile-time, when your kernel code is being compiled, the compiler has no idea how it will be launched. It has no idea what your launch configuration will be like (number of blocks, number of threads per block, amount of dynamically allocated shared memory, etc) In fact the compilation process mostly proceeds as if the thing being compiled is a single thread.
During compilation, the compiler makes a bunch of static decisions about register assignments (how and where registers will be used). CUDA has binary utilities that can help with understanding this. Register assignments don't change at runtime, are not in any way dynamic, and therefore are entirely determined at compile time. Therefore, at the completion of compilation for a given device code function, it is generally possible to determine how many registers are needed. The compiler includes this information in the binary compiled object.
At runtime, at the point of kernel launch, the CUDA runtime now knows:
How many registers (per thread) are needed for a given kernel
What device we are running on, and therefore what the aggregate limits are
What the launch configuration is (blocks, threads)
Assembling these 3 pieces of information means the runtime can immediately know if there is or will be enough "register space" for the launch. Roughly speaking, the pass/fail arithmetic is if the launch would satisfy this inequality:
registers_per_thread*threads_per_block <= max_registers_per_multiprocessor
There is granularity to be considered in this equation as well. Registers are often allocated in groups of 2 or 4 at runtime, i.e. the registers_per_thread quantity may need to be rounded up to the next whole-number multiple of something like 2 or 4, before the inequality test is applied. The registers_per_thread quantity is ascertained by the compiler as already described. The threads_per_block quantity comes from your kernel launch configuration. The max_registers_per_multiprocessor quantity is machine-readable (i.e. it is a function of the GPU you are running on). You can see how to retrieve that quantity yourself if you wish by studying the deviceQuery CUDA sample code.
3) Since there is a register spilling mechanism, shouldn't all CUDA kernels be launched even if there are not enough registers?
I reiterate that the register assignment (and register spill decisions) is/are entirely a static compile-time process. No runtime decisions or alterations are made. The register assignment is entirely inspectable from the compiled code. Therefore, since no adjustments can be made at runtime, no changes could be made to allow an arbitrary launch. Any such change would require recompilation of the code. While this might be theoretically possible, it is not currently implemented in CUDA. Furthermore, it has the possibility to lead to both variable and perhaps unpredictable behavior (in performance) so there might be reasons not to do it.
Its possible to make all kernels "launchable" (with respect to register limitations) by suitably restricting the compiler's choices about register assignment. __launch_bounds__ and the compiler switch -maxrregcount are a couple ways to achieve this. CUDA provides both an occupancy calculator as well as an occupancy API to help with this process.

What library or API should I use to implement a linux kernel module doing asynchronous IO?

First I will tell environment of my PC, background of my question, my problem, than I will explain my exact question.
Environment:
OS: Ubuntu 16.04
Kernel: 4.17.1
CPU: i7-6700k
Memory: 8GB DRAM
Storage: SSD 120GB
Background:
I'm trying to optimizing linux kernel for my specific application. Following is abstract logic of this application.
1. call malloc, allocate the memory space which size is exactly 4KB(page size)
2. Copy predefined data(also, size is 4KB) to allocated memory space.
3. Do computation
4. Free allocated memory space.
This sequence occurs about several thousands to ten thousands times a second.
So I thought copy predefined data to allocated memory space using memcpy() thousands of times every second is very inefficient. But I cannot fix the code of this application.
My problem:
I want to do these copies asynchronously by kernel module, using less CPU cycles as possible. So I'm trying to implement a kernel module that copy this predefined data to free page frames asynchronously in kernel, and managing a pool page frames which has predefined data on them. When my specific application request a page frame, my kernel will give a page frame from this pool.
To copy data asynchronously, I first considered DMA, but intel idma64 of my CPU cannot copy data memory to memory asynchronously. Now, I'm trying to copy this data from secondary storage(SSD) to memory. I found that there is library for asynchronous IO named libaio in linux.
My question:
1. Can I use libaio libraries in kernel module? If not, what kind of library or APIs do I have to use to copy asynchronously in my kernel module?
2. Will libaio(or something else) really do copies without exploiting CPU cycles?
I don't think you need to write a kernel module. A user space thread pool of CPU pinned threads working with a collection of memory maps of files will be as efficient as is possible to implement. Just be careful of "TLB shootdown" i.e. avoid modifying the address space of the process, and throw as much virtual address space as you can at the problem to avoid that. Perhaps a little bit of hinting to the kernel as to what written pages will never be used again via madvise(), and you should be optimal - sufficient multiple threads will maximise queue depth to the SSD, you want to aim for QD8 to QD16, and you should easily saturate a NVMe link whilst keeping CPU usage below 100%.
Things get harder if you have many NVMe linked SSDs, you may need to consider replacing Linux will something with more scalable storage i/o, but there is a throughput vs scalability tradeoff there. Windows and FreeBSD will scale better with lots of devices if you partition the work up right, but Linux will do much better with a few devices. Good luck!

How does cache affect while a same kernel is being launched repeatedly

I recently start learning OpenCL and have a question about interaction between cache and kernel in OpenCL. I am writing a program to measure a latency for accessing main memory.(bypassing caches) Therefore, I am wondering whether cache memory is cleared automatically after a kernel execution is finished or it will be remained and be used while the same kernel is executed repeatedly?
Thanks!
For AMD Radeon GCNs, L1 and L2 cache is persistent between all kernels and all different kernels. A kernel can use cached data from any other kernel. Additionally, Local Memory inside a Compute Unit is not cleared/zeroed between kernel runs (more precisely, between work-group runs). This means you have to initialize local variables. The same should apply for nVidia/CUDA devices and generic SIMD CPUs.
That being said, OpenCL does not know or define different level of caches, caches are vendor specific. Any functionality that handles or manages caching is a vendor specific extension.
To test latency, use a pseudo-random number generator in your kernel, and read random memory addresses. Use 2 kernels, the 1st one pollutes all caches, the 2nd one then does the actual latency measuring.
in OpenCL memory hierarchy there are NO "caches" (in the sense of CPU). In OpenCL there are different kind of memories that you can controll with some instructions. Here you can have a look on what I mean:
The fastest memories are the Private memory and the Local Memory. You can declare Variables in this memory space and controll, moving them in the way that you prefer. You should be careful because in the Local memory you can share data among "Block" and data inside the Privite is visible only by the Thread. Here you can find a lot of other informations.
So if you run repeatedly a kernel you can store your variables in the memory that you prefer and you will notice that if the variables are in the privite mamory you will be realy fast in comparison with the other solutions.

kmalloc'ed memory is slow

We have an app that requires ~1MB buffers for a hardware device to fill, therefore we wrote a kernel module that allocates buffers using kmalloc(). We did not use dma_alloc_coherent() as we need to manipulative the buffers and therefore wanted them to be cached (we flush the cache when needed). One of the manipulations that is done is the kernel module copies one buffer to another buffer. In timing these copies we see it takes about ~2ms to copy a buffer. The time does not include any cache flushing.
As this seemed slow we wrote a standard userspace test app, that used malloc() to create 1MB buffers and copied them. The userspace copies took about .5ms, which is about the correct time to move this amount of memory on the processor/memory config we are using.
Thinks we tried: To make sure it wasn't a different memcpy() in kernel space and user space we wrote our own NEON optimized copy, but made no difference. Changed the buffer size from 100KB to 10MB and made no difference. All times were over 10 copies, but always very very consistent. Time routine used gettimeofday() in userspace.
Only thing we can think of is that the data cache is setup up different for kmalloc()'ed memory then malloc()'ed memory???
We are working on iMX6 ARM, Linaro kerne.
The kmalloc() memory will be contiguous in physical space. The user space will definitely not (mlock() may result in closer to contiguous). If you have several SDRAM chips, it is possible that your memory controller allow pipelining or multiple issue reads/writes to different chips simultaneously. It may even be faster with multiple banks. vmalloc() will not use contiguous pages.Ref You should be able to write a test to swap kmalloc() with vmalloc(). If something has changed with the newer ARMs and the cache is not VIVT, the difference in physical addresses could cause cache (aliasing?) effects on some processors.
I do not think that the cache are setup differently for kernel memory versus user memory; at least with 2.6.34 variants; but they may come from different pools. Also, for a memcpy() a large cache is not needed; you just need enough to make sure the SDRAM will burst.
Another issue is peripherals. For instance, a large graphics buffer on one chip maybe stealing cycles via DMA. If you can change your machine file or device table to disable as many drivers as possible, this can be eliminated. This combined with the pipelining could account for the type of slow-down observed.
I believe this is a platform issue. If it was strictly Linux, I think that one of the millions of users may have encountered it. However, you haven't given a specific version of Linux. It could be an ARM based issue; so I tagged it as such. I think it is your platform/ARM combination; simply because others would observe this. Can you also provide a specific machine file or device table that your design was based upon and the Linux version.

How does the size of managed code affect memory footprint?

I have been tasked with reducing memory footprint of a Windows CE 5.0 application. I came across Rob Tiffany's highly cited article which recommends using managed DLL to keep the code out of the process's slot. But there is something I don't understand.
The article says that
The JIT compiler is running in your slot and it pulls in IL from the 1
GB space as needed to compile the current call stack.
This means that all the code in the managed DLL can potentially eventually end up in the process's slot. While this will help other processes by not loading the code in common area how does it help this process? FWIW the article does mention that
It also reduces the amount of memory that has to be allocated inside your
My only thought is that just as the code is pulled into the slot it is also pushed/swapped out. But that is just a wild guess and probably completely false.
CF assemblies aren't loaded into the process slot like native DLLs are. They're actually accessed as memory-mapped files. This means that the size of the DLL is effectively irrelevant.
The managed heap also lies in shared memory, not your process slot, so object allocations are far less likely to cause process slot fragmentation or OOM's.
The JITter also doesn't just JIT and hold forever. It compiles what is necessary, and during a GC may very well pitch compiled code that is not being used, or that hasn't been used in a while. You're never going to see an entire assembly JITTed and pulled into the process slow (well if it's a small assembly maybe, but it's certainly not typical).
Obviously some process slot memory has to be used to create some pointers, stack storage, etc etc, but by and large managed code has way less impact on the process slot limitations than native code. Of course you can still hit the limit with large stacks, P/Invokes, native allocations and the like.
In my experience, the area people get into trouble most often with CF apps an memory is with GDI objects and drawing. Bitmaps take up a lot of memory. Even though it's largely in shared memory, creating lots of them (along with brushes, pens, etc) and not caching and reusing is what most often give a large managed app memory footprint.
For a bit more detail this MSDN webcast on Compact Framework Memory Management, while old, is still very relevant.

Resources