CUDA global static data alternative?

CUDA global static data alternative? - memory-management

I'm building a toolkit that offers different algorithms in CUDA. However, many of these algorithms use static constant global data that will be used by all threads, declared this way for example:
static __device__ __constant__ real buf[MAX_NB];
My problem is that if I include all the .cuh files in the library, when the library will be instantiated all this memory will be allocated on the device, even though the user might want to use only one of these algorithms. Is there any any way around this? Will I absolutely have to use the typical dynamically allocated memory?
I want the fastest constant memory possible that can be used by all threads at runtime. Any ideas?
Thanks!

All the constant memory in a .cu file is allocated at launch (When the .cubin is generated and run, each .cu belongs to a different module)! Therefore, to use many different kernels that use constant memory, you have to divide them in .cu files as to not get a const memory overflow. The usual max is 64kb. Source: http://forums.nvidia.com/index.php?showtopic=185993

Have you looked into texture memory? I believe it is tricky but that it can be quite fast and can be allocated dynamically.
If you can't use textures, I've been brainstorming and the only thing I can think of for constant is to allocate a single constant array... some amount that is hopefully less than /all/ of the constants in /all/ of the headers, but big enough for what anyone would need in a maximal use case. Then you can load different values into this array for different needs.
(I'm assuming you've confirmed that allocating constant memory for the entire library is a problem. Is it insufficient space, or long initialization times, or what?)

Related

Allocator that manages a single block of memory

Due to system limitations, suppose that I can only allocate memory from a heap once (for example with std::allocator or some other more general C++11 compliant allocator).
This single allocation will take a large memory block.
Then I want to use containers and dynamic memory but all restricted to the previously allocated block of memory.
I managed to write very simple allocator that incrementally "gives" memory shifting a pointer.
In this allocator deallocate is a no-op, and memory from the block is not returned to the block.
One can obviously do better than this.
In other words, I want a managed heap.
Reusing this block memory in a sequence is a hard problem because one needs to manage discontinuous free segments, defragmentation, (optional) thread-safety, etc.
What is the name of this pattern? For some time I though that this was a pool allocator but it seem that that refers to something else (reusing small objects).
What features or standard libraries of C++ can I use either implement and administer such allocation or at least build my own with little effort?
I expected to find something in Boost.
But Boost.Pool is something else and it looks like something like this is implemented for a specific purpose in Boost.Interprocess but it doesn't seem to be easy to use and I have a hard time to understand it outside their prototypical use (such a interprocess shared memory.)
Otherwise, the closest thing I found is this https://www.boost.org/doc/libs/1_41_0/libs/pool/doc/interfaces/pool_alloc.html , but it seems that ::new can be called several times.
Example code:
int main(){
UserBlockAllocator<double> a(new double[1000], 1000);
{
std::vector<double, UserBlockAllocator<double>> v0(600, a);
} // v0 returns memory to block managed by a
std::vector<double, UserBlockAllocator<double>> v1(600, a);
std::vector<double, UserBlockAllocator<double>> v2(600, a); //out of memory
}

This pattern is referred to as arena allocator or stack allocator. If I understand the std::pmr stuff correctly, a std::pmr::monotonic_buffer_resource is related to that, but I have never tried that.
With those keywords you find something, but I have no experience with the tools.
Note that it is easy to succesfully deallocate the most recent allocation.
A powerful pattern is the composition of allocators as described in an entertaining talk by Andrei Alexandrescu at CppCon 2015. If you want to build your own tool, you might consider the combination of a FreeListAllocator (43:18) on top of your StackAllocator (35:42). This way, you may solve the problem how to manage discontinous free segments (as you describe it).

How to split one heap into several heaps inside one process?

Which ecosystems allow to create multiple heaps right now?
Is it possible to have multiple heaps in java?
garbage collection and memory management in Erlang
Is there any benefit to use multiple heaps for memory management purposes?
AppDomains don't create new heaps (there is still one heap for all domains). So, what one need to do to launch several different GC inside the single process?
Which syntactic primitives does one need to create? How a runtime should support that primitives?

Which ecosystems allow to create multiple heaps right now?
One obvious answer would be "C++" (feel free to fill in surrounding pieces as you see fit, if you don't consider a language to be an "ecosystem" in itself).
C++ allows you to specify heaps along a few different axes. One is by the type of an object--you can specify allocation for a particular type by overloading operator new and operator delete for that type:
class Foo {
static void *operator new(size_t size);
static void operator delete(void *block, size_t size);
};
It's then up to you to connect these heap management functions to an actual source of memory. You might allocate that via ::operator new, or you might (for example) go directly to the OS, such as with something like GlobalAlloc or VirtualAlloc on Windows, sbrk on UNIX-like systems, or just have pre-specified blocks of memory on a bare-metal embedded system.
Along a somewhat different axis, all the containers in the C++ standard library allocate and free memory via Allocator classes. The Allocator for any particular collection is specified as a template parameter, so (for example) a declaration for std::vector looks something like this:
template <class T, class Alloc=std::allocator<T>>
class vector {
// ...
};
This lets you specify a heap that will be used to allocate objects in that collection. Much as with operator new and operator delete, this really only specifies the interface by which the collection will allocate and free memory--it's up to you to connect that to code that actually manages the heap.
Garbage Collection
As far as garbage collection goes: I personally find it annoying, and advise against its use as a general rule. The problem is that it while it can (at least from one perspective) fix some types of problems with memory management, it does nothing to help management of other resources--and (unfortunately) I haven't seen anything like a tracing collector for file handles, network sockets, database connections, and so on. RAII provides a uniform method for dealing with resource management in general.
That said, if you really insist on using GC, C++ does support that as well. Prior to C++11, GC was entirely usable on a practical level, but led to what was technically undefined behavior under a few obscure circumstances, such as:
storing a pointer in a file, and reading it back in, or
modifying the bits of a pointer, later un-doing that modification
...and later taking the re-constituted pointer and dereferencing it. Obviously, while the pointer wasn't visible to the CPU, the pointed-to block of memory became eligible for GC, so the later dereference caused problems. C++11 defined these circumstances, and added a few library calls (e.g., declare_reachable, undeclare_reachable) to deal with them (e.g., if you call decalare_reachable(block);, that block is not eligible for collection, regardless of whether a pointer to it is visible). As such, if you want to use GC with C++ you can, and the bounds of defined behavior are thoroughly specified. The only problem is that essentially no code ever calls declare_reachable and/or undeclare_reachable, so in real use they're likely to be of little or no help (but pointer swizzling and/or storage in a file are sufficiently rare that this is unlikely to pose a real problem).
For a practical example, you might want to look at the Boehm-Demers-Weiser collector (if you haven't already).

Using shared memory in ArrayFire

Does anyone know how to declare that an array of data in ArrayFire should be stored in shared memory instead of global memory? Is this possible? I have a small set of data that needs to be randomly accessible by all threads. It's a constant look-up table that should be available for the life of the application. Maybe I am just missing the obvious or something, but reading the ArrayFire docs and googling have not turned up any info on how I tell ArrayFire that my data needs to go into shared memory.

In CUDA Shared memory (Local memory in OpenCL) is a very fast type of memory that is located on the GPU. It has the same lifetime as on thread block and can only be accessed by threads in the same thread block. It therefore cannot be used to store persistent data which needs to be used by multiple kernels even in raw CUDA. You might want to look into constant or texture memory to implement a look up table(LUT). These memory types are usually more suited for the type of access you usually encounter with a LUT.
ArrayFire has a high level API which makes GPU programming easy with one of the fastest implementations of many commonly used functions. With ArrayFire you will not be able to specify which type of memory is created but you are free to use the data in your own kernel. If you are using one of our function then it is very likely we will make use of shared/texture/constant memory where it makes sense.
Umar
Disclosure: I am one of the developers of ArrayFire

Optimizing local memory use with OpenCL

OpenCL is of course designed to abstract away the details of hardware implementation, so going down too much of a rabbit hole with respect to worrying about how the hardware is configured is probably a bad idea.
Having said that, I am wondering how much local memory is efficient to use for any particular kernel. For example if I have a work group which contains 64 work items then presumably more than one of these may simultaneously run within a compute unit. However it seems that the local memory size as returned by CL_DEVICE_LOCAL_MEM_SIZE queries is applicable to the whole compute unit, whereas it would be more useful if this information was for the work group. Is there a way to know how many work groups will need to share this same memory pool if they coexist on the same compute unit?
I had thought that making sure that my work group memory usage was below one quarter of total local memory size was a good idea. Is this too conservative? Is tuning by hand the only way to go? To me that means that you are only tuning for one GPU model.
Lastly, I would like to know if the whole local memory size is available for user allocation for local memory, or if there are other system overheads that make it less? I hear that if you allocate too much then data is just placed in global memory. Is there a way of determining if this is the case?

Is there a way to know how many work groups will need to share this same memory pool if they coexist on the same compute unit?
Not in one step, but you can compute it. First, you need to know how much local memory a workgroup will need. To do so, you can use clGetKernelWorkGroupInfo with the flag CL_KERNEL_LOCAL_MEM_SIZE (strictly speaking it's the local memory required by one kernel). Since you know how much local memory there is per compute unit, you can know the maximum number of workgroups that can coexist on one compute unit.
Actually, this is not that simple. You have to take into consideration other parameters, such as the max number of threads that can reside on one compute unit.
This is a problem of occupancy (that you should try to maximize). Unfortunately, occupancy will vary depending of the underlying architecture.
AMD publish an article on how to compute occupancy for different architectures here.
NVIDIA provide an xls sheet that compute the occupancy for the different architectures.
Not all the necessary information to do the calculation can be queried with OCL (if I recall correctly), but nothing stops you from storing info about different architectures in your application.
I had thought that making sure that my work group memory usage was below one quarter of total local memory size was a good idea. Is this too conservative?
It is quite rigid, and with clGetKernelWorkGroupInfo you don't need to do that. However there is something about CL_KERNEL_LOCAL_MEM_SIZE that needs to be taken into account:
If the local memory size, for any pointer argument to the kernel
declared with the __local address qualifier, is not specified, its
size is assumed to be 0.
Since you might need to compute dynamically the size of the necessary local memory per workgroup, here is a workaround based on the fact that the kernels are compiled in JIT.
You can define a constant in you kernel file and then use the -D option to set its value (previously computed) when calling clBuildProgram.
I would like to know if the whole local memory size is available for user allocation for local memory, or if there are other system overheads that make it less?
Again CL_KERNEL_LOCAL_MEM_SIZE is the answer. the standard states:
This includes local memory that may be needed by an implementation to
execute the kernel...

If your work is fairly independent and doesn't re-use input data you can safely ignore everything about work groups and shared local memory. However, if your work items can share any input data (classic example is a 3x3 or 5x5 convolution that re-reads input data) then the optimal implementation will need shared local memory. Non-independent work can also benefit. One way to think of shared local memory is programmer-managed cache.

Estimating the memory size of a software

I'm working on the development of Boot that will be embedded in a PROM chip for a project.
I was tasked with making an estimation of the final memory size that the software will probably take but I've never done this before.
I searched a bit around and I'm thinking about doing the following:
Counting all the variables, this size goes directly to the size total
Estimating a number of line of codes each function will take (the code hasn't been written yet)
Finding out an approximate number of asm instruction per c instruction
Total size = Total nb line of codes * avg asm instruction per c instruction * 32bit
My solution could very well be bogus, I hope someone will be able to help.

On principle - You are on the right track:
You need to distinguish between several types of memory footprint:
Stack
Dynamic memory (malloc, new, etc.)
Initialised variables
Un-initialised variables
Code
Stack is mostly impacted by recursion, local variables and function parameters.
Dynamic memory (heap) is obvious and also probably not relevant to you - so I'll ignore it for now.
Initialised variables are interesting since you need to count them twice - once for the program footprint on the PROM (similar to code and constants) and once for the RAM footprint.
Un-initialised variables obviously go toward the RAM and counting the size is almost good enough (you also need to consider alignment and padding.
The hardest to estimate is code or what goes into PROM, you need to count constants and local variables as well as the code, the code itself is more or less what you suspect (after adding padding, alignment, function call overhead, interrupt vector initialisation etc.) but many things can make it larger than expected, such as inline functions, library functions (many seemingly trivial operations involve such functions), casting etc.

On way of answering the question would be from experience or assessment of existing code with similar functionality. However there will be a number of factors that affect code size:
Target architecture and instruction set.
Compiler and compiler options used.
Library code usage.
Capability of development staff.
Required functionality.
The "development of Boot" tells us nothing about the requirements or functionality of your boot process. This will have the greatest affect on code size. As an example of how target can make a difference, 8-bit targets typically have greater code density, but generate more code for arithmetic on larger data types, while on say an ARM target where you can select between Thumb and ARM instruction sets, the code density will change significantly.
If you have no prior experience or representative code base to work from, then I suggest you perform a few experiments to get some metrics you can work with:
Build an empty application - just an empty main() function if C or C++; that will give you the basic fixed overhead of the runtime start-up.
If you are using library code, that will probably take a significant amount of space; add dummy calls to all library interfaces you will make use of in the final application, that will tell you how much code will be taken up by library code (assuming the library code is not in-lined).
Thereafter it will depend on functionality; you might implement a subset of the required functionality, and then estimate what proportion of the final build that might constitute.
Regarding your suggestions, remember that variables do not occupy space in ROM, though any constant initialisers will do so. Typically a boot-loader can use all available RAM because the application start-up will re-establish a new runtime environment for itself, discarding the boot-loader environment and variables.
If you were to provide details of functionality and target, you may be able to leverage the experience of the community in estimating the required resources. For example I might be able to tell you (from experience) that a boot-loader with support for Flash programming that loads via a UART using XMODEM protocol on an ARM7 using ARM instruction set will fit in 4k Bytes, or that adding support for loading via SD card may add a further 6Kb, and say USB Virtual Comm Port a further 4Kb. However your requirements are possibly unique and you will have to determine the resource load for yourself somehow.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio