Optimizing local memory use with OpenCL

Optimizing local memory use with OpenCL - memory-management

OpenCL is of course designed to abstract away the details of hardware implementation, so going down too much of a rabbit hole with respect to worrying about how the hardware is configured is probably a bad idea.
Having said that, I am wondering how much local memory is efficient to use for any particular kernel. For example if I have a work group which contains 64 work items then presumably more than one of these may simultaneously run within a compute unit. However it seems that the local memory size as returned by CL_DEVICE_LOCAL_MEM_SIZE queries is applicable to the whole compute unit, whereas it would be more useful if this information was for the work group. Is there a way to know how many work groups will need to share this same memory pool if they coexist on the same compute unit?
I had thought that making sure that my work group memory usage was below one quarter of total local memory size was a good idea. Is this too conservative? Is tuning by hand the only way to go? To me that means that you are only tuning for one GPU model.
Lastly, I would like to know if the whole local memory size is available for user allocation for local memory, or if there are other system overheads that make it less? I hear that if you allocate too much then data is just placed in global memory. Is there a way of determining if this is the case?

Is there a way to know how many work groups will need to share this same memory pool if they coexist on the same compute unit?
Not in one step, but you can compute it. First, you need to know how much local memory a workgroup will need. To do so, you can use clGetKernelWorkGroupInfo with the flag CL_KERNEL_LOCAL_MEM_SIZE (strictly speaking it's the local memory required by one kernel). Since you know how much local memory there is per compute unit, you can know the maximum number of workgroups that can coexist on one compute unit.
Actually, this is not that simple. You have to take into consideration other parameters, such as the max number of threads that can reside on one compute unit.
This is a problem of occupancy (that you should try to maximize). Unfortunately, occupancy will vary depending of the underlying architecture.
AMD publish an article on how to compute occupancy for different architectures here.
NVIDIA provide an xls sheet that compute the occupancy for the different architectures.
Not all the necessary information to do the calculation can be queried with OCL (if I recall correctly), but nothing stops you from storing info about different architectures in your application.
I had thought that making sure that my work group memory usage was below one quarter of total local memory size was a good idea. Is this too conservative?
It is quite rigid, and with clGetKernelWorkGroupInfo you don't need to do that. However there is something about CL_KERNEL_LOCAL_MEM_SIZE that needs to be taken into account:
If the local memory size, for any pointer argument to the kernel
declared with the __local address qualifier, is not specified, its
size is assumed to be 0.
Since you might need to compute dynamically the size of the necessary local memory per workgroup, here is a workaround based on the fact that the kernels are compiled in JIT.
You can define a constant in you kernel file and then use the -D option to set its value (previously computed) when calling clBuildProgram.
I would like to know if the whole local memory size is available for user allocation for local memory, or if there are other system overheads that make it less?
Again CL_KERNEL_LOCAL_MEM_SIZE is the answer. the standard states:
This includes local memory that may be needed by an implementation to
execute the kernel...

If your work is fairly independent and doesn't re-use input data you can safely ignore everything about work groups and shared local memory. However, if your work items can share any input data (classic example is a 3x3 or 5x5 convolution that re-reads input data) then the optimal implementation will need shared local memory. Non-independent work can also benefit. One way to think of shared local memory is programmer-managed cache.

Related

Restricting virtual address range of a process?

On Linux x86_64, I have a simple application for which I track all memory accesses with Intel's PIN. The program uses only "a bit" of memory, most of it for dynamically allocated matrices (I've bisected the right value with ulimit). However, the memory accesses span the whole range of the VM address space, low addresses for what I presume global variables in the code, high addresses for the malloc()ed arrays.
There's a huge gap in the middle, and even in the high addresses the range is between 0x7fff4e1a37f4 and 0x7fea3af99000, which is much larger than what I would assume my application to use in total.
The post-processing that I need to do on the memory accesses deals very badly with these sparse accesses, so I'm looking for a way to restrict the virtual address range available to the process so that "it just fits", and accesses will show addresses between 0 and some more reasonable value for dynamically allocated memory (somewhere around the 40 Mb that I've discovered through ulimit).
Q: Is there an easy way to limit the available address space (and hence implicitly, available memory) to an individual process on Linux, ideally from the command line on a per-process basis?
Further notes:
I can link my application statically.
Even if I limit the memory with ulimit, the process still uses the full VM address range (not entirely unexpected).
I know about /proc/${pid}/maps, but would like to avoid creating wrappers to deal with this, and how to actually use the data in there.
I've heard about prelink (which may not apply to my static binary, but only libraries?) and can image that there are more intrusive ways to interfere with malloc(), but these solutions are too far out of my expertise to evaluate their usefulness (Limiting the heap area's Virtual address range, https://stackoverflow.com/a/29960208/60462)
If there's no simple command line-solution, instead of going for any elaborate hack, I'll probably just wing it in the post-processing and "normalize" the addresses e.g. via a few lines of perl).

ld.so(8) lists LD_PREFER_MAP_32BIT_EXEC as a way to restrict the address space to the lower 2GiB which is significantly less than the normal 64-bit address space, but possibly not small enough for your purposes.
It may also be possible to use the prctl(2) PR_SET_MM options to control the addresses of different parts of the program to solve your problem.

Are OpenCL workgroups executed simultaneously?

My understanding was, that each workgroup is executed on the GPU and then the next one is executed.
Unfortunately, my observations lead to the conclusion that this is not correct.
In my implementation, all workgroups share a big global memory buffer.
All workgroups perform read and write operations to various positions on this buffer.
If the kernel operate on it directly, no conflicts arise.
If the workgroup loads chunk into local memory, performe some computation and copies the result back, the global memory gets corrupted by other workgroups.
So how can I avoid this behaviour?
Can I somehow tell OpenCL to only execute one workgroup at once or rearrange the execution order, so that I somehow don't get conflicts?

The answer is that it depends. A whole workgroup must be executed concurrently (though not necessarily in parallel) on the device, at least when barriers are present, because the workgroup must be able to synchronize and communicate. There is no rule that says work-groups must be concurrent - but there is no rule that says they cannot. Usually hardware will place a single work-group on a single compute core. Most hardware has multiple cores, which will each get a work-group, and to cover latency a lot of hardware will also place multiple work-groups on a single core if there is capacity available.
You have no way to control the order in which work-groups execute. If you want them to serialize you would be better off launching just one work-group and writing a loop inside to serialize the series of work chunks in that same work-group. This is often a good strategy in general even with multiple work-groups.
If you really only want one work-group at a time, though, you will probably be using only a tiny part of the hardware. Most hardware cannot spread a single work-group across the entire device - so if you're stuck to one core on a 32-core GPU you're not getting much use of the device.

You need to set the global size and dimensions to that of a single work group, and enqueue a new NDRange for each group. Essentially, breaking up the call to your kernel into many smaller calls. Make sure your command queue is not allowing out of order execution, so that the kernel calls are blocking.
This will likely result in poorer performance, but you will get the dedicated global memory access you are looking for.
Yes, the groups can be executed in parallel; this is normally a very good thing. Here is a related question.

The number of workgroups that can be concurrently launched on a ComputeUnit (AMD) or SMX (Nvidia) depends on the availability of GPU hardware resources, important ones being vector-registers and workgroup-level-memory** (called LDS for AMD and shared memory for Nvidia). If you want to launch just one workgroup on the CU/SMX, make sure that the workgroup consumes a bulk of these resources and blocks further workgroups on the same CU/SMX. You would, however, still have other workgroups executing on other CUs/SMXs - a GPU normally has multiple of these.
I am not aware of any API which lets you pin a kernel to a single CU/SMX.
** It also depends on the number of concurrent wavefronts/warps the scheduler can handle.

OpenCL: work group concept

I don't really understand the purpose of Work-Groups in OpenCL.
I understand that they are a group of Work Items (supposedly, hardware threads), which ones get executed in parallel.
However, why is there this need of coarser subdivision ? Wouldn't it be OK to have only the grid of threads (and, de facto, only one W-G)?
Should a Work-Group exactly map to a physical core ? For example, the TESLA c1060 card is said to have 240 cores. How would the Work-Groups map to this??
Also, as far as I understand, work-items inside a work group can be synchronized thanks to memory fences. Can work-groups synchronize or is that even needed ? Do they talk to each other via shared memory or is this only for work items (not sure on this one)?

Part of the confusion here I think comes down to terminology. What GPU people often call cores, aren't really, and what GPU people often call threads are only in a certain sense.
Cores
A core, in GPU marketing terms may refer to something like a CPU core, or it may refer to a single lane of a SIMD unit - in effect a single core x86 CPU would be four cores of this simpler type. This is why GPU core counts can be so high. It isn't really a fair comparison, you have to divide by 16, 32 or a similar number to get a more directly comparable core count.
Work-items
Each work-item in OpenCL is a thread in terms of its control flow, and its memory model. The hardware may run multiple work-items on a single thread, and you can easily picture this by imagining four OpenCL work-items operating on the separate lanes of an SSE vector. It would simply be compiler trickery that achieves that, and on GPUs it tends to be a mixture of compiler trickery and hardware assistance. OpenCL 2.0 actually exposes this underlying hardware thread concept through sub-groups, so there is another level of hierarchy to deal with.
Work-groups
Each work-group contains a set of work-items that must be able to make progress in the presence of barriers. In practice this means that it is a set, all of whose state is able to exist at the same time, such that when a synchronization primitive is encountered there is little overhead in switching between them and there is a guarantee that the switch is possible.
A work-group must map to a single compute unit, which realistically means an entire work-group fits on a single entity that CPU people would call a core - CUDA would call it a multiprocessor (depending on the generation), AMD a compute unit and others have different names. This locality of execution leads to more efficient synchronization, but it also means that the set of work-items can have access to locally constructed memory units. They are expected to communicate frequently, or barriers wouldn't be used, and to make this communication efficient there may be local caches (similar to a CPU L1) or scratchpad memories (local memory in OpenCL).
As long as barriers are used, work-groups can synchronize internally, between work-items, using local memory, or by using global memory. Work-groups cannot synchronize with each other and the standard makes no guarantees on forward progress of work-groups relative to each other, which makes building portable locking and synchronization primitives effectively impossible.
A lot of this is due to history rather than design. GPU hardware has long been designed to construct vector threads and assign them to execution units in a fashion that optimally processes triangles. OpenCL falls out of generalising that hardware to be useful for other things, but not generalising it so much that it becomes inefficient to implement.

There are already alot of good answers, for further understanding of the terminology of OpenCL this paper ("An Introduction to the OpenCL Programming Model" by Jonathan Tompson and Kristofer Schlachter) actually describes all the concepts very well.

Use of the work-groups allows more optimization for the kernel compilers. This is because data is not transferred between work-groups. Depending on used OpenCL device, there might be caches that can be used for local variables to result faster data accesses. If there is only one work-group, local variables would be just the same as global variables which would lead to slower data accesses.
Also, usually OpenCL devices use Single Instruction Multiple Data (SIMD) extensions to achieve good parallelism. One work group can be run in parallel with SIMD extensions.
Should a Work-Group exactly map to a physical core ?
I think that, only way to find the fastest work-group size, is to try different work-group sizes. It is also possible to query the CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE from the device with clGetKernelWorkGroupInfo. The fastest size should be multiple of that.
Can work-groups synchronize or is that even needed ?
Work-groups cannot be synchronized. This way there is no data dependencies between them and they can also be run sequentially, if that is considered to be the fastest way to run them. To achieve same result, than synchronization between work-groups, kernel needs to split into multiple kernels. Variables can be transferred between the kernels with buffers.

One benefit of work groups is they enable using shared local memory as a programmer-defined cache. A value read from global memory can be stored in shared work-group local memory and then accessed quickly by any work item in the work group. A good example is the game of life: each cell depends on itself and the 8 around it. If each work item read this information you'd have 9x global memory reads. By using work groups and shared local memory you can approach 1x global memory reads (only approach since there is redundant reads at the edges).

OpenCL Buffer caching behaviour

I've always been wondering about the caching behaviour of global data in OpenCL.
Lets say I have a pointer to global memory in a kernel.
Now I read the location the pointer points to.
Later in the kernel I might need the same data again, so I read it again through the pointer.
Now the question is, will this data be cached, or will it be reread from global memory every single time because other threads could have modified it?
If it's not cached, then I'd have to make a local copy every time so I don't lose tons of performance by constantly accessing global memory.
I know this might be vendor specific, but what do the specs say about this?

There is some caching but the key to great GPU compute performance it is move "accessed many time" data to private or shared local memory and not re-read it. In a way, you can think of this as "you control the caching". In OpenCL this would be done in your kernel (in parallel!) and then you'd have a memory barrier (to ensure all work items have finished the copy) then your algorithm has access to the data in fast memory. See the matrix multiply example (since each column and row contributes to multiple output values, copying them to shared local memory accelerates the algorithm.

Those who want benefits of local cashing for work-items within a work-group, for example on FPGAs, can read this paper by Andrew Ling at IWOCL2017 https://www.iwocl.org/wp-content/uploads/iwocl2017-andrew-ling-fpga-sdk.pdf. It is a good example of having correct usage of local caches and clever communication for dataflow computing. Those who want convenience of cache in parallel peer-to-peer setting and still have hardware do this for them should consider POWER8 or POWER9 chips. These conveniences come at the cost: caching global or virtual memory cluster interconnect may have to have several TBs of bandwidth. Real question is: What is the value of caching for dataflow compute e.g. ML, especially on clusters, vs. reducing communication and increasing data reuse by other means.

CUDA: When to use shared memory and when to rely on L1 caching?

After Compute Capability 2.0 (Fermi) was released, I've wondered if there are any use cases left for shared memory. That is, when is it better to use shared memory than just let L1 perform its magic in the background?
Is shared memory simply there to let algorithms designed for CC < 2.0 run efficiently without modifications?
To collaborate via shared memory, threads in a block write to shared memory and synchronize with __syncthreads(). Why not simply write to global memory (through L1), and synchronize with __threadfence_block()? The latter option should be easier to implement since it doesn't have to relate to two different locations of values, and it should be faster because there is no explicit copying from global to shared memory. Since the data gets cached in L1, threads don't have to wait for data to actually make it all the way out to global memory.
With shared memory, one is guaranteed that a value that was put there remains there throughout the duration of the block. This is as opposed to values in L1, which get evicted if they are not used often enough. Are there any cases where it's better too cache such rarely used data in shared memory than to let the L1 manage them based on the usage pattern that the algorithm actually has?

2 big reasons why automatic caching is less efficient than manual scratch pad memory (applies to CPUs as well)
parallel accesses to random addresses are more efficient. Example: histogramming. Let's say you want to increment N bins, and each are > 256 bytes apart. Then due to coalescing rules, that will result in N serial reads/writes since global and cache memory is organized in large ~256byte blocks. Shared memory doesn't have that problem.
Also to access global memory, you have to do virtual to physical address translation. Having a TLB that can do lots of translations in || will be quite expensive. I haven't seen any SIMD architecture that actually does vector loads/stores in || and I believe this is the reason why.
avoids writing back dead values to memory, which wastes bandwidth & power. Example: in an image processing pipeline, you don't want your intermediate images to get flushed to memory.
Also, according to an NVIDIA employee, current L1 caches are write-through (immediately writes to L2 cache), which will slow down your program.
So basically, the caches get in the way if you really want performance.

As far as i know, L1 cache in a GPU behaves much like the cache in a CPU. So your comment that "This is as opposed to values in L1, which get evicted if they are not used often enough" doesn't make much sense to me
Data on L1 cache isn't evicted when it isn't used often enough. Usually it is evicted when a request is made for a memory region that wasn't previously in cache, and whose address resolves to one that is already in use. I don't know the exact caching algorithm employed by NVidia, but assuming a regular n-way associative, then each memory entry can only be cached in a small subset of the entire cache, based on it's address
I suppose this may also answer your question. With shared memory, you get full control as to what gets stored where, while with cache, everything is done automatically. Even though the compiler and the GPU can still be very clever in optimizing memory accesses, you can sometimes still find a better way, since you're the one who knows what input will be given, and what threads will do what (to a certain extent of course)

Caching data through several memory layers always needs to follow a cache-coherency protocol. There are several such protocols and the decision on which one is the most suitable is always a trade off.
You can have a look at some examples:
Related to GPUs
Generally for computing units
I don't want to get in many details, because it is a huge domain and I am not an expert. What I want to point out is that in a shared-memory system (here the term shared does not refer to the so called shared memory of GPUs) where many compute-units (CUs) need data concurrently there is a memory protocol that attempts to keep the data close to the units so that can fetch them as fast as possible. In the example of a GPU when many threads in the same SM (symmetric multiprocessor) access the same data there should be a coherency in the sense that if thread 1 reads a chunk of bytes from the global memory and in the next cycle thread 2 is going to access these data, then an efficient implementation would be such that thread 2 is aware that data are found already in L1 cache and can access it fast. This is what the cache coherency protocol attempts to achieve, to let all compute units be up to date with what data exist in caches L1, L2 and so on.
However, keeping threads up to date, or else, keeping threads in coherent states, comes at some cost which is essentially missing cycles.
In CUDA by defining the memory as shared rather than L1-cache you free it from that coherency protocol. So access to that memory (which is physically the same piece of whatever material it is) is direct and does not implicitly call the functionality of coherency protocol.
I don't know how fast should this be, I didn't perform any such benchmark but the idea is that since you don't pay anymore for this protocol the access should be faster!
Of course, the shared memory on NVIDIA GPUs is split in banks and if someone wants to use it for performance improvement should have a look at this before. The reason is bank conflicts that occur when two threads access the same bank and this causes serialization of the access..., but that's another thing link

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio