OpenCL: work group concept - parallel-processing

I don't really understand the purpose of Work-Groups in OpenCL.
I understand that they are a group of Work Items (supposedly, hardware threads), which ones get executed in parallel.
However, why is there this need of coarser subdivision ? Wouldn't it be OK to have only the grid of threads (and, de facto, only one W-G)?
Should a Work-Group exactly map to a physical core ? For example, the TESLA c1060 card is said to have 240 cores. How would the Work-Groups map to this??
Also, as far as I understand, work-items inside a work group can be synchronized thanks to memory fences. Can work-groups synchronize or is that even needed ? Do they talk to each other via shared memory or is this only for work items (not sure on this one)?

Part of the confusion here I think comes down to terminology. What GPU people often call cores, aren't really, and what GPU people often call threads are only in a certain sense.
Cores
A core, in GPU marketing terms may refer to something like a CPU core, or it may refer to a single lane of a SIMD unit - in effect a single core x86 CPU would be four cores of this simpler type. This is why GPU core counts can be so high. It isn't really a fair comparison, you have to divide by 16, 32 or a similar number to get a more directly comparable core count.
Work-items
Each work-item in OpenCL is a thread in terms of its control flow, and its memory model. The hardware may run multiple work-items on a single thread, and you can easily picture this by imagining four OpenCL work-items operating on the separate lanes of an SSE vector. It would simply be compiler trickery that achieves that, and on GPUs it tends to be a mixture of compiler trickery and hardware assistance. OpenCL 2.0 actually exposes this underlying hardware thread concept through sub-groups, so there is another level of hierarchy to deal with.
Work-groups
Each work-group contains a set of work-items that must be able to make progress in the presence of barriers. In practice this means that it is a set, all of whose state is able to exist at the same time, such that when a synchronization primitive is encountered there is little overhead in switching between them and there is a guarantee that the switch is possible.
A work-group must map to a single compute unit, which realistically means an entire work-group fits on a single entity that CPU people would call a core - CUDA would call it a multiprocessor (depending on the generation), AMD a compute unit and others have different names. This locality of execution leads to more efficient synchronization, but it also means that the set of work-items can have access to locally constructed memory units. They are expected to communicate frequently, or barriers wouldn't be used, and to make this communication efficient there may be local caches (similar to a CPU L1) or scratchpad memories (local memory in OpenCL).
As long as barriers are used, work-groups can synchronize internally, between work-items, using local memory, or by using global memory. Work-groups cannot synchronize with each other and the standard makes no guarantees on forward progress of work-groups relative to each other, which makes building portable locking and synchronization primitives effectively impossible.
A lot of this is due to history rather than design. GPU hardware has long been designed to construct vector threads and assign them to execution units in a fashion that optimally processes triangles. OpenCL falls out of generalising that hardware to be useful for other things, but not generalising it so much that it becomes inefficient to implement.

There are already alot of good answers, for further understanding of the terminology of OpenCL this paper ("An Introduction to the OpenCL Programming Model" by Jonathan Tompson and Kristofer Schlachter) actually describes all the concepts very well.

Use of the work-groups allows more optimization for the kernel compilers. This is because data is not transferred between work-groups. Depending on used OpenCL device, there might be caches that can be used for local variables to result faster data accesses. If there is only one work-group, local variables would be just the same as global variables which would lead to slower data accesses.
Also, usually OpenCL devices use Single Instruction Multiple Data (SIMD) extensions to achieve good parallelism. One work group can be run in parallel with SIMD extensions.
Should a Work-Group exactly map to a physical core ?
I think that, only way to find the fastest work-group size, is to try different work-group sizes. It is also possible to query the CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE from the device with clGetKernelWorkGroupInfo. The fastest size should be multiple of that.
Can work-groups synchronize or is that even needed ?
Work-groups cannot be synchronized. This way there is no data dependencies between them and they can also be run sequentially, if that is considered to be the fastest way to run them. To achieve same result, than synchronization between work-groups, kernel needs to split into multiple kernels. Variables can be transferred between the kernels with buffers.

One benefit of work groups is they enable using shared local memory as a programmer-defined cache. A value read from global memory can be stored in shared work-group local memory and then accessed quickly by any work item in the work group. A good example is the game of life: each cell depends on itself and the 8 around it. If each work item read this information you'd have 9x global memory reads. By using work groups and shared local memory you can approach 1x global memory reads (only approach since there is redundant reads at the edges).

Related

How does cache affect while a same kernel is being launched repeatedly

I recently start learning OpenCL and have a question about interaction between cache and kernel in OpenCL. I am writing a program to measure a latency for accessing main memory.(bypassing caches) Therefore, I am wondering whether cache memory is cleared automatically after a kernel execution is finished or it will be remained and be used while the same kernel is executed repeatedly?
Thanks!
For AMD Radeon GCNs, L1 and L2 cache is persistent between all kernels and all different kernels. A kernel can use cached data from any other kernel. Additionally, Local Memory inside a Compute Unit is not cleared/zeroed between kernel runs (more precisely, between work-group runs). This means you have to initialize local variables. The same should apply for nVidia/CUDA devices and generic SIMD CPUs.
That being said, OpenCL does not know or define different level of caches, caches are vendor specific. Any functionality that handles or manages caching is a vendor specific extension.
To test latency, use a pseudo-random number generator in your kernel, and read random memory addresses. Use 2 kernels, the 1st one pollutes all caches, the 2nd one then does the actual latency measuring.
in OpenCL memory hierarchy there are NO "caches" (in the sense of CPU). In OpenCL there are different kind of memories that you can controll with some instructions. Here you can have a look on what I mean:
The fastest memories are the Private memory and the Local Memory. You can declare Variables in this memory space and controll, moving them in the way that you prefer. You should be careful because in the Local memory you can share data among "Block" and data inside the Privite is visible only by the Thread. Here you can find a lot of other informations.
So if you run repeatedly a kernel you can store your variables in the memory that you prefer and you will notice that if the variables are in the privite mamory you will be realy fast in comparison with the other solutions.

Are OpenCL workgroups executed simultaneously?

My understanding was, that each workgroup is executed on the GPU and then the next one is executed.
Unfortunately, my observations lead to the conclusion that this is not correct.
In my implementation, all workgroups share a big global memory buffer.
All workgroups perform read and write operations to various positions on this buffer.
If the kernel operate on it directly, no conflicts arise.
If the workgroup loads chunk into local memory, performe some computation and copies the result back, the global memory gets corrupted by other workgroups.
So how can I avoid this behaviour?
Can I somehow tell OpenCL to only execute one workgroup at once or rearrange the execution order, so that I somehow don't get conflicts?
The answer is that it depends. A whole workgroup must be executed concurrently (though not necessarily in parallel) on the device, at least when barriers are present, because the workgroup must be able to synchronize and communicate. There is no rule that says work-groups must be concurrent - but there is no rule that says they cannot. Usually hardware will place a single work-group on a single compute core. Most hardware has multiple cores, which will each get a work-group, and to cover latency a lot of hardware will also place multiple work-groups on a single core if there is capacity available.
You have no way to control the order in which work-groups execute. If you want them to serialize you would be better off launching just one work-group and writing a loop inside to serialize the series of work chunks in that same work-group. This is often a good strategy in general even with multiple work-groups.
If you really only want one work-group at a time, though, you will probably be using only a tiny part of the hardware. Most hardware cannot spread a single work-group across the entire device - so if you're stuck to one core on a 32-core GPU you're not getting much use of the device.
You need to set the global size and dimensions to that of a single work group, and enqueue a new NDRange for each group. Essentially, breaking up the call to your kernel into many smaller calls. Make sure your command queue is not allowing out of order execution, so that the kernel calls are blocking.
This will likely result in poorer performance, but you will get the dedicated global memory access you are looking for.
Yes, the groups can be executed in parallel; this is normally a very good thing. Here is a related question.
The number of workgroups that can be concurrently launched on a ComputeUnit (AMD) or SMX (Nvidia) depends on the availability of GPU hardware resources, important ones being vector-registers and workgroup-level-memory** (called LDS for AMD and shared memory for Nvidia). If you want to launch just one workgroup on the CU/SMX, make sure that the workgroup consumes a bulk of these resources and blocks further workgroups on the same CU/SMX. You would, however, still have other workgroups executing on other CUs/SMXs - a GPU normally has multiple of these.
I am not aware of any API which lets you pin a kernel to a single CU/SMX.
** It also depends on the number of concurrent wavefronts/warps the scheduler can handle.

Optimizing local memory use with OpenCL

OpenCL is of course designed to abstract away the details of hardware implementation, so going down too much of a rabbit hole with respect to worrying about how the hardware is configured is probably a bad idea.
Having said that, I am wondering how much local memory is efficient to use for any particular kernel. For example if I have a work group which contains 64 work items then presumably more than one of these may simultaneously run within a compute unit. However it seems that the local memory size as returned by CL_DEVICE_LOCAL_MEM_SIZE queries is applicable to the whole compute unit, whereas it would be more useful if this information was for the work group. Is there a way to know how many work groups will need to share this same memory pool if they coexist on the same compute unit?
I had thought that making sure that my work group memory usage was below one quarter of total local memory size was a good idea. Is this too conservative? Is tuning by hand the only way to go? To me that means that you are only tuning for one GPU model.
Lastly, I would like to know if the whole local memory size is available for user allocation for local memory, or if there are other system overheads that make it less? I hear that if you allocate too much then data is just placed in global memory. Is there a way of determining if this is the case?
Is there a way to know how many work groups will need to share this same memory pool if they coexist on the same compute unit?
Not in one step, but you can compute it. First, you need to know how much local memory a workgroup will need. To do so, you can use clGetKernelWorkGroupInfo with the flag CL_KERNEL_LOCAL_MEM_SIZE (strictly speaking it's the local memory required by one kernel). Since you know how much local memory there is per compute unit, you can know the maximum number of workgroups that can coexist on one compute unit.
Actually, this is not that simple. You have to take into consideration other parameters, such as the max number of threads that can reside on one compute unit.
This is a problem of occupancy (that you should try to maximize). Unfortunately, occupancy will vary depending of the underlying architecture.
AMD publish an article on how to compute occupancy for different architectures here.
NVIDIA provide an xls sheet that compute the occupancy for the different architectures.
Not all the necessary information to do the calculation can be queried with OCL (if I recall correctly), but nothing stops you from storing info about different architectures in your application.
I had thought that making sure that my work group memory usage was below one quarter of total local memory size was a good idea. Is this too conservative?
It is quite rigid, and with clGetKernelWorkGroupInfo you don't need to do that. However there is something about CL_KERNEL_LOCAL_MEM_SIZE that needs to be taken into account:
If the local memory size, for any pointer argument to the kernel
declared with the __local address qualifier, is not specified, its
size is assumed to be 0.
Since you might need to compute dynamically the size of the necessary local memory per workgroup, here is a workaround based on the fact that the kernels are compiled in JIT.
You can define a constant in you kernel file and then use the -D option to set its value (previously computed) when calling clBuildProgram.
I would like to know if the whole local memory size is available for user allocation for local memory, or if there are other system overheads that make it less?
Again CL_KERNEL_LOCAL_MEM_SIZE is the answer. the standard states:
This includes local memory that may be needed by an implementation to
execute the kernel...
If your work is fairly independent and doesn't re-use input data you can safely ignore everything about work groups and shared local memory. However, if your work items can share any input data (classic example is a 3x3 or 5x5 convolution that re-reads input data) then the optimal implementation will need shared local memory. Non-independent work can also benefit. One way to think of shared local memory is programmer-managed cache.

CPU SIMD vs GPU SIMD?

GPU uses the SIMD paradigm, that is, the same portion of code will be executed in parallel, and applied to various elements of a data set.
However, CPU also uses SIMD, and provide instruction-level parallelism. For example, as far as I know, SSE-like instructions will process data elements with parallelism.
While the SIMD paradigm seems to be used differently in GPU and CPU, does GPUs have more SIMD power than CPUs?
In which way the parallel computational capabilities in a CPU are 'weaker' than the ones in a GPU?
Both CPUs & GPUs provide SIMD with the most standard conceptual unit being 16 bytes/128 bits; for example a Vector of 4 floats (x,y,z,w).
Simplifying:
CPUs then parallelize more through pipelining future instructions so they proceed faster through a program. Then next step is multiple cores which run independent programs.
GPUs on the other hand parallelize by continuing the SIMD approach and executing the same program multiple times; both by pure SIMD where a set of programs execute in lock step (which is why branching is bad on a GPU, as both sides of an if statement must execute; and one result be thrown away so that the lock step programs proceed at the same rate); and also by single program, multiple data (SPMD) where groups of the sets of identical programs proceed in parallel but not necessarily in lock step.
The GPU approach is great where the exact same processing needs be applied to large volumes of data; for example a million vertices than need to be transformed in the same way, or many million pixels that need the processing to produce their colour. Assuming they don't become data block/pipeline stalled, GPUs programs general offer more predictable time bound execution due to its restrictions; which again is good for temporal parallelism e.g. the programs need to repeat their cycle at a certain rate for example 60 times a second (16ms) for 60 fps.
The CPU approach however is better for decisioning and performing multiple different tasks at the same time and dealing with changing inputs and requests.
Apart from its many other uses and purposes, the CPU is used to orchestrate work for the GPU to perform.
It's a similar idea, it goes kind of like this (very informally speaking):
The CPU has a set amount of functions that can run on packed values. Depending on your brand and version of your CPU, you might have access to SSE2, 3, 4, 3dnow, etc, and each of them gives you access to more and more functions. You're limited by the register size and the larger data types you work with the less values you can use in parallel. You can freely mix and match SIMD instructions with traditional x86/x64 instructions.
The GPU lets you write your entire pipeline for each pixel of a texture. The texture size doesn't depend on your pipeline length, ie the number of values you can affect in one cycle isn't dependant on anything but your GPU, and the functions you can chain (your pixel shader) can be pretty much anything. It's somewhat more rigid though in that the setup and readback of your values is somewhat slower, and it's a one shot process (load values, run shader, read values), you can't massage them at all besides that, so you actually need to use a lot of values for it to be worth it.

Why don't GPU libraries support automated function composition?

Intel's Integrated Performance Primitives (IPP) library has a feature called Deferred Mode Image Processing (DMIP). It lets you specify a sequence of functions, composes the functions, and applies the composed function to an array via cache-friendly tiled processing. This gives better performance than naively iterating through the whole array for each function.
It seems like this technique would benefit code running on a GPU as well. There are many GPU libraries available, such as NVIDIA Performance Primitives (NPP), but none seem to have a feature like DMIP. Am I missing something? Or is there a reason that GPU libraries would not benefit from automated function composition?
GPU programming has similar concepts to DMIP function composition on CPU.
Although it's not easy to be "automated" on GPU (some 3rd party lib may be able to do it), manually doing it is easier than CPU programming (see Thrust example below).
Two main features of DMIP:
processing by the image fragments so data can fit into cache;
parallel processing to different fragments or execution of different independent branches of a graph.
When applying a sequence of basic operations on a large image,
feature 1 omits RAM read/write between basic operations. All the read/write is done in cache, and feature 2 can utilize multi-core CPU.
The similar concept of DMIP feature 1 for GPGPU is kernel fusion. Instead of applying multiple kernels of the basic operation to image data, one could combine basic operations in one kernel to avoid multiple GPU global memory read/write.
A manually kernel fusion example using Thrust can be found in page 26 of this slides.
It seems the library ArrayFire has made notable efforts on automaic kernel fusion.
The similar concept of DMIP feature 2 for GPGPU is concurrent kernel execution.
This feature enlarge the bandwidth requirement, but most GPGPU programs are already bandwidth bound. So the concurrent kernel execution is not likely to be used very often.
CPU cache vs. GPGPU shared mem/cache
CPU cache omits the RAM read/write in DMIP, while for a fused kernel of GPGPU, registers do the same thing. Since a thread of CPU in DMIP processes on a small image fragment, but a thread of GPGPU often processes only one pixel. A few registers are large enough to buffer the data of a GPU thread.
For image processing, GPGPU shared mem/cache is often used when the result pixel depends on surrounding pixels. Image smoothing/filtering is a typical example requiring GPGPU shared mem/cache.

Resources