My understanding was, that each workgroup is executed on the GPU and then the next one is executed.
Unfortunately, my observations lead to the conclusion that this is not correct.
In my implementation, all workgroups share a big global memory buffer.
All workgroups perform read and write operations to various positions on this buffer.
If the kernel operate on it directly, no conflicts arise.
If the workgroup loads chunk into local memory, performe some computation and copies the result back, the global memory gets corrupted by other workgroups.
So how can I avoid this behaviour?
Can I somehow tell OpenCL to only execute one workgroup at once or rearrange the execution order, so that I somehow don't get conflicts?
The answer is that it depends. A whole workgroup must be executed concurrently (though not necessarily in parallel) on the device, at least when barriers are present, because the workgroup must be able to synchronize and communicate. There is no rule that says work-groups must be concurrent - but there is no rule that says they cannot. Usually hardware will place a single work-group on a single compute core. Most hardware has multiple cores, which will each get a work-group, and to cover latency a lot of hardware will also place multiple work-groups on a single core if there is capacity available.
You have no way to control the order in which work-groups execute. If you want them to serialize you would be better off launching just one work-group and writing a loop inside to serialize the series of work chunks in that same work-group. This is often a good strategy in general even with multiple work-groups.
If you really only want one work-group at a time, though, you will probably be using only a tiny part of the hardware. Most hardware cannot spread a single work-group across the entire device - so if you're stuck to one core on a 32-core GPU you're not getting much use of the device.
You need to set the global size and dimensions to that of a single work group, and enqueue a new NDRange for each group. Essentially, breaking up the call to your kernel into many smaller calls. Make sure your command queue is not allowing out of order execution, so that the kernel calls are blocking.
This will likely result in poorer performance, but you will get the dedicated global memory access you are looking for.
Yes, the groups can be executed in parallel; this is normally a very good thing. Here is a related question.
The number of workgroups that can be concurrently launched on a ComputeUnit (AMD) or SMX (Nvidia) depends on the availability of GPU hardware resources, important ones being vector-registers and workgroup-level-memory** (called LDS for AMD and shared memory for Nvidia). If you want to launch just one workgroup on the CU/SMX, make sure that the workgroup consumes a bulk of these resources and blocks further workgroups on the same CU/SMX. You would, however, still have other workgroups executing on other CUs/SMXs - a GPU normally has multiple of these.
I am not aware of any API which lets you pin a kernel to a single CU/SMX.
** It also depends on the number of concurrent wavefronts/warps the scheduler can handle.
Related
I wanted to ask you a question regarding the consistency of the cache memory.
If I have a sequential program, I shouldn't have cache consistency problems because in any case the instructions are executed sequentially and consequently there is no danger that several processors will write the same memory location at the same time, in case there are is the shared memory.
Different case is the situation where I have a parallel program, so it runs on multiple processors and there is a high probability that there are cache consistency problems.
Quite right?
In a single-threaded program, unless otherwise programmed, it doesn't change the thread by itself, except if OS does (and when it does, all the same thread-states are re-loaded from memory into that cache so there is no problem about coherence in there).
In a multi-threaded program, an update on same variable found on other caches needs to inform those caches somehow. This causes a re-flow of data through all other caches. Maybe it's not a blocking effect on same thread but once user wants only updated values, the synchronization / locking will see a performance hit. Especially when there are also other variables being updated on very close addresses such that they're in same cache-line. That's why using 20-byte elements for locking resolution is worse than using 128-byte elements in an array of locks.
If CPUs did not have coherence, multi-threading wouldn't work efficiently. So, for some versions, they chose to broadcast an update to all caches (as in Snoop cache). But this is not efficient on high number of cores. If 1000 cores existed in same CPU, it would require a 1000-way broadcasting logic consuming a lot of area of circuitry. So they break the problem into smaller parts and add other ways like directory-based coherence & multiple chunks of multiple cores. But this adds more latency for the coherence.
On the other hand, many GPUs do not implement automatic cache coherence because
the algorithm given by developer is generally embarrassingly parallel with only few points of synchronization and multiple blocks of threads do not require to communicate with other blocks (when they do, they go through a common cache by developer's choice of instructions anyway)
there are thousands of streaming pipelines (not real cores) that just need to make memory requests efficiently or else there wouldn't be enough space for that many pipelines
high throughput is required instead of low-latency (no need for implicit coherence anywhere)
so multi-processors in a GPU are designed to do completely independent work from each other and adding automatic coherence would add little performance (if not subtract). When developer needs to synchronize data between multiple threads in GPU in same block, there are instructions for this and not using these do not make any valid data update. So it's just an optional cache coherence in GPU.
When creating a logical device, we can specify multiple queue for one family queue index as we can see in the VkDeviceQueueCreateInfo, and we can pass an array of this struct in the VkDeviceCreateInfo.
So my first clue to, for instance, create a transfer queue and a graphics queue, is to use the same logical device with two different VkDeviceQueueCreateInfo at device creation.
But, can I create two logical devices from the same physical device with a different VkDeviceQueueCreateInfo (one for graphics, and one for transfer) ?
And if yes, what could be the benefit or the bad idea to make one or the other solution ?
Generally speaking, when assessing possible ways to do things in Vulkan, you should pick the way that seems to require the least stuff.
In this case, you're trying to select between multiple queues and multiple devices. Well, the multi-queue method obviously requires less stuff; you have one device with many queues (in theory), rather than many devices with many queues (one from each device). Same number of queues, but more devices. So pick the one with less.
The Vulkan API is not trying to trick you into taking the slower path. If using multiple devices with one queue per-device were the best option, then Vulkan wouldn't have multiple queues as an option at all.
To get into more detail, you say that you want to do memory transfers outside of graphics operations. OK, fine.
Physical devices do not have to provide multiple queues. Some devices provide exactly one queue family, which can have exactly one VkQueue created from it. Obviously if a device allows for multiple queues, you should use them. But if it only allows for one, then you might have reason to think that you should just create multiple devices and work that way.
Even in this case, do not do this.
Here's the thing: if a GPU could actually do multiple operations independently such that they overlap... they'd expose multiple queues. The fact that a physical device does not do so therefore suggests that independent execution of different operations simply is not possible at the GPU level.
This means that, even if you use multiple devices, the transfer and graphics operation will almost certainly execute in some order. That is, whichever vkQueueSubmit is issued first will be the one that does its work first.
So using multiple devices gives you no actual GPU execution overlap (in theory). You've gained nothing, and you've lost explicit control over the order in which these operations are issued.
Now, it may be that the execution of a transfer operation on the graphics queue will not inhibit the execution of rendering commands on that same queue. That is, transfer operations can start, then rendering commands can start while the transfer completes via DMA or something. So they start executing in order, but finish executing in any order.
Even if that's the case, working across devices doesn't give you any advantages here. As previously noted, you lose control over the order in which these commands are submitted. Graphics commands tend to hog the command queue, while a single transfer command could (on such a system) be processed and then execute in the background while processing unrelated commands. In such a case, it's important to send any transfer commands before graphics commands for a particular frame.
And if you have two devices, you have to have 2 vkQueueSubmit calls rather than one. And vkQueueSubmit calls are not known for being fast.
There are a host of other reasons not to try multi-device stuff for this. For example, if later rendering operations need access to the transferred data, this means you need external memory and external synchronization primitives to synchronize access between the devices. And so on.
I don't really understand the purpose of Work-Groups in OpenCL.
I understand that they are a group of Work Items (supposedly, hardware threads), which ones get executed in parallel.
However, why is there this need of coarser subdivision ? Wouldn't it be OK to have only the grid of threads (and, de facto, only one W-G)?
Should a Work-Group exactly map to a physical core ? For example, the TESLA c1060 card is said to have 240 cores. How would the Work-Groups map to this??
Also, as far as I understand, work-items inside a work group can be synchronized thanks to memory fences. Can work-groups synchronize or is that even needed ? Do they talk to each other via shared memory or is this only for work items (not sure on this one)?
Part of the confusion here I think comes down to terminology. What GPU people often call cores, aren't really, and what GPU people often call threads are only in a certain sense.
Cores
A core, in GPU marketing terms may refer to something like a CPU core, or it may refer to a single lane of a SIMD unit - in effect a single core x86 CPU would be four cores of this simpler type. This is why GPU core counts can be so high. It isn't really a fair comparison, you have to divide by 16, 32 or a similar number to get a more directly comparable core count.
Work-items
Each work-item in OpenCL is a thread in terms of its control flow, and its memory model. The hardware may run multiple work-items on a single thread, and you can easily picture this by imagining four OpenCL work-items operating on the separate lanes of an SSE vector. It would simply be compiler trickery that achieves that, and on GPUs it tends to be a mixture of compiler trickery and hardware assistance. OpenCL 2.0 actually exposes this underlying hardware thread concept through sub-groups, so there is another level of hierarchy to deal with.
Work-groups
Each work-group contains a set of work-items that must be able to make progress in the presence of barriers. In practice this means that it is a set, all of whose state is able to exist at the same time, such that when a synchronization primitive is encountered there is little overhead in switching between them and there is a guarantee that the switch is possible.
A work-group must map to a single compute unit, which realistically means an entire work-group fits on a single entity that CPU people would call a core - CUDA would call it a multiprocessor (depending on the generation), AMD a compute unit and others have different names. This locality of execution leads to more efficient synchronization, but it also means that the set of work-items can have access to locally constructed memory units. They are expected to communicate frequently, or barriers wouldn't be used, and to make this communication efficient there may be local caches (similar to a CPU L1) or scratchpad memories (local memory in OpenCL).
As long as barriers are used, work-groups can synchronize internally, between work-items, using local memory, or by using global memory. Work-groups cannot synchronize with each other and the standard makes no guarantees on forward progress of work-groups relative to each other, which makes building portable locking and synchronization primitives effectively impossible.
A lot of this is due to history rather than design. GPU hardware has long been designed to construct vector threads and assign them to execution units in a fashion that optimally processes triangles. OpenCL falls out of generalising that hardware to be useful for other things, but not generalising it so much that it becomes inefficient to implement.
There are already alot of good answers, for further understanding of the terminology of OpenCL this paper ("An Introduction to the OpenCL Programming Model" by Jonathan Tompson and Kristofer Schlachter) actually describes all the concepts very well.
Use of the work-groups allows more optimization for the kernel compilers. This is because data is not transferred between work-groups. Depending on used OpenCL device, there might be caches that can be used for local variables to result faster data accesses. If there is only one work-group, local variables would be just the same as global variables which would lead to slower data accesses.
Also, usually OpenCL devices use Single Instruction Multiple Data (SIMD) extensions to achieve good parallelism. One work group can be run in parallel with SIMD extensions.
Should a Work-Group exactly map to a physical core ?
I think that, only way to find the fastest work-group size, is to try different work-group sizes. It is also possible to query the CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE from the device with clGetKernelWorkGroupInfo. The fastest size should be multiple of that.
Can work-groups synchronize or is that even needed ?
Work-groups cannot be synchronized. This way there is no data dependencies between them and they can also be run sequentially, if that is considered to be the fastest way to run them. To achieve same result, than synchronization between work-groups, kernel needs to split into multiple kernels. Variables can be transferred between the kernels with buffers.
One benefit of work groups is they enable using shared local memory as a programmer-defined cache. A value read from global memory can be stored in shared work-group local memory and then accessed quickly by any work item in the work group. A good example is the game of life: each cell depends on itself and the 8 around it. If each work item read this information you'd have 9x global memory reads. By using work groups and shared local memory you can approach 1x global memory reads (only approach since there is redundant reads at the edges).
I have written a CUDA kernel in which each thread makes an update to a particular memory address (with int size). Some threads might want to update this address simultaneously.
How does CUDA handle this? Does the operation become atomic? Does this increase the latency of my application in any way? If so, how?
The operation does not become atomic, and it is essentially undefined behavior. When two or more threads write to the same location, one of the values will end up in the location, but there is no way to predict which one.
It can be especially problematic if you are reading and writing, such as to increment a variable.
CUDA provides a set of atomic operations to help.
You may also use other coding techniques such as parallel reductions, to help when there are multiple updates to the same location, such as finding a max or min value.
If you don't care about the order of updates, it should not be a performance issue for newer GPUs which automatically condense writes or reads to a single location in global memory or shared memory, but this is also not specified behavior.
I am implementing a kernel function in which the memory from the host side is transferred to kernel.The kernel has three functions.. Is it possible to share the same memory buffers with the kernels at different times ??
Yes, multiple kernels can use the same memory objects, as long as there is no risk for the kernels to be executed at the same time. It is the case for the usual "single command queue not created with out of order execution".
Yes, I do this with my ray tracer. I have three kernels. A preprocessor which changes geometry, a ray tracer , and a post processor which does image processing. I share memory buffers with all three of them. I make sure the kernels finish before I start the next one.
You can share memory without any problem. If the memory is read only you can even use that memory object as an input for 2 kernels running concurrently (ie: different GPUs/same context).
However, if you want to overwrite the memory zones, then be careful and use events to sync your kernels. I strongly recomend the events mechanism, since it enables parallel I/O read and writes to the memory zones in another queue.