I seem to have read somewhere that all work-items within a work-group execute concurrently. I have also read that a Work-group are handled by streaming a multiprocessor. But what if the work-group size is chosen such that the number of work-items exceed the number of streaming processors in a streaming multiprocessor (nvidia)? Then they can not be executed concurrently?
You are referring to different levels, where concurrently has different interpretations:
the OpenCL execution model, a specification describing how tasks are executed on the device. In this model, all work items in a ND-range kernel command execute concurrently, and work items within a work group can communicate using local memory, and can be synchronized.
the hardware processing the tasks. In this case, concurrently is usually interpreted as "executing an instruction at the same time", but it could also be interpreted as "in flight in a core at the same time" (a larger set of work items).
Related
My understanding was, that each workgroup is executed on the GPU and then the next one is executed.
Unfortunately, my observations lead to the conclusion that this is not correct.
In my implementation, all workgroups share a big global memory buffer.
All workgroups perform read and write operations to various positions on this buffer.
If the kernel operate on it directly, no conflicts arise.
If the workgroup loads chunk into local memory, performe some computation and copies the result back, the global memory gets corrupted by other workgroups.
So how can I avoid this behaviour?
Can I somehow tell OpenCL to only execute one workgroup at once or rearrange the execution order, so that I somehow don't get conflicts?
The answer is that it depends. A whole workgroup must be executed concurrently (though not necessarily in parallel) on the device, at least when barriers are present, because the workgroup must be able to synchronize and communicate. There is no rule that says work-groups must be concurrent - but there is no rule that says they cannot. Usually hardware will place a single work-group on a single compute core. Most hardware has multiple cores, which will each get a work-group, and to cover latency a lot of hardware will also place multiple work-groups on a single core if there is capacity available.
You have no way to control the order in which work-groups execute. If you want them to serialize you would be better off launching just one work-group and writing a loop inside to serialize the series of work chunks in that same work-group. This is often a good strategy in general even with multiple work-groups.
If you really only want one work-group at a time, though, you will probably be using only a tiny part of the hardware. Most hardware cannot spread a single work-group across the entire device - so if you're stuck to one core on a 32-core GPU you're not getting much use of the device.
You need to set the global size and dimensions to that of a single work group, and enqueue a new NDRange for each group. Essentially, breaking up the call to your kernel into many smaller calls. Make sure your command queue is not allowing out of order execution, so that the kernel calls are blocking.
This will likely result in poorer performance, but you will get the dedicated global memory access you are looking for.
Yes, the groups can be executed in parallel; this is normally a very good thing. Here is a related question.
The number of workgroups that can be concurrently launched on a ComputeUnit (AMD) or SMX (Nvidia) depends on the availability of GPU hardware resources, important ones being vector-registers and workgroup-level-memory** (called LDS for AMD and shared memory for Nvidia). If you want to launch just one workgroup on the CU/SMX, make sure that the workgroup consumes a bulk of these resources and blocks further workgroups on the same CU/SMX. You would, however, still have other workgroups executing on other CUs/SMXs - a GPU normally has multiple of these.
I am not aware of any API which lets you pin a kernel to a single CU/SMX.
** It also depends on the number of concurrent wavefronts/warps the scheduler can handle.
I'm not quite clear of the actual meaning of CL_DEVICE_LOCAL_MEM_SIZE, which is acquired through clGetDeviceInfo function. Is this value indicating the total sum of all the available local memory on a certain device, or the up-limit of local memory share to a work-group?
TL;DR: Per single processing unit, hence also the maximum allotable to a work unit.
This value is the amount of local memory available on each compute unit in the device. Since a work-group is assigned to a single compute unit, this is also the maximum amount of local memory that any work-group can have.
For performance reasons on many GPUs, it is usually desirable to have multiple work-groups running on each compute unit concurrently (to hide memory access latency, for example). If one work-group uses all of the available local memory, the device will not be able to schedule any other work-groups onto the same compute unit until it has finished. If possible, it is recommended to limit the amount of local memory each work-group uses (to e.g. a quarter of the total local memory) to allow multiple work-groups to run on the same compute unit concurrently.
Now I am studying parallel computing and algorithms I am little bit confused about the terms concurrent execution and simultaneous execution.
What is the difference between these terms? When do we have to use concurrent and when do we have to use simultaneous in parallel computing?
Simultaneous execution is about utilizing multiple resources (cores, HW threads, etc..) in order to perform multiple tasks at the same time. The tasks don't have to interact in any way, you may have two different applications running simultaneously on two different cores for example, or on the same core.
The art of designing systems to be able to perform multiple tasks at the same time can be said to deal with simultaneous execution. Hyper-threading for e.g. is also called "SMT", simultaneous multi-threading, since it deals with the ability to run two threads with their full contexts at the same time on a single core (This is Intels' approach, AMD has a slightly different solution, see - Difference between intel and AMD multithreading)
Concurrency is a term residing on a higher level of abstraction, relating to the OS world. It's a property of your execution environment in which you have multiple tasks that may be executed over time, while you have no control over the order or even the form of interleaving in which they're performed. It doesn't really matter if they operate simultaneously on multiple cores, on one core with SMT, or even on a single-threaded core with some preemption mechanism and some scheduling algorithm that breaks the tasks into chunks and constantly swaps between them. The important thing here is that concurrency forces you to design your tasks in a way that guarantees correctness (especially if they interact or share data) on any type of system with any order or interleaving.
If the task is designed correctly (with proper locking, barriers, semaphores, and anything guaranteeing correct data flow) and the OS does its job properly (saving states on context switch for example or clearing caches and shooting down TLB entries when needed), then it can run with any form of execution model "under the hood".
Since you're referring to parallel algorithms, the proper term for you is probably concurrent execution.
There are quite a lot of examples in this thread (with additional links to sources - I won't copy it here to avoid plagiarism :) - What is the difference between concurrency and parallelism?
I am a bit confused how it is possible that Warps diverge and need to be synchronized via __syncthreads() function. All elements in a Block handle the same code in a SIMT fashion. How could it be that they are not in sync? Is it related to the scheduler? Do the different warps get different computing times? And why is there an overhead when using __syncthreads()?
Lets say we have 12 different Warps in a block 3 of them have finished their work. So now there are idling and the other warps get their computation time. Or do they still get computation time to do the __syncthreads() function?
First let's be careful with terminology. Warp divergence refers to threads within a single warp that take different execution paths, due to control structures in the code (if, while, etc.) Your question really has to do with warps and warp scheduling.
Although the SIMT model might suggest that all threads execute in lockstep, this is not the case. First of all, threads within different blocks are completely independent. They may execute in any order with respect to each other. For your question about threads within the same block, let's first observe that a block can have up to 1024 (or perhaps more) threads, but today's SM's (SM or SMX is the "engine" inside the GPU that processes a threadblock) don't have 1024 cuda cores, so it's not even theoretically possible for an SM to execute all threads of a threadblock in lockstep. Note that a single threadblock executes on a single SM, not across all (or more than one) SMs simultaneously. So even if a machine has 512 or more total cuda cores, they cannot all be used to handle the threads of a single threadblock, because a single threadblock executes on a single SM. (One reason for this is so that SM-specific resources, like shared memory, can be accessible to all threads within a threadblock.)
So what happens? It turns out each SM has a warp scheduler. A warp is nothing more than a collection of 32 threads that gets grouped together, scheduled together, and executed together. If a threadblock has 1024 threads then it has 32 warps of 32 threads per warp. Now, for example, on Fermi, an SM has 32 CUDA cores, so it is reasonable to think about an SM executing a warp in lockstep (and that is what happens, on Fermi). By lockstep, I mean that (ignoring the case of warp divergence, and also certain aspects of instruction-level-parallelism, I'm trying to keep the explanation simple here...) no instruction in the warp is executed until the previous instruction has been executed by all threads in the warp. So a Fermi SM can only actually be executing one of the warps in a threadblock at any given instant. All other warps in that threadblock are queued up, ready to go, waiting.
Now, when the execution of a warp hits a stall for any reason, the warp scheduler is free to move that warp out and bring another ready-to-go warp in (this new warp might not even be from the same threadblock, but I digress.) Hopefully by now you can see that if a threadblock has more than 32 threads in it, not all the threads are actually getting executed in lockstep. Some warps are proceeding ahead of other warps.
This behavior is normally desirable, except when it isn't. There are times when you do not want any thread in the threadblock to proceed beyond a certain point, until a condition is met. This is what __syncthreads() is for. For example, you might be copying data from global to shared memory, and you don't want any of the threadblock data processing to commence until shared memory has been properly populated. __syncthreads() ensures that all threads have had a chance to copy their data element(s) before any thread can proceed beyond the barrier and presumably begin computations on the data that is now resident in shared memory.
The overhead with __syncthreads() is in two flavors. First of all there's a very small cost just to process the machine level instructions associated with this built-in function. Second, __syncthreads() will normally have the effect of forcing the warp scheduler and SM to shuffle through all the warps in the threadblock, until each warp has met the barrier. If this is useful, great. But if it's not needed, then you're spending time doing something that isn't needed. So thus the advice to not just liberally sprinkle __syncthreads() through your code. Use it sparingly and where needed. If you can craft an algorithm that doesn't use it as much as another, that algorithm may be better (faster).
what is the difference between parallel processing and multi core processing
Parallel and multi-core processing both refer to the same thing: the ability to execute code at the same time (in more than one core/CPU/machine.) So in this sense multi-core is just a means to do parallel processing.
On the other hand, concurrency (which is probably what you mean by parallel processing) refers to having multiple units of execution (threads or processes) that are interleaved. This can also happen in either in a single core CPU or in many cores/CPUs or even in many machines (clusters).
Summing up, multicore is a subset of parallel and concurrency can occur with or without parallelism. The field that studies this is distributed systems or distributed computing.
Parallel processing just refers to a program running more than 1 part simultaneously, usually with the different parts communicating in some way. This might be on multiple cores, multiple threads on one core (which is really simulated parallel processing), multiple CPUs, or even multiple machines.
Multicore processing is usually a subset of parallel processing.
Multicore processing means code working on more than one "core" of a single CPU chip. A core is like a little processor within a processor. So making code work for multicore processing will nearly always be talking about the parallelization aspect (though would also include removing any core specific assumptions, which you shouldn't normally have anyway).
As far as an algorithm design goes, if it is correct in a parallel processing point of view, it will be correct multicore.
However, if you need to optimise your code to get it to run as fast as possible "in parallel" then the differences between multicore, multi-cpu, multi-machine, or vectorised will make a big difference.
Parallel processing can be done inside a single core with multiple threads.
Multi-Core processing means distributing those threads to make use of the multiple cores in a CPU.